# Aim: To perform numeric data preprocessing and analysis

## Task 1(ii): Web Scraping using Beautiful Soup¶

#### Importing libraries

In [1]:
from bs4 import BeautifulSoup
import requests

- BeautifulSoup library to parse and extract data from HTML content obtained via HTTP requests. It's commonly used for web scraping and data extraction tasks in Python.

In [2]:
url="https://www.python.org/"
response=requests.get(url)
print(response)

<Response [200]>


- This code fetches the content of the URL "https://www.python.org/" using the requests library, which sends an HTTP GET request to the specified URL. The response from the server is stored in the response variable. Finally, the code prints out the response, which typically includes information about the HTTP status code, headers, and content of the web page.

In [3]:
html=response.content
print(html)



- - The BeautifulSoup function from the bs4 (Beautiful Soup 4) library is used to create a BeautifulSoup object. This object is used for parsing and navigating through HTML content.

In [4]:
soup=BeautifulSoup(html,'html.parser')

- creates a BeautifulSoup object named soup by parsing the HTML content stored in the html variable. The 'html.parser' argument specifies the parser that BeautifulSoup should use to interpret the HTML.

### (a) extract all the URLs that are nested within li tags

In [5]:
li_tags=soup.find_all('li')
for i in li_tags:
    a_tag=i.find_all('a')
    for k in a_tag:
        if "href" in k.attrs:
            print("The URL's from the webpage python.org that are nested within <li> tag is",k["href"])

The URL's from the webpage python.org that are nested within <li> tag is /
The URL's from the webpage python.org that are nested within <li> tag is https://www.python.org/psf/
The URL's from the webpage python.org that are nested within <li> tag is https://docs.python.org
The URL's from the webpage python.org that are nested within <li> tag is https://pypi.org/
The URL's from the webpage python.org that are nested within <li> tag is /jobs/
The URL's from the webpage python.org that are nested within <li> tag is /community-landing/
The URL's from the webpage python.org that are nested within <li> tag is #
The URL's from the webpage python.org that are nested within <li> tag is javascript:;
The URL's from the webpage python.org that are nested within <li> tag is javascript:;
The URL's from the webpage python.org that are nested within <li> tag is javascript:;
The URL's from the webpage python.org that are nested within <li> tag is javascript:;
The URL's from the webpage python.org that a

- This code segment searches for all li (list item) HTML tags within the parsed HTML content using find_all method. For each li tag found, it looks for nested a (anchor) tags using i.find_all('a'). Within this nested loop, it checks if the attribute "href" exists in the anchor tag's attributes using "href" in k.attrs

### (b) list of all the h1, h2, h3 tags from the webpage

- list of h1 tags

In [6]:
h1_tags=soup.find_all('h1')
print("<h1> tags")
for i in h1_tags:
    print(i)

<h1> tags
<h1 class="site-headline">
<a href="/"><img alt="python™" class="python-logo" src="/static/img/python-logo.png"/></a>
</h1>
<h1>Functions Defined</h1>
<h1>Compound Data Types</h1>
<h1>Intuitive Interpretation</h1>
<h1>All the Flow You’d Expect</h1>
<h1>Quick &amp; Easy to Learn</h1>


- This code segment uses BeautifulSoup's find_all method to locate all h1 (top-level heading) HTML tags within the parsed HTML content. It then iterates through the h1_tags list using a loop. During each iteration, it prints the representation of the current h1 tag using the print(i) statement

- list of h2 tags

In [7]:
h2_tags=soup.find_all('h2')
print("<h2> tags")
for i in h2_tags:
    print(i)

<h2> tags
<h2 class="widget-title"><span aria-hidden="true" class="icon-get-started"></span>Get Started</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-download"></span>Download</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-documentation"></span>Docs</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-jobs"></span>Jobs</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-news"></span>Latest News</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-calendar"></span>Upcoming Events</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-success-stories"></span>Success Stories</h2>
<h2 class="widget-title"><span aria-hidden="true" class="icon-python"></span>Use Python for…</h2>
<h2 class="widget-title">
<span class="prompt">&gt;&gt;&gt;</span> <a href="/dev/peps/">Python Enhancement Proposals<span class="say-no-more"> (PEPs)</span></a>: The future of Python<span class="say-no-more"> is discussed 

- This code segment employs BeautifulSoup's find_all method to locate all h2 (second-level heading) HTML tags within the parsed HTML content. It then iterates through the h2_tags list using a loop. During each iteration, it prints the representation of the current h2 tag using the print(i)

- list of h3 tags

In [8]:
h3_tags=soup.find_all('h3')
print("<h3> tags")
for i in h3_tags:
    print(i)

<h3> tags


- This code segment uses BeautifulSoup's find_all method to locate all h3 (third-level heading) HTML tags within the parsed HTML content. It then iterates through the h3_tags list using a loop. During each iteration, it prints the representation of the current h3 tag using the print(i) statement.

### (c) extract all the text from the given web page

In [9]:
text=soup.find_all('p')
for i in text:
    print("The text in the page is",i.text)

The text in the page is Notice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. 
The text in the page is The core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. More about defining functions in Python 3
The text in the page is Lists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. More about lists in Python 3
The text in the page is Calculations are simple with Python, and expression syntax is straightforward: the operators +, -, * and / work as expected; parentheses () can be used for grouping. More about simple math functions in Python 3.
The text in the page is Python knows the usual control flow statements that other languages speak — if, for, while and 

- This code segment uses BeautifulSoup's find_all method to locate all p (paragraph) HTML tags within the parsed HTML content. It then iterates through the text list using a loop. During each iteration, it extracts the text content from the current p tag using i.text and prints it

### (d) find and print all li tags of a given web page.

In [10]:
li_tags=soup.find_all('li')
for i in li_tags:
    print("The <li> tags of the web page is:",i)

The <li> tags of the web page is: <li class="python-meta current_item selectedcurrent_branch selected">
<a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>
</li>
The <li> tags of the web page is: <li class="psf-meta">
<a href="https://www.python.org/psf/" title="The Python Software Foundation">PSF</a>
</li>
The <li> tags of the web page is: <li class="docs-meta">
<a href="https://docs.python.org" title="Python Documentation">Docs</a>
</li>
The <li> tags of the web page is: <li class="pypi-meta">
<a href="https://pypi.org/" title="Python Package Index">PyPI</a>
</li>
The <li> tags of the web page is: <li class="jobs-meta">
<a href="/jobs/" title="Python Job Board">Jobs</a>
</li>
The <li> tags of the web page is: <li class="shop-meta">
<a href="/community-landing/">Community</a>
</li>
The <li> tags of the web page is: <li aria-haspopup="true" class="tier-1 last">
<a class="action-trigger" href="#"><strong><small>A</small> A<

- This code segment uses BeautifulSoup's find_all method to locate all li (list item) HTML tags within the parsed HTML content. It then iterates through the li_tags list using a loop. During each iteration, it prints the representation of the current li tag using the print(i) statement