## Aim: To perform numeric data preprocessing and analysis

### Task 1: Web Scraping using Beautiful Soup

#### Importing libraries

- BeautifulSoup library to parse and extract data from HTML content obtained via HTTP requests. It's commonly used for web scraping and data extraction tasks in Python.

In [1]:
from bs4 import BeautifulSoup
import requests

In [2]:
url="https://quotes.toscrape.com/"
response=requests.get(url)
print(response)

<Response [200]>


- This code fetches the content of the URL "https://quotes.toscrape.com/" using the requests library, which sends an HTTP GET request to the specified URL. The response from the server is stored in the response variable. Finally, the code prints out the response, which typically includes information about the HTTP status code, headers, and content of the web page.

In [3]:
html=response.content
print(html)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\xe2\x80\

- The BeautifulSoup function from the bs4 (Beautiful Soup 4) library is used to create a BeautifulSoup object. This object is used for parsing and navigating through HTML content.

In [4]:
soup=BeautifulSoup(html,'html.parser')

#### (a) Find the title tag

In [5]:
title_tag=soup.title
print("Title Tag:",title_tag)

# title tag as string
print("String Title Tag:",title_tag.string)

Title Tag: <title>Quotes to Scrape</title>
String Title Tag: Quotes to Scrape


- title_tag is assigned the value of the title HTML tag within the parsed HTML content using BeautifulSoup. The <title> tag typically represents the title of a web page. The print("Title Tag:", title_tag) line prints the representation of the title_tag object, which includes the entire <title> tag and its contents

#### (b) Retrieve all the paragraph tags

In [6]:
paragraph_tag=soup.find_all('p')
print("Paragraph Tags")
for i in paragraph_tag:
    print(i)

Paragraph Tags
<p>
<a href="/login">Login</a>
</p>
<p class="text-muted">
                Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a>
</p>
<p class="copyright">
                Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a>
</p>


- In this code, the variable paragraph_tag is assigned a list of all the <p> (paragraph) HTML tags found within the parsed HTML content using BeautifulSoup's find_all method. The find_all method is used to locate all instances of a particular HTML tag.

#### (c) extract the text in the first paragraph tag

In [7]:
if paragraph_tag:
    print("The text in the first paragraph tag is",paragraph_tag[0].text)
else:
    printf("No text found in the first paragraph tag")

The text in the first paragraph tag is 
Login



- The if condition checks whether the paragraph_tag list is non-empty, meaning it contains at least one <p> tag that was found in the parsed HTML content. If this condition is true, the code proceeds into the indented block. Within this block, paragraph_tag[0].text is used to extract and print the text content from the first tag 

#### (d) Find all the h2 tags

In [8]:
h2_tags=soup.find_all('h2')
print("<h2> tags")
for i in h2_tags:
    print(i)

<h2> tags
<h2>Top Ten tags</h2>


- searches for all h2 (second-level heading) HTML tags within the parsed HTML content using BeautifulSoup's find_all method and stores them in the h2_tags list. The loop for i in h2_tags: iterates through each h2 tag in the list and prints out their representations, which include the HTML structure and content of each heading

#### (e) find the length of the text of the first h2 tag

In [9]:
if h2_tags:
    print("The length of the first h2 tag is:",len(h2_tags[0].text))

The length of the first h2 tag is: 12


- In this code, the if condition checks whether the h2_tags list is non-empty, meaning it contains at least one h2 tag that was found in the parsed HTML content. If this condition is true, the code proceeds into the indented block.Within this block, h2_tags[0].text is used to extract the text content from the first h2 tag in the list

#### (f) find the text of the first a tag

In [None]:
a_tags=soup.find_all('a')
if a_tags:
    print("Text in first <a> tag is:",a_tags[0].text)

- BeautifulSoup's find_all method to locate all a tags (anchor) HTML tags within the parsed HTML content and stores them in the a_tags list. The if condition checks if the list is non-empty, indicating the presence of at least one anchor tag. If true, it prints the text content within the first anchor tag using a_tags[0].text. 

#### (g) find the href of the first a tag

In [None]:
if a_tags:
    print("The href of the first <a> tag is",a_tags[0]["href"])

- BeautifulSoup's find_all method to locate all a (anchor) HTML tags within the parsed HTML content and stores them in the a_tags list. The if condition checks if the list is non-empty, indicating the presence of at least one anchor tag. If true, it prints the text content within the first anchor tag using a_tags[0].text

#### (h) extract all the URLs from the webpage python.org that are nested within li tags

In [None]:
# Creating a response variable 
python_url="https://www.python.org"
python_response=requests.get(python_url)
print(python_response)

In [None]:
python_html=python_response.content
python_soup=BeautifulSoup(python_html,'html.parser')

In [None]:
li_tags=python_soup.find_all('li')
for i in li_tags:
    a_tag=i.find_all('a')
    for k in a_tag:
        if "href" in k.attrs:
            print("The URL's from the webpage python.org that are nested within <li> tag is",k["href"])
    

- Locates all li tags (list item) HTML tags using BeautifulSoup's find_all method and stores them in the li_tags list. It then iterates through each li tag and searches for nested a (anchor) tags using i.find_all('a'). For each found anchor tag, it checks if the "href" attribute exists in its attributes using "href" in k.attrs and prints the URLs within these anchor tags if they are found.

#### (i) Quotes on quotes.toscrape.com often are categorized with tags. On the first page, create a dict for each quote using the BeautifulSoup object

In [None]:
quote_tag=soup.find_all("span")
for i in quote_tag:
   print(i)
#from this we can note that the quotes are all in the class text
# from this we can observe that the authors name is in the class author

In [None]:
quote_tag=soup.find_all('div',class_="quote")
quotes=[]

for i in quote_tag:
    quote_text=i.find('span',class_="text").text
    author_text=i.find('small',class_="author").text
    tags=[tag.text for tag in i.find_all('a',class_="tag")]
    
    quote_dict={
        'Quote':quote_text,
        'Author':author_text,
        'Tags':tags
    }
    quotes.append(quote_dict)

for j in quotes:
    print(j)


#### (k) Putting all quotes and author name in CSV file

In [None]:
import pandas as pd
import csv
df=pd.DataFrame(quotes,columns=['Quote','Author'])

- Pandas library to create a DataFrame named df from the quotes data. It specifies the columns 'Quote' and 'Author' for the DataFrame.

In [None]:
df

In [None]:
# Storing the dataframe in a csv file
df.to_csv('quotes.csv',index=False, sep=';')