# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

## Learning Objectives

At the end of the experiment, you will be able to

*   extract data from web page
*   understand Beautiful Soup

## Web Scraping
Web scraping is a technique that automatically access and extracts large amount of information from a website. 
It is also known as web data extraction, which is the process of retrieving or scraping data from a website. The Python libraries requests and Beautiful Soup are powerful tools for web scraping.

### Importing required packages


In [None]:
import requests         # request is a library to send HTTP request in Python
import urllib.request   # urllib.request is a python module for fetching URLs

# Importing BeautifulSoup package
from bs4 import BeautifulSoup as bs  # BeautifulSoup scrapes the information from web pages

### Download the HTML page of a specified path 


Consider the Shakespeare story for downloading the web page content using urllib library

urllib is a package that works with the URL for downloading the web page.


Shakespeare web page [Source](http://shakespeare.mit.edu/comedy_errors/full.html)

In [None]:
# url is from the Shakespeare stories
url = "http://shakespeare.mit.edu/comedy_errors/full.html"

In [None]:
# Download the given url to specified path
urllib.request.urlretrieve(url, "shakespeare_comedy_play.html")  # Retrieves a URL into a temporary location

### Directly get the HTML page without downloading

The **requests** module to directly download the HTML page.


The requests module allows you to send HTTP requests using Python.
The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).

`requests.get()` makes a request to a web page, and returns the status code



In [None]:
# To avoid URL exception
try:  
    html_page = requests.get(url) 
except:
  pass

In [None]:
# To get the content from requested html page
page_content = html_page.content
print( page_content)


### Extract data from HTML page using BeautifulSoup


BeautifulSoup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

html.parser is used to identify the tags and parse the data

In [None]:
# Parsing the page content to HTML format
shakespeare_bs_object = bs(page_content, "html.parser")
print( shakespeare_bs_object)

In [None]:
print(shakespeare_bs_object.title.text)

### To view the content from the HTML page without any tags

get_text() is used for extracting text from HTML page by removing tags

In [None]:
# To see only the text without HTML tags
shakespeare_text = shakespeare_bs_object.get_text()
print(shakespeare_text)

#### Save the extracted text into  `.txt` file

In [None]:
# Open a file and writing to it
f = open("shakespeare.txt", "w")
f.write(shakespeare_text)
f.close()

### Other ways to navigate the data using BeautifulSoup

BeautifulSoup has different tags such as 'title' tag and 'H3' tag

Title tag: To get the title of the HTML page  (Title of the story)


In [None]:
# It gives title tag of the HTML page
print(shakespeare_bs_object.title)  

In [None]:
# To get only the text between the title tag
print(shakespeare_bs_object.title.text) 

H3 Tag: To get the heading tags of the web page

In [None]:
# findAll returns list of elements for the given specified tag
# for ex: find all the 3rd level headings in the HTML page and print them
h3tags = shakespeare_bs_object.findAll("h3")  
print( "No.of headings in the web page", len(h3tags))

In [None]:
# To get only the text between the h3 tag
print( h3tags[0].text)

### Extract tabular data from  the web page


Select a list of mountains by elevation table from the Wikipedia page [link](https://en.wikipedia.org/wiki/List_of_mountains_by_elevation) and create the dataframe

#### Download the HTML page

The **requests** module to directly download the HTML page.

In [None]:
url2 = "https://en.wikipedia.org/wiki/List_of_mountains_by_elevation"

try:
    wiki_html = requests.get(url2)
except:
  pass

#### Extract data from Wikipedia HTML page using BeautifulSoup

BeautifulSoup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

html.parser is used to identify the tags and parse the data

In [None]:
# Parsing the page content to HTML format
mountains_wiki_page = bs(wiki_html.content, "html.parser")
mountains_wiki_page

**The HTML table element represents tabular data which is in a two-dimensional table comprised of rows and columns**

`table` represents the table tag

`tr` represents each row in table

`th` represents headings in the table

`td` represents each cell of the table

Get the table from the wikipedia page, use `find` function.

In [None]:
# 'find' extracts only first table tag by default
table = mountains_wiki_page.find("table")
print(table)

Extract the headings (`th` tag) of the table from wikipedia page, use `findAll` and extract text for each heading tag

In [None]:
th = table.findAll("th")            # Find list of 'th' tags
headings = [i.text for i in th]     # Extract text from 'th' tags
print(headings)

Extract data from each row of the table using `tr` tag, and appending text of the current cell (`td` tag) to data and creating a dataframe

In [None]:
import pandas as pd

data = [ ] 

# Find all the tr tags and extracting data from each row
for row in table.findAll("tr"):
    data.append([cell.text for cell in row.findAll("td")])

# Create a DataFrame of data
df = pd.DataFrame(data, columns = headings)
df

#### Saving a `.csv` file using dataframe

In [None]:
df.to_csv("mountains_list_elevation.csv")