# Module - Beautiful Soup (bs4)

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

To install: pip install beautifulsoup4

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

In [2]:
# Sample HTML Data:

data = '''
<html>
    <head>
        <title>Web Scraping</title>
    </head>
    <body>
        <h1>Web Scraping</h1>
        <h2>web Scraping</h2>
        <h3>web Scraping</h3>

        <p>this is the session that will give the idea about collecting our own data</p>

        <ul class = 'test1', id = 'test1'>
            <li>python</li>
            <li>ml</li>
            <li>Dl</li>
            <li>NLP</li>
        </ul>
        
        <ul class = 'test2', id = 'test2'>
            <li>python</li>
            <li>ml</li>
            <li>Dl</li>
            <li>NLP</li>
        </ul>
        
        <ul class = 'test3', id = 'test3'>
            <li>python</li>
            <li>ml</li>
            <li>Dl</li>
            <li>NLP</li>
        </ul>

        <ol>
            <li>python</li>
            <li>ml</li>
            <li>Dl</li>
            <li>NLP</li>
        </ol>

        <table>
            <tr>
              <th>Company</th>
              <th>Contact</th>
              <th>Country</th>
            </tr>
            <tr>
              <td>Alfreds Futterkiste</td>
              <td>Maria Anders</td>
              <td>Germany</td>
            </tr>
            <tr>
              <td>Centro comercial Moctezuma</td>
              <td>Francisco Chang</td>
              <td>Mexico</td>
            </tr>
          </table>

    </body>
</html>
'''

In [3]:
# To view HTML content in Jupyter Notebook:

from IPython.core.display import display, HTML
display(HTML(data))

Company,Contact,Country
Alfreds Futterkiste,Maria Anders,Germany
Centro comercial Moctezuma,Francisco Chang,Mexico


In [4]:
soup = bs(data)
soup

<html>
<head>
<title>Web Scraping</title>
</head>
<body>
<h1>Web Scraping</h1>
<h2>web Scraping</h2>
<h3>web Scraping</h3>
<p>this is the session that will give the idea about collecting our own data</p>
<ul class="test1" id="test1">
<li>python</li>
<li>ml</li>
<li>Dl</li>
<li>NLP</li>
</ul>
<ul class="test2" id="test2">
<li>python</li>
<li>ml</li>
<li>Dl</li>
<li>NLP</li>
</ul>
<ul class="test3" id="test3">
<li>python</li>
<li>ml</li>
<li>Dl</li>
<li>NLP</li>
</ul>
<ol>
<li>python</li>
<li>ml</li>
<li>Dl</li>
<li>NLP</li>
</ol>
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
</body>
</html>

# HTML Tags

HTML tags are like keywords which defines that how web browser will format and display the content. With the help of tags, a web browser can distinguish between an HTML content and a simple content. HTML tags contain three main parts: opening tag, content and closing tag. But some HTML tags are unclosed tags.

When a web browser reads an HTML document, browser reads it from top to bottom and left to right. HTML tags are used to create HTML documents and render their properties. Each HTML tags have different properties.

An HTML file must have some essential tags so that web browser can differentiate between a simple text and HTML text. You can use as many tags you want as per your code requirement.

All HTML tags must enclosed within < > these brackets.
Every tag in HTML perform different tasks.
If you have used an open tag <tag>, then you must use a close tag </tag>

https://www.w3schools.com/TAgs/default.asp

In [5]:
soup.find('head')           # Returns the entire tag between starting <head> and ending <head>

<head>
<title>Web Scraping</title>
</head>

In [6]:
soup.find('title').text     # Returns only the text as a string

'Web Scraping'

In [7]:
soup.find('p').text         # finds only the first occurance of the tag 'p' and returns it's content as a String

'this is the session that will give the idea about collecting our own data'

In [8]:
soup.find('ul')             # We can extract strings by any tag name but it will return only the first occurance

<ul class="test1" id="test1">
<li>python</li>
<li>ml</li>
<li>Dl</li>
<li>NLP</li>
</ul>

In [9]:
soup.find_all('ul')       # find_all returns all the occurances of the tags

[<ul class="test1" id="test1">
 <li>python</li>
 <li>ml</li>
 <li>Dl</li>
 <li>NLP</li>
 </ul>,
 <ul class="test2" id="test2">
 <li>python</li>
 <li>ml</li>
 <li>Dl</li>
 <li>NLP</li>
 </ul>,
 <ul class="test3" id="test3">
 <li>python</li>
 <li>ml</li>
 <li>Dl</li>
 <li>NLP</li>
 </ul>]

In [10]:
# Extract individual tags from the entire list of tags

for i in soup.find_all('ul'):
    print(i.text)


python
ml
Dl
NLP


python
ml
Dl
NLP


python
ml
Dl
NLP



In [11]:
# We can use find_all with more than 1 matching tags if a single tag returns more than 1 result.
# Here we use both 'Unordered List' and 'class'
soup.find_all('ul', {'class':'test2'})

[<ul class="test2" id="test2">
 <li>python</li>
 <li>ml</li>
 <li>Dl</li>
 <li>NLP</li>
 </ul>]

# Practical Use Case 1:

Extract the table data from the link: https://en.wikipedia.org/wiki/List_of_countries_by_coffee_production 
and export it to a csv file

In [12]:
url = "https://en.wikipedia.org/wiki/List_of_countries_by_coffee_production"

In [13]:
page = requests.get(url, verify = False)
soup = bs(page.text)
soup



<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries by coffee production - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"810da5da-1b4e-478f-9666-01b19c8f6dc4","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_by_coffee_production","wgTitle":"List of countries by coffee production","wgCurRevisionId":1085331524,"wgRevisionId":1085331524,"wgArticleId":36196672,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidat

In [14]:
table = soup.find('table')
table

<table border="1" class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Country
</th>
<th>60 kilogram bags
</th>
<th>Metric tons
</th>
<th>Pounds
</th></tr>
<tr>
<td>1
</td>
<td><span class="flagicon"><a href="/wiki/Brazil" title="Brazil"><img alt="Brazil" class="thumbborder" data-file-height="504" data-file-width="720" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/0/05/Flag_of_Brazil.svg/22px-Flag_of_Brazil.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/0/05/Flag_of_Brazil.svg/33px-Flag_of_Brazil.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/0/05/Flag_of_Brazil.svg/43px-Flag_of_Brazil.svg.png 2x" width="22"/></a></span> <a href="/wiki/Coffee_production_in_Brazil" title="Coffee production in Brazil">Brazil</a>
</td>
<td>44,200,000
</td>
<td>2,652,000
</td>
<td>5,714,381,000
</td></tr>
<tr>
<td>2
</td>
<td><span class="flagicon"><a href="/wiki/Vietnam" title="Vietnam"><img alt="Vietnam" class="thumbborder" data-file-height="600" d

In [15]:
top_10_data = []
for row in table.find_all('tr')[1:10]:
    temp = row.text.replace('\n\n', ' ').strip()
    temp = temp.split()
    top_10_data.append(temp)

In [16]:
top_10_data

[['1', 'Brazil', '44,200,000', '2,652,000', '5,714,381,000'],
 ['2', 'Vietnam', '27,500,000', '1,650,000', '3,637,627,000'],
 ['3', 'Colombia', '13,500,000', '810,000', '1,785,744,000'],
 ['4', 'Indonesia', '11,000,000', '660,000', '1,455,050,000'],
 ['5', 'Ethiopia', '6,400,000', '384,000', '846,575,000'],
 ['6', 'Honduras', '5,800,000', '348,000', '767,208,000'],
 ['7', 'India', '5,800,000', '348,000', '767,208,000'],
 ['8', 'Uganda', '4,800,000', '288,000', '634,931,000'],
 ['9', 'Mexico', '3,900,000', '234,000', '515,881,000']]

In [17]:
# Convert the table to a dataframe:

df = pd.DataFrame(top_10_data)
df

Unnamed: 0,0,1,2,3,4
0,1,Brazil,44200000,2652000,5714381000
1,2,Vietnam,27500000,1650000,3637627000
2,3,Colombia,13500000,810000,1785744000
3,4,Indonesia,11000000,660000,1455050000
4,5,Ethiopia,6400000,384000,846575000
5,6,Honduras,5800000,348000,767208000
6,7,India,5800000,348000,767208000
7,8,Uganda,4800000,288000,634931000
8,9,Mexico,3900000,234000,515881000


# Practical Use Case 2:

Extract the tabular data from the below url:

https://www.imdb.com/chart/top/

In [18]:
url = "https://www.imdb.com/chart/top/"
page = requests.get(url, verify = False)
soup = bs(page.text)
table = soup.find('table')
table



<table class="chart full-width" data-caller-name="chart-top250movie">
<colgroup>
<col class="chartTableColumnPoster"/>
<col class="chartTableColumnTitle"/>
<col class="chartTableColumnIMDbRating"/>
<col class="chartTableColumnYourRating"/>
<col class="chartTableColumnWatchlistRibbon"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>
</thead>
<tbody class="lister-list">
<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="9.234390369769061" name="ir"></span>
<span data-value="7.791552E11" name="us"></span>
<span data-value="2614694" name="nv"></span>
<span data-value="-1.765609630230939" name="ur"></span>
<a href="/title/tt0111161/"> <img alt="The Shawshank Redemption" height="67" src="https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
    

In [19]:
top_250_data = []
for row in table.find_all('tr')[1:251]:
    temp = row.text.replace('\n', ' ').strip().split('   ')
    temp = temp[2] + '$' + temp[3]
    temp = temp.strip()
    temp = temp.split('$')

    top_250_data.append(temp)

In [20]:
top_250_data

[['The Shawshank Redemption (1994)', '9.2'],
 ['The Godfather (1972)', '9.2'],
 ['The Dark Knight (2008)', '9.0'],
 ['The Godfather Part II (1974)', '9.0'],
 ['12 Angry Men (1957)', '8.9'],
 ["Schindler's List (1993)", '8.9'],
 ['The Lord of the Rings: The Return of the King (2003)', '8.9'],
 ['Pulp Fiction (1994)', '8.9'],
 ['The Lord of the Rings: The Fellowship of the Ring (2001)', '8.8'],
 ['Il buono, il brutto, il cattivo (1966)', '8.8'],
 ['Forrest Gump (1994)', '8.8'],
 ['Fight Club (1999)', '8.7'],
 ['Inception (2010)', '8.7'],
 ['The Lord of the Rings: The Two Towers (2002)', '8.7'],
 ['The Empire Strikes Back (1980)', '8.7'],
 ['The Matrix (1999)', '8.7'],
 ['Goodfellas (1990)', '8.7'],
 ["One Flew Over the Cuckoo's Nest (1975)", '8.6'],
 ['Se7en (1995)', '8.6'],
 ['Shichinin no samurai (1954)', '8.6'],
 ["It's a Wonderful Life (1946)", '8.6'],
 ['The Silence of the Lambs (1991)', '8.6'],
 ['Cidade de Deus (2002)', '8.6'],
 ['Saving Private Ryan (1998)', '8.6'],
 ['La vita è 

In [21]:
# Convert the table to a dataframe:

df = pd.DataFrame(top_10_data)
df

Unnamed: 0,0,1,2,3,4
0,1,Brazil,44200000,2652000,5714381000
1,2,Vietnam,27500000,1650000,3637627000
2,3,Colombia,13500000,810000,1785744000
3,4,Indonesia,11000000,660000,1455050000
4,5,Ethiopia,6400000,384000,846575000
5,6,Honduras,5800000,348000,767208000
6,7,India,5800000,348000,767208000
7,8,Uganda,4800000,288000,634931000
8,9,Mexico,3900000,234000,515881000


# Practical Use Case 3:

Extract data from 2013 to 2015 (for all 12 months) from https://en.tutiempo.net/climate/01-2013/ws-421820.html

~30 days * 12 months * 3 years = ~1080 rows of data

From the above link, every page will give data for one month. 
Hence we need to alter the month and year in the URL to fetch data for all the years and months

# Part 1: Data Collection

In [22]:
# Collecting and saving the html files in local storage

In [23]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import pandas_profiling as pp
import numpy as np
import os
import time
import warnings
warnings.filterwarnings('ignore')

In [24]:
def retrieve_html(start_year, end_year):
    for year in range (start_year, end_year+1):
        for month in range (1, 13):
            if month<10:
                url = f'https://en.tutiempo.net/climate/0{month}-{year}/ws-421820.html'
            else:
                url = f'https://en.tutiempo.net/climate/{month}-{year}/ws-421820.html'
                
            data = requests.get(url, verify = False).text

            # Create Directory Structure to save the html files
            if not os.path.exists(f'Webscraping_Data/html_data/{year}'): 
                os.makedirs(f'Webscraping_Data/html_data/{year}')

            # File Handling - Save all the html data as html files
            if month<10:
                with open(f'Webscraping_Data/html_data/{year}/0{month}.html', 'w') as f:
                    f.write(data)
            else:
                with open(f'Webscraping_Data/html_data/{year}/{month}.html', 'w') as f:
                    f.write(data)

In [25]:
# Create a folder structure to store the data so that we can work offline:

# Webscraping_Data
#     - html_data
#                - 2013
#                      - 01.html
#                      - 02.html
#                      - 03.html

In [26]:
# Call function to retrieve HTML data from the url. Comment it out afterwards.
# Note: Internet connection is needed to execute this:

start_time = time.time()
retrieve_html(2013, 2015)
end_time = time.time()
print("Done, Total Time: ", end_time - start_time)

Done, Total Time:  29.475650787353516


# Part 2: Read and Consolidate the HTML Files

In [27]:
df_res = pd.DataFrame()
# Extract all the folder names in the directory 'Webscraping_Data\\html_data\\':
for _, dirs, _ in os.walk('Webscraping_Data\\html_data\\'):
    break

# Extract all the file names in each folder:
for sub_dir in dirs:
    for _, _, files in os.walk('Webscraping_Data\\html_data\\' + sub_dir):
        for file in files:
            path = os.path.join('Webscraping_Data\\html_data\\', sub_dir, file)
            #print(path)            
            
            with open(path, 'r') as page:
                soup = bs(page)
                table = soup.find('table', {'class':'medias mensuales numspan'})
                
                daily_data = []
                for rows in table.find_all('td'):  
                    daily_data.append(rows.text)
                    
                res_daily_data = []
                daily = []
                n = 0

                for val in daily_data:
                    if n<=13:
                        n = n + 1
                        res_daily_data.append(val)
                        
                    else:
                        daily.append(res_daily_data)
                        res_daily_data = []
                        n = 0
                        continue
                
                df = pd.DataFrame(daily, index = None, columns = ['Day', 
                                                                  'Average Temperature',
                                                                  'Maximum temperature',
                                                                  'Minimum temperature', 
                                                                  'Atmospheric pressure at sea level (hPa)', 
                                                                  'Average relative humidity', 
                                                                  'Total rainfall and / or snowmelt (mm)', 
                                                                  'Average visibility (Km)', 
                                                                  'Average wind speed (Km/h)', 
                                                                  'Maximum sustained wind speed (Km/h)', 
                                                                  'VG', 'RA', 'SN', 'TS'])
                
                df['Day'] = df['Day'].astype(str).str.zfill(2)
                df['Date'] = df['Day'] + '/' + file.split('.')[0] + '/' + sub_dir
                df = df[:-1]
                df = df[['Date', 
                         'Average Temperature',
                         'Maximum temperature',
                         'Minimum temperature', 
                         'Atmospheric pressure at sea level (hPa)', 
                         'Average relative humidity', 
                         'Total rainfall and / or snowmelt (mm)', 
                         'Average visibility (Km)', 
                         'Average wind speed (Km/h)', 
                         'Maximum sustained wind speed (Km/h)', 
                         'VG', 'RA', 'SN', 'TS']]
                #print (df)
                df_res = df_res.append(df)
                
final_features = df_res.iloc[:, :10] 
final_features.to_csv('Webscraping_Data/independent_variables.csv', index = None)
final_features.head(10)

Unnamed: 0,Date,Average Temperature,Maximum temperature,Minimum temperature,Atmospheric pressure at sea level (hPa),Average relative humidity,Total rainfall and / or snowmelt (mm),Average visibility (Km),Average wind speed (Km/h),Maximum sustained wind speed (Km/h)
0,01/01/2013,9.1,15.3,4.0,1015.6,90.0,0.0,0.5,0.0,-
1,02/01/2013,7.4,9.8,4.8,1017.6,93.0,0.0,0.5,4.3,9.4
2,03/01/2013,7.8,12.7,4.4,1018.5,87.0,0.0,0.6,4.4,11.1
3,04/01/2013,,,,,,,,,
4,05/01/2013,,,,,,,,,
5,06/01/2013,,,,,,,,,
6,07/01/2013,6.7,13.4,2.4,1019.4,82.0,0.0,0.6,4.8,11.1
7,08/01/2013,8.6,15.5,3.3,1018.7,72.0,0.0,0.8,8.1,20.6
8,09/01/2013,12.4,20.9,4.4,1017.3,61.0,0.0,1.3,8.7,22.2
9,10/01/2013,,,,,,,,,


# Replace Null and Zero Values with Median Value

In [28]:
# Replace empty cells in 'Average Temperature' column with NAN and drop those rows:
final_features['Average Temperature'].replace('', np.nan, inplace = True)
final_features.dropna(subset=['Average Temperature'], inplace=True)

# Replace values '-' with NAN
final_features.replace('-', np.nan, inplace = True)

# Replace values 0 with NAN
final_features.replace(0, np.nan, inplace = True)

# Replace all NAN Values with median value for all columns
final_features = final_features.fillna(final_features.median())

# Save the final dataframe as a csv file for later reference
final_features.to_csv('Webscraping_Data/cleaned_data.csv', index = None)
final_features.head()

Unnamed: 0,Date,Average Temperature,Maximum temperature,Minimum temperature,Atmospheric pressure at sea level (hPa),Average relative humidity,Total rainfall and / or snowmelt (mm),Average visibility (Km),Average wind speed (Km/h),Maximum sustained wind speed (Km/h)
0,01/01/2013,9.1,15.3,4.0,1015.6,90,0,0.5,0.0,14.8
1,02/01/2013,7.4,9.8,4.8,1017.6,93,0,0.5,4.3,9.4
2,03/01/2013,7.8,12.7,4.4,1018.5,87,0,0.6,4.4,11.1
6,07/01/2013,6.7,13.4,2.4,1019.4,82,0,0.6,4.8,11.1
7,08/01/2013,8.6,15.5,3.3,1018.7,72,0,0.8,8.1,20.6


In [29]:
# Check for NaN Values:
final_features.isnull().values.any()

False

In [30]:
final_features.to_csv('Webscraping_Data/Weather_Report.csv', index = None)