## Project to demonstrate my knowledge of web scraping, data analysis and python

In [66]:
from bs4 import BeautifulSoup 
import requests

<h2 id="BSO">Beautiful Soup Objects</h2>

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.  We can navigate the HTML as a tree and/or filter out what we are looking for.

Consider the following HTML:

In [67]:
%%html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable html

In [68]:
html="""
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>
"""

To parse a document, pass it into the BeautifulSoup contructor, the BeautifulSoup object, which represents the document as a nested data structure

In [69]:
soup=BeautifulSoup(html,"html5lib")

First, the document is converted to Unicode, (similar to ASCII),  and HTML entities are converted to Unicode characters. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. The <code>BeautifulSoup</code> object can create other types of objects. In this lab, we will cover <code>BeautifulSoup</code> and <code>Tag</code> objects that for the purposes of this lab are identical, and <code>NavigableString</code> objects.

we can use the method prettify() to display the HTML in the nested structure

In [70]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>



## Tags

Let's say we want the  title of the page and the name of the top paid player we can use the <code>Tag</code>. The <code>Tag</code> object corresponds to an HTML tag in the original document, for example, the tag title.

In [71]:
tag_object=soup.title
print(f"tag object: {tag_object}")
print(tag_object.string) # This will display just the information

tag object: <title>Page Title</title>
Page Title


In [72]:
tag_object=soup.h3.string
print(
    f"Type of the string attribute: {type(tag_object)} \n",
    tag_object+" hi \n",
    tag_object[:5]
      )
 

Type of the string attribute: <class 'bs4.element.NavigableString'> 
 Lebron James hi 
 Lebro


What we've learned is that the string attribute of an html tag works as a string python structure, we can add information, and also indexing through the string.

We can also go through the tree with the attributes:

1. This helps us to go up the tree

    - parent

2. This helps us to go right to left in the three
    - next_sibling

In [73]:
tag_object=soup.h3
tag_child=tag_object.b #This is the tag children of h3 tag
print(tag_child)

<b id="boldest">Lebron James</b>


In [74]:
tag_child.parent #This will give us the superior tag, with all tags within

<h3><b id="boldest">Lebron James</b></h3>

In [84]:
tag_object=soup.h3
sibling_1=tag_object.find_next_sibling(True)
print(sibling_1)

<p> Salary: $ 92,000,000 </p>


In [92]:
sibling_2=sibling_1.find_next_sibling(True)
print(sibling_2)

<h3> Stephen Curry</h3>


## Lets try with tables

In [113]:
%%html
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

0,1,2
Flight No,Launch site,Payload mass
1,Florida,300 kg
2,Texas,94 kg
3,Florida,80 kg


In [98]:
html2="""
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>"""

In [99]:
table_bs=BeautifulSoup(html2,"html5lib")
print(table_bs.prettify())

<html>
 <head>
 </head>
 <body>
  <table>
   <tbody>
    <tr>
     <td id="flight">
      Flight No
     </td>
     <td>
      Launch site
     </td>
     <td>
      Payload mass
     </td>
    </tr>
    <tr>
     <td>
      1
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
     </td>
     <td>
      300 kg
     </td>
    </tr>
    <tr>
     <td>
      2
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Texas">
       Texas
      </a>
     </td>
     <td>
      94 kg
     </td>
    </tr>
    <tr>
     <td>
      3
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
      <a>
      </a>
     </td>
     <td>
      80 kg
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>



In [110]:
rows=table_bs.find_all(name="tr") #This will save in a table each tag with the same name
print(rows)

[<tr>
    <td id="flight">Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>, <tr> 
    <td>1</td>
    <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a></td>
    <td>300 kg</td>
  </tr>, <tr>
    <td>2</td>
    <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
    <td>94 kg</td>
  </tr>, <tr>
    <td>3</td>
    <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
    <td>80 kg</td>
  </tr>]


### The tables can be parsed as a matrix, but it needs to go rows first and then columns

The first row is for the header of the table

In [112]:
for row in rows: 
    cols=row.find_all("td")
    for col in cols:
        print(col.string)
        

Flight No
Launch site
Payload mass
1
Florida
300 kg
2
Texas
94 kg
3
None
80 kg


If we want the links, we can look for the <code>a</code> tag, and get the <code>href</code> attribute

In [120]:
table_bs.find_all(name="a") #The difference here is that we are looking for the a tag, 
#but this attribute may not have any href attribute

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a> </a>]

In [121]:
table_bs.find_all(href=True) #This looks for the href attribute, assuring we will find all
#the links

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>,
 <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>,
 <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

In [128]:
for link in table_bs.find_all(href=True):
    print(link.get("href"))

https://en.wikipedia.org/wiki/Florida
https://en.wikipedia.org/wiki/Texas
https://en.wikipedia.org/wiki/Florida


## Scrape data from HTML tables into a DataFrame using BeautifulSoup and Pandas

In [134]:
import pandas as pd

The used webpage contains html tables with data about world population.

In [135]:
url = "https://en.wikipedia.org/wiki/World_population"

Before proceeding to scrape a website, you need to examine the contents, and the way data is organized on the website.

Then, get the contents of the webpage in text format and store in a variable called data

In [136]:
data=requests.get(url).text

In [137]:
soup=BeautifulSoup(data,"html.parser")

Now we have all the webpage, so we need to find all the tables. Remember that 
find_all() returns a list, each entry is the requested tag found

In [139]:
tables=soup.find_all("table") #the default parameters is name, which seeks tags

How much tables does the webpage has?

In [142]:
print(f"The webpage has {len(tables)} tables")

The webpage has 28 tables


Lets suppose we are looking for the 10 most densly populated countries table, we can look through the tables list and find the right one we are looking for based on the data ine ach table or we can search for the table name if it is in the table but this option migh not always work

In [146]:
for index, table in enumerate(tables):
    if ("10 most densely populated countries" in str(table)):
        table_index=index
print(f"The table '10 most densely populated countries' is in the position {table_index + 1} of the list, \n"+
      f" which means the index is equal to {table_index}")

The table '10 most densely populated countries' is in the position 7 of the list, 
 which means the index is equal to 6


In [147]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
  <sup class="reference" id="cite_ref-:10_105-0">
   <a href="#cite_note-:10-105">
    [101]
   </a>
  </sup>
 </caption>
 <tbody>
  <tr>
   <th scope="col">
    Rank
   </th>
   <th scope="col">
    Country
   </th>
   <th scope="col">
    Population
   </th>
   <th scope="col">
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th scope="col">
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload

Now we need to retrieve the headers of the data

In [151]:
t_pop=tables[table_index].tr #t_pop = table top 10 most densely countries header
t_pop

<tr>
<th scope="col">Rank
</th>
<th scope="col">Country
</th>
<th scope="col">Population
</th>
<th scope="col">Area<br/><small>(km<sup>2</sup>)</small>
</th>
<th scope="col">Density<br/><small>(pop/km<sup>2</sup>)</small>
</th></tr>

In [173]:
columns_names=[]
for header in t_pop.find_all(name="th"):
    print(header)
    if header.string != None:
        columns_names.append(header.string.replace("\n",""))
        
columns_names

<th scope="col">Rank
</th>
<th scope="col">Country
</th>
<th scope="col">Population
</th>
<th scope="col">Area<br/><small>(km<sup>2</sup>)</small>
</th>
<th scope="col">Density<br/><small>(pop/km<sup>2</sup>)</small>
</th>


['Rank', 'Country', 'Population']

As we can see, the names of the columns "Area" and "Density" are fragmented in different objects, so we will get a None in that cases. To retrieve those names, we may use regular expressions, but we will not be using it in this notebook

Now we get the column names, we need to get the values of each row and column.

In [174]:
columns_names.append("Area")
columns_names.append("Density")
columns_names

['Rank', 'Country', 'Population', 'Area', 'Density']

In [230]:

l=[] #list where all dictionaries with all columns will be stored
for row in tables[table_index].find_all("tr"):
    cols = row.find_all("td") # remember this is a list of html code
    if cols !=[]:
            
        rank=cols[0].text.replace("\n","")
        country=cols[1].text.replace("\n","")
        population=cols[2].text.replace("\n","")
        area=cols[3].text.replace("\n","")
        density=cols[4].text.replace("\n","")
        
        dic={
            
            columns_names[0]:rank,
            columns_names[1]:country,
            columns_names[2]:population,
            columns_names[3]:area,
            columns_names[4]:density
            
        }
        
        l.append(dic)
       
pop_data=pd.DataFrame(l)
pop_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[102],5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


## Scrape data from HTML tables into DataFrame using BeautifulSoup and read_html

In [233]:
pd.read_html(str(tables[table_index]),flavor="bs4")

[   Rank         Country  Population  Area (km2)  Density (pop/km2)
 0     1       Singapore     5921231         719               8235
 1     2      Bangladesh   165650475      148460               1116
 2     3  Palestine[102]     5223000        6025                867
 3     4          Taiwan    23580712       35980                655
 4     5     South Korea    51844834       99720                520
 5     6         Lebanon     5296814       10400                509
 6     7          Rwanda    13173730       26338                500
 7     8         Burundi    12696478       27830                456
 8     9           India  1389637446     3287263                423
 9    10     Netherlands    17400824       41543                419]

pd.read_html() ALWAYS RETURN A LIST OF DATAFRAMES

In [240]:
pop_data=pd.read_html(str(tables[table_index]),flavor="bs4")[0]
pop_data

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[102],5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


In [241]:
pop_data=pop_data.replace("Palestine[102]","Palestine")
pop_data

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine,5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419


## We can also use the read_html() to directly get DataFrames from a url

In [242]:
df_list=pd.read_html(url,flavor="bs4")

In [247]:
len(df_list)
print("Using pd.read_html() directly to get the tables from a webpage, it may\n "+
      "reduce the number of tables compared if we use a BeautifulSoup.find_all('table') \n\n")
print(f"Number of tables found with BeautifulSoup.find_all('table'): {len(tables)}\n")
print(f"Number of tables found with pd.read_html(): {len(df_list)}")

Using pd.read_html() directly to get the tables from a webpage, it may
 reduce the number of tables compared if we use a BeautifulSoup.find_all('table') 


Number of tables found with BeautifulSoup.find_all('table'): 28

Number of tables found with pd.read_html(): 25


In [256]:
df_list[table_index]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2),Population trend[citation needed]
0,1,India,1389637446,3287263,423,Growing
1,2,Pakistan,242923845,796095,305,Rapidly growing
2,3,Bangladesh,165650475,148460,1116,Growing
3,4,Japan,124214766,377915,329,Declining[103]
4,5,Philippines,114597229,300000,382,Growing
5,6,Vietnam,103808319,331210,313,Growing
6,7,United Kingdom,67791400,243610,278,Growing
7,8,South Korea,51844834,99720,520,Steady
8,9,Taiwan,23580712,35980,655,Steady
9,10,Sri Lanka,23187516,65610,353,Growing


In [257]:
pd.read_html(url,match="10 most densely populated countries",flavor="bs4")[0]

Unnamed: 0,Rank,Country,Population,Area (km2),Density (pop/km2)
0,1,Singapore,5921231,719,8235
1,2,Bangladesh,165650475,148460,1116
2,3,Palestine[102],5223000,6025,867
3,4,Taiwan,23580712,35980,655
4,5,South Korea,51844834,99720,520
5,6,Lebanon,5296814,10400,509
6,7,Rwanda,13173730,26338,500
7,8,Burundi,12696478,27830,456
8,9,India,1389637446,3287263,423
9,10,Netherlands,17400824,41543,419
