# <center> Web Data Sources </center>

- [What are Web Data Sources](#section_1)
- [Pandas read_html() Function](#section_2)

<hr>

### What are Web Data Sources <a class="anchor" id="section_1"></a>

Data professionals sometimes need to access external data sets from web pages to add in their analysis projects. 

For example, let's say you are working on a project and you need a data set about the population of each country. You do your research and find a table in a wikipedia page that has information about each country's population and the percentage of change.

Let's see how we can scrape this table into a Pandas DataFrame and make use of it.

### Pandas read_html() Function <a class="anchor" id="section_2"></a>

Pandas library offers the [read_html()](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) built-in function to allow users to parse HTML tables from web pages into a list of Pandas DataFrames. This functionality provides users with a fast way to access data tables embedded in web pages’ html code. 

In the example below, we will learn about how we can extract [this table](https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)) from the wikipedia page.

In [1]:
# Import pandas library
import pandas as pd

In [2]:
# Extract the HTML table from the wiki page above
web_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)")

# Display the data type
type(web_data), len(web_data)

(list, 2)

We notice that the `web_data` variable is actually a Python list of 2 items.

The function `read_html()` actually searches for any HTML tag that could be a data table and add that part into a list item.

<br>

In order to find the correct table we need, the data analyst must first examine the items inside this list to find what they actually represent. 

In [3]:
# Select the first item in the list
web_countries_table = web_data[0]

# Display the DataFrame
web_countries_table

Unnamed: 0,Country / Area,UN continentalregion[4],UN statisticalsubregion[4],Population(1 July 2018),Population(1 July 2019),Change
0,China[a],Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,United States,Americas,Northern America,327096265,329064917,+0.60%
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%
...,...,...,...,...,...,...
229,Falkland Islands (United Kingdom),Americas,South America,3234,3377,+4.42%
230,Niue,Oceania,Polynesia,1620,1615,−0.31%
231,Tokelau (New Zealand),Oceania,Polynesia,1319,1340,+1.59%
232,Vatican City[z],Europe,Southern Europe,801,799,−0.25%


If we look closely, we notice there are some extra characters like brackets[], parentheses() and percentage signs % already embedded in this table. These extra characters should not be part of our DataFrame.

Later in this course, we will learn how to clean this DataFrame in order to prepare the data for further analysis.

<br>

Let’s quickly explore some other items in our `web_data` list.

If we select the second item, it seems to return another table with some messy html code and tags. Clearly this is not the one we need. 

In [4]:
# Select the second item in the list
df_countries = web_data[1]

# Display the DataFrame
df_countries

Unnamed: 0,".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:""[ ""}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:"" ]""}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteLists of countries by population statistics",".mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:""[ ""}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:"" ]""}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vteLists of countries by population statistics.1"
0,Global,Current population Demographics of the world
1,Continents/subregions,Africa Antarctica Asia Europe North America Ca...
2,Intercontinental,Americas Arab world Commonwealth of Nations Eu...
3,Cities/urban areas,World cities National capitals Megacities Mega...
4,Past and future,Past and future population World population es...
5,Population density,Current density Past and future population den...
6,Growth indicators,Population growth rate Natural increase Net re...
7,Other demographics,Age at childbearing Age at first marriage Age ...
8,Health,Antidepressant consumption Antiviral medicatio...
9,Education and innovation,Bloomberg Innovation Index Education Index Int...


<br>

If we try to find the third item, we would get an error because this is a list of 2 items only. 

In [6]:
# Select the third  item in the list
df_countries = web_data[0]

# Display the DataFrame
df_countries

Unnamed: 0,Country / Area,UN continentalregion[4],UN statisticalsubregion[4],Population(1 July 2018),Population(1 July 2019),Change
0,China[a],Asia,Eastern Asia,1427647786,1433783686,+0.43%
1,India,Asia,Southern Asia,1352642280,1366417754,+1.02%
2,United States,Americas,Northern America,327096265,329064917,+0.60%
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,+1.10%
4,Pakistan,Asia,Southern Asia,212228286,216565318,+2.04%
...,...,...,...,...,...,...
229,Falkland Islands (United Kingdom),Americas,South America,3234,3377,+4.42%
230,Niue,Oceania,Polynesia,1620,1615,−0.31%
231,Tokelau (New Zealand),Oceania,Polynesia,1319,1340,+1.59%
232,Vatican City[z],Europe,Southern Europe,801,799,−0.25%


<br>

In summary, the Pandas `read_html()` function can help us quickly extract web data instead of using external python web data scraping libraries such as `beautifulsoup` and `selenium`. 

However, there will be scenarios where Pandas struggles with web scraping and we need to resort to specilized libraries. 