# Extraction, Transform, and Load

ETL is a common necessity for data engineering and data processing pipelines.
The source of the data may be other structured databases, unstructured data stores, data APIs, etc.

ETL can be a simple data acquisition task, such as shown below.

![AutomatedDataAcquisition.png MISSING](../images/AutomatedDataAcquisition.png)

**Or, it may be part of larger process to accumulated data and information in support of advanced analytical systems.**

![AutomatedDataAcquisition_to_Analytics.png MISSING](../images/AutomatedDataAcquisition_to_Analytics.png)

---

## In the context of ETL, you now have the tools to perform this activity.

In the data loading lab, you read in three data files and then massaged the Panda data frame to prepare the data for loading and to understand the semantics of the data.
You then loaded the database with data from the files.

We just need to understand how to acquire data from a remote resource, such as the web or an API and process it with Pandas.

Additionally, in this notebook we will see how to use the SQLAlchemy library to simplify data loading.

## Tasks:

 **Consider**:
 + https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)  
 
In the cells below, 

 1. Define a table for information about the worlds countries.
 1. Describe some challenges you foresee with the data
 1. Review and modify code cells that pull down the data from the tables into a data frame
 1. Load the data into your database
 1. Test loaded data with SQL queries

### 1. Define Tables

### 2. Describe the challenges

### 3. Data Scrapping Code

In [2]:
#import the library to query a website
import requests
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup



In [3]:
# specify the url
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
print(response.encoding)
print(response)

UTF-8
<Response [200]>


The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a `BeautifulSoup` object is created from an html response, we explicitly reference the text format(`response.text`).

The default encoding format is 'UTF-8' as shown below. 

[Click here for additional documentations about the response object.](http://docs.python-requests.org/en/master/user/quickstart/#response-content)



In [4]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")

### Inspect the page source to determine how you need to extract the tables into its own soup object.

We see that the table tag has class settings of:
 * sortable 
 * wikitable 
 * jquery-tablesorter
 
```HTML
<table class="sortable wikitable jquery-tablesorter">
```

We want to focus on the `wikitable`.  

In [18]:
# We can fetch all Tables with a find_all() 
all_tables=soup.find_all('table')
print(type(all_tables))
print(len(all_tables))


# We can find_all this time and get the second occurrence, [1]
right_table=soup.find_all('table', class_='wikitable')[0]
print(type(right_table))

<class 'bs4.element.ResultSet'>
2
<class 'bs4.element.Tag'>


#### Look at the first couple rows

In [19]:
first_two_rows = right_table.findAll("tr")[0:2]

print("Header")
print("-"*30)
print(first_two_rows[0])

print("="*30)

print("First Data row")
print("-"*30)
print(first_two_rows[1])


Header
------------------------------
<tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
<th>Country/Area
</th>
<th><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN continental<br/>region</a><sup class="reference" id="cite_ref-region_4-0"><a href="#cite_note-region-4">[4]</a></sup>
</th>
<th><a href="/wiki/United_Nations_geoscheme" title="United Nations geoscheme">UN statistical<br/>subregion</a><sup class="reference" id="cite_ref-region_4-1"><a href="#cite_note-region-4">[4]</a></sup>
</th>
<th>Population<br/>(1 July 2018)
</th>
<th>Population<br/>(1 July 2019)
</th>
<th>Change
</th></tr>
First Data row
------------------------------
<tr>
<td style="text-align:left"><span class="datasortkey" data-sort-value="China"><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic

The `Tag` element is the table.

**Examining the HTML Table Header, we have these columns**

 * Country/Territory
 * UN continental region
 * UN statistical subregion
 * Population 2018
 * Population 2019
 * Change


#### TODO: Replace all the `#?` with one or more lines or portions of code.

In [24]:
# We will use the locale library so we can use 
# atof and atoi to convert alphanumeric to float and integers, respectively.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 

country=[]
region=[]
subregion=[]
population_2018=[]
population_2019=[]
population_change=[]


# Notice we are skipping the head row in this table
for row in right_table.findAll("tr")[1:]: 
    # for each row, pull out the td elements.
    cells = row.findAll('td') # To store all other details
    
    if len(cells)>2: # Only extract information if there is table body not heading
        
        # for the country name, we need to find the name (text) in the Country Hyperlink (a)
        countr_cell = cells[0].find('a').find(text=True)
        country.append(countr_cell)

      
         # for the region name, we need to find the name (text) in the Region Hyperlink (a)
        region_text = cells[1].find('a').find(text=True)
        region.append(
            region_text
        )
        
         # for the subregion name, we need to find the name (text) in the Subregion Hyperlink (a)
        subregion_text = cells[2].find('a').find(text=True)
        subregion.append(
            subregion_text
        )

        print("Area: {},{},{}".format(countr_cell,region_text,subregion_text))        

        # Adjust the the data from Text to numeric data types for population
        population_2018.append(locale.atoi(cells[3].find(text=True)))
        population_2019.append(locale.atoi(cells[4].find(text=True)))
        
        
        
        change_pull = cells[5].find(text=True)
        print(change_pull)
        
        # Note the mdash character in the table needs changed 
        # to a regular dash to be parsed as a negative value
        population_change.append(
                            locale.atof(change_pull.replace('%','').replace('−','-'))
                            )

#print(population_change)
#print(len(country))
#print(len(region))
#print(len(subregion))
#print(len(population_2018))
#print(len(population_2019))
#print(len(population_change))

Area: China,Asia,Eastern Asia
+0.43%
Area: India,Asia,Southern Asia
+1.02%
Area: United States,Americas,Northern America
+0.60%
Area: Indonesia,Asia,South-eastern Asia
+1.10%
Area: Pakistan,Asia,Southern Asia
+2.04%
Area: Brazil,Americas,South America
+0.75%
Area: Nigeria,Africa,Western Africa
+2.60%
Area: Bangladesh,Asia,Southern Asia
+1.03%
Area: Russia,Europe,Eastern Europe
+0.09%
Area: Mexico,Americas,Central America
+1.10%
Area: Japan,Asia,Eastern Asia
−0.27%
Area: Ethiopia,Africa,Eastern Africa
+2.61%
Area: Philippines,Asia,South-eastern Asia
+1.37%
Area: Egypt,Africa,Northern Africa
+2.00%
Area: Vietnam,Asia,South-eastern Asia
+0.96%
Area: DR Congo,Africa,Middle Africa
+3.24%
Area: Germany,Europe,Western Europe
+0.47%
Area: Turkey,Asia,Western Asia
+1.32%
Area: Iran,Asia,Southern Asia
+1.36%
Area: Thailand,Asia,South-eastern Asia
+0.25%
Area: United Kingdom,Europe,Northern Europe
+0.58%
Area: France,Europe,Western Europe
+0.21%
Area: Italy,Europe,Southern Europe
−0.13%
Area: Sou

##### Now that we have built all our columns, stack into a data frame!

In [25]:
import pandas as pd

# Note, in the table definition about, we listed 
# the country name first to use as a primary key
df=pd.DataFrame({
                'country': country,
                'region': region,
                'subregion': subregion,
                'population_2018': population_2018,
                'population_2019': population_2019,
                'population_change': population_change
            }
    )


In [26]:
df.head()

Unnamed: 0,country,region,subregion,population_2018,population_2019,population_change
0,China,Asia,Eastern Asia,1427647786,1433783686,0.43
1,India,Asia,Southern Asia,1352642280,1366417754,1.02
2,United States,Americas,Northern America,327096265,329064917,0.6
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,1.1
4,Pakistan,Asia,Southern Asia,212228286,216565318,2.04


In [27]:
df.tail()

Unnamed: 0,country,region,subregion,population_2018,population_2019,population_change
228,Montserrat,Americas,Caribbean,4993,4989,-0.08
229,Falkland Islands,Americas,South America,3234,3377,4.42
230,Niue,Oceania,Polynesia,1620,1615,-0.31
231,Tokelau,Oceania,Polynesia,1319,1340,1.59
232,Vatican City,Europe,Southern Europe,801,799,-0.25


### Check our column data types!
Does this match the data types we sketched out in the `CREATE TABLE` statement above?
If you need to adjust the definition, this would be the time.
Alternatively, we can adjust the columns using Pandas techniques.

In [28]:
df.dtypes

country               object
region                object
subregion             object
population_2018        int64
population_2019        int64
population_change    float64
dtype: object

Once we have our Panda data frame and the SQL table inline, we can load it into the database.

---

### 4. Load the data into your database

This time, instead of the manual loading, we are going to use the SQLAlchemy library.


In [30]:
import getpass
mypasswd = getpass.getpass()
username = 'jch5x8'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

········


In [31]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=True)



#### When you run the cell below, carefully examine the output so you see what the SQLAlchemy library is doing!

In [32]:

## Now that SQLAlchemy is loaded, the to_sql function
df.to_sql('country_population', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Recall that panda data frame has a row index, so we need to ignore it
          chunksize=50)       # Do 50 records from the data frame at a time


2021-12-03 17:17:32,494 INFO sqlalchemy.engine.base.Engine select version()
2021-12-03 17:17:32,495 INFO sqlalchemy.engine.base.Engine {}
2021-12-03 17:17:32,497 INFO sqlalchemy.engine.base.Engine select current_schema()
2021-12-03 17:17:32,498 INFO sqlalchemy.engine.base.Engine {}
2021-12-03 17:17:32,500 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS VARCHAR(60)) AS anon_1
2021-12-03 17:17:32,500 INFO sqlalchemy.engine.base.Engine {}
2021-12-03 17:17:32,502 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS VARCHAR(60)) AS anon_1
2021-12-03 17:17:32,502 INFO sqlalchemy.engine.base.Engine {}
2021-12-03 17:17:32,503 INFO sqlalchemy.engine.base.Engine show standard_conforming_strings
2021-12-03 17:17:32,503 INFO sqlalchemy.engine.base.Engine {}
2021-12-03 17:17:32,505 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where n.nspname=%(schema)s and relname=%(name)s
2021-12-03 17:17:32

2021-12-03 17:17:32,580 INFO sqlalchemy.engine.base.Engine INSERT INTO jch5x8.country_population (country, region, subregion, population_2018, population_2019, population_change) VALUES (%(country)s, %(region)s, %(subregion)s, %(population_2018)s, %(population_2019)s, %(population_change)s)
2021-12-03 17:17:32,580 INFO sqlalchemy.engine.base.Engine ({'country': 'Antigua and Barbuda', 'region': 'Americas', 'subregion': 'Caribbean', 'population_2018': 96286, 'population_2019': 97118, 'population_change': 0.86}, {'country': 'Isle of Man', 'region': 'Europe', 'subregion': 'Northern Europe', 'population_2018': 84077, 'population_2019': 84584, 'population_change': 0.6}, {'country': 'Andorra', 'region': 'Europe', 'subregion': 'Southern Europe', 'population_2018': 77006, 'population_2019': 77142, 'population_change': 0.18}, {'country': 'Dominica', 'region': 'Americas', 'subregion': 'Caribbean', 'population_2018': 71625, 'population_2019': 71808, 'population_change': 0.26}, {'country': 'Cayman 

### 5. Test loaded data with SQL queries



```SQL
\x
select * from SSO.country_population limit 2;
```

---

#### TODO: Run the SQL in your database to verify the data was loaded.

If the data was not loaded, please restart from the top and carefully check and redo each step.



#### Now that the data is loaded, let's pull it back out!





In [33]:
df_backout = pd.read_sql_table(
    'country_population',
    con = engine,             # The engine created above
    schema= username   # The schema where the table lives, our pawprint
)

2021-12-03 17:19:40,443 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind in ('r', 'p')
2021-12-03 17:19:40,444 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8'}
2021-12-03 17:19:40,461 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind IN ('v', 'm')
2021-12-03 17:19:40,462 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8'}
2021-12-03 17:19:40,469 INFO sqlalchemy.engine.base.Engine 
            SELECT c.oid
            FROM pg_catalog.pg_class c
            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
            WHERE (n.nspname = %(schema)s)
            AND c.relname = %(table_name)s AND c.relkind in
            ('r', 'v', 'm', 'f', 'p')
        
2021-12-03 17:19:40,470 INFO sqlalchemy.engine.base.Engine {'schema': 'jch5x8', 'table_name': 'country

In [34]:
df_backout.head(10)

Unnamed: 0,country,region,subregion,population_2018,population_2019,population_change
0,China,Asia,Eastern Asia,1427647786,1433783686,0.43
1,India,Asia,Southern Asia,1352642280,1366417754,1.02
2,United States,Americas,Northern America,327096265,329064917,0.6
3,Indonesia,Asia,South-eastern Asia,267670543,270625568,1.1
4,Pakistan,Asia,Southern Asia,212228286,216565318,2.04
5,Brazil,Americas,South America,209469323,211049527,0.75
6,Nigeria,Africa,Western Africa,195874683,200963599,2.6
7,Bangladesh,Asia,Southern Asia,161376708,163046161,1.03
8,Russia,Europe,Eastern Europe,145734038,145872256,0.09
9,Mexico,Americas,Central America,126190788,127575529,1.1


In [35]:
df_backout.tail(10)

Unnamed: 0,country,region,subregion,population_2018,population_2019,population_change
223,Tuvalu,Oceania,Polynesia,11508,11646,1.2
224,Wallis and Futuna,Oceania,Polynesia,11661,11432,-1.96
225,Nauru,Oceania,Micronesia,10670,10756,0.81
226,Saint Helena,Africa,Western Africa,6035,6059,0.4
227,Saint Pierre and Miquelon,Americas,Northern America,5849,5822,-0.46
228,Montserrat,Americas,Caribbean,4993,4989,-0.08
229,Falkland Islands,Americas,South America,3234,3377,4.42
230,Niue,Oceania,Polynesia,1620,1615,-0.31
231,Tokelau,Oceania,Polynesia,1319,1340,1.59
232,Vatican City,Europe,Southern Europe,801,799,-0.25


# Save your notebook, then `File > Close and Halt`

---