# Extraction, Transform, and Load


---
This exercise is similar to the one in lab and practice. Here we create a database of freshwater withdrawal by each country, and we use wikipedia as a data source. 

## Tasks:

 **Consider**:
 + https://en.wikipedia.org/wiki/List_of_countries_by_freshwater_withdrawal
 
In the cells below, 

 1. Define a table for information about freshwater widrawal by each country
 1. Describe some challenges you foresee with the data
 1. Review and modify code cells that pull down the data from the tables into a data frame
 1. Load the data into your database
 1. Test loaded data with SQL queries

### 1. Define Tables

In the following cell define the table for storing the data

In [1]:
import getpass
mypasswd = getpass.getpass()
username = 'bmgwd9'
host = 'pgsql.dsa.lan'
database = 'dsa_student'

········


In [2]:
# Then connects to the DB
from sqlalchemy.engine.url import URL
from sqlalchemy import create_engine

# SQLAlchemy Connection Parameters
postgres_db = {'drivername': 'postgres',
               'username': username,
               'password': mypasswd,
               'host': host,
               'database' :database}
engine = create_engine(URL(**postgres_db), echo=True)
del mypasswd

Now create a query string with above create statement and execute the query with a sqlalchemy engine.

In [38]:
query = """
DROP TABLE IF EXISTS withdrawal;
CREATE TABLE withdrawal(
    rank                     INT,
    country                  varchar(100),
    total_withdrawal         REAL,
    per_capita_withdrawal    REAL,
    domestic_withdrawal      REAL,
    industrial_withdrawal    REAL,
    agricultural_withdrawal  REAL,
    date                     REAL,
    PRIMARY KEY (country)
);
"""

with engine.connect() as connection:
    res = connection.execute(query)
    print(res)

2020-12-04 23:15:30,552 INFO sqlalchemy.engine.base.Engine 
DROP TABLE IF EXISTS withdrawal;
CREATE TABLE withdrawal(
    rank                     INT,
    country                  varchar(100),
    total_withdrawal         REAL,
    per_capita_withdrawal    REAL,
    domestic_withdrawal      REAL,
    industrial_withdrawal    REAL,
    agricultural_withdrawal  REAL,
    date                     REAL,
    PRIMARY KEY (country)
);

2020-12-04 23:15:30,553 INFO sqlalchemy.engine.base.Engine {}
2020-12-04 23:15:30,570 INFO sqlalchemy.engine.base.Engine COMMIT
<sqlalchemy.engine.result.ResultProxy object at 0x7fd3dd89a1d0>


### 2. Data Scrapping 

In [39]:
#import the library to query a website
import requests
# import Beautiful soup library to access functions to parse the data returned from the website
from bs4 import BeautifulSoup



In [40]:
# specify the url
url = "https://en.wikipedia.org/wiki/List_of_countries_by_freshwater_withdrawal"
# Open website URL and return the html to the variable 'response'
response = requests.get(url)
print(response.encoding)
print(response.status_code)

UTF-8
200


The response we get from web is typically html content. 
We can read the content of the server's response. 
Below, when a `BeautifulSoup` object is created from an html response, we explicitly reference the text format(`response.text`).

The default encoding format is 'UTF-8' as shown below. 

[Click here for additional documentations about the response object.](http://docs.python-requests.org/en/master/user/quickstart/#response-content)



In [41]:
# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")

### Inspect the page source to determine how you need to extract the tables into its own soup object.


In [42]:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of countries by freshwater withdrawal - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"3827b415-bf7c-40be-9649-44a06de40580","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_countries_by_freshwater_withdrawal","wgTitle":"List of countries by freshwater withdrawal","wgCurRevisionId":945992976,"wgRevisionId":945992976,"wgArticleId":18012399,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: archived copy as title","Articles with 

#### Look at the first couple rows

In [43]:
# We can fetch all Tables with a find_all() 
all_tables=soup.find_all('table')
print(type(all_tables))
print(f"Num of avaiable tables = {len(all_tables)}")

wikitables=soup.find_all('table', {'class':"wikitable"})
print(f"Num of avaiable wikitables = {len(wikitables)}")

# In this case the second occurrence is the right table
right_table=soup.find_all('table', class_='wikitable')[0]
print(type(right_table))


first_two_rows = right_table.findAll("tr")[0:2]

print("Header")
print("-"*30)
print(first_two_rows[0])

print("="*30)

print("First Data row")
print("-"*30)
print(first_two_rows[1])


<class 'bs4.element.ResultSet'>
Num of avaiable tables = 3
Num of avaiable wikitables = 1
<class 'bs4.element.Tag'>
Header
------------------------------
<tr>
<th>Rank</th>
<th>Country</th>
<th>Total<br/>withdrawal<br/>(km³/year)</th>
<th>Per capita<br/>withdrawal<br/>(m³/year)</th>
<th>Domestic<br/>withdrawal<br/>(%)</th>
<th>Industrial<br/>withdrawal<br/>(%)</th>
<th>Agricultural<br/>withdrawal<br/>(%)</th>
<th>Date of <br/> Information
</th></tr>
First Data row
------------------------------
<tr>
<td>1</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="900" data-file-width="1350" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/23px-Flag_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/35px-Flag_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/4/41/Flag_of_India.svg/45px-Flag_of_India.svg.png 2x" width="23"/> </span><a href="/wiki/India" title="I

#### TODOs: 

**Examining the HTML Table Header, and identify the columns. Feel free automate the extraction of header.**




In [52]:
# We will use the locale library so we can use 
# atof and atoi to convert alphanumeric to float and integers, respectively.
import locale
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' ) 

rank=[]
country=[]
total_withdrawal=[]
per_capita_withdrawal=[]
domestic_withdrawal=[]
industrial_withdrawal=[]
agricultural_withdrawal=[]
date=[]


# skip first iteration as we dont need headers
for row in right_table.findAll("tr")[1:]: 
    # for each row, pull out the td elements.
    cells = row.findAll('td') # To store all other details
    

#     print(f"cells = {cells}")    
    
    this_rank = cells[0].find(text=True)
    print("Processing rank {}".format(this_rank))

    # If the rank is a number, we can convert it
    if (not this_rank.isnumeric()):
        print("Non-Ranked, skipping")
        continue
        
    rank.append(locale.atoi(this_rank))

    # for the country name, we need to find the name (text) in the Country Hyperlink (a)
    countr_cell = cells[1].find('a').find(text=True)
    print(countr_cell)

    country.append(countr_cell)

    # Adjust the the data from Text to numeric data types
    total_withdrawal.append(locale.atof(cells[2].find(text=True)))

    if (cells[3].find(text=True)):
        per_capita_withdrawal.append(locale.atoi(cells[3].find(text=True))) 
    else:
        cells[3] = None
        per_capita_withdrawal.append(cells[3])
    
    if (cells[4].find(text=True) and cells[4].find(text=True).isnumeric()):
        domestic_withdrawal.append(locale.atoi(cells[4].find(text=True)))
    else:
        cells[4] = None
        domestic_withdrawal.append(cells[4])     

    if (cells[5].find(text=True) and cells[5].find(text=True).isnumeric()):
        industrial_withdrawal.append(locale.atoi(cells[5].find(text=True)))
    else:
        cells[5] = None
        industrial_withdrawal.append(cells[5])       
        
    if (cells[6].find(text=True) and cells[6].find(text=True).isnumeric()):
        agricultural_withdrawal.append(locale.atoi(cells[6].find(text=True)))
    else:
        cells[6] = None
        agricultural_withdrawal.append(cells[6])       

    if (not cells[7].find(text=True) == 'cu\n'):
        date.append(locale.atoi(cells[7].find(text=True)))
    else:
        cells[7] = None
        date.append(cells[7])

Processing rank 1
India
Processing rank 2
China
Processing rank 3
United States
Processing rank 4
Vietnam
Processing rank 5
Japan
Processing rank 6
Indonesia
Processing rank 7
Thailand
Processing rank 8
Uzbekistan
Processing rank 9
Mexico
Processing rank 10
Russia
Processing rank 11
Iran
Processing rank 12
Pakistan
Processing rank 13
Egypt
Processing rank 14
Brazil
Processing rank 15
Bangladesh
Processing rank 16
Canada
Processing rank 17
Italy
Processing rank 18
Iraq
Processing rank 19
Turkey
Processing rank 20
Germany
Processing rank 21
Sudan
Processing rank 22
Spain
Processing rank 23
Ukraine
Processing rank 24
Burma
Processing rank 25
Turkmenistan
Processing rank 26
Colombia
Processing rank 27
France
Processing rank 28
South Korea
Processing rank 29
Philippines
Processing rank 30
Australia
Processing rank 31
Kazakhstan
Processing rank 32
Hungary
Processing rank 33
Afghanistan
Processing rank 34
Syria
Processing rank 35
Peru
Processing rank 36
Saudi Arabia
Processing rank 37
Azerbai

##### Now that we have built all our columns, stack into a data frame!

In [53]:
import pandas as pd

df=pd.DataFrame({'country': country,
                'rank': rank,
                'total_withdrawal': total_withdrawal,
                'per_capita_withdrawal': per_capita_withdrawal,
                'domestic_withdrawal': domestic_withdrawal,
                'industrial_withdrawal': industrial_withdrawal,
                'agricultural_withdrawal': agricultural_withdrawal,
                'date': date
                })

df.head()

Unnamed: 0,country,rank,total_withdrawal,per_capita_withdrawal,domestic_withdrawal,industrial_withdrawal,agricultural_withdrawal,date
0,India,1,645.84,585.0,8.0,5.0,86.0,2000.0
1,China,2,549.76,415.0,7.0,26.0,68.0,2000.0
2,United States,3,477.0,1600.0,13.0,46.0,41.0,2000.0
3,Vietnam,4,169.39,1072.0,2.0,2.0,96.0,2000.0
4,Japan,5,88.43,690.0,20.0,18.0,62.0,2000.0


### Check  column data types!
Does this match the data types we sketched out in the `CREATE TABLE` statement above?
If you need to adjust the definition, this would be the time.
Alternatively, we can adjust the columns using Pandas techniques.

In [54]:
df.dtypes

country                     object
rank                         int64
total_withdrawal           float64
per_capita_withdrawal      float64
domestic_withdrawal        float64
industrial_withdrawal      float64
agricultural_withdrawal    float64
date                       float64
dtype: object

Once we have our Panda data frame and the SQL table inline, we can load it into the database.

---

### 4. Load the data into your database using SQLAlchemy


In [55]:
## Now that SQLAlchemy is loaded, the to_sql function
df.to_sql('withdrawal', # The table to load
          engine,             # The engine created above
          schema= username,   # The schema where the table lives, our pawprint
          if_exists='append', # If the table is found, it would keep loading the end of table.
          index=False,        # Ignore creating an index for the index col in the dataframe
          chunksize=20)       # Do 20 records from the data frame at a time

2020-12-04 23:22:58,686 INFO sqlalchemy.engine.base.Engine select relname from pg_class c join pg_namespace n on n.oid=c.relnamespace where n.nspname=%(schema)s and relname=%(name)s
2020-12-04 23:22:58,687 INFO sqlalchemy.engine.base.Engine {'schema': 'bmgwd9', 'name': 'withdrawal'}
2020-12-04 23:22:58,757 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2020-12-04 23:22:58,758 INFO sqlalchemy.engine.base.Engine INSERT INTO bmgwd9.withdrawal (country, rank, total_withdrawal, per_capita_withdrawal, domestic_withdrawal, industrial_withdrawal, agricultural_withdrawal, date) VALUES (%(country)s, %(rank)s, %(total_withdrawal)s, %(per_capita_withdrawal)s, %(domestic_withdrawal)s, %(industrial_withdrawal)s, %(agricultural_withdrawal)s, %(date)s)
2020-12-04 23:22:58,759 INFO sqlalchemy.engine.base.Engine ({'country': 'India', 'rank': 1, 'total_withdrawal': 645.84, 'per_capita_withdrawal': 585.0, 'domestic_withdrawal': 8.0, 'industrial_withdrawal': 5.0, 'agricultural_withdrawal': 86.0, 'date

2020-12-04 23:22:58,810 INFO sqlalchemy.engine.base.Engine ({'country': 'Mali', 'rank': 61, 'total_withdrawal': 6.55, 'per_capita_withdrawal': 484.0, 'domestic_withdrawal': 9.0, 'industrial_withdrawal': 1.0, 'agricultural_withdrawal': 90.0, 'date': 2000.0}, {'country': 'Romania', 'rank': 62, 'total_withdrawal': 6.5, 'per_capita_withdrawal': 299.0, 'domestic_withdrawal': 9.0, 'industrial_withdrawal': 34.0, 'agricultural_withdrawal': 57.0, 'date': 2003.0}, {'country': 'Algeria', 'rank': 63, 'total_withdrawal': 6.07, 'per_capita_withdrawal': 185.0, 'domestic_withdrawal': 22.0, 'industrial_withdrawal': 13.0, 'agricultural_withdrawal': 65.0, 'date': 2000.0}, {'country': 'Ethiopia', 'rank': 64, 'total_withdrawal': 5.56, 'per_capita_withdrawal': 72.0, 'domestic_withdrawal': 6.0, 'industrial_withdrawal': 0.0, 'agricultural_withdrawal': 94.0, 'date': 2002.0}, {'country': 'Tanzania', 'rank': 65, 'total_withdrawal': 5.18, 'per_capita_withdrawal': 135.0, 'domestic_withdrawal': 10.0, 'industrial_wi

2020-12-04 23:22:58,836 INFO sqlalchemy.engine.base.Engine INSERT INTO bmgwd9.withdrawal (country, rank, total_withdrawal, per_capita_withdrawal, domestic_withdrawal, industrial_withdrawal, agricultural_withdrawal, date) VALUES (%(country)s, %(rank)s, %(total_withdrawal)s, %(per_capita_withdrawal)s, %(domestic_withdrawal)s, %(industrial_withdrawal)s, %(agricultural_withdrawal)s, %(date)s)
2020-12-04 23:22:58,836 INFO sqlalchemy.engine.base.Engine ({'country': 'Singapore', 'rank': 141, 'total_withdrawal': 0.19, 'per_capita_withdrawal': 44.0, 'domestic_withdrawal': 45.0, 'industrial_withdrawal': 51.0, 'agricultural_withdrawal': 4.0, 'date': 1975.0}, {'country': 'Botswana', 'rank': 142, 'total_withdrawal': 0.19, 'per_capita_withdrawal': 107.0, 'domestic_withdrawal': 41.0, 'industrial_withdrawal': 18.0, 'agricultural_withdrawal': 41.0, 'date': 2000.0}, {'country': 'Guinea-Bissau', 'rank': 143, 'total_withdrawal': 0.18, 'per_capita_withdrawal': 113.0, 'domestic_withdrawal': 13.0, 'industria

### 5. Test loaded data with SQL queries


#### TODO: Run the SQL in your database to verify the data was loaded.



In [57]:
with engine.connect() as connection:
    res = connection.execute("select * from withdrawal limit 2")
    for row in res:
        print(row)

2020-12-04 23:28:39,148 INFO sqlalchemy.engine.base.Engine select * from withdrawal limit 2
2020-12-04 23:28:39,149 INFO sqlalchemy.engine.base.Engine {}
(1, 'India', 645.84, 585.0, 8.0, 5.0, 86.0, 2000.0)
(2, 'China', 549.76, 415.0, 7.0, 26.0, 68.0, 2000.0)



### 6. Now that the data is loaded, let's pull it back out and store it to a dataframe!





In [59]:
df_backout = pd.read_sql_table(
    'withdrawal',
    con = engine,             # The engine created above
    schema= username   # The schema where the table lives, our pawprint
)

df_backout.head()

2020-12-04 23:29:33,485 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind in ('r', 'p')
2020-12-04 23:29:33,486 INFO sqlalchemy.engine.base.Engine {'schema': 'bmgwd9'}
2020-12-04 23:29:33,499 INFO sqlalchemy.engine.base.Engine SELECT c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = %(schema)s AND c.relkind IN ('v', 'm')
2020-12-04 23:29:33,500 INFO sqlalchemy.engine.base.Engine {'schema': 'bmgwd9'}
2020-12-04 23:29:33,527 INFO sqlalchemy.engine.base.Engine 
            SELECT c.oid
            FROM pg_catalog.pg_class c
            LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
            WHERE (n.nspname = %(schema)s)
            AND c.relname = %(table_name)s AND c.relkind in
            ('r', 'v', 'm', 'f', 'p')
        
2020-12-04 23:29:33,528 INFO sqlalchemy.engine.base.Engine {'schema': 'bmgwd9', 'table_name': 'withdra

Unnamed: 0,rank,country,total_withdrawal,per_capita_withdrawal,domestic_withdrawal,industrial_withdrawal,agricultural_withdrawal,date
0,1,India,645.84,585.0,8.0,5.0,86.0,2000.0
1,2,China,549.76,415.0,7.0,26.0,68.0,2000.0
2,3,United States,477.0,1600.0,13.0,46.0,41.0,2000.0
3,4,Vietnam,169.39,1072.0,2.0,2.0,96.0,2000.0
4,5,Japan,88.43,690.0,20.0,18.0,62.0,2000.0


# Save your notebook, then `File > Close and Halt`

---