### This is a template for ETL pipeline. This template contains 3 parts:
* Data extracting (from .csv/.json/.xml/.sql/API)
* Data transfering (cleaning/combining/datatype processing/date parsing/encoing/missing values/duplicates/outliers/scaling)
* Data loading


## 1. Extract data


## Extract from CSV

In [None]:
import pandas as pd
df_projects = pd.read_csv('projects_data.csv')
df_projects = pd.read_csv('projects_data.csv', dtype=str)
df_population = pd.read_csv('population_data.csv', skiprows=4)

In [None]:
f = open('population_data.csv')
for i in range(10):
    line = f.readline()
    print('line: ', i,  line)
f.close()

In [None]:
df_projects.head()

#Count the number of null values in each column
df_projects.isnull().sum()

#Sum the null values by column(in each row)
df_population.isnull().sum(axis=1)

# This code outputs any row that contains a null value
df_population[df_population.isnull().any(axis=1)]

In [None]:
df_projects.shape

In [None]:
df_population = df_population.drop('Unnamed: 62', axis=1)

## Extract from JSON

In [None]:
def print_lines(n, file_name):
    f = open(file_name)
    for i in range(n):
        print(f.readline())
    f.close()

print_lines(1, 'population_data.json')

The first "line" in the file is actually the entire file. JSON is a compact way of representing data in a dictionary-like format. Luckily, pandas has a method to [read in a json file](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).

If you open the link with the documentation, you'll see there is an *orient* option that can handle JSON formatted in different ways:
```
'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
'records' : list like [{column -> value}, ... , {column -> value}]
'index' : dict like {index -> {column -> value}}
'columns' : dict like {column -> {index -> value}}
'values' : just the values array
```

In this case, the JSON is formatted with a 'records' orientation, so you'll need to use that value in the read_json() method. You can tell that the format is 'records' by comparing the pattern in the documentation with the pattern in the JSON file.

Next, read in the population_data.json file using pandas.

In [None]:
import pandas as pd
df_json = pd.read_json('population_data.json', orient='records')

In [None]:
import json

# read in the JSON file
with open('population_data.json') as f:
    json_data = json.load(f)

# print the first record in the JSON file
print(json_data[0])
print('\n')

# show that JSON data is essentially a dictionary
print(json_data[0]['Country Name'])
print(json_data[0]['Country Code'])

## Extract from XML

In [None]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# open the population_data.xml file and load into Beautiful Soup
with open("population_data.xml") as fp:
    soup = BeautifulSoup(fp, "lxml") # lxml is the Parser type

# output the first 5 records in the xml file
# this is an example of how to navigate with BeautifulSoup

i = 0
# use the find_all method to get all record tags in the document
for record in soup.find_all('record'):
    # use the find_all method to get all fields in each record
    i += 1
    for record in record.find_all('field'):
        print(record['name'], ': ' , record.text)
    print()
    if i == 5:
        break

Create a data frame from the xml file.
The dataframe should have the following layout:

| Country or Area | Year | Item | Value |
|----|----|----|----|
| Aruba | 1960 | Population, total | 54211 |
| Aruba | 1961 | Population, total | 55348 |
etc...

In [None]:
# output the first 5 records in the xml file
# this is an example of how to navigate with BeautifulSoup

# use the find_all method to get all record tags in the document
data_dictionary = {'Country or Area':[], 'Year':[], 'Item':[], 'Value':[]}

for record in soup.find_all('record'):
    for record in record.find_all('field'):
        data_dictionary[record['name']].append(record.text)

df = pd.DataFrame.from_dict(data_dictionary)
df = df.pivot(index='Country or Area', columns='Year', values='Value')
df.reset_index(level=0, inplace=True)

# Extract from SQL Databases

### Demo: SQLite3 and Pandas

In [None]:
import sqlite3
import pandas as pd

# connect to the database
conn = sqlite3.connect('population_data.db')

# run a query
pd.read_sql('SELECT * FROM population_data', conn)
pd.read_sql('SELECT "Country_Name", "Country_Code", "1960" FROM population_data', conn)

### Demo: SQLAlchemy and Pandas
If you are working with a different type of database such as MySQL or PostgreSQL, you can use the SQLAlchemy library with pandas. Here are the instructions for connecting to [different types of databases using SQLAlchemy](http://docs.sqlalchemy.org/en/latest/core/engines.html).

In [None]:
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('sqlite:////home/workspace/3_sql_exercise/population_data.db')
pd.read_sql("SELECT * FROM population_data", engine)

## Extract From APIs

In [None]:
import requests
import pandas as pd

url = 'http://api.worldbank.org/v2/countries/br;cn;us;de/indicators/SP.POP.TOTL/?format=json&per_page=1000'
r = requests.get(url)
r.json()

This json data isn't quite ready for a pandas data frame. Notice that the json response is a list with two entries. The first entry is
```
{'lastupdated': '2018-06-28',
  'page': 1,
  'pages': 1,
  'per_page': 1000,
  'total': 232}
```

That first entry is meta data about the results. For example, it says that there is one page returned with 232 results.

The second entry is another list containing the data. This data would need some cleaning to be used in a pandas data frame. That would happen later in the transformation step of an ETL pipeline. Run the cell below to read the results into a dataframe and see what happens.

In [None]:
pd.DataFrame(r.json()[1])