# ETL Pipeline: Extraction (part 2)

Extraction is the first process in ETL pipeline. In this notebook, we will work through extracting world bank data which is saved in different, including csv, jason, xml.

### Extracting a json file

In [1]:
def print_lines(n, file_name):
    """ print_lines(): a function that prints the first n lines of
   a file.
   INPUT: 
   n = number of lines
   file_name = name of file
   """
    f = open(file_name)
    for i in range(n):
        print(f.readline())
    f.close()
    

In [None]:
#  call print_lines function to read the first line of the population_data.json file
print_lines(1, 'population_data.json')

The first "line" in the file is actually the entire file. JSON is a compact way of representing data in a dictionary-like format. We will now use pandas method to [read in a json file](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html) and as we can see the file is in records format.

In [3]:
import pandas as pd
df_json = pd.read_json('population_data.json', orient = "records") 

In [4]:
df_json.head()

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Aruba,ABW,"Population, total",SP.POP.TOTL,54211.0,55438.0,56225.0,56695.0,57032.0,57360.0,...,101353.0,101453.0,101669.0,102053.0,102577.0,103187.0,103795.0,104341.0,104822.0,105264.0
1,Afghanistan,AFG,"Population, total",SP.POP.TOTL,8996351.0,9166764.0,9345868.0,9533954.0,9731361.0,9938414.0,...,27294031.0,28004331.0,28803167.0,29708599.0,30696958.0,31731688.0,32758020.0,33736494.0,34656032.0,35530081.0
2,Angola,AGO,"Population, total",SP.POP.TOTL,5643182.0,5753024.0,5866061.0,5980417.0,6093321.0,6203299.0,...,21759420.0,22549547.0,23369131.0,24218565.0,25096150.0,25998340.0,26920466.0,27859305.0,28813463.0,29784193.0
3,Albania,ALB,"Population, total",SP.POP.TOTL,1608800.0,1659800.0,1711319.0,1762621.0,1814135.0,1864791.0,...,2947314.0,2927519.0,2913021.0,2905195.0,2900401.0,2895092.0,2889104.0,2880703.0,2876101.0,2873457.0
4,Andorra,AND,"Population, total",SP.POP.TOTL,13411.0,14375.0,15370.0,16412.0,17469.0,18549.0,...,83861.0,84462.0,84449.0,83751.0,82431.0,80788.0,79223.0,78014.0,77281.0,76965.0


### Extracting an xml file

Here we will work with the same data except now the data is in xml format. 

In [5]:
#read few lines of the population.xml file
print_lines(15, 'population_data.xml')

﻿<?xml version="1.0" encoding="utf-8"?>

<Root xmlns:wb="http://www.worldbank.org">

  <data>

    <record>

      <field name="Country or Area" key="ABW">Aruba</field>

      <field name="Item" key="SP.POP.TOTL">Population, total</field>

      <field name="Year">1960</field>

      <field name="Value">54211</field>

    </record>

    <record>

      <field name="Country or Area" key="ABW">Aruba</field>

      <field name="Item" key="SP.POP.TOTL">Population, total</field>

      <field name="Year">1961</field>

      <field name="Value">55438</field>

    </record>



- XML looks very similar to HTML. 
- XML is formatted with tags having values inside the tags. 
- XML is not as easy to navigate as JSON. 

Pandas cannot read in XML directly. One reason is that tag names are user defined. Every XML file might have different formatting. Therefore let us see how we can read and navigate xml files.

In [6]:
# import the BeautifulSoup library
from bs4 import BeautifulSoup

# open the population_data.xml file and load into Beautiful Soup
with open("population_data.xml") as f:
    soup = BeautifulSoup(f, "lxml") # lxml is the Parser type

In [7]:
#print first five lines of the xml file
dic = {}
i = 0
# get all record tags in the document
for record in soup.find_all('record'):
    # use the find_all method to get all fields in each record
    i += 1
    for record in record.find_all('field'):
        print(record['name'], ': ' , record.text)
    print()
    if i == 5:
        break

Country or Area :  Aruba
Item :  Population, total
Year :  1960
Value :  54211

Country or Area :  Aruba
Item :  Population, total
Year :  1961
Value :  55438

Country or Area :  Aruba
Item :  Population, total
Year :  1962
Value :  56225

Country or Area :  Aruba
Item :  Population, total
Year :  1963
Value :  56695

Country or Area :  Aruba
Item :  Population, total
Year :  1964
Value :  57032



In [8]:
data_dictionary = {'Country or Area':[], 'Year':[], 'Item':[], 'Value':[]}

for record in soup.find_all('record'):
    for record in record.find_all('field'):
        data_dictionary[record['name']].append(record.text)

df = pd.DataFrame.from_dict(data_dictionary)
df = df.pivot(index='Country or Area', columns='Year', values='Value')
df.reset_index(level=0, inplace=True)

In [9]:
df.head()

Year,Country or Area,1960,1961,1962,1963,1964,1965,1966,1967,1968,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,Afghanistan,8996351,9166764,9345868,9533954,9731361,9938414,10152331,10372630,10604346,...,27294031,28004331,28803167,29708599,30696958,31731688,32758020,33736494,34656032,35530081
1,Albania,1608800,1659800,1711319,1762621,1814135,1864791,1914573,1965598,2022272,...,2947314,2927519,2913021,2905195,2900401,2895092,2889104,2880703,2876101,2873457
2,Algeria,11124888,11404859,11690153,11985136,12295970,12626952,12980267,13354197,13744387,...,34860715,35465760,36117637,36819558,37565847,38338562,39113313,39871528,40606052,41318142
3,American Samoa,20013,20486,21117,21882,22698,23520,24321,25116,25885,...,57030,56227,55637,55320,55230,55307,55437,55537,55599,55641
4,Andorra,13411,14375,15370,16412,17469,18549,19647,20758,21890,...,83861,84462,84449,83751,82431,80788,79223,78014,77281,76965


#### Conclusion

We see that XML is abit tricky to handle. If everything is formatted correctly, JSON is especially easy to work with.