# ETL with Python

# pETL package
pETL is a general purpose Python package for extracting, transforming and loading tables of data.
<br>
Extract & Load : http://petl.readthedocs.io/en/latest/io.html
<br>
transform: http://petl.readthedocs.io/en/latest/transform.html

install petl by running ```pip install petl``` in Command-Line

In [None]:
import petl as etl
import datetime

# Extract

### Extract JSON file and display

In [None]:
filename= 'yelp_academic_dataset_users_nofriendlist_PA.json'

In [None]:
t1 = etl.fromjson(filename)
t1.display(10)

creating the users dimension:  
user_id= varchar(20)
friends_count= INT
review_count= INT
fans= INT
is_elite= Binary
yelping_since= Date

In [None]:
t2 = t1.cut(['user_id','review_count','fans','elite','yelping_since'])
t2.display(10)

# Transform

In [None]:
fields = t2.fieldnames()
for f in fields:
    print f,'\t', t2.typecounter(f)

convet unicode type to binary (creating is_elite field)

In [None]:
def to_binary(text):
    if text[3:7] == 'None':
        return 0 
    return 1

t3 = t2.convert('elite' , to_binary)
t3.display(10)


build the friends_count column by join (users and friends)

In [None]:
source = 'Pittsburgh_full_friend_text.json'
t4 = etl.fromjson(source)
t5 = t3.join(t4, # right table
                   lkey='user_id',rkey='user_id', #join equality columns
                   rprefix='t4_') # prefixes of columns from each table (not mandatory)

def friend_count(text):
    if text[0]=='[' and text[-1]==']':
        items =  text[1:-1].split(', ') # turn values to list
        if text[3:7]== 'None':
            return 0
        return len(items)

t6 = t5.convert('t4_friends' , friend_count)
t6.display(10)

adding new user ID and rename the table

In [None]:
t7 = t6.addrownumbers()
t8 = t7.rename({'row':'user_id','user_id':'yelp_user_id','elite':'is_elite','t4_friends':'friends_count'})
t8.display(10)


In [None]:
fields = t8.fieldnames()
for f in fields:
    print f,'\t', t8.typecounter(f)


# Load

We can save the ouput in multiple ways. First - Let's try csv (that we already know how to load to MySQL)

Now we will work with MySQL cursor to load the data - First we'll create the schema and tables

In [None]:
import MySQLdb as mdb

In [None]:
con = mdb.connect(
                host = '127.0.0.1', user = 'root', passwd = 'root') #optional - db="schema_name"  
# setting a cursor
cur = con.cursor()     # get the cursor

#### append data to existing tables

In [None]:
cur.execute('SET SQL_MODE=ANSI_QUOTES')
t8.appenddb(cur,'dim_users',schema='yelp_pittsburgh',commit=True)

#### Now we can find all our data in MySQL server:

In [None]:
cur.execute(""" SELECT * 
                FROM drinks.countries
                WHERE country like 'N%' ;""")

print cur.fetchall()

## P.S. - Pandas

Another very commonly used package for data manipulation is Pandas (http://pandas.pydata.org/), used not only for ETL but for data analysis, visuialization and data mining. You're welcome to check out this package and use it in your projects.

In [None]:
cur.close()