## INTRO TO DATA ENGINERING - ETL Part 2

- Extract is refers to pulling the source data from the original database or data source. With ETL, the data goes into a temporary staging area. With ELT it goes immediately into a data lake storage system.

- Transform is refers to the process of changing the structure of the information, so it integrates with the target data system and the rest of the data in that system.

- Load is refers to the process of depositing the information into a data storage system.

In [1]:
import pandas as pd
import requests
import sqlalchemy
from Dependencies import credential_db
from langdetect import detect
import warnings
warnings.filterwarnings("ignore")

### 1. EXTRACT

#### Take reviews data from production. 

First we have to connect to production jatimcamp5 and check the connection.

In [2]:
conn = sqlalchemy.create_engine('mysql+pymysql://{0}:{1}@{2}/{3}'.format(credential_db.db_jatimcamp5_username,
                                                                         credential_db.db_jatimcamp5_password, 
                                                                         credential_db.db_jatimcamp5_host, 
                                                                         credential_db.db_jatimcamp5_name))

In [3]:
conn

Engine(mysql+pymysql://etlonly:***@35.225.122.70/jatimCamp5_production)

Now check whether it successfully connected or not.

In [4]:
query_table = """show tables;"""
df_query = pd.read_sql(query_table,conn)

In [5]:
df_query

Unnamed: 0,Tables_in_jatimCamp5_production
0,calendars
1,listings
2,reviews


Successfully connected. Now extract data from reviews and show it.

In [6]:
query_rev = """select * from reviews"""
df_rev = pd.read_sql(query_rev, conn)

In [7]:
df_rev.head(3)

Unnamed: 0,index,listing_id,id,date,reviewer_id,reviewer_name,comments
0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...
1,1,1497879,22093757,2014-10-29,22488375,Kayla,Sara was beyond nice and very helpful with all...
2,2,1497879,22229170,2014-11-02,12920446,Tim,We arrived on Friday night to be met by Sara. ...


In [8]:
df_rev.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
index            500 non-null int64
listing_id       500 non-null int64
id               500 non-null int64
date             500 non-null object
reviewer_id      500 non-null int64
reviewer_name    500 non-null object
comments         500 non-null object
dtypes: int64(4), object(3)
memory usage: 27.5+ KB


In [9]:
df_rev.shape

(500, 7)

We have 500 rows and 8 columns.

### 2. TRANSFORM

#### Filter Spain Language comments only.

First we have to apply `detect` from `langdetect` libraries, and put it to new column.

In [10]:
df_rev['comments_langdect'] = df_rev['comments'].apply(detect)

In [11]:
df_rev.head(3)

Unnamed: 0,index,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_langdect
0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...,es
1,1,1497879,22093757,2014-10-29,22488375,Kayla,Sara was beyond nice and very helpful with all...,en
2,2,1497879,22229170,2014-11-02,12920446,Tim,We arrived on Friday night to be met by Sara. ...,en


Next, to locate the spain language from comments_langdetect and filter it.

In [12]:
df_rev[df_rev['comments_langdect'] == 'es'].head(3)

Unnamed: 0,index,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_langdect
0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...,es
50,50,1497879,32503601,2015-05-19,28626807,Angel,Sara es muy amable. Nos dio todas las indicaci...,es
279,279,4924009,28826784,2015-03-30,28815032,Romina,Regular. El departamento estaba sin terminar. ...,es


In [13]:
filter_es = df_rev['comments_langdect'] == 'es'
df_rev[filter_es]

Unnamed: 0,index,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_langdect
0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...,es
50,50,1497879,32503601,2015-05-19,28626807,Angel,Sara es muy amable. Nos dio todas las indicaci...,es
279,279,4924009,28826784,2015-03-30,28815032,Romina,Regular. El departamento estaba sin terminar. ...,es
299,299,4924009,50926436,2015-10-16,37419331,Mercedes,"El departamento es mas chico de lo que parece,...",es
326,326,4924009,97468253,2016-08-27,62577797,Gustavo G.,Todo más qué bien. El departamento es chico pe...,es
408,408,4967219,41686881,2015-08-08,39697072,Maria Camila,Excelente ubicacion a unos pocos pasos de la e...,es
421,421,4967219,46369603,2015-09-10,29021507,Andrea,Muy amables. Habitación y sabanas y toallas mu...,es


In [14]:
df_rev_clean = df_rev[filter_es]

In [15]:
df_rev_clean.head(3)

Unnamed: 0,index,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_langdect
0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...,es
50,50,1497879,32503601,2015-05-19,28626807,Angel,Sara es muy amable. Nos dio todas las indicaci...,es
279,279,4924009,28826784,2015-03-30,28815032,Romina,Regular. El departamento estaba sin terminar. ...,es


In [16]:
df_rev_clean.shape

(7, 8)

As we can see that from 500 rows, we filter it to only 7 rows.

### 3. LOAD 
Push the data to data warehouse.

In [17]:
conn_dwh = sqlalchemy.create_engine('mysql+pymysql://{0}:{1}@{2}/{3}'.format(credential_db.db_jatimcamp5_DWH_username, 
                                                                             credential_db.db_jatimcamp5_DWH_password, 
                                                                             credential_db.db_jatimcamp5_DWH_host, 
                                                                             credential_db.db_jatimcamp5_DWH_name))

In [18]:
conn_dwh

Engine(mysql+pymysql://etlonly:***@35.225.122.70/jatimCamp5_dwh)

Push the data and check the connection to data warehouse. 

In [19]:
df_rev_clean.to_sql(con=conn_dwh, name='REVIEW_CAHYA_995', if_exists='replace')

In [20]:
query_rev = 'select * from REVIEW_CAHYA_995'

df_rev_dwh = pd.read_sql_query(query_rev, conn_dwh)

In [21]:
df_rev_dwh

Unnamed: 0,level_0,index,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_langdect
0,0,0,1497879,21943405,2014-10-27,19785528,Jose Edwin,El apartamento está perfecto. Es tranquilo y e...,es
1,50,50,1497879,32503601,2015-05-19,28626807,Angel,Sara es muy amable. Nos dio todas las indicaci...,es
2,279,279,4924009,28826784,2015-03-30,28815032,Romina,Regular. El departamento estaba sin terminar. ...,es
3,299,299,4924009,50926436,2015-10-16,37419331,Mercedes,"El departamento es mas chico de lo que parece,...",es
4,326,326,4924009,97468253,2016-08-27,62577797,Gustavo G.,Todo más qué bien. El departamento es chico pe...,es
5,408,408,4967219,41686881,2015-08-08,39697072,Maria Camila,Excelente ubicacion a unos pocos pasos de la e...,es
6,421,421,4967219,46369603,2015-09-10,29021507,Andrea,Muy amables. Habitación y sabanas y toallas mu...,es


Success !!

Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable.