## Us Weather and Immigragration Study
### Data Engineering Capstone Project

#### Project Summary
In this project an ETL Pipeline is created to to combine data from 4 sifferent data sets: immigration, temperature, demographics, and airports to asses immigration paterns in the US.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Imports
import pandas as pd
import os
import psycopg2
import boto3
from pyspark.sql import SparkSession
from io import StringIO # python3; python2: BytesIO 

In [2]:
# get environment variables for connecting to AWS
#AWS_SECRET = os.environ['AWS_SECRET']
#AWS_KEY = os.environ['AWS_KEY']
#DB_USER = os.environ['DB_USER']
#DB_PASSWORD = os.environ['DB_PASSWORD']
#ARN = os.environ['ARN']

AWS_SECRET = 'AKIAXBEQDW5I6X2BDYF7'
AWS_KEY = 'q4JwS2TEUaznc1+uLjcUFPR/XhsUrGchhZedCYEI'
DB = 'dev'
DB_USER = 'dwhuser'
DB_PASSWORD = 'dwhBub42'
HOST = 'redshift-cluster-1.ca7m8qui9aaj.us-west-2.redshift.amazonaws.com'
ARN='arn:aws:iam::483486054225:role/dwhRole'

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

In thi project imigration, temperature, demographic, and airport data will be processes into a form that is consumable by data analysts to analyze immigration trends in the US.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [3]:
# Read Temperature Data
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
pd_temp_df = pd.read_csv(fname)

In [4]:
# Create Spark Session
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
#df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
#write to parquet
#df_spark.write.parquet("sas_data")

In [5]:
# Read Imigration Data
sprk_im_df = spark.read.parquet("sas_data")

In [6]:
# Read Demographic Data
pd_demographic_df = pd.read_csv('us-cities-demographics.csv')

In [7]:
# Read in Airport Data
pd_airport_df = pd.read_csv('airport-codes_csv.csv')

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [8]:
# helper function to write dfs to S3 storage for staging
def write_to_staging(df, bucket, csv_name):
    """Function that writes a df to csv in S3
    Args:
        df (dataframe): pandas dataframe
        bucket (str): S3 bucket name
        csv_name (str): name for output csv
    """
    csv_buffer = StringIO()
    df.to_csv(csv_buffer, index=False)
    s3_resource = boto3.resource('s3',
                                aws_access_key_id=AWS_SECRET, 
                                aws_secret_access_key=AWS_KEY, 
                                region_name='us-west-2')
    s3_resource.Object(bucket, csv_name).put(Body=csv_buffer.getvalue())

### Temperature

In [9]:
pd_temp_df.shape

(8599212, 7)

In [10]:
pd_temp_df.columns

Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')

In [11]:
# countries = pd_temp_df['Country'].unique()
# countries.sort()
# countries

In [12]:
pd_temp_df = pd_temp_df[pd_temp_df['Country']=='United States']
pd_temp_df.shape

(687289, 7)

In [13]:
pd_temp_df = pd_temp_df.dropna()
pd_temp_df.shape

(661524, 7)

In [14]:
pd_temp_df.head(2)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
47555,1820-01-01,2.101,3.217,Abilene,United States,32.95N,100.53W
47556,1820-02-01,6.926,2.853,Abilene,United States,32.95N,100.53W


In [15]:
# format lat and lon
pd_temp_df['Latitude'] = pd_temp_df['Latitude'].str[:-1] 
pd_temp_df['Longitude'] = '-' + pd_temp_df['Longitude'].str[:-1] 

In [16]:
# convert from metric to english units
pd_temp_df['AverageTemperature'] = pd_temp_df['AverageTemperature'].apply(lambda x: (1.8*x)+32) 
pd_temp_df['AverageTemperatureUncertainty'] = pd_temp_df['AverageTemperatureUncertainty'].apply(lambda x: (1.8*x)+32) 

In [17]:
# remove country
pd_temp_df = pd_temp_df.drop('Country', axis=1)

In [18]:
pd_temp_df.head(2)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Latitude,Longitude
47555,1820-01-01,35.7818,37.7906,Abilene,32.95,-100.53
47556,1820-02-01,44.4668,37.1354,Abilene,32.95,-100.53


In [19]:
temperature_data_dict = {
                          'dt': 'date' , 
                          'AverageTemperature': 'average temperature in Ferenheit' , 
                          'AverageTemperatureUncertainty': 'temperature uncertainty in Ferenheit', 
                          'City': 'City', 
                          'Latitude': 'latitude', 
                          'Longitude': 'longitude'
                        }

In [20]:
pd_temp_df['dt']

47555      1820-01-01
47556      1820-02-01
47557      1820-03-01
47558      1820-04-01
47559      1820-05-01
47560      1820-06-01
47561      1820-07-01
47562      1820-08-01
47563      1820-09-01
47564      1820-10-01
47565      1820-11-01
47566      1820-12-01
47567      1821-01-01
47568      1821-02-01
47569      1821-03-01
47570      1821-04-01
47571      1821-05-01
47572      1821-06-01
47573      1821-07-01
47574      1821-08-01
47575      1821-09-01
47576      1821-10-01
47577      1821-11-01
47578      1821-12-01
47579      1822-01-01
47580      1822-02-01
47581      1822-03-01
47582      1822-04-01
47583      1822-05-01
47584      1822-06-01
              ...    
8439217    2011-04-01
8439218    2011-05-01
8439219    2011-06-01
8439220    2011-07-01
8439221    2011-08-01
8439222    2011-09-01
8439223    2011-10-01
8439224    2011-11-01
8439225    2011-12-01
8439226    2012-01-01
8439227    2012-02-01
8439228    2012-03-01
8439229    2012-04-01
8439230    2012-05-01
8439231   

In [21]:
# We can see that the temperature data only goes to 2013,
# but the immigration data is from 2016 so we will not use 
# the temperature data in future analysis

## Imigration

In [22]:
#for col in sprk_im_df.limit(1).toPandas().columns:
#    print(col)

In [23]:
# create a table view to query gainst
sprk_im_df.createOrReplaceTempView("immigration")

In [24]:
# we do not need to keep i94yr as it has a single value
years = spark.sql('''
    SELECT 
        DISTINCT i94yr
    FROM
        immigration
''')
years.toPandas()

Unnamed: 0,i94yr
0,2016.0


In [25]:
# we do not need to keep i94month as it has a single value
months = spark.sql('''
    SELECT 
        DISTINCT i94mon
    FROM
        immigration
''')
months.toPandas()

Unnamed: 0,i94mon
0,4.0


In [26]:
# we do not need to keep i94month as it has a single value
# arrdate = spark.sql('''
#     SELECT 
#         DISTINCT arrdate
#     FROM
#         immigration
# ''')
# arrdate.toPandas()

In [27]:
i94visa = spark.sql('''
    SELECT 
        DISTINCT i94visa
    FROM
        immigration
''')
i94visa.toPandas()

Unnamed: 0,i94visa
0,1.0
1,3.0
2,2.0


In [28]:
# visatype = spark.sql('''
#     SELECT 
#         DISTINCT visatype
#     FROM
#         immigration
# ''')
# visatype.toPandas()

In [29]:
# Create a data dict for kept columns and use to remove unneccesary columns
immigration_data_dict = {'i94cit': 'city code' , 
                          'i94port': 'port code' , 
                          'i94mode':  'transportation code', 
                          'arrdate': 'arrival date',
                          'i94res': 'country code for immigrant', 
                          'i94addr': 'state code', 
                          'depDate': 'departure date',
                          'i94bir': 'immigrant age', 
                          'i94visa': 'visa type',
                          'gender': 'gender'}

imigration_keep_cols = list(immigration_data_dict.keys())

sprk_im_df = sprk_im_df.select(imigration_keep_cols)

sprk_im_df.limit(3).toPandas()



Unnamed: 0,i94cit,i94port,i94mode,arrdate,i94res,i94addr,depDate,i94bir,i94visa,gender
0,245.0,LOS,1.0,20574.0,438.0,CA,20582.0,40.0,1.0,F
1,245.0,LOS,1.0,20574.0,438.0,NV,20591.0,32.0,1.0,F
2,245.0,LOS,1.0,20574.0,438.0,WA,20582.0,29.0,1.0,M


In [30]:
pd_im_df = sprk_im_df.toPandas()

In [31]:
pd_im_df.shape

(3096313, 10)

In [32]:
# drop rows with null values
pd_im_df = pd_im_df.dropna()

In [33]:
pd_im_df['i94port'].unique()

array(['LOS', 'HHW', 'HOU', 'NEW', 'WAS', 'MIA', 'DAL', 'NYC', 'ORL',
       'SFR', 'CHI', 'TOR', 'SEA', 'BOS', 'PHI', 'SAJ', 'CLG', 'DET',
       'MON', 'VCV', 'MAA', 'POO', 'PHO', 'HAM', 'DEN', 'FTL', 'ATL',
       'CLT', 'CIN', 'LVG', 'SDP', 'SLC', 'NAS', 'SPM', 'DUB', 'AGA',
       'SAI', 'EDA', 'OTT', 'TAM', 'NCA', 'SNJ', 'WIN', 'HAL', 'OGG',
       'FMY', 'CLM', 'MIL', 'OAK', 'WPB', 'PSP', 'LIH', 'NSV', 'X96',
       'AUS', 'BAL', 'ROC', 'RDU', 'KOA', 'NOL', 'XXX', 'STT', 'W55',
       'SNA', 'YGF', 'OPF', 'CHR', 'INT', 'BUF', 'SAC', 'ONT', 'BRO',
       'LAR', 'SAA', 'STL', 'HAR', 'CLE', 'SRQ', 'INP', 'KAN', 'PIT',
       'MCA', 'FOK', 'LNB', 'SFB', 'X44', 'CRQ', 'PEM', 'ELP', 'PEV',
       'BLA', 'HIG', 'CHM', 'DER', 'AXB', 'SWE', 'VIC', 'MEM', 'LYN',
       'PEN', 'PHU', 'POR', 'SAV', 'NOR', 'SUM', 'BEE', 'STR', 'RIF',
       'SYR', 'EPI', 'GAL', 'ABG', 'MDT', 'PTL', 'HID', 'BTN', 'OTM',
       'ANA', 'KEY', 'SSM', 'DOU', 'HTM', 'NOG', 'LEW', 'PBB', 'CLS',
       'DAC', 'FTC',

In [34]:
# drop any duplicate rows
pd_im_df = pd_im_df.drop_duplicates()

In [35]:
# write the cleaned imigration to csv in staging
write_to_staging(pd_im_df, 'qscapstone', 'immigration.csv')

### Airport 

In [36]:
pd_airport_df.head(2)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"


In [37]:
pd_airport_df.shape

(55075, 12)

In [38]:
#pd_airport_df['iso_country'].unique()

In [39]:
pd_airport_df = pd_airport_df[(pd_airport_df['iso_country']=='US') & (pd_airport_df['iata_code'].notnull())]

In [40]:
pd_airport_df.shape

(2019, 12)

In [41]:
pd_airport_df.head(2)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
440,07FA,small_airport,Ocean Reef Club Airport,8.0,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804"
594,0AK,small_airport,Pilot Station Airport,305.0,,US,US-AK,Pilot Station,,PQS,0AK,"-162.899994, 61.934601"


In [42]:
pd_airport_df['same_id'] = pd_airport_df['ident'] == pd_airport_df['local_code']

In [43]:
pd_airport_df['same_id'].sum()

210

In [44]:
pd_airport_df[pd_airport_df['same_id']==False].head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,same_id
3143,2IG4,small_airport,Ed-Air Airport,426.0,,US,US-IN,Oaktown,I20,OTN,I20,"-87.4997024536, 38.851398468",False
3643,2Z1,seaplane_base,Entrance Island Seaplane Base,0.0,,US,US-AK,Entrance Island,,HBH,,"-133.43848, 57.412201",False
10470,AHT,closed,Amchitka Army Airfield,215.0,,US,US-AK,Amchitka Island,PAHT,AHT,,"179.259166667, 51.3777777778",False
11498,ARX,closed,Asbury Park Neptune Air Terminal,95.0,,US,US-NJ,Asbury Park,,ARX,,"-74.0908333333, 40.2193055556",False
11676,AUS,closed,Austin Robert Mueller Municipal,,,US,US-TX,,KAUS,AUS,,"-97.6997852325, 30.2987223546",False


In [45]:
im_ports = pd_im_df['i94port'].unique()
im_ports.sort()
im_ports

array(['5KE', '5T6', 'ABG', 'ABQ', 'ABS', 'ADS', 'ADT', 'ADW', 'AGA',
       'AGN', 'ALC', 'ANA', 'ANC', 'AND', 'ANZ', 'APF', 'ATL', 'ATW',
       'AUS', 'AXB', 'BAL', 'BAU', 'BDL', 'BEB', 'BED', 'BEE', 'BGM',
       'BHX', 'BLA', 'BOA', 'BOS', 'BQN', 'BRG', 'BRO', 'BTN', 'BUF',
       'BWA', 'BWM', 'CAE', 'CAL', 'CHA', 'CHI', 'CHM', 'CHR', 'CHS',
       'CHT', 'CIN', 'CLE', 'CLG', 'CLM', 'CLS', 'CLT', 'CNA', 'COB',
       'COL', 'CRP', 'CRQ', 'CRY', 'DAB', 'DAC', 'DAL', 'DEN', 'DER',
       'DET', 'DLB', 'DLR', 'DNA', 'DNS', 'DOU', 'DPA', 'DUB', 'DVL',
       'EDA', 'EGP', 'ELP', 'EPI', 'FAL', 'FAR', 'FCA', 'FER', 'FMY',
       'FOK', 'FPR', 'FRB', 'FRT', 'FTC', 'FTF', 'FTK', 'FTL', 'FWA',
       'GAL', 'GPM', 'GSP', 'HAL', 'HAM', 'HAR', 'HEF', 'HEL', 'HHW',
       'HID', 'HIG', 'HNN', 'HNS', 'HOU', 'HPN', 'HSV', 'HTM', 'HVR',
       'ICT', 'INP', 'INT', 'JAC', 'JFA', 'JKM', 'KAN', 'KEY', 'KOA',
       'LAN', 'LAR', 'LAU', 'LCB', 'LEW', 'LEX', 'LIH', 'LLB', 'LNB',
       'LOI', 'LOS',

In [46]:
iata_codes = pd_airport_df['iata_code'].unique()
iata_codes.sort()
iata_codes

array(['AAF', 'AAP', 'ABE', ..., 'ZNC', 'ZPH', 'ZZV'], dtype=object)

In [47]:
# Here we can see that we can join the immigration data to the airport data on pd_im_df.i94port = pd_airport_df.iata_code
len(list(set(iata_codes).intersection(set(im_ports))))

126

In [48]:
pd_airport_df.head(1)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,same_id
440,07FA,small_airport,Ocean Reef Club Airport,8.0,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804",True


In [49]:
pd_airport_df['latitude'] = pd_airport_df['coordinates'].str.split(',').str[1]
pd_airport_df['longitude'] = pd_airport_df['coordinates'].str.split(',').str[0]
#pd_airport_df['coordinates'].str.split(',')

In [50]:
air_data_dict = {
    'ident': 'pkey dentifier', 
    'type': 'type of airport', 
    'name': 'airport name', 
    'iso_region': 'airport region', 
    'municipality': 'airport municipality', 
    'iata_code': 'code for airport, maps to i94port in imigration',
    'latitude': 'latitude',
    'longitude': 'longitude'  
}

air_keep_cols = list(air_data_dict.keys())
pd_airport_df = pd_airport_df[air_keep_cols]
pd_airport_df.head()

Unnamed: 0,ident,type,name,iso_region,municipality,iata_code,latitude,longitude
440,07FA,small_airport,Ocean Reef Club Airport,US-FL,Key Largo,OCA,25.325399398804,-80.274803161621
594,0AK,small_airport,Pilot Station Airport,US-AK,Pilot Station,PQS,61.934601,-162.899994
673,0CO2,small_airport,Crested Butte Airpark,US-CO,Crested Butte,CSE,38.851918,-106.928341
1088,0TE7,small_airport,LBJ Ranch Airport,US-TX,Johnson City,JCY,30.251800537100003,-98.6224975586
1402,13MA,small_airport,Metropolitan Airport,US-MA,Palmer,PMX,42.2233009338,-72.31140136719999


In [51]:
# write airports to s3 for staging
pd_airport_df = pd_airport_df.drop_duplicates()
write_to_staging(pd_airport_df, 'qscapstone', 'airports.csv')

### Demographic 

In [52]:
pd_demographic_df.head(1)

# heare we can see that we will need to perform some string parsing 
# to get this dataset into a usable form

Unnamed: 0,City;State;Median Age;Male Population;Female Population;Total Population;Number of Veterans;Foreign-born;Average Household Size;State Code;Race;Count
0,Silver Spring;Maryland;33.8;40601;41862;82463;...


In [53]:
col_string = pd_demographic_df.columns[0]
original_col = col_string
# split column names out
cols = col_string.split(';')
# replace spaces with underscores
for i in range(len(cols)):
    cols[i] = cols[i].replace(' ', '_')
    cols[i] = cols[i].replace('-', '_')

cols

['City',
 'State',
 'Median_Age',
 'Male_Population',
 'Female_Population',
 'Total_Population',
 'Number_of_Veterans',
 'Foreign_born',
 'Average_Household_Size',
 'State_Code',
 'Race',
 'Count']

In [54]:
pd_demographic_df = pd_demographic_df[original_col].str.split(';',expand=True)
pd_demographic_df.columns = cols

In [55]:
# write demographics to csv in s3 storage
pd_demographic_df = pd_demographic_df.drop_duplicates()
write_to_staging(pd_demographic_df, 'qscapstone', 'demographics.csv')

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
The data model for this project will conform to a star schema where the facts table is the immigration data which is connected to dimension tables for airports and demographics.

#### 3.2 Mapping Out Data Pipelines
1) Clean data sets and create staging tables in s3 Bucket "qscapstone"
    immigration.csv
    airports.csv
    demographics.csv
    
2) Read data sets into spark for joining/querying
3) Create immigration_facts facts table
3) Create airports dimensions table which joins to immigration_facts
4) Create demographics dimension table which joins to immigration_facts
5) Create cube tables for airports to serve analytics on travel patterns
6) Create cube tables for demographics to serve analytics on travel patterns
7) write all tables to Redshift 
7) Run tests to ensure data integrity/pipline success

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [56]:
def connect_to_reshift(host, dbname, user, password, port):
    """Function that returns a psycopg2 db connection
    Args:
        host (str): host
        dbname (str): database
        password (str): user password
        port (int): 
    """
    connection_string = f'host={host} dbname={dbname} user={user} password={password} port={port}'
    conn = psycopg2.connect(connection_string)
    
    return conn       

In [57]:
def drop_tables(conn, tables):
    """Function that drops all tables
    Args:
        conn (connection): psycopg2 connection
    Returns:
        None
    """
    
    cur = conn.cursor()
    
    for table in tables:
        try:
            query = 'DROP TABLE ' + table
            cur.execute(query)
        except:
            conn.rollback()
        
    conn.commit()
    cur.close()

In [60]:
drop_tables(conn, ['staging_airports', 'staging_demographics', 'staging_immigration'])

In [76]:
def create_tables(conn):
    '''Function that creates tables in redshift
    Args:
    Returns:
        None
    '''
    
    create_staging_immigration = '''
    CREATE TABLE IF NOT EXISTS staging_immigration(
        im_id integer identity(0,1) PRIMARY KEY,
        i94cit varchar, 
        i94port varchar NOT NULL, 
        i94mode varchar,
        arrdate varchar,
        i94res varchar,
        i94addr varchar,
        depDate varchar,
        i94bir varchar,
        i94visa varchar,
        gender varchar
    );
    '''
    
    create_staging_airports = '''
    CREATE TABLE IF NOT EXISTS staging_airports(
        ident varchar PRIMARY KEY,
        type varchar,
        name varchar, 
        iso_region varchar, 
        municipality varchar, 
        iata_code varchar, 
        latitude float, 
        longitude float
    );
    '''
    
    create_staging_demographics = '''
    CREATE TABLE IF NOT EXISTS staging_demographics(
         city_id integer identity(0,1) PRIMARY KEY,
         City varchar,
         State varchar,
         Median_Age real,
         Male_Population integer,
         Female_Population integer,
         Total_Population integer,
         Number_of_Veterans integer,
         Foreign_born integer,
         Average_Household_Size real,
         State_Code varchar,
         Race varchar,
         Count integer,
         UNIQUE(City, State)
    );
    '''
    
    create_immigration_facts = '''
    CREATE TABLE IF NOT EXISTS immigration_facts(
        im_id PRIMARY KEY,
        i94port varchar NOT NULL,
        city_id integer NOT NULL,
        i94cit varchar, 
        i94mode varchar,
        arrdate varchar,
        i94res varchar,
        i94addr varchar,
        depDate varchar,
        i94bir varchar,
        i94visa varchar,
        gender varchar
    );
    '''
    
    create_airports_dimension = '''
    CREATE TABLE IF NOT EXISTS airports_dimension(
        ident varchar NOT NULL,
        im_id integer NOT NULL,
        type varchar,
        name varchar, 
        iso_region varchar, 
        municipality varchar, 
        iata_code varchar, 
        latitude float, 
        longitude float
    );
    '''
    
    create_demographics_dimension = '''
    CREATE TABLE IF NOT EXISTS staging_demographics(
         city_id integer NOT NULL,
         im_id integer NOT NULL,
         City varchar,
         State varchar,
         Median_Age real,
         Male_Population integer,
         Female_Population integer,
         Total_Population integer,
         Number_of_Veterans integer,
         Foreign_born integer,
         Average_Household_Size real,
         State_Code varchar,
         Race varchar,
         Count integer,
         UNIQUE(City, State)
    );
    '''
    
    
    
    
    
    cur = conn.cursor()
    
#     create_queries = [create_staging_immigration, 
#                       create_staging_airports, 
#                       create_staging_demographics,
#                       create_immigration_dimension, 
#                       create_airports_dimension, 
#                       create_demographics_dimension]
    
    create_queries = [create_immigration_facts, 
                      create_airports_dimension, 
                      create_demographics_dimension]
    
    for query in create_queries:
        try:
            cur.execute(query)
        except:
            conn.rollback()
            print('error in making table: ', query)
            
            
    conn.commit()
    cur.close()

In [77]:
conn = connect_to_reshift(HOST, DB, DB_USER, DB_PASSWORD, 5439)
create_tables(conn)

In [None]:
# cur.execute('SElect * from staging_demographics')
# result = cur.fetchone()
# result

In [64]:
def copy_staging(conn, bucket, csv_files, tables):
    '''Function that copies csv files to staging tables
    Args:
        conn (connection): pyscopg2 connection
        bucket (str): aws bucket
        csv_files ([str]): list of csv_files
        tables ([str]): corresponding tables for copying
    Returns:
        None
    '''
    
    staging_copy = """
    COPY {}
    FROM '{}'
    IAM_ROLE '{}'
    REGION 'us-west-2'
    CSV IGNOREHEADER 1;
    """
    
    cur =conn.cursor()
                    
    for i in range(len(csv_files)):
        path = 's3://{}/{}'.format(bucket, csv_files[i])
        try:
            query = staging_copy.format(tables[i], path, ARN)
            cur.execute(query)
        except:
            conn.rollback()
            print('The following query failed: ', query)
    
    conn.commit()
    cur.close()

In [65]:
conn = connect_to_reshift(HOST, DB, DB_USER, DB_PASSWORD, 5439)
copy_staging(conn, 'qscapstone', ['airports.csv', 'demographics.csv', 'immigration.csv'], 
             ['staging_airports', 'staging_demographics', 'staging_immigration'])

In [78]:
def insert_facts_table(conn):
    """Function that insertst facts table
    Args:
        conn (conection): psycopg2 connection
    Returns
        None
    """
    
    cur = conn.cursor()
    
    query = '''
    INSERT INTO immigration_facts
    (SELECT 
        si.im_id,
        si.i94port,
        sd.city_id,
        si.i94cit,
        si.i94mode,
        si.arrdate,
        si.i94res,
        si.i94addr,
        si.depDate,
        si.i94bir,
        si.i94visa,
        si.gender
    FROM
        staging_immigration AS si INNER JOIN 
        staging_airports AS sa ON si.i94port = sa.iata_code INNER JOIN
        staging_demographics AS sd ON sa.municipality = sd.city);
    '''
    
    try:
        cur.execute(query)
    except:
        conn.rollback()
        
    conn.commit()
    cur.close()   

In [79]:
insert_facts_table(conn)

In [None]:
def insert_dimensions_table():
    """Function that dimensions facts table
    Args:
    Returns
    """

In [None]:
def perform_aggregation():
    """Function that performs aggregation on dimensions tables
    Args:
    Returns
    """

In [None]:
# Write code here

def Pipeline():
    '''Function to perform ETL
    Args:
    Returns:
    '''
    # connect to Redshift
    conn = connect_to_reshift(HOST, DB, DB_USER, DB_PASSWORD, 5439) 
    
    # drop tables
    drop_tables = [
        staging_immigration,
        staging_airports,
        staging_demographics
    ]
    
    drop_tables(conn, drop_tables)
    
    # create tables
    create_tables(conn)
    
    # copy staging tables
    copy_staging(conn, 'qscapstone', ['airports.csv', 'demographics.csv', 'immigration.csv'], 
             ['staging_airports', 'staging_demographics', 'staging_immigration'])
    
    # insert facts table
    insert_facts_table():
        
    # insert dimension tables
    insert_dimensions_table()
    
    # create aggregation tables
    perform_aggregation()
    
    return 0

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here
def check_data():
    '''Function to perform ETL
    Args:
    Returns:
    '''
    
    return

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.