### Capstone Project(Data Engineering)

#### Project Summary and outline
This project aims to be able to answers questions on US immigration trend
1. Most popular cities
2. Gender distribution of the immigration
3. visa type distribution
4. average age per immigrant 
5. average temperature per month per city

Data taken from three different sources 
1. I94 immigration dataset of 2016
2. City temperature
3. US city demographic data from openshoft

Design 4 dimention tables and 1 fact table
cities, immigrants, monthl average city temp and time, and immigration

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
#!pip install pyspark
from pyspark.sql  import SparkSession

import psycopg2
from datetime import datetime, timedelta
import re
from pyspark.sql.types import *
from pyspark.sql.functions import *
import numpy as np
import pandas as pd
import glob 
import cleanup as cleanup

### Step 1: Scope the Project and Gather Data

#### Scope 
The goal of this project is pull data from 3 different sources and create fact, dimention table to analyze US immigration using city demographisc, seasions, avg temperature.

#### Describe and Gather Data 

I94 Immigration Data: This data comes from the U.S. National Tourism and Trade Office and contains various statistics on international visitor arrival in USA.
World Temperature Data: This data comes from Kaggle and contains average weather temperatures by city. 
U.S. City Demographic Data: comes from OpenSoft and contains information about the demographics of all US cities such as average age, male and female population. 


# Load data from CSV file


### Follow below steps and repeat step 2, 3  to load Airport Codes,Immigration, US  Cities Demographic
1. Creat Spark Session(Set app name to Capstone)
2. Read Csv File
3. Show data frame

In [2]:
# create pyspark session
pySparkSession = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()


## Load airports information

In [3]:
# Load airport codes data 
airports_info_df = pySparkSession.read.csv("csv/airport-codes_csv.csv",header=True)
airports_info_df.toPandas().head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [4]:
airports_info_df.columns
airports_info_df.describe()

DataFrame[summary: string, ident: string, type: string, name: string, elevation_ft: string, continent: string, iso_country: string, iso_region: string, municipality: string, gps_code: string, iata_code: string, local_code: string, coordinates: string]

## Immigration data

In [5]:
# Load immigration data through sas7bat files
# This project only the i94_apr16_sub.sas7bdat will be used for this project, inorder to avoid memory errors
i94_all_files = glob.glob("../../data/18-83510-I94-Data-2016/*.sas7bdat")
i94_fname = "../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat"
i94_df = pySparkSession.read.format("com.github.saurfang.sas.spark").load(i94_fname)
i94_df.limit(5).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [6]:
# Convert data formates, It will be easy for query
convert_isoformat = udf(lambda x: (datetime(1960, 1, 1).date() + timedelta(x)).isoformat() if x else None)
valid_birth_year = udf(lambda yr: yr if (yr and 1900 <= yr <= 2016) else None)

In [7]:
i94_df =  i94_df \
          .withColumn('arrdate', convert_isoformat(i94_df.arrdate)) \
          .withColumn('depdate', convert_isoformat(i94_df.depdate)) \
          .withColumn("biryear", valid_birth_year(i94_df.biryear)) \
          .dropDuplicates()

In [8]:
i94_df.createOrReplaceTempView('staging_i94')

## Us cities demographics
Contains information about city demographics data

In [9]:
demographics_df = pySparkSession.read.csv("csv/us-cities-demographics.csv",inferSchema=True, header=True, sep=';')
demographics_df.toPandas().head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [10]:
demographics_df.describe()

DataFrame[summary: string, City: string, State: string, Median Age: string, Male Population: string, Female Population: string, Total Population: string, Number of Veterans: string, Foreign-born: string, Average Household Size: string, State Code: string, Race: string, Count: string]

In [11]:
demographics_df.columns

['City',
 'State',
 'Median Age',
 'Male Population',
 'Female Population',
 'Total Population',
 'Number of Veterans',
 'Foreign-born',
 'Average Household Size',
 'State Code',
 'Race',
 'Count']

In [12]:
### Load global temperature
path = '../../data2/GlobalLandTemperaturesByCity.csv'
temperature_df = pySparkSession.read.csv(path,inferSchema=True, header=True)
temperature_df.limit(20).toPandas().head(10)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
9,1744-08-01,,,Århus,Denmark,57.05N,10.33E


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.



### Missing values and duplicate data clean up
## i94_df , demographics_df and temperature_df

In [13]:
i94_df_cleanup = cleanup.drop_empty_columns(i94_df,["arrdate","i94addr","visatype","biryear","gender","depdate"])
i94_df_cleanup = cleanup.drop_duplicate_rows(i94_df_cleanup)

Dropping missing data...
+------+------+------+------+------+-------+----------+-------+-------+----------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+
|cicid |i94yr |i94mon|i94cit|i94res|i94port|arrdate   |i94mode|i94addr|depdate   |i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear|dtaddto |gender|insnum|airline|admnum         |fltno|visatype|
+------+------+------+------+------+-------+----------+-------+-------+----------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+
|30.0  |2016.0|4.0   |101.0 |101.0 |ATL    |2016-04-01|1.0    |NJ     |2016-05-04|49.0  |2.0    |1.0  |20160401|TIA     |null |G      |O      |null   |M      |1967.0 |09302016|M     |null  |OS     |9.247020943E10 |00089|B2      |
|84.0  |2016.0|4.0   |103.0 |103.0 |BOS    |2016-04-01|

In [14]:
demographics_df_cleanup = cleanup.drop_empty_columns(demographics_df,["city","state"])
demographics_df_cleanup = cleanup.drop_duplicate_rows(demographics_df_cleanup)

Dropping missing data...
+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+---------------------------------+------+
|City            |State         |Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|Race                             |Count |
+----------------+--------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+---------------------------------+------+
|Silver Spring   |Maryland      |33.8      |40601          |41862            |82463           |1562              |30908       |2.6                   |MD        |Hispanic or Latino               |25924 |
|Quincy          |Massachusetts |41.0      |44129          |49500            |93629           |4147              |32935       |2.39                  |MA        |Wh

In [15]:
temperature_df_cleanup = cleanup.drop_empty_columns(temperature_df,["dt","AverageTemperature"])
temperature_df_cleanup = cleanup.drop_duplicate_rows(temperature_df_cleanup)

Dropping missing data...
+-------------------+-------------------+-----------------------------+-----+-------+--------+---------+
|dt                 |AverageTemperature |AverageTemperatureUncertainty|City |Country|Latitude|Longitude|
+-------------------+-------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01 00:00:00|6.068              |1.7369999999999999           |Århus|Denmark|57.05N  |10.33E   |
|1744-04-01 00:00:00|5.7879999999999985 |3.6239999999999997           |Århus|Denmark|57.05N  |10.33E   |
|1744-05-01 00:00:00|10.644             |1.2830000000000001           |Århus|Denmark|57.05N  |10.33E   |
|1744-06-01 00:00:00|14.050999999999998 |1.347                        |Århus|Denmark|57.05N  |10.33E   |
|1744-07-01 00:00:00|16.082             |1.396                        |Århus|Denmark|57.05N  |10.33E   |
|1744-09-01 00:00:00|12.780999999999999 |1.454                        |Århus|Denmark|57.05N  |10.33E   |
|1744-10-01 00:00:00|7.95     

In [16]:
demographics_df_cleanup.describe()

DataFrame[summary: string, City: string, State: string, Median Age: string, Male Population: string, Female Population: string, Total Population: string, Number of Veterans: string, Foreign-born: string, Average Household Size: string, State Code: string, Race: string, Count: string]

In [17]:
demographics_df_cleanup.limit(10).toPandas().head(10)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Maple Grove,Minnesota,38.6,31780,36601,68381,2943,7645,2.64,MN,White,59683
1,Concord,California,39.6,62310,66358,128668,6287,37428,2.72,CA,White,92575
2,Highlands Ranch,Colorado,39.6,49186,53281,102467,4840,8827,2.72,CO,Asian,5650
3,Asheville,North Carolina,37.9,42100,46407,88507,4973,6630,2.18,NC,American Indian and Alaska Native,496
4,Westland,Michigan,39.9,37742,44253,81995,4756,6429,2.41,MI,Black or African-American,16422
5,Wichita Falls,Texas,34.0,55775,48934,104709,7800,9855,2.41,TX,Hispanic or Latino,23061
6,Clovis,California,37.8,52392,51780,104172,6173,13409,2.76,CA,White,78029
7,Waldorf,Maryland,33.6,35640,39872,75512,6932,5954,2.69,MD,Asian,4100
8,Schaumburg,Illinois,36.9,35971,39840,75811,2019,24614,2.72,IL,White,43688
9,Winston-Salem,North Carolina,34.7,112520,128712,241232,14521,24302,2.47,NC,White,139301


In [18]:
demographics_df_cleanup.count()

2891

### temperature

In [19]:
temperature_df_cleanup.limit(20).toPandas().head(10)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1850-04-01,,,Bontang,Indonesia,0.80N,118.13E
1,1859-05-01,,,Bontang,Indonesia,0.80N,118.13E
2,1862-09-01,,,Bontang,Indonesia,0.80N,118.13E
3,1863-09-01,,,Butembo,Congo (Democratic Republic Of The),0.80N,29.73E
4,1864-07-01,20.314,1.248,Butembo,Congo (Democratic Republic Of The),0.80N,29.73E
5,1880-05-01,26.876,0.844,Bontang,Indonesia,0.80N,118.13E
6,1886-04-01,20.908,1.257,Butembo,Congo (Democratic Republic Of The),0.80N,29.73E
7,1887-05-01,20.622,1.14,Butembo,Congo (Democratic Republic Of The),0.80N,29.73E
8,1887-09-01,20.435,1.215,Butembo,Congo (Democratic Republic Of The),0.80N,29.73E
9,1900-03-01,25.904,1.29,Bitung,Indonesia,0.80N,124.55E


In [20]:
temperature_df_cleanup_us = temperature_df_cleanup.filter("Country == 'United States'")
temperature_df_cleanup_us.limit(10).toPandas().head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1848-06-01,24.97,1.799,Abilene,United States,32.95N,100.53W
1,1892-05-01,21.656,0.501,Abilene,United States,32.95N,100.53W
2,1917-02-01,8.004,0.518,Abilene,United States,32.95N,100.53W
3,1937-04-01,17.291,0.307,Abilene,United States,32.95N,100.53W
4,1942-09-01,21.529,0.319,Abilene,United States,32.95N,100.53W


In [21]:
temperature_df_cleanup_us.count()

687289

In [22]:
temperature_df_cleanup_us.describe()

DataFrame[summary: string, AverageTemperature: string, AverageTemperatureUncertainty: string, City: string, Country: string, Latitude: string, Longitude: string]

In [23]:
# Convert date to datetime
temperature_df_cleanup_us = temperature_df_cleanup_us.withColumn("convertedDate",to_date(temperature_df.dt))

In [24]:
df_temp_con.select(min('convertedDate')).collect()

NameError: name 'df_temp_con' is not defined

In [None]:
def create_dim_table(label):
    '''
      Extract data from I94_SAS_file_content
      :param : input_lable
      :return :code,value   
    '''
    with open('I94_SAS_file_content.SAS') as file_content:
            raw_labels = file_content.read()
    labels = raw_labels[raw_labels.index(label):]
    labels = labels[:labels.index(';')]
    lines = labels.splitlines()
    code_value_list = []
    try:
        code, value = line.split('=')
        code = code.strip().strip("'").strip('"')
        value = value.strip().strip("'").strip('"').strip()
        code_value_list.append((code, value))
    except:
        pass
        
    return code_value_list


### Immigration data

In [None]:
i94_df_cleanup.count()

In [None]:
i94_df_cleanup.limit(10).toPandas().head(10)

### Filter valid ports

In [None]:
i94_sas_label_des_filename = "I94_SAS_file_content.SAS"
with open(i94_sas_label_des_filename) as f:
    lines = f.readlines()

re_compiled = re.compile(r"\'(.*)\'.*\'(.*)\'")
valid_ports = {}
for line in lines[302:961]:
    results = re_compiled.search(line)
    valid_ports[results.group(1)] = results.group(2)


### valid states

In [None]:
valid_states = demographics_df.toPandas()["State Code"].unique().tolist()
type(valid_states)
print(valid_states)

In [None]:
#valid_states = demographics_df.select('State Code').distinct().collect()
#print(valid_states.toPandas())

In [None]:
demographics_df_cleanup.select('State Code').distinct().count()

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Following star scheam is designed , it is very simple and powerful.


### staging_i94
    id
    date
    city_code
    state_code
    age
    gender
    visa_type
    count

### staging_temperature
    year
    month
    city_code
    city_name
    avg_temp
    lat
    long

### staging_demographics
    city_code
    state_code
    city_name
    medianAge
    male_pop
    female_pop
    veterans
    foreign_born
    total_pop
### Dimension Tables
### dim_immigration
    id
    gender
    age
    visa_type

#### dim_demographics
    city_code
    state_code
    city_name
    medianAge
    male_pop
    female_pop
    veterans
    foreign_born
    total_pop
    lat
    long
### dim_monthly_city_temp
    city_code
    year
    month
    avg_temp

### dim_time
    date
    dayofweek
    weekofyear
    month
### Fact Table
### immigrations
    id
    state_code
    city_code
    date
    count

### Refer tables.png file to get schema

#### 3.2 Mapping Out Data Pipelines

### Steps necessary to pipeline the data into the chosen data model

1. Clean the data on nulls, data types, duplicates, etc
2. Load staging tables for stag_i94_df, stag_temp_df and stag_demo_df
3. Create dimension tables for imm_df, city_df, monthly_city_temp_df and time_df
4. Create fact table immigration_df with information on immigration count, mapping id in imm_df, city_code in city_df and monthly_city_temp_df and date in time_df to make sure  referential integrity
5. Save processed dimension and fact tables in parquet for downstream query

### clean immigraton data

In [None]:
# create a function
@udf(StringType())
def state_validation(st):
    print(st)
    if st in valid_states:
        return  st
    return 'None'

In [None]:
# convert date
@udf(StringType())
def conv_date(x):
    if x:
        return (datetime(1960,1,1).date() + timedelta(x)).isoformat()
    return None


In [None]:
# Remove any missing values ( any null value from columns i94port, i94addr, gender)
i94_c_d = i94_df.dropna(how="any", subset=["i94port","i94addr","gender"])

In [None]:
i94_c_d.limit(10).toPandas().head(10)

In [None]:
i94_c_d = i94_c_d.withColumn("i94addr", state_validation(i94_c_d.i94addr))

In [None]:
i94_c_d= i94_c_d.withColumn("arrdate", conv_date(i94_c_d.arrdate))


In [None]:
i94_c_d = i94_c_d.filter(i94_c_d.i94addr != 'None')


In [None]:
i94_c_d.count()

In [None]:
### staging i94 df table
i94_s_t = i94_c_d.select(
col("cicid").alias("id"),
    col("arrdate").alias("date"),
    col("i94addr").alias("city_code"),
    col("i94bir").alias("age"),
    col("gender").alias("gender"),
    col("i94visa").alias("visa_type"), "count").drop_duplicates()



In [None]:
i94_s_t.limit(10).toPandas().head(10)

In [None]:
# Create udf to map city full name to city port
@udf(StringType())
def city_to_port(city):
    for key in valid_ports:
        if city.lower() in valid_ports[key].lower():
            return key

In [None]:
# Temperature clean up
df_temp_con.filter(df_temp_con["Country"] == "United States")

In [None]:
# Remove any missing values from temperature ( any null value from columns i94port)
df_temp_con_clean_up = df_temp_con.dropna(how="any", subset=["City"])

In [None]:
cleaned_temp_df = df_temp_con.\
withColumn("year", year(df_temp_con['dt'])) \
    .withColumn("month", month(df_temp_con["dt"])) \
    .withColumn("i94port", city_to_port(df_temp_con["City"])) \
    .withColumn("AverageTemperature", col("AverageTemperature").cast("float")) \
    .dropna(how='any', subset=["i94port"])

cleaned_temp_df.limit(10).toPandas().head(10)

In [None]:
#consider data only from 2013 year
cleaned_temp_df = cleaned_temp_df.filter(cleaned_temp_df["year"] == 2013)

In [None]:
stag_temp_df = cleaned_temp_df.select(col("year"), col("month"), col("i94port").alias("city_code"),
                                         round(col("AverageTemperature"), 1).alias("avg_temp"),
                                         col("Latitude").alias("lat"), col("Longitude").alias("long")).drop_duplicates()

In [None]:
print(stag_temp_df.count())
stag_temp_df.limit(5).toPandas()

In [None]:
stag_temp_df.printSchema()

In [None]:
c_demo_df = demographics_df.withColumn("medianAge", demographics_df['Median Age']) \
    .withColumn("male_pop", (demographics_df['Male Population'] / demographics_df['Total Population']) * 100) \
    .withColumn("female_pop", (demographics_df['Female Population'] / demographics_df['Total Population']) * 100) \
    .withColumn("veterans", (demographics_df['Number of Veterans'] / demographics_df['Total Population']) * 100) \
    .withColumn("foreign_born", (demographics_df['Foreign-born'] / demographics_df['Total Population']) * 100) \
    .withColumn("race", (demographics_df['Count'] / demographics_df['Total Population']) * 100) \
    .withColumn("city_code", city_to_port(demographics_df["City"])) \
    .dropna(how='any', subset=["city_code"])

c_demo_df.limit(10).toPandas().head(10)


In [None]:
cleaned_demo_df = c_demo_df.select(col("City").alias("city_name"), \
                                   col("State Code").alias("state_code"), 
                                  "medianAge", "male_pop", "female_pop","veterans", \
                                   "foreign_born", \
                                   col("Total Population").alias("total_pop"), \
                                   #col("Race").alias("race"), \
                                   "race").drop_duplicates()

cleaned_demo_df.count()

In [None]:
p_demo_df = cleaned_demo_df.groupBy("city_name", "state_code", "medianAge", "male_pop",
                                        "female_pop","veterans", "foreign_born", "total_pop").pivot("Race").avg("race")

p_demo_df = p_demo_df.withColumn("city_code", city_to_port(p_demo_df["city_name"])) \
    .dropna(how='any', subset=["city_code"])

p_demo_df.limit(10).toPandas().head(10)



In [None]:
stag_demo_df = p_demo_df.select("city_code", "state_code", "city_name", "medianAge", \
                                    round(col("male_pop"), 1).alias("male_pop"),\
                                    round(col("female_pop"), 1).alias("female_pop"),\
                                    round(col("veterans"), 1).alias("veterans"),\
                                    round(col("veterans"), 1).alias("foreign_born"), "total_pop")
stag_demo_df.limit(10).toPandas()
stag_demo_df.printSchema()

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
im_df = i94_s_t.select("id", "gender", "age", "visa_type").drop_duplicates()

In [None]:
#im_df.count()

In [None]:
c_df = stag_demo_df.join(stag_temp_df, "city_code") \
    .select("city_code", "state_code", "city_name", "medianAge", "male_pop", "female_pop", "veterans",
           "foreign_born", "total_pop", "lat", "long").drop_duplicates()
c_df.limit(10).toPandas().head(10)

In [None]:
m_df = stag_temp_df.select("city_code", "year", "month", "avg_temp").drop_duplicates()
m_df.limit(10).toPandas().head(10)

In [None]:
#m_df.count()

In [None]:
time_df = i94_s_t.withColumn("dayofweek", dayofweek("date"))\
                .withColumn("weekofyear", weekofyear("date"))\
                .withColumn("month", month("date"))
                        
time_df = time_df.select("date", "dayofweek", "weekofyear", "month").drop_duplicates()

In [None]:
#time_df.count()

In [None]:
time_df.limit(5).toPandas().head(5)

In [None]:
# Write to dimension tables
i94_df_cleanup.write.mode("overwrite").partitionBy("gender", "age").parquet("immigrants")
c_df.write.mode("overwrite").partitionBy("state_code").parquet("cities")
m_df.write.mode("overwrite").parquet("monthly_city_temperatues")
time_df.write.mode("overwrite").parquet("time")

# Write to  fact table
immigration_df.write.mode("overwrite").partitionBy("state_code", "city_code").parquet("immigration")

### 4.2 Data Quality Checks

In [None]:

def load_parquets():
    # load immigration parquest file, create view and query
    read_im_df = pySparkSession.read.parquet("immigrants/")
    immigration = read_im_df.createOrReplaceTempView("immigrants")
    table_im_df = pySparkSession.sql("select * from immigrants limit 10");
    cleanup.data_quality_check(table_im_df, "immigrants")
    table_im_df.printSchema()
    # load cities parquest file create view and query
    table_c_df = pySparkSession.read.parquet("cities/")
    cities = table_c_df.createOrReplaceTempView("cities")
    citi_table = pySparkSession.sql("select * from cities limit 10");
    cleanup.data_quality_check(citi_table, "cities")
    table_m_df = pySparkSession.read.parquet("monthly_city_temperatues/")
    city_temperatures = table_m_df.createOrReplaceTempView("city_temperatures")
    city_temparature_table = pySparkSession.sql("select * from city_temperatures limit 10");
    cleanup.data_quality_check(city_temparature_table, "city_temperatures")
    table_time = pySparkSession.read.parquet("time/")
    time = table_time.createOrReplaceTempView("time")
    time_table = pySparkSession.sql("select * from time limit 10");
    cleanup.data_quality_check(time_table, "time")




In [None]:
load

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.


### Dimension Tables

### city_df
    city_code: represents city port code
    state_code: represents state code of the city
    city_name: represents name of the city
    medianAge: represents median age of the city
    male_pop: represents city male population in %
    female_pop: represents city's female population in %
    veterans: represents city's veteran population in %
    foreign_born: represents city's foreign born population in %
    total_pop: represents city's total population
    lat: represents latitude of the city
    long: represents longitude of the city
	
### imm_df
    id: represents id of immigrant
    gender: represents gender of immigrant
    age: represents age of immigrant
    visa_type: represents immigrant's visa type

### city_df
    city_code: represents city port code
    state_code: represents state code of the city
    city_name: represents name of the city
    medianAge: represents median age of the city
    male_pop: represents city's male population in %
    female_pop: represents city's female population in %
    veterans: represents city's veteran population in %
    foreign_born: represents city's foreign born population in %
    total_pop: represents city's total population
    lat: represents latitude of the city
    long: represents longitude of the city

### monthly_city_temp_df
    city_code: represents city port code
    year: represents year
    month: represents month 
    avg_temp: represents average temperature in city for given month

### time_df
    date: represents date
    dayofweek: represents day of the week
    weekofyear: represents week of year
    month: represents month
### Fact Table
### immigration_df
    id: represents id
    state_code: represents state code of arrival city
    city_code: represents city port code of arrival city
    date: represents date of arrival
    count: represents count of immigrant's entries into the US

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.


Apache Spark used  because of ability to process large set of data along with apis to read data and its convenient dataframe manipulation functions

* Propose how often the data should be updated and why.

The immigration (i94) data set and relevant data can be updated montly as this is report can fetch mothly/seasonally. 

* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x: An Amazon EMR cluster can be useda with Apache Spark installed to process the increase in data easily prior to being stored on S3.S3 have capabiliity to auto scale at any speed. 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day: We can define airflow to run job every day at 7am on dela to make process more effective. 
 * The database needed to be accessed by 100+ people.
  We can use redshift to store staging, dimention and fact tables as it was cluster and improves performance, multiple people can case at any point of time. 

In [None]:
# read from dimension tables
read_im_df = pySparkSession.read.parquet("immigrants/")


In [None]:
table_im_df.printSchema()

In [None]:
def execute_quries():
    # visa's count based on male and female
    visa_type_count_male = pySparkSession.sql("select count(*),gender from immigrants  group by gender limit 10");
    visa_type_count_male.show()
   
    # Avg temperature per month per city
    city_temparature_table = pySparkSession.sql("select * from city_temperatures where year = '2016'");

    

In [None]:
execute_quries()

In [None]:
 # Avg temperature in year 2013 per month per city
city_temparature_table = pySparkSession.sql("select * from city_temperatures where year = 2013");
city_temparature_table.show()


In [None]:
 # Avg temperature in year 2013 per month per city
city_temparature_table = pySparkSession.sql("select * from city_temperatures where year = 2013");
city_temparature_table.show()

In [None]:
 #Most popular cities
popular_cities = pySparkSession.sql("select total_pop,city_code from immigrants order by total_pop");
popular_cities.show()

In [None]:
check_stats()

In [None]:
# Most popular cities


In [None]:
citi_table.printSchema()
city_temparature_table.printSchema()
time_table.printSchema()