# Immigrant cities data
### Data Engineering Capstone Project

#### Project Summary

In the Udacity provided project, I'll work with four datasets to complete the project. The main dataset will include data on immigration to the United States, and supplementary datasets will include data on airport codes, U.S. city demographics, and temperature data.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [21]:
# Do all imports and installs here
import pandas as pd
import re
pd.set_option('display.max_columns', 30) # to view all columns 
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import count,when,isnan, col, udf, year, month, round, dayofweek, weekofyear, isnull
from pyspark.sql.types import StringType

### Step 1: Scope the Project and Gather Data

#### Scope 
The objective of this project is to collect data from three different sources and produce fact and dimension tables in order to conduct immigration analysis using Spark and Pandas in the United States utilizing criteria such as city average temperature, city demographics, population number, and percentage.

#### Describe and Gather Data 
**I94 Immigration Data:** This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. [This](https://www.trade.gov/national-travel-and-tourism-office) is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.

**World Temperature Data:** This dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

**U.S. City Demographic Data:** This data comes from OpenSoft. You can read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).



In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

df_i94 = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
df_temp = spark.read.format("csv").option("delimiter", ",").option("header", "true").load('../../data2/GlobalLandTemperaturesByCity.csv')
df_demo = spark.read.format("csv").option("delimiter", ";").option("header", "true").load('us-cities-demographics.csv')
df_airport = spark.read.format("csv").option("delimiter", ",").option("header", "true").load('airport-codes_csv.csv')

In [3]:
#write to parquet
#df_spark.write.parquet("sas_data")
#df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 

#### Check number and duplicates of rows for each dataset

In [4]:
print('Number immigration rows: ',df_i94.count())
print('Number of distinct immigration rows: ',df_i94.distinct().count())
print('Number demographics rows: ',df_demo.count())
print('Number of distinct demographics rows: ',df_demo.distinct().count())
print('Number temperature rows: ',df_temp.count())
print('Number of distinct temperature rows: ',df_temp.distinct().count())

Number immigration rows:  3096313
Number of distinct immigration rows:  3096313
Number demographics rows:  2891
Number of distinct demographics rows:  2891
Number temperature rows:  8599212
Number of distinct temperature rows:  8599212


#### Check Schema for each dataset

In [5]:
print('Immigration Schema:')
df_i94.printSchema()
print('Demographics Schema:')
df_demo.printSchema()
print('Temperature Schema:')
df_temp.printSchema()

Immigration Schema:
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum:

#### Display five records for each dataset

In [6]:
df_i94.limit(5).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,37.0,2.0,1.0,,,,T,,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,1.0,20130811.0,SEO,,G,,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,1.0,20160401.0,,,T,O,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,28.0,2.0,1.0,20160401.0,,,O,O,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,4.0,2.0,1.0,20160401.0,,,O,O,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [7]:
df_demo.limit(5).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


In [8]:
df_temp.limit(5).toPandas()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


#### Check the number of nulls for each column in each dataset 

In [9]:
df_i94.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_i94.columns]
   ).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,0,0,0,0,0,0,0,239,152592,142457,802,0,0,1,1881250,3088187,238,138429,3095921,138429,802,477,414269,2982605,83627,0,19549,0


occup, entdepu and insnum columns seem to be useless since they are over 3 million records are missing.

In [10]:
df_demo.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_demo.columns]
   ).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,0,0,0,3,3,0,13,13,16,0,0,0


There are few null values, the dataset will not be affected dropping these rows.

In [11]:
df_temp.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_temp.columns]
   ).toPandas()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,0,364130,364130,0,0,0,0


There are a lot of null values, these nulls will be dropped. 

There are a lot of null values in iata_code column, these nulls will be dropped. 

In [12]:
# check duplicates in cicid column
if df_i94.count() > df_i94.dropDuplicates(['cicid']).count():
    raise ValueError('Data has duplicates')

#### Cleaning Steps

In [13]:
# Drop columns from Immigration dataset
cols = ('occup', 'entdepu','insnum')
df_i94 = df_i94.drop(*cols)
df_i94 = df_i94.dropna(how="any", subset=['i94port', 'i94addr', 'gender'])
df_i94.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = true)
 |-- fltno: string (nullable = true)
 |-- visatype: string (nullable = true)



In [14]:
df_i94.limit(5).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,entdepa,entdepd,matflag,biryear,dtaddto,gender,airline,admnum,fltno,visatype
0,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,1.0,20130811,SEO,G,,,1991.0,D/S,M,,3736796000.0,296,F1
1,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,1.0,20160401,,T,O,M,1961.0,09302016,M,OS,666643200.0,93,B2
2,27.0,2016.0,4.0,101.0,101.0,BOS,20545.0,1.0,MA,20549.0,58.0,1.0,1.0,20160401,TIA,G,O,M,1958.0,04062016,M,LH,92478760000.0,422,B1
3,28.0,2016.0,4.0,101.0,101.0,ATL,20545.0,1.0,MA,20549.0,56.0,1.0,1.0,20160401,TIA,G,O,M,1960.0,04062016,F,LH,92478900000.0,422,B1
4,29.0,2016.0,4.0,101.0,101.0,ATL,20545.0,1.0,MA,20561.0,62.0,2.0,1.0,20160401,TIA,G,O,M,1954.0,09302016,M,AZ,92503780000.0,614,B2


In [15]:
print('Number of demographics rows before dropping null values: ',df_demo.count())
df_demo = df_demo.na.drop("any")
print('Number of demographics rows after dropping null values: ', df_demo.count())

Number of demographics rows before dropping null values:  2891
Number of demographics rows after dropping null values:  2875


In [16]:
df_demo.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_demo.columns]
   ).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,0,0,0,0,0,0,0,0,0,0,0,0


In [17]:
print('Number of temperature rows before dropping null values: ',df_temp.count())
df_temp = df_temp.na.drop("any")
print('Number of temperature rows after dropping null values: ', df_temp.count())

Number of temperature rows before dropping null values:  8599212
Number of temperature rows after dropping null values:  8235082


In [18]:
df_temp.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df_temp.columns]
   ).toPandas()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,0,0,0,0,0,0,0


In [19]:
# Create list of valid ports
i94_sas_label_descriptions = 'I94_SAS_Labels_Descriptions.SAS'
with open(i94_sas_label_descriptions) as f:
    lines = f.readlines()

re_compiled = re.compile(r"\'(.*)\'.*\'(.*)\'")
valid_ports = {}
for line in lines[302:961]:
    results = re_compiled.search(line)
    valid_ports[results.group(1)] = results.group(2)

# Create list of valid states
valid_states = df_demo.toPandas()['State Code'].unique()
print(valid_states)

['MD' 'MA' 'AL' 'CA' 'NJ' 'IL' 'AZ' 'MO' 'NC' 'PA' 'KS' 'FL' 'TX' 'VA' 'NV'
 'CO' 'MI' 'CT' 'MN' 'UT' 'AR' 'TN' 'OK' 'WA' 'NY' 'GA' 'NE' 'KY' 'SC' 'LA'
 'NM' 'IA' 'RI' 'DC' 'WI' 'OR' 'NH' 'ND' 'DE' 'OH' 'ID' 'IN' 'AK' 'MS' 'HI'
 'SD' 'ME' 'MT']


In [22]:
# Create udf to convert SAS date to PySpark date 
@udf(StringType())
def convert_datetime(x):
    if x:
        return (datetime(1960, 1, 1).date() + timedelta(x)).isoformat()
    return None

# Create udf to validate state
@udf(StringType())
def validate_state(x):  
    if x in valid_states:
        return x
    return 'null'

In [23]:
# Extract valid states 
df_i94_cleaned = df_i94.withColumn('i94addr', validate_state(df_i94['i94addr']))

# Convert arrdate from SAS to PySpark format
df_i94_cleaned = df_i94.withColumn('arrdate', convert_datetime(df_i94['arrdate']))

# filter out null values
df_i94_cleaned = df_i94_cleaned.filter(df_i94_cleaned.i94addr != 'null')


df_i94_staging = df_i94_cleaned.select(col('cicid').alias('id'), 
                                       col('arrdate').alias('date'),
                                       col('i94port').alias('city_code'),
                                       col('i94addr').alias('state_code'),
                                       col('i94bir').alias('age'),
                                       col('gender').alias('gender'),
                                       col('i94visa').alias('visa_type'),
                                       col('i94mode').alias('transportation_type'),
                                       'count').drop_duplicates()

df_i94_staging.limit(5).toPandas()

Unnamed: 0,id,date,city_code,state_code,age,gender,visa_type,transportation_type,count
0,158.0,2016-04-01,NEW,NY,24.0,F,2.0,1.0,1.0
1,538.0,2016-04-01,PHI,PR,53.0,M,2.0,1.0,1.0
2,1047.0,2016-04-01,NEW,NY,45.0,M,2.0,1.0,1.0
3,1258.0,2016-04-01,NYC,NY,49.0,F,2.0,1.0,1.0
4,1517.0,2016-04-01,NYC,NY,50.0,F,2.0,1.0,1.0


In [24]:
# Calculate percentages of numeric columns and create new ones
df_demo_cleaned = df_demo.withColumn("median_age", df_demo['Median Age']) \
    .withColumn("total_pop",df_demo['Total Population'])\
    .withColumn("num_male_pop", df_demo['Male Population']) \
    .withColumn("prct_male_pop", (df_demo['Male Population'] / df_demo['Total Population']) * 100) \
    .withColumn("num_female_pop", df_demo['Female Population']) \
    .withColumn("prct_female_pop", (df_demo['Female Population'] / df_demo['Total Population']) * 100) \
    .withColumn("num_veterans", df_demo['Number of Veterans']) \
    .withColumn("prct_veterans", (df_demo['Number of Veterans'] / df_demo['Total Population']) * 100) \
    .withColumn("num_foreign_born", df_demo['Foreign-born'] ) \
    .withColumn("prct_foreign_born", (df_demo['Foreign-born'] / df_demo['Total Population']) * 100) \
    .withColumn("race", df_demo['Race']) \
    .withColumn("state_code",df_demo['State Code'])\
    .withColumn("city",df_demo['City'])\
    .dropna(how='any', subset=["state_code"])

df_demo_staging = df_demo_cleaned.select("median_age",'total_pop','num_male_pop','prct_male_pop',"num_female_pop",
                                         "prct_female_pop","num_veterans","prct_veterans","num_foreign_born","prct_foreign_born",
                                         "race",'state_code','city' )
                                         

df_demo_staging.limit(5).toPandas()

Unnamed: 0,median_age,total_pop,num_male_pop,prct_male_pop,num_female_pop,prct_female_pop,num_veterans,prct_veterans,num_foreign_born,prct_foreign_born,race,state_code,city
0,33.8,82463,40601,49.235415,41862,50.764585,1562,1.894183,30908,37.481052,Hispanic or Latino,MD,Silver Spring
1,41.0,93629,44129,47.131765,49500,52.868235,4147,4.429183,32935,35.176067,White,MA,Quincy
2,38.5,84839,38040,44.837869,46799,55.162131,4819,5.680171,8229,9.699549,Asian,AL,Hoover
3,34.5,175232,88127,50.291613,87105,49.708387,5821,3.321882,33878,19.333227,Black or African-American,CA,Rancho Cucamonga
4,34.6,281913,138040,48.965461,143873,51.034539,5829,2.067659,86253,30.595609,White,NJ,Newark


In [25]:
# filter out the temperature dataset by United States

df_temp_cleaned= df_temp.filter(col('Country') == 'United States') \
    .withColumn('year', year(df_temp['dt'])) \
    .withColumn('month', month(df_temp['dt'])) \
    .withColumn('week#',weekofyear(df_temp['dt']))\
    .withColumn("city", df_temp["City"])\
    .withColumn("AverageTemperature", col("AverageTemperature").cast("float")) \
    .dropna(how='any', subset=["city"])

# use temperatures from the year 2006 and above
df_temp_cleaned = df_temp_cleaned.filter(df_temp_cleaned["year"] >= 2006)

df_temp_staging = df_temp_cleaned.select(col('year'), 
                                         col('month'), 
                                         col('week#'),
                                         col('city'),
                                         round(col('AverageTemperature'), 1).alias('avg_temperature'),
                                         col('Latitude'), 
                                         col('Longitude')).drop_duplicates()

df_temp_staging.limit(5).toPandas()

Unnamed: 0,year,month,week#,city,avg_temperature,Latitude,Longitude
0,2008,1,1,Abilene,5.8,32.95N,100.53W
1,2009,10,40,Amarillo,13.3,34.56N,101.19W
2,2013,6,22,Anaheim,18.6,32.95N,117.77W
3,2007,11,44,Anchorage,-5.3,61.88N,151.13W
4,2012,10,40,Ann Arbor,10.8,42.59N,82.91W


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model

#### Staging tables:
| Immigration stage | Demographics stage | Temperature stage |
|:-----|:------:|:-----:|
|id                   |total_pop              |year
|date                 |num_male_pop          | month
|city_code            |prct_male_pop         |week#
|state_code           |num_female_pop        | city
|age                  |prct_female_pop         | avg_temperature
|gender               |num_veterans           | Latitude
|visa_type           |prct_veterans         |Longitude
|transportation_type |num_foreign_born
|                     |prct_foreign_born
|                      |race  
|                     |  state_code 
|                      |city
|                     | median_age



#### Fact table:
 | Immigration fact|
 | ----|
 |id|
 | date|
 |city|
 | city_code|
 |state_code|
 | count|
 

#### Dimension  tables
|immigration dim | demographics dim| temperature dim | time dim
|:----|:---:|:---:|:---|
|id|                 state_code        |city| date
|age|                city              | year| year
|visa_type           | median_age     |  month| month
|transportation_type| num_male_pop      |week#| week#
|   gender           |prct_male_pop     |avg_temperature|day
|                    |num_female_pop|
|                     |prct_female_pop|
|                   | num_veterans| 
|                     |prct_veterans|
|                      |num_foreign_born|
|                    |  prct_foreign_born|
|                  |total_pop| 
|                  |race| 
|                  |   Longitude  |
|                   | Latitude|


          
    

#### 3.2 Mapping Out Data Pipelines
Listing the steps necessary to pipeline the data into the chosen data model

1. Clean the data from null values, duplicates, etc
2. Load staging tables.
3. Create fact and dimension tables


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Building the data pipelines to create the data model.

In [26]:
df_immigration_dim = df_i94_staging.select('id','age','visa_type','transportation_type','gender').drop_duplicates()

In [27]:
df_immigration_dim.limit(5).toPandas()

Unnamed: 0,id,age,visa_type,transportation_type,gender
0,151887.0,51.0,2.0,1.0,M
1,158024.0,64.0,2.0,1.0,F
2,620410.0,22.0,2.0,1.0,M
3,689745.0,57.0,2.0,1.0,F
4,706743.0,34.0,1.0,1.0,M


In [28]:
df_demographics_dim =df_demo_staging.join(df_temp_staging,'city').select('state_code',
                                                                         'city',
                                                                         'median_age',
                                                                         'total_pop',
                                                                         'num_male_pop',
                                                                         'prct_male_pop',
                                                                         'num_female_pop',
                                                                         'prct_female_pop',
                                                                         'num_veterans',
                                                                         'prct_veterans',
                                                                         'num_foreign_born',
                                                                         'prct_foreign_born',
                                                                         'race',
                                                                         'Longitude',
                                                                         'Latitude').drop_duplicates()
                                                                         

In [29]:
df_demographics_dim.limit(5).toPandas()

Unnamed: 0,state_code,city,median_age,total_pop,num_male_pop,prct_male_pop,num_female_pop,prct_female_pop,num_veterans,prct_veterans,num_foreign_born,prct_foreign_born,race,Longitude,Latitude
0,VA,Arlington,34.4,229164,113981,49.737742,115183,50.262258,12719,5.550174,55205,24.089735,White,76.99W,39.38N
1,MA,Boston,31.8,669469,322149,48.120077,347320,51.879923,18350,2.740978,190123,28.399074,Asian,72.00W,42.59N
2,GA,Columbus,33.7,200579,98785,49.249921,101794,50.750079,21747,10.842112,10376,5.173024,Hispanic or Latino,85.21W,32.95N
3,CA,Garden Grove,37.8,175384,86779,49.479428,88605,50.520572,4316,2.460886,80677,46.000205,Asian,117.77W,32.95N
4,CA,Rialto,31.6,103137,49902,48.384188,53235,51.615812,2036,1.974073,32741,31.745155,White,116.76W,34.56N


In [30]:
df_temperature_dim = df_temp_staging.select('city','year','month','week#','avg_temperature').drop_duplicates()

In [31]:
df_temperature_dim.limit(5).toPandas()

Unnamed: 0,city,year,month,week#,avg_temperature
0,Chattanooga,2010,5,17,21.700001
1,Richmond,2010,12,48,0.1
2,Jacksonville,2010,12,48,10.5
3,Metairie,2011,12,48,13.5
4,Plano,2006,1,52,11.8


In [32]:
df_immigration_fact = df_i94_staging.join(df_demographics_dim,'state_code').join(df_temperature_dim,'city').select('id',
                                                                                                                  'date',
                                                                                                                  'city',
                                                                                                                  'city_code',
                                                                                                                  'state_code',
                                                                                                                  'count').drop_duplicates()

In [33]:
df_immigration_fact.limit(5).toPandas()

Unnamed: 0,id,date,city,city_code,state_code,count
0,4646.0,2016-04-01,Worcester,BOS,MA,1.0
1,12839.0,2016-04-01,Worcester,DUB,MA,1.0
2,13836.0,2016-04-01,Worcester,BOS,MA,1.0
3,25082.0,2016-04-01,Worcester,BOS,MA,1.0
4,28524.0,2016-04-01,Worcester,BOS,MA,1.0


In [34]:
df_time_dim = df_i94_staging.select('date')\
                            .withColumn('year', year(df_i94_staging['date']))\
                            .withColumn('month', month(df_i94_staging['date']))\
                            .withColumn('day',dayofweek(df_i94_staging['date']))\
                            .withColumn('week#',weekofyear(df_i94_staging['date'])).drop_duplicates()

In [35]:
df_time_dim.limit(5).toPandas()

Unnamed: 0,date,year,month,day,week#
0,2016-04-18,2016,4,2,16
1,2016-04-03,2016,4,1,13
2,2016-04-11,2016,4,2,15
3,2016-04-05,2016,4,3,14
4,2016-04-19,2016,4,3,16


#### 4.2 Data Quality Checks


In [36]:
# Function to check tables availability 
def check_tables(df):
    if df is not None:
        print("Data quality check PASSED.\nFact and Dimension tables are available.")
        return True      
    else:
        print("Data quality check failed.\nThere are some tables that are missing!")
        return False
        
check_tables(df_immigration_dim) & \
check_tables(df_demographics_dim) & \
check_tables(df_temperature_dim) &\
check_tables(df_immigration_fact) & \
check_tables(df_time_dim)

Data quality check PASSED.
Fact and Dimension tables are available.
Data quality check PASSED.
Fact and Dimension tables are available.
Data quality check PASSED.
Fact and Dimension tables are available.
Data quality check PASSED.
Fact and Dimension tables are available.
Data quality check PASSED.
Fact and Dimension tables are available.


True

In [None]:
# Function to check if there are values that exist in tables
def values_check(df):
    if df.count() !=0:
        print("Data quality check passed.\nValues in Fact and Dimension tables are available.")
        return True
    else:
        print("Data quality check failed.\nThere are some tables that are empty.")
        return False

values_check(df_immigration_dim) & \
values_check(df_demographics_dim) & \
values_check(df_temperature_dim) & \
values_check(df_immigration_fact) & \
values_check(df_time_dim)

Data quality check passed.
Values in Fact and Dimension tables are available.
Data quality check passed.
Values in Fact and Dimension tables are available.
Data quality check passed.
Values in Fact and Dimension tables are available.


#### 4.3 Data dictionary 
 
    
 | Immigration fact|description|
 | ----|-
  |id:| id 
 | date:| arrival date 
 |city:| arrival city 
  | city_code:| arrival city code
  |state_code:| arrival state code
 | count:| used to count how many arrival to US

 &nbsp;
 
|immigration dim | description|
|:----|-|
|id:|   immigrant's id            
|age:|   immigrant's age             
|visa_type:   |    immigrant's visa type     
|transportation_type:| immigrant's transportation type
|   gender:  |  immigrant's gender        

&nbsp;

|demographics dim| description|
|:----|-|
|state_code:        |city port code|
|city:              | city name|
| median_age:     | median age of the city|
| num_male_pop:  |  number of the male population 
|prct_male_pop:  |   percentage of the male population
|num_female_pop:| number of the female population
|prct_female_pop:|percentage of the female population
| num_veterans:| number of the veterans population
|prct_veterans:| percentage of the veterans population
|num_foreign_born:| number of the foreign population
|  prct_foreign_born:|percentage of the foreign population
|total_pop:| total number of the city's population
|race:| Respondent race
|Longitude:  | city longitude
| Latitude:| city latitude

&nbsp;

 |temperature dim | description|
|:----|-|
|city:| city name
| year:| year of the record
|  month:| month of the record 
|week#:| week of the record 
|avg_temperature:| average temperature

&nbsp;

| time dim|description|
|:---|--|
| date:|date of the record|
| year:|year of the record|
| month:|month of the record|
| week#:|week of the record |
|day:|day week of the record |



#### Step 5: Complete Project Write Up

Spark is chosen for this project as it isprovides a faster and more general data processing platform. Spark lets you run programs up to 100x faster in memory, or 10x faster on disk, than Hadoop. 

The data is used for reporting purposes. Whenever new data is needed, this code provides the ability to have cleaned data and organized.

How you would approach the problem differently under the following scenarios:

If the data was increased by 100x: In the future, we could explore scaling up the number of EC2 instances hosting Spark, as well as adding more Spark work nodes. Processing time should be able to be sped up as a result of increased capacity, which might come from either vertical or horizontal scaling.

If the data populates a dashboard that must be updated on a daily basis by 7am every day: We might explore utilizing Airflow to plan and automate the data pipeline jobs, which would be quite convenient. We may be able to satisfy customer requirements thanks to the built-in retry and monitoring mechanisms.

If the database needed to be accessed by 100+ people: We may think about putting our solution in a production-scale data warehouse in the cloud, which will have more capacity to service more users and will include workload management to guarantee that resources are distributed equally across users.