# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import psycopg2
from pyspark.sql import SparkSession
from insert_data import insert_fact, insert_dim_imm_per, insert_dim_imm_air, insert_dim_demo_info, insert_dim_demo_stat, insert_dim_temp, insert_cntry_code, insert_port_code, insert_mode, insert_state_code, insert_visa

### Step 1: Scope the Project and Gather Data

#### Scope 

The purpose of this project is to see the connection between the temperature of the cities that people immigrate to. We could check the preferred destination cities for immigrants based on the immigration year as well. There will be 1 fact table with 5 dimension tables processed from the datasets below. This data model will be used for a further data analysis using SQL in order to see the relationship as I mentioned previously.

__Data Used:__
- I94 Immigration Data
- Temperature Data
- Demographics Data

__Tools Used:__
- AWS
    - Redshift
- Python
    - PySpark
- SQL

#### I94 Immigration Data
This data comes from [National Travel and Trourism Office(NTTO)](https://www.trade.gov/national-travel-and-tourism-office). The subject of the data is the immigrants going to the U.S. and the information is gives are where they come from, birth year, gender, visa type, etc.

In [2]:
# Read in the data here
fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df = pd.read_sas(fname, 'sas7bdat', encoding="ISO-8859-1")

In [3]:
df.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [4]:
df.columns

Index(['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate',
       'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count',
       'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu',
       'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline',
       'admnum', 'fltno', 'visatype'],
      dtype='object')

In [5]:
fact_immigration = df[['cicid', 'i94yr', 'i94mon', 'i94port', 'arrdate', 'depdate', 'i94mode', 'i94visa']]
fact_immigration.columns = ['cic_id', 'year', 'month', 'dep_city', 'arrival_date', 'dep_date', 'travel_code', 'visa']
fact_immigration['country'] = 'United States'
fact_immigration.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,cic_id,year,month,dep_city,arrival_date,dep_date,travel_code,visa,country
0,6.0,2016.0,4.0,XXX,20573.0,,,2.0,United States
1,7.0,2016.0,4.0,ATL,20551.0,,1.0,3.0,United States
2,15.0,2016.0,4.0,WAS,20545.0,20691.0,1.0,2.0,United States
3,16.0,2016.0,4.0,NYC,20545.0,20567.0,1.0,2.0,United States
4,17.0,2016.0,4.0,NYC,20545.0,20567.0,1.0,2.0,United States


In [6]:
dim_immigration_personal = df[['cicid', 'i94cit', 'i94res', 'biryear', 'gender', 'insnum']]
dim_immigration_personal.columns = ['cic_id', 'citizen_country', 'resident_country', 'birthyear', 'gender', 'ins_number']
dim_immigration_personal.head()

Unnamed: 0,cic_id,citizen_country,resident_country,birthyear,gender,ins_number
0,6.0,692.0,692.0,1979.0,,
1,7.0,254.0,276.0,1991.0,M,
2,15.0,101.0,101.0,1961.0,M,
3,16.0,101.0,101.0,1988.0,,
4,17.0,101.0,101.0,2012.0,,


In [7]:
dim_immigration_air = df[['cicid', 'airline', 'admnum', 'fltno', 'i94visa', 'visatype']]
dim_immigration_air.columns = ['cic_id', 'airline', 'admin_number', 'flight_number', 'visa', 'visa_type']
dim_immigration_air.head()

Unnamed: 0,cic_id,airline,admin_number,flight_number,visa,visa_type
0,6.0,,1897628000.0,,2.0,B2
1,7.0,,3736796000.0,296.0,3.0,F1
2,15.0,OS,666643200.0,93.0,2.0,B2
3,16.0,AA,92468460000.0,199.0,2.0,B2
4,17.0,AA,92468460000.0,199.0,2.0,B2


#### Temperature Data
This data comes from [Kaggle](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data), and it shows the temperature in different cities around the world. It's recorded monthly, and the values are the average temperature of that month.

In [8]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
df_temp = pd.read_csv(fname)

In [9]:
df_temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [10]:
df_temp_us = df_temp[df_temp['Country'] == 'United States']
df_temp_us.columns = ['dt', 'avg_temp', 'avg_temp_uncertainty', 'city', 'country', 'latitude', 'longitude']
df_temp_us.head()

Unnamed: 0,dt,avg_temp,avg_temp_uncertainty,city,country,latitude,longitude
47555,1820-01-01,2.101,3.217,Abilene,United States,32.95N,100.53W
47556,1820-02-01,6.926,2.853,Abilene,United States,32.95N,100.53W
47557,1820-03-01,10.767,2.395,Abilene,United States,32.95N,100.53W
47558,1820-04-01,17.989,2.202,Abilene,United States,32.95N,100.53W
47559,1820-05-01,21.809,2.036,Abilene,United States,32.95N,100.53W


In [11]:
df_temp_us['dt'] = pd.to_datetime(df_temp_us['dt'])
df_temp_us['month'] = pd.DatetimeIndex(df_temp_us['dt']).month
df_temp_us['year'] = pd.DatetimeIndex(df_temp_us['dt']).year
df_temp_us.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,dt,avg_temp,avg_temp_uncertainty,city,country,latitude,longitude,month,year
47555,1820-01-01,2.101,3.217,Abilene,United States,32.95N,100.53W,1,1820
47556,1820-02-01,6.926,2.853,Abilene,United States,32.95N,100.53W,2,1820
47557,1820-03-01,10.767,2.395,Abilene,United States,32.95N,100.53W,3,1820
47558,1820-04-01,17.989,2.202,Abilene,United States,32.95N,100.53W,4,1820
47559,1820-05-01,21.809,2.036,Abilene,United States,32.95N,100.53W,5,1820


#### US Cities Demographics Data
This data comes from [OpenSoft](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/), and it shows the demographics of the cities in the U.S. including the population, median age, the state the city belongs to, etc.

In [12]:
fname = './us-cities-demographics.csv'
df_demo = pd.read_csv(fname, delimiter=';')

In [13]:
df_demo.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [14]:
df_demo_info = df_demo[['City', 'State', 'Male Population', 'Female Population', 'Total Population', 'Number of Veterans', 'Foreign-born', 'State Code', 'Race']]
df_demo_info.columns = ['city', 'state', 'm_population', 'f_population', 'total_population', 'num_of_veterans', 'foreign_born', 'state_code', 'race']
df_demo_info.head()

Unnamed: 0,city,state,m_population,f_population,total_population,num_of_veterans,foreign_born,state_code,race
0,Silver Spring,Maryland,40601.0,41862.0,82463,1562.0,30908.0,MD,Hispanic or Latino
1,Quincy,Massachusetts,44129.0,49500.0,93629,4147.0,32935.0,MA,White
2,Hoover,Alabama,38040.0,46799.0,84839,4819.0,8229.0,AL,Asian
3,Rancho Cucamonga,California,88127.0,87105.0,175232,5821.0,33878.0,CA,Black or African-American
4,Newark,New Jersey,138040.0,143873.0,281913,5829.0,86253.0,NJ,White


In [15]:
df_demo_stat = df_demo[['City', 'State', 'Median Age', 'Average Household Size', 'State Code']]
df_demo_stat.columns = ['city', 'state', 'median_age', 'avg_household_size', 'state_code']
df_demo_stat.head()

Unnamed: 0,city,state,median_age,avg_household_size,state_code
0,Silver Spring,Maryland,33.8,2.6,MD
1,Quincy,Massachusetts,41.0,2.39,MA
2,Hoover,Alabama,38.5,2.58,AL
3,Rancho Cucamonga,California,34.5,3.18,CA
4,Newark,New Jersey,34.6,2.73,NJ


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [16]:
fact_immigration.dtypes

cic_id          float64
year            float64
month           float64
dep_city         object
arrival_date    float64
dep_date        float64
travel_code     float64
visa            float64
country          object
dtype: object

#### Converting floats into datetime format for fact_immigration table

In [17]:
fact_immigration['arrival_date'] = pd.to_datetime(fact_immigration['arrival_date'], unit='D',
               origin=pd.Timestamp('1960-01-01'))
fact_immigration['dep_date'] = pd.to_datetime(fact_immigration['dep_date'], unit='D',
               origin=pd.Timestamp('1960-01-01'))

fact_immigration = fact_immigration[pd.notnull(fact_immigration['arrival_date'])]
fact_immigration = fact_immigration[pd.notnull(fact_immigration['dep_date'])]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [18]:
fact_immigration.head()

Unnamed: 0,cic_id,year,month,dep_city,arrival_date,dep_date,travel_code,visa,country
2,15.0,2016.0,4.0,WAS,2016-04-01,2016-08-25,1.0,2.0,United States
3,16.0,2016.0,4.0,NYC,2016-04-01,2016-04-23,1.0,2.0,United States
4,17.0,2016.0,4.0,NYC,2016-04-01,2016-04-23,1.0,2.0,United States
5,18.0,2016.0,4.0,NYC,2016-04-01,2016-04-11,1.0,1.0,United States
6,19.0,2016.0,4.0,NYC,2016-04-01,2016-04-14,1.0,2.0,United States


#### Dropping the non-integer rows in flight_number column in the dim_immigration_air table

In [19]:
dim_immigration_air.where(dim_immigration_air['flight_number'] == 'LAND')

Unnamed: 0,cic_id,airline,admin_number,flight_number,visa,visa_type
0,,,,,,
1,,,,,,
2,,,,,,
3,,,,,,
4,,,,,,
5,,,,,,
6,,,,,,
7,,,,,,
8,,,,,,
9,,,,,,


In [20]:
dim_immigration_air['flight_number'] = dim_immigration_air[dim_immigration_air['flight_number'].apply(lambda x: str(x).isdigit())]
dim_immigration_air['flight_number'] = pd.to_numeric(dim_immigration_air['flight_number'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


#### Extracting necessary information from i94 description file

In [21]:
with open("I94_SAS_Labels_Descriptions.SAS") as f:
    contents = f.readlines()

In [22]:
countries = contents[9:298]
ports = contents[302:962]
modes = contents[972:976]
states = contents[981:1036]
visa = contents[1046:1049]


country = [x.strip().split('=') for x in countries]
country_code = [x[0].replace("'","") for x in country]
country_name = [x[1].replace("'","") for x in country]
df_country = pd.DataFrame({'country_code':country_code, 'country_name':country_name})

port = [x.strip().split('=') for x in ports]
port_code = [x[0].replace("'","").strip('\t') for x in port]
port_loc = [x[1].replace("'","").strip('\t') for x in port]
port_loc_c = [x.split(',')[0] for x in port_loc]
port_loc_s = [x.split(',')[-1] for x in port_loc]
df_port = pd.DataFrame({'port_code':port_code, 'port_loc_city':port_loc_c, 'port_loc_state':port_loc_s})

mode = [x.strip().split('=') for x in modes]
code = [x[0] for x in mode]
mode_name = [x[1].replace("'", "").strip(";") for x in mode]
df_mode = pd.DataFrame({'code':code, 'mode':mode_name})

state = [x.strip().split('=') for x in states]
code = [x[0].replace("'", "") for x in state]
state_name = [x[1].replace("'", "") for x in state]
df_state = pd.DataFrame({'state_code':code, 'state':state_name})

visa_list = [x.strip().split('=') for x in visa]
code = [x[0] for x in visa_list]
visa_name = [x[1] for x in visa_list]
df_visa = pd.DataFrame({'code':code, 'visa':visa_name})

In [30]:
df_port.dtypes

port_code         object
port_loc_city     object
port_loc_state    object
dtype: object

#### Dropping missing values and duplicate data

In [23]:
# immigration data
fact_immigration.dropna(subset=['cic_id'])
dim_immigration_personal.dropna(subset=['cic_id'])
dim_immigration_air.dropna(subset=['cic_id'])

fact_immigration.drop_duplicates(subset = 'cic_id', keep = 'first')
dim_immigration_personal.drop_duplicates(subset = 'cic_id', keep = 'first')
dim_immigration_air.drop_duplicates(subset = 'cic_id', keep = 'first')

# temperature data
df_temp_us.dropna()
df_temp_us.drop_duplicates(subset = 'dt', keep = 'first')

# demography data
df_demo_info.dropna()
df_demo_stat.dropna()
df_demo_info.drop_duplicates(subset = 'city', keep = 'first', inplace=True)
df_demo_stat.drop_duplicates(subset = 'city', keep = 'first', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
<img width="802" alt="Screen Shot 2021-09-09 at 10 46 07 PM" src="https://user-images.githubusercontent.com/79597984/132707777-a124e5d3-03d0-45c6-9bd5-61a8fc6f5618.png">

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model
1. Create the data model by copying the drop and create statements from `create_tables.sql` file into the query editor in redshift.
2. Run step 1 & 2 above to clean the data.
3. Insert the data into the tables created by running the below code in step `4.1`.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [24]:
conn = psycopg2.connect("host=redshift-cluster-1.******.us-west-2.redshift.amazonaws.com dbname=dev user=awsuser password=****** port=5439")
cur = conn.cursor()
conn.autocommit = True

In [25]:
def insert_data(df_table, query):
    for index, row in df_table.head(1000).iterrows():
        cur.execute(query, list(row.values))
        
tables = [fact_immigration, dim_immigration_personal, dim_immigration_air, df_demo_info, df_demo_stat, df_temp_us]
insert_q = [insert_fact, insert_dim_imm_per, insert_dim_imm_air, insert_dim_demo_info, insert_dim_demo_stat, insert_dim_temp]

for i in range(len(tables)):
    insert_data(tables[i], insert_q[i])

DataError: invalid input syntax for integer: "AL"


In [None]:
aux_tables = [df_country, df_port, df_mode, df_state, df_visa]
insert_aux = [insert_cntry_code, insert_port_code, insert_mode, insert_state_code, insert_visa]

for i in range(len(aux_tables)):
    insert_data(aux_tables[i], insert_aux[i])

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

#### Checking if the duplicated rows are all removed

In [26]:
def is_duplicated(p_key, table):
    duplicate_query = f"""
            SELECT {p_key}, COUNT({p_key})
            FROM {table}
            GROUP BY {p_key}
            HAVING COUNT({p_key})>1;
        """
    cur.execute(duplicate_query)
    data = cur.fetchall()
    
    if not data:
        print(f"There's no duplicated rows in the table {table}")
    else:
        print(f"There's {len(data)} duplicated rows in {p_key} in the table {table}")

In [27]:
p_keys = ['cic_id', 'cic_id', 'cic_id', 'city', 'city', 'dt']
tables = ['fact_immigration', 'dim_immigration_personal', 'dim_immigration_air', 'dim_demo_info', 'dim_demo_stat', 'dim_temp']

for i in range(len(tables)):
    is_duplicated(p_keys[i], tables[i])

There's no duplicated rows in the table fact_immigration
There's no duplicated rows in the table dim_immigration_personal
There's no duplicated rows in the table dim_immigration_air
There's no duplicated rows in the table dim_demo_info
There's no duplicated rows in the table dim_demo_stat
There's no duplicated rows in the table dim_temp


#### Checking if the rows exist after inserting data

In [28]:
def is_empty(table):
    query = f"SELECT COUNT(*) FROM {table};"
    cur.execute(query) 
    data = cur.fetchall()
    
    if data[0][0] < 1:
        print(f"No data found in the table {table}")
    else:
        print(f"Successfully executed with {data[0][0]} records in the table {table}!")

In [29]:
tables = ['fact_immigration', 'dim_immigration_personal', 'dim_immigration_air', 'dim_demo_info', 'dim_demo_stat', 'dim_temp']

for table in tables:
    is_empty(table)

Successfully executed with 1000 records in the table fact_immigration!
Successfully executed with 1000 records in the table dim_immigration_personal!
Successfully executed with 1000 records in the table dim_immigration_air!
Successfully executed with 567 records in the table dim_demo_info!
Successfully executed with 567 records in the table dim_demo_stat!
Successfully executed with 1000 records in the table dim_temp!


#### 4.3 Data dictionary 
Sources:
- Immigration dataset : [US NTTO website](https://travel.trade.gov/research/reports/i94/historical/2016.html)
- Temperature dataset : [Kaggle](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
- Demographic dataset : [OpenSoft](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)

__Fact Table__
- fact_immigration<br>
`cic_id` : city id<br>
`year` : year the object came in<br>
`month` : month the object came in<br>
`dep_city` : departing city<br>
`arrival_date` : arrival date in the US<br>
`dep_date` : departure date from the US<br>
`travel_code` : travel code<br>
`visa` : visa codes in 3 categories<br>

__Dimension Tables__
- dim_immigration_personal<br>
`cic_id` : city id<br>
`citizen_country` : the city the object is citizen in<br>
`resident_country` : residing city<br>
`birthyear` : birthyear in 4 digits<br>
`gender` : gender<br>
`ins_number` : INS number<br>
<br>
- dim_immigration_air<br>
`cic_id` : city id<br>
`airline` : airline used to arrive in the US<br>
`admin_number` : admission number<br>
`flight_number` : flight number of Airline used to arrive in the US<br>
`visa` : visa codes in 3 categories<br>
`visa_type` : class of admission legally admitting the non-immigrant to temporarily stay in the US<br>
<br>
- dim_temp<br>
`dt` : date<br>
`avg_temp` : average temperature<br>
`avg_temp_uncertainty` : average tempertaure uncertainty<br>
`city` : city <br>
`country` : country<br>
`latitude` : latitude<br>
`longitude` : longitude<br>
<br>
- dim_demo_info<br>
`city` : city in the US<br>
`state` : state in the US<br>
`m_population` : male population of the city<br>
`f_population` : female population of the city<br>
`total_population` : total population of the city<br>
`num_of_veterans` : number of veterans in the city<br>
`foreign_born` : number of foreign-borns in. the city <br>
`state_code` : state code<br>
`race` : race of the object<br>
<br>
- dim_demo_stat<br>
`city` : city in the US<br>
`state` : state in the US<br>
`median_age` : median age of the population in the city<br>
`avg_household_size` : average household size of the population of the city<br> 
`state_code` : state code<br>


#### Step 5: Complete Project Write Up

##### Steps taken in the project
1. Load the data using Pandas DataFrame and then process them
2. Clean the processed data frames
    - Remove the rows with unmatching data types
    - Remove empty rows
    - Remove duplicates
    - Convert the column to a better data type
3. Create the tables using SQL and Redshift, and also create the schema
4. Connect to Redshift cluster and insert the data we prepared in step 1 & 2
5. Go through Data qulity check in order to see if the tables were made in the way we wanted
    - Check if there's any duplicated data
    - Check if any of the rows are empty

##### Purpose of the output model
The output data model will be used for further analysis of the information we obtained from the above steps. For example, we would be able to find out the relationship between the temperature and the population of the city in the US.

##### Choice of tools
1. Pandas DataFrame to process and clean the data before inserting.
2. Spark(PySpark) to insert the data into the schema created with SQL.

__Why use star schema?__
I used star schema because it's an easy schema for me to aggregate the tables. Also, there was no need for me to use the snowflake schema since the data model I wanted didn't need multiple level of relationships.

##### How often the data should be updated and why
The immigration data should be updated monthly since it's in a monthly form, and the temperature data should be updated annually.

##### Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 - Using `Spark` along with `EMR(Hadoop)` should help me process the data in much bigger size. This way, I would be able to handle the copy of the data on cloud.
 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 - For the automated system like this situation, `Apache Airflow` would be the best choice since we can schedule the ETL process within Airflow. With this powerful automation tool, we can also send out emails to other teams in charge when there's an error in the task.
 
 * The database needed to be accessed by 100+ people.
 - `AWS Redshift` will allow us to have out database accessed by a number of people in a stable manner. 

##### Sample Query
Query to check the median age of the city the immigrant arrived to, where its average household size is bigger than 3, along with the birthyear(age) of the immigrant.
<img width="1208" alt="Screen Shot 2021-09-15 at 2 54 51 AM" src="https://user-images.githubusercontent.com/79597984/133317339-2048c67c-0bb3-4650-a3c8-60a5ca9287be.png">

__Ouput__
<img width="1214" alt="Screen Shot 2021-09-15 at 2 53 04 AM" src="https://user-images.githubusercontent.com/79597984/133317394-fe98b36b-4310-4883-a9a2-712a1bfc3e75.png">
