In [1]:
%%html
<style>
table {float:left}
</style>

# Immigration Data ETL
## Data Engineering Capstone Project

### Project Introduction
"Flyby Salad" is a take-away food franchise with over 100 locations on airports worldwide. They offer a variety of snacks, meals and drinks to travellers and specialise on producing only with fresh and healthy ingredients. Flyby Salad's menu offerings are tailored to specific traveller groups who enjoy a "taste of home" while being abroad.

To improve the menu offerings, Flyby Salad's data analytics team has hired you, the Data Engineer, to set up a new dataset with fact and dimension tables for further analysis. The dataset shall be created from immigration data provided by the U.S. Customs and Border Protection Agency as well as weather data, demographic information and others.

You are to analyze and select appropriate data sources and create an ETL pipeline with a set of fact and dimension tables for further analysis using suitable AWS services. A regular daily run should ensure up-to-date information is available to Flyby Salad's product management department.

The project follows the follow steps:
    
| Description                               | Where to find |
| ----------------------------------------- | ---------------: |
| Step 1: Scope the Project and Gather Data | **This notebook** |
| Step 2: Explore and Assess the Data       | Notebook `02_Immigration-Data_ExploreAssess.ipynb` |
| Step 3: Define the Data Model             | Notebook `03_Immigration-Data_DataModel.ipynb` |
| Step 4: Run ETL to Model the Data         | Notebook `04_Immigration-Data.ipynb` |
| Step 5: Complete Project Write Up         | File `Readme.md` | 


## Step 1 - Scoping and Data Gathering
This section will contain a general description and overview of all datasets which were suggested for the project scope. Then each dataset will be described in structure approach (explained in next section).

Step 1 will conclude with a summary on the project scope at the end of this notebook.


### Describing the approach
The Udacity provided datasets for the Capstone Project include:
* I94 Immigration data from 2016 provided by U.S. Customs and Border Protection agency
* World Temperature Data
* U.S. cities demographic data
* An airport code table

Those dataset comprise the _possible analytics scope_ from which the actual scope is then derived based on the data's assessment.

In order to make a reasoned selection each dataset will be collected at least once for assessment. The findings are included in the following chapters of this notebook, even if the dataset is not used in Step 2. No additional datasets will be added to th

In Step 1 we will describe each of the datasets mentioned above and document the results in this notebook. Each description will follow the same pattern and shall address the following issues:
1. Create a Pandas dataframe from the data or a sample of the data
1. Get a first understanding of the data and create insights from this "first impression"
1. Analyse available documentation for each dataset
1. Create statistics about number of entries, duplicates, missing values and other basic statistics of _numeric and non-numeric_ columns. Conclude with a summary that contains:
    1. Findings about the set's columns' semantics, possible ways to link the data, possible analytic value
    1. Identify missing values, duplicates and other inconsistencies
    1. Reason if and why the dataset should be added to / removed from scope
1. Conclusion with regards to the resulting project scope


### Imports and Installs

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import logging
import sys
from datetime import datetime
from os.path import getsize
from nb_helpers import summarize_data, get_sas_definitions, read_sas_in_chunks, read_csv_print

# Logging
logging.basicConfig(
    level=logging.ERROR,
    format='%(asctime)s %(levelname)s \t %(message)s ',
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('log')

# Improve view
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


### Data Description Section 1 - I94 Immigration Dataset

#### I94 Immigration Dataset Description
The dataset provided contains immigration data provided by US immigration authorities. Data is collected via form **I94** and contains data about people travelling from and to the US on people who are either **non United States citizens** or **lawful permanent residents** in the US.

    “Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” (https://en.wikipedia.org/wiki/Form_I-94)

An overview of this dataset is also outlined [here] (https://travel.trade.gov/research/programs/i94/description.asp)

Data files and formats:
- Data files are stored in SAS (proprietary?) sas7bdat format
- Per year a folder exists
- Per month a file exists (~500 GB)

Description file:
- A description file for the fields was included, named *I94_SAS_Labels_Descriptions.SAS*
- The file contains field descriptions for each column
- And it contains value constraints for some columns, namely: *i94cnty, i94port, i94mode, i94addr*

#### I94 Immigration Data First Read
As Pandas has a method to import SAS data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

Reading the whole SAS file in the workspace resulted in long wait times for no obvious reason. Downloading and reading the file locally (4-Cores, 8GB laptop) took only 80 seconds.

In [2]:
# Read in the data using a wrapper for the read_sas() method
# Configuration
sas_file =  '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
sas_file_format = 'sas7bdat'
max_lines=100000     # Set the desired line number here
for_lines=50000     # Set the desired lines for each cycle here

sas_df = read_sas_in_chunks(sas_file, sas_file_format, max_lines, for_lines)


'START reading SAS file ../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat, total filesize is: 450 Mb'

'Importing lines from 1 to 50000 of total 100000 lines'

'Importing lines from 50001 to 100000 of total 100000 lines'

'STOP reading SAS files'

'First lines of data and data types:'

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,37.0,2.0,1.0,,,,T,,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,1.0,20130811.0,SEO,,G,,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,1.0,20160401.0,,,T,O,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,28.0,2.0,1.0,20160401.0,,,O,O,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,4.0,2.0,1.0,20160401.0,,,O,O,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,float64,float64,float64,float64,float64,object,float64,float64,object,float64,float64,float64,float64,object,object,object,object,object,object,object,float64,object,object,object,object,float64,object,object


'DONE reading SAS data in chunks, time elapsed is    7.28 seconds'

**Summary on first read of data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| all | (N) We are not importing everything here, since the files amount to about 6GB in total | |
| |  (N) Using "chunksize" parameter and then breaking from the loop, so that we have handy 2.000 lines ||
| all | In total 28 columns exist, 15 columns contain strings (object type) and 13 contain numbers (float64 type) | |
| all | At first sight one can already spot unfamiliar date columns (arrdate, depdate, etc.) with various datatypes | |
| all | Several rows have missing values | |
| all | Some columns contain obviously integer values but float64 was assigned | |
| all | Some categorical columns seem to exist | |
| `dtadfile` | Contains an obviously malformed entry in row 2 ("2013...") since the dataset is supposed to contain data from April 2016 only | Search for more of such values |

#### Documentation and data dictionary analysis


**Documentation sources**

A link was provided to the data source at the Visitor Arrivals Program: [LINK](https://travel.trade.gov/research/reports/i94/historical/2016.html)

Following related pages were also analyzed:
* Approved listing of [countries by world region](https://travel.trade.gov/research/programs/i94/1999-2020%20Region%20Dictionary.xlsx), stored as Excel file
** Recommendation: use as source to validate data
* A [Q&A section](https://travel.trade.gov/research/programs/ifs/qamythbuster.asp) which contains indications for completeness and accuracy of data
** Those should be considered before making assumptions or draw conclusions from the data
* Detailed descriptions about data collection [methodology](https://travel.trade.gov/research/programs/ifs/description.asp)

**Parsing the description file / data dictionary**

The workspace contains a field description file for the dataset named `I94_SAS_Labels_Descriptions.SAS`

The file seems pretty well structured, so I wrote a quick parser to automatically check the description file (see [SAS-Description-Parser](https://r766466c839826xjupyterlnnfq3jud.udacity-student-workspaces.com/lab/tree/SAS-Description-Parser.ipynb) for further details).

**Definitions**

| **Variable name** | **Data Type** | **Description** |
|---------------|---------------|---------------|
| i94yr | float64 | 4 digit year |
| i94mon | float64 | Numeric month |
| i94cit | float64 | This format shows all the valid and invalid codes for processing |
| i94res | float64 | This format shows all the valid and invalid codes for processing |
| i94port | object | This format shows all the valid and invalid codes for processing |
| arrdate | float64 | is the Arrival Date in the USA. It is a SAS date numeric field that apermament format has not been applied.  Please apply whichever date formatpermament format has not been applied.  Please apply whichever date format |
| i94mode | float64 | There are missing values as well as not reported (9) |
| i94addr | object | There is lots of invalid codes in this variable and the list belowThere is lots of invalid codes in this variable and the list below |
| depdate | float64 | is the Departure Date from the USA. It is a SAS date numeric field thata permament format has not been applied.  Please apply whichever date formata permament format has not been applied.  Please apply whichever date format |
| i94bir | float64 | Age of Respondent in Years |
| i94visa | float64 | Visa codes collapsed into three categories:1 = Business2 = Pleasure3 = Student*/ |
| count | float64 | Used for summary statistics |
| dtadfile | object | Character Date Field |
| visapost | object | Department of State where where Visa was issued |
| occup | object | Occupation that will be performed in U.S. |
| entdepa | object | Arrival Flag |
| entdepd | object | Departure Flag |
| entdepu | object | Update Flag |
| matflag | object | Match flag |
| biryear | float64 | 4 digit year of birth |
| dtaddto | object | Character Date Field |
| gender | object | Non |
| insnum | object | INS number |
| airline | object | Airline used to arrive in U.S. |
| admnum | float64 | Admission Number |
| fltno | object | Flight number of Airline used to arrive in U.S. |
| visatype | object | Class of admission legally admitting the non |

**Summary on data documentation and descriptions:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `i94cnty, i94port, i94mode, i94addr` | have value constraints (lists with allowed entry values) | Change data types, validate content (see below) | |
| `i94cnty` | contains country short codes and their corresponding state names | Change to category field |
| `i94port` | contains port/airport codes from various cities, without specific selection criteria as it seems (most of the codes are cities in the US but we also see city codes from Europe and Asia) | Change data type to string, Validate airport table using airport-codes_csv.csv| 
| `i94mode` | is a code for the way of travelling (by Air, by Sea or by Land) or unknown | Change to category field |
| `i94addr` | is a code for the state in which this immigrants temporary address is located (aka "First Intended Address") | Change to category field |


#### I94 analysis of numeric columns
The Pandas describe() function creates a basic set of descriptive statistics for each numeric column in the data frame.

In [3]:
summarize_data(sas_df, ['numbers'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 100000 rows and 13 columns. Godspeed!'

'Quantifying NUMERIC data types in columns:'

'cicid, i94yr, i94mon, i94cit, i94res, arrdate, i94mode, depdate, i94bir, i94visa, count, biryear, admnum'

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,99999.0,96194.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,78261.79979,2016.0,4.0,296.76358,300.37938,20545.00034,1.00555,20560.189638,40.3371,1.90722,1.0,1975.6629,70120770000.0
std,65475.726951,0.0,0.0,208.55203,209.822608,0.090554,0.084849,22.70339,17.845464,0.341896,0.0,17.845464,20945540000.0
min,6.0,2016.0,4.0,101.0,101.0,20545.0,1.0,20544.0,0.0,1.0,1.0,1916.0,39845880.0
25%,30037.75,2016.0,4.0,135.0,131.0,20545.0,1.0,20550.0,28.0,2.0,1.0,1962.0,55430250000.0
50%,58714.5,2016.0,4.0,209.0,209.0,20545.0,1.0,20553.0,40.0,2.0,1.0,1976.0,55457610000.0
75%,91764.25,2016.0,4.0,464.0,509.0,20545.0,1.0,20560.0,54.0,2.0,1.0,1988.0,92467440000.0
max,218060.0,2016.0,4.0,734.0,749.0,20573.0,9.0,20716.0,100.0,3.0,1.0,2016.0,92517120000.0
Unique,100000.0,1.0,1.0,180.0,193.0,3.0,3.0,172.0,100.0,3.0,1.0,100.0,100000.0
Missing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3806.0,0.0,0.0,0.0,0.0,0.0


'Columns with missing values:'

'i94mode, depdate'

'Columns with less than 10 unique values:'

'i94yr, i94mon, arrdate, i94mode, i94visa, count'

"Unique values in column 'i94yr': "

'      2016'

"Unique values in column 'i94mon': "

'         4'

"Unique values in column 'arrdate': "

'     20573,      20551,      20545'

"Unique values in column 'i94mode': "

'         1,          2,          9'

"Unique values in column 'i94visa': "

'         2,          3,          1'

"Unique values in column 'count': "

'         1'

'Data Quantification Done, time elapsed is    0.69 sec'

**Summary on numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `cicid` |is unique for all 2.000 lines (check `len(sas_df['cicid'].unique())`) and appears to be the primary key for each record | Change to int, Use as primary key|
| | The following columns appear to indicate datetime related values: |
| `i94yr` |indicating the year the I94 form was filled and 'i94mon' indicating the month | Change to int |
| `arrdate` |is the immigrants arrival date | Change to datetime64 |
| `depdate` |the date of the immigrants (planned) departure | Change to datetime64 |
| `i94mode` | has already been identified as a category variable, the integers here are just codes indicating if the immigrant travelled by Land, Air or Sea (or unknown) | Change to category|
| `i94visa` | was not identified correctly by my parser it seems, it has value constraints (* 1 = Business, 2 = Pleasure,3 = Student)  | Change to category|
| `i94cit` and `i94res` | are again not numeric but indicate the immigrant's countries of citizenship ("cit") and residence (res) | Change to category, Validate date using value constraints |
|`admnum` | is the admission number | Use as key variable to connect several rows |
|`i94bir` |appears to be the immigrant's age at the time of admission (in other words it's the time delta between `i94yr`and `biryear` | Change to int|
| `biryear` | marks the immigrants birthyear | Change to int |
| `count` |is for statistical purposes according to the description | Change to int |

#### I94 Analysis of non-numeric columns
Measuring the number of NaN entries and unique values

In [4]:
summarize_data(sas_df, ['object'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 100000 rows and 15 columns. Godspeed!'

'Quantifying NON-NUMERIC data types in columns:'

'i94port, i94addr, dtadfile, visapost, occup, entdepa, entdepd, entdepu, matflag, dtaddto, gender, insnum, airline, fltno, visatype'

Unnamed: 0,i94port,i94addr,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,dtaddto,gender,insnum,airline,fltno,visatype
Datatype,object,object,object,object,object,object,object,object,object,object,object,object,object,object,object
Lines,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000,100000
Non-Null,100000,96415,99999,39610,334,100000,96194,13,96194,99998,86465,0,99998,99999,100000
,0,3585,1,60390,99666,0,3806,99987,3806,2,13535,100000,2,1,0
Fill-%,100,96.415,99.999,39.61,0.334,100,96.194,0.013,96.194,99.998,86.465,0,99.998,99.999,100
Unique,92,126,2,277,42,8,8,2,1,247,2,0,173,2206,14
Uniq-%,0.092,0.126,0.002,0.277,0.042,0.008,0.008,0.002,0.001,0.247,0.002,0,0.173,2.206,0.014


'Columns with missing values:'

'i94addr, dtadfile, visapost, occup, entdepd, entdepu, matflag, dtaddto, gender, insnum, airline, fltno'

'Columns with less than 10 unique values:'

'dtadfile, entdepa, entdepd, entdepu, matflag, gender, insnum'

"Unique values in column 'dtadfile': "

'20130811, 20160401'

"Unique values in column 'entdepa': "

'T, G, O, H, U, B, K, M'

"Unique values in column 'entdepd': "

'O, K, I, Q, R, N, M, J'

"Unique values in column 'entdepu': "

'U, Y'

"Unique values in column 'matflag': "

'M'

"Unique values in column 'gender': "

'M, F'

"Unique values in column 'insnum': "

''

'Data Quantification Done, time elapsed is     2.8 sec'

**Summary on non-numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `i94prt, i94addr, dtadfile` | Value constraints exist | Change to category, Validate against list of constraints |
| `dtadfile` |is the date on which the form was entered into the database | Change to datetime64 |
| `dtaddto` |is the date the immigrant is admissioned to stay in the US | Change to datetime64 |
| `visapost, occup` | rarely filled, not fit for analysis | 
| `entdepa, entdepd,entdepu,matflag,` | Unclear description in the data dictionary | Exclude from analyis |
| `gender` | Immigrant gender | Change to category |
| `insnum` | Immigration registration number, not filled in data sample | Check if field is filled in complete dataset | |
| `visatype` | Type of issued visa | Check [online sources](https://travel.trade.gov/research/programs/i94/methodology.asp) for list of possible types |


#### I94 Immigration Dataset conclusion

The immigration dataset contains significant datapoints about Flyby Salad's customer focus group. However a number of columns with valuable information like the immigrant' s occupation (`occump`) show missing values.

A number of columns can be transformed into more suitable data types such as integer or datetime types before filling the fact or dimension tables.

For some columns the exact meaning was not clear (e.g. matflag) so it should be considered to remove them from the dataset (in reality you would ask a contact with more immigration business knowledge about their meaning).

Given the exact information about immigrant's age, country of origin and citizenship, gender and purpose of travel the Immigration Dataset offers valuable opportunities for fact and dimension tables.

### Data Description Section 2 - World Temperature Data

#### General Temperature Data Description

The World Temperature Dataset contains temperatures on land by city on a global scale. A detailed discription can be found [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

The source indicates that other datasets exist which summarize the data e.g. by country or major cities.

#### Documentation Analysis

The dataset from kaggle provides a documentation on its hosting [site](https://www.kaggle.com/colinpbowen/starter-climate-change-earth-surface-e24bc90c-4)

The set consists of seven columns:
* `dt` is the temperatures measurement timestamp
* `AverageTemperature` displays the average temperature in celsius degrees
* `AgerageTemperatureUncertainty` shows possible deviations from average (95% confidence)
* `City` - name of the city, has 3.448 distinct values
* `Latitude, Longitude` - location of measurement

####  World Temperature Data First read

As Pandas has a method to import CSV data we will be using this mechanism.

Instead of reading just a chunk of the file we will read it in full here.

In [5]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
temperature_df = read_csv_print(fname, ',')

'START reading CSV file ../../data2/GlobalLandTemperaturesByCity.csv of (Filesize: 5.1e+02 Mb)'

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


'Done. Operation took 2.1e+01 seconds'

**Summary on first read:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| all | Dataset contains in total over 8.5m rows | |
| `dt` | starting in 1743 | Convert to datetime64 |
| `AverageTemperature` | mind the missing values | Check for isna() |
| `AverageTemperatureUncertainty` | same as above | |
| `City, Country` | None | |
| `Latitude, Longitude` | |

#### Analysis of numeric columns in Temperature Data

Only two numeric columns were identified:  `AverageTemperature`, `AverageTemperatureUncertainty`

In [6]:
summarize_data(temperature_df, ['numbers'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 8599212 rows and 2 columns. Godspeed!'

'Quantifying NUMERIC data types in columns:'

'AverageTemperature, AverageTemperatureUncertainty'

Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396
Unique,111994.0,10902.0
Missing,364130.0,364130.0


'Columns with missing values:'

'AverageTemperature, AverageTemperatureUncertainty'

'Columns with less than 10 unique values:'

''

'Data Quantification Done, time elapsed is     7.5 sec'

**Summary on numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `AverageTemperature` | About 364.000 missing values | Exclude NaN from analysis |
| | 
| `AverageTemperatureUncertainty` | About 364.000 missing values, which is expected due to missing temperature data | Exclude NaN from analysis |

#### Analysis of non-numeric columns in Temperature

Non-numeric data columns, qualitative analysis

In [7]:
summarize_data(temperature_df, ['object'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 8599212 rows and 5 columns. Godspeed!'

'Quantifying NON-NUMERIC data types in columns:'

'dt, City, Country, Latitude, Longitude'

Unnamed: 0,dt,City,Country,Latitude,Longitude
Datatype,object,object,object,object,object
Lines,8599212,8599212,8599212,8599212,8599212
Non-Null,8599212,8599212,8599212,8599212,8599212
,0,0,0,0,0
Fill-%,100,100,100,100,100
Unique,3239,3448,159,73,1227
Uniq-%,0.0376662,0.0400967,0.00184901,0.000848915,0.0142687


'Columns with missing values:'

''

'Columns with less than 10 unique values:'

''

'Data Quantification Done, time elapsed is 6.2e+01 sec'

In [8]:
# Convert "dt" to datetime and sort, then check latest measurement date

temperature_df[['dt']] = temperature_df[['dt']].astype('datetime64')
temperature_df = temperature_df.sort_values(by=['dt'], ascending=False)
print('The last datapoint is from the following date: {:.10}'.format(temperature_df['dt'].head(1).values[0]))

The last datapoint is from the following date: 2013-09-01


In [9]:
# Check distribution of data points per City

# Count datapoints per city
count_df = temperature_df[['City', 'dt']].copy()
count_df = count_df.groupby(by=['City']).count()
count_df = count_df.sort_values(by='dt', ascending=False)
entries = len(temperature_df)
num_of_cit = len(count_df)
print('Average datapoints per city: {:6.0f} entries'.format((entries / num_of_cit)))
print('City with most data points: {}\t\t\t{} entries'.format(count_df.head(1).index[0], count_df['dt'].head(1).values[0]))
print('City with least data points: {}\t\t\t{} entries'.format(count_df.tail(1).index[0], count_df['dt'].tail(1).values[0]))


Average datapoints per city:   2494 entries
City with most data points: Springfield			9545 entries
City with least data points: Port Moresby			1581 entries


**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `City, dt` | city names and timestamps are one possible way to **link this dataset to the I94 immigration data** | Join datasets on a combined key of both fields
| `dt` | The "freshest" data point is from September 2013 |  |
| all | There is a significant **imbalance in the number of data points**: While the average is 2.494 entries, the city with most entries has 9.545 and the city with least entries has 1.581 entries | |
| all | Also the datapoints per day are varying between about 700 and 3500 per day | |


#### Conclusion on World Temperature Dataset

The "World Temperature Dataset" could be linked to the I94 immigration data by a combined key of `City`and `dt`. This would allow to add temperature data as dimension table when analyzing immigration data.

However the dataset is not well suited to be analyzed in conjunction with the Immigration Data, since there is **no time period overlap**. Without temperature data from the time period of the provided immigration data no signifant findings from data analyses can be expected.

**Conclusion**: the World Temperature Dataset is ruled **out of scope**.

### Data Description Section 3 - U.S. City Demographic Data

#### Demographic Data Description

A dataset of demographic data is provided from [opendatasoft](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/). It contains statistic about the population structure in various U.S. cities with a population greater or equal to 65,000.

The data was gathered in the US Census Bureau's 2015 American Community Survey.

#### Demographic Data gathering and first read

We will use Pandas standard CSV reading method and read the complete file

In [10]:
fname = "us-cities-demographics.csv"
dem_data_df = read_csv_print(fname)

'START reading CSV file us-cities-demographics.csv of (Filesize:    0.24 Mb)'

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


'Done. Operation took    0.12 seconds'

#### Documentation Analysis

The dataset's documentation is rather brief and available  [online](https://public.opendatasoft.com/explore/embed/dataset/us-cities-demographics/table/?dataChart=eyJxdWVyaWVzIjpbeyJjb25maWciOnsiZGF0YXNldCI6InVzLWNpdGllcy1kZW1vZ3JhcGhpY3MiLCJvcHRpb25zIjp7fX0sImNoYXJ0cyI6W3siYWxpZ25Nb250aCI6dHJ1ZSwidHlwZSI6ImNvbHVtbiIsImZ1bmMiOiJBVkciLCJ5QXhpcyI6Im1lZGlhbl9hZ2UiLCJzY2llbnRpZmljRGlzcGxheSI6dHJ1ZSwiY29sb3IiOiIjRkY1MTVBIn1dLCJ4QXhpcyI6ImNpdHkiLCJtYXhwb2ludHMiOjUwLCJzb3J0IjoiIn1dLCJ0aW1lc2NhbGUiOiIiLCJkaXNwbGF5TGVnZW5kIjp0cnVlLCJhbGlnbk1vbnRoIjp0cnVlfQ%3D%3D). Meanings of column headings are pretty self-explanatory. No deeper analysis required.

#### Analysis of numeric columns in demographic data

The demographic data contains various numeric columns.

In [11]:
summarize_data(dem_data_df, ['numbers'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 2891 rows and 8 columns. Godspeed!'

'Quantifying NUMERIC data types in columns:'

'Median Age, Male Population, Female Population, Total Population, Number of Veterans, Foreign-born, Average Household Size, Count'

Unnamed: 0,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,Count
count,2891.0,2888.0,2888.0,2891.0,2878.0,2878.0,2875.0,2891.0
mean,35.494881,97328.43,101769.6,198966.8,9367.832523,40653.6,2.742543,48963.77
std,4.401617,216299.9,231564.6,447555.9,13211.219924,155749.1,0.433291,144385.6
min,22.9,29281.0,27348.0,63215.0,416.0,861.0,2.0,98.0
25%,32.8,39289.0,41227.0,80429.0,3739.0,9224.0,2.43,3435.0
50%,35.3,52341.0,53809.0,106782.0,5397.0,18822.0,2.65,13780.0
75%,38.0,86641.75,89604.0,175232.0,9368.0,33971.75,2.95,54447.0
max,70.5,4081698.0,4468707.0,8550405.0,156961.0,3212500.0,4.98,3835726.0
Unique,180.0,593.0,594.0,594.0,577.0,587.0,161.0,2785.0
Missing,0.0,3.0,3.0,0.0,13.0,13.0,16.0,0.0


'Columns with missing values:'

'Male Population, Female Population, Number of Veterans, Foreign-born, Average Household Size'

'Columns with less than 10 unique values:'

''

'Data Quantification Done, time elapsed is    0.16 sec'

**Summary:**

* All columns contain integer values with the exception of `Median Age` and `Average Household Size` which are floating point values
* Only a limited amounts of values are missing,the dataset seems rather complete

#### Analysis of non-numeric columns

The demographic dataset contains some string data which will be analyzed here.

In [12]:
summarize_data(dem_data_df, ['object'])

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 2891 rows and 4 columns. Godspeed!'

'Quantifying NON-NUMERIC data types in columns:'

'City, State, State Code, Race'

Unnamed: 0,City,State,State Code,Race
Datatype,object,object,object,object
Lines,2891,2891,2891,2891
Non-Null,2891,2891,2891,2891
,0,0,0,0
Fill-%,100,100,100,100
Unique,567,49,49,5
Uniq-%,19.6126,1.69492,1.69492,0.172951


'Columns with missing values:'

''

'Columns with less than 10 unique values:'

'Race'

"Unique values in column 'Race': "

'Hispanic or Latino, White, Asian, Black or African-American, American Indian and Alaska Native'

'Data Quantification Done, time elapsed is    0.13 sec'

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `City` | can be used to link data to I94 immigration dataset | Apply as key |
| `State Code` | can be used to link data to I94 immigration dataset | Apply as key |
| `Race` | Only 5 different values exist indicating this column as a category | Add category/ dimension |
| `all` | No missing values | |

#### Dataset summary and conclusion

**Summary:**

* The demographic dataset appears to be of good quality with few missing values
* The dataset has only a limited files of 0.24 Mb
* It can be linked to the immigration dataset, for instance to research relations between demographics and travelling patterns

**Conclusion**: The dataset on U.S. city demographics is considered **in scope**

### Data Description Section 1 - Airport Code Table

#### Airport Code Table Data Description

A list of airports from [datahub.io](https://datahub.io/core/airport-codes#data) with various datapoints was provided for this project.

#### Airport Code Table Data Gathering and first read

Again we will be using Pandas CSV Reader and import the complete file.

In [13]:
fname = 'airport-codes_csv.csv'
airport_df = read_csv_print(fname, ',')

'START reading CSV file airport-codes_csv.csv of (Filesize:     5.7 Mb)'

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


'Done. Operation took    0.51 seconds'

#### Documentation Analysis

According to the documentation found online, the list contains the following data:
* either IATA airport codes consisting of 3 letters
* or ICAO codes consisting of 4 letters

The list contains over 50.000 entries for airports and adds information for each airport such as:
* additional codes like local codes
* location information such as city, country and geo-coordinates
* elevation of the airport above sea-level
* type of airport (is it Newark or just a helipad?)

#### Analysis of numeric columns

`elevation_ft` is the only numeric field in the dataset

In [14]:
summarize_data(airport_df, 'numbers')

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 55075 rows and 1 columns. Godspeed!'

'Quantifying NUMERIC data types in columns:'

'elevation_ft'

Unnamed: 0,elevation_ft
count,48069.0
mean,1240.789677
std,1602.363459
min,-1266.0
25%,205.0
50%,718.0
75%,1497.0
max,22000.0
Unique,5449.0
Missing,7006.0


'Columns with missing values:'

'elevation_ft'

'Columns with less than 10 unique values:'

''

'Data Quantification Done, time elapsed is   0.099 sec'

#### Analysis of non-numeric columns

In [15]:
summarize_data(airport_df, 'objects')

'Running Data Quantifier with parameter: '

' and example threshhold is '

'The dataframe has 55075 rows and 11 columns. Godspeed!'

'Quantifying NON-NUMERIC data types in columns:'

'ident, type, name, continent, iso_country, iso_region, municipality, gps_code, iata_code, local_code, coordinates'

Unnamed: 0,ident,type,name,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
Datatype,object,object,object,object,object,object,object,object,object,object,object
Lines,55075,55075,55075,55075,55075,55075,55075,55075,55075,55075,55075
Non-Null,55075,55075,55075,27356,54828,55075,49399,41030,9189,28686,55075
,0,0,0,27719,247,0,5676,14045,45886,26389,0
Fill-%,100,100,100,49.6704,99.5515,100,89.6941,74.4984,16.6845,52.0853,100
Unique,55075,7,52144,6,243,2810,27133,40850,9042,27436,54874
Uniq-%,100,0.0127099,94.6782,0.0108942,0.441217,5.10213,49.2655,74.1716,16.4176,49.8157,99.635


'Columns with missing values:'

'continent, iso_country, municipality, gps_code, iata_code, local_code'

'Columns with less than 10 unique values:'

'type, continent'

"Unique values in column 'type': "

'heliport, small_airport, closed, seaplane_base, balloonport, medium_airport, large_airport'

"Unique values in column 'continent': "

'OC, AF, AN, EU, AS, SA'

'Data Quantification Done, time elapsed is    0.68 sec'

**Summary on non-numeric columns:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `continent, iso_country, municipality, gps_code, iata_code, local_code` | all fields would be suitable analytics dimensions, but the `*_code`columns show a high number of empty cells | Analyze possible hidden duplicates (e.g. by comparing location data) to improve data quality |
| `type, continent` | while `type` columns is completely filled, `continent` has about 50% missing values | Use column `iso_country` to update continents |
| `iata_code` | high amount of missing values but still should be tried to match to column `i94port` of the immigration dataset |
| `iso_region, municipality` | Possibly matching to `i94addr` | Remove prefix "US-" and join tables |

#### Dataset conclusion on airport codes

**Summary:**

* The dataset offers various analytical dimensions e.g.
    * the `type` column adds a bit of detail to airports immigrants use
    * the `coordinates` column may be used to create geo-visulizations
    * `name` and `iata_codes` can be used to enrich available data points
* The dataset should be considered in scope since it is possible to link it with immigration data using one of two columns.

## Step 1 Summary and resulting project scope

For analyzing the immigration dataset 3 possible datasets were suggested as possible analysis dimensions and data sources: world temperature data, U.S. cities demographic data and a detailed information about airports.

Concluding from the chapters above the following scope decisions were made with regards to the data and its **analytical use case**:

* **I94 Immigration data** is considered **in scope** regarding the following analytical tasks:
    * Develop a scalable automated extraction procedure using Spark Data Lake
    * Load and Transform the data into fact and dimension tables
    * Develop Airflow routines to manage the process
* **Airport Codes** are considered **in scope** and will be used to
    * Enrich the immigration dataset with complete and updated values in dimension tables
* **Demographic data** is considered in scope and will be used to
    * Add demographic information per location


The **world temperature dataset** was moved out of the project scope since it does not contain temperature data from the available immigration dataset's time period.

The airport codes and demographics data can be linked with the immigration data via location names or codes. Thus it is being considered in scope of this project.

While both datasets are rather limited in size (5.7 and 0.24 Megabytes) the immigration dataset is comparably large. For production use a scalable import tool is required.

For many columns quality characteristics were identified like suggesting possible categorical data types, fill methods for missing values and relationship key fields. Those recommendations will be implemented in Step 4 of this project.

Step 1 of this project created a more technical understanding of the data and its quality. In Step 2 some explorative analysis will be performed to learn more about required data cleaning actions.

---