# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import logging
import sys

# Logging
logging.basicConfig(
    level=logging.ERROR,
    format='%(asctime)s %(levelname)s \t %(message)s ',
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('log')

# Improve view
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


In [2]:
# Function definitions

def quantify_data(df, type_choice=None, examples=10):
    """ DOC STRING"""
    # Set number of examples to be printed per value
    # examples = 10
    
    # If type_choice is set, only the dtypes provided will be analysed
    if type_choice is None:
        # If not set, we simple analyze numeric and string data and
        # print the result
        type_choice = ['all']
        print('Running Data Quantifier with parameter: ', ', '.join(type_choice),\
             ' and example threshhold is ', examples)
    else:
        print('Running Data Quantifier with parameter: ',  ', '.join(type_choice),\
             ' and example threshhold is ', examples)
    
    # Analysis section
    if (('all' in type_choice) or ('numbers' in type_choice)):
        # NUMERIC DATA ANALYSIS
        sub_df = df.select_dtypes(exclude=['object'])
        print('\nQuantifying NUMERIC data types in columns:\n',  ', '.join(sub_df.columns), '\n')
        # Get descriptive statistics
        stat_df = sub_df.describe()
        # Count missing values per column
        miss_df = pd.DataFrame.from_dict({'Missing': sub_df.isna().sum()})
        #miss_df = miss_df['Missing'].astype(int)
        #mis_val_cols = miss_df.loc[miss_df['Missing'] > 0].columns
        mis_val_cols = miss_df[miss_df > 0].dropna().index
        # Count unique values per column
        uniq_df = pd.DataFrame.from_dict({'Unique': sub_df.nunique()})
        #uniq_df = uniq_df['Unique'].astype(int)
        # Get list of example values for columns which have less than x unique values
        uni_val_cols = uniq_df[uniq_df <= examples].dropna().index
        uniq_df = uniq_df.transpose()
        miss_df = miss_df.transpose()
        stat_df = pd.concat([stat_df, uniq_df, miss_df])
        display(stat_df)
        print('Columns with missing values: ', ','.join(mis_val_cols), '\n')
        for unique_value_column in uni_val_cols:
            unique_values = df[unique_value_column].drop_duplicates()
            msg = 'Unique values in column \'{}\': \n'.format(unique_value_column)
            print(msg, unique_values.values, '\n')
        #print('Columns with missing values: ', ','.join(mis_val_cols))

    if (('all' in type_choice) or ('object' in type_choice)):
        # STRING DATA ANALYSIS
        sub_df = df.select_dtypes(exclude=['float64'])
        print('\nQuantifying NON-NUMERIC data types in columns:\n',  ', '.join(sub_df.columns))
        stat_df = pd.DataFrame.from_dict(data=dict(sub_df.dtypes), orient='index', columns=['Datatype'])
        stat_df['Lines'] = len(df)
        stat_df['Non-Null'] = df.count()
        stat_df['NaN'] = df.isna().sum()
        stat_df['Fill-%'] = df.count() / len(df) *100
        stat_df['Unique'] = df.nunique()
        stat_df['Uniq-%'] = stat_df['Unique'] / stat_df['Lines'] *100
        mis_val_cols = list(stat_df.loc[stat_df['Fill-%'] < 100].index)
        uni_val_cols = list(stat_df.loc[stat_df['Unique'] <= examples].index)
        display(stat_df.transpose())
        print('Columns with missing values: ', ','.join(mis_val_cols), '\n')
        for unique_value_column in uni_val_cols:
            unique_values = df[unique_value_column].drop_duplicates()
            msg = 'Unique values in column \'{}\': \n'.format(unique_value_column)
            print(msg, unique_values.values, '\n')
    print("\n\nData Quantification Done\n\n")

# Step 1 - Scoping and Data Gathering
**Task: Scope the Project and Gather Data**

*Identify and gather the data you'll be using for your project (at least two sources and more than 1 million rows). See Project Resources for ideas of what data you can use.*

*Explain what end use cases you'd like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)*


## Step 1a - General Scope and Data Gathering Description
The Udacity provided datasets for the Capstone Project include:
* I94 Immigration data from 2016 provided by U.S. Customs and Border Protection agency
* World Temperature Data
* U.S. cities demographic data
* An airport code table

Each dataset has been collected at least once for assessment. The findings are included in the following chapters of this notebook, even if the dataset is not used in Step 2.

Regarding the scope itself the following findings are relevant:
* **I94 Immigration data** is considered **in scope** regarding the following analytical tasks:
    * Develop a scalable automated extraction procedure using Spark Data Lake
    * Load and Transform the data into fact and dimension tables
    * Develop Airflow routines to manage the process
* **Airport Codes** are considered **in scope** and will be used
    * to enrich the immigration dataset with complete and updated values
* **World Temperature data** is considered **out of scope** since no analytics questions for this dataset in conjunction with immigration data could be identified _and_ the datasets' time periods do not overlap
* **Demographic data** is considered

**Approach to describe and gather data**

Descriptions for each dataset will be given in the sections below. Each description shall include:
1. A first read of the dataset using Python and Pandas default methods
1. "First Impression" notes about the extracted data
1. Analysis of dataset documentation, enclosed data dictionaries, etc.
1. Findings about Data Meaning, Quality, possible relationsships and definitions for
    1. Numeric columns (including missing values, uniqueness and descriptive statistics)
    1. Non-numeric columns

## Step 1b - I94 Dataset of U.S. Customs and Border Protection department

### A - I94 Immigration Dataset Description
The dataset provided contains immigration data provided by US immigration authorities. Data is collected via form **I94** and contains data about people travelling from and to the US on people who are either **non United States citizens** or **lawful permanent residents** in the US.

    “Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” (https://en.wikipedia.org/wiki/Form_I-94)

An overview of this dataset is also outlined [here] (https://travel.trade.gov/research/programs/i94/description.asp)

Data files and formats:
- Data files are stored in SAS (proprietary?) sas7bdat format
- Per year a folder exists
- Per month a file exists (~500 GB)

Description file:
- A description file for the fields was included, named *I94_SAS_Labels_Descriptions.SAS*
- The file contains field descriptions for each column
- And it contains value constraints for some columns, namely: *i94cnty, i94port, i94mode, i94addr*

### B - I94 Immigration Data Data Gathering and first read

As Pandas has a method to import SAS data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [3]:
# Read in the data using read_sas() method
sas_file =  '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
sas_df = pd.DataFrame()
lines_imported = 0
max_lines=6000     # Set the desired line number here
for_lines=2000     # Set the desired lines for each cycle here

print('START reading SAS file ', sas_file)
# The method _read_sas()_ will read the files in chunks
for chunk in pd.read_sas(sas_file, 'sas7bdat', encoding="ISO-8859-1", chunksize=for_lines):
    last_lines = lines_imported + 1
    lines_imported = lines_imported + len(chunk)
    sas_df = sas_df.append(chunk)
    print('\t\t\tImporting lines from {} to {} of total {} lines'.format(last_lines, lines_imported, max_lines))
    if lines_imported >= max_lines:
        print('STOP reading SAS files')
        break

sas_df.head()

START reading SAS file  ../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat
			Importing lines from 1 to 2000 of total 6000 lines
			Importing lines from 2001 to 4000 of total 6000 lines
			Importing lines from 4001 to 6000 of total 6000 lines
STOP reading SAS files


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,37.0,2.0,1.0,,,,T,,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,1.0,20130811.0,SEO,,G,,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,1.0,20160401.0,,,T,O,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,28.0,2.0,1.0,20160401.0,,,O,O,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,4.0,2.0,1.0,20160401.0,,,O,O,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


**Notes and Findings on First Read:**
* (N) We are not importing everything here, since the files amount to about 6GB in total
* (N) Using "chunksize" parameter and then breaking from the loop, so that we have handy **2.000 lines
* (F) In total 28 columns exist, 15 columns contain strings (object type) and 13 contain numbers (float64 type)
* At first sight one can already spot unfamiliar date columns (arrdate, depdate, etc.) with various datatypes
* Several rows have missing values
* Some columns contain obviously integer values but float64 was assigned
* Some categorical columns seem to exist

### C - Documentation Analysis

The workspace contains a field description file for the dataset named `I94_SAS_Labels_Descriptions.SAS`

The file seems pretty well structured, so I wrote a quick parser to automatically check the description file (see [SAS-Description-Parser](https://r766466c839826xjupyterlnnfq3jud.udacity-student-workspaces.com/lab/tree/SAS-Description-Parser.ipynb) for further details).

**Definitions**

| **Variable name** | **Data Type** | **Description** |
|---------------|---------------|---------------|
| i94yr | float64 | 4 digit year |
| i94mon | float64 | Numeric month |
| i94cit | float64 | This format shows all the valid and invalid codes for processing |
| i94res | float64 | This format shows all the valid and invalid codes for processing |
| i94port | object | This format shows all the valid and invalid codes for processing |
| arrdate | float64 | is the Arrival Date in the USA. It is a SAS date numeric field that apermament format has not been applied.  Please apply whichever date formatpermament format has not been applied.  Please apply whichever date format |
| i94mode | float64 | There are missing values as well as not reported (9) |
| i94addr | object | There is lots of invalid codes in this variable and the list belowThere is lots of invalid codes in this variable and the list below |
| depdate | float64 | is the Departure Date from the USA. It is a SAS date numeric field thata permament format has not been applied.  Please apply whichever date formata permament format has not been applied.  Please apply whichever date format |
| i94bir | float64 | Age of Respondent in Years |
| i94visa | float64 | Visa codes collapsed into three categories:1 = Business2 = Pleasure3 = Student*/ |
| count | float64 | Used for summary statistics |
| dtadfile | object | Character Date Field |
| visapost | object | Department of State where where Visa was issued |
| occup | object | Occupation that will be performed in U.S. |
| entdepa | object | Arrival Flag |
| entdepd | object | Departure Flag |
| entdepu | object | Update Flag |
| matflag | object | Match flag |
| biryear | float64 | 4 digit year of birth |
| dtaddto | object | Character Date Field |
| gender | object | Non |
| insnum | object | INS number |
| airline | object | Airline used to arrive in U.S. |
| admnum | float64 | Admission Number |
| fltno | object | Flight number of Airline used to arrive in U.S. |
| visatype | object | Class of admission legally admitting the non |

**Findings on value constraints**

Columns `i94cnty, i94port, i94mode, i94addr` have value constraints (lists with allowed entry values) which are outlined here:
* `i94cnty` contains country short codes and their corresponding state names
* `i94port`contains port/airport codes from various cities
    * There doesn't seem to be a specific selection criteria
    * Although most of the codes are cities in the US we also see city codes from Europe and Asia
* `i94mode` is a code for the way of travelling (by Air, by Sea or by Land) or unknown
* `i94addr`is a code for the state in which this immigrants temporary address is located (aka "First Intended Address")


### D - Analysis of numeric columns
The Pandas describe() function creates a basic set of descriptive statistics for each numeric column in the data frame.

In [18]:
quantify_data(sas_df, ['numbers'])

Running Data Quantifier with parameter:  numbers  and example threshhold is  10

Quantifying NUMERIC data types in columns:
 cicid, i94yr, i94mon, i94cit, i94res, arrdate, i94mode, depdate, i94bir, i94visa, count, biryear, admnum 



Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,5999.0,5821.0,6000.0,6000.0,6000.0,6000.0,6000.0
mean,3516.653833,2016.0,4.0,107.9335,111.358833,20545.005667,1.004501,20556.275039,39.435167,1.886,1.0,1976.564834,59268510000.0
std,2053.802328,0.0,0.0,8.384875,34.445892,0.369672,0.066942,15.564844,17.166816,0.327651,0.0,17.166816,13751950000.0
min,6.0,2016.0,4.0,101.0,101.0,20545.0,1.0,20546.0,0.0,1.0,1.0,1929.0,664491000.0
25%,1678.75,2016.0,4.0,104.0,104.0,20545.0,1.0,20550.0,27.0,2.0,1.0,1964.0,55426680000.0
50%,3448.5,2016.0,4.0,108.0,108.0,20545.0,1.0,20552.0,40.0,2.0,1.0,1976.000003,55443320000.0
75%,5354.25,2016.0,4.0,111.0,111.0,20545.0,1.0,20558.0,52.0,2.0,1.0,1989.000003,55457830000.0
max,6956.0,2016.0,4.0,692.0,692.0,20573.0,2.0,20715.0,87.0,3.0,1.0,2016.000003,92516690000.0
Unique,6000.0,1.0,1.0,12.0,48.0,3.0,2.0,124.0,86.0,3.0,1.0,169.0,6000.0
Missing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,179.0,0.0,0.0,0.0,0.0,0.0


Columns with missing values:  i94mode,depdate 

Unique values in column 'i94yr': 
 [ 2016.] 

Unique values in column 'i94mon': 
 [ 4.] 

Unique values in column 'arrdate': 
 [ 20573.  20551.  20545.] 

Unique values in column 'i94mode': 
 [ nan   1.   2.] 

Unique values in column 'i94visa': 
 [ 2.  3.  1.] 

Unique values in column 'count': 
 [ 1.] 



Data Quantification Done




**Summary on numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `cicid` |is unique for all 2.000 lines (check `len(sas_df['cicid'].unique())`) and appears to be the primary key for each record | |
| | The following columns appear to indicate datetime related values: |
| `i94yr` |indicating the year the I94 form was filled and 'i94mon' indicating the month | |
| `arrdate` |is the immigrants arrival date | |
| `depdate` |the date of the immigrants (planned) departure | |
| `dtadfile` |is the date on which the form was entered into the database | |
| `dtaddto` |is the date the immigrant is admissioned to stay in the US | |
| `i94mode` | has already been identified as a category variable, the integers here are just codes indicating if the immigrant travelled by Land, Air or Sea (or unknown) | |
| `i94visa` | was not identified correctly by my parser it seems, it has value constraints (* 1 = Business, 2 = Pleasure,3 = Student)  | |
| `i94cit` and `i94res` | are again not numeric but indicate the immigrant's countries of citizenship ("cit") and residence (res) | |
|`admnum` | is the admission number | |
|`i94bir` |appears to be the immigrant's age at the time of admission (in other words it's the time delta between `i94yr`and `biryear` | |
| `biryear` |marks the immigrants birthyear | |
| `count` |is for statistical purposes according to the description | |

### E - Analysis of non-numeric columns
Measuring the number of NaN entries and unique values

In [13]:
quantify_data(sas_df, ['numbers'])

Running Data Quantifier with parameter:  numbers  and example threshhold is  10

Quantifying NUMERIC data types in columns:
 cicid, i94yr, i94mon, i94cit, i94res, arrdate, i94mode, depdate, i94bir, i94visa, count, biryear, admnum 



Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,arrdate,i94mode,depdate,i94bir,i94visa,count,biryear,admnum
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,5999.0,5821.0,6000.0,6000.0,6000.0,6000.0,6000.0
mean,3516.653833,2016.0,4.0,107.9335,111.358833,20545.005667,1.004501,20556.275039,39.435167,1.886,1.0,1976.564834,59268510000.0
std,2053.802328,0.0,0.0,8.384875,34.445892,0.369672,0.066942,15.564844,17.166816,0.327651,0.0,17.166816,13751950000.0
min,6.0,2016.0,4.0,101.0,101.0,20545.0,1.0,20546.0,0.0,1.0,1.0,1929.0,664491000.0
25%,1678.75,2016.0,4.0,104.0,104.0,20545.0,1.0,20550.0,27.0,2.0,1.0,1964.0,55426680000.0
50%,3448.5,2016.0,4.0,108.0,108.0,20545.0,1.0,20552.0,40.0,2.0,1.0,1976.000003,55443320000.0
75%,5354.25,2016.0,4.0,111.0,111.0,20545.0,1.0,20558.0,52.0,2.0,1.0,1989.000003,55457830000.0
max,6956.0,2016.0,4.0,692.0,692.0,20573.0,2.0,20715.0,87.0,3.0,1.0,2016.000003,92516690000.0
Unique,6000.0,1.0,1.0,12.0,48.0,3.0,2.0,124.0,86.0,3.0,1.0,169.0,6000.0
Missing,0.0,0.0,0.0,0.0,0.0,0.0,1.0,179.0,0.0,0.0,0.0,0.0,0.0


Columns with missing values:  i94mode,depdate 

Unique values in column 'i94yr': 
 [ 2016.] 

Unique values in column 'i94mon': 
 [ 4.] 

Unique values in column 'arrdate': 
 [ 20573.  20551.  20545.] 

Unique values in column 'i94mode': 
 [ nan   1.   2.] 

Unique values in column 'i94visa': 
 [ 2.  3.  1.] 

Unique values in column 'count': 
 [ 1.] 



Data Quantification Done




**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| several | columns have missing values (i94addr,dtadfile,visapost,occup,entdepd,entdepu,matflag,dtaddto,gender,insnum,airline,fltno) | |


### F - Dataset conclusion

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1c - World Temperature Data

### A - World Temperature Data Description

Lorem Ipsum

### B - World Temperature Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [15]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
csv_df = pd.read_csv(fname)

### C - Documentation Analysis

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### D - Analysis of numeric columns

Lorem Ipsum

In [16]:
quantify_data(df, 'numbers')

Running Data Quantifier with parameter:  n, u, m, b, e, r, s  and example threshhold is  10

Quantifying NUMERIC data types in columns:
 AverageTemperature, AverageTemperatureUncertainty 



Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396
Unique,111994.0,10902.0
Missing,364130.0,364130.0


Columns with missing values:  AverageTemperature,AverageTemperatureUncertainty 



Data Quantification Done




**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### E - Analysis of non-numeric columns

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### F - Dataset conclusion

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1d U.S. City Demographic Data

### A - World Temperature Data Description

Lorem Ipsum

### B - World Temperature Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [15]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
csv_df = pd.read_csv(fname)

### C - Documentation Analysis

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### D - Analysis of numeric columns

Lorem Ipsum

In [16]:
quantify_data(df, 'numbers')

Running Data Quantifier with parameter:  n, u, m, b, e, r, s  and example threshhold is  10

Quantifying NUMERIC data types in columns:
 AverageTemperature, AverageTemperatureUncertainty 



Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396
Unique,111994.0,10902.0
Missing,364130.0,364130.0


Columns with missing values:  AverageTemperature,AverageTemperatureUncertainty 



Data Quantification Done




**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### E - Analysis of non-numeric columns

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### F - Dataset conclusion

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1e - Airport Code Table

### A - World Temperature Data Description

Lorem Ipsum

### B - World Temperature Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [15]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
csv_df = pd.read_csv(fname)

### C - Documentation Analysis

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### D - Analysis of numeric columns

Lorem Ipsum

In [16]:
quantify_data(df, 'numbers')

Running Data Quantifier with parameter:  n, u, m, b, e, r, s  and example threshhold is  10

Quantifying NUMERIC data types in columns:
 AverageTemperature, AverageTemperatureUncertainty 



Unnamed: 0,AverageTemperature,AverageTemperatureUncertainty
count,8235082.0,8235082.0
mean,16.72743,1.028575
std,10.35344,1.129733
min,-42.704,0.034
25%,10.299,0.337
50%,18.831,0.591
75%,25.21,1.349
max,39.651,15.396
Unique,111994.0,10902.0
Missing,364130.0,364130.0


Columns with missing values:  AverageTemperature,AverageTemperatureUncertainty 



Data Quantification Done




**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### E - Analysis of non-numeric columns

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### F - Dataset conclusion

Lorem Ipsum

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

In [17]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.