# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

#### Imports and Installs

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np
import logging
import sys
from nb_helpers import summarize_data, get_sas_definitions, read_sas_in_chunks

# Logging
logging.basicConfig(
    level=logging.ERROR,
    format='%(asctime)s %(levelname)s \t %(message)s ',
    datefmt='%Y-%m-%d %H:%M:%S',
    stream=sys.stdout,
)
log = logging.getLogger('log')

# Improve view
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)


# Function definitions

def quantify_data(df, type_choice=None, examples=10):
    """ DOC STRING"""
    # Set number of examples to be printed per value
    # examples = 10
    
    # If type_choice is set, only the dtypes provided will be analysed
    if type_choice is None:
        # If not set, we simple analyze numeric and string data and
        # print the result
        type_choice = ['all']
        print('Running Data Quantifier with parameter: ', ', '.join(type_choice),\
             ' and example threshhold is ', examples)
    else:
        print('Running Data Quantifier with parameter: ',  ', '.join(type_choice),\
             ' and example threshhold is ', examples)
    
    # Analysis section
    if (('all' in type_choice) or ('numbers' in type_choice)):
        # NUMERIC DATA ANALYSIS
        sub_df = df.select_dtypes(exclude=['object'])
        print('\nQuantifying NUMERIC data types in columns:\n',  ', '.join(sub_df.columns), '\n')
        # Get descriptive statistics
        stat_df = sub_df.describe()
        # Count missing values per column
        miss_df = pd.DataFrame.from_dict({'Missing': sub_df.isna().sum()})
        #miss_df = miss_df['Missing'].astype(int)
        #mis_val_cols = miss_df.loc[miss_df['Missing'] > 0].columns
        mis_val_cols = miss_df[miss_df > 0].dropna().index
        # Count unique values per column
        uniq_df = pd.DataFrame.from_dict({'Unique': sub_df.nunique()})
        #uniq_df = uniq_df['Unique'].astype(int)
        # Get list of example values for columns which have less than x unique values
        uni_val_cols = uniq_df[uniq_df <= examples].dropna().index
        uniq_df = uniq_df.transpose()
        miss_df = miss_df.transpose()
        stat_df = pd.concat([stat_df, uniq_df, miss_df])
        display(stat_df)
        print('Columns with missing values: ', ','.join(mis_val_cols), '\n')
        for unique_value_column in uni_val_cols:
            unique_values = df[unique_value_column].drop_duplicates()
            msg = 'Unique values in column \'{}\': \n'.format(unique_value_column)
            print(msg, unique_values.values, '\n')
        #print('Columns with missing values: ', ','.join(mis_val_cols))

    if (('all' in type_choice) or ('object' in type_choice)):
        # STRING DATA ANALYSIS
        sub_df = df.select_dtypes(exclude=['float64'])
        print('\nQuantifying NON-NUMERIC data types in columns:\n',  ', '.join(sub_df.columns))
        stat_df = pd.DataFrame.from_dict(data=dict(sub_df.dtypes), orient='index', columns=['Datatype'])
        stat_df['Lines'] = len(df)
        stat_df['Non-Null'] = df.count()
        stat_df['NaN'] = df.isna().sum()
        stat_df['Fill-%'] = df.count() / len(df) *100
        stat_df['Unique'] = df.nunique()
        stat_df['Uniq-%'] = stat_df['Unique'] / stat_df['Lines'] *100
        mis_val_cols = list(stat_df.loc[stat_df['Fill-%'] < 100].index)
        uni_val_cols = list(stat_df.loc[stat_df['Unique'] <= examples].index)
        display(stat_df.transpose())
        print('Columns with missing values: ', ','.join(mis_val_cols), '\n')
        for unique_value_column in uni_val_cols:
            unique_values = df[unique_value_column].drop_duplicates()
            msg = 'Unique values in column \'{}\': \n'.format(unique_value_column)
            print(msg, unique_values.values, '\n')
    print("\n\nData Quantification Done\n\n")

# Step 1 - Scoping and Data Gathering
**Task: Scope the Project and Gather Data**

*Identify and gather the data you'll be using for your project (at least two sources and more than 1 million rows). See Project Resources for ideas of what data you can use.*

*Explain what end use cases you'd like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)*


## Step 1a - General Scope and Data Gathering Description
The Udacity provided datasets for the Capstone Project include:
* I94 Immigration data from 2016 provided by U.S. Customs and Border Protection agency
* World Temperature Data
* U.S. cities demographic data
* An airport code table

Each dataset has been collected at least once for assessment. The findings are included in the following chapters of this notebook, even if the dataset is not used in Step 2.

Regarding the scope itself the following findings are relevant:
* **I94 Immigration data** is considered **in scope** regarding the following analytical tasks:
    * Develop a scalable automated extraction procedure using Spark Data Lake
    * Load and Transform the data into fact and dimension tables
    * Develop Airflow routines to manage the process
* **Airport Codes** are considered **in scope** and will be used
    * to enrich the immigration dataset with complete and updated values
* **World Temperature data** is considered **out of scope** since no analytics questions for this dataset in conjunction with immigration data could be identified _and_ the datasets' time periods do not overlap
* **Demographic data** is considered

**Approach to describe and gather data**

Descriptions for each dataset will be given in the sections below. Each description shall include:
1. A first read of the dataset using Python and Pandas default methods
1. "First Impression" notes about the extracted data
1. Analysis of dataset documentation, enclosed data dictionaries, etc.
1. Findings about Data Meaning, Quality, possible relationsships and definitions for
    1. Numeric columns (including missing values, uniqueness and descriptive statistics)
    1. Non-numeric columns

## Step 1b - I94 Dataset of U.S. Customs and Border Protection department

### A - I94 Immigration Dataset Description
The dataset provided contains immigration data provided by US immigration authorities. Data is collected via form **I94** and contains data about people travelling from and to the US on people who are either **non United States citizens** or **lawful permanent residents** in the US.

    “Form I-94, the Arrival-Departure Record Card, is a form used by the U.S. Customs and Border Protection (CBP) intended to keep track of the arrival and departure to/from the United States of people who are not United States citizens or lawful permanent residents (with the exception of those who are entering using the Visa Waiver Program or Compact of Free Association, using Border Crossing Cards, re-entering via automatic visa revalidation, or entering temporarily as crew members)” (https://en.wikipedia.org/wiki/Form_I-94)

An overview of this dataset is also outlined [here] (https://travel.trade.gov/research/programs/i94/description.asp)

Data files and formats:
- Data files are stored in SAS (proprietary?) sas7bdat format
- Per year a folder exists
- Per month a file exists (~500 GB)

Description file:
- A description file for the fields was included, named *I94_SAS_Labels_Descriptions.SAS*
- The file contains field descriptions for each column
- And it contains value constraints for some columns, namely: *i94cnty, i94port, i94mode, i94addr*

### B - I94 Immigration Data Data Gathering and first read

As Pandas has a method to import SAS data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [2]:
# Read in the data using a wrapper for the read_sas() method
# Configuration
sas_file =  '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
sas_file_format = 'sas7bdat'
max_lines=6000     # Set the desired line number here
for_lines=2000     # Set the desired lines for each cycle here

sas_df = sas_chunk_reader(sas_file, sas_file_format, max_lines, for_lines)


NameError: name 'sas_chunk_reader' is not defined

**Summary on first read of data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| all | (N) We are not importing everything here, since the files amount to about 6GB in total | |
| |  (N) Using "chunksize" parameter and then breaking from the loop, so that we have handy 2.000 lines ||
| all | In total 28 columns exist, 15 columns contain strings (object type) and 13 contain numbers (float64 type) | |
| all | At first sight one can already spot unfamiliar date columns (arrdate, depdate, etc.) with various datatypes | |
| all | Several rows have missing values | |
| all | Some columns contain obviously integer values but float64 was assigned | |
| all | Some categorical columns seem to exist | |


### C - Documentation and data description analysis


**Documentation sources**

A link was provided to the data source at the Visitor Arrivals Program: [LINK](https://travel.trade.gov/research/reports/i94/historical/2016.html)

Following related pages were also analyzed:
* Approved listing of [countries by world region](https://travel.trade.gov/research/programs/i94/1999-2020%20Region%20Dictionary.xlsx), stored as Excel file
** Recommendation: use as source to validate data
* A [Q&A section](https://travel.trade.gov/research/programs/ifs/qamythbuster.asp) which contains indications for completeness and accuracy of data
** Those should be considered before making assumptions or draw conclusions from the data
* Detailed descriptions about data collection [methodology](https://travel.trade.gov/research/programs/ifs/description.asp)

**Parsing the description file / data dictionary**

The workspace contains a field description file for the dataset named `I94_SAS_Labels_Descriptions.SAS`

The file seems pretty well structured, so I wrote a quick parser to automatically check the description file (see [SAS-Description-Parser](https://r766466c839826xjupyterlnnfq3jud.udacity-student-workspaces.com/lab/tree/SAS-Description-Parser.ipynb) for further details).

**Definitions**

| **Variable name** | **Data Type** | **Description** |
|---------------|---------------|---------------|
| i94yr | float64 | 4 digit year |
| i94mon | float64 | Numeric month |
| i94cit | float64 | This format shows all the valid and invalid codes for processing |
| i94res | float64 | This format shows all the valid and invalid codes for processing |
| i94port | object | This format shows all the valid and invalid codes for processing |
| arrdate | float64 | is the Arrival Date in the USA. It is a SAS date numeric field that apermament format has not been applied.  Please apply whichever date formatpermament format has not been applied.  Please apply whichever date format |
| i94mode | float64 | There are missing values as well as not reported (9) |
| i94addr | object | There is lots of invalid codes in this variable and the list belowThere is lots of invalid codes in this variable and the list below |
| depdate | float64 | is the Departure Date from the USA. It is a SAS date numeric field thata permament format has not been applied.  Please apply whichever date formata permament format has not been applied.  Please apply whichever date format |
| i94bir | float64 | Age of Respondent in Years |
| i94visa | float64 | Visa codes collapsed into three categories:1 = Business2 = Pleasure3 = Student*/ |
| count | float64 | Used for summary statistics |
| dtadfile | object | Character Date Field |
| visapost | object | Department of State where where Visa was issued |
| occup | object | Occupation that will be performed in U.S. |
| entdepa | object | Arrival Flag |
| entdepd | object | Departure Flag |
| entdepu | object | Update Flag |
| matflag | object | Match flag |
| biryear | float64 | 4 digit year of birth |
| dtaddto | object | Character Date Field |
| gender | object | Non |
| insnum | object | INS number |
| airline | object | Airline used to arrive in U.S. |
| admnum | float64 | Admission Number |
| fltno | object | Flight number of Airline used to arrive in U.S. |
| visatype | object | Class of admission legally admitting the non |

**Summary on data documentation and descriptions:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `i94cnty, i94port, i94mode, i94addr` | have value constraints (lists with allowed entry values) | Change data types, validate content (see below) | |
| `i94cnty` | contains country short codes and their corresponding state names | Change to category field |
| `i94port` | contains port/airport codes from various cities, without specific selection criteria as it seems (most of the codes are cities in the US but we also see city codes from Europe and Asia) | Change data type to string, Validate airport table using airport-codes_csv.csv| 
| `i94mode` | is a code for the way of travelling (by Air, by Sea or by Land) or unknown | Change to category field |
| `i94addr` | is a code for the state in which this immigrants temporary address is located (aka "First Intended Address") | Change to category field |


### D - Analysis of numeric columns
The Pandas describe() function creates a basic set of descriptive statistics for each numeric column in the data frame.

In [None]:
summarize_data(sas_df, ['numbers'])

**Summary on numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `cicid` |is unique for all 2.000 lines (check `len(sas_df['cicid'].unique())`) and appears to be the primary key for each record | Change to int, Use as primary key|
| | The following columns appear to indicate datetime related values: |
| `i94yr` |indicating the year the I94 form was filled and 'i94mon' indicating the month | Change to int |
| `arrdate` |is the immigrants arrival date | Change to datetime64 |
| `depdate` |the date of the immigrants (planned) departure | Change to datetime64 |
| `i94mode` | has already been identified as a category variable, the integers here are just codes indicating if the immigrant travelled by Land, Air or Sea (or unknown) | Change to category|
| `i94visa` | was not identified correctly by my parser it seems, it has value constraints (* 1 = Business, 2 = Pleasure,3 = Student)  | Change to category|
| `i94cit` and `i94res` | are again not numeric but indicate the immigrant's countries of citizenship ("cit") and residence (res) | Change to category, Validate date using value constraints |
|`admnum` | is the admission number | Use as key variable to connect several rows |
|`i94bir` |appears to be the immigrant's age at the time of admission (in other words it's the time delta between `i94yr`and `biryear` | Change to int|
| `biryear` | marks the immigrants birthyear | Change to int |
| `count` |is for statistical purposes according to the description | Change to int |

### E - Analysis of non-numeric columns
Measuring the number of NaN entries and unique values

In [None]:
quantify_data(sas_df, ['object'])

**Summary on non-numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `i94prt, i94addr, dtadfile` | Value constraints exist | Change to category, Validate against list of constraints |
| `dtadfile` |is the date on which the form was entered into the database | Change to datetime64 |
| `dtaddto` |is the date the immigrant is admissioned to stay in the US | Change to datetime64 |
| `visapost, occup` | rarely filled, not fit for analysis | 
| `entdepa, entdepd,entdepu,matflag,` | Unclear description in the data dictionary | Exclude from analyis |
| `gender` | Immigrant gender | Change to category |
| `insnum` | Immigration registration number, not filled in data sample | Check if field is filled in complete dataset | |
| `visatype` | Type of issued visa | Check [online sources](https://travel.trade.gov/research/programs/i94/methodology.asp) for list of possible types |


### F - Dataset conclusion

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1c - World Temperature Data

### A - World Temperature Data Description

The World Temperature Dataset contains temperatures on land by city on a global scale. A detailed discription can be found [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

The source indicates that other datasets exist which summarize the data e.g. by country or major cities.

### B - Documentation Analysis

The dataset from kaggle provides a documentation on its hosting [site](https://www.kaggle.com/colinpbowen/starter-climate-change-earth-surface-e24bc90c-4)

The set consists of seven columns:
* `dt` is the temperatures measurement timestamp
* `AverageTemperature` displays the average temperature in celsius degrees
* `AgerageTemperatureUncertainty` shows possible deviations from average (95% confidence)
* `City` - name of the city, has 3.448 distinct values
* `Latitude, Longitude` - location of measurement

### C - World Temperature Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism.

Instead of reading just a chunk of the file we will read it in full here.

In [None]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
csv_df = pd.read_csv(fname)
size = int(getsize(fname) / 1024 / 1024)
print('Reading {} (Size: {} Mb) lines from file {}\n'.format(len(csv_df), fname, size))
print('First lines of data and data types:')
csv_df_typ = pd.DataFrame(csv_df.dtypes).transpose()
display(csv_df.head(), csv_df_typ)

**Summary on first read:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| all | Dataset contains in total over 8.5m rows | |
| `dt` | starting in 1743 | Convert to datetime64 |
| `AverageTemperature` | mind the missing values | Check for isna() |
| `AverageTemperatureUncertainty` | same as above | |
| `City, Country` | None | |
| `Latitude, Longitude` | |

### D - Analysis of numeric columns in Temperature Data

Only two numeric columns were identified:  `AverageTemperature`, `AverageTemperatureUncertainty`

In [None]:
summarize_data(csv_df, ['numbers'])

**Summary on numeric data:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `AverageTemperature` | About 364.000 missing values | Exclude NaN from analysis |
| | 
| `AverageTemperatureUncertainty` | About 364.000 missing values, which is expected due to missing temperature data | Exclude NaN from analysis |

### E - Analysis of non-numeric columns in Temperature

Non-numeric data columns, qualitative analysis

In [None]:
summarize_data(csv_df, ['object'])

In [None]:
# Convert "dt" to datetime and sort, then check latest measurement date

csv_df[['dt']] = csv_df[['dt']].astype('datetime64')
csv_df = csv_df.sort_values(by=['dt'], ascending=False)
print('The last datapoint is from the following date: {:.10}'.format(csv_df['dt'].head(1).values[0]))

In [None]:
# Check distribution of data points per City

# Count datapoints per city
count_df = csv_df[['City', 'dt']].copy()
count_df = count_df.groupby(by=['City']).count()
count_df = count_df.sort_values(by='dt', ascending=False)
entries = len(csv_df)
num_of_cit = len(count_df)
print('Average datapoints per city: {:6.0f} entries'.format((entries / num_of_cit)))
print('City with most data points: {}\t\t\t{} entries'.format(count_df.head(1).index[0], count_df['dt'].head(1).values[0]))
print('City with least data points: {}\t\t\t{} entries'.format(count_df.tail(1).index[0], count_df['dt'].tail(1).values[0]))


**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| `dt` | The "freshest" data point is from September 2013 | Match with immigration data |
| all | There is a significant **imbalance in the number of data points**: While the average is 2.494 entries, the city with most entries has 9.545 and the city with least entries has 1.581 entries | |
| all | Also the datapoints per day are varying between about 700 and 3500 per day | |


### F - Dataset conclusion

The "World Temperature Dataset is not well suited to be analyzed in conjunction with the Immigration Data, since there is no time period overlap. Without temperature data from the time period of the provided immigration data no signifant findings from data analyses can be expected.

The World Temperature Dataset is ruled **out of scope**.

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1d U.S. City Demographic Data

### A - Demographic Data Description

A dataset of demographic data is provided

### B - Demographic Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [None]:
dem_df = pd.read_csv("us-cities-demographics.csv", sep=";")
dem_df.head()

### C - Documentation Analysis

Lorem Ipsum

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### D - Analysis of numeric columns

Lorem Ipsum

In [None]:
summarize_data(dem_df, ['numbers'])

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### E - Analysis of non-numeric columns

Lorem Ipsum

In [None]:
summarize_data(dem_df, ['object'])

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### F - Dataset conclusion

Lorem Ipsum

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

## Step 1e - Airport Code Table

### A - World Temperature Data Description

Lorem Ipsum

### B - World Temperature Data Gathering and first read

As Pandas has a method to import CSV data we will be using this mechanism. The following code will read a defined number of lines only due to performance reasons.

In [None]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
csv_df = pd.read_csv(fname)

### C - Documentation Analysis

Lorem Ipsum

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### D - Analysis of numeric columns

Lorem Ipsum

In [None]:
summarize_data(df, 'numbers')

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### E - Analysis of non-numeric columns

Lorem Ipsum

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

### F - Dataset conclusion

Lorem Ipsum

**Summary:**

| **Column Name** | **Notes, comments and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

**Summary on non-numeric data:**

| **Column Name** | **Commends and findings:**                  | **Recommended action items** |
|------------|--------------------------------------------------|----------------------------|
| <FELD> | <COMMENT, NOTE, FINDING> | <ACTION>

In [None]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [None]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.