# Project Title
### Data Engineering Capstone Project

#### Project Summary
This project consists on analysing the data about imigration to the US. To achieve this objective we will export all the data to an S3 bucket so we can import this information to Redshift. Once in Redshift, as staging tables, we will perform a series of transformations in order to produce a schema that we can analyze.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd

In [2]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc><br>
<br>
In this project we are going to use data that, appeared to come from different sources to analyze imigration to the United States. To achieve this goal we will work with a dataset containing information on imigration, another one containing the on airports and the last one containing information on the id available on the imigration dataset. My end solution would be the table schema (Tables_schema.png) upon which I can perform some queries to analyse the information. For this project I decided to make use of python programming language, with its libraries for interacting with AWS Cloud ecosystem (boto3), connecting to Postgres database (psycopg2) among others.

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? <br>
The datasets used on this project are the ones provided by Udacity's as a project suggestion for completing this course. That are a total of three datasets used. 

The first one _(`immigration_data_sample.csv`)_ , consists on raw data about imigration. It is consisted mostly on id columns. We can see, just with a quick look that data here must be properly dealt with, because that are some issues like integer information _(such as year and month)_ that are stored as float. At the end, this table was heavily use on our `fact_imigration` table. <br>

Another dataset used was the `airport-codes_csv.csv`. It contains detailed information on airports, going from simple information such as name to complex ones like coordinates. Just looking at the top 5 register on the dataset, we can see that NULL values will be a commun issue here. Another interesting column is the coordinate one. In this column we have somthing like an array for each value, refering to lattitude and longitude. In our case, we will not break it into two columns since we are not going to use it in our analysis. Even though, not used, this column will be imported to our staging table.<br>

Lastly, we have the `I94_SAS_Labels_Descriptions.SAS` file, which took some effort to treat. For this task I used the work done by a Udacity's student and made some little adjustments to produce a .csv file to be analysed. This data consists basically on a key pair value containing the id and its description for every column on the imigrant tables. This information was used a LOT when producing the dimensions table.<br>

The `us-cities-demographics.csv` was not used.


In [3]:
df_i = pd.read_csv('/home/felipe/Documentos/Udacity/Data Engineering Nanodegree/Udacity_final_project/data/immigration_data_sample.csv')


In [4]:
df_i.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Missing values _(`immigration_data_sample.csv`)_
As we can see below that are a lot of columns with Null values in the imigration dataset.<br>
That are some critic cases, which makes it impossible to analyze, such as `occup`, `insnum` and `entdepu`.

In [18]:
df_i.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 29 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1000 non-null   int64  
 1   cicid       1000 non-null   float64
 2   i94yr       1000 non-null   float64
 3   i94mon      1000 non-null   float64
 4   i94cit      1000 non-null   float64
 5   i94res      1000 non-null   float64
 6   i94port     1000 non-null   object 
 7   arrdate     1000 non-null   float64
 8   i94mode     1000 non-null   float64
 9   i94addr     941 non-null    object 
 10  depdate     951 non-null    float64
 11  i94bir      1000 non-null   float64
 12  i94visa     1000 non-null   float64
 13  count       1000 non-null   float64
 14  dtadfile    1000 non-null   int64  
 15  visapost    382 non-null    object 
 16  occup       4 non-null      object 
 17  entdepa     1000 non-null   object 
 18  entdepd     954 non-null    object 
 19  entdepu     0 non-null      

#### Incorrect Data Types _(`immigration_data_sample.csv`)_
Columns that should be presented as integer, such as `i94yr` and `i94mon` are set as float64.<br>

#### Duplicate Values _(`immigration_data_sample.csv`)_
No duplicates found on the column `cicid`. THis is important because we are going to use it as a key for the future `dim_imgrant` table.

In [19]:
df_i.groupby(['cicid'])['cicid'].count().sort_index(ascending=False)

cicid
6061994.0    1
6058513.0    1
6057910.0    1
6057882.0    1
6055844.0    1
            ..
18310.0      1
17786.0      1
13826.0      1
13213.0      1
13208.0      1
Name: cicid, Length: 1000, dtype: int64

#### Weird Column names _(`immigration_data_sample.csv`)_
`i94bir` seems to refer to the imigrant's age. Looking at the name of the column, we would think it would refer to something like "birthday".

In [6]:
df_i.i94bir.head()

0    61.0
1    26.0
2    76.0
3    25.0
4    19.0
Name: i94bir, dtype: float64

#### Cleaning Steps
All the tables used on this project was imported as is to S3 and migrated to Redshift. The issues they presented were dealt with using SQL in Redshift environment. The only exception to this rule is the information contained on the `I94_SAS_Labels_Descriptions.SAS` file. In this case we had to performed the transformations described, and properly documented on the `reading_sas_file.py`. <br>
The code on this file was a mix of a code found [here](https://knowledge.udacity.com/questions/125439) and some adjustments I made myself. The result is a .csv file containing all the ids and descriptions for the columns and a final column telling from which column in the imigrant dataset it refers to. This .csv was imported as a staging table to redshift.

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Reference for this information comes from [here](https://knowledge.udacity.com/questions/234597)
This model is composed of 3 staging tables, 6 dimension tables and 1 fact table.

Staging Tables
1. staging_imigration: Contains the raw data from the `imigration_data_sample.csv`/parquet
2. staging_airport_codes: Contains information on airports
3. staging_sas_information: A compilation of ids and descriptions from the columns in the `imigration_data_sample.csv`/parquet

Dimension Tables
1. dim_modal: Contain information on transport modal ('Air', 'Sea' , 'Land' and 'Not reported');
2. dim_port: Contain information on the port of arrival;
3. dim_imgrant: Contains information on the imigrant himself/herself;
4. dim_country: Contain the name of the country, and its respective id;
5. dim_state: Contain the name of the state, and its respective id;
6. dim_visa_motive: Contain information on the motive of the visa

Fact Table
1. fact_imigration: Contains information on the act of imigration itself.

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model
Since the data was provided by Udacity, we start our pipeline with the files already in possession.<br>
- Generate the `sas_descriptive_information.csv` by executing the `reading_sas_file.py`;
- Upload all the .csv files into the `nogfel-imigration` S3 bucket, respecting the path below:
    - s3://nogfel-imigration/airport_data for `airport-codes_csv.csv`;
    - s3://nogfel-imigration/imigration_data for `immigration_data_sample.csv` and;
    - s3://nogfel-imigration/sas_data for `sas_descriptive_information.csv`
- Execute `create_cluster_aws.py` to create the Redshift cluster and the necessary infrastructure on AWS Cloud;
- Execute `create_and_load_tables.py` to create all the tables, and load them, with the data necessary for the analysis;
- Execute `quality_checks.py` to perform checks on the data to see if everything is ok.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
All the code necessary are available on the files:
- `create_cluster_aws.py`;
- `create_and_load_tables.py` and;
- `quality_checks.py`

THe files must be executed in order above for the correct execution of the job.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Quality check codes available on `quality_checks.py` file.<br><br>
<font color='orange'>Insert a picture of the output of the `quality_checks.py` showing that everything is Ok.</font>

#### 4.3 Data dictionary 

The data dictionary presented here was based on [this](https://knowledge.udacity.com/questions/875627) reference.<br>
The data dictionary is available in [this google sheets file](https://docs.google.com/spreadsheets/d/1Y7uEm-tTa66jRgp2h-mtDe9kk_6elhQmDKd_N2US3q8/edit#gid=0). The acces is available, for reading, for anyone who has the link.


#### Step 5: Complete Project Write Up
##### Clearly state the rationale for the choice of tools and technologies for the project
_This project made use of AWS S3 for storaging raw data, Pandas (python's library) for data exploration and Redshift for data wrangling._<br>
##### Propose how often the data should be updated and why.
_The imigrant's dataset, which was heavily used for generating the fact table, only presented information for one month, April 2016. So, it should be updated monthly. The other two, does not need to update with this frequence. Updates once every 2 or three months will be enough._
##### Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.<br>
 _In this scenario we could make use of Spark running on a Amazon EMR, a service specific for processing large amounts of data._

 * The data populates a dashboard that must be updated on a daily basis by 7am every day.<br>
 _Apache Airflow would be the go for product for this scenario. In this situation the whole ETL process created here should be migrated to Airflow to make use of DAGs and the scheduler in order to meet the specification._
 
 * The database needed to be accessed by 100+ people.<br>
 _The project already makes use of Amazon's Redshift Database, which can handle up to 500 connections ([source](https://docs.aws.amazon.com/redshift/latest/mgmt/amazon-redshift-limits.html)) and would be more than enough to handle this demand._