Objective: test translate the SAS Date columns into Python datetimes. There are two SAS columns:

* `arrdate`: no `null` from EDA.
* `depdate`: from EDA we observed records with `null`

More insights:

* it appears that the monthly datasets are mutually exclusive to each other. This means that the same CICID (say, 5748517) does not neccessary represent the same immigrant in Jan dataset, vs the Feb dataset (and so on). i.e it appears that CICID is just a way for the dataset to capture immigration data.
* It also appears that the i94yr and i94mon corresponds to the arrivate date (into the US) - looking at an example. (we can validate this by comparing the year-month of the dataset, vs the year-month of the arrival date.

In [1]:
import os
import pyspark
import configparser
import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T

from datetime import datetime, timedelta
from pyspark.sql import SparkSession

In [2]:
# Ensure Jupyter Notebook display pandas dataframe fully. Show all columns. Do not truncate column value.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 1000)

# Dev config from the non-secret configuration file
config_dev = configparser.ConfigParser()
config_dev.read_file(open('aws_dev.cfg'))

PAR_I94_FILE_BY_NONE = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_NONE')

In [3]:
spark = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

In [4]:
df_i94 = spark.read.parquet(PAR_I94_FILE_BY_NONE)

In [5]:
def convert_datetime(sas_date):
    try:
        if (sas_date == 'null'):
            sas_date = 0
        start_cutoff = datetime(1960, 1, 1)
        return start_cutoff + timedelta(days=int(sas_date))
    except:
        return None

In [6]:
udf_datetime_from_sas = F.udf(lambda x: convert_datetime(x), T.DateType())

In [7]:
df_i94 = df_i94.withColumn('arrdate_pydt', udf_datetime_from_sas(df_i94.arrdate))

In [8]:
df_i94 = df_i94.withColumn('depdate_pydt', udf_datetime_from_sas(df_i94.depdate))

In [9]:
df_i94.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

# Hypothesis 1

Hypothesis: the same CICID of different monthly i94 datasets are mutually exclusive.

Observation: it appears that the monthly datasets are mutually exclusive to each other. This means that the same CICID (say, 5748517) does not neccessary represent the same immigrant in Jan dataset, vs the Feb dataset (and so on). i.e it appears that CICID is just a way for the dataset to capture immigration data.

In [12]:
df_i94.createOrReplaceTempView("df_i94")

df_i94_agg_1 = spark.sql("""
select
    cicid,
    i94yr,
    i94mon,
    i94cit,
    i94res,
    i94port,
    arrdate,
    arrdate_pydt,
    depdate,
    depdate_pydt
from df_i94
where cicid = 5748517
order by 1,2,3
limit 1000
""")

df_i94_agg_1.show()

+---------+------+------+------+------+-------+-------+------------+-------+------------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|arrdate_pydt|depdate|depdate_pydt|
+---------+------+------+------+------+-------+-------+------------+-------+------------+
|5748517.0|2016.0|   1.0| 135.0| 135.0|    LOS|20484.0|  2016-01-31|   null|        null|
|5748517.0|2016.0|   3.0| 135.0| 135.0|    NYC|20541.0|  2016-03-28|20545.0|  2016-04-01|
|5748517.0|2016.0|   4.0| 245.0| 438.0|    LOS|20574.0|  2016-04-30|20582.0|  2016-05-08|
|5748517.0|2016.0|   6.0| 252.0| 209.0|    AGA|20631.0|  2016-06-26|   null|        null|
|5748517.0|2016.0|   7.0| 251.0| 251.0|    NYC|20659.0|  2016-07-24|20665.0|  2016-07-30|
|5748517.0|2016.0|   8.0| 117.0| 117.0|    WAS|20691.0|  2016-08-25|20699.0|  2016-09-02|
|5748517.0|2016.0|   9.0| 254.0| 276.0|    AGA|20723.0|  2016-09-26|20728.0|  2016-10-01|
|5748517.0|2016.0|  10.0| 111.0| 111.0|    NYC|20755.0|  2016-10-28|20762.0|  2016-11-04|
|5748517.0

# Hypothesis 2

Hypothesis: the year-month of the i94 form (`i94yr` and `i94mon`) matches exactly as the arrivate date into the US (`year(arrdate_pydt)` and `month(arrdate_pydt)`)

Observation: It also appears that the i94yr and i94mon corresponds to the arrivate date (into the US) - looking at an example. (we can validate this by comparing the year-month of the dataset, vs the year-month of the arrival date.

In [18]:
df_i94_agg_2 = spark.sql("""
select
    i94yr,
    i94mon,
    year(arrdate_pydt) as arr_year,
    month(arrdate_pydt) as arr_mon,
    count(*)
from df_i94
group by 1,2,3,4
order by 1,2,3,4
""")

df_i94_agg_2.show()

+------+------+--------+-------+--------+
| i94yr|i94mon|arr_year|arr_mon|count(1)|
+------+------+--------+-------+--------+
|2016.0|   1.0|    2016|      1| 2847924|
|2016.0|   2.0|    2016|      2| 2570543|
|2016.0|   3.0|    2016|      3| 3157072|
|2016.0|   4.0|    2016|      4| 3096313|
|2016.0|   5.0|    2016|      5| 3444249|
|2016.0|   6.0|    2016|      6| 3574989|
|2016.0|   7.0|    2016|      7| 4265031|
|2016.0|   8.0|    2016|      8| 4103570|
|2016.0|   9.0|    2016|      9| 3733786|
|2016.0|  10.0|    2016|     10| 3649136|
|2016.0|  11.0|    2016|     11| 2914926|
|2016.0|  12.0|    2016|     12| 3432990|
+------+------+--------+-------+--------+



In [19]:
spark.stop()