Objective: since `i94port` is port of entry into the US. is it just airport codes? any correlation between `i94port` and `i94mode`? (`1 = "Air", 2 = "Sea", 3 = "Land, 9 "Not Reported"`). Is it possible that we only have `i94port` value when mode is air?)

Summary: `i94port` appears to have insignificant relationship to `i94mode`. For example, I am seeing an airport code, but a mode of Sea or Land. (so probably best not to relate the two for the time being)

Observation from manual search: we may have `i94mode` of 2 (Sea), but with an i94port of an Airpot code (e.g. MIA).

Useful references:

* US Airport code search: e.g. https://www.iata.org/en/publications/directories/code-search/?airport.search=JFK (e.g. JFK is an airport)
* US Seaport code search: e.g. https://www.freightos.com/freight-resources/seaport-code-name-finder/  (e.g. 48Y may be an airport or seaport.)

In [1]:
import os
import pyspark
import configparser
import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T

from datetime import datetime, timedelta
from pyspark.sql import SparkSession

In [2]:
# Ensure Jupyter Notebook display pandas dataframe fully. Show all columns. Do not truncate column value.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 1000)

# Dev config from the non-secret configuration file
config_dev = configparser.ConfigParser()
config_dev.read_file(open('aws_dev.cfg'))

PAR_I94_DIR_BY_NONE = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_NONE')

In [3]:
spark = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

In [4]:
df_i94 = spark.read.parquet(PAR_I94_DIR_BY_NONE)

In [5]:
df_i94.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [6]:
df_i94.createOrReplaceTempView("df_i94")

#

In [8]:
df_i94_agg_1 = spark.sql("""
select
    i94mode,
    case
        when i94port is not null then 'Y'
        else 'N'
    end as has_i94port,
    count(*)
from df_i94
group by 1,2
order by 1,2
""")

df_i94_agg_1.show()

+-------+-----------+--------+
|i94mode|has_i94port|count(1)|
+-------+-----------+--------+
|   null|          Y|   73949|
|    0.0|          Y|       4|
|    1.0|          Y|39166088|
|    2.0|          Y|  387184|
|    3.0|          Y| 1095001|
|    9.0|          Y|   68303|
+-------+-----------+--------+



# Arrival mode (i94mode) and Port code (i94port)

In [10]:
df_i94_agg_2 = spark.sql("""
select
    i94mode,
    i94port,
    count(*)
from df_i94
group by 1,2
order by 1,2
""")

df_i94_agg_2.show(1000)

+-------+-------+--------+
|i94mode|i94port|count(1)|
+-------+-------+--------+
|   null|    ATL|       1|
|   null|    BOS|       1|
|   null|    CHI|       1|
|   null|    DAL|       1|
|   null|    DET|       1|
|   null|    HHW|       2|
|   null|    HOU|       1|
|   null|    LIH|       1|
|   null|    LOS|       1|
|   null|    MIA|       2|
|   null|    NYC|       3|
|   null|    ORL|       1|
|   null|    PHI|       1|
|   null|    PSP|       1|
|   null|    SAA|       1|
|   null|    SFR|       3|
|   null|    XXX|   73927|
|    0.0|    SAI|       2|
|    0.0|    XXX|       2|
|    1.0|    48Y|       2|
|    1.0|    5T6|     161|
|    1.0|    ABE|       2|
|    1.0|    ABG|      10|
|    1.0|    ABQ|      36|
|    1.0|    ABS|       4|
|    1.0|    ACY|       5|
|    1.0|    ADS|     388|
|    1.0|    ADW|     505|
|    1.0|    AFW|      19|
|    1.0|    AGA| 1336365|
|    1.0|    AGM|       2|
|    1.0|    AGU|      12|
|    1.0|    ALB|      46|
|    1.0|    ANA|       2|
|

In [12]:
spark.stop()