Objective: load parititioned (by i94yr and i94mon) i94 dataset, where 

* Attempt 1: where ... `i94yr=2016 and i94mon=3`.  (successful)
* Attempt 2: where ... `i94yr=2016 and (i94mon between 3 an 6)`.  (successful)

Lesson learnt: turns out the partitionBy variables in the partitioned dataset remain in the dataset, if we read it correctly!!

Open ended question. Which design is better?


* Attempt 1: 
    * where ... `i94yr=2016 and i94mon=3`, or
    * where ... `ym=201603`
* Attempt 2: `i94yr=2016 and (i94mon between 3 an 6)`
    * where ... `i94yr=2016 and (i94mon between 3 and 9)`, or
    * where ... `ym between 201603 and 201609`
    
Either options have its pros and cons. Bottom line is, I believe both options work.

Paritioning by `i94yr` and `i94mon` has the benefit of easier querying for just year, or just month. But if we want data from across multiple years, the where query may be longer.

Paritioning by (a newly created `ym` has the effectively opposite pros/cons of the above.

In [1]:
import os
import pyspark
import configparser
import pandas as pd
import pyspark.sql.functions as F
import pyspark.sql.types as T

from datetime import datetime, timedelta
from pyspark.sql import SparkSession

In [2]:
# Ensure Jupyter Notebook display pandas dataframe fully. Show all columns. Do not truncate column value.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 1000)

# Dev config from the non-secret configuration file
config_dev = configparser.ConfigParser()
config_dev.read_file(open('aws_dev.cfg'))

PAR_I94_FILE_BY_I94YR_I94MON = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_I94YR_I94MON')
PAR_I94_DIR_BY_I94YR_I94MON = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_DIR_BY_I94YR_I94MON')

In [3]:
spark = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

https://stackoverflow.com/questions/47191078/spark-sql-queries-on-partitioned-data-using-date-ranges

```
spark.read.parquet("hdfs:///basepath")
  .where('ts >= 201710060000L && 'ts <= 201711030000L)
```

Attempt 1: read one particular year-month (partition)

In [14]:
df_i94 = spark.read.parquet(PAR_I94_DIR_BY_I94YR_I94MON)\
    .where('i94yr=2016 and i94mon=3')

# This works too
# df_i94 = spark.read.parquet(PAR_I94_DIR_BY_I94YR_I94MON)\
#     .where('i94yr=2016.0 and i94mon=3.0')

In [15]:
df_i94.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = true)
 |-- fltno: string (nullable = true)
 |-- visatype: string (nullable 

In [16]:
def convert_datetime(sas_date):
    try:
        if (sas_date == 'null'):
            sas_date = 0
        start_cutoff = datetime(1960, 1, 1)
        return start_cutoff + timedelta(days=int(sas_date))
    except:
        return None
    
udf_datetime_from_sas = F.udf(lambda x: convert_datetime(x), T.DateType())
df_i94 = df_i94.withColumn('arrdate', udf_datetime_from_sas(df_i94.arrdate))
df_i94 = df_i94.withColumn('depdate', udf_datetime_from_sas(df_i94.depdate))

In [17]:
df_i94.createOrReplaceTempView('df_i94')

In [18]:
spark.sql("""
select
    year(arrdate) as year_arrdate,
    month(arrdate) as month_arrdate,
    count(*)
from df_i94
group by 1, 2
order by 1, 2
""").show()

+------------+-------------+--------+
|year_arrdate|month_arrdate|count(1)|
+------------+-------------+--------+
|        2016|            3| 3157072|
+------------+-------------+--------+



In [19]:
spark.sql("""
select
    i94yr,
    i94mon,
    count(*)
from df_i94
group by 1, 2
order by 1, 2
""").show()

+------+------+--------+
| i94yr|i94mon|count(1)|
+------+------+--------+
|2016.0|   3.0| 3157072|
+------+------+--------+



Attemp 2: read a range of months

In [21]:
df_i94 = spark.read.parquet(PAR_I94_DIR_BY_I94YR_I94MON)\
    .where('i94yr=2016 and (i94mon between 3 and 6)')

In [25]:
df_i94 = df_i94.withColumn('arrdate', udf_datetime_from_sas(df_i94.arrdate))
df_i94 = df_i94.withColumn('depdate', udf_datetime_from_sas(df_i94.depdate))

In [26]:
df_i94.createOrReplaceTempView('df_i94')

In [27]:
spark.sql("""
select
    year(arrdate) as year_arrdate,
    month(arrdate) as month_arrdate,
    count(*)
from df_i94
group by 1, 2
order by 1, 2
""").show()

+------------+-------------+--------+
|year_arrdate|month_arrdate|count(1)|
+------------+-------------+--------+
|        2016|            3| 3157072|
|        2016|            4| 3096313|
|        2016|            5| 3444249|
|        2016|            6| 3574989|
+------------+-------------+--------+



In [28]:
spark.sql("""
select
    i94yr,
    i94mon,
    count(*)
from df_i94
group by 1, 2
order by 1, 2
""").show()

+------+------+--------+
| i94yr|i94mon|count(1)|
+------+------+--------+
|2016.0|   3.0| 3157072|
|2016.0|   4.0| 3096313|
|2016.0|   5.0| 3444249|
|2016.0|   6.0| 3574989|
+------+------+--------+



In [29]:
spark.stop()