Objective: for the DoubleType columns (in the i94 dataset), get a feel of whether these are actually integer like in nature (i.e. decimal doesn't make much sense), vs float-like in nature (i.e. decimal makes sense)

Spoiler alert: it appears all the numeric DoubleType columns in the i94 datasets are integer like in nature. Column `admnum` is `bigint` like, where the rest of the DoubleType columns are `int` like.

In [1]:
import os
import pyspark
import configparser
import pandas as pd
from pyspark.sql import SparkSession

In [2]:
# Ensure Jupyter Notebook display pandas dataframe fully. Show all columns. Do not truncate column value.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 1000)

In [3]:
# Dev config from the non-secret configuration file
config_dev = configparser.ConfigParser()
config_dev.read_file(open('aws_dev.cfg'))

PAR_I94_FILE_BY_NONE = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_NONE')
# PAR_I94_FILE_BY_I94MON = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_I94MON')
# PAR_I94_FILE_BY_I94YR_I94MON = config_dev.get('DATA_PATHS_LOCAL', 'PAR_I94_FILE_BY_I94YR_I94MON')

In [4]:
spark = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

In [5]:
df_i94 = spark.read.parquet(PAR_I94_FILE_BY_NONE)

In [6]:
df_i94.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

Should the `double` column be set as `integer` column?

For instance, if the values of a column that is currently auto-parsed in as a `double` type, and that the integer version has the same value, we may conclue that for that column we ought to store that column as an integer column instead of double.

In [34]:
def get_cols_with_dtype(df, dtype: str) -> list:
    """ Given a Spark DataFrame and a target data type, return all columns of that data type. 
    Example Usage: get_cols_with_dtype(df_i94, 'double')
    """
    double_col_list = [item[0] for item in df.dtypes if item[1].startswith(f"{dtype}")]
    return double_col_list

In [35]:
# get_cols_with_dtype(df_i94, 'double')

In [36]:
def should_be_int(df, colname: str) -> bool:
    """ Check entire Spark DataFrame Column. If integer form has same as the float form, return True 
    Example Usage: should_be_int(df_i94, 'cicid')
    """
    df.createOrReplaceTempView("_tmp_df")

    sql_str = f"""\
        select count({colname}) as records
        from _tmp_df
        where ({colname} is not null) and (abs({colname} - bigint({colname}))) > 0
    """  
    
    # run spark SQL
    #print(sql_str)
    _tmp_df = spark.sql(sql_str)

    # Grab the value of `records`    
    diff_records = _tmp_df.head()[0]

    return diff_records == 0


In [37]:
# should_be_int(df_i94, 'cicid')

In [38]:
def which_int_type(df, colname: str) -> str:
    """ Check entire Spark DataFrame Column. Should the int column be of IntegerType (int) or LongType (bigint)? 
    https://spark.apache.org/docs/latest/sql-ref-datatypes.html
        ByteType: Represents 1-byte signed integer numbers. The range of numbers is from -128 to 127.
        ShortType: Represents 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
        IntegerType: Represents 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
        LongType: Represents 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.        
    """
    
    df.createOrReplaceTempView("_tmp_df")
     
    sql_str = f"""\
        select
            case
                when max(abs({colname})) <  2147483647 then 'IntegerType'
                else 'LongType'
            end as proposed
        from _tmp_df
        where {colname} is not null
    """  
    
    # run spark SQL
    #print(sql_str)
    _tmp_df = spark.sql(sql_str)

    # Grab the value of `proposed`
    proposed = _tmp_df.head()[0]
    
    return proposed

In [39]:
# which_int_type(df_i94, 'cicid')

In [40]:
# which_int_type(df_i94, 'admnum')

In [41]:
def optimise_double(df):
    double_list = get_cols_with_dtype(df, 'double')
    #print(double_list)
    return [{
        "column_name": colname,
        "proposed": which_int_type(df, colname) if should_be_int(df, colname) else 'double'
    } for idx, colname in enumerate(double_list)]

In [42]:
optimise_double(df_i94)

[{'column_name': 'cicid', 'proposed': 'IntegerType'},
 {'column_name': 'i94yr', 'proposed': 'IntegerType'},
 {'column_name': 'i94mon', 'proposed': 'IntegerType'},
 {'column_name': 'i94cit', 'proposed': 'IntegerType'},
 {'column_name': 'i94res', 'proposed': 'IntegerType'},
 {'column_name': 'arrdate', 'proposed': 'IntegerType'},
 {'column_name': 'i94mode', 'proposed': 'IntegerType'},
 {'column_name': 'depdate', 'proposed': 'IntegerType'},
 {'column_name': 'i94bir', 'proposed': 'IntegerType'},
 {'column_name': 'i94visa', 'proposed': 'IntegerType'},
 {'column_name': 'count', 'proposed': 'IntegerType'},
 {'column_name': 'biryear', 'proposed': 'IntegerType'},
 {'column_name': 'admnum', 'proposed': 'LongType'}]

Mini insights:

* These numeric columns currently are auto-parsed as DoubleType.
* This notebook aims to understand nature of these columns - conceptually are they integer like - e.g. IntegerType (int) and LongType (bigint).
* This notebook is purely exploratory. It appears the DoubleType store numeric values covering largest range and might be safest. It is possible that we keep these numeric columns as double. It is important to get an appreciate of the integer-like nature of the numeric columns though (eg. year 2016 makes more sense than year 2016.0; likewise, month 3 makes more sense than month 3.0, and so on).


In [43]:
spark.stop()