# Udacity Environment Setup

To setup Spark Environment on Udacity Environment, use the sample code snippet from: https://knowledge.udacity.com/questions/586304

Useful links:

https://stackoverflow.com/questions/35497069/passing-ipython-variables-as-arguments-to-bash-commands

In [1]:
import os
import pyspark
import configparser
from pyspark.sql import SparkSession

In [2]:
DATA_PATH_MODE = 'DATA_PATHS_LOCAL'
assert DATA_PATH_MODE in ['DATA_PATHS_LOCAL']

In [5]:
# Dev config from the non-secret configuration file
config_dev = configparser.ConfigParser()
config_dev.read_file(open('aws_dev.cfg'))

RAW_I94_MTHLY_SAS7BDAT_DIR = config_dev.get('DATA_PATHS_UDACITY', 'RAW_I94_MTHLY_SAS7BDAT_DIR')
RAW_GLOBAL_CITY_LAND_TEMPERATURE = config_dev.get('DATA_PATHS_UDACITY', 'RAW_GLOBAL_CITY_LAND_TEMPERATURE')

PROCESSED_I94_DIR = config_dev.get(DATA_PATH_MODE, 'PROCESSED_I94_DIR')
RAW_I94_SAMPLE = config_dev.get(DATA_PATH_MODE, 'RAW_I94_SAMPLE')
RAW_AIRPORT_CODES = config_dev.get(DATA_PATH_MODE, 'RAW_AIRPORT_CODES')
RAW_US_CITY_DEMOGRAPHICS = config_dev.get(DATA_PATH_MODE, 'RAW_US_CITY_DEMOGRAPHICS')

print(f"(Udacity Server) RAW_I94_MTHLY_SAS7BDAT_DIR: {RAW_I94_MTHLY_SAS7BDAT_DIR}")
print(f"(Udacity Server) RAW_GLOBAL_CITY_LAND_TEMPERATURE: {RAW_GLOBAL_CITY_LAND_TEMPERATURE}")
print(f"PROCESSED_I94: {PROCESSED_I94_DIR}")
print(f"RAW_I94_SAMPLE: {RAW_I94_SAMPLE}")
print(f"RAW_AIRPORT_CODES: {RAW_AIRPORT_CODES}")
print(f"RAW_US_CITY_DEMOGRAPHICS: {RAW_US_CITY_DEMOGRAPHICS}")

(Udacity Server) RAW_I94_MTHLY_SAS7BDAT_DIR: /data/18-83510-I94-Data-2016
(Udacity Server) RAW_GLOBAL_CITY_LAND_TEMPERATURE: /data2/GlobalLandTemperaturesByCity.csv
PROCESSED_I94: sas_data/
RAW_I94_SAMPLE: raw_input_data/i94_sample_csv/immigration_data_sample.csv
RAW_AIRPORT_CODES: raw_input_data/airport_codes_csv/airport-codes_csv.csv
RAW_US_CITY_DEMOGRAPHICS: raw_input_data/us_cities_demographics_csv/us-cities-demographics.csv


In [6]:
# Might not be needed? (Tested not required! But keeping here for info in case we do.)
# Got below from https://knowledge.udacity.com/questions/586304
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["PATH"] = "/opt/conda/bin:/opt/spark-2.4.3-bin-hadoop2.7/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/lib/jvm/java-8-openjdk-amd64/bin"
# os.environ["SPARK_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"
# os.environ["HADOOP_HOME"] = "/opt/spark-2.4.3-bin-hadoop2.7"

In [7]:
spark = SparkSession.builder\
    .config("spark.jars.repositories", "https://repos.spark-packages.org/")\
    .config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11")\
    .enableHiveSupport()\
    .getOrCreate()

In [8]:
cwd = os.getcwd()
print(f"Current working directory: {cwd}")

Current working directory: /home/workspace


In [9]:
type(spark)

pyspark.sql.session.SparkSession

# Udacity Environment

For learning purpose, let's find out what the Udacity virtual environment looks like. So in future should we wish to reproduce results outside the Udacity environment, we may be able to do some reverse-engineering - reproduce the Udacity environment (e.g. via Docker) and get our code working there.

In [10]:
!python --version

Python 3.6.3


In [11]:
!aws --version

aws-cli/1.16.17 Python/3.6.3 Linux/4.15.0-1083-gcp botocore/1.12.7


In [12]:
spark.sparkContext.version

'2.4.3'

# Main EDA Notebook

Our main EDA Notebook: `/home/workspace/EDA.ipynb`

# Utility Functions

To make analysis more concise, we have refactered out some reusable Python procedures here (following some initial Notebook iterations).

In [13]:
def spark_read_csv_raw(
        spark: pyspark.sql.SparkSession,
        file_path: str
    ) -> pyspark.sql.dataframe.DataFrame:
    """
    Read a CSV file into a Spark DataFrame. Read entire row into one column.
    Useful for initila Data Quality Check.
    Useful for deciding on beset delimiters to use.
    """
    
    return spark\
        .read\
        .format("csv")\
        .option("delimiter", "\n")\
        .load(file_path)

In [14]:
def spark_read_csv(
        spark: pyspark.sql.SparkSession,
        file_path: str, 
        delimiter: str = ",", 
        header: bool = True
    ) -> pyspark.sql.dataframe.DataFrame:
    """
    Read a CSV file into a Spark DataFrame. Specify how we wish to read the CSV file.
    This returns a SparkDataFrame that is useful for downstream analytics.
    """
    
    return spark\
        .read\
        .format("csv")\
        .option("delimiter", delimiter)\
        .option("header", header)\
        .load(file_path)


In [15]:
def spark_df_overview(
        spark_df: pyspark.sql.dataframe.DataFrame,
        spark_df_name: str = "spark_df_name_not_provided",
        top_x: int = 5
    ) -> None:
    """ Given a Spark DataFrame, print out useful high level summary 
    Ref: why use take() instead of collect():
    https://sparkbyexamples.com/spark/show-top-n-rows-in-spark-pyspark/
    """
 
    # Print the Spark DataFrame Name for ease of reading of console output
    print(f"Spark DataFrame Name: {spark_df_name}")

    # Total Column row
    print(f"Total Columns: {len(spark_df.columns)}")

    # Total row count
    spark_df_rows = spark_df.count()
    print(f"Total Rows: {spark_df_rows}")

    # Basic Stricture of Spark DataFrame
    print("Spark DataFrame Structure")
    spark_df.printSchema()
    
    # If dataframe has contents, print some samples
    
    df_top_x = spark_df.take(top_x)
        
    if spark_df_rows >= 1:
        for row in range(top_x):
            print(f"*** Sample Row: {row} ***")
            print(df_top_x[row])
    

# Where are the data

The provided data can be found in the Udacity virtual workspace environment

In [17]:
data_desc = \
f"""
* Data Domain 1: I94 Immigration Data (Primary)

    * Dataset 1.1: I94 Immigration Data (2016?) - Sample CSV.
        Path: {RAW_I94_SAMPLE}
        
    * Dataset 1.2: I94 Immigration Data (2016?) - Sample Parquet.
        Path: {PROCESSED_I94_DIR}
        Note that these Parquet files originated from SAS Datsets (1.3 below).
        From analysis it appears that the Parquet Files (in path `sas_data`) come from the 
        Apr 2016 SAS dataset: `i94_apr16_sub.sas7bdat`. 
        (Of course! created by the Notebook: `Capstone Project Template.ipynb` - duh!)
        
    * Dataset 1.3: I94 Immigration Data (2016?) - SAS Datasets.
        Path: {RAW_I94_MTHLY_SAS7BDAT_DIR}
        It contains 12 SAS Dataset (1 for each month in year 2016).
        Each SAS Dataset has a naming convention like this: `i94_<mmm>16_sub.sas7bdat`,
        where `mmm` is in this list `['apr','aug','dec','feb','jan','jul','jun','mar','may','nov','oct','sep']`
        
* Data Domain 2: World Temperature Data (Secondary)

    * Dataset 2.1: World Temperature Data - CSV.
        Path: {RAW_GLOBAL_CITY_LAND_TEMPERATURE}
        
* Data Domain 3: US Cities Demographics (Secondary)

    * Data Set 3.1: US Cities Demographics - CSV.
        Path: {RAW_US_CITY_DEMOGRAPHICS}
    
* Data Domain 4: Airport Codes (Secondary)

    * Data Set 4.1: Airport Codes - CSV.
        Path: {RAW_AIRPORT_CODES}    
"""

print(data_desc)


* Data Domain 1: I94 Immigration Data (Primary)

    * Dataset 1.1: I94 Immigration Data (2016?) - Sample CSV.
        Path: raw_input_data/i94_sample_csv/immigration_data_sample.csv
        
    * Dataset 1.2: I94 Immigration Data (2016?) - Sample Parquet.
        Path: sas_data/
        Note that these Parquet files originated from SAS Datsets (1.3 below).
        From analysis it appears that the Parquet Files (in path `sas_data`) come from the 
        Apr 2016 SAS dataset: `i94_apr16_sub.sas7bdat`. 
        (Of course! created by the Notebook: `Capstone Project Template.ipynb` - duh!)
        
    * Dataset 1.3: I94 Immigration Data (2016?) - SAS Datasets.
        Path: /data/18-83510-I94-Data-2016
        It contains 12 SAS Dataset (1 for each month in year 2016).
        Each SAS Dataset has a naming convention like this: `i94_<mmm>16_sub.sas7bdat`,
        where `mmm` is in this list `['apr','aug','dec','feb','jan','jul','jun','mar','may','nov','oct','sep']`
        
* Data Do

# EDA

## Data Domain 1: I94 Immigration Data (Primary)

This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace. [This](https://travel.trade.gov/research/reports/i94/historical/2016.html) is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.

### Dataset 1.1: I94 Immigration Data (2016?) - Sample CSV

In [18]:
!ls -lh $RAW_I94_SAMPLE

-rw-r--r-- 1 root root 142K Mar 15  2019 raw_input_data/i94_sample_csv/immigration_data_sample.csv


In [19]:
# Read raw CSV into a Spark DataFrame where whole row is stored in with one column
df_immigration_sample_raw = spark_read_csv_raw(
    spark=spark,
    file_path=RAW_I94_SAMPLE)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_immigration_sample_raw,
    spark_df_name="df_immigration_sample_raw")

Spark DataFrame Name: df_immigration_sample_raw
Total Columns: 1
Total Rows: 1001
Spark DataFrame Structure
root
 |-- _c0: string (nullable = true)

*** Sample Row: 0 ***
Row(_c0=',cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype')
*** Sample Row: 1 ***
Row(_c0='2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,07202016,F,,JL,56582674633.0,00782,WT')
*** Sample Row: 2 ***
Row(_c0='2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94361995930.0,XBLNG,B2')
*** Sample Row: 3 ***
Row(_c0='589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,07052016,M,,LH,55780468433.0,00464,WT')
*** Sample Row: 4 ***
Row(_c0='2631158,5291768.0,2016.0,4.0,297.0,297.0,L

In [20]:
# Now read CSV again with additional config (Header, Delimiter, etc.)
df_immigration_sample = spark_read_csv(
    spark=spark,
    file_path=RAW_I94_SAMPLE, 
    delimiter=",", 
    header=True)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_immigration_sample,
    spark_df_name="df_immigration_sample")

Spark DataFrame Name: df_immigration_sample
Total Columns: 29
Total Rows: 1000
Spark DataFrame Structure
root
 |-- _c0: string (nullable = true)
 |-- cicid: string (nullable = true)
 |-- i94yr: string (nullable = true)
 |-- i94mon: string (nullable = true)
 |-- i94cit: string (nullable = true)
 |-- i94res: string (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: string (nullable = true)
 |-- i94mode: string (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: string (nullable = true)
 |-- i94bir: string (nullable = true)
 |-- i94visa: string (nullable = true)
 |-- count: string (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: string (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- ge

**User Notes**: the first column `_c0` is likely an identifier to the specific row of the full dataset? (e.g. a random row index of the originating full dataset?)

### EDA - Dataset 1.2: I94 Immigration Data (2016?) - Parquet Files (originated from SAS Datsets?)

In [21]:
!ls -lh sas_data

total 46M
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.2M Mar 19  2019 part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.3M Mar 19  2019 part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-- 1 root root 3.2M Mar 19  2019 part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
-rw-r--r-

In [22]:
df_immigration_parquet = spark.read.load("sas_data")

In [23]:
# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_immigration_parquet,
    spark_df_name="df_immigration_parquet")

Spark DataFrame Name: df_immigration_parquet
Total Columns: 28
Total Rows: 3096313
Spark DataFrame Structure
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)


### Dataset 1.3: I94 Immigration Data (2016?) - SAS Datasets

In [24]:
!ls -lh $RAW_I94_MTHLY_SAS7BDAT_DIR

total 6.0G
-rw-r--r-- 1 root root 451M May 31  2018 i94_apr16_sub.sas7bdat
-rw-r--r-- 1 root root 597M May 31  2018 i94_aug16_sub.sas7bdat
-rw-r--r-- 1 root root 500M May 31  2018 i94_dec16_sub.sas7bdat
-rw-r--r-- 1 root root 374M May 31  2018 i94_feb16_sub.sas7bdat
-rw-r--r-- 1 root root 415M May 31  2018 i94_jan16_sub.sas7bdat
-rw-r--r-- 1 root root 620M May 31  2018 i94_jul16_sub.sas7bdat
-rw-r--r-- 1 root root 684M May 31  2018 i94_jun16_sub.sas7bdat
-rw-r--r-- 1 root root 459M May 31  2018 i94_mar16_sub.sas7bdat
-rw-r--r-- 1 root root 501M May 31  2018 i94_may16_sub.sas7bdat
-rw-r--r-- 1 root root 424M May 31  2018 i94_nov16_sub.sas7bdat
-rw-r--r-- 1 root root 531M May 31  2018 i94_oct16_sub.sas7bdat
-rw-r--r-- 1 root root 543M May 31  2018 i94_sep16_sub.sas7bdat


In [25]:
# Note that each SAS Dataset is quite large (about 0.5 GB each).
# For now, let's just read in one and see what it looks like

In [26]:
df_immigration_sas = spark\
    .read\
    .format('com.github.saurfang.sas.spark')\
    .load(f"{RAW_I94_MTHLY_SAS7BDAT_DIR}/i94_apr16_sub.sas7bdat")

In [27]:
# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_immigration_sas,
    spark_df_name="df_immigration_sas")

Spark DataFrame Name: df_immigration_sas
Total Columns: 28
Total Rows: 3096313
Spark DataFrame Structure
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |--

### Data Domain 2: World Temperature Data (Secondary)

This dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

#### Dataset 2.1: World Temperature Data - CSV:

In [28]:
!ls -lh $RAW_GLOBAL_CITY_LAND_TEMPERATURE

-rw-r--r-- 1 1002 1003 509M Mar 30  2019 /data2/GlobalLandTemperaturesByCity.csv


In [29]:
# Read raw CSV into a Spark DataFrame where whole row is stored in with one column
df_temperature_raw = spark_read_csv_raw(
    spark=spark,
    file_path=RAW_GLOBAL_CITY_LAND_TEMPERATURE)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_temperature_raw,
    spark_df_name="df_temperature_raw")

Spark DataFrame Name: df_temperature_raw
Total Columns: 1
Total Rows: 8599213
Spark DataFrame Structure
root
 |-- _c0: string (nullable = true)

*** Sample Row: 0 ***
Row(_c0='dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude')
*** Sample Row: 1 ***
Row(_c0='1743-11-01,6.068,1.7369999999999999,Århus,Denmark,57.05N,10.33E')
*** Sample Row: 2 ***
Row(_c0='1743-12-01,,,Århus,Denmark,57.05N,10.33E')
*** Sample Row: 3 ***
Row(_c0='1744-01-01,,,Århus,Denmark,57.05N,10.33E')
*** Sample Row: 4 ***
Row(_c0='1744-02-01,,,Århus,Denmark,57.05N,10.33E')


In [30]:
# Now read CSV again with additional config (Header, Delimiter, etc.)
df_temperature = spark_read_csv(
    spark=spark,
    file_path=RAW_GLOBAL_CITY_LAND_TEMPERATURE, 
    delimiter=",", 
    header=True)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_temperature,
    spark_df_name="df_temperature")

Spark DataFrame Name: df_temperature
Total Columns: 7
Total Rows: 8599212
Spark DataFrame Structure
root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: string (nullable = true)
 |-- AverageTemperatureUncertainty: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)

*** Sample Row: 0 ***
Row(dt='1743-11-01', AverageTemperature='6.068', AverageTemperatureUncertainty='1.7369999999999999', City='Århus', Country='Denmark', Latitude='57.05N', Longitude='10.33E')
*** Sample Row: 1 ***
Row(dt='1743-12-01', AverageTemperature=None, AverageTemperatureUncertainty=None, City='Århus', Country='Denmark', Latitude='57.05N', Longitude='10.33E')
*** Sample Row: 2 ***
Row(dt='1744-01-01', AverageTemperature=None, AverageTemperatureUncertainty=None, City='Århus', Country='Denmark', Latitude='57.05N', Longitude='10.33E')
*** Sample Row: 3 ***
Row(dt='1744-02-01', Ave

## Data Domain 3: US Cities Demographics (Secondary)

This data comes from OpenSoft. You can read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).

### Dataset 3.1: US Cities Demographics - CSV:

In [31]:
!ls -lh $RAW_US_CITY_DEMOGRAPHICS

-rw-r--r-- 1 root root 246K Mar 15  2019 raw_input_data/us_cities_demographics_csv/us-cities-demographics.csv


In [32]:
# Read raw CSV into a Spark DataFrame where whole row is stored in with one column
df_us_cities_demo_raw = spark_read_csv_raw(
    spark=spark,
    file_path=RAW_US_CITY_DEMOGRAPHICS)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_us_cities_demo_raw,
    spark_df_name="df_us_cities_demo_raw")

Spark DataFrame Name: df_us_cities_demo_raw
Total Columns: 1
Total Rows: 2892
Spark DataFrame Structure
root
 |-- _c0: string (nullable = true)

*** Sample Row: 0 ***
Row(_c0='City;State;Median Age;Male Population;Female Population;Total Population;Number of Veterans;Foreign-born;Average Household Size;State Code;Race;Count')
*** Sample Row: 1 ***
Row(_c0='Silver Spring;Maryland;33.8;40601;41862;82463;1562;30908;2.6;MD;Hispanic or Latino;25924')
*** Sample Row: 2 ***
Row(_c0='Quincy;Massachusetts;41.0;44129;49500;93629;4147;32935;2.39;MA;White;58723')
*** Sample Row: 3 ***
Row(_c0='Hoover;Alabama;38.5;38040;46799;84839;4819;8229;2.58;AL;Asian;4759')
*** Sample Row: 4 ***
Row(_c0='Rancho Cucamonga;California;34.5;88127;87105;175232;5821;33878;3.18;CA;Black or African-American;24437')


In [33]:
# Now read CSV again with additional config (Header, Delimiter, etc.)
df_us_cities_demo = spark_read_csv(
    spark=spark,
    file_path=RAW_US_CITY_DEMOGRAPHICS, 
    delimiter=";", 
    header=True)

# Print Spark DataFrame overview
spark_df_overview(
    spark_df=df_us_cities_demo,
    spark_df_name="df_us_cities_demo")

Spark DataFrame Name: df_us_cities_demo
Total Columns: 12
Total Rows: 2891
Spark DataFrame Structure
root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Number of Veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)

*** Sample Row: 0 ***
Row(City='Silver Spring', State='Maryland', Median Age='33.8', Male Population='40601', Female Population='41862', Total Population='82463', Number of Veterans='1562', Foreign-born='30908', Average Household Size='2.6', State Code='MD', Race='Hispanic or Latino', Count='25924')
*** Sample Row: 1 ***
Row(City='Quincy', State='Massachusetts', Median

## Data Domain 4: Airport Codes (Secondary)

This is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).

### Dataset 4.1: Airport Codes - CSV:

In [34]:
!ls -lh $RAW_AIRPORT_CODES

-rw-r--r-- 1 root root 5.8M Mar 15  2019 raw_input_data/airport_codes_csv/airport-codes_csv.csv


In [35]:
df_airport_raw = spark_read_csv_raw(
    spark=spark,
    file_path=RAW_AIRPORT_CODES)

spark_df_overview(
    spark_df=df_airport_raw,
    spark_df_name="df_airport_raw")

Spark DataFrame Name: df_airport_raw
Total Columns: 1
Total Rows: 55076
Spark DataFrame Structure
root
 |-- _c0: string (nullable = true)

*** Sample Row: 0 ***
Row(_c0='ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates')
*** Sample Row: 1 ***
Row(_c0='00A,heliport,Total Rf Heliport,11,NA,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"')
*** Sample Row: 2 ***
Row(_c0='00AA,small_airport,Aero B Ranch Airport,3435,NA,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"')
*** Sample Row: 3 ***
Row(_c0='00AK,small_airport,Lowell Field,450,NA,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"')
*** Sample Row: 4 ***
Row(_c0='00AL,small_airport,Epps Airpark,820,NA,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"')


In [36]:
df_airport = spark_read_csv(
    spark=spark,
    file_path=RAW_AIRPORT_CODES, 
    delimiter=",", 
    header=True)

spark_df_overview(
    spark_df=df_airport,
    spark_df_name="df_airport")

Spark DataFrame Name: df_airport
Total Columns: 12
Total Rows: 55075
Spark DataFrame Structure
root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

*** Sample Row: 0 ***
Row(ident='00A', type='heliport', name='Total Rf Heliport', elevation_ft='11', continent='NA', iso_country='US', iso_region='US-PA', municipality='Bensalem', gps_code='00A', iata_code=None, local_code='00A', coordinates='-74.93360137939453, 40.07080078125')
*** Sample Row: 1 ***
Row(ident='00AA', type='small_airport', name='Aero B Ranch Airport', elevation_ft='3435', continent='NA', is

# Mini Conclusion

We now have a feel of the raw datasets provided by Udacity. We also have successfully read these raw files into Spark DataFrames for downstream processes.