# Goal
### Overall Goal
See how useful Spark is to explore a large data set.
__*PUDF_base_all_tab.txt*__ is 10 GB and 18 M observations on 260 features.

### Specific to this notebook
Generating a cleaner data set and doing additional analysis on:
1. PROVIDER_NAME
2. *ADMIT_WEEKDAY*
3. *pat_age*
4. ~~RACE~~
5. *ETHNICITY*
6. ~~FIRST_PAYMENT_SRC~~
7. ~~SECONDARY_PAYMENT_SRC~~
8. ADMITTING_DIAGNOSIS
9. *TYPE_OF_ADMISSION*
10. *SOURCE_OF_ADMISSION*
11. ~~SEX_CODE~~

__*Justifying Columns to be Removed*__
<table align="left">
<tr>
<th>Column</th><th>Reason</th>
</tr>
<tr><td>SEX_CODE</td><td>Data seems invalid. Male or Female account for less than 40% of observations</td></tr>
<tr><td>RACE</td><td>Over 70% of the race values are not in the dicitonary.</td></tr>
<tr><td>FIRST_PAYMENT_SRC</td><td>Variable is not in dictionary and values don't match any related columns in dictionary.</td></tr>
<tr><td>SECONDARY_PAYMENT_SRC</td><td>Variable is not in dictionary and values don't match any related columns in dictionary.</td></tr>
</table>

__*Handling Invalid Data*__
<table align="left">
<tr>
<th>Column</th><th>Comments</th>
</tr>
<tr><td>SOURCE_OF_ADMISSION</td><td>Though most values are not in dictionary. The valid ones like <strong>doctor's referal</strong> and <strong>ED</strong> seem useful. Will assume explicit and implicit invalid values are null</td></tr>
<tr><td>ETHNICITY</td><td>Most of data is invalid. However, the valid values have to do with being latin or not. Decided to keep valid columns and assume all invalid are non-Latin</td></tr>
<tr><td>TYPE_OF_ADMISSION</td><td>Most values are in dictionary. Will assume explicit and implicit invalid values are null</td></tr>
<tr><td>ADMIT_WEEKDAY</td><td>Most values are in dictionary. Will assume explicit and implicit invalid values are null</td></tr>
<tr><td>pat_age</td><td>Will remove a few non-numeric values</td></tr>
<tr><td>LENGTH_OF_STAY</td><td>Most values are in dictionary. Will assume explicit and implicit invalid values are null</td></tr>
</table>

In [1]:
from os import path, getcwd
import sys
import pandas as pd
import numpy as np

In [15]:
def get_distinct_number(df, col):
    if df is None or col is None:
        print("The dataframe or col is invalid")
        return 0
    elif col not in df.columns:
        print("The column '{}' is not in the dataframe." \
             .format(col))
    else:
        return df.groupBy(col).count().count()

def get_topN_group(df, col, sort=True, highest_first=True):
    if df is None or col is None:
        print("The dataframe or col is invalid")
        return 0
    elif col not in df.columns:
        print("The column '{}' is not in the dataframe." \
             .format(col))
    else:
        if sort:
            return df.groupBy(col).count().orderBy("count", ascending = not highest_first)
        else:
            return df.groupBy(col).count()   

In [2]:
texas_df = spark.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true', delimiter='\t')\
    .load(path.join(getcwd(), "..", "data", "PUDF_base_all_tab.txt"))

# Munging