# Goal
### Overall Goal
See how useful Spark is to explore a large data set.
__*PUDF_base_all_tab.txt*__ is 10 GB and 18 M observations on over 250 features.

### Specific to this notebook
Explore how clean eight (*) columns of interest are:
1. PROVIDER_NAME
2. ADMIT_WEEKDAY
3. pat_age
4. RACE
5. ETHNICITY
6. FIRST_PAYMENT_SRC
7. SECONDARY_PAYMENT_SRC
8. ADMITTING_DIAGNOSIS

In [1]:
from os import path, getcwd
import sys

In [2]:
texas_df = spark.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true', delimiter='\t')\
    .load(path.join(getcwd(), "..", "data", "PUDF_base_all_tab.txt"))

# Munging

# Data Exploration
Looking at completeness of variables in __*goal*__

### LOS & PAT_AGE

In [5]:
print("There are non-numerics for LOS and POS.\n" + \
      "# LOS with value 'LENGTH_OF_STAY': {}.\n "\
      .format(
        texas_df.filter(texas_df["LENGTH_OF_STAY"] == 'LENGTH_OF_STAY').count()) + \
      "# LOS with value '*': {}. \n" \
      .format(
      texas_df.filter(texas_df["LENGTH_OF_STAY"] == '*').count()) + \
      "# PAT_AGE with value '*': {}.\n" \
      .format(
      texas_df.filter(texas_df["PAT_AGE"] == '*').count()) + \
      "# PAT_AGE with value 'ZZ': {}." \
      .format(texas_df.filter(texas_df["PAT_AGE"] == 'ZZ').count()))

There are non-numerics for LOS and POS.
# LOS with value 'LENGTH_OF_STAY': 3.
 # LOS with value '*': 8181. 
# PAT_AGE with value '*': 983.
# PAT_AGE with value 'ZZ': 152283.


## Patient Gender

In [25]:
%timeit -n1 -r1 texas_df.groupBy("SEX_CODE").count().orderBy("count", ascending=False).show(20)

+--------+-------+
|SEX_CODE|  count|
+--------+-------+
|       F|1661639|
|       M|1027482|
|    null| 992226|
|   88888| 102457|
|   78521|  70316|
|   78572|  63415|
|       *|  60799|
|   79936|  51113|
|   78501|  50928|
|   78577|  49470|
|   75217|  48805|
|   78539|  48298|
|   75228|  42459|
|   78596|  41771|
|   78550|  41233|
|   75211|  40880|
|   78520|  40843|
|   75216|  40407|
|   77084|  39612|
|   78207|  39258|
+--------+-------+
only showing top 20 rows

1 loop, best of 1: 7min 37s per loop


In [22]:
%timeit -n1 -r1 texas_df.groupBy("LENGTH_OF_STAY").count().orderBy("count", ascending=False).show(20)

+--------------+--------+
|LENGTH_OF_STAY|   count|
+--------------+--------+
|             2|10877316|
|             1| 4410786|
|          0002|  735987|
|          0003|  467292|
|          0001|  466910|
|          0004|  268435|
|          0005|  174801|
|          0006|  130284|
|          0007|  106240|
|          0008|   76916|
|          0009|   56394|
|          0010|   44955|
|          null|   43369|
|          0011|   36529|
|          0012|   28968|
|          0013|   25607|
|          0014|   24318|
|          0015|   19085|
|          0016|   15173|
|          0017|   13030|
+--------------+--------+
only showing top 20 rows

1 loop, best of 1: 3min 13s per loop


## Admit Diagnosis

In [17]:
## Number of Diagnosis Codes
%timeit -n1 -r1 print(texas_df.groupBy("ADMITTING_DIAGNOSIS").count().count())

11303
1 loop, best of 1: 2min 20s per loop


## Patient Race

In [19]:
%timeit -n1 -r1 texas_df.groupBy("ETHNICITY").count().orderBy("count", ascending=False).show(20)

+---------+-------+
|ETHNICITY|  count|
+---------+-------+
|     0.00|8905012|
|        2|2043158|
|        1| 765130|
|  1182.00|  32114|
|  2400.00|  28429|
|  1773.00|  27205|
|  2000.00|  26900|
|   591.00|  23950|
|  1604.00|  21883|
|  1528.00|  21617|
|  2160.00|  19552|
|  1850.00|  19396|
|  1406.00|  19042|
|  1108.00|  18642|
|  1456.00|  18492|
|  2364.00|  17633|
|  1000.00|  16816|
|  1624.00|  16401|
|  3000.00|  16201|
|  3240.00|  15961|
+---------+-------+
only showing top 20 rows

1 loop, best of 1: 2min 15s per loop


In [21]:
%timeit -n1 -r1 texas_df.groupBy("RACE").count().orderBy("count", ascending=False).show(20)

+----+--------+
|RACE|   count|
+----+--------+
| 111|15198686|
|   4| 1802112|
|   5|  600784|
|   3|  354070|
|   2|   45705|
| 121|   40574|
| 211|   37878|
| 110|   27573|
| 131|   16543|
| 181|   11919|
|   1|   10989|
| 114|    5890|
|null|    4777|
| 641|     149|
| 711|      96|
| 210|      88|
| 281|      77|
| 171|      46|
| 134|      37|
|   *|      23|
+----+--------+
only showing top 20 rows

1 loop, best of 1: 59.1 s per loop


### Providers

In [20]:
## Number of Diagnosis Codes
%timeit -n1 -r1 print(texas_df.groupBy("PROVIDER_NAME").count().count())

778
1 loop, best of 1: 1min 33s per loop


# Basic Stats on LOS

In [5]:
%timeit -n1 -r1 texas_df.describe("LENGTH_OF_STAY", "PAT_AGE").show()

+-------+-----------------+------------------+
|summary|   LENGTH_OF_STAY|           PAT_AGE|
+-------+-----------------+------------------+
|  count|         18114764|          18147843|
|   mean|2.280949958520556|11.599688703625358|
| stddev|9.438761332878363| 4.850573319201647|
|    min|                *|                 *|
|    max|   LENGTH_OF_STAY|                ZZ|
+-------+-----------------+------------------+

1 loop, best of 1: 1min 20s per loop


In [4]:
%timeit -n1 -r1 texas_df.stat.cov("LENGTH_OF_STAY", "PAT_AGE")

IllegalArgumentException: u'requirement failed: Currently covariance calculation for columns with dataType StringType not supported.'