# Exercises

## Data Aquisition

These exercises use the `case.csv`, `dept.csv`, and `source.csv` files from the San Antonio 311 call data set.

1. Read the case, department, and source data into their own spark dataframes.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession.builder.getOrCreate()

In [3]:
source_df = spark.read.csv('data/source.csv', sep = ',', header = True, inferSchema = True)

In [4]:
source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|Merlene Blodgett|
|   103582|     Carmen Cura|
|   106463| Richard Sanchez|
|   119403|  Betty De Hoyos|
|   119555|  Socorro Quiara|
+---------+----------------+
only showing top 5 rows



In [27]:
case_df = spark.read.csv('data/case.csv', sep = ',', header = True, inferSchema = True)

In [6]:
case_df.show(5, vertical = True, truncate = False)

-RECORD 0-----------------------------------------------------
 case_id              | 1014127332                            
 case_opened_date     | 1/1/18 0:42                           
 case_closed_date     | 1/1/18 12:29                          
 SLA_due_date         | 9/26/20 0:42                          
 case_late            | NO                                    
 num_days_late        | -998.5087616000001                    
 case_closed          | YES                                   
 dept_division        | Field Operations                      
 service_request_type | Stray Animal                          
 SLA_days             | 999.0                                 
 case_status          | Closed                                
 source_id            | svcCRMLS                              
 request_address      | 2315  EL PASO ST, San Antonio, 78207  
 council_district     | 5                                     
-RECORD 1----------------------------------------------

In [7]:
dept_df = spark.read.csv('data/dept.csv', sep = ',', header = True, inferSchema = True)

In [8]:
dept_df.show(5, vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | YES                  
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | YES                  
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 4-------------------------------

2. Let's see how writing to the local disk works in spark:
    - Write the code necessary to store the source data in both csv, and json format, store these as `sources_csv` and `sources_json`
    - Inspect your folder structure. What do you notice?

In [9]:
source_df.write.csv('data/sources_csv', mode = 'overwrite', header = True)

In [10]:
source_df.write.json('data/sources_json', mode = 'overwrite')

3. Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [11]:
case_df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 1/1/18 0:42          
 case_closed_date     | 1/1/18 12:29         
 SLA_due_date         | 9/26/20 0:42         
 case_late            | NO                   
 num_days_late        | -998.5087616000001   
 case_closed          | YES                  
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 1/1/18 0:46          
 case_closed_date     | 1/3/18 8:11          
 SLA_due_date         | 1/5/18 8:30          
 case_late            | NO                   
 num_days_late        | -2.0126041

In [29]:
case_df.describe()

DataFrame[summary: string, case_id: string, case_opened_date: string, case_closed_date: string, case_due_date: string, case_late: string, num_days_late: string, case_closed: string, dept_division: string, service_request_type: string, SLA_days: string, case_status: string, source_id: string, request_address: string, council_district: string]

In [28]:
# Rename Columns
# We'll rename this column to match the other date-type columns.

case_df = case_df.withColumnRenamed('SLA_due_date', 'case_due_date')
case_df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 1/1/18 0:42          
 case_closed_date     | 1/1/18 12:29         
 case_due_date        | 9/26/20 0:42         
 case_late            | NO                   
 num_days_late        | -998.5087616000001   
 case_closed          | YES                  
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 1/1/18 0:46          
 case_closed_date     | 1/3/18 8:11          
 case_due_date        | 1/5/18 8:30          
 case_late            | NO                   
 num_days_late        | -2.0126041

In [30]:
# Correct Data Types.
# Two columns `case_closed` and `case_late` store yes/no values.
# Currently spark thinks they are strings; Let's turn them into booleans

# demostrating we only have yes/no in each field.

case_df.groupBy('case_closed', 'case_late').count().show()


+-----------+---------+------+
|case_closed|case_late| count|
+-----------+---------+------+
|         NO|      YES|  6525|
|        YES|      YES| 87978|
|         NO|       NO| 11585|
|        YES|       NO|735616|
+-----------+---------+------+



In [31]:
case_df = case_df.withColumn('case_closed', expr('case_closed == "YES"')).withColumn('case_late', expr('case_late == "YES"'))

case_df.select('case_closed', 'case_late').show(5)

+-----------+---------+
|case_closed|case_late|
+-----------+---------+
|       true|    false|
|       true|    false|
|       true|    false|
|       true|    false|
|       true|     true|
+-----------+---------+
only showing top 5 rows



In [32]:
case_df.describe()

DataFrame[summary: string, case_id: string, case_opened_date: string, case_closed_date: string, case_due_date: string, num_days_late: string, dept_division: string, service_request_type: string, SLA_days: string, case_status: string, source_id: string, request_address: string, council_district: string]

Now we will handle the 3 columns that have dates in them. We'll use spark's `to_timestamp` function for this.

In order to work properly, we'll need to provide the date format when use the `to_timestamp`. The date format is a little different than the date functionality we've worked with in pandas, this because it using JaveSimpleDateFormat.

In [33]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').describe()

DataFrame[summary: string, case_opened_date: string, case_closed_date: string, case_due_date: string]

In [34]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').show(5)

+----------------+----------------+-------------+
|case_opened_date|case_closed_date|case_due_date|
+----------------+----------------+-------------+
|     1/1/18 0:42|    1/1/18 12:29| 9/26/20 0:42|
|     1/1/18 0:46|     1/3/18 8:11|  1/5/18 8:30|
|     1/1/18 0:48|     1/2/18 7:57|  1/5/18 8:30|
|     1/1/18 1:29|     1/2/18 8:13| 1/17/18 8:30|
|     1/1/18 1:34|    1/1/18 13:29|  1/1/18 4:34|
+----------------+----------------+-------------+
only showing top 5 rows



In [35]:
fmt = 'M/d/yy H:mm'
fmt

'M/d/yy H:mm'

In [36]:
case_df = (
    case_df.withColumn('case_opened_date', to_timestamp('case_opened_date', fmt))
    .withColumn('case_closed_date', to_timestamp('case_opened_date', fmt))
    .withColumn('case_due_date', to_timestamp('case_due_date', fmt))
)

case_df.select('case_opened_date', 'case_closed_date', 'case_due_date')



DataFrame[case_opened_date: timestamp, case_closed_date: timestamp, case_due_date: timestamp]

In [37]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').show(5)

+-------------------+-------------------+-------------------+
|   case_opened_date|   case_closed_date|      case_due_date|
+-------------------+-------------------+-------------------+
|2018-01-01 00:42:00|2018-01-01 00:42:00|2020-09-26 00:42:00|
|2018-01-01 00:46:00|2018-01-01 00:46:00|2018-01-05 08:30:00|
|2018-01-01 00:48:00|2018-01-01 00:48:00|2018-01-05 08:30:00|
|2018-01-01 01:29:00|2018-01-01 01:29:00|2018-01-17 08:30:00|
|2018-01-01 01:34:00|2018-01-01 01:34:00|2018-01-01 04:34:00|
+-------------------+-------------------+-------------------+
only showing top 5 rows



In [41]:
case_df.describe().show(5,vertical = True)

-RECORD 0------------------------------------
 summary              | count                
 case_id              | 841704               
 num_days_late        | 841671               
 dept_division        | 841704               
 service_request_type | 841704               
 SLA_days             | 841671               
 case_status          | 841704               
 source_id            | 841704               
 request_address      | 841704               
 council_district     | 841704               
-RECORD 1------------------------------------
 summary              | mean                 
 case_id              | 1.0139680837676392E9 
 num_days_late        | -49.07486758369357   
 dept_division        | null                 
 service_request_type | null                 
 SLA_days             | 59.25478976660689    
 case_status          | null                 
 source_id            | 136602.73663950132   
 request_address      | null                 
 council_district     | 4.62516870

In [40]:
case_df.show(5,vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01 00:46:00  
 case_closed_date     | 2018-01-01 00:46:00  
 case_due_date        | 2018-01-05 08:30:00  
 case_late            | false                
 num_days_late        | -2.0126041

Now let's look at `dept.csv`

In [43]:
dept_df.show(5,vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | YES                  
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | YES                  
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 4-------------------------------

In [50]:
dept_df.groupBy('dept_name','dept_division').count().show(40, truncate=False)

+-------------------------+-----------------------------+-----+
|dept_name                |dept_division                |count|
+-------------------------+-----------------------------+-----+
|Trans & Cap Improvements |Traffic Engineering Design   |1    |
|Trans & Cap Improvements |Signals                      |1    |
|Parks and Recreation     |Tree Crew                    |1    |
|Metro Health             |Vector                       |1    |
|Code Enforcement Services|Code Enforcement             |1    |
|Trans & Cap Improvements |Storm Water                  |1    |
|Parks and Recreation     |Clean and Green              |1    |
|Animal Care Services     |Field Operations             |1    |
|Trans & Cap Improvements |Director's Office Horizontal |1    |
|Customer Service         |311 Call Center              |1    |
|Code Enforcement Services|Graffiti                     |1    |
|null                     |Code Enforcement (Internal)  |1    |
|Solid Waste Management   |Waste Collect

In [53]:
# Change `dept_subject_to_SLA` to Boolean

dept_df.select('dept_subject_to_SLA').groupBy('dept_subject_to_SLA').count().show()

+-------------------+-----+
|dept_subject_to_SLA|count|
+-------------------+-----+
|                YES|   31|
|                 NO|    8|
+-------------------+-----+



In [54]:
dept_df = dept_df.withColumn('dept_subject_to_SLA', expr('dept_subject_to_SLA == "YES"'))

dept_df.show(5, vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | true                 
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | true                 
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | true                 
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | true                 
-RECORD 4-------------------------------

In [57]:
dept_df.describe()

DataFrame[summary: string, dept_division: string, dept_name: string, standardized_dept_name: string]

Let's look at `source_df`.

In [55]:
source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|Merlene Blodgett|
|   103582|     Carmen Cura|
|   106463| Richard Sanchez|
|   119403|  Betty De Hoyos|
|   119555|  Socorro Quiara|
+---------+----------------+
only showing top 5 rows



In [56]:
source_df.describe()

DataFrame[summary: string, source_id: string, source_username: string]

In [58]:
source_df = source_df.withColumn('source_username', trim(lower(source_df.source_username)))

source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|merlene blodgett|
|   103582|     carmen cura|
|   106463| richard sanchez|
|   119403|  betty de hoyos|
|   119555|  socorro quiara|
+---------+----------------+
only showing top 5 rows



1. How old is the latest(in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?