# Spark Wrangle Exercises
This exercises uses the `case.csv`, `dept.csv`, and `source.csv` files from the san antonio 311 call dataset.
- Read the `case`, `department`, and `source` data into their own spark dataframes.
- Let's see how writing to the local disk works in spark:
    - Write the code necessary to store the source data in both __csv__ and __json__ format, store these as `sources_csv` and `sources_json`.
    - Inspect your _folder_ structure. What do you notice?

- Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.  

In [1]:
import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StructField, StringType, StringType

In [2]:
# Create a spark session object to build spark dataframes.
spark = SparkSession.builder.getOrCreate()

In [3]:
# 1. Read the case, dept, and source datasets into spark dataframes.
df_case = spark.read.csv('data/case.csv', header=True, sep=',', inferSchema=True)
df_dept = spark.read.csv('data/dept.csv', header=True, sep=',', inferSchema=True)
df_source = spark.read.csv('data/source.csv', header=True, sep=',', inferSchema=True)

## case.csv

In [4]:
# The dimensions of the case dataset
df_case.count(), len(df_case.columns)

(841704, 14)

In [46]:
df_case.summary?

In [49]:
# Efficient way to show the number of non-null values in a spark dataframe
df_case.summary().show(1, truncate=False, vertical=True)

-RECORD 0----------------------
 summary              | count  
 case_id              | 841704 
 case_opened_date     | 841704 
 case_closed_date     | 823594 
 SLA_due_date         | 841671 
 case_late            | 841704 
 num_days_late        | 841671 
 case_closed          | 841704 
 dept_division        | 841704 
 service_request_type | 841704 
 SLA_days             | 841671 
 case_status          | 841704 
 source_id            | 841704 
 request_address      | 841704 
 council_district     | 841704 
only showing top 1 row



In [5]:
# Display the column names
df_case.dtypes

[('case_id', 'int'),
 ('case_opened_date', 'string'),
 ('case_closed_date', 'string'),
 ('SLA_due_date', 'string'),
 ('case_late', 'string'),
 ('num_days_late', 'double'),
 ('case_closed', 'string'),
 ('dept_division', 'string'),
 ('service_request_type', 'string'),
 ('SLA_days', 'double'),
 ('case_status', 'string'),
 ('source_id', 'string'),
 ('request_address', 'string'),
 ('council_district', 'int')]

In [19]:
# Display a single record from the case dataset to understand what an obersation represents.
# The `case` dataset has many columns that wrap around when displayed.
# Pass: truncate=False, vertical=True to display the full values of each column.
df_case.show(2, truncate=False, vertical=True)

-RECORD 0----------------------------------------------------
 case_id              | 1014127332                           
 case_opened_date     | 1/1/18 0:42                          
 case_closed_date     | 1/1/18 12:29                         
 SLA_due_date         | 9/26/20 0:42                         
 case_late            | NO                                   
 num_days_late        | -998.5087616000001                   
 case_closed          | YES                                  
 dept_division        | Field Operations                     
 service_request_type | Stray Animal                         
 SLA_days             | 999.0                                
 case_status          | Closed                               
 source_id            | svcCRMLS                             
 request_address      | 2315  EL PASO ST, San Antonio, 78207 
 council_district     | 5                                    
-RECORD 1----------------------------------------------------
 case_id

### `case` dataset: Record 0 takeaway
Each row represents a 311 call.
- Unique case_id
- Case opened and closed datetimes.
    - Date and time represents the total time it took the department to complete the case: Paper work, travel, tasks, etc.
- The SLA due date is set 999 days from the case open date.
- This type of case is late if it exceeds 999 days.
- The number of days late is a negative number starting at -999, representing the case open date
    - The fraction missing means that the case was opened and closed within half of a day.
- Department division identifies the city department assigned to the case.
- Service type request indentifies the type of request issued to the department.
- SLA days is the maximum number of days this case can be opened.
- Case status identifies if a case is opened or closed.
- Source id may represent the division/personnel generating the request.
- Request address is the location requesting 311 service.
- Council district represents the district a location resides in.
    - Council representatives can find common 311 requests among their constituents.

## dept.csv

In [6]:
# The dimensions of the dept dataset
df_dept.count(), len(df_dept.columns)

(39, 4)

In [7]:
# Display the column names and datatypes
# Column data types are correct
df_dept.dtypes

[('dept_division', 'string'),
 ('dept_name', 'string'),
 ('standardized_dept_name', 'string'),
 ('dept_subject_to_SLA', 'string')]

## source.csv

In [8]:
# The dimensions of the source dataset
df_source.count(), len(df_source.columns)

(140, 2)

In [9]:
# Display the column names and datatypes
# Column data types are correct
df_source.dtypes

[('source_id', 'string'), ('source_username', 'string')]

Save the source dataset in csv and JSON format.
```python
df_source.write.json('data/source_json', mode='overwrite')
df_source.write.csv('data/source_csv', mode='overwrite')
```

## 1. How old is the latest (in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?<br>
> This question refers to the `case` dataset.



In [37]:
df_case.where(col('case_status') == 'Open').orderBy(col('num_days_late')).show(1, truncate=False, vertical=True)

-RECORD 0----------------------------------------------------
 case_id              | 1014437272                           
 case_opened_date     | 4/18/18 14:37                        
 case_closed_date     | null                                 
 SLA_due_date         | null                                 
 case_late            | NO                                   
 num_days_late        | null                                 
 case_closed          | NO                                   
 dept_division        | District 3                           
 service_request_type | Request for Research/Information     
 SLA_days             | null                                 
 case_status          | Open                                 
 source_id            | np26458                              
 request_address      | 500  HANSFORD ST, San Antonio, 78210 
 council_district     | 3                                    
only showing top 1 row



## 2. How many Stray Animal cases are there?

## 3. How many service requests that are assigned to the Field Operations department (dept_division) are not classified as "Officer Standby" request type (service_request_type)?

## 4. Convert the council_district column to a string column.

## 5. Extract the year from the case_closed_date column.

## 6. Convert num_days_late from days to hours in new columns num_hours_late.

## 7. Join the case data with the source and department data.

## 8. Are there any cases that do not have a request source?

## 9. What are the top 10 service request types in terms of number of requests?

## 10. What are the top 10 service request types in terms of average days late?

## 11. Does number of days late depend on department?

## 12. How do number of days late depend on department and request type?