# Exercises

## Data Aquisition

These exercises use the `case.csv`, `dept.csv`, and `source.csv` files from the San Antonio 311 call data set.

1. Read the case, department, and source data into their own spark dataframes.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession.builder.getOrCreate()

In [3]:
source_df = spark.read.csv('data/source.csv', sep = ',', header = True, inferSchema = True)

In [4]:
source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|Merlene Blodgett|
|   103582|     Carmen Cura|
|   106463| Richard Sanchez|
|   119403|  Betty De Hoyos|
|   119555|  Socorro Quiara|
+---------+----------------+
only showing top 5 rows



In [5]:
case_df = spark.read.csv('data/case.csv', sep = ',', header = True, inferSchema = True)

In [6]:
case_df.show(5, vertical = True, truncate = False)

-RECORD 0-----------------------------------------------------
 case_id              | 1014127332                            
 case_opened_date     | 1/1/18 0:42                           
 case_closed_date     | 1/1/18 12:29                          
 SLA_due_date         | 9/26/20 0:42                          
 case_late            | NO                                    
 num_days_late        | -998.5087616000001                    
 case_closed          | YES                                   
 dept_division        | Field Operations                      
 service_request_type | Stray Animal                          
 SLA_days             | 999.0                                 
 case_status          | Closed                                
 source_id            | svcCRMLS                              
 request_address      | 2315  EL PASO ST, San Antonio, 78207  
 council_district     | 5                                     
-RECORD 1----------------------------------------------

In [7]:
dept_df = spark.read.csv('data/dept.csv', sep = ',', header = True, inferSchema = True)

In [8]:
dept_df.show(5, vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | YES                  
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | YES                  
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 4-------------------------------

2. Let's see how writing to the local disk works in spark:
    - Write the code necessary to store the source data in both csv, and json format, store these as `sources_csv` and `sources_json`
    - Inspect your folder structure. What do you notice?

In [9]:
source_df.write.csv('data/sources_csv', mode = 'overwrite', header = True)

In [10]:
source_df.write.json('data/sources_json', mode = 'overwrite')

3. Inspect the data in your dataframes. Are the data types appropriate? Write the code necessary to cast the values to the appropriate types.

In [11]:
case_df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 1/1/18 0:42          
 case_closed_date     | 1/1/18 12:29         
 SLA_due_date         | 9/26/20 0:42         
 case_late            | NO                   
 num_days_late        | -998.5087616000001   
 case_closed          | YES                  
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 1/1/18 0:46          
 case_closed_date     | 1/3/18 8:11          
 SLA_due_date         | 1/5/18 8:30          
 case_late            | NO                   
 num_days_late        | -2.0126041

In [12]:
case_df.describe()

DataFrame[summary: string, case_id: string, case_opened_date: string, case_closed_date: string, SLA_due_date: string, case_late: string, num_days_late: string, case_closed: string, dept_division: string, service_request_type: string, SLA_days: string, case_status: string, source_id: string, request_address: string, council_district: string]

In [13]:
# Rename Columns
# We'll rename this column to match the other date-type columns.

case_df = case_df.withColumnRenamed('SLA_due_date', 'case_due_date')
case_df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 1/1/18 0:42          
 case_closed_date     | 1/1/18 12:29         
 case_due_date        | 9/26/20 0:42         
 case_late            | NO                   
 num_days_late        | -998.5087616000001   
 case_closed          | YES                  
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 1/1/18 0:46          
 case_closed_date     | 1/3/18 8:11          
 case_due_date        | 1/5/18 8:30          
 case_late            | NO                   
 num_days_late        | -2.0126041

In [14]:
# Correct Data Types.
# Two columns `case_closed` and `case_late` store yes/no values.
# Currently spark thinks they are strings; Let's turn them into booleans

# demostrating we only have yes/no in each field.

case_df.groupBy('case_closed', 'case_late').count().show()


+-----------+---------+------+
|case_closed|case_late| count|
+-----------+---------+------+
|         NO|      YES|  6525|
|        YES|      YES| 87978|
|         NO|       NO| 11585|
|        YES|       NO|735616|
+-----------+---------+------+



In [15]:
case_df = case_df.withColumn('case_closed', expr('case_closed == "YES"')).withColumn('case_late', expr('case_late == "YES"'))

case_df.select('case_closed', 'case_late').show(5)

+-----------+---------+
|case_closed|case_late|
+-----------+---------+
|       true|    false|
|       true|    false|
|       true|    false|
|       true|    false|
|       true|     true|
+-----------+---------+
only showing top 5 rows



In [16]:
case_df.describe()

DataFrame[summary: string, case_id: string, case_opened_date: string, case_closed_date: string, case_due_date: string, num_days_late: string, dept_division: string, service_request_type: string, SLA_days: string, case_status: string, source_id: string, request_address: string, council_district: string]

Now we will handle the 3 columns that have dates in them. We'll use spark's `to_timestamp` function for this.

In order to work properly, we'll need to provide the date format when use the `to_timestamp`. The date format is a little different than the date functionality we've worked with in pandas, this because it using JaveSimpleDateFormat.

In [17]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').describe()

DataFrame[summary: string, case_opened_date: string, case_closed_date: string, case_due_date: string]

In [18]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').show(5)

+----------------+----------------+-------------+
|case_opened_date|case_closed_date|case_due_date|
+----------------+----------------+-------------+
|     1/1/18 0:42|    1/1/18 12:29| 9/26/20 0:42|
|     1/1/18 0:46|     1/3/18 8:11|  1/5/18 8:30|
|     1/1/18 0:48|     1/2/18 7:57|  1/5/18 8:30|
|     1/1/18 1:29|     1/2/18 8:13| 1/17/18 8:30|
|     1/1/18 1:34|    1/1/18 13:29|  1/1/18 4:34|
+----------------+----------------+-------------+
only showing top 5 rows



In [19]:
fmt = 'M/d/yy H:mm'
fmt

'M/d/yy H:mm'

In [20]:
case_df = (
    case_df.withColumn('case_opened_date', to_timestamp('case_opened_date', fmt))
    .withColumn('case_closed_date', to_timestamp('case_opened_date', fmt))
    .withColumn('case_due_date', to_timestamp('case_due_date', fmt))
)

case_df.select('case_opened_date', 'case_closed_date', 'case_due_date')



DataFrame[case_opened_date: timestamp, case_closed_date: timestamp, case_due_date: timestamp]

In [21]:
case_df.select('case_opened_date', 'case_closed_date', 'case_due_date').show(5)

+-------------------+-------------------+-------------------+
|   case_opened_date|   case_closed_date|      case_due_date|
+-------------------+-------------------+-------------------+
|2018-01-01 00:42:00|2018-01-01 00:42:00|2020-09-26 00:42:00|
|2018-01-01 00:46:00|2018-01-01 00:46:00|2018-01-05 08:30:00|
|2018-01-01 00:48:00|2018-01-01 00:48:00|2018-01-05 08:30:00|
|2018-01-01 01:29:00|2018-01-01 01:29:00|2018-01-17 08:30:00|
|2018-01-01 01:34:00|2018-01-01 01:34:00|2018-01-01 04:34:00|
+-------------------+-------------------+-------------------+
only showing top 5 rows



In [22]:
case_df.describe().show(5,vertical = True)

-RECORD 0------------------------------------
 summary              | count                
 case_id              | 841704               
 num_days_late        | 841671               
 dept_division        | 841704               
 service_request_type | 841704               
 SLA_days             | 841671               
 case_status          | 841704               
 source_id            | 841704               
 request_address      | 841704               
 council_district     | 841704               
-RECORD 1------------------------------------
 summary              | mean                 
 case_id              | 1.0139680837676392E9 
 num_days_late        | -49.07486758369357   
 dept_division        | null                 
 service_request_type | null                 
 SLA_days             | 59.25478976660689    
 case_status          | null                 
 source_id            | 136602.73663950132   
 request_address      | null                 
 council_district     | 4.62516870

In [23]:
case_df.show(5,vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01 00:46:00  
 case_closed_date     | 2018-01-01 00:46:00  
 case_due_date        | 2018-01-05 08:30:00  
 case_late            | false                
 num_days_late        | -2.0126041

Now let's look at `dept.csv`

In [24]:
dept_df.show(5,vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | YES                  
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | YES                  
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | YES                  
-RECORD 4-------------------------------

In [25]:
dept_df.groupBy('dept_name','dept_division').count().show(40, truncate=False)

+-------------------------+-----------------------------+-----+
|dept_name                |dept_division                |count|
+-------------------------+-----------------------------+-----+
|Trans & Cap Improvements |Traffic Engineering Design   |1    |
|Trans & Cap Improvements |Signals                      |1    |
|Parks and Recreation     |Tree Crew                    |1    |
|Metro Health             |Vector                       |1    |
|Code Enforcement Services|Code Enforcement             |1    |
|Trans & Cap Improvements |Storm Water                  |1    |
|Parks and Recreation     |Clean and Green              |1    |
|Animal Care Services     |Field Operations             |1    |
|Trans & Cap Improvements |Director's Office Horizontal |1    |
|Customer Service         |311 Call Center              |1    |
|Code Enforcement Services|Graffiti                     |1    |
|null                     |Code Enforcement (Internal)  |1    |
|Solid Waste Management   |Waste Collect

In [26]:
# Change `dept_subject_to_SLA` to Boolean

dept_df.select('dept_subject_to_SLA').groupBy('dept_subject_to_SLA').count().show()

+-------------------+-----+
|dept_subject_to_SLA|count|
+-------------------+-----+
|                YES|   31|
|                 NO|    8|
+-------------------+-----+



In [27]:
dept_df = dept_df.withColumn('dept_subject_to_SLA', expr('dept_subject_to_SLA == "YES"'))

dept_df.show(5, vertical = True)

-RECORD 0--------------------------------------
 dept_division          | 311 Call Center      
 dept_name              | Customer Service     
 standardized_dept_name | Customer Service     
 dept_subject_to_SLA    | true                 
-RECORD 1--------------------------------------
 dept_division          | Brush                
 dept_name              | Solid Waste Manag... 
 standardized_dept_name | Solid Waste          
 dept_subject_to_SLA    | true                 
-RECORD 2--------------------------------------
 dept_division          | Clean and Green      
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | true                 
-RECORD 3--------------------------------------
 dept_division          | Clean and Green N... 
 dept_name              | Parks and Recreation 
 standardized_dept_name | Parks & Recreation   
 dept_subject_to_SLA    | true                 
-RECORD 4-------------------------------

In [28]:
dept_df.describe()

DataFrame[summary: string, dept_division: string, dept_name: string, standardized_dept_name: string]

Let's look at `source_df`.

In [29]:
source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|Merlene Blodgett|
|   103582|     Carmen Cura|
|   106463| Richard Sanchez|
|   119403|  Betty De Hoyos|
|   119555|  Socorro Quiara|
+---------+----------------+
only showing top 5 rows



In [30]:
source_df.describe()

DataFrame[summary: string, source_id: string, source_username: string]

In [31]:
source_df = source_df.withColumn('source_username', trim(lower(source_df.source_username)))

source_df.show(5)

+---------+----------------+
|source_id| source_username|
+---------+----------------+
|   100137|merlene blodgett|
|   103582|     carmen cura|
|   106463| richard sanchez|
|   119403|  betty de hoyos|
|   119555|  socorro quiara|
+---------+----------------+
only showing top 5 rows



1. How old is the latest(in terms of days past SLA) currently open issue? How long has the oldest (in terms of days since opened) currently opened issue been open?

In [33]:
case_df.describe().show(vertical=True)

-RECORD 0------------------------------------
 summary              | count                
 case_id              | 841704               
 num_days_late        | 841671               
 dept_division        | 841704               
 service_request_type | 841704               
 SLA_days             | 841671               
 case_status          | 841704               
 source_id            | 841704               
 request_address      | 841704               
 council_district     | 841704               
-RECORD 1------------------------------------
 summary              | mean                 
 case_id              | 1.0139680837676392E9 
 num_days_late        | -49.07486758369357   
 dept_division        | null                 
 service_request_type | null                 
 SLA_days             | 59.25478976660689    
 case_status          | null                 
 source_id            | 136602.73663950132   
 request_address      | null                 
 council_district     | 4.62516870

In [55]:
max_date = case_df.select(max('case_closed_date')).first()[0]

In [56]:
case_df = (
    case_df.withColumn('case_age',datediff(lit(max_date), 'case_opened_date'))
)

case_df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 dept_division        | Field Operations     
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
 case_age             | 219                  
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01 00:46:00  
 case_closed_date     | 2018-01-01 00:46:00  
 case_due_date        | 2018-01-05 08:30:00  
 case_late            | false     

In [57]:
case_df.filter('! case_closed').select('SLA_days').describe().show()

+-------+-----------------+
|summary|         SLA_days|
+-------+-----------------+
|  count|            18081|
|   mean|67.47167643113069|
| stddev|83.56549024394532|
|    min|            0.125|
|    max|       1419.00191|
+-------+-----------------+



In [58]:
case_df.filter('! case_closed').select('case_age').describe().show()

+-------+------------------+
|summary|          case_age|
+-------+------------------+
|  count|             18110|
|   mean| 82.87338487023744|
| stddev|114.68853606815411|
|    min|                 0|
|    max|               584|
+-------+------------------+



In [59]:
case_df.filter(expr('! case_closed')).sort(desc('num_days_late')).show(3, vertical=True)


-RECORD 0------------------------------------
 case_id              | 1013225646           
 case_opened_date     | 2017-01-01 13:48:00  
 case_closed_date     | 2017-01-01 13:48:00  
 case_due_date        | 2017-01-17 08:30:00  
 case_late            | true                 
 num_days_late        | 348.6458333          
 case_closed          | false                
 dept_division        | Code Enforcement     
 service_request_type | No Address Posted    
 SLA_days             | 15.77859954          
 case_status          | Open                 
 source_id            | svcCRMSS             
 request_address      | 7299  SHADOW RIDG... 
 council_district     | 6                    
 case_age             | 584                  
-RECORD 1------------------------------------
 case_id              | 1013225651           
 case_opened_date     | 2017-01-01 13:57:00  
 case_closed_date     | 2017-01-01 13:57:00  
 case_due_date        | 2017-01-17 08:30:00  
 case_late            | true      

2. How many Stray Animal cases are there?

In [70]:
case_df = case_df.withColumn('service_request_type', lower(case_df.service_request_type))

In [72]:
case_df.show(5,vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 dept_division        | Field Operations     
 service_request_type | stray animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
 case_age             | 219                  
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01 00:46:00  
 case_closed_date     | 2018-01-01 00:46:00  
 case_due_date        | 2018-01-05 08:30:00  
 case_late            | false     

In [77]:
case_df.where('service_request_type == lower("Stray Animal")').count()

26760

3. How many service requests that are assigned to the Field Operations department(`dept_division`) are not classified as "Officer Standby" request type

In [92]:
case_df.filter('dept_division =="Field Operations"').filter('service_request_type != lower("Officer Standby")').count()

113902

convert the `council_district` column to a string column.

In [93]:
case_df.groupBy('council_district').count().show()

+----------------+------+
|council_district| count|
+----------------+------+
|               1|119309|
|               6| 74095|
|               3|102706|
|               5|114609|
|               9| 40916|
|               4| 93778|
|               8| 42345|
|               7| 72445|
|              10| 62926|
|               2|114745|
|               0|  3830|
+----------------+------+



In [94]:
case_df = case_df.withColumn('council_district', col('council_district').cast("string"))

In [95]:
case_df

DataFrame[case_id: int, case_opened_date: timestamp, case_closed_date: timestamp, case_due_date: timestamp, case_late: boolean, num_days_late: double, case_closed: boolean, dept_division: string, service_request_type: string, SLA_days: double, case_status: string, source_id: string, request_address: string, council_district: string, case_age: int]

5. Extract the year from the `case_closed_date` column.

In [97]:
case_df = case_df.withColumn('year_closed', year('case_closed_date'))

case_df.show(5,vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 dept_division        | Field Operations     
 service_request_type | stray animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
 case_age             | 219                  
 year_closed          | 2018                 
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01 00:46:00  
 case_closed_date     | 2018-01-01 00:46:00  
 case_due_date        | 2018-01-05

6. Convert `num_days_late` from days to hours in new column `num_hours_late`. 

In [100]:
case_df = case_df.withColumn('num_hours_late', expr('num_days_late*24'))
case_df

DataFrame[case_id: int, case_opened_date: timestamp, case_closed_date: timestamp, case_due_date: timestamp, case_late: boolean, num_days_late: double, case_closed: boolean, dept_division: string, service_request_type: string, SLA_days: double, case_status: string, source_id: string, request_address: string, council_district: string, case_age: int, year_closed: int, num_hours_late: double]

7. Join the case data with the source and deptarment data.

In [102]:
df = (
    case_df
    # left join on dept_division
    .join(dept_df, 'dept_division', 'left')
    .drop(dept_df.dept_division)
    .drop(dept_df.dept_name)
    .drop(case_df.dept_division)
    .withColumnRenamed('standardized_dept_name', 'department')
#     .withColumn('dept_subject_to_SLA', col('dept_subject_to_SLA == "YES"'))
)

In [103]:
df.show(5, vertical = True)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 00:42:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 service_request_type | stray animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 5                    
 case_age             | 219                  
 year_closed          | 2018                 
 num_hours_late       | -23964.2102784       
 department           | Animal Care Services 
 dept_subject_to_SLA  | true                 
-RECORD 1------------------------------------
 case_id              | 1014127333           
 case_opened_date     | 2018-01-01

8. Are there any cases that do not have a request source?

In [115]:
df.where('source_id == "0" ').show(5, vertical = True)

(0 rows)



In [137]:
source_id_df = df.select(df.source_id.isNull().alias('null_id'))

In [138]:
source_id_df

DataFrame[null_id: boolean]

In [144]:
source_id_df.select(source_id_df.null_id, when(source_id_df.null_id == True, '1').otherwise(0).alias('null_int')).select(sum('null_int')).show()

+-------------+
|sum(null_int)|
+-------------+
|          0.0|
+-------------+



9. What are the top 10 service request types in termsof number of requests?

In [149]:
df.groupBy('department','service_request_type').count().sort(desc('count')).show(10, truncate=False)

+------------------------+--------------------------------+-----+
|department              |service_request_type            |count|
+------------------------+--------------------------------+-----+
|Solid Waste             |no pickup                       |86855|
|DSD/Code Enforcement    |overgrown yard/trash            |65895|
|DSD/Code Enforcement    |bandit signs                    |32910|
|Solid Waste             |damaged cart                    |30338|
|DSD/Code Enforcement    |front or side yard parking      |28794|
|Animal Care Services    |stray animal                    |26760|
|Animal Care Services    |aggressive animal(non-critical) |24882|
|Solid Waste             |cart exchange request           |22024|
|DSD/Code Enforcement    |junk vehicle on private property|21473|
|Trans & Cap Improvements|pot hole repair                 |20616|
+------------------------+--------------------------------+-----+
only showing top 10 rows



10. What are the top 10 service request types in terms of average days?

In [156]:
df.groupBy('department', 'service_request_type').agg(avg('case_age').alias('avg_case_age')).sort(desc('avg_case_age')).show(10, truncate = False)

+------------------------+--------------------------------------+------------------+
|department              |service_request_type                  |avg_case_age      |
+------------------------+--------------------------------------+------------------+
|Animal Care Services    |spay/neuter request response          |570.0             |
|Trans & Cap Improvements|floodplain inquiry                    |560.0             |
|DSD/Code Enforcement    |record keeping of used mattresses     |554.0             |
|DSD/Code Enforcement    |labeling for used mattress            |553.4285714285714 |
|DSD/Code Enforcement    |license requied used mattress sales   |551.8571428571429 |
|DSD/Code Enforcement    |signage requied for sale of used mattr|550.8333333333334 |
|DSD/Code Enforcement    |structure/housing maintenance         |507.1828153564899 |
|Trans & Cap Improvements|sign fabrication - internal           |499.32365145228215|
|DSD/Code Enforcement    |storage of used mattress              |

In [162]:
df.groupBy('department', 'service_request_type').agg(avg('num_days_late').alias('avg_days_late')).sort(desc('avg_days_late')).show(10, truncate = False)




+------------------------+--------------------------------------+------------------+
|department              |service_request_type                  |avg_days_late     |
+------------------------+--------------------------------------+------------------+
|DSD/Code Enforcement    |zoning: junk yards                    |175.9563621042095 |
|DSD/Code Enforcement    |labeling for used mattress            |162.43032902285717|
|DSD/Code Enforcement    |record keeping of used mattresses     |153.99724039428568|
|DSD/Code Enforcement    |signage requied for sale of used mattr|151.63868055333333|
|DSD/Code Enforcement    |storage of used mattress              |142.112556415     |
|DSD/Code Enforcement    |zoning: recycle yard                  |135.92851612479797|
|DSD/Code Enforcement    |donation container enforcement        |131.75610506358706|
|DSD/Code Enforcement    |license requied used mattress sales   |128.79828704142858|
|Trans & Cap Improvements|traffic signal graffiti               |

11. Does number of days late depend on department?

In [164]:
df.groupBy('department').agg(avg('num_days_late')).show()

+--------------------+-------------------+
|          department| avg(num_days_late)|
+--------------------+-------------------+
|         Solid Waste| -2.193864424022545|
|Animal Care Services|-226.16549770717506|
|Trans & Cap Impro...|-20.509793501785314|
|  Parks & Recreation| -5.283345998745901|
|    Customer Service|  59.49019459221518|
|        Metro Health| -4.904223205386017|
|        City Council|               null|
|DSD/Code Enforcement| -38.32346772537388|
+--------------------+-------------------+



12. How doo number of days late depend on department and request type?

In [169]:
df.groupBy('department', 'service_request_type').agg(avg('num_days_late')).sort('avg(num_days_late)').show(100)

+--------------------+--------------------+-------------------+
|          department|service_request_type| avg(num_days_late)|
+--------------------+--------------------+-------------------+
|        City Council|cco_request for r...|               null|
|        City Council|request for resea...|               null|
|Trans & Cap Impro...|  engineering design|      -1399.1272335|
|Trans & Cap Impro...|signal timing mod...|-1247.0797799732143|
|Animal Care Services|        stray animal|  -998.804572616083|
|  Parks & Recreation|major park improv...| -280.2546235360405|
|Trans & Cap Impro...|sidewalk cost sha...|-184.87626063647144|
|DSD/Code Enforcement|multi tenant exte...|-135.71588128047625|
|DSD/Code Enforcement|   cps energy towers|-129.84778717829747|
|DSD/Code Enforcement|cps energy wood p...|-129.30905202721226|
|DSD/Code Enforcement|cps energy metal ...|-129.17919786427768|
|DSD/Code Enforcement|multi tenant inte...| -125.1431856354651|
|DSD/Code Enforcement|temporary obstruc.