## Explore Exercises: Spark
### Corey Solitaire
`12.01.2020`

#### Imports

In [11]:
import warnings

warnings.filterwarnings("ignore")

import pyspark.sql
from pyspark.sql.functions import *

import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from wrangle import wrangle_311

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = wrangle_311(spark)
print("\ndf shape: (%d, %d)\n" % (df.count(), len(df.columns)))
df.show(1, vertical=True)

[wrangle.py] reading case.csv
[wrangle.py] handling data types
[wrangle.py] parsing dates
[wrangle.py] adding features
[wrangle.py] joining departments

df shape: (841704, 20)

-RECORD 0------------------------------------
 case_id              | 1014127332           
 case_opened_date     | 2018-01-01 00:42:00  
 case_closed_date     | 2018-01-01 12:29:00  
 case_due_date        | 2020-09-26 00:42:00  
 case_late            | false                
 num_days_late        | -998.5087616000001   
 case_closed          | true                 
 service_request_type | Stray Animal         
 SLA_days             | 999.0                
 case_status          | Closed               
 source_id            | svcCRMLS             
 request_address      | 2315  EL PASO ST,... 
 council_district     | 005                  
 num_weeks_late       | -142.6441088         
 zipcode              | 78207                
 case_age             | 219                  
 days_to_closed       | 0                

## Exercises:

- Answer the questions below by using a combination of the techniques discussed in the lesson that you think is appropriate.

1. How many different cases are there, by department?

In [13]:
df.groupby("department").count().show()

+--------------------+------+
|          department| count|
+--------------------+------+
|         Solid Waste|279270|
|Animal Care Services|116915|
|Trans & Cap Impro...| 96193|
|  Parks & Recreation| 19907|
|    Customer Service|  2849|
|        Metro Health|  5163|
|        City Council|    33|
|DSD/Code Enforcement|321374|
+--------------------+------+



2. Does the percentage of cases that are late vary by department?

In [19]:
(
    df.groupBy("department")
    .pivot("case_late")
    .agg(round(mean("case_lifetime"), 2))
    .orderBy("department")
    .show(truncate=False)
)

+------------------------+------+------+
|department              |false |true  |
+------------------------+------+------+
|Animal Care Services    |0.25  |26.97 |
|City Council            |138.94|null  |
|Customer Service        |16.83 |134.21|
|DSD/Code Enforcement    |11.38 |98.71 |
|Metro Health            |4.87  |16.8  |
|Parks & Recreation      |5.9   |36.53 |
|Solid Waste             |2.11  |13.67 |
|Trans & Cap Improvements|7.82  |30.22 |
+------------------------+------+------+



In [22]:
(
    
    df.where(df.case_late == True)
    .groupBy("department")
    .agg(round(mean("case_lifetime"), 2))
    .orderBy("department")
    .show(truncate=False)
)

+------------------------+----------------------------+
|department              |round(avg(case_lifetime), 2)|
+------------------------+----------------------------+
|Animal Care Services    |26.97                       |
|Customer Service        |134.21                      |
|DSD/Code Enforcement    |98.71                       |
|Metro Health            |16.8                        |
|Parks & Recreation      |36.53                       |
|Solid Waste             |13.67                       |
|Trans & Cap Improvements|30.22                       |
+------------------------+----------------------------+



3. On average, how late are the late cases by department?

In [24]:
(
    df.where(df.case_late == True)
    .groupBy("department")
    .agg(round(mean("num_days_late"), 2))
    .orderBy("department")
    .show(truncate=False)
)

+------------------------+----------------------------+
|department              |round(avg(num_days_late), 2)|
+------------------------+----------------------------+
|Animal Care Services    |23.46                       |
|Customer Service        |87.68                       |
|DSD/Code Enforcement    |49.38                       |
|Metro Health            |6.54                        |
|Parks & Recreation      |22.35                       |
|Solid Waste             |7.19                        |
|Trans & Cap Improvements|10.6                        |
+------------------------+----------------------------+



4. What is the service type that is the most late? Just for Parks & Rec?

In [44]:
(
    df.where(df.case_late == True)
    .groupBy('department',"service_request_type")
    .agg(round(mean("num_days_late"), 2)).withColumnRenamed("round(avg(num_days_late), 2)","avg_days_late")
    .orderBy(desc('avg_days_late'))
    .show(truncate=False)
)

+------------------------+----------------------------------------+-------------+
|department              |service_request_type                    |avg_days_late|
+------------------------+----------------------------------------+-------------+
|DSD/Code Enforcement    |Zoning: Recycle Yard                    |210.89       |
|DSD/Code Enforcement    |Zoning: Junk Yards                      |200.21       |
|DSD/Code Enforcement    |Structure/Housing Maintenance           |190.21       |
|DSD/Code Enforcement    |Donation Container Enforcement          |171.09       |
|DSD/Code Enforcement    |Storage of Used Mattress                |163.97       |
|DSD/Code Enforcement    |Labeling for Used Mattress              |162.43       |
|DSD/Code Enforcement    |Record Keeping of Used Mattresses       |154.0        |
|DSD/Code Enforcement    |Signage Requied for Sale of Used Mattr  |151.64       |
|Trans & Cap Improvements|Traffic Signal Graffiti                 |137.65       |
|DSD/Code Enforc

In [46]:
# Just parks and rec
(
    df.where(df.case_late == True)
    .groupBy('department',"service_request_type")
    .agg(round(mean("num_days_late"), 2)).withColumnRenamed("round(avg(num_days_late), 2)","avg_days_late")
    .orderBy(desc('avg_days_late'))
    .where(df.department == 'Parks & Recreation')
    .show(truncate=False)
)

+------------------+-------------------------------------+-------------+
|department        |service_request_type                 |avg_days_late|
+------------------+-------------------------------------+-------------+
|Parks & Recreation|Amenity Park Improvement             |76.87        |
|Parks & Recreation|Major Park Improvement Install       |75.79        |
|Parks & Recreation|Reservation Assistance               |66.03        |
|Parks & Recreation|Park Building Maint Invest           |59.37        |
|Parks & Recreation|Sportfield Lighting                  |51.48        |
|Parks & Recreation|Electrical                           |42.95        |
|Parks & Recreation|Tree Removal                         |40.28        |
|Parks & Recreation|Landscape Maintenance                |38.87        |
|Parks & Recreation|Heavy Equipment                      |38.57        |
|Parks & Recreation|Miscellaneous Park Equipment         |33.62        |
|Parks & Recreation|Tree Trimming/Maintenance      

5. For the DSD/Code Enforcement department, what are the most common service request types? Look at other departments too.

6. Does whether or not its a weekend matter for when a case is opened/closed?

7. On average, how many cases are opened a day for the Customer Service department?

8. Does the number of service requests for the solid waste department vary by day of the week?