# Capstone Project 2
# How Soon Will a Complaint be Resolved?
## A Case Study on New York City 311 Call
## Notebook in pyspark

Data Source: NYC Open Data - 311 Service Requests from 2010 to Present
URL: https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9
Analyst: Eugene Wen

### Subsample datasets for model development
First installed subsample package through pip, then changed directory to NYC311 data folder and run command as follows to randomly draw 50,000 rows:

`subsample -n 500000 311_Service_Requests_from_2010_to_Present.csv -r > nyc311_sample.csv`  

Not that this dataset is used for development. In the final report all 10GB data will be used for analysis.

In [2]:
# Load packages
import pandas as pd
from pyspark.sql import SparkSession

In [3]:
# Start a spark session
spark = SparkSession.builder.appName('nyc311').getOrCreate()

In [4]:
# Load sample dataset
df = spark.read.csv('../NYC311/nyc311_sample.csv', inferSchema=True, header=True)
df.printSchema()

root
 |-- Unique Key: integer (nullable = true)
 |-- Created Date: string (nullable = true)
 |-- Closed Date: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Agency Name: string (nullable = true)
 |-- Complaint Type: string (nullable = true)
 |-- Descriptor: string (nullable = true)
 |-- Location Type: string (nullable = true)
 |-- Incident Zip: string (nullable = true)
 |-- Incident Address: string (nullable = true)
 |-- Street Name: string (nullable = true)
 |-- Cross Street 1: string (nullable = true)
 |-- Cross Street 2: string (nullable = true)
 |-- Intersection Street 1: string (nullable = true)
 |-- Intersection Street 2: string (nullable = true)
 |-- Address Type: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Landmark: string (nullable = true)
 |-- Facility Type: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Due Date: string (nullable = true)
 |-- Resolution Description: string (nullable = true)
 |-- Resolution Actio

In [10]:
# Check the first three columns we are interested in.
df['Unique Key', 'Created Date', 'Closed Date'].show()

+----------+--------------------+--------------------+
|Unique Key|        Created Date|         Closed Date|
+----------+--------------------+--------------------+
|  32199603|12/14/2015 12:00:...|01/04/2016 12:00:...|
|  20074547|03/21/2011 04:22:...|03/23/2011 02:49:...|
|  28951515|09/25/2014 06:18:...|09/25/2014 06:19:...|
|  17575598|07/03/2010 10:11:...|07/07/2010 12:00:...|
|  28270434|06/16/2014 12:00:...|06/18/2014 12:00:...|
|  34115581|08/18/2016 02:24:...|08/31/2016 12:47:...|
|  28261221|06/14/2014 09:12:...|06/14/2014 10:44:...|
|  22829180|03/06/2012 12:05:...|03/07/2012 01:36:...|
|  29709630|01/13/2015 05:51:...|01/13/2015 05:51:...|
|  20809019|07/11/2011 12:00:...|07/18/2011 12:00:...|
|  29450972|12/07/2014 12:00:...|12/09/2014 12:00:...|
|  34728843|11/07/2016 10:24:...|11/08/2016 10:58:...|
|  34260889|09/07/2016 06:03:...|09/27/2016 03:49:...|
|  36413130|06/11/2017 01:20:...|06/11/2017 04:35:...|
|  25301783|04/04/2013 11:30:...|04/05/2013 12:49:...|
|  1599464

In [13]:
# Calculate the response time as the difference between created date and closed date, in minutes.

df = df.withColumn("Resp_time", df['Closed Date'] - df['Created Date'])
df['Unique Key', 'Resp_time'].show()

+----------+---------+
|Unique Key|Resp_time|
+----------+---------+
|  32199603|     null|
|  20074547|     null|
|  28951515|     null|
|  17575598|     null|
|  28270434|     null|
|  34115581|     null|
|  28261221|     null|
|  22829180|     null|
|  29709630|     null|
|  20809019|     null|
|  29450972|     null|
|  34728843|     null|
|  34260889|     null|
|  36413130|     null|
|  25301783|     null|
|  15994647|     null|
|  20885742|     null|
|  19128116|     null|
|  32064460|     null|
|  30699222|     null|
+----------+---------+
only showing top 20 rows



### Data Wrangling


In [None]:
# Check missing patterns

### Exploratory Data Analysis

In [None]:
# Plot response time distribution

### Machine Learning Models

In [14]:
# First run

In [15]:
# Tuning with optimal subset

In [16]:
# Final model choice with optimal hyperparameter(s)

### Results

In [17]:
# Final model training

In [18]:
# Final model evaluation