# Client Project: The Lab @ DC

## Project Title: {here}

### Authors: Kihoon Sohn, Brian Collins, Harsha Goonawardana, Priya Kakkar
- Cohorts of the Data Science Immersive, General Assembly @ Washington DC campus

In this notebook, we have Exploratory Data Analysis on the City Service Requests / ShotsSpotters datasets. **This is notebook 2 of 3.**

### Import Libraries

In [75]:
# import basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

### Read CSVs

In [76]:
csr_train   = pd.read_csv('./assets/csr/csr_train.csv', low_memory=False)
csr_test    = pd.read_csv('./assets/csr/csr_test.csv', low_memory=False)
shots_train = pd.read_csv('./assets/mpd/shots_train.csv', low_memory=False)
shots_test  = pd.read_csv('./assets/mpd/shots_test.csv', low_memory=False)

##### check null values and basic info on the datasets

In [77]:
csr_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1231233 entries, 0 to 1231232
Data columns (total 30 columns):
X                             1231233 non-null float64
Y                             1231233 non-null float64
OBJECTID                      1231233 non-null int64
SERVICECODE                   1231233 non-null object
SERVICECODEDESCRIPTION        1231233 non-null object
SERVICETYPECODEDESCRIPTION    1230379 non-null object
ORGANIZATIONACRONYM           1231232 non-null object
SERVICECALLCOUNT              1231233 non-null int64
ADDDATE                       1231233 non-null object
RESOLUTIONDATE                1145187 non-null object
SERVICEDUEDATE                1218530 non-null object
SERVICEORDERDATE              1231233 non-null object
INSPECTIONFLAG                1231233 non-null object
INSPECTIONDATE                434130 non-null object
INSPECTORNAME                 40361 non-null object
SERVICEORDERSTATUS            1230380 non-null object
STATUS_CODE               

In [78]:
csr_train.isnull().sum().sort_values(ascending=False)

INSPECTORNAME                 1190872
INSPECTIONDATE                 797103
DETAILS                        444580
MARADDRESSREPOSITORYID         189162
STATUS_CODE                    151801
RESOLUTIONDATE                  86046
CITY                            50324
STATE                           50324
STREETADDRESS                   49730
SERVICEDUEDATE                  12703
WARD                             6221
PRIORITY                         2677
SERVICETYPECODEDESCRIPTION        854
SERVICEORDERSTATUS                853
ZIPCODE                            16
ORGANIZATIONACRONYM                 1
SERVICEREQUESTID                    0
XCOORD                              0
INSPECTIONFLAG                      0
SERVICEORDERDATE                    0
YCOORD                              0
LATITUDE                            0
ADDDATE                             0
SERVICECALLCOUNT                    0
LONGITUDE                           0
SERVICECODEDESCRIPTION              0
SERVICECODE 

In [79]:
shots_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28343 entries, 0 to 28342
Data columns (total 7 columns):
ID           28339 non-null object
Type         28343 non-null object
Date         28343 non-null object
Time         28343 non-null object
Source       28343 non-null object
Latitude     28343 non-null float64
Longitude    28343 non-null float64
dtypes: float64(2), object(5)
memory usage: 1.5+ MB


In [80]:
shots_train.isnull().sum().sort_values(ascending=False)

ID           4
Longitude    0
Latitude     0
Source       0
Time         0
Date         0
Type         0
dtype: int64

In [81]:
csr_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 152796 entries, 0 to 152795
Data columns (total 30 columns):
X                             152796 non-null float64
Y                             152796 non-null float64
OBJECTID                      152796 non-null int64
SERVICECODE                   152796 non-null object
SERVICECODEDESCRIPTION        152796 non-null object
SERVICETYPECODEDESCRIPTION    152795 non-null object
ORGANIZATIONACRONYM           152795 non-null object
SERVICECALLCOUNT              152796 non-null int64
ADDDATE                       152796 non-null object
RESOLUTIONDATE                131874 non-null object
SERVICEDUEDATE                152791 non-null object
SERVICEORDERDATE              152796 non-null object
INSPECTIONFLAG                152796 non-null object
INSPECTIONDATE                10705 non-null object
INSPECTORNAME                 0 non-null float64
SERVICEORDERSTATUS            152796 non-null object
STATUS_CODE                   152796 non-null 

In [82]:
csr_test.isnull().sum()

X                                  0
Y                                  0
OBJECTID                           0
SERVICECODE                        0
SERVICECODEDESCRIPTION             0
SERVICETYPECODEDESCRIPTION         1
ORGANIZATIONACRONYM                1
SERVICECALLCOUNT                   0
ADDDATE                            0
RESOLUTIONDATE                 20922
SERVICEDUEDATE                     5
SERVICEORDERDATE                   0
INSPECTIONFLAG                     0
INSPECTIONDATE                142091
INSPECTORNAME                 152796
SERVICEORDERSTATUS                 0
STATUS_CODE                        0
SERVICEREQUESTID                   0
PRIORITY                           0
STREETADDRESS                  10788
XCOORD                             0
YCOORD                             0
LATITUDE                           0
LONGITUDE                          0
CITY                           10786
STATE                          10786
ZIPCODE                            8
M

### Basic EDAs and Data Cleaning

##### Basic EDAs on Shots dataset

In [83]:
shots_train.head()

Unnamed: 0,ID,Type,Date,Time,Source,Latitude,Longitude
0,5D39700,Multiple_Gunshots,2014-01-01,00:00:02,WashingtonDC5D,38.917,-77.012
1,5D39701,Multiple_Gunshots,2014-01-01,00:00:06,WashingtonDC5D,38.917,-77.002
2,5D39702,Multiple_Gunshots,2014-01-01,00:00:07,WashingtonDC5D,38.917,-76.987
3,7D119445,Multiple_Gunshots,2014-01-01,00:00:10,WashingtonDC7D,38.823,-77.0
4,1D55993,Multiple_Gunshots,2014-01-01,00:00:10,WashingtonDC1D,38.893,-76.993


In [84]:
shots_train.shape

(28343, 7)

In [85]:
shots_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28343 entries, 0 to 28342
Data columns (total 7 columns):
ID           28339 non-null object
Type         28343 non-null object
Date         28343 non-null object
Time         28343 non-null object
Source       28343 non-null object
Latitude     28343 non-null float64
Longitude    28343 non-null float64
dtypes: float64(2), object(5)
memory usage: 1.5+ MB


In [86]:
shots_train.isnull().sum().sort_values(ascending=False)

ID           4
Longitude    0
Latitude     0
Source       0
Time         0
Date         0
Type         0
dtype: int64

In [87]:
shots_train[shots_train.ID.isnull()]

Unnamed: 0,ID,Type,Date,Time,Source,Latitude,Longitude
17514,,Multiple_Gunshots,2015-12-27,19:41:22,WashingtonDC5D,38.931,-76.97
20378,,Gunshot_or_Firecracker,2016-06-28,11:41:36,WashingtonDC6D,38.894,-76.924
24456,,Single_Gunshot,2017-02-22,17:10:13,WashingtonDC7D,38.841,-76.976
24691,,Single_Gunshot,2017-03-15,20:08:03,WashingtonDC7D,38.84,-76.988


In [88]:
def shot_spot_preprocess(df):
    shots_train.set_index(['ID'], inplace=True)
    shots_train.Source = shots_train.Source.apply(lambda DC: DC.replace('WashingtonDC', ''))
    shots_train.Date = pd.to_datetime(shots_train.Date, infer_datetime_format=True)
    return df

shots_train = shot_spot_preprocess(shots_train)

In [89]:
# shots_train.to_csv('./assets/mpd/shots_train_preprocessed.csv')

In [90]:
shots_train.Source.value_counts()

7D    10342
6D     8407
5D     3216
4D     2701
1D     1954
3D     1723
Name: Source, dtype: int64

In [91]:
shots_train.Type.value_counts()

Multiple_Gunshots         15858
Single_Gunshot            10034
Gunshot_or_Firecracker     2451
Name: Type, dtype: int64

##### Basic EDAs on City Service Requests Datasets

In [92]:
csr_train.iloc[:5, :6]

Unnamed: 0,X,Y,OBJECTID,SERVICECODE,SERVICECODEDESCRIPTION,SERVICETYPECODEDESCRIPTION
0,-76.972735,38.897957,463232,S0011,Alley Cleaning,Street Cleaning
1,-76.991907,38.922865,463233,S0321,Recycling Collection - Missed,Recycling
2,-76.970891,38.874749,463234,S0031,Bulk Collection,Bulk Collection
3,-77.022678,38.942819,463235,S0311,Rat Abatement,DOH
4,-77.04884,38.89896,463236,S0276,Parking Meter Repair,TOA


##### findings above
- X and Y seems to be closely related to Lats/Longs - drop!
- Is service code description the same as service type, just a little more information?
- Object ID can be indexed
- Service code can be grouped by with code description

In [93]:
csr_train.iloc[:5, 6:12]

Unnamed: 0,ORGANIZATIONACRONYM,SERVICECALLCOUNT,ADDDATE,RESOLUTIONDATE,SERVICEDUEDATE,SERVICEORDERDATE
0,DPW,1,2014-01-02T13:27:40.000Z,2014-01-15T07:43:42.000Z,2014-02-18T13:27:40.000Z,2014-01-02T13:27:40.000Z
1,DPW,1,2014-01-02T13:46:57.000Z,2014-01-06T12:39:39.000Z,2014-01-06T13:46:57.000Z,2014-01-02T13:46:57.000Z
2,DPW,1,2014-01-02T13:57:46.000Z,2014-01-14T14:29:16.000Z,2014-01-23T13:57:46.000Z,2014-01-02T13:57:46.000Z
3,DOH,1,2014-01-02T13:43:20.000Z,,2014-02-24T13:43:20.000Z,2014-01-02T13:43:20.000Z
4,DDOT,1,2014-01-02T16:00:59.000Z,2014-01-07T16:33:48.000Z,2014-01-09T16:00:59.000Z,2014-01-02T16:00:59.000Z


##### findings above

- Organization to know who is on the task.
- Service call count to see how many times a call is needed.
- Everything just needs to be set to datetime.
- Organization acronym / service call count can be dropped.

In [94]:
csr_train.iloc[:5, 12:18]

Unnamed: 0,INSPECTIONFLAG,INSPECTIONDATE,INSPECTORNAME,SERVICEORDERSTATUS,STATUS_CODE,SERVICEREQUESTID
0,N,,,CLOSED,,14-00000654
1,N,2014-01-06T12:39:00.000Z,"Bryant, Kevin",CLOSED,,14-00000686
2,N,,,CLOSED,,14-00000707
3,N,,,OPEN,,14-00000677
4,N,,,CLOSED,,14-00000877


##### findings above
- Service order data good.  Needs to be broken down into Datetime.
- Inspection Flag, what does that mean?
- Remove Inspector Name
- What does Status code contain? A lot of NaN could be bad
- Service Order Status, Important maybe?

In [95]:
csr_train.iloc[:5, 18:24]

Unnamed: 0,PRIORITY,STREETADDRESS,XCOORD,YCOORD,LATITUDE,LONGITUDE
0,STANDARD,2301 BENNING ROAD NE,402365.36,136678.02,38.89795,-76.972732
1,STANDARD,1004 RHODE ISLAND AVENUE NE,400701.99,139442.62,38.922857,-76.991905
2,STANDARD,2333 FAIRLAWN AVENUE SE,402526.12,134101.82,38.874742,-76.970889
3,STANDARD,720 VARNUM STREET NW,398034.18,141657.91,38.942811,-77.022676
4,STANDARD,700 - 799 BLOCK OF 22ND STREET NW,395763.56,136790.11,38.898952,-77.048838


##### findings above
- Service Request ID seems unimportant
- What is the XCOORD and YCOORD?
- Street Address can help us find our quandrants. Do we also want the address or is Latitude and Longitude
- Priority - how many unique values are in there?

In [96]:
csr_train.iloc[:5, 24:31]

Unnamed: 0,CITY,STATE,ZIPCODE,MARADDRESSREPOSITORYID,WARD,DETAILS
0,WASHINGTON,DC,20002.0,48983.0,Ward 7,There is some dumping in the rear of this addr...
1,WASHINGTON,DC,20018.0,76304.0,Ward 5,Has not been collected the past 4 weeks.
2,WASHINGTON,DC,20020.0,286919.0,Ward 7,"1 television, 2 vacuums, 1 boom box,"
3,WASHINGTON,DC,20011.0,249794.0,Ward 4,requesting ratb abatement
4,WASHINGTON,DC,20052.0,,2,Broken Parking Meter


##### findings above
- Remove City and State
- clean up the Ward to just numbers
- Fix Zipcode to be int.
- Longitude is the same as the X column
- Details can be vectorized.
- What is MARADDRESSREPOSITORYID?

##### Check with some categorical columns' values and find typos or misspelled. 

In [97]:
csr_train.isnull().sum()

X                                   0
Y                                   0
OBJECTID                            0
SERVICECODE                         0
SERVICECODEDESCRIPTION              0
SERVICETYPECODEDESCRIPTION        854
ORGANIZATIONACRONYM                 1
SERVICECALLCOUNT                    0
ADDDATE                             0
RESOLUTIONDATE                  86046
SERVICEDUEDATE                  12703
SERVICEORDERDATE                    0
INSPECTIONFLAG                      0
INSPECTIONDATE                 797103
INSPECTORNAME                 1190872
SERVICEORDERSTATUS                853
STATUS_CODE                    151801
SERVICEREQUESTID                    0
PRIORITY                         2677
STREETADDRESS                   49730
XCOORD                              0
YCOORD                              0
LATITUDE                            0
LONGITUDE                           0
CITY                            50324
STATE                           50324
ZIPCODE     

In [98]:
csr_train.PRIORITY.value_counts()

STANDARD     1179004
URGENT         24585
EMERGNCY       23635
EMERGENCY       1189
PRIORITY         141
ESCALATED          1
PRIOR003           1
Name: PRIORITY, dtype: int64

##### Preprocessing

In [99]:
csr_train.columns

Index(['X', 'Y', 'OBJECTID', 'SERVICECODE', 'SERVICECODEDESCRIPTION',
       'SERVICETYPECODEDESCRIPTION', 'ORGANIZATIONACRONYM', 'SERVICECALLCOUNT',
       'ADDDATE', 'RESOLUTIONDATE', 'SERVICEDUEDATE', 'SERVICEORDERDATE',
       'INSPECTIONFLAG', 'INSPECTIONDATE', 'INSPECTORNAME',
       'SERVICEORDERSTATUS', 'STATUS_CODE', 'SERVICEREQUESTID', 'PRIORITY',
       'STREETADDRESS', 'XCOORD', 'YCOORD', 'LATITUDE', 'LONGITUDE', 'CITY',
       'STATE', 'ZIPCODE', 'MARADDRESSREPOSITORYID', 'WARD', 'DETAILS'],
      dtype='object')

In [100]:
csr_test.columns

Index(['X', 'Y', 'OBJECTID', 'SERVICECODE', 'SERVICECODEDESCRIPTION',
       'SERVICETYPECODEDESCRIPTION', 'ORGANIZATIONACRONYM', 'SERVICECALLCOUNT',
       'ADDDATE', 'RESOLUTIONDATE', 'SERVICEDUEDATE', 'SERVICEORDERDATE',
       'INSPECTIONFLAG', 'INSPECTIONDATE', 'INSPECTORNAME',
       'SERVICEORDERSTATUS', 'STATUS_CODE', 'SERVICEREQUESTID', 'PRIORITY',
       'STREETADDRESS', 'XCOORD', 'YCOORD', 'LATITUDE', 'LONGITUDE', 'CITY',
       'STATE', 'ZIPCODE', 'MARADDRESSREPOSITORYID', 'WARD', 'DETAILS'],
      dtype='object')

In [101]:
# create preprocess function

# after the basic EDA above, decided to drop columns below
drop_cols = ['X', 'Y', 'ORGANIZATIONACRONYM', 'SERVICECALLCOUNT',
             'SERVICEDUEDATE', 'SERVICEORDERDATE', 'INSPECTIONFLAG',
             'INSPECTIONDATE', 'INSPECTORNAME',
             'STREETADDRESS', 'XCOORD', 'YCOORD', 'CITY', 'STATE',
             'MARADDRESSREPOSITORYID', 'DETAILS']


def crimespot_preprocess(df):
    # Removing unused or redundent information
    df.drop(drop_cols, axis=1, inplace=True)
    
    # Easier to work with lowercase columns
    df.columns = map(str.lower, df.columns) 

    # replace values
    df.priority = df.priority.replace("EMERGNCY", "EMERGENCY")
        
    # treat zipcode as string and strip
    df.zipcode = df.zipcode.astype(str).str.strip().str.strip('.0')

    # (KS) make single line to combine Brian's on 'Ward'
    df.ward = df.ward.astype(str).map(lambda x: x.strip('Ward')).str.strip().str.strip('.0')

    # create binary classfication column by resolution date info
    df['resolved'] = [0 if x == True else 1 for x in df['resolutiondate'].isnull()]

    # fill nan values to 0
    df.resolutiondate.fillna('0', inplace=True)
    
    # clean up datetime related data
    timestamp = ['adddate', 'resolutiondate']
    for x in timestamp:
        df[x] = df[x].astype(str).map(lambda x: x.strip('Z').replace('T', ' ')).astype('datetime64[ns]')
        
    # calculate the time difference between resolutiondate - adddate by hours
    df['turnover'] = (df['resolutiondate']-df['adddate']).astype('timedelta64[h]')*df['resolved']

    df['servicecodedescription'] = [x.lower() for x in df['servicecodedescription']]

    return df

csr_train = crimespot_preprocess(csr_train)
csr_test  = crimespot_preprocess(csr_test)

### Turnover time and resolution examine

In [102]:
# no resolution date gets 0

csr_train['resolved'].value_counts(normalize=True)

1    0.930114
0    0.069886
Name: resolved, dtype: float64

In [103]:
csr_test['resolved'].value_counts(normalize=True)

1    0.863072
0    0.136928
Name: resolved, dtype: float64

In [104]:
# turnover stats
csr_train['turnover'].describe().apply(lambda x: format(x, 'f'))

count    1231233.000000
mean         571.527879
std         1747.934413
min            0.000000
25%            6.000000
50%           72.000000
75%          268.000000
max        28247.000000
Name: turnover, dtype: object

In [105]:
csr_test['turnover'].describe().apply(lambda x: format(x, 'f'))

count    152796.000000
mean        124.583209
std         255.951533
min           0.000000
25%           1.000000
50%          28.000000
75%         163.000000
max        3855.000000
Name: turnover, dtype: object

In [106]:
# let's set threshold as 100 days has not been resolved case.

csr_train[csr_train.turnover > 24000]

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,zipcode,ward,resolved,turnover
115,462863,S0287,sign removal investigation,Toa-Trans Sys Mnt-Signs,2014-01-02 14:20:00,2017-03-24 13:48:10,CLOSED,CLOSED,14-00000640,STANDARD,38.941250,-77.016082,20011,4,1,28247.0
2284,477074,S0000,abandoned vehicle - on public property,PEMA- Parking Enforcement Management Administr...,2014-01-06 08:35:00,2016-12-17 02:59:50,CLOSED,CLOSED,14-00002671,STANDARD,38.900807,-76.995916,20002,6,1,25818.0
6618,486603,INFLIGRE,light-infrastructure,Transportation Operations Administration,2014-02-18 14:06:00,2017-01-27 10:58:26,CLOSED,CLOSED,14-00036323,STANDARD,38.876256,-77.006990,20003,6,1,25772.0
13105,493090,S0376,sign new investigation,Toa-Trans Sys Mnt-Signs,2014-02-06 16:04:00,2016-11-21 14:43:10,CLOSED,CLOSED,14-00028706,STANDARD,38.931513,-76.991731,20017,5,1,24454.0
13973,493958,S0000,abandoned vehicle - on public property,PEMA- Parking Enforcement Management Administr...,2014-01-27 10:43:00,2016-12-18 12:14:11,CLOSED,CLOSED,14-00018692,STANDARD,38.860496,-76.997453,2002,8,1,25345.0
14525,494510,S0361,sidewalk repair,Toa-Street & Bridge Maintenance,2014-01-13 12:52:00,2017-03-25 14:48:07,CLOSED,CLOSED,14-00008849,STANDARD,38.967892,-77.020137,20012,4,1,28009.0
14882,494867,S0361,sidewalk repair,Toa-Street & Bridge Maintenance,2014-01-13 12:12:00,2017-03-29 08:23:08,CLOSED,CLOSED,14-00008784,STANDARD,38.902014,-77.026204,20001,2,1,28100.0
15865,495850,S0000,abandoned vehicle - on public property,PEMA- Parking Enforcement Management Administr...,2014-02-05 16:46:00,2016-12-18 09:32:19,CLOSED,CLOSED,14-00027687,STANDARD,38.955822,-77.024285,20011,4,1,25120.0
16551,496536,S0000,abandoned vehicle - on public property,PEMA- Parking Enforcement Management Administr...,2014-02-10 13:22:00,2016-12-18 10:33:18,CLOSED,CLOSED,14-00031228,STANDARD,38.879378,-76.943238,20019,7,1,25005.0
16955,496940,S0003,abandoned vehicle - on private property,PEMA- Parking Enforcement Management Administr...,2014-01-18 19:13:00,2016-12-18 10:26:08,CLOSED,CLOSED,14-00014062,STANDARD,38.948808,-77.030522,20011,4,1,25551.0


In [107]:
csr_train.dtypes

objectid                               int64
servicecode                           object
servicecodedescription                object
servicetypecodedescription            object
adddate                       datetime64[ns]
resolutiondate                datetime64[ns]
serviceorderstatus                    object
status_code                           object
servicerequestid                      object
priority                              object
latitude                             float64
longitude                            float64
zipcode                               object
ward                                  object
resolved                               int64
turnover                             float64
dtype: object

In [108]:
csr_train.isnull().sum().sort_values(ascending=False)

status_code                   151801
priority                        2677
servicetypecodedescription       854
serviceorderstatus               853
turnover                           0
resolved                           0
ward                               0
zipcode                            0
longitude                          0
latitude                           0
servicerequestid                   0
resolutiondate                     0
adddate                            0
servicecodedescription             0
servicecode                        0
objectid                           0
dtype: int64

In [109]:
csr_test.isnull().sum().sort_values(ascending=False)

servicetypecodedescription    1
turnover                      0
resolved                      0
ward                          0
zipcode                       0
longitude                     0
latitude                      0
priority                      0
servicerequestid              0
status_code                   0
serviceorderstatus            0
resolutiondate                0
adddate                       0
servicecodedescription        0
servicecode                   0
objectid                      0
dtype: int64

##### Spatial data (ward, zipcode)
- Initially, team planned to use `ward`, `zipcode` for anlaysis, however we decided to use geopandas to assign PSA to each values. Therefore, no needs to keep `ward`, `zipcode` here.

In [110]:
csr_train.ward.value_counts(normalize=True)

2      0.207058
6      0.174543
4      0.129094
5      0.124368
1      0.103957
7      0.097007
3      0.093190
8      0.065731
nan    0.005053
Name: ward, dtype: float64

- zipcode has bunch of spoiled values.

In [111]:
csr_train.zipcode.value_counts()

20002    144190
20011    123232
20001    119199
20019     90170
20009     80421
2002      66351
20007     61568
20003     61316
2001      51723
20016     48526
20032     37641
20018     37447
20008     34014
20005     33947
20017     33281
20015     31462
20037     30120
20036     29633
20024     29490
20012     28990
20006     22403
20004     19572
20052      4942
20415      1011
20059       966
2025        854
2024        636
20405       601
20201       553
20057       523
          ...  
20886         1
20744         1
20746         1
31088         1
-1865         1
-12           1
16            1
36            1
60435         1
18            1
20135         1
1             1
20613         1
-4115         1
20903         1
22406         1
2             1
-1254         1
9002          1
2122          1
2077          1
20783         1
11            1
20705         1
83127         1
20906         1
-2326         1
-2207         1
10533         1
22153         1
Name: zipcode, Length: 1

In [112]:
# As stated above, decided to drop two spacial columns
csr_train.drop(['ward', 'zipcode'], axis=1, inplace=True)
csr_test.drop(['ward', 'zipcode'], axis=1, inplace=True)

### Clean servicecode / description

In [113]:
csr_train.columns

Index(['objectid', 'servicecode', 'servicecodedescription',
       'servicetypecodedescription', 'adddate', 'resolutiondate',
       'serviceorderstatus', 'status_code', 'servicerequestid', 'priority',
       'latitude', 'longitude', 'resolved', 'turnover'],
      dtype='object')

In [114]:
# 'service code' and 'service code description' mismatches its values

print("train set service code(counts)  : " ,csr_train['servicecode'].nunique())
print("train set svc_code descr(counts): ", csr_train['servicecodedescription'].nunique())
print("----------")
print("test set service code(counts)   : " ,csr_test['servicecode'].nunique())
print("test set svc_code descr(counts) : ", csr_test['servicecodedescription'].nunique())

train set service code(counts)  :  164
train set svc_code descr(counts):  214
----------
test set service code(counts)   :  103
test set svc_code descr(counts) :  105


In [115]:
# obviously, 'service code description' needs to be cleaned and merged by one adequate description.
# (e.g. dead animal collection v. dead animal pickup) 

csr_train.groupby(["servicecode", "servicecodedescription"]).count().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,objectid,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover
servicecode,servicecodedescription,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
11,dead animal collection,6930,6927,6930,6930,6927,6930,6930,6927,6930,6930,6930,6930
11,dead animal pickup,3506,3506,3506,3506,3506,2704,3506,3506,3506,3506,3506,3506
BEDBUGS,bed bugs,21,21,21,21,21,21,21,21,21,21,21,21
BICYCLE,abandoned bicycle,2556,2556,2556,2556,2556,2556,2556,1532,2556,2556,2556,2556
C62313,christmas tree removal-seasonal,549,549,549,549,549,549,549,549,549,549,549,549


In [116]:
# original code inspration: courtesy of Ben Shaver
# replace duplicated code descriptions in each service code.

def code_breaker(df):
    foo = df.groupby(["servicecode", "servicecodedescription"]).count()
    bar = foo.reset_index().groupby('servicecode')['servicecodedescription'].count().sort_values(ascending=False)

#     print(bar[bar > 1].index)

    for code in bar[bar > 1].index:
        replacement = df[df['servicecode'] == code]['servicecodedescription'].value_counts().argmax()
        df.loc[df['servicecode'] == code, 'servicecodedescription'] = replacement
    return print(df.shape)

code_breaker(csr_train)
code_breaker(csr_test)



(1231233, 14)
(152796, 14)


In [117]:
# set the index to list
svc_descr_train =list(csr_train.servicecodedescription.value_counts().index)
svc_descr_test  =list(csr_test.servicecodedescription.value_counts().index)

In [118]:
# cross check which index is missing to dictionary for the mapping below
[i for i in svc_descr_test if i not in svc_descr_train]

['roadway repair',
 'snow sidewalk shoveling enforcement exemption',
 'utility repair issue',
 'bicycle services',
 'snow/ice removal (roadways and bridge walkways only)',
 'signed street sweeping missed',
 'doee - foam ban / food container requirements',
 'snow other (snow vehicle / property damage)',
 'doee - bag law tips',
 'eviction']

In [119]:
# map descriptions into small numbers of category

mapping = {
    'parking meter repair': 'parking meter repair',
    'bulk collection': 'collection',
    'parking enforcement': 'parking enforcement',
    'pothole': 'maintenance', 
    'streetlight repair investigation': 'light repair',
    'emergency no-parking verification': 'parking enforcement',
    'trash collection - missed': 'collection',
    'alley cleaning': 'street cleaning',
    'sanitation enforcement' : 'sanitation',
    'container removal': 'collection',
    'residential parking permit violation': 'parking enforcement', 
    'recycling collection - missed': 'collection',
    'street cleaning': 'street cleaning', 
    'illegal dumping': 'dumping', 
    'roadway signs': 'signs',
    'sidewalk repair': 'maintenance', 
    'tree inspection': 'tree related',
    'abandoned vehicle - on public property': 'maintenance', 
    'graffiti removal': 'graffiti removal',
    'tree pruning': 'tree related', 
    'rodent inspection and treatment': 'pesticide',
    'snow/ice removal': 'snow related',       
    'tree planting': 'tree related', 
    'tru report': 'report', 
    'out of state parking violation (rosa)': 'parking enforcement',
    'dead animal collection': 'collection', 
    'traffic signal issue': 'signs',
    'dmv - drivers license/id issues': 'dmv related',
    'sidewalk shoveling enforcement exemption': 'maintenance', 
    'street repair': 'maintenance',
    'tree removal': 'tree related', 
    'alleylight repair investigation': 'light repair',
    'yard waste - missed': 'collection', 
    'sign replacement': 'signs',
    'residential snow removal (servedc)': 'snow related',
    'dmv - vehicle registration issues': 'dmv related', 
    'recycling cart delivery': 'collection',
    'trash cart - delivery': 'collection', 
    'alley repair': 'maintenance', 
    'supercan - delivery': 'collection',
    'abandoned vehicle - on private property': 'safety', 
    'grass and weeds mowing': 'maintenance',
    'sign new investigation': 'signs', 
    'fems - community events': 'maintenance', 
    'vacant lot': 'transportation',
    'traffic safety investigation': 'safety', 
    'utility repair investigation': 'maintenance',
    'abandoned bicycle': 'bicycle related', 
    'bicycle issues': 'bicycle related', 
    'curb and gutter repair': 'maintenance',
    'dc government information': 'dc gov',
    'leaf season collection': 'collection',
    'fems - smoke alarm application': 'maintenance', 
    'how is my driving - complaint': 'transportation',
    'public space litter can-collection': 'collection', 
    'roadway striping / markings': 'maintenance',
    'public space litter can- installation/removal/repair': 'maintenance',
    'dmv - vehicle title issues': 'dmv related',
    'dmv - copy of ticket': 'dmv related',
    'supercan - repair': 'maintenance', 
    'sign removal investigation': 'signs',
    'snow removal complaints for sidewalks': 'snow related', 
    'trash cart repair': 'maintenance',
    'illegal poster': 'maintenance', 
    'marking maintenance': 'maintenance',
    'doee - general environmental concerns': 'safety', 
    'street sweeping': 'maintenance',
    'dmv - forms, applications, and manuals request': 'dmv related',
    'doee - construction – erosion runoff': 'maintenance', 
    'bus/rail issues': 'transportation',
    'dmv - online processing issues': 'dmv related', 
    'resident parking permit': 'parking enforcement',
    'recycling cart - repair': 'maintenance', 
    'signs conflicting': 'signs',
    'dpw correspondence tracking': 'dc gov',
    'christmas tree removal-seasonal': 'tree related',
    'child safety seat program': 'safety', 
    'dmv - refunds - tickets': 'dmv related',
    '311force reported issues': 'dc gov', 
    'insect treatment': 'pesticide',
    'christmas tree removal - seasonal': 'tree related', 
    'insects': 'pesticide', 
    'dmv - hearings': 'dmv related',
    'dmv - vehicle insurance lapse': 'dmv related', 
    'dmv - processing center manager': 'dmv related',
    'wire down/power outage': 'maintenance', 
    'doee - nuisance odor complaints': 'maintenance',
    'emergency - trees': 'tree related', 
    'marking modification': 'maintenance',
    'dmv - ticket payment dispute': 'dmv related',
    'school crossing guard': 'safety',
    'safe routes to school': 'safety', 
    'doee - engine idling tips': 'maintenance',
    'dmv - driver and vehicle services refund': 'dmv related',
    'dmv - adjudication supervisor': 'dmv related', 
    'hypothermia shelter information': 'safety',
    'snow metro bus shelter/stop': 'snow related',
    'graffiti removal - paint voucher request': 'graffiti removal',
    'dmv - driver record issues': 'dmv related', 
    'dmv - drivers license/id reinstatement': 'dmv related',
    'fems - fire safety education': 'safety', 
    'dcra - grass and weeds': 'safety',
    'dmv - etims ticket alert services issues': 'dmv related', 
    'snow towing': 'snow related',
    'homeless services - winter/hypothermia season': 'safety',
    'how is my driving - compliment': 'transportation', 
    'recycling- information request': 'maintenance',
    'ouc nye test': 'dc gov', 
    'dmv - vehicle inspection issues': 'dmv related',
    'emergency - power outage/wires down': 'maintenance', 
    'parks and recreation': 'maintenance',
    'ddot citation': 'dc gov', 
    'light-light pole': 'light repair', 
    'illegal fireworks': 'safety',
    'dmv - appeal': 'dmv related', 
    'dmv - offset tracking': 'dmv related', 
    'marking removal': 'maintenance',
    'dcra - trash and debris': 'maintenance', 
    'doee - ban on foam food containers': 'safety',
    'ddoe - bag law tips': 'maintenance', 
    'homeless encampment': 'safety',
    'snow ticket reimbursement': 'snow related', 
    'yard waste - missed - customer follow-up': 'collection',
    'light-infrastructure': 'light repair',
    'recycling collection - missed - customer follow-up': 'collection',
    'light-tunnel/underpass light repair': 'light repair', 
    'recycling - commercial only': 'maintenance',
    'trash collection - missed customer follow-up': 'collection',
    'dc 311 service requests': 'dc gov', 
    'dcra - vacant building': 'maintenance', 
    'snow other': 'snow related',
    'bed bugs': 'pesticide', 
    'fems - honor guard': 'dc gov', 
    'streetcar': 'maintenance',
    'dmv - ticket ombudsman': 'dmv related', 
    'light-overhead guide sign lighting repair': 'light repair',
    'ticket ombudsman': 'maintenance', 
    'dcra - misc': 'maintenance', 
    'sanitation enforcement - customer follow-up': 'maintenance',
    'hoarding': 'maintenance',
    'homeless services - hypothermia/cold/winter - protection items': 'safety',
    'trash container - delivery - customer follow-up': 'maintenance',
    'graffiti removal - customer follow-up': 'graffiti removal', 
    'emergency - flooding': 'safety',
    'recycling - school program': 'safety', 
    'fems - 20/20 vision plan': 'safety',
    'recycling container delivery - customer follow-up': 'collection',
    'dds - serious medication error': 'safety', 
    'supercan - repair - customer follow-up': 'maintenance',
    'emergency - senior assistance': 'safety',
    'dcra - zoning': 'maintenance',
    'report invalid address to gis dept': 'safety',
    'bulk collection - unscheduled': 'collection',
    'homeless services - hypothermia/cold/winter - safety checks': 'safety',
    'dhs - iris update': 'maintenance', 
    'dds - theft of personal property': 'safety',
    'emergency - transportation': 'transportation',
    'emergency - supplies': 'maintenance',
    'school transit subsidy program': 'safety',
    'homeless services - hypothermia/cold/winter - transport to shelter': 'safety',
    'emergency - heating and cooling': 'maintenance',
    'signs - conflicting': 'signs',
    'survey sr type': 'dc gov',
    'roadway repair': 'maintenance',
    'snow sidewalk shoveling enforcement exemption': 'snow related', 
    'utility repair issue': 'maintenance',
    'bicycle services': 'bicycle related',
    'snow/ice removal (roadways and bridge walkways only)': 'snow related',
    'signed street sweeping missed': 'maintenance',
    'doee - foam ban / food container requirements': 'safety',
    'snow other (snow vehicle / property damage)': 'snow related',
    'doee - bag law tips': 'maintenance',
    'eviction': 'safety'
}

In [120]:
csr_train['servicecodedescription'] = csr_train['servicecodedescription'].map(mapping)
csr_train['servicecodedescription'].value_counts()


collection              308906
parking meter repair    223577
parking enforcement     206177
maintenance             130821
tree related             54693
light repair             52047
street cleaning          51731
signs                    41402
sanitation               29834
dmv related              23652
snow related             21505
dumping                  20510
graffiti removal         15924
pesticide                14497
report                   11253
safety                   10129
transportation            5893
bicycle related           5052
dc gov                    3630
Name: servicecodedescription, dtype: int64

In [121]:
csr_test['servicecodedescription'] = csr_test['servicecodedescription'].map(mapping)
csr_test['servicecodedescription'].value_counts()


collection              41098
parking enforcement     32257
parking meter repair    18817
maintenance             16484
signs                    7381
tree related             5975
light repair             5400
sanitation               5096
street cleaning          3969
dmv related              3277
dumping                  2927
graffiti removal         2735
pesticide                2328
report                   1629
safety                   1229
snow related              862
bicycle related           674
transportation            601
dc gov                     57
Name: servicecodedescription, dtype: int64

In [122]:
print(csr_train['servicecodedescription'].nunique())
print(csr_test['servicecodedescription'].nunique())

19
19


In [123]:
csr_train.isnull().sum().sort_values(ascending=False)

status_code                   151801
priority                        2677
servicetypecodedescription       854
serviceorderstatus               853
turnover                           0
resolved                           0
longitude                          0
latitude                           0
servicerequestid                   0
resolutiondate                     0
adddate                            0
servicecodedescription             0
servicecode                        0
objectid                           0
dtype: int64

In [124]:
csr_test.isnull().sum().sort_values(ascending=False)

servicetypecodedescription    1
turnover                      0
resolved                      0
longitude                     0
latitude                      0
priority                      0
servicerequestid              0
status_code                   0
serviceorderstatus            0
resolutiondate                0
adddate                       0
servicecodedescription        0
servicecode                   0
objectid                      0
dtype: int64

In [125]:
csr_train.head()

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover
0,463232,S0011,street cleaning,Street Cleaning,2014-01-02 13:27:40,2014-01-15 07:43:42.000000000,CLOSED,,14-00000654,STANDARD,38.89795,-76.972732,1,306.0
1,463233,S0321,collection,Recycling,2014-01-02 13:46:57,2014-01-06 12:39:39.000000000,CLOSED,,14-00000686,STANDARD,38.922857,-76.991905,1,94.0
2,463234,S0031,collection,Bulk Collection,2014-01-02 13:57:46,2014-01-14 14:29:16.000000000,CLOSED,,14-00000707,STANDARD,38.874742,-76.970889,1,288.0
3,463235,S0311,pesticide,DOH,2014-01-02 13:43:20,1753-08-29 22:43:41.128654848,OPEN,,14-00000677,STANDARD,38.942811,-77.022676,0,-0.0
4,463236,S0276,parking meter repair,TOA,2014-01-02 16:00:59,2014-01-07 16:33:48.000000000,CLOSED,,14-00000877,STANDARD,38.898952,-77.048838,1,120.0


### Geopandas to PSA

In [126]:
import geopandas as gpd
from shapely.geometry import Point

In [127]:

geometry = [Point(xy) for xy in zip(csr_train['longitude'], csr_train['latitude'])]
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(csr_train, geometry=geometry, crs=crs)

psa = gpd.read_file('./assets/Police_Service_Areas.geojson')

psa = psa[['PSA','geometry']]

csr_train = gpd.sjoin(gdf, psa, how='left', op='within')


In [128]:
csr_train.head()

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover,geometry,index_right,PSA
0,463232,S0011,street cleaning,Street Cleaning,2014-01-02 13:27:40,2014-01-15 07:43:42.000000000,CLOSED,,14-00000654,STANDARD,38.89795,-76.972732,1,306.0,POINT (-76.97273246 38.89794972),22.0,507.0
1,463233,S0321,collection,Recycling,2014-01-02 13:46:57,2014-01-06 12:39:39.000000000,CLOSED,,14-00000686,STANDARD,38.922857,-76.991905,1,94.0,POINT (-76.99190473 38.92285708),42.0,504.0
2,463234,S0031,collection,Bulk Collection,2014-01-02 13:57:46,2014-01-14 14:29:16.000000000,CLOSED,,14-00000707,STANDARD,38.874742,-76.970889,1,288.0,POINT (-76.97088871 38.87474188),10.0,605.0
3,463235,S0311,pesticide,DOH,2014-01-02 13:43:20,1753-08-29 22:43:41.128654848,OPEN,,14-00000677,STANDARD,38.942811,-77.022676,0,-0.0,POINT (-77.02267596 38.94281117),46.0,407.0
4,463236,S0276,parking meter repair,TOA,2014-01-02 16:00:59,2014-01-07 16:33:48.000000000,CLOSED,,14-00000877,STANDARD,38.898952,-77.048838,1,120.0,POINT (-77.04883778217619 38.8989521053866),23.0,207.0


In [129]:
geometry = [Point(xy) for xy in zip(csr_test['longitude'], csr_test['latitude'])]
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(csr_test, geometry=geometry, crs=crs)

psa = gpd.read_file('./assets/Police_Service_Areas.geojson')

psa = psa[['PSA','geometry']]

csr_test = gpd.sjoin(gdf, psa, how='left', op='within')
csr_test.reset_index()

Unnamed: 0,index,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover,geometry,index_right,PSA
0,0,308864,S05SL,light repair,Transportation Operations Administration,2018-01-02 11:19:40,2018-01-03 20:28:07.000000000,CLOSED,CLOSED,18-00001277,STANDARD,38.943383,-77.017422,1,33.0,POINT (-77.01742151000001 38.94338256),46.0,407.0
1,1,308865,EMNPV,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 16:02:18,2018-01-03 19:24:34.000000000,CLOSED,CLOSED,18-00004925,STANDARD,38.926447,-77.023318,1,3.0,POINT (-77.0233175 38.92644728),35.0,304.0
2,2,308866,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 18:03:59,2018-01-03 19:52:35.000000000,CLOSED,CLOSED,18-00005187,STANDARD,38.889467,-76.977983,1,1.0,POINT (-76.97798256 38.88946701),16.0,108.0
3,3,308867,HYPSHEIN,safety,311- Call Center,2018-01-03 18:40:16,2018-01-03 18:40:16.000000000,CLOSED,CLOSED,18-00005252,STANDARD,38.916998,-77.031951,1,0.0,POINT (-77.0319507377 38.9169981213),31.0,305.0
4,4,308868,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 18:09:57,2018-01-03 18:49:01.000000000,VOIDED,CLOSED,18-00005198,STANDARD,38.924503,-77.028874,1,0.0,POINT (-77.02887352 38.92450322),35.0,304.0
5,5,308869,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-03 20:26:42,2018-01-08 14:48:07.000000000,CLOSED,CLOSED,18-00005351,STANDARD,38.914609,-77.029331,1,114.0,POINT (-77.02933118999999 38.9146089),31.0,305.0
6,6,308870,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-03 19:54:09,2018-01-08 04:53:25.000000000,CLOSED,CLOSED,18-00005323,STANDARD,38.892373,-76.919587,1,104.0,POINT (-76.9195865305 38.892373122),18.0,608.0
7,7,308871,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 13:51:20,2018-01-03 21:32:21.000000000,CLOSED,CLOSED,18-00004455,STANDARD,38.908443,-76.983073,1,7.0,POINT (-76.98307265 38.90844316),25.0,506.0
8,8,308872,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-04 04:46:54,2018-01-07 08:43:41.000000000,CLOSED,CLOSED,18-00005454,STANDARD,38.889487,-76.923695,1,75.0,POINT (-76.92369518 38.88948669),13.0,604.0
9,9,308873,SRC02,snow related,SNOW,2018-01-04 06:09:56,2018-01-08 15:52:37.000000000,CLOSED,CLOSED,18-00005465,STANDARD,38.897048,-76.974572,1,105.0,POINT (-76.97457181999999 38.89704768),22.0,507.0


In [130]:
csr_train.reset_index()
csr_test.reset_index()

Unnamed: 0,index,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover,geometry,index_right,PSA
0,0,308864,S05SL,light repair,Transportation Operations Administration,2018-01-02 11:19:40,2018-01-03 20:28:07.000000000,CLOSED,CLOSED,18-00001277,STANDARD,38.943383,-77.017422,1,33.0,POINT (-77.01742151000001 38.94338256),46.0,407.0
1,1,308865,EMNPV,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 16:02:18,2018-01-03 19:24:34.000000000,CLOSED,CLOSED,18-00004925,STANDARD,38.926447,-77.023318,1,3.0,POINT (-77.0233175 38.92644728),35.0,304.0
2,2,308866,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 18:03:59,2018-01-03 19:52:35.000000000,CLOSED,CLOSED,18-00005187,STANDARD,38.889467,-76.977983,1,1.0,POINT (-76.97798256 38.88946701),16.0,108.0
3,3,308867,HYPSHEIN,safety,311- Call Center,2018-01-03 18:40:16,2018-01-03 18:40:16.000000000,CLOSED,CLOSED,18-00005252,STANDARD,38.916998,-77.031951,1,0.0,POINT (-77.0319507377 38.9169981213),31.0,305.0
4,4,308868,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 18:09:57,2018-01-03 18:49:01.000000000,VOIDED,CLOSED,18-00005198,STANDARD,38.924503,-77.028874,1,0.0,POINT (-77.02887352 38.92450322),35.0,304.0
5,5,308869,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-03 20:26:42,2018-01-08 14:48:07.000000000,CLOSED,CLOSED,18-00005351,STANDARD,38.914609,-77.029331,1,114.0,POINT (-77.02933118999999 38.9146089),31.0,305.0
6,6,308870,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-03 19:54:09,2018-01-08 04:53:25.000000000,CLOSED,CLOSED,18-00005323,STANDARD,38.892373,-76.919587,1,104.0,POINT (-76.9195865305 38.892373122),18.0,608.0
7,7,308871,S0261,parking enforcement,PEMA- Parking Enforcement Management Administr...,2018-01-03 13:51:20,2018-01-03 21:32:21.000000000,CLOSED,CLOSED,18-00004455,STANDARD,38.908443,-76.983073,1,7.0,POINT (-76.98307265 38.90844316),25.0,506.0
8,8,308872,S0321,collection,SWMA- Solid Waste Management Admistration,2018-01-04 04:46:54,2018-01-07 08:43:41.000000000,CLOSED,CLOSED,18-00005454,STANDARD,38.889487,-76.923695,1,75.0,POINT (-76.92369518 38.88948669),13.0,604.0
9,9,308873,SRC02,snow related,SNOW,2018-01-04 06:09:56,2018-01-08 15:52:37.000000000,CLOSED,CLOSED,18-00005465,STANDARD,38.897048,-76.974572,1,105.0,POINT (-76.97457181999999 38.89704768),22.0,507.0


In [131]:
# By examining 11 null PSA data points, it is all location Maryland side of the state boarder. 
# Therefore, it has no PSA assignment and we decided to drop values.

csr_train[csr_train.PSA.isnull()]

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover,geometry,index_right,PSA
359657,810193,S0276,parking meter repair,TOA,2015-03-14 09:06:56,2015-03-14 09:48:05,CLOSED,CLOSED,15-00057929,STANDARD,38.974781,-77.013805,1,0.0,POINT (-77.01380472855159 38.974781058004),,
361301,811994,S0261,parking enforcement,Parking Enforcement,2015-03-16 08:23:08,2015-03-16 09:18:10,CLOSED,CLOSED,15-00059020,STANDARD,38.974781,-77.013805,1,0.0,POINT (-77.01380472855159 38.974781058004),,
386281,836451,S0301,maintenance,Street & Bridge Maintenance,2015-04-06 13:35:27,2015-04-11 15:24:03,CLOSED,CLOSED,15-00083523,EMERGENCY,38.974781,-77.013805,1,121.0,POINT (-77.01380472855159 38.974781058004),,
422659,929948,S0457,tree related,Urban Forrestry,2015-05-19 11:16:11,2015-05-20 09:24:38,CLOSED,CLOSED,15-00125094,EMERGENCY,38.974781,-77.013805,1,22.0,POINT (-77.01380472855159 38.974781058004),,
486403,1051125,S0301,maintenance,Street & Bridge Maintenance,2015-03-30 10:19:57,2015-04-08 10:50:43,CLOSED,CLOSED,15-00075932,STANDARD,38.96092,-77.086078,1,216.0,POINT (-77.086078338608 38.9609197722797),,
717491,1288399,S0276,parking meter repair,Transportation Operations Administration,2016-05-02 12:49:58,2016-05-02 13:33:51,CLOSED,CLOSED,16-00486049,STANDARD,38.974781,-77.013805,1,0.0,POINT (-77.01380472861619 38.9747813913021),,
747953,1319113,S0166,maintenance,SWMA- Solid Waste Management Admistration,2016-05-26 17:14:09,2016-06-01 15:41:04,CLOSED,CLOSED,16-00523158,STANDARD,38.961193,-77.085719,1,142.0,POINT (-77.08571903000001 38.96119269),,
1012261,1585730,SIGTRAMA,signs,Toa- Trans Sys Mnt,2017-04-27 15:28:29,2017-05-04 09:24:14,CLOSED,CLOSED,17-00214970,STANDARD,38.974781,-77.013805,1,161.0,POINT (-77.01380472861619 38.9747813913021),,
1090334,1676507,S0276,parking meter repair,Transportation Operations Administration,2017-07-24 17:29:48,2017-07-24 18:43:09,CLOSED,CLOSED,17-00400767,STANDARD,38.974781,-77.013805,1,1.0,POINT (-77.01380472861619 38.9747813913021),,
1105404,1696451,MARKINST,maintenance,Transportation Operations Administration,2017-08-07 12:50:18,2017-09-20 15:38:17,CLOSED,CLOSED,17-00435055,STANDARD,38.974781,-77.013805,1,1058.0,POINT (-77.01380472861619 38.9747813913021),,


In [132]:
# 2 PSA null data turns out Maryland side. decided to drop
csr_test[csr_test.PSA.isnull()]

Unnamed: 0,objectid,servicecode,servicecodedescription,servicetypecodedescription,adddate,resolutiondate,serviceorderstatus,status_code,servicerequestid,priority,latitude,longitude,resolved,turnover,geometry,index_right,PSA
39640,357097,GRAFF,graffiti removal,SWMA- Solid Waste Management Admistration,2018-02-20 15:20:10,2018-05-01 06:13:30,CLOSED,CLOSED,18-00088276,STANDARD,38.974781,-77.013805,1,1670.0,POINT (-77.01380472859999 38.9747813913),,
108043,447888,S0276,parking meter repair,Transportation Operations Administration,2018-05-05 12:49:07,2018-05-08 13:03:57,CLOSED,CLOSED,18-00228733,STANDARD,38.974781,-77.013805,1,72.0,POINT (-77.01380472859999 38.9747813913),,


In [133]:
csr_train.dropna(subset=['PSA'], inplace=True)
csr_train.PSA=csr_train.PSA.astype(int)

csr_test.dropna(subset=['PSA'], inplace=True)
csr_test.PSA=csr_test.PSA.astype(int)

In [134]:
csr_train.columns

Index(['objectid', 'servicecode', 'servicecodedescription',
       'servicetypecodedescription', 'adddate', 'resolutiondate',
       'serviceorderstatus', 'status_code', 'servicerequestid', 'priority',
       'latitude', 'longitude', 'resolved', 'turnover', 'geometry',
       'index_right', 'PSA'],
      dtype='object')

In [135]:
# drop unnecessary columns

cols = ['servicerequestid', 'servicetypecodedescription', 
        'objectid', 'servicecode', 'resolutiondate',
        'longitude', 'latitude', 'index_right', 'geometry']
csr_train.drop(cols, axis=1, inplace=True)
csr_test.drop(cols, axis=1, inplace=True)


### Feature Engineering

##### Date separator

In [136]:
# def date_separate(df):
#     df = df.copy()
#     df['Year'] = pd.DatetimeIndex(df['adddate']).year
#     df['Month'] = pd.DatetimeIndex(df['adddate']).month
#     df['Day'] = pd.DatetimeIndex(df['adddate']).day
#     df.drop(['adddate'], axis=1, inplace=True)
#     return df

# csr_train = date_separate(csr_train)
# csr_test  = date_separate(csr_test)

##### Groupby datasets

In [137]:
csr_train.rename({'servicecodedescription': 'svc_descr'}, axis=1, inplace=True)
csr_test.rename({'servicecodedescription': 'svc_descr'}, axis=1, inplace=True)

In [138]:
csr_train.groupby(['PSA', 'svc_descr']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,adddate,serviceorderstatus,status_code,priority,resolved,turnover
PSA,svc_descr,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
101,bicycle related,161,161,157,141,161,161
101,collection,110,110,103,110,110,110
101,dc gov,54,54,50,54,54,54
101,dmv related,57,57,56,57,57,57
101,dumping,17,17,14,17,17,17
101,graffiti removal,70,70,69,70,70,70
101,light repair,552,552,538,552,552,552
101,maintenance,1239,1239,1085,1239,1239,1239
101,parking enforcement,2922,2922,2690,2922,2922,2922
101,parking meter repair,16889,16873,13978,16873,16889,16889


In [139]:

def svc_descr(df):
    df['svc_descr'] = [type_ if type_ in ['collection', 'maintenance', 
                                         'light repair','graffiti removal',
                                         'street cleaning', 'parking meter repair']
                      else "OTHER" for type_ in df['svc_descr']]
    return print(df.shape)
svc_descr(csr_train)
svc_descr(csr_test)

(1231222, 8)
(152794, 8)


In [140]:
csr_train['svc_descr'].value_counts()

OTHER                   448223
collection              308906
parking meter repair    223574
maintenance             130817
light repair             52047
street cleaning          51731
graffiti removal         15924
Name: svc_descr, dtype: int64

In [141]:
csr_test['svc_descr'].value_counts()

OTHER                   64293
collection              41098
parking meter repair    18816
maintenance             16484
light repair             5400
street cleaning          3969
graffiti removal         2734
Name: svc_descr, dtype: int64

In [142]:
csr_train = pd.get_dummies(csr_train, columns=['svc_descr'])
csr_test = pd.get_dummies(csr_test, columns=['svc_descr'])


In [143]:
cats = ['svc_descr_collection','svc_descr_maintenance',
     'svc_descr_light repair', 'svc_descr_graffiti removal',
     'svc_descr_street cleaning', 'svc_descr_parking meter repair']

In [144]:
# csr_train.groupby(['PSA', 'Year', 'Month'])[cats].agg(sum)

In [145]:
csr_train.groupby(['PSA'])['svc_descr_collection',
                           'svc_descr_maintenance',
                           'svc_descr_light repair',
                           'svc_descr_graffiti removal',
                           'svc_descr_street cleaning',
                           'svc_descr_parking meter repair'].agg(sum)

Unnamed: 0_level_0,svc_descr_collection,svc_descr_maintenance,svc_descr_light repair,svc_descr_graffiti removal,svc_descr_street cleaning,svc_descr_parking meter repair
PSA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
101,110.0,1239.0,552.0,70.0,101.0,16889.0
102,75.0,1199.0,484.0,130.0,101.0,21196.0
103,367.0,1870.0,441.0,61.0,271.0,10506.0
104,11739.0,2864.0,1118.0,492.0,3240.0,4654.0
105,1325.0,2611.0,712.0,131.0,362.0,17447.0
106,4017.0,1587.0,806.0,102.0,473.0,3650.0
107,11967.0,3116.0,1850.0,179.0,1316.0,7090.0
108,11287.0,2967.0,1314.0,102.0,2075.0,205.0
201,9958.0,3936.0,2448.0,78.0,333.0,1459.0
202,9773.0,4369.0,1721.0,301.0,398.0,8644.0


##### Save it to pickle

In [146]:
import pickle

In [147]:
csr_train.to_pickle("./assets/csr/csr_train_EDAed.pkl")

In [148]:
csr_test.to_pickle("./assets/csr/csr_test_EDAed.pkl")