# Feature Extraction on 311 Data

Find just the top ten 311 complaints
Join 311 data with NYC zipcodes to ensure that all reported zicodes are valid
Create a zipcode fingerprint of the ratios of each type of call by using a pivot table
We'll do this again for each year, in order to get our training and evaluation sets later on.


Sampling is important because we have too much data to work with effectively. 

The 311 call information represents 20 million calls, and over 11 gigs of data. So I stripped out the columns that I though would be most useful, and from those I'll sample down to 10,000 rows.

In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize']=(16,16)

## read in the sampled 311 calls

Now, I have both the sampled and unsampled at this point. And I think it makes sense to use the entire 311 call data as long as it is constrained to the weather data time limits, starting at 01/01/2015

In [2]:
df_311_sampled = pd.read_csv("csvs/311_unsampled.csv",            
                        usecols=["Unique Key","Created Date","Agency", "Agency Name","Complaint Type","Descriptor","Incident Zip"],  
                        index_col=0,
                        dtype={"Incident Zip": 'str'})

  mask |= (ar1 == a)


In [3]:
df_311_sampled['Date Only'] = df_311_sampled['Created Date'].apply(lambda x: x[0:10])
df_311_sampled.drop(columns=['Created Date'], inplace=True)
#df_311_sampled

## Aggregate all the calls by date

At this point it makes sense to aggregate all the calls by date, as long as we don't lose call type information or location.  I'll save it and call it Incidents

In [4]:
df_incidents = (df_311_sampled.groupby(['Date Only']).count())
df_incidents.drop(axis=1, columns=['Agency','Agency Name','Complaint Type', 'Descriptor'], inplace=True)
df_incidents.rename(columns={'Incident Zip': 'count'}, inplace=True)

df_incidents.to_csv('csvs/incidents.csv')
df_incidents = pd.read_csv("csvs/incidents.csv", parse_dates=[0]) 


In [5]:
df_incidents['count'].std()

1187.9181970473126

In [6]:
df_incidents['count'].mean()

6496.650436953808

In [7]:
df_incidents.shape

(1602, 2)

## Just get counts of most frequent call types

There is probably an easier way or more correct way to do this. But it works.

In [8]:
pd.DataFrame(df_311_sampled.groupby(['Complaint Type']).count()).sort_values(by=['Agency'], ascending=False)

Unnamed: 0_level_0,Agency,Agency Name,Descriptor,Incident Zip,Date Only
Complaint Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HEAT/HOT WATER,1008816,1008816,1008816,1001770,1008816
Noise - Residential,961345,961345,961345,958734,961345
Illegal Parking,592285,592285,592285,587872,592285
Blocked Driveway,544351,544351,544351,542687,544351
Street Condition,452214,452214,452214,438224,452214
Street Light Condition,375473,375473,375468,226485,375473
UNSANITARY CONDITION,355950,355950,355950,355569,355950
Water System,307990,307990,307990,302164,307990
Request Large Bulky Item Collection,300118,300118,300118,298891,300118
Noise - Street/Sidewalk,279389,279389,279389,277784,279389


In [9]:
df_top_ten = df_311_sampled[df_311_sampled['Complaint Type'].isin(["HEAT/HOT WATER", "Noise - Residential", "Illegal Parking", "Blocked Driveway", "Street Condition", "Street Light Condition", "UNSANITARY CONDITION", "Water System", "Request Large Bulky Item Collection", "Noise - Street/Sidewalk"
                                                            ])]

In [10]:
df_top_ten.to_csv("csvs/top_ten_311.csv")

In [11]:
df_top_ten.head()

Unnamed: 0_level_0,Agency,Agency Name,Complaint Type,Descriptor,Incident Zip,Date Only
Unique Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
31723508,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,10023,2015-10-10
31723509,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,10034,2015-10-10
31723510,DEP,Department of Environmental Protection,Water System,Hydrant Running (WC3),10021,2015-10-10
31723512,DOT,Department of Transportation,Street Condition,Pothole,10470,2015-10-10
31723515,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,11224,2015-10-10


In [12]:
df_zips = pd.read_csv("csvs/nyc-zip-code-latitude-and-longitude.csv", usecols=[1], dtype={0: 'str', 1:'str'}) 
#df_incidents.drop(axis=1, columns=['Unique Key'], inplace=True)
#df_incidents.set_index('Date Only', inplace=True)

## Zipcode Trick

I will join the top ten incidents list with the valid zipcodes in NYC. This will remove any zipcodes reported outside of NYC.


In [13]:
df_311zip = df_top_ten.join(df_zips, on='Incident Zip', how="inner")
#pd.DataFrame(df_311_sampled.groupby(['Incident Zip','Complaint Type']).count())

print(df_311zip.shape)
df_311zip.dropna(inplace=True)
df_311zip.head()

(4913644, 7)


Unnamed: 0_level_0,Agency,Agency Name,Complaint Type,Descriptor,Incident Zip,Date Only,Zip
Unique Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
31723508,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,10023,2015-10-10,10023
31724314,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,10023,2015-10-10,10023
31724946,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,10023,2015-10-10,10023
31725751,HPD,Department of Housing Preservation and Develop...,HEAT/HOT WATER,ENTIRE BUILDING,10023,2015-10-10,10023
31726082,NYPD,New York City Police Department,Noise - Residential,Loud Music/Party,10023,2015-10-10,10023


In [14]:
df_311zip.to_csv("csvs/top_ten_w_zips.csv")

In [15]:
df_311zip_one_year = df_311zip[df_311zip["Date Only"] >= '2018-5-29']

In [16]:
df_311zip_one_year_grouped = df_311zip_one_year.groupby(['Zip','Complaint Type']).count()

In [17]:
df_311zip_one_year_grouped.to_csv("csvs/zips_grouped.csv")

## Get percentage of each call type by zip code

In this trick we have to get the total number of calls per zipcode. And then get the total number of calls per call type within each zipcode. Then we can divide to get a normalized percentage, or really a range from 0 to 1 for each  call type.

In [18]:
df_zip_one_year_grouped = df_311zip_one_year.groupby('Zip').count()
df_zip_one_year_grouped.head()                                                             

Unnamed: 0_level_0,Agency,Agency Name,Complaint Type,Descriptor,Incident Zip,Date Only
Zip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
10001,1509,1509,1509,1509,1509,1509
10002,3876,3876,3876,3876,3876,3876
10003,2824,2824,2824,2824,2824,2824
10004,235,235,235,235,235,235
10005,250,250,250,250,250,250


In [19]:
df_311zip_one_year_grouped2 = df_311zip_one_year_grouped.div(df_zip_one_year_grouped, level='Zip')['Agency']

df_311zip_one_year_grouped2 = df_311zip_one_year_grouped2.reindex()
df_311zip_one_year_grouped2.to_csv("csvs/percent_complaints_by_zip.csv")

df_311zip_one_year_grouped2.head()


  after removing the cwd from sys.path.


Zip    Complaint Type         
10001  Blocked Driveway           0.029821
       HEAT/HOT WATER             0.111332
       Illegal Parking            0.203446
       Noise - Residential        0.228628
       Noise - Street/Sidewalk    0.104042
Name: Agency, dtype: float64

In [20]:
df_311_by_zip = pd.read_csv("csvs/percent_complaints_by_zip.csv")
df_311_by_zip.rename(index=str, columns={"10001":"Zip", "Blocked Driveway": "Complaint", "0.02982107355864811":"Fraction" }, inplace=True)
df_311_by_zip.head()

Unnamed: 0,Zip,Complaint,Fraction
0,10001,HEAT/HOT WATER,0.111332
1,10001,Illegal Parking,0.203446
2,10001,Noise - Residential,0.228628
3,10001,Noise - Street/Sidewalk,0.104042
4,10001,Request Large Bulky Item Collection,0.088138


## Make a pivot table

This gives us a table form of each call type by zipcode, which we can use in the next step with a k-means cluster analysis.



In [None]:
df_zip_by_311 = pd.pivot_table(df_311_by_zip, index=["Zip"],columns=['Complaint']).fillna(0)

In [22]:
df_zip_by_311.to_csv("csvs/zip_by_311.csv")

In [23]:
df_zip_by_311.head()

Unnamed: 0_level_0,Fraction,Fraction,Fraction,Fraction,Fraction,Fraction,Fraction,Fraction,Fraction,Fraction
Complaint,Blocked Driveway,HEAT/HOT WATER,Illegal Parking,Noise - Residential,Noise - Street/Sidewalk,Request Large Bulky Item Collection,Street Condition,Street Light Condition,UNSANITARY CONDITION,Water System
Zip,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
10001,0.0,0.111332,0.203446,0.228628,0.104042,0.088138,0.153082,0.003313,0.017893,0.060305
10002,0.013932,0.271156,0.186533,0.241744,0.067595,0.081269,0.063467,0.025026,0.025542,0.023736
10003,0.008144,0.211402,0.172096,0.188031,0.088173,0.157932,0.098442,0.002833,0.03187,0.041076
10004,0.017021,0.029787,0.331915,0.055319,0.059574,0.038298,0.33617,0.076596,0.008511,0.046809
10005,0.016,0.068,0.32,0.092,0.128,0.12,0.16,0.036,0.016,0.044
