<h1 style="text-align:center">Data Science and Machine Learning Capstone Project</h1>
<img style="float:right" src="https://prod-edxapp.edx-cdn.org/static/edx.org/images/logo.790c9a5340cb.png">
<p style="text-align:center">IBM: DS0720EN</p>
<p style="text-align:center">Question 2 of 4</p>

1. [Problem Statement](#problem)
2. [Question 2](#question)
3. [Analyzing and Visualizing](#analysis)
4. [Concluding Remarks](#conclusion)

<a id="problem"></a>
## Problem Statement
---

The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

<a id="question"></a>
## Question 2
---

Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?

### Approach
Analyze the data to see if there is a higher correlation between the HEATING complaints and any particular borough, ZIP code, or street.

### Load Data
Separately the [New York 311](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) data was loaded by [SODA](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status) into a Pandas DataFrame then saved to a pickle file.

In [1]:
import pandas as pd
df = pd.read_pickle('C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\ny311full.pkl') # Local
#df = pd.read_pickle('./ny311.pkl') #IBM Cloud / Watson Studio
df.shape

(5862383, 15)

<a id="analysis"></a>
### Analyzing and Visualizing
---

#### Reduce data to relevant rows and columns

In [2]:
#Remove rows that were not for the complaint types identified in question one.
df.drop(df[df["complaint_type"].isin(["HEAT/HOT WATER","HEATING"])==False].index, inplace=True)
#Double check that the correct rows were removed.
df['complaint_type'].value_counts()

HEAT/HOT WATER    1152592
HEATING            887869
Name: complaint_type, dtype: int64

In [3]:
#Remove columns deemed unnecessary for this question.
df.drop(['created_date','complaint_type','resolution_description','closed_date','location_type','status','address_type'], axis=1, inplace=True)
df.shape

(2040461, 8)

#### Wrangle (Clean) the data

##### Identify and handle missing values

In [4]:
#See if any data is null.
df.isnull().sum()

unique_key              0
incident_zip        18970
incident_address        1
street_name             1
city                18843
borough                 0
latitude            18966
longitude           18966
dtype: int64

<p style="color:Red;"><b>Analysis</b>:  There are three items the question hinges upon:  ZIP Code, Borough, and Street.</p>

In [5]:
#There were some null zip codes.  Examine them more closely.
df[df['incident_zip'].isnull()==True].describe(include="all")

Unnamed: 0,unique_key,incident_zip,incident_address,street_name,city,borough,latitude,longitude
count,18970.0,0.0,18969,18969,127,18970,4.0,4.0
unique,18970.0,,8306,2205,2,6,,
top,24968225.0,,34 ARDEN STREET,GRAND CONCOURSE,QUEENS,BRONX,,
freq,1.0,,232,365,121,5957,,
mean,,,,,,,40.724393,-73.849776
std,,,,,,,0.0,0.0
min,,,,,,,40.724393,-73.849776
25%,,,,,,,40.724393,-73.849776
50%,,,,,,,40.724393,-73.849776
75%,,,,,,,40.724393,-73.849776


<p style="color:Red;"><b>Decision</b>:  There is nothing "worth the trouble" in the data upon which to try to fill in the missing zip codes.  If these 18K rows were deemed super important, a library for looking up a zip code based on an address would be necessary.  Because these rows still have Borough, don't drop the row.  Just leave them as NaN values.</p>

In [6]:
# Let's take a look at the one row with the missing street.
df[df["street_name"].isnull()]

Unnamed: 0,unique_key,incident_zip,incident_address,street_name,city,borough,latitude,longitude
251660,21833329,,,,,BRONX,,


<p style="color:Red;"><b>Decision</b>:  There is nothing in the data upon which to base a guess to the street.  In order to keep anticipated later standardization of the street_name simpler, get rid of the null row.</p>

In [7]:
# Drop it.
df.drop(df[df["street_name"].isnull()].index, axis=0, inplace=True)

##### Standardize values from different sources into the same format, units, or convention

<p style="color:Red;"><b>Decision</b>:  While looking for null values, I noticed there were rows with the city missing.  How could the city ever be missing, when the borough is filled in?  Let's explore the data to answer this question.</p>

In [10]:
#Normalize all strings to uppercase so different casing won't appear as separate values.
df['incident_address'] = df['incident_address'].str.upper()
df['street_name'] = df['street_name'].str.upper()
df['city'] = df['city'].str.upper()
df['borough'] = df['borough'].str.upper()

In [11]:
df['borough'].value_counts()

BRONX            569959
BROOKLYN         543166
MANHATTAN        398552
UNSPECIFIED      282917
QUEENS           228447
STATEN ISLAND     17419
Name: borough, dtype: int64

<p style="color:Red;">The expected five boroughs, but then also:  UNSPECIFIED.  What does that mean?</p>

In [12]:
df[df["borough"]=="UNSPECIFIED"]["city"].value_counts()

BROOKLYN               93388
BRONX                  88585
NEW YORK               59095
JAMAICA                 5020
STATEN ISLAND           3462
ASTORIA                 3381
FLUSHING                3154
RIDGEWOOD               2273
FAR ROCKAWAY            2040
WOODSIDE                1773
ELMHURST                1696
JACKSON HEIGHTS         1479
CORONA                  1398
FOREST HILLS            1340
REGO PARK               1134
SOUTH RICHMOND HILL     1005
QUEENS VILLAGE           928
SUNNYSIDE                877
OZONE PARK               835
RICHMOND HILL            778
HOLLIS                   776
WOODHAVEN                770
EAST ELMHURST            752
SPRINGFIELD GARDENS      675
SAINT ALBANS             660
SOUTH OZONE PARK         606
KEW GARDENS              599
ARVERNE                  589
LONG ISLAND CITY         454
ROSEDALE                 416
OAKLAND GARDENS          397
MASPETH                  363
ROCKAWAY PARK            361
BAYSIDE                  236
FRESH MEADOWS 

<p style="color:Red;"><b>Insight</b>:  When the borough is UNSPECIFIED it appears to mean that often either the borough <i>or even a "neighborhood" (a division below borough)</i> has been entered in the CITY column.  The city is actually "correct" with NEW YORK only 59K times.  The city column is a de-facto "neighborhood" column for the most part.</p>

<p style="color:Red;">Because borough is something further analysis will key from, standardize the data where possible.</p>

In [13]:
import numpy as np
#Correct rows where borough was entered in the city column with "UNSPECIFIED" in the borough column.
five_boroughs = ["BROOKLYN","BRONX","MANHATTAN","QUEENS","STATEN ISLAND"]
which_rows_to_adjust = df[(df["borough"]=='UNSPECIFIED')&df["city"].isin(five_boroughs)].index
df.loc[which_rows_to_adjust,'borough']=df.loc[which_rows_to_adjust,'city']
df.loc[which_rows_to_adjust,'city']=np.nan

<p style="color:Red;">Almost 200K previously "UNSPECIFIED" rows will now show up under the correct borough during later analysis.</p>

In [14]:
#See if all the boroughs encompassed when the CITY is showing up as "New York"
df[df['city']=='NEW YORK']['borough'].value_counts()

MANHATTAN      393941
UNSPECIFIED     59095
Name: borough, dtype: int64

In [15]:
#See if all the MANHATTAN borough entries filled in the CITY as "New York" if if they sometimes have "neighborhood".
df[df['borough']=='MANHATTAN']['city'].value_counts()

NEW YORK    393941
BRONX            9
Name: city, dtype: int64

<p style="color:Red;"><b>Insight</b>:  Some data entered very oddly with BRONX as the city and MANHATTAN as the borough.</p>

<p style="color:Red;"><b>Decision</b>:  Ambiguous and only 9 rows out of millions, so drop the data so it won't confuse further analysis.</p>

In [16]:
print(df.shape) #before
#Drop a few rows of ambiguous data.
df.drop(df[(df["borough"]=='MANHATTAN')&(df["city"]=='BRONX')].index, axis=0, inplace=True)
print(df.shape) #after

(2040460, 8)
(2040451, 8)


<p style="color:Red;"><b>Decision</b>:  There are almost 400K rows with city as NEW YORK and the borough is MANHATTAN.  The other 59K rows with city NEW YORK has UNSPECIFIED as the borough.  Assume the borough is also MANHATTAN for those.</p>

In [17]:
#Fill in UNSPECIFIED borough when city was entered as NEW YORK.
which_rows_to_adjust = df[(df["borough"]=='UNSPECIFIED')&(df["city"]=='NEW YORK')].index
df.loc[which_rows_to_adjust,'borough']="MANHATTAN"
df.loc[which_rows_to_adjust,'city']=np.nan

In [19]:
#Although the city for most of the "NEW YORK" ones are the only ones that technically got the "city" column valued correctly,
#since every other row uses city as "neighborhood":  Standardize these.
which_rows_to_adjust = df[(df["city"]=='NEW YORK')].index
df.loc[which_rows_to_adjust,'city']=np.nan

In [20]:
#Double check how many unspecified boroughs.
df['borough'].value_counts()

BRONX            658544
BROOKLYN         636554
MANHATTAN        457638
QUEENS           228450
UNSPECIFIED       38384
STATEN ISLAND     20881
Name: borough, dtype: int64

<p style="color:Red;"><b>Decision</b>:  There are still 38K rows with unspecified boroughs, but the neighborhoods are in the city column, so use them to map to boroughs.</p>

In [21]:
#See the neighborhoods of the still unspecified boroughs.
neighborhoods = df[(df['borough']=='UNSPECIFIED')&(df['city'].isnull()==False)]['city'].unique()
neighborhoods.sort()
neighborhoods

array(['ARVERNE', 'ASTORIA', 'BAYSIDE', 'BELLEROSE', 'BREEZY POINT',
       'CAMBRIA HEIGHTS', 'COLLEGE POINT', 'CORONA', 'EAST ELMHURST',
       'ELMHURST', 'FAR ROCKAWAY', 'FLORAL PARK', 'FLUSHING',
       'FOREST HILLS', 'FRESH MEADOWS', 'GLEN OAKS', 'HOLLIS',
       'HOWARD BEACH', 'JACKSON HEIGHTS', 'JAMAICA', 'KEW GARDENS',
       'LITTLE NECK', 'LONG ISLAND CITY', 'MASPETH', 'MIDDLE VILLAGE',
       'NEW HYDE PARK', 'OAKLAND GARDENS', 'OZONE PARK', 'QUEENS VILLAGE',
       'REGO PARK', 'RICHMOND HILL', 'RIDGEWOOD', 'ROCKAWAY PARK',
       'ROSEDALE', 'SAINT ALBANS', 'SOUTH OZONE PARK',
       'SOUTH RICHMOND HILL', 'SPRINGFIELD GARDENS', 'SUNNYSIDE',
       'WHITESTONE', 'WOODHAVEN', 'WOODSIDE'], dtype=object)

<p style="color:Red;"><b>Insight</b>:  Checking https://en.wikipedia.org/wiki/List_of_Queens_neighborhoods show all but NEW HYDE PARK are in QUEENS.</p>

<p style="color:Red;"><b>Decision</b>:  NEW HYDE PARK isn't in any borough but is right on the border with QUEENS, and had only 2 complaints, so just consider it within QUEEENS.</p>

In [22]:
#Fix rows with unspecified borough but with a neighborhood (in the city column) that indicates it is in QUEEENS.
which_rows_to_adjust = df[(df["borough"]=='UNSPECIFIED')&df["city"].isin(neighborhoods)].index
df.loc[which_rows_to_adjust,'borough']="QUEENS"
df['borough'].value_counts()

BRONX            658544
BROOKLYN         636554
MANHATTAN        457638
QUEENS           266565
STATEN ISLAND     20881
UNSPECIFIED         269
Name: borough, dtype: int64

<p style="color:Red;"><b>Insight</b>:  The above will ultimately become the raw numbers to partially answer part of the the question.</p>

In [23]:
#Examine rows for those final 269 unspecified boroughs
df[df['borough']=='UNSPECIFIED'].describe()

Unnamed: 0,incident_zip,latitude,longitude
count,0.0,0.0,0.0
mean,,,
std,,,
min,,,
25%,,,
50%,,,
75%,,,
max,,,


<p style="color:Red;"><b>Decision</b>:  Nothing "worth the trouble" to base further adjustments on for only 269 rows.  Set the borough to null on these rows so that they will not skew any subsequent analysis that examines borough.</p>

In [24]:
#Fix rows with unspecified borough and no other practical information from which to derive it.
which_rows_to_adjust = df[(df["borough"]=='UNSPECIFIED')&df["city"].isnull()].index
df.loc[which_rows_to_adjust,'borough']=np.nan

In [25]:
#Fix rows with city (neighborhood) equal to borough
which_rows_to_adjust = df[(df["borough"]==df["city"])].index
df.loc[which_rows_to_adjust,'city']=np.nan

In [26]:
#Fix rows with city that started off null or was adjusted subsequently to be null
which_rows_to_adjust = df[df["city"].isnull()].index
df.loc[which_rows_to_adjust,'city']="Unspecified"

<p style="color:Red;">At this point borough is reasonably populated with all the UNSPECIFIED and weird cases standardized.</p>

In [27]:
#Learn the nature of the zip code data.
print(df[(df['incident_zip'].isnull()==True)].shape)
print(df[(df['incident_zip'].isnull()==False)].shape)
print(df[(df['incident_zip'].isnull()==False)&(df['incident_zip'] % 1.0 == 0.0)].shape)
print(df[(df['incident_zip']<10000.0)].shape)
print(df[(df['incident_zip']>99999.0)].shape)
print(df[(df['incident_zip']>9999.0)&(df['incident_zip']<100000.0)].shape)

(18969, 8)
(2021482, 8)
(2021482, 8)
(0, 8)
(0, 8)
(2021482, 8)


<p style="color:Red;"><b>Insight</b>:  The zip codes are all float64.  All of them have zero after the decimal point.  They are all five digits before the decimal point.  Almost 19K of the 2 Million+ rows are null.</p>

<p style="color:Red;">The question calls for using street too.</p>

In [28]:
df['street_name'].value_counts()

GRAND CONCOURSE                   35960
BROADWAY                          23497
OCEAN AVENUE                      17882
ARDEN STREET                      15841
MORRIS AVENUE                     15792
ST NICHOLAS AVENUE                14945
AMSTERDAM AVENUE                  11844
ELMHURST AVENUE                   10946
BOYNTON AVENUE                    10809
DR M L KING JR BOULEVARD          10037
OCEAN PARKWAY                      9965
WALTON AVENUE                      9579
BAILEY AVENUE                      9561
RIVERSIDE DRIVE                    9189
LINDEN BOULEVARD                   9173
SEDGWICK AVENUE                    9111
DECATUR AVENUE                     9095
NOSTRAND AVENUE                    8826
CRESTON AVENUE                     8683
SHERMAN AVENUE                     7885
BEDFORD AVENUE                     7530
SHERIDAN AVENUE                    7502
EASTERN PARKWAY                    7424
3 AVENUE                           7323
WALLACE AVENUE                     7287


<p style="color:Red;"><b>Insight</b>:  Before any attempt at standardization, there are 5979 distinct street_name values in the data.  There does not appear to be much standardization in the data.

<p style="color:Red;"><b>Goal</b>:  Standardize the street names so different representations of the same street are all counted into the same totals.

<p style="color:Red;"><b>Insights</b></p>
<p>This section will be added to over and over again with any insights gleaned during the running of cells in the subsequent "Work area" section.  This section is placed ahead of that one because sometimes the insights will usually be captured in the form of coded lists or even function definitions which will be re-run every time they change.</p>

In [252]:
# Some street values have multiple spaces in a row.
import re
def standardize_spaces(street):
    effective_street = street.strip() #Remove leading and trailing spaces.
    effective_street = re.sub(' +', ' ', effective_street) #Squeeze multiple adjacent spaces into just one space.
    return effective_street

In [253]:
# Some streets have problematic characters.  For example:  ST. ANN'S AVENUE also exists without period or apostophe.
problem_characters = ['.', '\'']
def replace_problem_characters(raw):
    result = raw
    for (character) in problem_characters:
        result = result.replace(character,'')
    return result

In [254]:
#Streets are not always called a street.
street_suffixes = [
    "STREET","AVENUE","BOULEVARD","PLACE","ROAD","PARKWAY","CONCOURSE","DRIVE",
    "TERRACE","HIGHWAY","PARK","EXPRESSWAY","SQUARE","PLAZA","OVAL","CRESCENT",
    "LANE","COURT","EXTENSION","TURNPIKE", "LOOP", "ESTATE",
    "SR" #Service Road
]

In [255]:
#Streets often have a directional element
directionals = ["SOUTH","NORTH","EAST","WEST"]

In [256]:
#Some streets are known by just one name
mononymous_streets = ["BROADWAY","BOWERY"]

In [257]:
#Some words that are entered in a non-standard way or with typos need to be standardized.
word_replacements = [
    ("AVE","AVENUE"),
    ("ST","STREET"),
    ("RD","ROAD"),
    ("NICHLAS","NICHOLAS"),
    ("NICHALOS","NICHOLAS")
]

def replace_words(raw):
    split_raw = raw.split()
    for (old, new) in word_replacements:
        found_at_index = next((i for i, x in enumerate(split_raw) if x==old), None)
        if found_at_index!=None:
            split_raw[found_at_index] = new
    return " ".join(split_raw)

In [258]:
#Abbreviations that are sometimes not abbreviations.  Example:  AVENUE N.
abbreviation_replacements = [
    ("N","NORTH"),
    ("S","SOUTH"),
    ("E","EAST"),
    ("W","WEST")
]

def replace_tricky_abbreviations(raw):
    split_raw = raw.split()
    if len(split_raw)!=2:  # AVENUE N, E STREET, etc.
        for (old, new) in abbreviation_replacements:
            found_at_index = next((i for i, x in enumerate(split_raw) if x==old), None)
            if found_at_index!=None:
                split_raw[found_at_index] = new
    return " ".join(split_raw)

In [259]:
#Some phrases need custom replacement because they involve words that individually would be mis-interpretted.
phrase_replacements = [
    ("DR M L KING JR","MARTIN LUTHER KING"),
    ("DR MARTIN L KING","MARTIN LUTHER KING"),
    ("MARTIN LUTHER KING","MARTIN LUTHER KING"),
    ("MARTIN L KING JR","MARTIN LUTHER KING"),
    ("MARTIN L KING","MARTIN LUTHER KING"),
    ("ST NICHOLAS","SAINT NICHOLAS"),
    ("ST JOHN","SAINT JOHN"),
    ("ST MARK","SAINT MARK"),
    ("ST ANN","SAINT ANN"),
    ("ST LAWRENCE","SAINT LAWRENCE"),
    ("ST PAUL","SAINT PAUL"),
    ("ST PETER","SAINT PETER"),
    ("ST RAYMOND","SAINT RAYMOND"),
    ("ST THERESA","SAINT THERESA"),
    ("ST FELIX","SAINT FELIX"),
    ("ST MARY","SAINT MARY"),
    ("ST OUEN","SAINT OUEN"),
    ("ST JAMES","SAINT JAMES"),
    ("ST GEORGE","SAINT GEORGE"),
    ("ST EDWARD","SAINT EDWARD"),
    ("ST CHARLES","SAINT CHARLES"),
    ("ST FRANCIS","SAINT FRANCIS"),
    ("ST ANDREW","SAINT ANDREW"),
    ("ST JUDE","SAINT JUDE"),
    ("ST LUKE","SAINT LUKE"),
    ("ST JOSEPH","SAINT JOSEPH"),
    ("N D PERLMAN","NATHAN PERLMAN"),
    ("O BRIEN","OBRIEN"),
    ("F D R","FDR"),
    ("EXPRESSWAY N SR","EXPRESSWAY SR N")
]

def replace_phrases(raw):
    result = raw
    for (old,new) in phrase_replacements:
        result = result.replace(old,new)
    return result

<p style="color:Red;"><b>Standardize Function</b></p>
<p>This function will be built up gradually as insights are gained.  Ultimately resulting in a single function that will standardize a street name.</p>

In [260]:
def standardize_street(street):
    r = street
    r = standardize_spaces(r)
    r = replace_problem_characters(r)
    r = replace_phrases(r)
    r = replace_words(r)
    r = replace_tricky_abbreviations(r)
    return r

<p style="color:Red;"><b>Work area</b></p>
<p>The following sections will be run over and over again, not necessarily in order.  Each time with modifications to filter out words already considered, or to drill into a specific case under consideration for further insights.  Each time, the insights gained will be applied within the "Insights" section above.</p>

In [261]:
def is_word_present(x, word):
    split_x = x.split()
    if word in split_x:
        return True
    else:
        return False

def is_not_last_word(x, word):
    split_x = x.split()
    last_word = len(split_x) - 1
    if word!=last_word:
        return True
    else:
        return False
    
def is_word_present_but_not_last(x, word):
    if is_word_present(x,word):
        if is_not_last_word(x,word):
            return True
    return False

def is_small_alpha_word_present(x, size):
    split_x = x.split()
    for a in split_x:
        if len(a)==size:
            if a.isalpha():
                return True
    return False

In [262]:
#Standardize what we can
df['standardized_street_name'] = df['street_name'].apply(standardize_street)

In [263]:
#Examine the partially standardized results to gain more insights.
#The following cell needs to change often.  Only the most common types of lines are shown.

In [264]:
#df['standardized_street_name'].value_counts()
df[df['standardized_street_name'].apply(is_small_alpha_word_present, args=(2,))]['standardized_street_name'].value_counts()
#df[df['standardized_street_name'].apply(xxx)]['standardized_street_name'].value_counts()
#df[df['standardized_street_name'].str.contains('F D R')]['standardized_street_name'].value_counts()

FT WASHINGTON AVENUE                  6698
DE KALB AVENUE                        3689
MT HOPE PLACE                         1433
FT HAMILTON PARKWAY                   1016
MC DONALD AVENUE                       707
MC CLELLAN STREET                      559
FT INDEPENDENCE STREET                 514
MC KEEVER PLACE                        505
HOR HARDING EXPRESSWAY SR NORTH        367
SHORE PARKWAY SR NORTH                 357
MC GRAW AVENUE                         354
MT MORRIS PARK WEST                    338
FT GEORGE AVENUE                       297
SHORE PARKWAY SR SOUTH                 256
DE REIMER AVENUE                       252
MC BRIDE STREET                        237
FR CAPODANNO BOULEVARD                 213
LA FONTAINE AVENUE                     212
MC KINLEY AVENUE                       137
FT GREENE PLACE                        137
DE WITT AVENUE                         136
EAST MT EDEN AVENUE                    115
VAN WYCK EXPRESSWAY SR WEST            109
MC GUINNESS

In [None]:
# Continue here
#
# Examine the above.  Decide if it is worth standardizing "MC" (space or no space after), "FT" (fort), etc.

In [34]:
#Get all of the individual words represented within the street_name column in a countable form.
words = list()
df['standardized_street_name'].str.split().apply(words.extend)
word_series = pd.Series(words)
counts = word_series.value_counts()
counts

AVENUE         869520
STREET         786344
EAST           248298
WEST           179592
BOULEVARD       76973
PLACE           76390
ROAD            59069
PARKWAY         48542
GRAND           43831
CONCOURSE       36598
ST              35569
OCEAN           28277
BROADWAY        24642
PARK            24258
NICHOLAS        17809
MORRIS          17225
ARDEN           15847
SOUTH           15811
DRIVE           15639
WASHINGTON      13737
NORTH           13418
BEACH           13079
AMSTERDAM       11845
L               11487
3               11238
LINDEN          10995
BEDFORD         10988
ELMHURST        10947
TERRACE         10942
BOYNTON         10815
                ...  
SEGART              1
URSINA              1
2177                1
ND                  1
ZION                1
HOLSMAN             1
TENTH               1
WOLFFE              1
FORBEL              1
CLEARMONT           1
LORIN               1
HA                  1
BRADEY              1
DOOP                1
HARVEY    

In [33]:
# Drop words that don't appear often enough to fuss with
counts.drop(counts[counts < 100].index, inplace=True) 
counts.shape

(1308,)

In [None]:
# Drop words that are ONLY numbers
counts.drop(counts[counts.index.str.isdigit()].index, inplace=True) 
counts.shape

In [35]:
df[(df['street_name'].isnull()==False)&(df['street_name'].str.contains('ST'))]

Unnamed: 0,unique_key,incident_zip,incident_address,street_name,city,borough,latitude,longitude
2,43917912,11233.0,1711 FULTON STREET,FULTON STREET,Unspecified,BROOKLYN,40.679340,-73.930435
25,43916888,10031.0,620 WEST 141 STREET,WEST 141 STREET,Unspecified,MANHATTAN,40.824162,-73.952978
26,43917860,10032.0,527 WEST 162 STREET,WEST 162 STREET,Unspecified,MANHATTAN,40.836441,-73.940802
34,43920891,10009.0,635 EAST 11 STREET,EAST 11 STREET,Unspecified,MANHATTAN,40.726767,-73.977783
37,43919611,11225.0,1087 CARROLL STREET,CARROLL STREET,Unspecified,BROOKLYN,40.667366,-73.953965
38,43920920,11234.0,1495 EAST 46 STREET,EAST 46 STREET,Unspecified,BROOKLYN,40.624272,-73.931343
44,43919638,10002.0,246 BROOME STREET,BROOME STREET,Unspecified,MANHATTAN,40.717936,-73.989585
59,43914332,10002.0,135 ELDRIDGE STREET,ELDRIDGE STREET,Unspecified,MANHATTAN,40.718995,-73.991544
65,43915680,10002.0,575 GRAND STREET,GRAND STREET,Unspecified,MANHATTAN,40.713707,-73.979140
66,43914296,11233.0,1711 FULTON STREET,FULTON STREET,Unspecified,BROOKLYN,40.679340,-73.930435


<p style="color:Red;">What standardizations can we do based on these insights?</p>

In [None]:
#Code helpers to standardize

def fix_last_word(street, old, new):
    split_street = street.split()
    last_index = len(split_street) - 1
    if split_street[last_index]==old:
        split_street[last_index] = new
    effective_street = " ".join(split_street)
    return effective_street

def fix_middle_words(street, old, new):
    split_street = street.split()
    found_at_index = next((i for i, x in enumerate(split_street) if x==old), None)
    last_index = len(split_street) - 1
    if found_at_index!=None:
        if found_at_index!=last_index:
            split_street[found_at_index] = new
    effective_street = " ".join(split_street)
    return effective_street

def standardize_mlk(street):
    effective_street = street
    effective_street = effective_street.replace("DR M L KING JR","mlk")
    effective_street = effective_street.replace("DR MARTIN L KING","mlk")
    effective_street = effective_street.replace("MARTIN LUTHER KING","mlk")
    effective_street = effective_street.replace("MARTIN L KING JR","mlk")
    effective_street = effective_street.replace("MARTIN L KING","mlk")
    effective_street = effective_street.replace("mlk","MARTIN LUTHER KING")
    return effective_street

def standardize_street(street):
    effective_street = street.strip() #Remove leading and trailing spaces.
    effective_street = re.sub(' +', ' ', effective_street) #Squeeze multiple spaces to one space.
    
    effective_street = street_mlk(effective_street)
    effective_street = fix_last_word(effective_street,"AVE","AVENUE")
    effective_street = fix_last_word(effective_street,"ST","STREET")
    effective_street = fix_last_word(effective_street,"RD","ROAD")
    
    split_street = effective_street.split()
    last_word = split_street[-1]
   
    if street in popular_streets:
        return "accounted for - popular"
    if (len(split_street)==2)&(last_word in popular_suffixes):
        return "accounted for - 2 word normal"
    if (len(split_street)==2)&(split_street[0]=="AVENUE"):
        return "accounted for - avenue X"
    if (last_word in directionals)&(len(split_street)>2):
        if (split_street[-2] in popular_suffixes):
            return "accounted for - normal with directional"
    if (len(split_street)==2)&(split_street[0] in directionals):
        if (split_street[1] in popular_streets):
            return "accounted for - directional popular"
    
    return effective_street
    #return last_word

In [None]:
# Create a new column by applying the standardizations to each street.
df['standardized_street_name'] = df['street_name'].apply(standardize_street)

In [None]:
df['normalized_street_name'].value_counts()

In [None]:
df[df['normalized_street_name']=='SOUTH']['street_name'].value_counts()

In [None]:
df[(df['street_name'].isnull()==False)&(df['street_name'].str.contains('  '))].head()

In [None]:
df[df['street_name'].str.contains('CONCOURSE')]['street_name'].value_counts()

### Visualize data

In [None]:
graph_width_max = df['borough'].value_counts().max()

In [None]:
%matplotlib inline 
import matplotlib.pyplot as plt

In [None]:
#Gather totals for graphic
totals_for_graph = df['borough'].value_counts()

In [None]:
#Plot
totals_for_graph.plot(kind='barh', figsize=(13, 4), color='steelblue')
plt.xlim = (0, graph_width_max)
plt.xlabel('Number of complaints')
plt.title('HEATING / HOT WATER complaints by borough')
for index, value in enumerate(totals_for_graph): 
    label = format(int(value), ',') # format int with commas
    # place text at the end of bar (subtracting 47000 from x, and 0.1 from y to make it fit within the bar)
    plt.annotate(label, xy=(value - 47000, index - 0.10), color='white')
plt.show()

<a id="conclusion"></a>
## Concluding Remarks
---

xxx.

The Department of Housing Preservation and Development of New York City should focus on the following particular set of boroughs, ZIP codes, and streets (where the complaints are severe) for the "HEAT/HOT WATER" + "HEATING" complaint types:

<p style="color:Red;">xxx</p>

xxx