<h1 style="text-align:center">Data Science and Machine Learning Capstone Project</h1>
<img style="float:right" src="https://prod-edxapp.edx-cdn.org/static/edx.org/images/logo.790c9a5340cb.png">
<p style="text-align:center">IBM: DS0720EN</p>
<p style="text-align:center">Question 2 of 4</p>

1. [Problem Statement](#problem)
2. [Question 2](#question)
3. [Analyzing and Visualizing](#analysis)
4. [Concluding Remarks](#conclusion)

<a id="problem"></a>
## Problem Statement
---

The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

<a id="question"></a>
## Question 2
---

Should the Department of Housing Preservation and Development of New York City focus on any particular set of boroughs, ZIP codes, or street (where the complaints are severe) for the specific type of complaints you identified in response to Question 1?

### Approach
Analyze the data to see if there is a higher correlation between the HEATING complaints and any particular borough, ZIP code, or street.

### Load Data
Separately the [New York 311](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) data was loaded by [SODA](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status) into a Pandas DataFrame then saved to a pickle file.

In [1]:
import pandas as pd
df = pd.read_pickle('C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\ny311full.pkl') # Local
#df = pd.read_pickle('./ny311.pkl') #IBM Cloud / Watson Studio
df.shape

(5862383, 15)

<a id="analysis"></a>
## Analyzing and Visualizing
---

### Reduce data to relevant rows and columns

In [2]:
#Remove rows that were not for the complaint types identified in question one.
df.drop(df[df["complaint_type"].isin(["HEAT/HOT WATER","HEATING"])==False].index, inplace=True)
#Double check that the correct rows were removed.
df['complaint_type'].value_counts()

HEAT/HOT WATER    1152592
HEATING            887869
Name: complaint_type, dtype: int64

In [3]:
#Remove columns deemed unnecessary for this question.
df.drop(['created_date','complaint_type','resolution_description','closed_date','location_type','status','address_type'], axis=1, inplace=True)
df.shape

(2040461, 8)

### Wrangle any unruly data

In [4]:
#Normalize strings so different casing won't appear as separate values.
df['incident_address'] = df['incident_address'].str.upper()
df['street_name'] = df['street_name'].str.upper()
df['city'] = df['city'].str.upper()
df['borough'] = df['borough'].str.upper()

In [5]:
#See if any data is null.
df.isnull().sum()

unique_key              0
incident_zip        18970
incident_address        1
street_name             1
city                18843
borough                 0
latitude            18966
longitude           18966
dtype: int64

<p style="color:Red;">How is it that the city could ever be missing, when the borough is not?</p>

In [6]:
df['borough'].value_counts()

BRONX            569960
BROOKLYN         543166
MANHATTAN        398552
UNSPECIFIED      282917
QUEENS           228447
STATEN ISLAND     17419
Name: borough, dtype: int64

<p style="color:Red;">The expected five boroughs, but then also:  UNSPECIFIED.  What does that mean?</p>

In [7]:
df[df["borough"]=="UNSPECIFIED"]["city"].value_counts().head(10)

BROOKLYN         93388
BRONX            88585
NEW YORK         59095
JAMAICA           5020
STATEN ISLAND     3462
ASTORIA           3381
FLUSHING          3154
RIDGEWOOD         2273
FAR ROCKAWAY      2040
WOODSIDE          1773
Name: city, dtype: int64

<p style="color:Red;"><b>Insight</b>:  When the borough is UNSPECIFIED it appears to mean that often either the borough <i>or even a "neighborhood" (a division below borough)</i> has been entered in the CITY column!  The city is actually "correct" with NEW YORK only 59K times.  The city column is a de-facto "neighborhood" column for the most part.</p>

In [8]:
import numpy as np
#Correct rows where borough was entered in the city column with "UNSPECIFIED" in the borough column.
five_boroughs = ["BROOKLYN","BRONX","MANHATTAN","QUEENS","STATEN ISLAND"]
which_rows_to_adjust = df[(df["borough"]=='UNSPECIFIED')&df["city"].isin(five_boroughs)].index
df.loc[which_rows_to_adjust,'borough']=df.loc[which_rows_to_adjust,'city']
df.loc[which_rows_to_adjust,'city']=np.nan

<p style="color:Red;">Almost 200K previously "UNSPECIFIED" rows will now show up under the correct borough during later analysis.</p>

In [12]:
#See if all the boroughs encompassed when the CITY is showing up as "New York"
df[df['city']=='NEW YORK']['borough'].value_counts()

MANHATTAN      393941
UNSPECIFIED     59095
Name: borough, dtype: int64

In [13]:
#See if all the MANHATTAN borough entries filled in the CITY as "New York" if if they sometimes have "neighborhood".
df[df['borough']=='MANHATTAN']['city'].value_counts()

NEW YORK    393941
BRONX            9
Name: city, dtype: int64

<p style="color:Orange;">Continue here</p>

In [None]:
#See what neighborhoods are in each borough
print(df[df['borough'].isin(five_boroughs)==True]['unique_key'].count())
print(df[df['city']=="NEW YORK"]['unique_key'].count())

<p style="color:Red;">Why so many of the five boroughs specified without being in NEW YORK city?</p>

In [None]:
df[(df['borough'].isin(five_boroughs)==True)&(df['city']!="NEW YORK")]['city'].value_counts()

<p style="color:Red;">Insight:  The boroughs or even neighborhoods are sometimes entered in the "city" field.</p>

In [None]:
#Correct rows where borough was also entered as the city.
which_rows_to_adjust = df[(df["borough"].isin(five_boroughs))&(df["city"]==df["borough"])].index
which_rows_to_adjust
#df.loc[which_rows_to_adjust,'borough']=df.loc[which_rows_to_adjust,'city']
#df.loc[which_rows_to_adjust,'city']="NEW YORK"

<a id="conclusion"></a>
## Concluding Remarks
---

xxx.

The Department of Housing Preservation and Development of New York City should focus on the following particular set of boroughs, ZIP codes, and streets (where the complaints are severe) for the "HEAT/HOT WATER" + "HEATING" complaint types:

<p style="color:Red;">xxx</p>

xxx