<h1 style="text-align:center">Data Science and Machine Learning Capstone Project</h1>
<img style="float:right" src="https://prod-edxapp.edx-cdn.org/static/edx.org/images/logo.790c9a5340cb.png">
<p style="text-align:center">IBM: DS0720EN</p>
<p style="text-align:center">Question 3 of 4</p>

1. [Problem Statement](#problem)
2. [Question 3](#question)
3. [Data Cleaning and Standardization](#wrangling)
4. [Analyzing and Visualizing](#analysis)
5. [Concluding Remarks](#conclusion)

<a id="problem"></a>
# Problem Statement
---

The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

<a id="question"></a>
# Question 3
---

Does the Complaint Type that you identified in response to Question 1 have an obvious relationship with any particular characteristic or characteristic of the Houses?

## Approach
Determine how to link the 311 data with the PLUTO data then identify whether or not there are any correlations between the HEAT/HOT WATER complaints from Question 1 to the PLUTO house information.

## Load Data
Separately from this notebook:

The [New York 311](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) data was loaded by [SODA](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status) into a Pandas DataFrame then saved to a pickle file.

The [New York PLUTO](https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/xuk2-nczf) data was downloaded.  The instructions at ( Course / 1. Project Challenge Details and Setup / Datasets Used in this Course / Datasets ) said "Use only the part that is specific to the borough that you are interested in based on your analysis."  My answer for Question 2 suggested the borough with the biggest HEAT/HOT WATER problem was BRONX.  For that reason, only the BX_18v1.csv file was loaded into a Pandas DataFrame then saved to a pickle file.

In [1]:
import pandas as pd
files_path = 'C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\' #local
#files_path = './' #IBM Cloud / Watson Studio

b311 = pd.read_pickle(files_path + 'ny311full.pkl')

#bPlu = pd.read_csv(files_path + 'BX_18v1.csv', usecols=['Address','BldgArea','BldgDepth','BuiltFAR','CommFAR','FacilFAR','Lot','LotArea','LotDepth','NumBldgs','NumFloors','OfficeArea','ResArea','ResidFAR','RetailArea','YearBuilt','YearAlter1','ZipCode', 'YCoord', 'XCoord'])
#bPlu.to_pickle(files_path + 'BX_18v1.pkl')
bPlu = pd.read_pickle(files_path + 'BX_18v1.pkl')

print("NY 311 shape %s" % (b311.shape,))
print("BRONX PLUTO shape %s" % (bPlu.shape,))

NY 311 shape (5862383, 15)
BRONX PLUTO shape (89854, 20)


<a id="wrangling"></a>
# Data Cleaning and Standardization
---

Observations with missing or malformed data elements will need to be corrected or removed.  The 311 and the PLUTO data sets will need to be "joined" together by the common "address" element, which means the addresses will need to be standardized to a consistent layout to allow the addresses to be compared consistently.

## NY 311

### General

In [2]:
#Normalize relevant strings to uppercase so different casing won't appear as separate values.
b311['incident_address'] = b311['incident_address'].str.upper()
b311['city'] = b311['city'].str.upper()
b311['borough'] = b311['borough'].str.upper()
#We only care if it is the combined "heating and hot water" from Question 1, or some other type of complaint.
b311.loc[b311[b311["complaint_type"].isin(["HEAT/HOT WATER","HEATING"])==True].index,'complaint_type'] = "HEAT/HOT WATER"
b311.loc[b311[b311["complaint_type"].isin(["HEAT/HOT WATER","HEATING"])==False].index,'complaint_type'] = "OTHER"

In [3]:
#Print some initial information for comparison during later steps.
print("shape %s" % str(b311.shape))
print("--nulls below--")
print(b311.isnull().sum())
print("--types below--")
print(b311.dtypes)
b311.head()

shape (5862383, 15)
--nulls below--
created_date                   0
unique_key                     0
complaint_type                 0
incident_zip               80611
incident_address           52831
street_name                52831
address_type               84775
city                       80210
resolution_description     13194
borough                        0
latitude                   80585
longitude                  80585
closed_date               123122
location_type                  0
status                         0
dtype: int64
--types below--
created_date               object
unique_key                 object
complaint_type             object
incident_zip              float64
incident_address           object
street_name                object
address_type               object
city                       object
resolution_description     object
borough                    object
latitude                  float64
longitude                 float64
closed_date                objec

Unnamed: 0,created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status
0,2019-09-28T20:31:44.000,43918968,OTHER,11435.0,141-40 PERSHING CRESCENT,PERSHING CRESCENT,ADDRESS,JAMAICA,The following complaint conditions are still o...,QUEENS,40.712047,-73.815983,,RESIDENTIAL BUILDING,Open
1,2019-09-28T20:16:49.000,43917365,OTHER,11223.0,1702 WEST 1 STREET,WEST 1 STREET,ADDRESS,BROOKLYN,The following complaint conditions are still o...,BROOKLYN,40.606232,-73.974553,,RESIDENTIAL BUILDING,Open
2,2019-09-28T16:10:10.000,43917912,HEAT/HOT WATER,11233.0,1711 FULTON STREET,FULTON STREET,ADDRESS,BROOKLYN,The complaint you filed is a duplicate of a co...,BROOKLYN,40.67934,-73.930435,,RESIDENTIAL BUILDING,Open
3,2019-09-28T14:50:47.000,43918939,OTHER,10456.0,881 CAULDWELL AVENUE,CAULDWELL AVENUE,ADDRESS,BRONX,The following complaint conditions are still o...,BRONX,40.822361,-73.907267,,RESIDENTIAL BUILDING,Open
4,2019-09-28T09:28:25.000,43916893,OTHER,11205.0,196 CLINTON AVENUE,CLINTON AVENUE,ADDRESS,BROOKLYN,The following complaint conditions are still o...,BROOKLYN,40.692252,-73.968649,,RESIDENTIAL BUILDING,Open


### Standardize Borough
Leveraging findings found while standardizing during Question 2.

In [4]:
b311['borough'].value_counts()

BROOKLYN         1692795
BRONX            1566246
MANHATTAN        1020161
UNSPECIFIED       873226
QUEENS            624323
STATEN ISLAND      85632
Name: borough, dtype: int64

In [5]:
import numpy as np
#Correct rows where borough was entered in the city column with "UNSPECIFIED" in the borough column.
five_boroughs = ["BROOKLYN","BRONX","MANHATTAN","QUEENS","STATEN ISLAND"]
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isin(five_boroughs)].index
b311.loc[which_rows_to_adjust,'borough']=b311.loc[which_rows_to_adjust,'city']
b311.loc[which_rows_to_adjust,'city']=np.nan
#Drop a few rows of ambiguous data.
b311.drop(b311[(b311["borough"]=='MANHATTAN')&(b311["city"]=='BRONX')].index, axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
#Fill in UNSPECIFIED borough when city was entered as NEW YORK.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&(b311["city"]=='NEW YORK')].index
b311.loc[which_rows_to_adjust,'borough']="MANHATTAN"
b311.loc[which_rows_to_adjust,'city']=np.nan
#Although the city for most of the "NEW YORK" ones are the only ones that technically got the "city" column valued correctly,
#since every other row uses city as "neighborhood":  Standardize these.
which_rows_to_adjust = b311[(b311["city"]=='NEW YORK')].index
b311.loc[which_rows_to_adjust,'city']=np.nan
#Any still unspecified boroughs with a value in "city" are in the Queens borough.  The "city" is actually a "neighborhood".
queens_neighborhoods = b311[(b311['borough']=='UNSPECIFIED')&(b311['city'].isnull()==False)]['city'].unique()
#Standardize borough for Queens neighborhoods.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isin(queens_neighborhoods)].index
b311.loc[which_rows_to_adjust,'borough']="QUEENS"
#Null the borough if it still shows up as unspecified borough as there is no other information from which to derive it.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isnull()].index
b311.loc[which_rows_to_adjust,'borough']=np.nan

In [6]:
b311['borough'].value_counts()

BROOKLYN         1988158
BRONX            1816914
MANHATTAN        1175218
QUEENS            728676
STATEN ISLAND      99998
Name: borough, dtype: int64

### Filter NY 311 data by borough to only include BRONX
The instructions at ( Course / 1. Project Challenge Details and Setup / Datasets Used in this Course / Datasets ) said "Use only the part that is specific to the borough that you are interested in based on your analysis."  My answer for Question 2 suggested the borough with the biggest HEAT/HOT WATER problem was BRONX.  For that reason, I am only considering the BRONX data.

In [7]:
b311.drop(b311[(b311["borough"]!='BRONX')].index, axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
b311['borough'].value_counts()

BRONX    1816914
Name: borough, dtype: int64

### Remove unnecessary columns

In [8]:
#Remove columns deemed unnecessary for this question.
b311.drop(['created_date','street_name','address_type','city','resolution_description','borough','closed_date','location_type','status','unique_key','latitude','longitude'], axis=1, inplace=True)
print(b311.shape)
print(b311.isnull().sum())
b311.head()

(1816914, 3)
complaint_type         0
incident_zip        8083
incident_address       1
dtype: int64


Unnamed: 0,complaint_type,incident_zip,incident_address
0,OTHER,10456.0,881 CAULDWELL AVENUE
1,HEAT/HOT WATER,10457.0,4487 3 AVENUE
2,OTHER,10452.0,133 CLARKE PLACE EAST
3,OTHER,10453.0,2076 CRESTON AVENUE
4,OTHER,10471.0,6035 BROADWAY


In [9]:
# Drop the one observation with the missing address as there will be no way to tie it to any PLUTO data.
b311.dropna(subset=['incident_address'], axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
print(b311.isnull().sum())
b311['incident_address'].value_counts().head()

complaint_type         0
incident_zip        8082
incident_address       0
dtype: int64


1025 BOYNTON AVENUE        9854
3810 BAILEY AVENUE         7171
750 GRAND CONCOURSE        4412
888 GRAND CONCOURSE        4271
3555 BRUCKNER BOULEVARD    4076
Name: incident_address, dtype: int64

## BRONX PLUTO

In [10]:
print("shape %s" % str(bPlu.shape))
print("---isnull follows---")
print(bPlu.isnull().sum())
bPlu.head()

shape (89854, 20)
---isnull follows---
Lot              0
ZipCode        329
Address         69
LotArea          0
BldgArea         0
ResArea          0
OfficeArea       0
RetailArea       0
NumBldgs         0
NumFloors        0
LotDepth         0
BldgDepth        0
YearBuilt        0
YearAlter1       0
BuiltFAR         0
ResidFAR         0
CommFAR          0
FacilFAR         0
XCoord        3259
YCoord        3259
dtype: int64


Unnamed: 0,Lot,ZipCode,Address,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
0,1,10454.0,122 BRUCKNER BOULEVARD,15000,0,0,0,0,1,0.0,200.0,0.0,0,0,0.0,6.02,5.0,6.5,1005957.0,232162.0
1,4,10454.0,126 BRUCKNER BOULEVARD,13770,752,0,272,0,2,1.0,100.0,16.0,1931,1994,0.05,6.02,5.0,6.5,1006076.0,232156.0
2,10,10454.0,138 BRUCKNER BOULEVARD,35000,39375,0,0,0,1,2.0,200.0,200.0,1931,0,1.13,6.02,5.0,6.5,1006187.0,232036.0
3,17,10454.0,144 BRUCKNER BOULEVARD,2500,12500,12500,0,0,1,5.0,100.0,85.0,1931,2001,5.0,6.02,5.0,6.5,1006299.0,232033.0
4,18,10454.0,148 BRUCKNER BOULEVARD,1875,8595,6876,0,1719,1,5.0,75.0,70.0,1920,2009,4.58,6.02,5.0,6.5,1006363.0,232040.0


### General

In [11]:
bPlu.dtypes

Lot             int64
ZipCode       float64
Address        object
LotArea         int64
BldgArea        int64
ResArea         int64
OfficeArea      int64
RetailArea      int64
NumBldgs        int64
NumFloors     float64
LotDepth      float64
BldgDepth     float64
YearBuilt       int64
YearAlter1      int64
BuiltFAR      float64
ResidFAR      float64
CommFAR       float64
FacilFAR      float64
XCoord        float64
YCoord        float64
dtype: object

In [12]:
#Normalize relevant strings to uppercase so different casing won't appear as separate values.
bPlu['Address'] = bPlu['Address'].str.upper()

In [13]:
# Drop the observations with missing address as there will be no way to tie it to any 311 data.
bPlu.dropna(subset=['Address'], axis=0, inplace=True)
bPlu.reset_index(drop=True, inplace=True)

## Standardization of Addresses
Leveraging standardization methods developed during question 2.

In [14]:
print("BRONX 311 unique addresses: %s" % b311['incident_address'].unique().size)
print("BRONX PLUTO unique addresses: %s" % bPlu['Address'].unique().size)

BRONX 311 unique addresses: 29216
BRONX PLUTO unique addresses: 87017


<p style="color:Red;">Determine how much overlap.  Ideally all 29K BRONX 311 addresses will be represented in the PLUTO set.</p>

In [15]:
def WhichAddressesNotInPluto(howManyTopToShow):
    complaints = set(b311['incident_address'].unique())
    pluto = set(bPlu['Address'].unique())
    #Determine which 311 addresses were not found in PLUTO to gain insight as to why.
    differences = complaints.difference(pluto)
    print("Records not in PLUTO: %s.  Percent: %s" % (len(differences), "{:.2%}".format(len(differences) / len(complaints))))
    print("---Top %i---" % howManyTopToShow)
    print(b311[b311['incident_address'].isin(differences)]['incident_address'].value_counts().head(howManyTopToShow))

In [16]:
WhichAddressesNotInPluto(3)

Records not in PLUTO: 6753.  Percent: 23.11%
---Top 3---
2090 EAST TREMONT AVENUE         3940
266 BEDFORD PARK BOULEVARD       2594
1425 DR M L KING JR BOULEVARD    2534
Name: incident_address, dtype: int64


<p style="color:Red;">23 percent of addresses in the BRONX 311 data cannot be merged to the BRONX PLUTO data.</p>

### Borrow some python functions developed during question 2
With minor improvements to better work with full addresses instead of just street names.

In [17]:
# Some street values have multiple spaces in a row.
import re
def standardize_spaces(raw):
    result = raw.strip() #Remove leading and trailing spaces.
    result = re.sub(' +', ' ', result) #Squeeze multiple adjacent spaces into just one space.
    return result

In [18]:
# Some streets have problematic characters.  For example:  ST. ANN'S AVENUE also exists without period or apostophe.
problem_characters = ['.', '\'']
def replace_problem_characters(raw):
    result = raw
    for (character) in problem_characters:
        result = result.replace(character,'')
    return result

In [19]:
#Some words are sometimes entered in a non-standard way or with typos need to be standardized.
word_replacements = [("AVE","AVENUE"),("ST","STREET"),("RD","ROAD"),("FT","FORT"),("BX","BRONX"),("MT","MOUNT"),
                     ("NICHLAS","NICHOLAS"),("NICHALOS","NICHOLAS"),("EXPRE","EXPRESSWAY"),("HARACE","HORACE"),
                     ("NO","NORTH"),("AV","AVENUE"),("CRK","CREEK"),("FR","FATHER"),("JR","JUNIOR"),("GR","GRAND"),
                     ("CT","COURT"),
                     ("SR",""), # Service Road.  These are always near a similarly named street.  Lump together.
                     ("QN","QUEENS"),
                     ("ND",""), # A space between a number and ND such as EAST 52 ND STREET.  Note ST and RD can be street or road.
                     ("PO","POND"),("BO","BOND"),("GRA","GRAND"),("REV","REVEREND"),("CO-OP","COOP"),
                     ("GRANDCONCOURSE", "GRAND CONCOURSE"),("CENTRL", "CENTRAL"),("BLVD","BOULEVARD"),
                     ("FREDRICK", "FREDERICK"),("DOUGLAS", "DOUGLASS"),("MALCOM", "MALCOMN"),
                     ("NORTHEN", "NORTHERN"),("AVNEUE","AVENUE"),
                    ("N","NORTH"),("S","SOUTH"),("E","EAST"),("W","WEST"),("SW","SOUTHWEST"),
                             ("NW","NORTHWEST"),("SE","SOUTHEAST"),("NE","NORTHEAST")]
def replace_words(raw):
    split_raw = raw.split()
    for (old, new) in word_replacements:
        found_at_index = next((i for i, x in enumerate(split_raw) if x==old), None)
        if found_at_index!=None:
            split_raw[found_at_index] = new
    return standardize_spaces(" ".join(split_raw))

In [20]:
#Some words are actually prefixes of the following word.  Example the LA prefix of LA GRANGE.
word_prefixes = ["DE","MC","LA","VAN","MAC","CO"]
def concatenate_prefixes(raw):
    split_raw = raw.split()
    last_word = len(split_raw) - 1
    for (prefix) in word_prefixes:
        found_at_index = next((i for i, x in enumerate(split_raw) if x==prefix), None)
        if found_at_index!=None:
            if len(split_raw)>1:
                if found_at_index != last_word:
                    split_raw[found_at_index] = ''
                    split_raw[found_at_index+1] = prefix + split_raw[found_at_index+1]
                    return standardize_spaces(" ".join(split_raw))
    return raw

In [21]:
#Some phrases need custom replacement because they involve multiple words or easily mis-interpretted out of context.
phrase_replacements = [("DR M L KING JR","MARTIN LUTHER KING"),("DR MARTIN L KING","MARTIN LUTHER KING"),
    ("MARTIN LUTHER KING","MARTIN LUTHER KING"),("MARTIN L KING JR","MARTIN LUTHER KING"),
    ("MARTIN L KING","MARTIN LUTHER KING"),("ST NICHOLAS","SAINT NICHOLAS"),("ST JOHN","SAINT JOHN"),
    ("ST MARK","SAINT MARK"),("ST ANN","SAINT ANN"),("ST LAWRENCE","SAINT LAWRENCE"),("ST PAUL","SAINT PAUL"),
    ("ST PETER","SAINT PETER"),("ST RAYMOND","SAINT RAYMOND"),("ST THERESA","SAINT THERESA"),("ST FELIX","SAINT FELIX"),
    ("ST MARY","SAINT MARY"),("ST OUEN","SAINT OUEN"),("ST JAMES","SAINT JAMES"),("ST GEORGE","SAINT GEORGE"),
    ("ST EDWARD","SAINT EDWARD"),("ST CHARLES","SAINT CHARLES"),("ST FRANCIS","SAINT FRANCIS"),
    ("ST ANDREW","SAINT ANDREW"),("ST JUDE","SAINT JUDE"),("ST LUKE","SAINT LUKE"),("ST JOSEPH","SAINT JOSEPH"),
    ("N D PERLMAN","NATHAN PERLMAN"),("O BRIEN","OBRIEN"),("F D R","FDR"),("EXPRESSWAY N SR","EXPRESSWAY SR N"),
    ("HOR HARDING","HORACE HARDING"),
    ("SERVICE ROAD",""), # These are always near a similarly named street. Lump together.
    ("DUMMY",""),("ADAM C POWELL","ADAM CLAYTON POWELL"),("POWELL COVE","POWELLS COVE")]
def replace_phrases(raw):
    result = raw
    for (old,new) in phrase_replacements:
        result = standardize_spaces(result.replace(old,new))
    return result

In [22]:
# 1ST, 2ND, 3RD, 4TH, ... nTH
# Remove the suffixes leaving the numbers by themselves.
number_suffixes = ["ST","ND","RD","TH"]
digits=["1","2","3","4","5","6","7","8","9","0"]
def remove_number_suffixes(raw):
    split_raw = raw.split()
    for suffix in number_suffixes:
        found_at_index = next((i for i, x in enumerate(split_raw) if x[0] in digits and x.endswith(suffix)), None) 
        if found_at_index!=None:            
            split_raw[found_at_index] = split_raw[found_at_index][:-2]
            return standardize_spaces(" ".join(split_raw))
    return raw

In [23]:
def standardize_street(street):
    r = street 
    r = standardize_spaces(r) 
    r = replace_problem_characters(r) 
    r = replace_phrases(r) 
    r = replace_words(r) 
    r = concatenate_prefixes(r) 
    r = remove_number_suffixes(r) 
    return r

### Standardize the address in both the 311 and PLUTO data

In [24]:
b311['incident_address'] = b311['incident_address'].apply(standardize_street)

In [25]:
bPlu['Address'] = bPlu['Address'].apply(standardize_street)

In [26]:
# See if there was an improvement in how well the 311 data can be merged with the PLUTO data by address.
WhichAddressesNotInPluto(3)

Records not in PLUTO: 3033.  Percent: 11.39%
---Top 3---
2090 EAST TREMONT AVENUE             3940
1425 MARTIN LUTHER KING BOULEVARD    2534
1259 CLAY AVENUE                     2297
Name: incident_address, dtype: int64


<p style="color:Red;">The percentage of addresses in the BRONX 311 data that could not be merged to the BRONX PLUTO data set improved from 23% to under 12%.</p>

## Merging the 311 and Pluto datasets together.

In [27]:
combined = b311.merge(bPlu, left_on='incident_address', right_on='Address', how='inner')

In [29]:
combined.drop(['incident_address'], axis=1, inplace=True)
print(combined.shape)
combined.head()

(1628421, 22)


Unnamed: 0,complaint_type,incident_zip,Lot,ZipCode,Address,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,...,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
0,OTHER,10456.0,52,10456.0,881 CAULDWELL AVENUE,1800,2640,1950,0,0,...,100.0,35.0,1901,2000,1.47,2.43,0.0,4.8,1009827.0,238873.0
1,OTHER,10456.0,52,10456.0,881 CAULDWELL AVENUE,1800,2640,1950,0,0,...,100.0,35.0,1901,2000,1.47,2.43,0.0,4.8,1009827.0,238873.0
2,OTHER,10456.0,52,10456.0,881 CAULDWELL AVENUE,1800,2640,1950,0,0,...,100.0,35.0,1901,2000,1.47,2.43,0.0,4.8,1009827.0,238873.0
3,OTHER,10456.0,52,10456.0,881 CAULDWELL AVENUE,1800,2640,1950,0,0,...,100.0,35.0,1901,2000,1.47,2.43,0.0,4.8,1009827.0,238873.0
4,OTHER,10456.0,52,10456.0,881 CAULDWELL AVENUE,1800,2640,1950,0,0,...,100.0,35.0,1901,2000,1.47,2.43,0.0,4.8,1009827.0,238873.0


In [36]:
combined.describe()

Unnamed: 0,incident_zip,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
count,1621311.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1628421.0,1627360.0,1627360.0
mean,10460.55,99.47213,10460.56,13638.88,50159.75,48623.59,211.562,715.9137,1.112133,5.266443,114.0181,89.14868,1927.279,489.4255,3.641243,3.376216,0.09826427,4.650286,1014326.0,249154.6
std,6.482533,595.261,6.454374,28386.88,70553.75,68649.24,5132.482,2275.009,0.852602,2.431705,45.17917,36.38943,93.32986,857.6975,2.584701,1.567639,0.5465802,1.412978,6300.347,7957.665
min,10451.0,1.0,10451.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1003067.0,230939.0
25%,10456.0,16.0,10456.0,4526.0,15500.0,14656.0,0.0,0.0,1.0,5.0,100.0,72.0,1922.0,0.0,2.91,2.43,0.0,4.8,1009528.0,242594.0
50%,10459.0,35.0,10459.0,10000.0,39338.0,38250.0,0.0,0.0,1.0,5.0,100.0,87.0,1927.0,0.0,3.88,3.44,0.0,4.8,1013194.0,248679.0
75%,10467.0,62.0,10467.0,15100.0,64824.0,63000.0,0.0,0.0,1.0,6.0,120.0,97.33,1931.0,0.0,4.49,3.44,0.0,4.8,1018532.0,255238.0
max,10803.0,9100.0,10475.0,3392065.0,5541031.0,5529331.0,1311800.0,314924.0,129.0,42.0,2276.0,950.0,2017.0,2017.0,259.8,10.0,9.0,10.0,1044583.0,271831.0


In [37]:
#Building info that do not feature any actual buildings
combined.drop(combined[combined["NumBldgs"]==0].index, axis=0, inplace=True)
combined.reset_index(drop=True, inplace=True) #reset indexes since columns or rows have been dropped.

In [38]:
combined.describe()

Unnamed: 0,incident_zip,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
count,1617996.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1625085.0,1624163.0,1624163.0
mean,10460.56,99.38981,10460.57,13598.06,50257.46,48718.39,211.9715,717.3176,1.114416,5.276678,113.991,89.32799,1930.996,490.3905,3.648247,3.377283,0.09831756,4.651429,1014338.0,249156.1
std,6.481216,595.8329,6.453017,28373.3,70588.52,68683.16,5137.707,2276.959,0.8519849,2.422676,45.1983,36.20783,39.20272,858.2665,2.582123,1.567945,0.5467056,1.412805,6295.964,7958.586
min,10451.0,1.0,10451.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1003067.0,230939.0
25%,10456.0,16.0,10456.0,4526.0,15565.0,14862.0,0.0,0.0,1.0,5.0,100.0,72.0,1922.0,0.0,2.91,2.43,0.0,4.8,1009540.0,242586.0
50%,10459.0,35.0,10459.0,10000.0,39432.0,38283.0,0.0,0.0,1.0,5.0,100.0,87.0,1927.0,0.0,3.89,3.44,0.0,4.8,1013200.0,248679.0
75%,10467.0,62.0,10467.0,15083.0,64895.0,63000.0,0.0,0.0,1.0,6.0,119.83,97.5,1931.0,0.0,4.49,3.44,0.0,4.8,1018533.0,255238.0
max,10803.0,9100.0,10475.0,3392065.0,5541031.0,5529331.0,1311800.0,314924.0,129.0,42.0,2276.0,950.0,2017.0,2017.0,259.8,10.0,9.0,10.0,1044583.0,271831.0


In [40]:
#Buildings that have no area recorded for the building
combined.drop(combined[combined["BldgArea"]==0].index, axis=0, inplace=True)
combined.reset_index(drop=True, inplace=True) #reset indexes since columns or rows have been dropped.
combined.describe()

Unnamed: 0,incident_zip,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
count,1617874.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624962.0,1624064.0,1624064.0
mean,10460.56,99.28458,10460.57,13598.27,50261.26,48722.08,211.9875,717.3719,1.114393,5.276879,113.9917,89.33445,1931.129,490.4277,3.648523,3.377206,0.09831688,4.651377,1014338.0,249156.4
std,6.481163,595.1759,6.452964,28374.19,70589.84,68684.45,5137.901,2277.037,0.8519829,2.422312,45.19628,36.20137,35.77631,858.2884,2.582026,1.567917,0.5467059,1.412822,6295.999,7958.631
min,10451.0,1.0,10451.0,0.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1003067.0,230939.0
25%,10456.0,16.0,10456.0,4526.0,15565.0,14862.0,0.0,0.0,1.0,5.0,100.0,72.0,1922.0,0.0,2.92,2.43,0.0,4.8,1009540.0,242586.0
50%,10459.0,35.0,10459.0,10000.0,39432.0,38300.0,0.0,0.0,1.0,5.0,100.0,87.0,1927.0,0.0,3.89,3.44,0.0,4.8,1013200.0,248679.0
75%,10467.0,62.0,10467.0,15083.0,64923.0,63000.0,0.0,0.0,1.0,6.0,119.64,97.5,1931.0,0.0,4.49,3.44,0.0,4.8,1018533.0,255238.0
max,10803.0,9100.0,10475.0,3392065.0,5541031.0,5529331.0,1311800.0,314924.0,129.0,42.0,2276.0,950.0,2017.0,2017.0,259.8,10.0,9.0,10.0,1044583.0,271831.0


In [41]:
#Buildings with no year that it was built recorded.
combined.drop(combined[combined["YearBuilt"]==0].index, axis=0, inplace=True)
combined.reset_index(drop=True, inplace=True) #reset indexes since columns or rows have been dropped.
combined.describe()

Unnamed: 0,incident_zip,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
count,1617525.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1624613.0,1623715.0,1623715.0
mean,10460.56,98.94322,10460.57,13599.93,50258.74,48721.6,210.936,716.8341,1.114352,5.27752,114.001,89.35268,1931.544,490.4935,3.648796,3.377241,0.09827177,4.651353,1014338.0,249157.1
std,6.481462,593.0537,6.453256,28376.47,70591.22,68688.07,5135.001,2275.111,0.8519959,2.421568,45.18729,36.1833,21.88438,858.3269,2.582015,1.567867,0.5466727,1.412771,6295.89,7958.908
min,10451.0,1.0,10451.0,0.0,18.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1800.0,0.0,0.0,0.0,0.0,0.0,1003067.0,230939.0
25%,10456.0,16.0,10456.0,4526.0,15565.0,14862.0,0.0,0.0,1.0,5.0,100.0,72.0,1922.0,0.0,2.92,2.43,0.0,4.8,1009540.0,242587.0
50%,10459.0,35.0,10459.0,10000.0,39432.0,38300.0,0.0,0.0,1.0,5.0,100.0,87.0,1927.0,0.0,3.89,3.44,0.0,4.8,1013200.0,248679.0
75%,10467.0,62.0,10467.0,15083.0,64895.0,63000.0,0.0,0.0,1.0,6.0,119.83,97.5,1931.0,0.0,4.49,3.44,0.0,4.8,1018533.0,255243.0
max,10803.0,9100.0,10475.0,3392065.0,5541031.0,5529331.0,1311800.0,314924.0,129.0,42.0,2276.0,950.0,2017.0,2017.0,259.8,10.0,9.0,10.0,1044583.0,271831.0


In [42]:
#Buildings with no floors.
combined.drop(combined[combined["NumFloors"]==0].index, axis=0, inplace=True)
combined.reset_index(drop=True, inplace=True) #reset indexes since columns or rows have been dropped.
combined.describe()

Unnamed: 0,incident_zip,Lot,ZipCode,LotArea,BldgArea,ResArea,OfficeArea,RetailArea,NumBldgs,NumFloors,LotDepth,BldgDepth,YearBuilt,YearAlter1,BuiltFAR,ResidFAR,CommFAR,FacilFAR,XCoord,YCoord
count,1617052.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1624138.0,1623240.0,1623240.0
mean,10460.56,98.95074,10460.57,13602.23,50269.96,48734.86,210.9977,717.0119,1.114385,5.279064,114.0052,89.36795,1931.523,490.637,3.649204,3.377627,0.09829682,4.651762,1014336.0,249156.4
std,6.480814,593.1396,6.452601,28379.88,70597.96,68693.69,5135.751,2275.409,0.8521183,2.42024,45.19033,36.17553,21.85143,858.4114,2.582069,1.567636,0.546744,1.412428,6294.318,7958.63
min,10451.0,1.0,10451.0,0.0,18.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1800.0,0.0,0.0,0.0,0.0,0.0,1003067.0,230939.0
25%,10456.0,16.0,10456.0,4539.0,15570.0,14875.0,0.0,0.0,1.0,5.0,100.0,72.0,1922.0,0.0,2.92,2.43,0.0,4.8,1009540.0,242587.0
50%,10459.0,35.0,10459.0,10000.0,39432.0,38325.0,0.0,0.0,1.0,5.0,100.0,87.0,1927.0,0.0,3.89,3.44,0.0,4.8,1013199.0,248679.0
75%,10467.0,62.0,10467.0,15083.0,64923.0,63000.0,0.0,0.0,1.0,6.0,119.83,97.5,1931.0,0.0,4.49,3.44,0.0,4.8,1018533.0,255238.0
max,10803.0,9100.0,10475.0,3392065.0,5541031.0,5529331.0,1311800.0,314924.0,129.0,42.0,2276.0,950.0,2017.0,2017.0,259.8,10.0,9.0,10.0,1044583.0,271831.0


## Save combined and cleaned data

In [43]:
combined.to_pickle(files_path + 'combined.pkl')

<a id="analysis"></a>
# Analyzing and Visualizing
---

## Load combined and cleaned data

In [45]:
#Load
combined = pd.read_pickle(files_path + 'combined.pkl')
combined.shape

(1624138, 22)

## Analyze

In [None]:
# Check for correlations

<a id="conclusion"></a>
# Concluding Remarks
---

The HEAT/HOT WATER (including HEATING) complaint type identified in Question 1 as the most prevalent complaint type has (or does not have any????) an obvious relationship with the BRONX (identified in Question 2) house characteristics: