<h1 style="text-align:center">Data Science and Machine Learning Capstone Project</h1>
<img style="float:right" src="https://prod-edxapp.edx-cdn.org/static/edx.org/images/logo.790c9a5340cb.png">
<p style="text-align:center">IBM: DS0720EN</p>
<p style="text-align:center">Question 3 of 4</p>

1. [Problem Statement](#problem)
2. [Question 3](#question)
3. [Data Cleaning and Standardization](#wrangling)
4. [Analyzing and Visualizing](#analysis)
5. [Concluding Remarks](#conclusion)

<a id="problem"></a>
# Problem Statement
---

The people of New York use the 311 system to report complaints about the non-emergency problems to local authorities. Various agencies in New York are assigned these problems. The Department of Housing Preservation and Development of New York City is the agency that processes 311 complaints that are related to housing and buildings.

In the last few years, the number of 311 complaints coming to the Department of Housing Preservation and Development has increased significantly. Although these complaints are not necessarily urgent, the large volume of complaints and the sudden increase is impacting the overall efficiency of operations of the agency.

Therefore, the Department of Housing Preservation and Development has approached your organization to help them manage the large volume of 311 complaints they are receiving every year.

The agency needs answers to several questions. The answers to those questions must be supported by data and analytics. These are their  questions:

<a id="question"></a>
# Question 3
---

Does the Complaint Type that you identified in response to Question 1 have an obvious relationship with any particular characteristic or characteristic of the Houses?

## Approach
Determine whether or not there are any correlations between the building characteristics of buildings that experienced HEAT/HOT WATER complaints (from Question 1) relative to the building characteristics of all buildings in the PLUTO house database.

## Load Data
Separately from this notebook:

The [New York 311](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) data was loaded by [SODA](https://data.cityofnewyork.us/resource/fhrw-4uyv.csv?$limit=100000000&Agency=HPD&$select=created_date,unique_key,complaint_type,incident_zip,incident_address,street_name,address_type,city,resolution_description,borough,latitude,longitude,closed_date,location_type,status) into a Pandas DataFrame then saved to a pickle file.

The [New York PLUTO](https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/xuk2-nczf) data was downloaded.  The instructions at ( Course / 1. Project Challenge Details and Setup / Datasets Used in this Course / Datasets ) said "Use only the part that is specific to the borough that you are interested in based on your analysis."  My answer for Question 2 suggested the borough with the biggest HEAT/HOT WATER problem was BRONX.  For that reason, only the BX_18v1.csv file was loaded into a Pandas DataFrame then saved to a pickle file

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
import scipy.stats as stats
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
from sklearn.metrics import jaccard_score, classification_report, log_loss, confusion_matrix

files_path = 'C:\\Users\\It_Co\\Documents\\DataScience\\Capstone\\' #local
#files_path = './' #IBM Cloud / Watson Studio

In [None]:
b311 = pd.read_pickle(files_path + 'ny311full.pkl')

#file_columns = ['Address','BldgArea','BldgDepth','BuiltFAR','CommFAR','FacilFAR','Lot','LotArea','LotDepth','NumBldgs','NumFloors','OfficeArea','ResArea','ResidFAR','RetailArea','YearBuilt','YearAlter1','ZipCode', 'YCoord', 'XCoord']
#df = pd.read_csv(files_path + 'BX_18v1.csv', usecols=file_columns)
#df = pd.concat([df, pd.read_csv(files_path + 'BK_18v1.csv', usecols=file_columns)])
#df = pd.concat([df, pd.read_csv(files_path + 'MN_18v1.csv', usecols=file_columns)])
#df = pd.concat([df, pd.read_csv(files_path + 'QN_18v1.csv', usecols=file_columns)])
#df = pd.concat([df, pd.read_csv(files_path + 'SI_18v1.csv', usecols=file_columns)])
#df.to_pickle(files_path + 'q3.pkl')

df = pd.read_pickle(files_path + 'q3.pkl')

print("NY 311 shape %s" % (b311.shape,))
print("PLUTO shape %s" % (df.shape,))

<a id="wrangling"></a>
# Data Cleaning and Standardization
---

Correct or remove observations with missing or malformed data.  The 311 and the PLUTO data sets will need to be "joined" together by the common "address" element, which means the addresses will need to be standardized to a consistent layout to allow the addresses to be compared consistently.

## NY 311

### General

In [None]:
#Remove columns deemed unnecessary for this question.
b311.drop(['created_date','street_name','address_type','resolution_description','closed_date','location_type','status','unique_key','latitude','longitude'], axis=1, inplace=True)
#Only use the combined "heating and hot water" complaints determined from Question 1.
b311['complaint_type'] = b311['complaint_type'].str.upper()
b311.drop(b311[b311["complaint_type"].isin(["HEAT/HOT WATER","HEATING"])==False].index, axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
#Adjust all relevant strings to uppercase so different casing won't appear as separate values.
b311['incident_address'] = b311['incident_address'].str.upper()
b311['city'] = b311['city'].str.upper()
b311['borough'] = b311['borough'].str.upper()

In [None]:
#Print some initial information for comparison during later steps.
print("shape %s" % str(b311.shape))
print("--nulls below--")
print(b311.isnull().sum())
print("--types below--")
print(b311.dtypes)
b311.head()

### Standardize Borough
Leveraging findings found while standardizing during Question 2.

In [None]:
b311['borough'].value_counts()

In [None]:
#Correct rows where borough was entered in the city column with "UNSPECIFIED" in the borough column.
five_boroughs = ["BROOKLYN","BRONX","MANHATTAN","QUEENS","STATEN ISLAND"]
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isin(five_boroughs)].index
b311.loc[which_rows_to_adjust,'borough']=b311.loc[which_rows_to_adjust,'city']
b311.loc[which_rows_to_adjust,'city']=np.nan
#Drop a few rows of ambiguous data.
b311.drop(b311[(b311["borough"]=='MANHATTAN')&(b311["city"]=='BRONX')].index, axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
#Fill in UNSPECIFIED borough when city was entered as NEW YORK.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&(b311["city"]=='NEW YORK')].index
b311.loc[which_rows_to_adjust,'borough']="MANHATTAN"
b311.loc[which_rows_to_adjust,'city']=np.nan
#Although the city for most of the "NEW YORK" ones are the only ones that technically got the "city" column valued correctly,
#since every other row uses city as "neighborhood":  Standardize these.
which_rows_to_adjust = b311[(b311["city"]=='NEW YORK')].index
b311.loc[which_rows_to_adjust,'city']=np.nan
#Any still unspecified boroughs with a value in "city" are in the Queens borough.  The "city" is actually a "neighborhood".
queens_neighborhoods = b311[(b311['borough']=='UNSPECIFIED')&(b311['city'].isnull()==False)]['city'].unique()
#Standardize borough for Queens neighborhoods.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isin(queens_neighborhoods)].index
b311.loc[which_rows_to_adjust,'borough']="QUEENS"
#Null the borough if it still shows up as unspecified borough as there is no other information from which to derive it.
which_rows_to_adjust = b311[(b311["borough"]=='UNSPECIFIED')&b311["city"].isnull()].index
b311.loc[which_rows_to_adjust,'borough']=np.nan

In [None]:
b311['borough'].value_counts()

### Filter NY 311 data by borough to only include BRONX
The instructions at ( Course / 1. Project Challenge Details and Setup / Datasets Used in this Course / Datasets ) said "Use only the part that is specific to the borough that you are interested in based on your analysis."  My answer for Question 2 suggested the borough with the biggest HEAT/HOT WATER problem was BRONX.  For that reason, I am only considering the BRONX data.

In [None]:
b311.drop(b311[(b311["borough"]!='BRONX')].index, axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
b311['borough'].value_counts()

### Remove unnecessary columns
These were only necessary to standardize and then filter by borough.

In [None]:
#Remove columns no longer necessary
b311.drop(['borough','city'], axis=1, inplace=True)
print(b311.shape)
print(b311.isnull().sum())
b311.head(3)

In [None]:
# Drop observations with missing address as there will be no way to tie them to any PLUTO data.
b311.dropna(subset=['incident_address'], axis=0, inplace=True)
b311.reset_index(drop=True, inplace=True)
print(b311.isnull().sum())
b311['incident_address'].value_counts().head()

## BRONX PLUTO

In [None]:
print("shape %s" % str(df.shape))
print("---isnull follows---")
print(df.isnull().sum())
df.head()

### General

In [None]:
df.dtypes

In [None]:
#Adjust relevant strings to uppercase so different casing won't appear as separate values.
df['Address'] = df['Address'].str.upper()

In [None]:
# Drop the observations with missing address as there will be no way to tie them to any 311 data.
df.dropna(subset=['Address'], axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)

## Standardization of Addresses
Leveraging standardization methods developed during question 2.

In [None]:
print("BRONX 311 unique addresses: %s" % b311['incident_address'].unique().size)
print("BRONX PLUTO unique addresses: %s" % df['Address'].unique().size)

<p style="color:Red;">Determine how much overlap.  Ideally all 29K BRONX 311 addresses will be represented in the PLUTO set.</p>

In [None]:
def WhichAddressesNotInPluto(howManyTopToShow):
    complaints = set(b311['incident_address'].unique())
    pluto = set(df['Address'].unique())
    #Determine which 311 addresses were not found in PLUTO to gain insight as to why.
    differences = complaints.difference(pluto)
    print("Records not in PLUTO: %s.  Percent: %s" % (len(differences), "{:.2%}".format(len(differences) / len(complaints))))
    print("---Top %i---" % howManyTopToShow)
    print(b311[b311['incident_address'].isin(differences)]['incident_address'].value_counts().head(howManyTopToShow))

In [None]:
WhichAddressesNotInPluto(3)

<p style="color:Red;">Over 20 percent of addresses in the BRONX 311 data cannot be merged to the BRONX PLUTO data prior to standardization.</p>

### Borrow some python functions developed during question 2
With minor improvements to better work with full addresses instead of just street names.

In [None]:
# Some street values have multiple spaces in a row.
import re
def standardize_spaces(raw):
    result = raw.strip() #Remove leading and trailing spaces.
    result = re.sub(' +', ' ', result) #Squeeze multiple adjacent spaces into just one space.
    return result

In [None]:
# Some streets have problematic characters.  For example:  ST. ANN'S AVENUE also exists without period or apostophe.
problem_characters = ['.', '\'']
def replace_problem_characters(raw):
    result = raw
    for (character) in problem_characters:
        result = result.replace(character,'')
    return result

In [None]:
#Some words are sometimes entered in a non-standard way or with typos need to be standardized.
word_replacements = [("AVE","AVENUE"),("ST","STREET"),("RD","ROAD"),("FT","FORT"),("BX","BRONX"),("MT","MOUNT"),
                     ("NICHLAS","NICHOLAS"),("NICHALOS","NICHOLAS"),("EXPRE","EXPRESSWAY"),("HARACE","HORACE"),
                     ("NO","NORTH"),("AV","AVENUE"),("CRK","CREEK"),("FR","FATHER"),("JR","JUNIOR"),("GR","GRAND"),
                     ("CT","COURT"),
                     ("SR",""), # Service Road.  These are always near a similarly named street.  Lump together.
                     ("QN","QUEENS"),
                     ("ND",""), # A space between a number and ND such as EAST 52 ND STREET.  Note ST and RD can be street or road.
                     ("PO","POND"),("BO","BOND"),("GRA","GRAND"),("REV","REVEREND"),("CO-OP","COOP"),
                     ("GRANDCONCOURSE", "GRAND CONCOURSE"),("CENTRL", "CENTRAL"),("BLVD","BOULEVARD"),
                     ("FREDRICK", "FREDERICK"),("DOUGLAS", "DOUGLASS"),("MALCOM", "MALCOMN"),
                     ("NORTHEN", "NORTHERN"),("AVNEUE","AVENUE"),
                    ("N","NORTH"),("S","SOUTH"),("E","EAST"),("W","WEST"),("SW","SOUTHWEST"),
                             ("NW","NORTHWEST"),("SE","SOUTHEAST"),("NE","NORTHEAST")]
def replace_words(raw):
    split_raw = raw.split()
    for (old, new) in word_replacements:
        found_at_index = next((i for i, x in enumerate(split_raw) if x==old), None)
        if found_at_index!=None:
            split_raw[found_at_index] = new
    return standardize_spaces(" ".join(split_raw))

In [None]:
#Some words are actually prefixes of the following word.  Example the LA prefix of LA GRANGE.
word_prefixes = ["DE","MC","LA","VAN","MAC","CO"]
def concatenate_prefixes(raw):
    split_raw = raw.split()
    last_word = len(split_raw) - 1
    for (prefix) in word_prefixes:
        found_at_index = next((i for i, x in enumerate(split_raw) if x==prefix), None)
        if found_at_index!=None:
            if len(split_raw)>1:
                if found_at_index != last_word:
                    split_raw[found_at_index] = ''
                    split_raw[found_at_index+1] = prefix + split_raw[found_at_index+1]
                    return standardize_spaces(" ".join(split_raw))
    return raw

In [None]:
#Some phrases need custom replacement because they involve multiple words or easily mis-interpretted out of context.
phrase_replacements = [("DR M L KING JR","MARTIN LUTHER KING"),("DR MARTIN L KING","MARTIN LUTHER KING"),
    ("MARTIN LUTHER KING","MARTIN LUTHER KING"),("MARTIN L KING JR","MARTIN LUTHER KING"),
    ("MARTIN L KING","MARTIN LUTHER KING"),("ST NICHOLAS","SAINT NICHOLAS"),("ST JOHN","SAINT JOHN"),
    ("ST MARK","SAINT MARK"),("ST ANN","SAINT ANN"),("ST LAWRENCE","SAINT LAWRENCE"),("ST PAUL","SAINT PAUL"),
    ("ST PETER","SAINT PETER"),("ST RAYMOND","SAINT RAYMOND"),("ST THERESA","SAINT THERESA"),("ST FELIX","SAINT FELIX"),
    ("ST MARY","SAINT MARY"),("ST OUEN","SAINT OUEN"),("ST JAMES","SAINT JAMES"),("ST GEORGE","SAINT GEORGE"),
    ("ST EDWARD","SAINT EDWARD"),("ST CHARLES","SAINT CHARLES"),("ST FRANCIS","SAINT FRANCIS"),
    ("ST ANDREW","SAINT ANDREW"),("ST JUDE","SAINT JUDE"),("ST LUKE","SAINT LUKE"),("ST JOSEPH","SAINT JOSEPH"),
    ("N D PERLMAN","NATHAN PERLMAN"),("O BRIEN","OBRIEN"),("F D R","FDR"),("EXPRESSWAY N SR","EXPRESSWAY SR N"),
    ("HOR HARDING","HORACE HARDING"),
    ("SERVICE ROAD",""), # These are always near a similarly named street. Lump together.
    ("DUMMY",""),("ADAM C POWELL","ADAM CLAYTON POWELL"),("POWELL COVE","POWELLS COVE")]
def replace_phrases(raw):
    result = raw
    for (old,new) in phrase_replacements:
        result = standardize_spaces(result.replace(old,new))
    return result

In [None]:
# 1ST, 2ND, 3RD, 4TH, ... nTH
# Remove the suffixes leaving the numbers by themselves.
number_suffixes = ["ST","ND","RD","TH"]
digits=["1","2","3","4","5","6","7","8","9","0"]
def remove_number_suffixes(raw):
    split_raw = raw.split()
    for suffix in number_suffixes:
        found_at_index = next((i for i, x in enumerate(split_raw) if x[0] in digits and x.endswith(suffix)), None) 
        if found_at_index!=None:            
            split_raw[found_at_index] = split_raw[found_at_index][:-2]
            return standardize_spaces(" ".join(split_raw))
    return raw

In [None]:
def standardize_street(street):
    r = street 
    r = standardize_spaces(r) 
    r = replace_problem_characters(r) 
    r = replace_phrases(r) 
    r = replace_words(r) 
    r = concatenate_prefixes(r) 
    r = remove_number_suffixes(r) 
    return r

### Standardize the address in both the 311 and PLUTO data

In [None]:
b311['incident_address'] = b311['incident_address'].apply(standardize_street)

In [None]:
df['Address'] = df['Address'].apply(standardize_street)

In [None]:
# See if there was an improvement in how well the 311 data can be merged with the PLUTO data by address.
WhichAddressesNotInPluto(3)

<p style="color:Red;">The percentage of addresses in the 311 data that can be matched to an entry in the PLUTO data set improved measurably.</p>

In [None]:
#Save the file to more quickly repeat additional work below.
df.to_pickle(files_path + 'q3clean.pkl')

## Add latent variable to indicate if the address had any HEAT complaints.
So that we can more conveniently see if it correlates to anything else in the data.

In [None]:
df = pd.read_pickle(files_path + 'q3clean.pkl')

In [None]:
#Latent variable that indicates that there has been at least one complaint at the address.
df['Complaints'] = df['Address'].isin(b311['incident_address'].unique()).astype('int64')
grouped_addresses = b311.groupby('incident_address')
grouped_addresses.groups

#How many complaints at the address?
#counts = b311['incident_address'].value_counts(sort=False).sort_index()
#df['Complaints'] = df['Address'].map(counts)
#df['Complaints'].replace(to_replace=np.nan, value=0, inplace=True)

In [None]:
#del b311 # Maybe save a little memory later

In [None]:
print (df.shape)
df['Complaints'].value_counts()

In [None]:
#Convert each unique address into a unique number.
#A category with a lot of different values, not a continuous value.
address_lookup = {k:v + 1 for v, k in enumerate(df['Address'].unique().tolist())}
df["Address"] = df["Address"].map(address_lookup)
len(df)

In [None]:
#Save the file to more quickly repeat additional work below.
df.to_pickle(files_path + 'q3clean.pkl')

<p style="color:Red;">Saving the data allows restarting from this point without needing to wait for standardization of addresses and other time to be repeated as work continues.</p>

## Imputation
Handling "missing" features.  That are null or have a zero placeholder instead of a "real" value.

In [None]:
df = pd.read_pickle(files_path + 'q3clean.pkl')

In [None]:
def check_nulls():
    for col in df.columns:
        nulls = len(df[df[col].isnull()])
        if nulls > 0:
            print("%s:  %i of %i (%s)" % (col, nulls, df.shape[0], "{:.2%}".format(nulls / df.shape[0])))
check_nulls()

In [None]:
#Replace with zeroes to kick these forward to the next section.
df['ZipCode'].replace(to_replace=np.nan, value=0, inplace=True)
df['XCoord'].replace(to_replace=np.nan, value=0, inplace=True)
df['YCoord'].replace(to_replace=np.nan, value=0, inplace=True)
check_nulls()

In [None]:
def check_zeroes():
    for col in df.columns:
        zeroes = len(df[df[col].eq(0)])
        if zeroes > 0:
            print("%s:  %i of %i (%s)" % (col, zeroes, df.shape[0], "{:.2%}".format(zeroes / df.shape[0])))
check_zeroes()

In [None]:
# Drop rows that have only a small number of zero values in a column.
df.drop(df[df["ZipCode"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["LotArea"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["LotDepth"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["BldgDepth"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["BuiltFAR"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["FacilFAR"].eq(0)].index, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
check_zeroes()

In [None]:
# May need to drop more rows, that had columns with a lot of zeroes in multiple columns.
df.drop(df[df["NumBldgs"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["NumFloors"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["YearBuilt"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["XCoord"].eq(0)].index, axis=0, inplace=True)
df.drop(df[df["YCoord"].eq(0)].index, axis=0, inplace=True)
df.reset_index(drop=True, inplace=True)
check_zeroes()

<p style="color:Red;">Several features have a significant number of zeroes.  Some even a vast majority of zero values.  This means simply the fact that there was a zero or not could be an important thing to correlate.  I am going to create latent variables for each simply to record the presence of the value or not.  These latent variables can themselves be used to filter when doing regression checks on the root feature.  The latent variables themselves can even be the subject of regression checks.</p>

In [None]:
df.insert(len(df.columns) - 1, 'IsResArea', df['ResArea'].ne(0).astype('int64'))
df.insert(len(df.columns) - 1, 'IsOfficeArea', df['OfficeArea'].ne(0).astype('int64'))
df.insert(len(df.columns) - 1, 'IsRetailArea', df['RetailArea'].ne(0).astype('int64'))
df.insert(len(df.columns) - 1, 'IsYearAlter1', df['YearAlter1'].ne(0).astype('int64'))
df.insert(len(df.columns) - 1, 'IsResidFAR', df['ResidFAR'].ne(0).astype('int64'))
df.insert(len(df.columns) - 1, 'IsCommFAR', df['CommFAR'].ne(0).astype('int64'))

In [None]:
df.head()

In [None]:
print(df['IsResArea'].value_counts())
print(df['IsOfficeArea'].value_counts())
print(df['IsRetailArea'].value_counts())
print(df['IsYearAlter1'].value_counts())
print(df['IsResidFAR'].value_counts())
print(df['IsCommFAR'].value_counts())

## Scaling, Centering

In [None]:
def scale_and_center(column_name):
    scale=MinMaxScaler()
    scale.fit(df[[column_name]])
    df[[column_name]]=scale.transform(df[[column_name]])
scale_and_center('LotArea')
scale_and_center('BldgArea')
scale_and_center('NumBldgs')
scale_and_center('NumFloors')
scale_and_center('LotDepth')
scale_and_center('BldgDepth')
scale_and_center('YearBuilt')
scale_and_center('BuiltFAR')
scale_and_center('FacilFAR')
scale_and_center('XCoord')
scale_and_center('YCoord')

In [None]:
def scale_and_center_latent(column_name, latent):
    scale=MinMaxScaler()
    #scale.fit(df.loc[df[latent].eq(1), column_name].to_frame())
    df.loc[df[latent].eq(1), column_name] = scale.fit_transform(df.loc[df[latent].eq(1), column_name].to_frame())
scale_and_center_latent('ResArea', 'IsResArea')
scale_and_center_latent('OfficeArea', 'IsOfficeArea')
scale_and_center_latent('RetailArea', 'IsRetailArea')
scale_and_center_latent('YearAlter1', 'IsYearAlter1')
scale_and_center_latent('ResidFAR', 'IsResidFAR')
scale_and_center_latent('CommFAR', 'IsCommFAR')

In [None]:
df.describe()

## Save the cleaned data.

In [None]:
df.to_pickle(files_path + 'q3clean.pkl')

<a id="analysis"></a>
# Analyzing and Visualizing
---

## Load combined and cleaned data

In [None]:
df = pd.read_pickle(files_path + 'q3clean.pkl')
print(df.shape[0])
print(len(df[df['Complaints'].ne(0)]))

## Analyze, using visualizations as necessary

### Logistic Regression
Logistic Regression instead of Linear chosen because the Y values are zero or one, so this is a classifier rather than continuous.

In [None]:
def regress_logistically(d):
    print("-----")
    print(d.columns)
    y_data = d['Complaints']
    x_data = d.drop('Complaints',axis=1)
    x_train,x_test,y_train,y_test = train_test_split(x_data,y_data,test_size=0.20,random_state=1)
    print("number of test samples: ", x_test.shape[0])
    print("number of training samples: ",x_train.shape[0])
    m = linear_model.LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=250)
    m.fit(x_train, y_train)
    y_hat = m.predict(x_test)
    print("Jaccard score: ", jaccard_score(y_test, y_hat))
    print("Log loss: ", log_loss(y_test, m.predict_proba(x_test)))
    print(confusion_matrix(y_test, y_hat, labels=[1,0]))
    print(classification_report(y_test, y_hat))
    coefficients = []
    for i in range(m.coef_.size):
        coefficients.append((d.columns[i], m.coef_[0,i]))
    def sortco(a):
        return abs(a[1])
    coefficients.sort(key=sortco)
    print(coefficients)

In [None]:
#See how the conditional items fare.
regress_logistically(df[df['IsResArea'].eq(1)][['ResArea','Complaints']])
regress_logistically(df[df['IsOfficeArea'].eq(1)][['OfficeArea','Complaints']])
regress_logistically(df[df['IsRetailArea'].eq(1)][['RetailArea','Complaints']])
regress_logistically(df[df['IsYearAlter1'].eq(1)][['YearAlter1','Complaints']])
regress_logistically(df[df['IsResidFAR'].eq(1)][['ResidFAR','Complaints']])
regress_logistically(df[df['IsCommFAR'].eq(1)][['CommFAR','Complaints']])

In [None]:
# See how all the other items fare.
regress_logistically(df[['LotArea','BldgArea', 'NumBldgs','NumFloors','LotDepth','BldgDepth','YearBuilt',
                          'BuiltFAR','FacilFAR','XCoord','YCoord','Complaints']])

In [None]:
# Take out the location based items and dubious items.
regress_logistically(df[['LotArea','BldgArea', 'NumBldgs','NumFloors','LotDepth','BldgDepth','YearBuilt',
                          'Complaints']])

<p style="color:Red;">The relationships are too weak to get a good scoring logistic regression.  The fact that about 85% of the properties did not have a complaint means that even a coin-toss model has high F1 score when it predicts "tails", but very poor scoring when predicting "heads".  The top row of the confusion matrix is always very poor.</p>

### Pearson Correlation

In [None]:
# Check for correlations with heatmap.
AllForHeat = df.corr()
#AllForHeat.shape
AllForHeat.head()

In [None]:
fig, ax = plt.subplots()
im = ax.pcolor(AllForHeat, cmap='RdBu_r')
row_labels = df.columns; col_labels = df.columns
ax.set_xticklabels(row_labels, minor = False);ax.set_yticklabels(col_labels, minor = False)
#move ticks and labels to the center.
ax.set_xticks(np.arange(AllForHeat.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(AllForHeat.shape[0]) + 0.5, minor=False)
ax.set_title("heat map of correlations")
plt.xticks(rotation=90)
plt.colorbar(im)
plt.show()

<p style="color:Red;">Looking along the top row or rightmost column, there appear to be some correlations, though none super strong.  Looking at the actual numbers might help see them better.</p>

In [None]:
#See numerically what the strongest correlations are.
AllForHeat[(AllForHeat['Complaints'].ge(0.05))|(AllForHeat['Complaints'].le(-0.05))]['Complaints'].sort_values(ascending=False)

In [None]:
# Get the pearson correlations and confidence measures
for col in df.columns:
    pearson_coef, p_value = stats.pearsonr(df[col], df['Complaints'])
    if abs(pearson_coef) > 0.20 and p_value < 0.10:
        print(col, pearson_coef, p_value)

<p style="color:Red;">Not really any "strong" correlations, but some weak ones with high confidence.  Double check the details of a few of these.</p>

In [None]:
# Address correlates?  Why?
total = df.shape[0]
complainers = df[df['Complaints'].eq(1)].shape[0]
print (total, complainers, complainers / total)

<p style="color:Red;">I think address correlates just because each address either had a complaint or not, so the correlation is about the same as the percentage that had a complaint.  Also, question 2 already narrowed it down geographically so I'm going to dismiss this "correlation" as not relevant to the question at hand.  To a lesser extent the X Coordinate and ZIP code are in the same boat, though there may be a higher (but still very low) correlation than there is with address.</p>

In [None]:
#YearAlter1 correlates?  Why?
total = df.shape[0]
never_altered = df[df['Complaints'].eq(0)].shape[0]
print (total, never_altered, never_altered / total)

In [None]:
#Alteration year - take a closer look.
alterations = df[['YearAlter1','Complaints']]
alterations = pd.concat([alterations, pd.get_dummies(alterations['YearAlter1'])], axis=1)
for col in alterations.columns:
    pearson_coef, p_value = stats.pearsonr(alterations[col], alterations['Complaints'])
    if abs(pearson_coef) > 0.05 and p_value < 0.10:
        print(col, pearson_coef, p_value)

<p style="color:Red;">The only even mildly significant YearAlter1 correlations are between the observations where this field is zero, which is about 2/3 of the data.  My conclusion is that this is a factor of the high number of zeroes, causing the correlation to skew toward lower numbers, hence the lowest numbers, the zeroes themselves, are considered more correlated.</p>

In [None]:
#check a few of these to see if they are different data points that capture the same relationship.
def see_pearson(a,b):
    pearson_coef, p_value = stats.pearsonr(df[a], df[b])
    print(a, b, pearson_coef, p_value)
see_pearson('ResidFAR','FacilFAR')
see_pearson('ResidFAR','BuiltFAR')
see_pearson('FacilFAR','BuiltFAR')

<p style="color:Red;">ResidFAR and FacilFAR are strongly correlated to each other.  The documentation says FAR is the "Floor Area Ratio" between the building and the LOT.  The Resid and Facility variations are about the same.  What they have in common is they are both "not commercial".</p>

#### Visualize strongest Pearson correlations

In [None]:
#Visualize
#df.corr()['Complaints'].map(abs).plot(kind='bar', title='Correlation with heating complaints',figsize=(8,3))
df.corr()['Complaints'].plot(kind='bar', grid=True, title='Correlation with heating complaints',figsize=(8,3))
plt.xlabel('Building Characteristics'); plt.ylabel('Correlation')
plt.show()

<a id="conclusion"></a>
# Concluding Remarks
---

The HEATING/HOT WATER complaint types (Question 1) reported in the BRONX borough (question 2) have an obvious relationship with the following housing characteristics:
<ul>
<li>NumFloors (Number of Floors):  0.37 correlation.
<li>BuiltFAR (Total building floor area divided by the area of the tax lot):  0.35 correlation.
<li>ResidFAR ():  0.32 correlation.
</ul>
The inability to compute an effective logistic regression is further evidence that the relationships are not especially strong, but the pearson correlation confidence numbers suggest these measurements are accurate.

So basically the taller and bigger the floor area of the building, the more heating complaints from that building.  Especially residential buildings.