# Building a Predictive Model

Using the Department of Housing Preservation and Development dataset alongside the PLUTO dataset we will build a predictive model. The output of which will determine the complaint type and whether or not it is of High priority.

Using our previous understanding of the data through analysis we have determined that the culprit of the high volume complaints was HEAT/HOT WATER. This particular complaint had a high volume of origin in the Borough Bronx. 

Comparing our HPD dataset with that of Plutos, we have spotted some correlation between building characteristics and and complaint type. Specifically; residfar, bldgfar, age, and numfloors.

Using this information we will now work on building a predictive model. 

## Load Llibraries

In [1]:
import pandas as pd
import numpy as np

## Load Data

In [2]:
# The Department of Housing Preservation and Development
Orig_HPD = pd.read_csv("HBD_v1.csv")
# Pluto data
Orig_Pluto = pd.read_csv("Pluto_v1.csv")

## Data Wrangiling

### Looking at the HPD Dataset

In [None]:
Orig_HPD.shape

Again we have 6,019,843 samples and 17 features. This dataset was a loaded dataframe from our previous exercise so it already has some of the features and adjustment that we needed.

In [3]:
Orig_HPD.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unique_Key,Created_Date,Closed_Date,Complaint_Type,Location_Type,Incident_Zip,Incident Address,Street_Name,Address_Type,City,Status,Resolution Description,Borough,Latitude,Longitude
0,0,0,45531130,02/02/2020 06:09:17 AM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,10019.0,426 WEST 52 STREET,WEST 52 STREET,ADDRESS,NEW YORK,Open,The following complaint conditions are still o...,MANHATTAN,40.765132,-73.988993
1,1,1,45529784,02/02/2020 02:15:24 PM,,UNSANITARY CONDITION,RESIDENTIAL BUILDING,11204.0,1751 67 STREET,67 STREET,ADDRESS,BROOKLYN,Open,The following complaint conditions are still o...,BROOKLYN,40.618484,-73.992673
2,2,2,45527528,02/02/2020 02:27:41 AM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,11372.0,87-15 37 AVENUE,37 AVENUE,ADDRESS,Jackson Heights,Open,The following complaint conditions are still o...,QUEENS,40.750269,-73.879432
3,3,3,45530329,02/02/2020 12:13:18 PM,,HEAT/HOT WATER,RESIDENTIAL BUILDING,10458.0,2405 SOUTHERN BOULEVARD,SOUTHERN BOULEVARD,ADDRESS,BRONX,Open,The following complaint conditions are still o...,BRONX,40.853773,-73.881558
4,4,4,45528814,02/02/2020 01:59:44 PM,,APPLIANCE,RESIDENTIAL BUILDING,11209.0,223 78 STREET,78 STREET,ADDRESS,BROOKLYN,Open,The following complaint conditions are still o...,BROOKLYN,40.629745,-74.030533


In [3]:
Orig_HPD.drop(['Unnamed: 0', 'Unnamed: 0.1','Location_Type','Street_Name','Address_Type', 'City','Status', 'Unique_Key','Created_Date',
              'Closed_Date','Resolution Description'],axis=1,inplace=True)

In [4]:
Orig_HPD.rename(columns={"Incident_Zip": "zipcode"}, inplace=True)
Orig_HPD.rename(columns={"Incident Address": "address"}, inplace=True)
Orig_HPD['zipcode'] = Orig_HPD['zipcode'].astype("str")

In [5]:
Orig_HPD['count'] = Orig_HPD.groupby(['address'])['Complaint_Type'].transform('count')

In [6]:
Type = pd.get_dummies(Orig_HPD['Complaint_Type'])

In [7]:
Type.columns.unique()

Index(['AGENCY', 'APPLIANCE', 'CONSTRUCTION', 'DOOR/WINDOW', 'ELECTRIC',
       'ELEVATOR', 'FLOORING/STAIRS', 'GENERAL', 'GENERAL CONSTRUCTION',
       'HEAT/HOT WATER', 'HPD LITERATURE REQUEST', 'MOLD', 'NONCONST',
       'OUTSIDE BUILDING', 'PAINT/PLASTER', 'PLUMBING', 'SAFETY', 'STRUCTURAL',
       'UNSANITARY CONDITION', 'VACANT APARTMENT', 'WATER LEAK'],
      dtype='object')

In [8]:
#Drop multiple columns
Type.drop(Type.iloc[:, 0:9], axis=1, inplace=True)

In [9]:
#Drop multiple column  AFTER first alteration
Type.drop(Type.iloc[:, 1:], axis=1, inplace=True)

**Merge back into DF**

In [10]:
HPD = Orig_HPD.merge(Type, left_index=True, right_index=True)
HPD.head()

Unnamed: 0,Complaint_Type,zipcode,address,Borough,Latitude,Longitude,count,HEAT/HOT WATER
0,HEAT/HOT WATER,10019.0,426 WEST 52 STREET,MANHATTAN,40.765132,-73.988993,2.0,1
1,UNSANITARY CONDITION,11204.0,1751 67 STREET,BROOKLYN,40.618484,-73.992673,267.0,0
2,HEAT/HOT WATER,11372.0,87-15 37 AVENUE,QUEENS,40.750269,-73.879432,442.0,1
3,HEAT/HOT WATER,10458.0,2405 SOUTHERN BOULEVARD,BRONX,40.853773,-73.881558,288.0,1
4,APPLIANCE,11209.0,223 78 STREET,BROOKLYN,40.629745,-74.030533,95.0,0


In [12]:
HPD.dtypes

Complaint_Type     object
zipcode            object
address            object
Borough            object
Latitude          float64
Longitude         float64
count             float64
HEAT/HOT WATER      uint8
dtype: object

In [15]:
HPD.isnull().sum()

Complaint_Type        0
zipcode           80697
address           52825
Borough               0
Latitude          80671
Longitude         80671
count             52825
HEAT/HOT WATER        0
dtype: int64

In [11]:
HPD.replace("nan", np.NaN, inplace=True)

In [12]:
HPD['HEAT/HOT WATER'] = HPD['HEAT/HOT WATER'].astype("int")

In [13]:
HPD.dropna(subset=['address'],axis=0,inplace=True)

### **Looking at the Pluto Dataset**

In [14]:
#Change format of how float is displayed
pd.options.display.float_format = "{:.2f}".format

In [15]:
Orig_Pluto.head()

Unnamed: 0.1,Unnamed: 0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,yearbuilt,yearalter1,builtfar,residfar,commfar,facilfar,officearea,retailarea,zipcode,Age,borough
0,0,JOE DIMAGGIO HIGHWAY,401.0,246896.0,0.0,,0.0,0.0,0.0,1949,0,0.0,0.0,0.0,0.0,,,,71,MN
1,7,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,1900,2001,3.97,4.0,0.0,4.0,0.0,7200.0,10003.0,120,MN
2,10,243 MORELAND STREET,30.0,2807.0,1770.0,1770.0,1.0,2.0,50.0,2019,0,0.63,0.5,0.0,1.0,0.0,0.0,10306.0,1,SI
3,12,454 BEACH 125 STREET,70.0,4000.0,1286.0,1286.0,1.0,1.67,36.0,1950,0,0.32,0.5,0.0,1.0,0.0,0.0,11694.0,70,QN
4,13,460 BEACH 125 STREET,72.0,5000.0,2352.0,1344.0,1.0,1.0,55.0,1970,0,0.47,0.5,0.0,1.0,0.0,0.0,11694.0,50,QN


In [19]:
Orig_Pluto.dtypes

Unnamed: 0      int64
address        object
lot           float64
lotarea       float64
bldgarea      float64
resarea       float64
numbldgs      float64
numfloors     float64
bldgdepth     float64
yearbuilt       int64
yearalter1      int64
builtfar      float64
residfar      float64
commfar       float64
facilfar      float64
officearea    float64
retailarea    float64
zipcode       float64
Age             int64
borough        object
dtype: object

In [16]:
Orig_Pluto['zipcode'] = Orig_Pluto['zipcode'].astype("str")

In [17]:
Pluto = Orig_Pluto[["address","lot","lotarea","bldgarea","resarea","numbldgs","numfloors",
                "bldgdepth","builtfar","residfar","zipcode","borough","Age","yearbuilt"]]
Pluto.head()

Unnamed: 0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,zipcode,borough,Age,yearbuilt
0,JOE DIMAGGIO HIGHWAY,401.0,246896.0,0.0,,0.0,0.0,0.0,0.0,0.0,,MN,71,1949
1,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,10003.0,MN,120,1900
2,243 MORELAND STREET,30.0,2807.0,1770.0,1770.0,1.0,2.0,50.0,0.63,0.5,10306.0,SI,1,2019
3,454 BEACH 125 STREET,70.0,4000.0,1286.0,1286.0,1.0,1.67,36.0,0.32,0.5,11694.0,QN,70,1950
4,460 BEACH 125 STREET,72.0,5000.0,2352.0,1344.0,1.0,1.0,55.0,0.47,0.5,11694.0,QN,50,1970


In [18]:
Pluto['yearbuilt'] =  Pluto['yearbuilt'].astype("int")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [19]:
Pluto.head()

Unnamed: 0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,zipcode,borough,Age,yearbuilt
0,JOE DIMAGGIO HIGHWAY,401.0,246896.0,0.0,,0.0,0.0,0.0,0.0,0.0,,MN,71,1949
1,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,10003.0,MN,120,1900
2,243 MORELAND STREET,30.0,2807.0,1770.0,1770.0,1.0,2.0,50.0,0.63,0.5,10306.0,SI,1,2019
3,454 BEACH 125 STREET,70.0,4000.0,1286.0,1286.0,1.0,1.67,36.0,0.32,0.5,11694.0,QN,70,1950
4,460 BEACH 125 STREET,72.0,5000.0,2352.0,1344.0,1.0,1.0,55.0,0.47,0.5,11694.0,QN,50,1970


An issue came up where nan in `zipcode` column was not being read by isnull(). so to correct this I replaced it with np.Nan. This revealed 549 null data in the zipcode column

In [20]:
Pluto.replace("nan", np.NaN, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  method=method,


In [21]:
Pluto.replace(0, np.NaN, inplace=True)

In [22]:
Pluto.isnull().sum()

address          0
lot              0
lotarea          7
bldgarea      2522
resarea      58295
numbldgs      1625
numfloors     2309
bldgdepth     5965
builtfar      2843
residfar     24119
zipcode        549
borough          0
Age             12
yearbuilt        0
dtype: int64

### Merging datasets

In [23]:
MergedDF = pd.merge( left=Pluto, right=HPD, how='inner', on='address')
MergedDF.head()

Unnamed: 0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,...,borough,Age,yearbuilt,Complaint_Type,zipcode_y,Borough,Latitude,Longitude,count,HEAT/HOT WATER
0,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,...,MN,120.0,1900,HEAT/HOT WATER,10003.0,MANHATTAN,40.73,-73.99,21.0,1
1,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,...,MN,120.0,1900,HEAT/HOT WATER,10003.0,MANHATTAN,40.73,-73.99,21.0,1
2,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,...,MN,120.0,1900,HEAT/HOT WATER,10003.0,MANHATTAN,40.73,-73.99,21.0,1
3,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,...,MN,120.0,1900,DOOR/WINDOW,10003.0,MANHATTAN,40.73,-73.99,21.0,0
4,146 2 AVENUE,1.0,9412.0,37353.0,30153.0,5.0,5.0,86.0,3.97,4.0,...,MN,120.0,1900,ELECTRIC,10003.0,MANHATTAN,40.73,-73.99,21.0,0


In [24]:
MergedDF.set_index('Complaint_Type')

Unnamed: 0_level_0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,zipcode_x,borough,Age,yearbuilt,zipcode_y,Borough,Latitude,Longitude,count,HEAT/HOT WATER
Complaint_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
HEAT/HOT WATER,146 2 AVENUE,1.00,9412.00,37353.00,30153.00,5.00,5.00,86.00,3.97,4.00,10003.0,MN,120.00,1900,10003.0,MANHATTAN,40.73,-73.99,21.00,1
HEAT/HOT WATER,146 2 AVENUE,1.00,9412.00,37353.00,30153.00,5.00,5.00,86.00,3.97,4.00,10003.0,MN,120.00,1900,10003.0,MANHATTAN,40.73,-73.99,21.00,1
HEAT/HOT WATER,146 2 AVENUE,1.00,9412.00,37353.00,30153.00,5.00,5.00,86.00,3.97,4.00,10003.0,MN,120.00,1900,10003.0,MANHATTAN,40.73,-73.99,21.00,1
DOOR/WINDOW,146 2 AVENUE,1.00,9412.00,37353.00,30153.00,5.00,5.00,86.00,3.97,4.00,10003.0,MN,120.00,1900,10003.0,MANHATTAN,40.73,-73.99,21.00,0
ELECTRIC,146 2 AVENUE,1.00,9412.00,37353.00,30153.00,5.00,5.00,86.00,3.97,4.00,10003.0,MN,120.00,1900,10003.0,MANHATTAN,40.73,-73.99,21.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
HEAT/HOT WATER,1088 LIBERTY AVENUE,18.00,2500.00,3750.00,750.00,1.00,2.00,75.00,1.50,1.25,11208.0,BK,85.00,1935,11208.0,BROOKLYN,40.68,-73.87,43.00,1
HEAT/HOT WATER,1088 LIBERTY AVENUE,18.00,2500.00,3750.00,750.00,1.00,2.00,75.00,1.50,1.25,11208.0,BK,85.00,1935,11208.0,BROOKLYN,40.68,-73.87,43.00,1
HEAT/HOT WATER,1088 LIBERTY AVENUE,18.00,2500.00,3750.00,750.00,1.00,2.00,75.00,1.50,1.25,11208.0,BK,85.00,1935,11208.0,BROOKLYN,40.68,-73.87,43.00,1
HEAT/HOT WATER,1088 LIBERTY AVENUE,18.00,2500.00,3750.00,750.00,1.00,2.00,75.00,1.50,1.25,11208.0,BK,85.00,1935,11208.0,BROOKLYN,40.68,-73.87,43.00,1


In [25]:
MergedDF.sort_values("Complaint_Type", inplace=True)

In [26]:
MergedDF = MergedDF.drop_duplicates()

In [27]:
MergedDF.head(5)

Unnamed: 0,address,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,...,borough,Age,yearbuilt,Complaint_Type,zipcode_y,Borough,Latitude,Longitude,count,HEAT/HOT WATER
3849534,1890 ANDREWS AVENUE SOUTH,28.0,14299.0,59433.0,57183.0,2.0,6.0,116.0,4.16,3.44,...,BX,96.0,1924,AGENCY,10453.0,BRONX,40.85,-73.91,210.0,0
4098727,109-29 SUTPHIN BOULEVARD,7.0,11514.0,55015.0,55015.0,1.0,6.0,95.0,4.78,2.0,...,QN,16.0,2004,AGENCY,11435.0,QUEENS,40.69,-73.8,93.0,0
3105063,2297 SEDGWICK AVENUE,57.0,3676.0,14000.0,14000.0,1.0,4.0,48.0,3.81,3.44,...,BX,110.0,1910,AGENCY,10468.0,BRONX,40.86,-73.91,114.0,0
1682983,1725 61 STREET,68.0,2500.0,3337.0,3337.0,1.0,2.0,81.0,1.33,1.25,...,BK,89.0,1931,AGENCY,11204.0,BROOKLYN,40.62,-73.99,55.0,0
1947894,1038 LOWELL STREET,39.0,3900.0,10950.0,10950.0,1.0,5.0,68.0,2.81,3.44,...,BX,109.0,1911,AGENCY,10459.0,BRONX,40.83,-73.89,131.0,0


This will delete duplicate Complaint Types from same addresses that share other complaints.

In [31]:
MergedDF.isnull().sum()

address           0
lot               0
lotarea           0
bldgarea          0
resarea           0
numbldgs          0
numfloors         0
bldgdepth         0
builtfar          0
residfar          0
zipcode_x         0
borough           0
Age               0
yearbuilt         0
Complaint_Type    0
zipcode_y         0
Borough           0
Latitude          0
Longitude         0
count             0
HEAT/HOT WATER    0
dtype: int64

In [29]:
MergedDF.fillna(MergedDF.median(), inplace=True)

In [30]:
MergedDF.to_csv("MergedDF.csv")

In [151]:
MergedDF.shape

(688663, 21)

# Data Selection

In [2]:
MergedDF = pd.read_csv("MergedDF.csv")

In [32]:
MergedDF.rename(columns={"HEAT/HOT WATER": "HEAT_HOT_WATER"}, inplace=True)

In [33]:
Data =MergedDF.drop(['address','zipcode_x','zipcode_y', 'Latitude','Longitude','borough', 'Complaint_Type','Borough','HEAT_HOT_WATER','count'],axis=1)

In [34]:
Data.head()

Unnamed: 0,lot,lotarea,bldgarea,resarea,numbldgs,numfloors,bldgdepth,builtfar,residfar,Age,yearbuilt
3849534,28.0,14299.0,59433.0,57183.0,2.0,6.0,116.0,4.16,3.44,96.0,1924
4098727,7.0,11514.0,55015.0,55015.0,1.0,6.0,95.0,4.78,2.0,16.0,2004
3105063,57.0,3676.0,14000.0,14000.0,1.0,4.0,48.0,3.81,3.44,110.0,1910
1682983,68.0,2500.0,3337.0,3337.0,1.0,2.0,81.0,1.33,1.25,89.0,1931
1947894,39.0,3900.0,10950.0,10950.0,1.0,5.0,68.0,2.81,3.44,109.0,1911


In [35]:
Data['Age'].describe()

count   688663.00
mean        89.14
std         28.15
min          1.00
25%         87.00
50%         94.00
75%        108.00
max        255.00
Name: Age, dtype: float64

In [38]:
# Let's Define X, and y for our dataset
X=np.asarray(Data[['lot','resarea','numbldgs','numfloors','bldgdepth','residfar']])
X[0:5]

array([[2.8000e+01, 5.7183e+04, 2.0000e+00, 6.0000e+00, 1.1600e+02,
        3.4400e+00],
       [7.0000e+00, 5.5015e+04, 1.0000e+00, 6.0000e+00, 9.5000e+01,
        2.0000e+00],
       [5.7000e+01, 1.4000e+04, 1.0000e+00, 4.0000e+00, 4.8000e+01,
        3.4400e+00],
       [6.8000e+01, 3.3370e+03, 1.0000e+00, 2.0000e+00, 8.1000e+01,
        1.2500e+00],
       [3.9000e+01, 1.0950e+04, 1.0000e+00, 5.0000e+00, 6.8000e+01,
        3.4400e+00]])

In [39]:
y = MergedDF['HEAT_HOT_WATER']
y[0:5]

3849534    0
4098727    0
3105063    0
1682983    0
1947894    0
Name: HEAT_HOT_WATER, dtype: int32

In [43]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.527946
         Iterations 6
                          Results: Logit
Model:              Logit            Pseudo R-squared: -0.014     
Dependent Variable: HEAT_HOT_WATER   AIC:              727165.6874
Date:               2020-06-09 21:57 BIC:              727234.3425
No. Observations:   688663           Log-Likelihood:   -3.6358e+05
Df Model:           5                LL-Null:          -3.5854e+05
Df Residuals:       688657           LLR p-value:      1.0000     
Converged:          1.0000           Scale:            1.0000     
No. Iterations:     6.0000                                        
---------------------------------------------------------------------
       Coef.     Std.Err.       z        P>|z|      [0.025     0.975]
---------------------------------------------------------------------
x1     0.0000      0.0000      9.9740    0.0000     0.0000     0.0000
x2     0.0000      0.0000     49.4931    

#### Normalizing the dataset

In [41]:
from sklearn import preprocessing
from sklearn.preprocessing import RobustScaler
from scipy import stats

In [8]:
X = preprocessing.RobustScaler().fit(X).transform(X)
X[0:5]

array([[-0.85365854,  7.63791416,  4.        ,  1.        ,  1.10344828,
         0.88888889],
       [-0.73170732,  7.92981966,  0.        ,  1.5       ,  2.44827586,
         0.88888889],
       [-0.09756098,  1.44806375,  0.        ,  1.        ,  1.24137931,
         0.64      ],
       [-0.53658537, -0.14818957,  0.        , -0.5       , -0.20689655,
        -0.55555556],
       [-0.68292683,  1.8487348 ,  0.        ,  1.5       ,  1.10344828,
         0.64      ]])

In [42]:
stats.boxcox(X[0,])

(array([2.31306633, 3.94058236, 0.63979489, 1.46323195, 2.86689618,
        1.07274565]),
 -0.234274051987872)

#### Train_Test_Split

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=4)
print('Train set :', X_train.shape, y_train.shape)
print('Test set: ', X_test.shape, y_test.shape)

Train set : (482064, 6) (482064,)
Test set:  (206599, 6) (206599,)


In [10]:
from sklearn import datasets
from sklearn import svm

In [None]:
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)

# MODELING

## **LOGISTIC REGRESSION**

In [46]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

LR = LogisticRegression(penalty ='l1',C=0.01, solver='liblinear', max_iter=100, fit_intercept=True, n_jobs=1,
                       warm_start=True).fit(X_train, y_train)

In [47]:
#Dummy Classifier
from sklearn.dummy import DummyClassifier
clf = DummyClassifier(strategy= 'most_frequent').fit(X_train,y_train)
y_pred = clf.predict(X_test)

#Distribution of y test
print('y actual : \n' +  str(y_test.value_counts()))

#Distribution of y predicted
print('y predicted : \n' + str(pd.Series(y_pred).value_counts()))

y actual : 
0    161978
1     44621
Name: HEAT_HOT_WATER, dtype: int64
y predicted : 
0    206599
dtype: int64


In [48]:

# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Dummy Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.7840212198510157
Precision Score : 0.0
Recall Score : 0.0


  _warn_prf(average, modifier, msg_start, len(result))


F1 Score : 0.0
Confusion Matrix : 
[[161978      0]
 [ 44621      0]]


In [49]:
clf = LogisticRegression().fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Model Evaluation metrics 
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
from sklearn.metrics import confusion_matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.7840115392620487
Precision Score : 0.375
Recall Score : 6.723291723627888e-05
F1 Score : 0.0001344417307132134
Confusion Matrix : 
[[161973      5]
 [ 44618      3]]


In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
import math
import pandas
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.utils import shuffle

In [None]:
def svr_model(X, y):
    gsc = GridSearchCV(
        estimator=SVR(kernel='rbf'),
        param_grid={
            'C': [0.1, 1, 100, 1000],
            'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
            'gamma': [0.0001, 0.001, 0.005, 0.1, 1, 3, 5]
        },
        cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1)

    grid_result = gsc.fit(X, y)
    best_params = grid_result.best_params_
    best_svr = SVR(kernel='rbf', C=best_params["C"], epsilon=best_params["epsilon"], gamma=best_params["gamma"],
                   coef0=0.1, shrinking=True,
                   tol=0.001, cache_size=200, verbose=False, max_iter=-1)

    scoring = {
               'abs_error': 'neg_mean_absolute_error',
               'squared_error': 'neg_mean_squared_error'}

    scores = cross_validate(best_svr, X, y, cv=10, scoring=scoring, return_train_score=True)
    return "MAE :", abs(scores['test_abs_error'].mean()), "| RMSE :", math.sqrt(abs(scores['test_squared_error'].mean()))

In [None]:
svr_model(X,y)

#### Predict using testset:

In [None]:
yhat = LR.predict(X_test)
yhat

#### predict_proba returns estimates for all classes

In [None]:
yhat_prob = LR.predict_proba(X_test)
yhat_prob

In [None]:
y_pred = LR.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(LR.score(X_test, y_test)))

## K - NEAREST NEIGHBOR

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

#### training

In [None]:
K=100

neigh = KNeighborsClassifier(n_neighbors=K).fit(X_train,y_train)
neigh

#### Predicting

In [None]:
yhat=neigh.predict(X_test)
yhat[0:5]

#### Accuracy Evaluation

In [None]:
from sklearn import metrics

In [None]:
from sklearn import metrics

print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, neigh.predict(X_test)))

**How to choose right K**

In [None]:
ks=25
mean_acc = np.zeros((ks-1))
std_acc = np.zeros((ks-1))
ConfusionMatrix=[]

for n in range(1,ks):
    #Train Model and Ppredict
    neigh = KNeighborsClassifier(n_neighbors=n).fit(X_train, y_train)
    yhat=neigh.predict(X_test)
    mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)
    
    std_acc[n-1]=np.std(yhat == y_test)/np.sqrt(yhat.shape[0])
    
mean_acc

In [None]:
plt.plot(range(1,ks),mean_acc,'g')
plt.fill_between(range(1,ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
plt.legend(('Accuracy ', '+/- 3xstd'))
plt.ylabel('Accuracy ')
plt.xlabel('Number of Nabors (K)')
plt.tight_layout()
plt.show()

In [None]:
print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

### In Conclusion
I do not think that KNN algorithm makes the most sense in this case

# SVM

In [None]:
Data.head()

In [None]:
Data.dtypes

In [11]:
from sklearn import svm, datasets
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [13]:
parameters = {'kernel':('linear','rbf'), 'gamma':[1,0.1,0.01,0.001],
              'C':[0.1,1,10,100]}
grid = GridSearchCV(SVC(),parameters, refit=True,verbose=10, n_jobs=2)

In [None]:
from tqdm import tqdm

tqdm.pandas(desc="My progressbar")

In [None]:
grid.fit(X_train, y_train)

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   1 tasks      | elapsed: 22.3min
[Parallel(n_jobs=2)]: Done   4 tasks      | elapsed: 51.0min
[Parallel(n_jobs=2)]: Done   9 tasks      | elapsed: 71.2min


fitting 

In [None]:
grid.fit(Data_x, MergedDF['HEAT_HOT_WATER'])

## Evaluation

In [None]:
yhat = clf.predict(X_test)
yhat [0:5]

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import itertools

In [None]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted') 

In [None]:
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

In [None]:
accuracy = metrics.accuracy_score(y_test,yhat)*100
accuracy

# Random Forest

In [None]:
feature_list = list(Data)

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state=42)

#Train the model on training data
rf.fit(X_train, y_train)

In [63]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100,
                              bootstrap =True,
                              max_features = 'sqrt')

model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [64]:
rf_pred = model.predict(X_test)
rf_probs = model.predict_proba(X_test)[:,1]

In [65]:
from sklearn.metrics import roc_auc_score

roc_value = roc_auc_score(y_test, rf_probs)
print("this is the roc value: ", roc_value)

this is the roc value:  0.4672526083369931


In [66]:
#Use the forest's predict method on the test data
predictions = model.predict(X_test)

#Calculate the absolute errors
errors = abs(predictions - y_test)

# Print out the mean absolute error (MAE)
print("Mean Absolute Error: ",round(np.mean(errors),2),
     'degrees.')

Mean Absolute Error:  0.25 degrees.


# XGBOOST

In [51]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.1.1-py3-none-win_amd64.whl (54.4 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.1.1


In [52]:
from sklearn import svm, datasets
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from numpy import loadtxt
from xgboost import XGBClassifier

In [60]:
# fit model no training data
model = XGBClassifier(learning_rate=0.001,n_estimators=750,objective='binary:logistic')

In [55]:
parameters = {'gamma':[0.01,1],
              'C':[0.01,1]}
grid = GridSearchCV(model,parameters, refit=True,verbose=10, n_jobs=1)

In [61]:
# fit model no training data
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.001, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=750, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [62]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [58]:
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 78.41%
