In [1]:
import pandas as pd
import numpy as np
import sklearn.naive_bayes as NB
import sklearn.model_selection as cv
import sklearn.metrics as m
from sklearn import preprocessing

# Naive Bayes

## Read the data

As usual, before analyzing the data we read the csv and store all the values in a variable.

In [2]:
data = pd.read_csv('../datasets/preprocessed/train.csv', sep=',', na_values="NA")

In [3]:
data.head()

Unnamed: 0,MSSubClass,MSZoning,LotArea,LotShape,LandContour,LotConfig,Neighborhood,Condition1,BldgType,HouseStyle,...,MiscVal,SaleType,SaleCondition,SalePrice,MasVnr,SecondFloor,Baths,Porch,Pool,Id
0,G,RH,0.185945,1.0,Lvl,Inside,Edwards,Artery,1Fam,2Story,...,0.0,WD,Normal,Level2,0.0,1.0,0.4,True,0.0,0
1,A,RL,0.19889,1.0,Lvl,Inside,NAmes,Norm,1Fam,1Story,...,0.0,WD,Family,Level2,1.0,0.0,0.0,True,0.0,1
2,L,RL,0.260616,1.0,Lvl,Corner,NridgHt,Norm,Twnhs,1Story,...,0.0,New,Partial,Level4,1.0,0.0,0.4,True,0.0,2
3,A,RL,0.25123,1.0,Lvl,Inside,NAmes,Norm,1Fam,1Story,...,0.0,WD,Abnorml,Level1,1.0,0.0,0.0,False,0.0,3
4,E,RL,0.174186,1.0,Lvl,Inside,SWISU,Norm,1Fam,1.5Fin,...,0.0,WD,Normal,Level2,0.0,1.0,0.4,True,0.0,4


## Prepare the data

First, we begin by separating the data into two different variables: num_data, which only contains the numerical values, and cat_data, which only contains the categorical ones. We also exclude from cat_dat all the values that correspond to the column "SalePrice", since that's what we want to predict.

In [4]:
data
num_data = data.select_dtypes(include=np.number).drop(columns='Id')
cat_data = data.select_dtypes(include=['bool','object']).drop(columns='SalePrice')

## Train a model with numeric columns

First, we train a model using the numerical values in num_data. We're using Gaussian Naive Bayes which, as we can see, gives us a very poor score, only 0.219.

In [5]:
X = num_data
Y = data.loc[:,'SalePrice']

X_train, X_test, y_train, y_test = cv.train_test_split(X, Y, test_size=.3, random_state=1)

gnb = NB.GaussianNB()
gnb.fit(X_train,y_train)
gnb.score(X_test,y_test)

0.2185430463576159

## Train a model with categorical columns

Afterwards, we train a model with the categorical values in cat_data. Since we're using Multinomial Naive Bayes, which requires numerical tags instead of strings, we need to preprocess the categories in order to assign an integer ID to each different one before doing the training. As we can see, the final score is now 0.665, which is times better than the previous score, but still has room for improvement.

In [6]:
X = cat_data
Y = data.loc[:,'SalePrice']

data.dtypes
for col in X.columns:
    X.loc[:,col] = pd.factorize(X.loc[:,col])[0]
X.head()

X_train, X_test, y_train, y_test = cv.train_test_split(X, Y, test_size=.3, random_state=1)

mnb = NB.MultinomialNB()
mnb.fit(X_train,y_train)
mnb.score(X_test,y_test)

0.6655629139072847

## Cross validation of the best model

Now we do cross validation with the categorical values, as it's the best model for now, applying the same preprocessing we did for the previous model and Multinomial Naive Bayes again, but now calculating the cross_val_score. As we can see, the result is very similar: 0.613. We also build the confusion matrix and compute the accuracy, which again has almost the same value, and then we finish with the classification report.

In [7]:
kfold = cv.StratifiedKFold(n_splits=10, random_state=1) 

X = cat_data
Y = data.loc[:,'SalePrice']

for col in X.columns:
    X.loc[:,col] = pd.factorize(X.loc[:,col])[0]
X.head()

mnb = NB.MultinomialNB()

cvs = cv.cross_val_score(mnb,X=X,y=Y,cv=kfold)
np.mean(cvs)



0.6131782178217822

In [8]:
pred = cv.cross_val_predict(mnb, X=X, y=Y, cv=kfold)  

print(m.confusion_matrix(Y, pred))
print(m.accuracy_score(Y, pred))

[[ 46  32   1   0   0]
 [ 61 403 132  14  18]
 [  2  46 154  10  11]
 [  0   7  30   5  17]
 [  0   1   4   3   9]]
0.6133200795228628


In [9]:
print(m.classification_report(Y, pred))

              precision    recall  f1-score   support

      Level1       0.42      0.58      0.49        79
      Level2       0.82      0.64      0.72       628
      Level3       0.48      0.69      0.57       223
      Level4       0.16      0.08      0.11        59
      Level5       0.16      0.53      0.25        17

    accuracy                           0.61      1006
   macro avg       0.41      0.51      0.43      1006
weighted avg       0.67      0.61      0.63      1006



## Balancing the dataset

Despite having an accuracy of 0.613, the data of some of the categories of SalePrice was poorly classified. Level4, for example, has an f1-score as low as 0.11. This is due to having an unbalanced dataset, so we decide to balance our data so that the amount of instances for each SalePrice category is more or less the same.

In [10]:
Y.value_counts()

Level2    628
Level3    223
Level1     79
Level4     59
Level5     17
Name: SalePrice, dtype: int64

In [11]:
print(data['SalePrice'].unique())

X1 = data[data['SalePrice'] == 'Level1']
X2 = data[data['SalePrice'] == 'Level2']
X3 = data[data['SalePrice'] == 'Level3']
X4 = data[data['SalePrice'] == 'Level4']
X5 = data[data['SalePrice'] == 'Level5']

bdata = pd.DataFrame()

for i in range(3):
    bdata = bdata.append(X1, ignore_index = True)
bdata = bdata.append(X2.sample(frac=1/3), ignore_index = True)
bdata = bdata.append(X3, ignore_index = True)
for i in range(4):
    bdata = bdata.append(X4, ignore_index = True)
for i in range(10):
    bdata = bdata.append(X5, ignore_index = True)

bdata['SalePrice'].value_counts()

['Level2' 'Level4' 'Level1' 'Level3' 'Level5']


Level1    237
Level4    236
Level3    223
Level2    209
Level5    170
Name: SalePrice, dtype: int64

By balancing the dataset, we can see that the scores and accuracy are lower than before doing so (0.556), but the f1-score for the different levels is more consistent, with smaller diferences between them.

In [12]:
X = bdata.select_dtypes(include=['bool','object']).drop(columns=['SalePrice'])
Y = bdata['SalePrice']

kfold = cv.StratifiedKFold(n_splits=10) 

for col in X.columns:
    X.loc[:,col] = pd.factorize(X.loc[:,col])[0]
X.head()

mnb = NB.MultinomialNB()

cvs = cv.cross_val_score(mnb,X=X,y=Y,cv=kfold)
np.mean(cvs)

0.5525441329179648

In [13]:
pred = cv.cross_val_predict(mnb, X=X, y=Y, cv=kfold)  

print(m.confusion_matrix(Y, pred))
print(m.accuracy_score(Y, pred))

[[206  26   3   1   1]
 [ 50  78  58  19   4]
 [  4  58  83  60  18]
 [  0  16  50 115  55]
 [  0   0  20  38 112]]
0.5525581395348838


In [14]:
print(m.classification_report(Y, pred))

              precision    recall  f1-score   support

      Level1       0.79      0.87      0.83       237
      Level2       0.44      0.37      0.40       209
      Level3       0.39      0.37      0.38       223
      Level4       0.49      0.49      0.49       236
      Level5       0.59      0.66      0.62       170

    accuracy                           0.55      1075
   macro avg       0.54      0.55      0.54      1075
weighted avg       0.54      0.55      0.55      1075



## Naive Bayes using PCA 

Using PCA we obtain significantly better results than before. We have a final score of 0.732 with Gaussian Naive Bayes and 0.756 for the cross validation. This is an improvement, but the numbers are still a little bit lower than desired.

In [15]:
data_pca = pd.read_csv('../datasets/preprocessed/train_pca.csv', sep=',', na_values="NA")

In [16]:
X = data_pca.drop(columns='Id')
Y = data.loc[:,'SalePrice']
X.head()

Unnamed: 0,0,1,2,3,4,5
0,-0.753748,0.636941,-0.164025,-0.204247,0.549835,0.539191
1,-0.117306,-0.688384,-0.493263,0.796882,-0.008604,-0.239237
2,0.787114,-0.469485,-0.626444,-0.272131,-0.467151,0.305142
3,-0.172825,-0.708999,-0.480148,0.864432,0.145394,-0.106748
4,-0.488369,0.642514,-0.179286,-0.202405,0.023691,-0.091447


In [17]:
X_train, X_test, y_train, y_test = cv.train_test_split(X, Y, test_size=.3, random_state=1)

gnb = NB.GaussianNB()
gnb.fit(X_train,y_train)
gnb.score(X_test,y_test)

0.7317880794701986

In [18]:
cvs = cv.cross_val_score(gnb,X=X,y=Y,cv=kfold)
np.mean(cvs)

0.7564455445544553

In [19]:
pred = cv.cross_val_predict(gnb, X=X, y=Y, cv=kfold)  

print(m.confusion_matrix(Y, pred))
print(m.accuracy_score(Y, pred))
print(m.classification_report(Y, pred))

[[ 35  44   0   0   0]
 [ 24 559  44   1   0]
 [  3  72 132  15   1]
 [  0   3  25  28   3]
 [  0   0   2   8   7]]
0.7564612326043738
              precision    recall  f1-score   support

      Level1       0.56      0.44      0.50        79
      Level2       0.82      0.89      0.86       628
      Level3       0.65      0.59      0.62       223
      Level4       0.54      0.47      0.50        59
      Level5       0.64      0.41      0.50        17

    accuracy                           0.76      1006
   macro avg       0.64      0.56      0.60      1006
weighted avg       0.75      0.76      0.75      1006

