# Milestone 4 -BayesOpEd

## Emily McAfee

Instructions

1) Leverage Naïve Bayes algorithm to classify build a model using the data from previous milestones.

2) Briefly summarize your findings on using Naïve Bayes.

3) Is Naïve Bayes more accurate than the regression model you used in Milestone 3?

In [1]:
# Load packages
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
import seaborn as sns

In [2]:
# Load data
filename = "https://drive.google.com/uc?export=download&id=1TqcnKvPJnuuEMVZYpx2H-hOf5Kaw_LUu"

# Read .csv file in with pandas
housedata = pd.read_csv(filename, header = 0)

In [3]:
# Check data
print(housedata.dtypes)
print(housedata.loc[:,'bedrooms'].head())
housedata.loc[:, 'id'].head()
housedata.loc[:,'date'].head()
housedata.columns

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object
0    3
1    3
2    2
3    4
4    3
Name: bedrooms, dtype: int64


Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [4]:
# Reduce features
housedata2 = housedata.drop(['id', 'date', 'grade', 'zipcode', 'lat', 'long', 
                             'sqft_living15', 'sqft_lot15','sqft_above', 'sqft_basement',
                            'yr_built', 'yr_renovated'], axis = 1)
housedata2.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition
0,221900.0,3,1.0,1180,5650,1.0,0,0,3
1,538000.0,3,2.25,2570,7242,2.0,0,0,3
2,180000.0,2,1.0,770,10000,1.0,0,0,3
3,604000.0,4,3.0,1960,5000,1.0,0,0,5
4,510000.0,3,2.0,1680,8080,1.0,0,0,3


In [5]:
# Add column that we want to base our train/test on
housedata2['price_group'] = np.where(housedata2.price >= 600000, 0, 1)

# Check data
housedata2.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,price_group
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,1
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,1
2,180000.0,2,1.0,770,10000,1.0,0,0,3,1
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,0
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,1


In [6]:
# Replace 0 and 1 with category labels
housedata2.price_group = housedata2.price_group.replace(1, 'affordable')
housedata2.price_group = housedata2.price_group.replace(0, 'not_affordable')
housedata2.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,price_group
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,affordable
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,affordable
2,180000.0,2,1.0,770,10000,1.0,0,0,3,affordable
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,not_affordable
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,affordable


In [7]:
# Convert numerical variables to categorical variables
housedata2.bedrooms = pd.cut(housedata2.bedrooms, 5, labels = False)
housedata2.bathrooms = pd.cut(housedata2.bathrooms, 5, labels = False)
housedata2.sqft_living = pd.cut(housedata2.sqft_living, 5, labels = False)
housedata2.sqft_lot = pd.cut(housedata2.sqft_lot, 5, labels = False)
housedata2.floors = pd.cut(housedata2.floors, 5, labels = False)
housedata2.view = pd.cut(housedata2.view, 5, labels = False)
housedata2.condition = pd.cut(housedata2.condition, 5, labels = False)
housedata2.waterfront = pd.cut(housedata2.waterfront, 2, labels = False)


In [8]:
housedata2

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,price_group
0,221900.0,0,0,0,0,0,0,0,2,affordable
1,538000.0,0,1,0,0,1,0,0,2,affordable
2,180000.0,0,0,0,0,0,0,0,2,affordable
3,604000.0,0,1,0,0,0,0,0,4,not_affordable
4,510000.0,0,1,0,0,0,0,0,2,affordable
...,...,...,...,...,...,...,...,...,...,...
21608,360000.0,0,1,0,0,3,0,0,2,affordable
21609,400000.0,0,1,0,0,1,0,0,2,affordable
21610,402101.0,0,0,0,0,1,0,0,2,affordable
21611,400000.0,0,1,0,0,1,0,0,2,affordable


In [9]:
for col in ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition','price_group']:
    housedata2[col] = housedata2[col].astype('category')

housedata2.dtypes

# Drop orginal price column
housedata3 = housedata2.drop(['price'], axis = 1)


In [10]:
# Train and Test the model
label_col = 'price_group'

# Convert categorical values to numeric vector features
feature_vecs = np.array([
    housedata3[c].cat.codes
    for c in housedata3.columns
    if c != label_col]).T

feature_vecs.shape
feature_vecs

array([[0, 0, 0, ..., 0, 0, 2],
       [0, 1, 0, ..., 0, 0, 2],
       [0, 0, 0, ..., 0, 0, 2],
       ...,
       [0, 0, 0, ..., 0, 0, 2],
       [0, 1, 0, ..., 0, 0, 2],
       [0, 0, 0, ..., 0, 0, 2]], dtype=int8)

In [11]:
# Convert label (democrat v. republican) to numeric values
labels = housedata3[label_col].cat.codes

# Look at mapping for first five values
list(zip(housedata3[label_col][:5], labels[:5]))

[('affordable', 0),
 ('affordable', 0),
 ('affordable', 0),
 ('not_affordable', 1),
 ('affordable', 0)]

In [12]:
# Get this model going
import sklearn.naive_bayes

# Define the model
model = sklearn.naive_bayes.MultinomialNB(alpha = 1e-7)

# Train the model with house data
model.fit(feature_vecs, labels)

MultinomialNB(alpha=1e-07, class_prior=None, fit_prior=True)

In [13]:
# Evaluate the model performance
predicted_price = model.predict(feature_vecs[:10])
price_probabilities = model.predict_proba(feature_vecs[:10])

results = pd.DataFrame({
    'price_group': housedata3['price_group'][:10],
    'predicted': pd.Categorical.from_codes(
        predicted_price, housedata3['price_group'][:10].cat.categories),
    'proba(affordable)' : price_probabilities[:,0],
    'proba(not_affordable)' : price_probabilities[:,1],
})
results

Unnamed: 0,price_group,predicted,proba(affordable),proba(not_affordable)
0,affordable,affordable,0.829477,0.170523
1,affordable,affordable,0.790122,0.209878
2,affordable,affordable,0.829477,0.170523
3,not_affordable,affordable,0.897079,0.102921
4,affordable,affordable,0.811022,0.188978
5,not_affordable,not_affordable,0.348056,0.651944
6,affordable,affordable,0.790122,0.209878
7,affordable,affordable,0.829477,0.170523
8,affordable,affordable,0.829477,0.170523
9,affordable,affordable,0.790122,0.209878


In [14]:
# Compute confusion matrix and performance metrics for this model
import sklearn.metrics

def confusion_matrix(labels, predicted_labels, label_classes):
    return pd.DataFrame(
        sklearn.metrics.confusion_matrix(labels, predicted_labels),
        index=[label_classes], 
        columns=label_classes)    

def performance(results):
    accuracy = sklearn.metrics.accuracy_score(
        results['price_group'].cat.codes, results['predicted'].cat.codes)
    precision = sklearn.metrics.precision_score(
            results['price_group'].cat.codes, results['predicted'].cat.codes)
    recall = sklearn.metrics.recall_score(
            results['price_group'].cat.codes, results['predicted'].cat.codes)

    print('Accuracy = %.3f, Precision = %.3f, Recall = %.3f' % (accuracy, precision, recall))
    
    return confusion_matrix(
        results['price_group'], 
        results['predicted'], 
        results.price_group.cat.categories)

In [15]:
performance(results)

Accuracy = 0.900, Precision = 1.000, Recall = 0.500


Unnamed: 0,affordable,not_affordable
affordable,8,0
not_affordable,1,1


In [16]:
# Apply to all data
predicted_price = model.predict(feature_vecs)
price_probabilities = model.predict_proba(feature_vecs)

results_all = pd.DataFrame({
        'price_group': housedata3['price_group'],
        'predicted': pd.Categorical.from_codes(
            predicted_price, housedata3['price_group'].cat.categories),
        'proba(affordable)': price_probabilities[:, 0],
        'proba(not_affordable)': price_probabilities[:, 1],
    })
performance(results_all)

Accuracy = 0.801, Precision = 0.737, Recall = 0.502


Unnamed: 0,affordable,not_affordable
affordable,14105,1142
not_affordable,3169,3197


In [17]:
# Let's try to improve the model with Laplace smoothing
model = sklearn.naive_bayes.MultinomialNB(alpha=3)
model.fit(feature_vecs, labels)

MultinomialNB(alpha=3, class_prior=None, fit_prior=True)

In [18]:
# Apply model to house data
predicted_price = model.predict(feature_vecs)
price_probabilities = model.predict_proba(feature_vecs)

results_all = pd.DataFrame({
        'price_group': housedata3['price_group'],
        'predicted': pd.Categorical.from_codes(
            predicted_price, housedata3['price_group'].cat.categories),
        'proba(affordable)': price_probabilities[:, 0],
        'proba(not_affordable)': price_probabilities[:, 1],
    })
performance(results_all)

Accuracy = 0.801, Precision = 0.737, Recall = 0.502


Unnamed: 0,affordable,not_affordable
affordable,14105,1142
not_affordable,3169,3197


## Summary
When printing out the first 10 rows of our naive bayes model, we see that our model yielded one classification error, with nine correctly classified. This is interesting because the probabilities are not close (i.e. 51% v. 49%, etc.). However, after looking back into our data, we can see that that one hom eis priced at about 604,000 - only 4,000 above our 'affordable' cut off point. Thus, it makes sense that the model may have not accounted for such a small difference in price. When computing the confusion matrix we see that we have an accuracy of 80%, precision of 74%, and recall of 50%. We have much more afforadable homes in our sample than not affordable homes - which is great - but likely impacts our model in terms of weighting. To see if we can improve our model, we use Laplace smoothing. As you can see above, the restuls are similar as without the Laplace smoothing. This result is expected in the data, as our data set likely has sufficient data.

For milestone three, we used a linear regression to predict price. We obtained an adjusted r-squared of .49, meaning that about 50%  of price can be explained by the principal components we generated. That meant that overall, when using a linear regression model, the data we have is moderately helpful in predicting price. However, when observing the work here with a naive bayes model, we have a high accuracy, high precision, and moderate recall. I argue that the naive bayes model is more accurate for predicting price, as we have an 80% accuracy rate in, while our pricipal components in the linear regression only account for about 50% of the variability in price.