# Worksheet 14

Name: Calvin Li  
UID: U51621195

### Topics

- Naive Bayes
- Model Evaluation

### Naive Bayes

| Attribute A | Attribute B | Attribute C | Class |
|-------------|-------------|-------------|-------|
| Yes         | Single      | High        | No    |
| No          | Married     | Mid         | No    |
| No          | Single      | Low         | No    |
| Yes         | Married     | High        | No    |
| No          | Divorced    | Mid         | Yes   |
| No          | Married     | Low         | No    |
| Yes         | Divorced    | High        | No    |
| No          | Single      | Mid         | Yes   |
| No          | Married     | Low         | No    |
| No          | Single      | Mid         | Yes   |

a) Compute the following probabilities:

- P(Attribute A = Yes | Class = No)
- P(Attribute B = Divorced | Class = Yes)
- P(Attribute C = High | Class = No)
- P(Attribute C = Mid | Class = Yes)

P(Attribute A = Yes | Class = No) = $\frac{3}{7}$  
P(Attribute B = Divorced | Class = Yes) = $\frac{1}{3}$  
P(Attribute C = High | Class = No) = $\frac{3}{7}$  
P(Attribute C = Mid | Class = Yes) = $\frac{3}{4}$

b) Classify the following unseen records:

- (Yes, Married, Mid)
- (No, Divorced, High)
- (No, Single, High)
- (No, Divorced, Low)

(Yes, Married, Mid) = No  
(No, Divorced, High) = No  
(No, Single, High) = No  
(No, Divorced, Low) = No 

### Model Evaluation

a) Write a function to generate the confusion matrix for a list of actual classes and a list of predicted classes

In [26]:
actual_class = ["Yes", "No", "No", "Yes", "No", "No", "Yes", "No", "No", "No"]
predicted_class = ["Yes", "No", "Yes", "No", "No", "No", "Yes", "Yes", "Yes", "No"]

def confusion_matrix(actual, predicted):
    TP, FP, TN, FN = 0, 0, 0, 0
    for i in range(len(actual_class)):
        if actual[i] == "Yes" and predicted[i] == "Yes":
            TP += 1
        elif actual[i] == "Yes" and predicted[i] == "No":
            FN += 1
        elif actual[i] == "No" and predicted[i] == "Yes":
            FP += 1
        elif actual[i] == "No" and predicted[i] == "No":
            TN += 1
    return [[TP, FN], [FP, TN]]

print(confusion_matrix(actual_class, predicted_class))

[[2, 1], [3, 4]]


b) Assume you have the following Cost Matrix:

|            | predicted = Y | predicted = N |
|------------|---------------|---------------|
| actual = Y |       -1      |       5       |
| actual = N |        10     |       0       |

What is the cost of the above classification?

TP cost = 2 * -1 = -2  
FP cost = 3 * 10 = 30  
FN cost = 1 * 5 = 5  
TN cost = 4 * 0 = 0

Overall cost: 30 + 5 + (-2) + 0 = 35 - 2 = 33

c) Write a function that takes in the actual values, the predictions, and a cost matrix and outputs a cost. Test it on the above example.

In [27]:
cost_matrix = [[-1, 5],[10, 0]]

def cost(actual, predicted, cost_matrix):
    total = 0
    for i in range(len(actual_class)):
        if actual[i] == "Yes" and predicted[i] == "Yes":
            total += cost_matrix[0][0]
        elif actual[i] == "Yes" and predicted[i] == "No":
            total += cost_matrix[0][1]
        elif actual[i] == "No" and predicted[i] == "Yes":
            total += cost_matrix[1][0]
        elif actual[i] == "No" and predicted[i] == "No":
            total += cost_matrix[1][1]
    return total

print(cost(actual_class, predicted_class, cost_matrix))

33


d) Implement functions for the following:

- accuracy
- precision
- recall
- f-measure

and apply them to the above example.

In [28]:
def accuracy(confusion):
    a = confusion[0][0]
    b = confusion[0][1]
    c = confusion[1][0]
    d = confusion[1][1]
    return (a + d) / (a + b + c + d)

def precision(confusion):
    a = confusion[0][0]
    c = confusion[1][0]
    return a / (a + c)

def recall(confusion):
    a = confusion[0][0]
    b = confusion[0][1]
    return a / (a + b)

def f_measure(precision, recall):
    return (2 * precision * recall) / (precision + recall)


confusion = confusion_matrix(actual_class, predicted_class)
precision_value = precision(confusion)
recall_value = recall(confusion)

print("accuracy:", accuracy(confusion))
print("precision:", precision(confusion))
print("recall:", recall(confusion))
print("f-measure:", f_measure(precision_value, recall_value))

accuracy: 0.6
precision: 0.4
recall: 0.6666666666666666
f-measure: 0.5


## Challenge (Midterm prep part 2)

In this exercise you will update your submission to the titanic competition.

a) First let's add new numerical features / columns to the datasets that might be related to the survival of individuals.

- `has_cabin` should have a value of 0 if the `cabin` feature is `nan` and 1 otherwise
- `family_members` should have the total number of family members (by combining `SibSp` and `Parch`)
- `title_type`: from the title extracted from the name, we will categorize it into 2 types: `common` for titles that many passengers have, `rare` for titles that few passengers have. Map `common` to 1 and `rare` to 0. Describe what threshold you used to define `common` and `rare` titles and how you found it.
- `fare_type`: using Kmeans clustering on the fare column, find an appropriate number of clusters / groups of similar fares. Using the clusters you created, `fare_price` should be an ordinal variable that represents the expensiveness of the fare. For example if you split fare into 3 clusters ( 0 - 15, 15 - 40, and 40+ ) then the `fare_price` value should be `0` for `fare` values 0 - 15, `1` for 15 - 40, and `2` for 40+.
- Create an addition two numerical features of your invention that you think could be relevant to the survival of individuals.

Note: The features must be numerical because the sklearn `DecisionTreeClassifier` can only take on numerical features.

In [141]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
#setting Cabin to either 1 or 0
train_df.loc[train_df["Cabin"].notnull(), "Cabin"] = 1
train_df["Cabin"].fillna(0, inplace=True)
#adding Family members
train_df["family_members"] = train_df["SibSp"] + train_df["Parch"]
#Any title that has a frequency of over 100 is considered common
#These are Mr, Miss, and Mrs
train_df["title_type"] = train_df["Name"].str.extract(" ([A-Za-z]+)\.", expand=False)
frequency = train_df['title_type'].value_counts()
common = frequency[frequency > 100].index
train_df["title_type"] = train_df["title_type"].apply(lambda x: 1 if x in common else 0)
#fare_price
fare_data = np.array(train_df["Fare"]).reshape(-1, 1)
k_means = KMeans(n_clusters=5, random_state=50, n_init=10)
k_means.fit(fare_data)
centroids = k_means.cluster_centers_
train_df["fare_price"] = k_means.predict(fare_data)
#categorizing the age
train_df["Age"].fillna(train_df["Age"].mean(), inplace=True)
age_bins = [0, 10, 18, 60, float('inf')]
age_labels = ["child", "teen", "adult", "elder"]
train_df["Categorized_Age"] = pd.cut(train_df["Age"], bins=age_bins, labels=age_labels, right=True)
#changing categorical into numerical
le = LabelEncoder()
#female is 0 and male is 1
train_df["Sex"] = le.fit_transform(train_df["Sex"])
#adult = 0, child = 1, teen = 3, elder = 2
train_df["Categorized_Age"] = le.fit_transform(train_df["Categorized_Age"])
survived = train_df["Survived"]

b) Using a method covered in class, tune the parameters of a decision tree model on the titanic dataset (containing all numerical features including the ones you added above). Evaluate this model locally and report it's performance.

Note: make sure you are not tuning your parameters on the same dataset you are using to evaluate the model. Also explain how you know you are not overfitting to the training set.

In [142]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

train_df = train_df.drop(["Name", "Age", "SibSp", "Parch", "Ticket", "Fare", "Embarked", "Survived"], axis=1)
x_train, x_test, y_train, y_test = train_test_split(train_df, survived, test_size=0.2, random_state=50)

param_grid = {
    'max_depth': [3, 5, 7, 9, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt_classifier = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train, y_train)

best_params = grid_search.best_params_
print("Best Parameters:", best_params)

best_dt_classifier = DecisionTreeClassifier(**best_params)
best_dt_classifier.fit(x_train, y_train)

y_pred = best_dt_classifier.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

#every new feature added to train, add to test 
#setting Cabin to either 1 or 0
test_df.loc[test_df["Cabin"].notnull(), "Cabin"] = 1
test_df["Cabin"].fillna(0, inplace=True)
#adding Family members
test_df["family_members"] = test_df["SibSp"] + test_df["Parch"]
#Any title that has a frequency of over 100 is considered common
#These are Mr, Miss, and Mrs
test_df["title_type"] = test_df["Name"].str.extract(" ([A-Za-z]+)\.", expand=False)
frequency = test_df["title_type"].value_counts()
common = frequency[frequency > 100].index
test_df["title_type"] = test_df["title_type"].apply(lambda x: 1 if x in common else 0)
#fare_price
test_df["Fare"].fillna(test_df["Fare"].mean(), inplace=True)
fare_data = np.array(test_df["Fare"]).reshape(-1, 1)
k_means = KMeans(n_clusters=5, random_state=50, n_init=10)
k_means.fit(fare_data)
centroids = k_means.cluster_centers_
test_df["fare_price"] = k_means.predict(fare_data)
#categorizing the age
test_df["Age"].fillna(test_df["Age"].mean(), inplace=True)
age_bins = [0, 10, 18, 60, float("inf")]
age_labels = ["child", "teen", "adult", "elder"]
test_df["Categorized_Age"] = pd.cut(test_df["Age"], bins=age_bins, labels=age_labels, right=True)
#changing categorical into numerical
le = LabelEncoder()
#female is 0 and male is 1
test_df["Sex"] = le.fit_transform(test_df["Sex"])
#adult = 0, child = 1, teen = 3, elder = 2
test_df["Categorized_Age"] = le.fit_transform(test_df["Categorized_Age"])


test_df = test_df.drop(["Name", "Age", "SibSp", "Parch", "Ticket", "Fare", "Embarked"], axis=1)

prediction = ensemble_classifier.predict(test_df)
result = pd.DataFrame({'PassengerId': test_df["PassengerId"], 'Survived': prediction})
result.to_csv('Titanic.csv', index=False)
result.name = "Predicted Titanic Passenger Survivability"
result

Best Parameters: {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy: 0.8100558659217877


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


This is not overfitting because the model is being ran over multiple folds and it is being ran that many times each time using k - 1 folds. Thus it is a more generic model instead of being generalized to the one dataset used to train it. 

c) Try reducing the dimension of the dataset and create a Naive Bayes model. Evaluate this model.

In [90]:
from sklearn.decomposition import PCA
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

pca = PCA(n_components=2)
reduced = pca.fit_transform(train_df.drop(["Survived", "Name", "Age", "SibSp", "Parch", "Ticket", "Fare", "Embarked"], axis=1))

X_train, X_test, y_train, y_test = train_test_split(reduced, train_df["Survived"], test_size=0.2, random_state=50)

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

y_pred = nb_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.5642458100558659


d) Create an ensemble classifier using a combination of KNN, Decision Trees, and Naive Bayes models. Evaluate this classifier.

In [131]:
from sklearn.ensemble import VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

train_df = train_df.drop(["Name", "Age", "SibSp", "Parch", "Ticket", "Fare", "Embarked", "Survived"], axis=1)
x_train, x_test, y_train, y_test = train_test_split(train_df, survived, test_size = 0.2, random_state=50)

knn_classifier = KNeighborsClassifier()
dt_classifier = DecisionTreeClassifier()
nb_classifier = GaussianNB()

ensemble_classifier = VotingClassifier(estimators=[
    ('knn', knn_classifier),
    ('dt', dt_classifier),
    ('nb', nb_classifier)
], voting='hard')

ensemble_classifier.fit(x_train, y_train)

y_pred = ensemble_classifier.predict(x_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#every new feature added to train, add to test 
#setting Cabin to either 1 or 0
test_df.loc[test_df["Cabin"].notnull(), "Cabin"] = 1
test_df["Cabin"].fillna(0, inplace=True)
#adding Family members
test_df["family_members"] = test_df["SibSp"] + test_df["Parch"]
#Any title that has a frequency of over 100 is considered common
#These are Mr, Miss, and Mrs
test_df["title_type"] = test_df["Name"].str.extract(" ([A-Za-z]+)\.", expand=False)
frequency = test_df["title_type"].value_counts()
common = frequency[frequency > 100].index
test_df["title_type"] = test_df["title_type"].apply(lambda x: 1 if x in common else 0)
#fare_price
test_df["Fare"].fillna(test_df["Fare"].mean(), inplace=True)
fare_data = np.array(test_df["Fare"]).reshape(-1, 1)
k_means = KMeans(n_clusters=5, random_state=50, n_init=10)
k_means.fit(fare_data)
centroids = k_means.cluster_centers_
test_df["fare_price"] = k_means.predict(fare_data)
#categorizing the age
test_df["Age"].fillna(test_df["Age"].mean(), inplace=True)
age_bins = [0, 10, 18, 60, float("inf")]
age_labels = ["child", "teen", "adult", "elder"]
test_df["Categorized_Age"] = pd.cut(test_df["Age"], bins=age_bins, labels=age_labels, right=True)
#changing categorical into numerical
le = LabelEncoder()
#female is 0 and male is 1
test_df["Sex"] = le.fit_transform(test_df["Sex"])
#adult = 0, child = 1, teen = 3, elder = 2
test_df["Categorized_Age"] = le.fit_transform(test_df["Categorized_Age"])


test_df = test_df.drop(["Name", "Age", "SibSp", "Parch", "Ticket", "Fare", "Embarked"], axis=1)

prediction = ensemble_classifier.predict(test_df)
result = pd.DataFrame({'PassengerId': test_df["PassengerId"], 'Survived': prediction})
result.to_csv('titanic.csv', index=False)
result.name = "Predicted Titanic Passenger Survivability"
result

Accuracy: 0.776536312849162


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


e) Update your kaggle submission using the best model you created (best model means the one that performed the best on your local evaluation)

https://www.kaggle.com/competitions/titanic/leaderboard?search=Calvin0824

Kaggle User: Calvin0824  
Kaggle Score: 0.78468  
Kaggle Rank: 2245

## Some useful code for the midterm

In [None]:
import seaborn as sns
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.datasets import fetch_lfw_people
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import GridSearchCV, train_test_split

sns.set()

# Get face data
faces = fetch_lfw_people(min_faces_per_person=60)

# plot face data
fig, ax = plt.subplots(3, 5)
for i, axi in enumerate(ax.flat):
    axi.imshow(faces.images[i], cmap='bone')
    axi.set(xticks=[], yticks=[],
            xlabel=faces.target_names[faces.target[i]])
plt.show()

# split train test set
Xtrain, Xtest, ytrain, ytest = train_test_split(faces.data, faces.target, random_state=42)

pca = PCA(n_components=150, whiten=True)
svc = SVC(kernel='rbf', class_weight='balanced')
svcpca = make_pipeline(pca, svc)

# Tune model to find best values of C and gamma using cross validation
param_grid = {'svc__C': [1, 5, 10, 50],
              'svc__gamma': [0.0001, 0.0005, 0.001, 0.005]}
kfold = 10
grid = GridSearchCV(svcpca, param_grid, cv=kfold)
grid.fit(Xtrain, ytrain)

print(grid.best_params_)

# use the best params explicitly here
pca = PCA(n_components=150, whiten=True)
svc = SVC(kernel='rbf', class_weight='balanced', C=10, gamma=0.005)
svcpca = make_pipeline(pca, svc)

model = BaggingClassifier(svcpca, n_estimators=100).fit(Xtrain, ytrain)
yfit = model.predict(Xtest)

fig, ax = plt.subplots(6, 6)
for i, axi in enumerate(ax.flat):
    axi.imshow(Xtest[i].reshape(62, 47), cmap='bone')
    axi.set(xticks=[], yticks=[])
    axi.set_ylabel(faces.target_names[yfit[i]].split()[-1],
                   color='black' if yfit[i] == ytest[i] else 'red')
fig.suptitle('Predicted Names; Incorrect Labels in Red', size=14)
plt.show()

mat = confusion_matrix(ytest, yfit)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=faces.target_names,
            yticklabels=faces.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

print("Accuracy = ", accuracy_score(ytest, yfit))