# Brady Engelke
# MSBA 6420 - Predictive Analytics
## HW 2 
## 10/14/19

## Q1) (10 points)

Choose a problem from a past job, hobby, or interest that would make for a good predictive modeling classification application. Describe it in one page or less using the relevant concepts. Your description should be as complete and precise as possible.

### a) What exactly is the business decision you want to support with this solution?

This past spring, I worked as a Resorts Estimating Intern for Walt Disney Imagineering where I was exposed to Disney's process in which Resort Executives and Project Managers follow when building a resort. Before construction begins, all stakeholders must agree upon the project scope, timeline, and most importantly, cost. I believe any executive who plays a pivotal role in deciding whether a $100M+ project proceeds should consult a data scientist's classification model that can predict whether the resort at hand has a good chance of being profitable for Disney or not. There are an overwhelming number of factors to consider for any executive who is involved in a decision of this magnitude. Thus, it is a well-suited business scenario to implement classification modeling. I anticipate an executive who chooses to implement this tactic will on average select projects that are more profitable for Disney than those executives who ignore the utility of data science.   

### b) Describe the use phase.

Leading up to construction for a particular resort, there are a number of Investment Committee meetings where all stakeholders congregate to decide whether the project should move forward. Ultimately, it is up to the CFO of Walt Disney Imagineering to make the final call. Thus, I believe the CFO would benefit the most from investing in a data scientist. The CFO's data scientist should manage a classification model for each of the pertinent projects the CFO will have to make an upcoming decision on, consistently updating the models as new information about projects flows in. I am confident the CFO will find it useful to consult a project's classification model prior to any of these Investment Committee meetings before digesting all of the details of the project shared by the Resort Executives & Project Managers.

### c) Why did you select this as a good predictive modeling problem?

I was surprised to see the limited use of data science at Walt Disney Imagineering. Considering the sheer magnitude of investments that Walt Disney Imagineering executives manage, it seems necessary to start cultivating data science expertise to help inform decision making.

### d) How and where would you get the data?

The estimating department at Walt Disney Imagineering warehouses cost data and construnction documentation for all resorts they have built over the past 50 years. The Finance department at Walt Disney Imagineering also tracks the running profitability of all resorts that have been built. If these two data sources can be consolidated, the CFO's data scientist would have plenty of data to begin building classification models.

### e) Explain precisely why and how you expect doing the predictive modeling will add value.

The CFO likely has a well-founded intuition when assessing whether a project will be profitable or not, but overseeing a massive operation with billions of dollars at stake doesn't leave one with much time to study the details of all projects under development. By hiring a data scientist to conduct predictive modeling, the CFO would be better informed when forecasting the profitability of a resort via comparing the profitability of resorts built in the past that have similar feature sets as the resort under analysis. The predictive models would be able to raise red &/or green flags about upcoming projects that may be difficult to pinpoint without crunching the numbers. Furthermore, the models would likely provide confidence in many aspects of the CFO's intuition.

### f) What exactly is the quantity that you inherently do not know and need to predict?

Will the future profitability of the resort under development be above Disney's minimally accepted rate of return (MARR)? (yes or no)

### g) Is this a classification, ranking, or probability estimation problem?

It is a classifcation problem. There is a clear distinction between whether a resort built in the past has performed above Disney's MARR or not. This binary variable will serve as our class variable for the training and testing data. This application could be approached as ranking or probability estimation as well. It depends upon the preferences of the CFO.

### h) What are the features? Provide a list of at least 5 features that you think (a) you can get and (b) you think might be useful.

Features: 1. Total Cost of Resort 2. Total # of Rooms 3. # of Disney Vacation Club Rooms / Total # of Rooms 4. # of Restaraunts 5. Target Disney Customer 6. # sq ft of resort site 7. general contractor 8. MEP contractors 9. Integrated IP (Mickey, Star Wars, etc.) 10. Geographic Theming (Wilderness, Tropical, Victorian, etc.)  

This is a set of features that would capture the basic financial, operational, and creative components that are always coordinated prior to making the final decision of whether to build a resort. 

### i) What exactly would be your training data?

A consolidation of all of the excel spreadsheets that currently contain most of this information. Some of the most useful features may have to be derived and manually encoded. The consolidated master dataframe could be split into 75% for training and 25% for testing with k-fold cross-validation supplementing the usefulness of the training data.

## Q2) (20 points)

You have a fraud detection task (predicting whether a given credit card transaction is “fraud” vs. “non-fraud”) and you built a classification model for this purpose. For any credit card transaction, your model estimates the probability that this transaction is “fraud”. The following table represents the probabilities that your model estimated for the validation dataset containing 10 records.

In [None]:
Actual Class (from validation data) | Estimated Probability of Record Belonging to Class “fraud”
           
                              fraud | 0.95
                              fraud | 0.91
                              fraud | 0.75
                          non-fraud | 0.67
                              fraud | 0.61
                          non-fraud | 0.46
                              fraud | 0.42
                          non-fraud | 0.25
                          non-fraud | 0.09
                          non-fraud | 0.04

### a) What is the overall accuracy of your model, if the chosen probability cutoff value is 0.3? What is the overall accuracy of your model, if the chosen probability cutoff value is 0.8?

In [None]:
cutoff_1 = (8 / 10) * 100 # for a cutoff of 0.3
cutoff_2 = (7 / 10) * 100 # for a cutoff of 0.8
print('The accuracy of this model with a cutoff of 0.3 is {0}%'.format(cutoff_1))
print('The accuracy of this model with a cutoff of 0.8 is {0}%'.format(cutoff_2))

### b) What probability cutoff value should you choose, in order to have Precision fraud = 100% for your model? What is the overall accuracy of your model in this case?

To have Precision = 100%, you would set the probability cutoff = 0.70. This ensures that all transactions predicted to be fraud are indeed fraud. The overall accuracy for this model would be 80%.

### c) What probability cutoff value should you choose, in order to have Recall fraud = 100% for your model? What is the overall accuracy of your model in this case?

To have Recall = 100%, you would set the probability cutoff = 0.30. This ensures that all transactions that are indeed fraud are predicted to be fraud. The overall accuracy for this model would be 80%.

### d) Draw an ROC curve for your model.

In [None]:
import matplotlib.pyplot as plt

ps = [.04, .09, .25, .42, .46, .61, .67, .75, .91, .95]
actuals = ['non-fraud', 'non-fraud', 'non-fraud', 'fraud', 
           'non-fraud', 'fraud', 'non-fraud', 'fraud', 'fraud', 'fraud']
preds = zip(ps, actuals)
thresh = [0.00, .04, .09, .25, .42, .46, .61, .67, .75, .91, .95, 1.00]
thresh.sort(reverse = True)
tpr = []
fpr = []

for i in thresh:
    tp_c = 0
    fp_c = 0
    count_p = 0
    count_n = 0
    for p, a in zip(ps, actuals):
        if i <= p and a == 'fraud':
            tp_c += 1
            count_p += 1
        elif i <= p and a == 'non-fraud':
            fp_c += 1
            count_n += 1
        elif i > p and a == 'fraud':
            count_p += 1
        else:
            count_n += 1
        
    tpr.append(tp_c / count_p) 
    fpr.append(fp_c / count_n)

plt.plot(fpr, tpr, color = 'lightblue', linewidth = 3)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

^need to run this cell twice to show plot

### e) Given the following three cost structures (see cost matrices below), determine the probability cutoff (threshold) value from the [0,1] interval that would give the best average misclassification cost performance for your model under each cost structure? (If there are several “best” threshold values for a given cost structure, providing one of them is sufficient.) In summary, for each of the three cost structures, report the best (lowest) average misclassification cost that can be achieved by your model, as well as the corresponding probability cutoff value that allows to achieve this cost.

In [None]:
        Cost 1               Cost 2                 Cost 3
         |  Actual            |  Actual              |  Actual
Predicted| Yes | No  Predicted|  Yes | No   Predicted|  Yes | No
      Yes|  0    1         Yes|   0    2          Yes|  0     1
       No|  1    0          No|   1    0           No|  2     0

In [None]:
import numpy as np
ps = [.04, .09, .25, .42, .46, .61, .67, .75, .91, .95]
actuals = ['non-fraud', 'non-fraud', 'non-fraud', 'fraud', 
           'non-fraud', 'fraud', 'non-fraud', 'fraud', 'fraud', 'fraud']
thresh = np.arange(0, 101, 1)
c1 = [1, 1]
c1_sum = []
c2 = [1, 2]
c2_sum = []
c3 = [2, 1]
c3_sum = []

for i in thresh:
    c1_count = 0
    c2_count = 0
    c3_count = 0
    for p, a in zip(ps, actuals):
        if (i / 100) >= p and a == 'fraud':
            c1_count += c1[0]
            c2_count += c2[0]
            c3_count += c3[0]
        elif (i / 100) <= p and a == 'non-fraud':
            c1_count += c1[1]
            c2_count += c2[1]
            c3_count += c3[1]
            
    c1_sum.append(c1_count)
    c2_sum.append(c2_count)        
    c3_sum.append(c3_count)

c1_min = np.array(c1_sum).min()
c2_min = np.array(c2_sum).min()
c3_min = np.array(c3_sum).min()

t = np.array(thresh)
t1 = t[c1_sum == c1_min][0] / 100
t2 = t[c2_sum == c2_min][0] / 100
t3 = t[c3_sum == c3_min][0] / 100

print('The minimum missclassification cost performance for cost structure 1 = ${0} with an associated threshold of {1}'.format(c1_min, t1))
print('The minimum missclassification cost performance for cost structure 2 = ${0} with an associated threshold of {1}'.format(c2_min, t2))
print('The minimum missclassification cost performance for cost structure 3 = ${0} with an associated threshold of {1}'.format(c3_min, t3))


## Q3) (10 points)

Assume that you built a model for predicting consumer credit ratings and evaluated it on the validation dataset of 5 records. Based the following 5 actual and predicted credit ratings (see table below), calculate the following performance metrics for your model: MAE, MAPE, RMSE, and Average error.

In [None]:
Actual Credit Rating | Predicted Credit Rating
                 670 | 710
                 680 | 660
                 550 | 600
                 740 | 800
                 700 | 600

In [None]:
import numpy as np
y = np.array([670, 680, 550, 740, 700])
y_hat = np.array([710, 660, 600, 800, 600])
e = y_hat - y
e_abs = abs(y_hat - y)
n = len(y)
ape = abs(e / y)

mae = e_abs.sum()/n
ae = e.sum()/n
mape = ape.sum()/n
rmse = np.sqrt((e^2).sum()/n)

print('MAE = {0}'.format(mae))
print('MAPE = {0:.2f}'.format(mape))
print('RMSE = {0:.2f}'.format(rmse))
print('Avg Error = {0}'.format(ae))

## Q4) (10 points each)

### a) What is overfitting? Please explain it briefly and precisely.

When a model finds chance occurences in the training data that may seem like interesting patterns but do not generalize beyond the training data, hindering the model's out-of-sample performance.

### b) Briefly compare SVM Regression and Classification.

SVM Regression follows the same methodology of identifying a hyperplane, margin, and support vectors in an effort to optimize an objective function subjected to a set of constraints. The key difference is SVM Regression aims to minimize the number of nodes outside the margin by introducing a specified distance from the hyperplane (epsilon) that deems what is tolerable error. Thus, all error associated with nodes outside of the epsilon space is minimized. SVM Classification is trying to find a linear discriminant that maximizes the margin between boundaries. Each boundary is set by the support vectors on each side of the hyperplane and SVC ignores all non-support-vector-nodes outside of the margin.

## Q5) (25 points)

Perform a predictive modeling analysis on this same dataset (Problem 5 of HW1) using the decision tree, k-NN techniques, logistic regression and SVM (explore how well model performs for several different hyper-parameter values). Present a brief overview of your predictive modeling process, explorations, and discuss your results. Make sure you present information about the model “goodness” (possible things to think abou t: confusion matrix, predictive accuracy, precision, recall, f-measure). Briefly discuss ROC and lift curves.

In [None]:
import pandas as pd
from sklearn import tree
from sklearn import neighbors
from sklearn import linear_model
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import numpy as np
from sklearn.metrics import roc_auc_score

In [None]:
df = pd.read_csv('wdbc.data', header = None)
df1 = pd.DataFrame(df.values, columns = ['ID','Diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 
                                         'area_mean','smoothness_mean', 'compactness_mean', 'concavity_mean', 
                                         'concave_points_mean','symmetry_mean', 'fractal_dim_mean','radius_se', 
                                         'texture_se', 'perimeter_se','area_se', 'smoothness_se', 'compactness_se',
                                         'concavity_se', 'concave_points_se','symmetry_se', 'fractal_dim_se', 
                                         'radius_worst', 'texture_worst','perimeter_worst','area_worst',
                                         'smoothness_worst', 'compactness_worst', 'concavity_worst', 
                                         'concave_points_worst',  'symmetry_worst','fractal_dim_worst'])

In [None]:
class_series = df1['Diagnosis']
del df1['Diagnosis']
del df1['ID']

In [None]:
df1.head()

In [None]:
class_series.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df1, class_series, stratify = class_series,
                                                    test_size = 0.25, random_state = 0)

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Support Vector Machine - GridSearch for optimal hyper-parameter settings. I am new to tuning SVM classifiers so I have specified a fairly large parameter grid.

Note, slightly different best hyper-parameters are generated at times from GridSearchCV which may alter the hyper-parameter specification for the final model of that type.

In [None]:
hps = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['poly'], 'degree': [2, 3, 4, 5], 'C':[1, 10, 100, 1000]}]

svc_grid = GridSearchCV(SVC(), hps, cv = 5, scoring = 'roc_auc').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(svc_grid.best_score_))
print('Best hyper-parameters: {0}'.format(svc_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, svc_grid.predict(X_test)))

Below are the four competing models. I have carried over my hyper-parameter settings from HW1 for all of the models except for SVM since I had already determined these settings to be optimal for this particular dataset.

In [None]:
lr_clf = linear_model.LogisticRegression(multi_class='ovr', solver ='liblinear', penalty = 'l2')
dt_clf = tree.DecisionTreeClassifier(criterion = "gini", max_depth = 4, min_samples_leaf = 1)
k_nn_clf = neighbors.KNeighborsClassifier(n_neighbors = 26, weights = 'distance')
svm_clf = SVC(kernel = 'rbf', gamma = 0.001, C = 100)

Next, build out classification reports for each model. Then compare the ROC_AUC score of the top two models. The intuition of the ROC curve was illustrated earlier in this notebook.

Logistic Regression Model Evaluation

In [None]:
lr_clf.fit(X_train, y_train)
y_true, y_pred = y_test, lr_clf.predict(X_test)
print(classification_report(y_true, y_pred))

Decision Tree Model Evaluation

In [None]:
dt_clf.fit(X_train, y_train)
y_true, y_pred = y_test, dt_clf.predict(X_test)
print(classification_report(y_true, y_pred))

k-NN Model Evaluation

In [None]:
k_nn_clf.fit(X_train, y_train)
y_true, y_pred = y_test, k_nn_clf.predict(X_test)
print(classification_report(y_true, y_pred))

SVC Model Evaluation

In [None]:
svm_clf.fit(X_train, y_train)
y_true, y_pred = y_test, svm_clf.predict(X_test)
print(classification_report(y_true, y_pred))

As shown above, Logistic Regression and SVC performed the best on this dataset. Below is a comparison of the two models in terms of the ROC_AUC score.

In [None]:
print('The roc_auc score for logistic regression:')
roc_auc_score(y_true, lr_clf.decision_function(X_test))

In [None]:
print('The roc_auc score for SVC:')
roc_auc_score(y_true, svm_clf.decision_function(X_test))

SVC had slightly more area under the ROC curve. Thus, I recommend using the SVC for this classification problem.

## Q6) (25 points)

Download the dataset on car evaluations from http://archive.ics.uci.edu/ml/datasets/Car+Evaluation (this link also has the description of the data). This dataset has 1728 records, each record representing a car evaluation. Each car evaluation is described with 7 attributes. 6 of the attributes represent car characteristics, such as buying price, price of the maintenance, number of doors, capacity in terms of persons to carry, the size of luggage boot, and estimated safety of the car. The seventh variable represents the evaluation of the car (unacceptable, acceptable, good, very good).

Your task: Among the basic classification techniques that you are familiar with (i.e., decision tree, k-NN, logistic regression, NB, SVM) use all that would be applicable to this dataset to predict the evaluation of the cars based on their characteristics. Explore how well these techniques perform for several different parameter values. Present a brief overview of your predictive modeling process, explorations, and discuss your results. Present your final model (i.e., the best predictive model that you were able to come up with), and discuss its performance in a comprehensive manner (overall accuracy; per-class performance, i.e., whether this model predicts all classes equally well, or if there some classes for which it does much better than others; etc.).

Note that in this classification problem your input variables are ordinal. Should you treat them as numeric or categorical? (What are pros and cons?) You can try building your models both ways; which demonstrate better predictive performance?

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
cars = pd.read_csv('car.data', header = None)
cars_df = pd.DataFrame(cars.values, columns = ['buying', 'maint', 'doors', 'persons', 'log_boot', 'safety', 'class'])
cars_df.head()

In [None]:
cars_df.describe()

In [None]:
cars_df.dtypes

In [None]:
classes = cars_df['class']
del cars_df['class']

In [None]:
classes.head()

Converting dataframe via OneHotEncoding and LabelEncoding. The latter treats the fields as ordinal data whereas the former treats each level of each factor variable as its own binary variable. LabelEncoding allows for us to maintain only 6 dimensions but it forces a questionable assumption to be made that the scale between the levels of a nominal column are equivalent(i.e. the distance between 1->2 & 2->3 for the maint col is the same). For this dataset, it is difficult to assess whether this assumption has merit or not so I will attempt both types of encoding and compare the resulting model performance.

OneHotEncoding

In [None]:
cars_df_dummies = pd.get_dummies(cars_df)
cars_df_dummies.head()

LabelEncoding

In [None]:
cars_df["buying"] = cars_df["buying"].astype('category')
cars_df["buying_cat"] = cars_df["buying"].cat.codes
cars_df["maint"] = cars_df["maint"].astype('category')
cars_df["maint_cat"] = cars_df["maint"].cat.codes
cars_df["doors"] = cars_df["doors"].astype('category')
cars_df["doors_cat"] = cars_df["doors"].cat.codes
cars_df["persons"] = cars_df["persons"].astype('category')
cars_df["persons_cat"] = cars_df["persons"].cat.codes
cars_df["log_boot"] = cars_df["log_boot"].astype('category')
cars_df["log_boot_cat"] = cars_df["log_boot"].cat.codes
cars_df["safety"] = cars_df["safety"].astype('category')
cars_df["log_boot_cat"] = cars_df["safety"].cat.codes

In [None]:
del cars_df['buying']
del cars_df['maint']
del cars_df['doors']
del cars_df['persons']
del cars_df['log_boot']
del cars_df['safety']
cars_df.head()

Below I will tune the parameters and assess the prediction accuracy for the following models: Gaussian Naive Bayes, Logistic Regression, Decision Tree, k-NN, & SVM. For each model, I will ensure the hyper-parameters are set to handle multi-class classification. I will select the best model based upon the classifcation reports. Specifically, the out-of-sample weighted avg f1-score. First, I will use the OneHotEncoded dataframe and then conduct the same process for the LabelEncoded dataframe to see if either format is more conducive of a high-quality predictive model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cars_df_dummies, classes, stratify = classes,
                                                    test_size = 0.25, random_state = 0)

Gaussian Naive Bayes

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)
y_true, y_pred = y_test, gnb_clf.predict(X_test)
print(classification_report(y_true, y_pred))

Logistic Regression

In [None]:
lr_hps = {'penalty': ['l1', 'l2']}
lr_clf = linear_model.LogisticRegression(multi_class = 'multinomial', solver ='saga')
lr_grid = GridSearchCV(lr_clf, lr_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(lr_grid.best_score_))
print('Best hyper-parameters: {0}'.format(lr_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, lr_grid.predict(X_test)))

In [None]:
lr_clf = linear_model.LogisticRegression(multi_class = 'multinomial', solver ='saga', penalty = 'l1')
lr_clf.fit(X_train, y_train)
y_true, y_pred = y_test, lr_clf.predict(X_test)
print(classification_report(y_true, y_pred))

Decision Tree

In [None]:
depths = list(range(1,20))
leaf_mins = list(range(1, 20))
dt_hps = dict(max_depth = depths, min_samples_leaf = leaf_mins)
dt_clf = tree.DecisionTreeClassifier(criterion = 'entropy')
dt_grid = GridSearchCV(dt_clf, dt_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(dt_grid.best_score_))
print('Best hyper-parameters: {0}'.format(dt_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, dt_grid.predict(X_test)))

In [None]:
dt_clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 15, min_samples_leaf = 1)
dt_clf.fit(X_train, y_train)
y_true, y_pred = y_test, dt_clf.predict(X_test)
print(classification_report(y_true, y_pred))

k-NN

In [None]:
k_range = list(range(1,30))
weight_options = ["uniform", "distance"]
knn_hps = dict(n_neighbors = k_range, weights = weight_options)
knn_clf = neighbors.KNeighborsClassifier()
knn_grid = GridSearchCV(knn_clf, knn_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(knn_grid.best_score_))
print('Best hyper-parameters: {0}'.format(knn_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, knn_grid.predict(X_test)))

In [None]:
knn_clf = neighbors.KNeighborsClassifier(n_neighbors = 9, weights = 'distance')
knn_clf.fit(X_train, y_train)
y_true, y_pred = y_test, knn_clf.predict(X_test)
print(classification_report(y_true, y_pred))

SVC

In [None]:
svm_hps = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['poly'], 'degree': [2, 3, 4, 5], 'C':[1, 10, 100, 1000]}]

svm_clf = SVC(decision_function_shape = 'ovr')

svm_grid = GridSearchCV(svm_clf, svm_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(svm_grid.best_score_))
print('Best hyper-parameters: {0}'.format(svm_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, svm_grid.predict(X_test)))

In [None]:
svm_clf = SVC(decision_function_shape = 'ovr', C = 1000, degree = 3, kernel = 'poly')
svm_clf.fit(X_train, y_train)
y_true, y_pred = y_test, svm_clf.predict(X_test)
print(classification_report(y_true, y_pred))

LabelEncoded

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cars_df, classes, stratify = classes,
                                                    test_size = 0.25, random_state = 0)

Gaussian Naive Bayes

In [None]:
gnb_clf = GaussianNB()
gnb_clf.fit(X_train, y_train)
y_true, y_pred = y_test, gnb_clf.predict(X_test)
print(classification_report(y_true, y_pred))

Logistic Regression

In [None]:
lr_hps = {'penalty': ['l1', 'l2']}
lr_clf = linear_model.LogisticRegression(multi_class = 'multinomial', solver ='saga')
lr_grid = GridSearchCV(lr_clf, lr_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(lr_grid.best_score_))
print('Best hyper-parameters: {0}'.format(lr_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, lr_grid.predict(X_test)))

In [None]:
lr_clf = linear_model.LogisticRegression(multi_class = 'multinomial', solver ='saga', penalty = 'l1')
lr_clf.fit(X_train, y_train)
y_true, y_pred = y_test, lr_clf.predict(X_test)
print(classification_report(y_true, y_pred))

Decision Tree

In [None]:
depths = list(range(1,20))
leaf_mins = list(range(1, 20))
dt_hps = dict(max_depth = depths, min_samples_leaf = leaf_mins)
dt_clf = tree.DecisionTreeClassifier(criterion = 'entropy')
dt_grid = GridSearchCV(dt_clf, dt_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(dt_grid.best_score_))
print('Best hyper-parameters: {0}'.format(dt_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, dt_grid.predict(X_test)))

In [None]:
dt_clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 10, min_samples_leaf = 8)
dt_clf.fit(X_train, y_train)
y_true, y_pred = y_test, dt_clf.predict(X_test)
print(classification_report(y_true, y_pred))

k-NN

In [None]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [None]:
k_range = list(range(1,30))
weight_options = ["uniform", "distance"]
knn_hps = dict(n_neighbors = k_range, weights = weight_options)
knn_clf = neighbors.KNeighborsClassifier()
knn_grid = GridSearchCV(knn_clf, knn_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(knn_grid.best_score_))
print('Best hyper-parameters: {0}'.format(knn_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, knn_grid.predict(X_test)))

In [None]:
knn_clf = neighbors.KNeighborsClassifier(n_neighbors = 13, weights = 'uniform')
knn_clf.fit(X_train, y_train)
y_true, y_pred = y_test, knn_clf.predict(X_test)
print(classification_report(y_true, y_pred))

SVM

Warning: This next cell takes a substantial amount of time to complete and does not provide any insight that alters the final recommendation. Feel free to skip.

In [None]:
svm_hps = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]},
                    {'kernel': ['poly'], 'degree': [2, 3, 4, 5], 'C':[1, 10, 100, 1000]}]

svm_clf = SVC(decision_function_shape = 'ovr')

svm_grid = GridSearchCV(svm_clf, svm_hps, cv = 5, scoring = 'f1_weighted').fit(X_train, y_train)

In [None]:
print('Best score: {0}'.format(svm_grid.best_score_))
print('Best hyper-parameters: {0}'.format(svm_grid.best_params_))
print("Prediction Accuracy: ", accuracy_score(y_test, svm_grid.predict(X_test)))

In [None]:
svm_clf = SVC(decision_function_shape = 'ovr', C = 1000, degree = 4, kernel = 'poly')
svm_clf.fit(X_train, y_train)
y_true, y_pred = y_test, svm_clf.predict(X_test)
print(classification_report(y_true, y_pred))

### Final Recommendation

Across all models there was a substantial decrease in predictive performance when I switched to LabelEncoded data. Thus, clarifying that it is not safe to assume that the scale between the levels of a nominal column are equivalent for this dataset. When comparing models fit by the OneHotEncoded data, the SVM classifier performed the best while the Decision Tree came in second. Keep in mind, SVM took by far the longest to compute. If this model will be used in a high-volume manner, it may serve well to implement the Decision Tree instead. Below is a comprehensive assessment of each model's predictive performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(cars_df_dummies, classes, stratify = classes,
                                                    test_size = 0.25, random_state = 0)

In [None]:
svm_clf = SVC(decision_function_shape = 'ovr', C = 1000, degree = 3, kernel = 'poly')
svm_clf.fit(X_train, y_train)
y_true, y_pred = y_test, svm_clf.predict(X_test)
svm_cm = confusion_matrix(y_test, y_pred, labels = ['unacc', 'acc', 'good', 'v-good'])
print('SVM Confusion Matrix:')
print(svm_cm)
print('')
print("Prediction Accuracy: ", accuracy_score(y_test, svm_clf.predict(X_test)))

In [None]:
dt_clf = tree.DecisionTreeClassifier(criterion = 'entropy', max_depth = 15, min_samples_leaf = 1)
dt_clf.fit(X_train, y_train)
y_true, y_pred = y_test, dt_clf.predict(X_test)
dt_cm = confusion_matrix(y_test, y_pred, labels = ['unacc', 'acc', 'good', 'v-good'])
print('Decision Tree Confusion Matrix:')
print(dt_cm)
print('')
print("Prediction Accuracy: ", accuracy_score(y_test, dt_clf.predict(X_test)))

The SVM model is perfect in its predictions whereas the Decision Tree has the most difficulty predicting the class label 'acc' correctly. Otherwise, the Decision Tree performs extremely well.