### Comparing Models

Now that you have seen a variety of models for regression and classification problems, it is good to step back and weigh the pros and cons of these options.  In the case of classification models, there are at least three things to consider:

1. Is the model good at handling imbalanced classes?
2. Does the model train quickly?
3. Does the model yield interpretable results?

Depending on your dataset and goals, the importance of these considerations will vary from project to project.  Your goal is to review our models to this point and discuss the pros and cons of each.  Two example datasets are offered as a way to offer two very different tasks where interpretability of the model may be of differing importance.

### Data and Task

Your goal is to discuss the pros and cons of Logistic Regression, Decision Trees, KNN, and SVM for the tasks below.  Consider at least the three questions above and list any additional considerations you believe are important to determining the "best" model for the task.  Share your response with your peers on the class discussion board.  

**TASK 1**: Predicting Customer Churn

Suppose you are tasked with producing a model to predict customer churn.  Which of your classification models would you use and what are the pros and cons of this model for this task?  Be sure to consider interpretability, imbalnced classes, and the speed of training.



In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import time 

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.linear_model import LogisticRegression, SGDRegressor, LinearRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix

from sklearn.datasets import load_digits

def evaluate_model1(model,X_train,y_train,X_test,y_test):
    start_time = time.time()
    model.fit(X_train, y_train)
    fit_time = time.time() - start_time
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    return train_score, test_score, fit_time

The data is loaded below.  Note that the handwritten digit data is already split into features and target (`digits`, `labels`). 

In [8]:
churn = pd.read_csv('data/telecom_churn.csv').dropna()
churn = churn.drop(['State'],axis=1)
#churn.head()

categorical = ['International plan', 'Voice mail plan']
numerical = churn.drop(columns=categorical).columns
numerical = numerical.drop(['Churn'])

In [9]:
X = churn.drop('Churn', axis=1)
y = churn['Churn'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical),
        ('cat', OneHotEncoder(), categorical)
    ])

# Create Pipelines
log_reg = make_pipeline(preprocessor, LogisticRegression(max_iter=10000))
dec_tree = make_pipeline(preprocessor, DecisionTreeClassifier())
knn = make_pipeline(preprocessor, KNeighborsClassifier())
svm = make_pipeline(preprocessor, SVC())

# Evaluating models
lr_train_score, lr_test_score, lr_fit_time = evaluate_model1(log_reg,X_train,y_train,X_test,y_test)
dt_train_score, dt_test_score, dt_fit_time = evaluate_model1(dec_tree,X_train,y_train,X_test,y_test)
knn_train_score, knn_test_score, knn_fit_time = evaluate_model1(knn,X_train,y_train,X_test,y_test)
svm_train_score, svm_test_score, svm_fit_time = evaluate_model1(svm,X_train,y_train,X_test,y_test)

# Creating a results DataFrame
results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'KNN', 'SVM'],
    'Training Score': [lr_train_score, dt_train_score, knn_train_score, svm_train_score],
    'Test Score': [lr_test_score, dt_test_score, knn_test_score, svm_test_score],
    'Fit Time': [lr_fit_time, dt_fit_time, knn_fit_time, svm_fit_time]
})

print(results)

                 Model  Training Score  Test Score  Fit Time
0  Logistic Regression        0.863545    0.859712  0.036906
1        Decision Tree        1.000000    0.912470  0.041891
2                  KNN        0.912765    0.887290  0.008976
3                  SVM        0.935574    0.919664  0.136637


**TASK 2**: Recognizing Handwritten Digits

Suppose you are tasked with training a model to recognize handwritten digits.  Which of your classifier would you use here and why?  Again, be sure to consider the balance of classes, speed of training, and importance of interpretability.



In [10]:
digits, labels = load_digits(return_X_y=True, as_frame=True)
X_train2, X_test2, y_train2, y_test2 = train_test_split(digits, labels, random_state = 42)


# Evaluating models
start_time = time.time()
logreg2 = LogisticRegression(max_iter=10000).fit(X_train2,y_train2) #, y_train2, y_test2)
logreg2_time = time.time() - start_time
logreg2_train_score = logreg2.score(X_train2, y_train2)
logreg2_test_score = logreg2.score(X_test2, y_test2)

start_time = time.time()
dectree2 = DecisionTreeClassifier().fit(X_train2,y_train2) #, y_train2, y_test2)
dectree2_time = time.time() - start_time
dectree2_train_score = dectree2.score(X_train2, y_train2)
dectree2_test_score = dectree2.score(X_test2, y_test2)

start_time = time.time()
knn2 = KNeighborsClassifier().fit(X_train2,y_train2) #, y_train2, y_test2)
knn2_time = time.time() - start_time
knn2_train_score = knn2.score(X_train2, y_train2)
knn2_test_score = knn2.score(X_test2, y_test2)

start_time = time.time()
svm2 = SVC().fit(X_train2,y_train2) #, y_train2, y_test2)
svm2_time = time.time() - start_time
svm2_train_score = svm2.score(X_train2, y_train2)
svm2_test_score = svm2.score(X_test2, y_test2)




# Creating a results DataFrame
results2 = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'KNN', 'SVM'],
    'Training Score': [logreg2_train_score, dectree2_train_score, knn2_train_score, svm2_train_score],
    'Test Score': [logreg2_test_score, dectree2_test_score, knn2_test_score, svm2_test_score],
    'Fit Time': [logreg2_time, dectree2_time, knn2_time, svm2_time]
})

print(results2.head())

                 Model  Training Score  Test Score  Fit Time
0  Logistic Regression        1.000000    0.973333  0.119295
1        Decision Tree        1.000000    0.860000  0.014960
2                  KNN        0.988864    0.993333  0.002517
3                  SVM        0.996288    0.986667  0.038898
