# Lab 2: classification methods

This lab is due by midnight Saturday Feb 19th

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn import neighbors
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split


In [3]:
# You will need to change this for your environment
DATA_ROOT = '../ALL CSV FILES - 2nd Edition/'

In [4]:
# Note the 'index_col' argument here, which makes slicing easier below.
market = pd.read_csv(DATA_ROOT + 'Smarket.csv', index_col=0, parse_dates=True)
market.head()

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2001-01-01,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2001-01-01,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2001-01-01,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
2001-01-01,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
2001-01-01,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


## Logistic Regression


In [4]:
# We will re-use this formula with other learning methods below
all_lags = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'

marklr = smf.glm(formula=all_lags, data=market, family=sm.families.Binomial())
mlr_res = marklr.fit()
print(mlr_res.summary())

# The predicted values are probabilities
mlr_prob = mlr_res.predict()
print('predicted probabilities:', mlr_prob[0:10])

# Here we create a set of qualitative predictions by thresholding on the probabilities
predictions_nominal = ["Up" if x < 0.5 else "Down" for x in mlr_prob]
print('qualitative predictions:', predictions_nominal[0:10])

# Note: the '.T' here to take the transpose so that the true classes are columns and the predicted classes are rows,
# matching the class slides
print('confusion matrix:\n', confusion_matrix(market["Direction"], predictions_nominal).T)

print(classification_report(market["Direction"], predictions_nominal, digits=3))

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['Direction[Down]', 'Direction[Up]']   No. Observations:                 1250
Model:                                              GLM   Df Residuals:                     1243
Model Family:                                  Binomial   Df Model:                            6
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -863.79
Date:                                  Sat, 12 Feb 2022   Deviance:                       1727.6
Time:                                          20:20:51   Pearson chi2:                 1.25e+03
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

In [5]:
# Split the data into training and test sets, training on everything up to and including 2004 data
# and testing on 2005 and later data:
x_train = market[:'2004'][:]
y_train = market[:'2004']['Direction']

x_test = market['2005':][:]
y_test = market['2005':]['Direction']

In [6]:
# Fit a logistic regression to the training data and (below) evaluate it using the test data
mlr_04 = smf.glm(formula=all_lags, data=x_train, family=sm.families.Binomial())
res_04 = mlr_04.fit()
print(res_04.summary())

# Build predictions of the test data using a 0.5 threshold
prob_04 = res_04.predict(x_test)
pred_04 = ['Up' if x < 0.5 else 'Down' for x in prob_04]

print('confusion matrix:\n', confusion_matrix(y_test, pred_04).T)
print(classification_report(y_test, pred_04))

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['Direction[Down]', 'Direction[Up]']   No. Observations:                  998
Model:                                              GLM   Df Residuals:                      991
Model Family:                                  Binomial   Df Model:                            6
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -690.55
Date:                                  Sat, 12 Feb 2022   Deviance:                       1381.1
Time:                                          20:30:57   Pearson chi2:                     998.
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

## Your job: build and test a LR model with only the two predictors with the best p-values above

Looking at the model summary above, that will be Lag1 and Lag2.

Build the new model below, and generate a new confusion matrix and classification report as above.

In [7]:
# Build a model using just lag1 and lag2 and test it (skip the code for the lab)

slr = smf.glm(formula='Direction ~ Lag1 + Lag2', data=x_train, family=sm.families.Binomial())
slr_fit = slr.fit()
print(slr_fit.summary())
prob_slr = slr_fit.predict(x_test)
pred_slr = ['Up' if x < 0.5 else 'Down' for x in prob_slr]
print('confusion matrix:\n', confusion_matrix(y_test, pred_slr).T)
print(classification_report(y_test, pred_slr))

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['Direction[Down]', 'Direction[Up]']   No. Observations:                  998
Model:                                              GLM   Df Residuals:                      995
Model Family:                                  Binomial   Df Model:                            2
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -690.70
Date:                                  Sat, 12 Feb 2022   Deviance:                       1381.4
Time:                                          20:32:43   Pearson chi2:                     998.
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

## Questions 1 - 3

Question 1: How does the overall accuracy of this smaller model compare

Question 2: Show how to use the confusion matrix to derive the overall accuracy as shown in the classification report.
(The calculations can be typed here and do not have to be shown with code.)

Question 3: How does the interpretability of the second model compare with the first in your opinion? Justify your answer.


Q1
For the smaller model, the accuracy is 0.56. For the bigger model before, the accuracy is only 0.48. The smaller model is more accurate than the bigger one, although the accuracy rate is still not high. 

Q2
smaller model: (35+106)/(35+35+76+106)=141/252=0.5595 
bigger model: (77+44)/(77+97+34+44)=121/252=0.4802

Q3
Because both models are logistic regressions, their interpretability wouldn't differ a lot. The interpretation of the coefficient is that if X increases by 1, the log odd P(down)/P(up) will increase by the value of the coefficient if we hold all other predictors constant. The second model might be more interpretable because there are fewer variables. The bigger model includes the variable volume, which is different from the lag variables. This could make it slightly trickier to interpret. Also, when there are fewer lag variables in the smaller model, the autocorrelation between lag variables would be a less serious issue.

## K-Nearest Neighbors

We now build a model for the same data with K-Nearest neighbors

In [8]:
knn = neighbors.KNeighborsClassifier(n_neighbors=1)

# Restrict the training and test data to only have the 'Lag1' and 'Lag2' predictor variables.
# (This code fits the model and makes predictions in one line.)
pred = knn.fit(x_train[['Lag1', 'Lag2']], y_train).predict(x_test[['Lag1', 'Lag2']])

print('KNN confusion matrix:\n', confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred))

KNN confusion matrix:
 [[43 58]
 [68 83]]
              precision    recall  f1-score   support

        Down       0.43      0.39      0.41       111
          Up       0.55      0.59      0.57       141

    accuracy                           0.50       252
   macro avg       0.49      0.49      0.49       252
weighted avg       0.50      0.50      0.50       252



In [9]:
# KNN with K of 1 performed poorly, let's try K of 3

knn = neighbors.KNeighborsClassifier(n_neighbors=3)
pred = knn.fit(x_train[['Lag1', 'Lag2']], y_train).predict(x_test[['Lag1', 'Lag2']])

print('KNN confusion matrix:\n', confusion_matrix(y_test, pred).T)
print(classification_report(y_test, pred))

KNN confusion matrix:
 [[48 55]
 [63 86]]
              precision    recall  f1-score   support

        Down       0.47      0.43      0.45       111
          Up       0.58      0.61      0.59       141

    accuracy                           0.53       252
   macro avg       0.52      0.52      0.52       252
weighted avg       0.53      0.53      0.53       252



## Your task: try some more values for K (number of neighbors) and report on which has best overall accuracy

In [17]:
# That was an improvement, try some other values to compare

def print_knn(k):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    pred = knn.fit(x_train[['Lag1', 'Lag2']], y_train).predict(x_test[['Lag1', 'Lag2']])
    print('KNN confusion matrix:\n', confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred))
    
for k in range(1,10):
    print("k =",k)
    print_knn(k)
    print()

k = 1
KNN confusion matrix:
 [[43 58]
 [68 83]]
              precision    recall  f1-score   support

        Down       0.43      0.39      0.41       111
          Up       0.55      0.59      0.57       141

    accuracy                           0.50       252
   macro avg       0.49      0.49      0.49       252
weighted avg       0.50      0.50      0.50       252


k = 2
KNN confusion matrix:
 [[74 93]
 [37 48]]
              precision    recall  f1-score   support

        Down       0.44      0.67      0.53       111
          Up       0.56      0.34      0.42       141

    accuracy                           0.48       252
   macro avg       0.50      0.50      0.48       252
weighted avg       0.51      0.48      0.47       252


k = 3
KNN confusion matrix:
 [[48 55]
 [63 86]]
              precision    recall  f1-score   support

        Down       0.47      0.43      0.45       111
          Up       0.58      0.61      0.59       141

    accuracy                        

## Question 4:

Question 4: Which of the other K values that you tried for K-Nearest neighbors worked the best, based on overall accuracy?

Q4
When k=3, the overall accuracy is the highest among all k from 1 to 10. It is 0.53. 
k=4 has the second highest accuracy. It is 0.52, only slightly less accurate than k=3. We can see that accuracy increases first and then decreases as k increases.

# Linear discriminant analysis

In [18]:
lda = LinearDiscriminantAnalysis()
ldm = lda.fit(x_train[['Lag1', 'Lag2']], y_train)

print('Priors:', ldm.priors_)
print('Means:', ldm.means_)
print('Coefficients:', ldm.coef_)

pred = ldm.predict(x_test[['Lag1', 'Lag2']])
print(confusion_matrix(pred, y_test).T)
print(classification_report(y_test, pred))


Priors: [0.49198397 0.50801603]
Means: [[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]
Coefficients: [[-0.05544078 -0.0443452 ]]
[[ 35  76]
 [ 35 106]]
              precision    recall  f1-score   support

        Down       0.50      0.32      0.39       111
          Up       0.58      0.75      0.66       141

    accuracy                           0.56       252
   macro avg       0.54      0.53      0.52       252
weighted avg       0.55      0.56      0.54       252



## Quadratic discriminant analysis

In [19]:
qda = QuadraticDiscriminantAnalysis()
qdm = qda.fit(x_train[['Lag1', 'Lag2']], y_train)

print('Priors:', qdm.priors_)
print('Means:', qdm.means_)

q_pred = qdm.predict(x_test[['Lag1', 'Lag2']])
print(confusion_matrix(q_pred, y_test).T)
print(classification_report(y_test, q_pred))

Priors: [0.49198397 0.50801603]
Means: [[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]
[[ 30  81]
 [ 20 121]]
              precision    recall  f1-score   support

        Down       0.60      0.27      0.37       111
          Up       0.60      0.86      0.71       141

    accuracy                           0.60       252
   macro avg       0.60      0.56      0.54       252
weighted avg       0.60      0.60      0.56       252



## Question 5

Question 5: which of the methods that you tried produced the best results for predicting Direction from Lag1 and Lag2?

Q5
QDA performs the best. Its accuracy rate is 0.6, higher than LDA (0.56), KNN (0.53 when k=3), and logistic regression (0.56). This suggests that the decision boundary may not be linear and is likely to be quadratic.

# Carseats data

Now load the carseats data and try to predict whether the store is located in the US from the other predictor variables.

Report below on your findings about (at least) three different learning approaches, comparing their overall accuracy.

If you use K-nearest neighbors, be sure to try a few different values for K and report on the best one, showing your work.

If you use logistic regression, try to find a simple model with good accuracy by dropping predictors with high p-values.

In [8]:
seats = pd.read_csv(DATA_ROOT + 'Carseats.csv')
seats.head()

Unnamed: 0,Sales,CompPrice,Income,Advertising,Population,Price,ShelveLoc,Age,Education,Urban,US
0,9.5,138,73,11,276,120,Bad,42,17,Yes,Yes
1,11.22,111,48,16,260,83,Good,65,10,Yes,Yes
2,10.06,113,35,10,269,80,Medium,59,12,Yes,Yes
3,7.4,117,100,4,466,97,Medium,55,14,Yes,Yes
4,4.15,141,64,3,340,128,Bad,38,13,Yes,No


In [11]:
# Pick random training and test sets for your analysis:
x_train, x_test, y_train, y_test = train_test_split(seats, seats['US'],
                                                    train_size=0.8, test_size=0.2)

# Hint: if you need to remove some predictors for training or testing in any of the learning methods,
# you can use the pandas 'drop' function to drop the corresponding columns, e.g.
# x_train.drop(columns=['US']).head()

# Hint 2: if you want to write a formula and include a lot of columns, you could use the method
# that was shown in lab 1, e.g.:
#sm.OLS.from_formula('medv ~ ' + '+'.join(df.columns.difference(['medv', 'age', 'indus'])), df)


In [12]:
# Your code goes here. I would recommend using a different cell for each learning method:

# learning method 1: logistic regression
# Use all variables in the first logistic regression
m1 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US'])), data=x_train, family=sm.families.Binomial())
m1_fit = m1.fit()
print(m1_fit.summary())

prob_m1 = m1_fit.predict(x_test)
pred_m1 = ['Yes' if x < 0.5 else 'No' for x in prob_m1]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m1).T)
print(classification_report(y_test, pred_m1))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      308
Model Family:                   Binomial   Df Model:                           11
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.087
Date:                   Tue, 15 Feb 2022   Deviance:                       158.17
Time:                           19:56:14   Pearson chi2:                     294.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Inte

In [23]:
# learning method 1: logistic regression
# drop the variable with the highest p-value: CompPrice
m2 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice'])), data=x_train, family=sm.families.Binomial())
m2_fit = m2.fit()
print(m2_fit.summary())

prob_m2 = m2_fit.predict(x_test)
pred_m2 = ['Yes' if x < 0.5 else 'No' for x in prob_m2]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m2).T)
print(classification_report(y_test, pred_m2))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      309
Model Family:                   Binomial   Df Model:                           10
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.119
Date:                   Tue, 15 Feb 2022   Deviance:                       158.24
Time:                           20:07:58   Pearson chi2:                     305.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Inte

In [25]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Price
m3 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price'])), data=x_train, family=sm.families.Binomial())
m3_fit = m3.fit()
print(m3_fit.summary())

prob_m3 = m3_fit.predict(x_test)
pred_m3 = ['Yes' if x < 0.5 else 'No' for x in prob_m3]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m3).T)
print(classification_report(y_test, pred_m3))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      310
Model Family:                   Binomial   Df Model:                            9
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.141
Date:                   Tue, 15 Feb 2022   Deviance:                       158.28
Time:                           20:09:10   Pearson chi2:                     306.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
Inte

In [26]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: ShelveLoc
m4 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc'])), data=x_train, family=sm.families.Binomial())
m4_fit = m4.fit()
print(m4_fit.summary())

prob_m4 = m4_fit.predict(x_test)
pred_m4 = ['Yes' if x < 0.5 else 'No' for x in prob_m4]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m4).T)
print(classification_report(y_test, pred_m4))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      312
Model Family:                   Binomial   Df Model:                            7
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.534
Date:                   Tue, 15 Feb 2022   Deviance:                       159.07
Time:                           20:09:48   Pearson chi2:                     326.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0

In [27]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Sales
m5 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc','Sales'])), data=x_train, family=sm.families.Binomial())
m5_fit = m5.fit()
print(m5_fit.summary())

prob_m5 = m5_fit.predict(x_test)
pred_m5 = ['Yes' if x < 0.5 else 'No' for x in prob_m5]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m5).T)
print(classification_report(y_test, pred_m5))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      313
Model Family:                   Binomial   Df Model:                            6
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.577
Date:                   Tue, 15 Feb 2022   Deviance:                       159.15
Time:                           20:10:51   Pearson chi2:                     332.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        0

In [28]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Urban
m6 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc','Urban','Sales'])), data=x_train, family=sm.families.Binomial())
m6_fit = m6.fit()
print(m6_fit.summary())

prob_m6 = m6_fit.predict(x_test)
pred_m6 = ['Yes' if x < 0.5 else 'No' for x in prob_m6]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m6).T)
print(classification_report(y_test, pred_m6))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      314
Model Family:                   Binomial   Df Model:                            5
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.663
Date:                   Tue, 15 Feb 2022   Deviance:                       159.33
Time:                           20:11:32   Pearson chi2:                     354.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.31

In [32]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Education
m7 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc','Urban','Sales','Education'])), data=x_train, family=sm.families.Binomial())
m7_fit = m7.fit()
print(m7_fit.summary())

prob_m7 = m7_fit.predict(x_test)
pred_m7 = ['Yes' if x < 0.5 else 'No' for x in prob_m7]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m7).T)
print(classification_report(y_test, pred_m7))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      315
Model Family:                   Binomial   Df Model:                            4
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -79.837
Date:                   Tue, 15 Feb 2022   Deviance:                       159.67
Time:                           20:14:29   Pearson chi2:                     383.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.99

In [33]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Age
m8 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc','Urban','Sales','Education','Age'])), data=x_train, family=sm.families.Binomial())
m8_fit = m8.fit()
print(m8_fit.summary())

prob_m8 = m8_fit.predict(x_test)
pred_m8 = ['Yes' if x < 0.5 else 'No' for x in prob_m8]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m8).T)
print(classification_report(y_test, pred_m8))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      316
Model Family:                   Binomial   Df Model:                            3
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -80.064
Date:                   Tue, 15 Feb 2022   Deviance:                       160.13
Time:                           20:15:04   Pearson chi2:                     435.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       1.41

In [34]:
# learning method 1: logistic regression
# drop the variable with the next highest p-value: Income
m9 = smf.glm('US ~ '+ '+'.join(x_train.columns.difference(['US','CompPrice','Price','ShelveLoc','Urban','Sales','Education','Age','Income'])), data=x_train, family=sm.families.Binomial())
m9_fit = m9.fit()
print(m9_fit.summary())

prob_m9 = m8_fit.predict(x_test)
pred_m9 = ['Yes' if x < 0.5 else 'No' for x in prob_m9]
print('confusion matrix:\n', confusion_matrix(y_test, pred_m9).T)
print(classification_report(y_test, pred_m9))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      317
Model Family:                   Binomial   Df Model:                            2
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -81.089
Date:                   Tue, 15 Feb 2022   Deviance:                       162.18
Time:                           20:16:00   Pearson chi2:                     473.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.64

In [36]:
# learning method 1: logistic regression
# From backward selection, we identify advertising and population as the two key variables
# Note that accuracy does not drop as we drop variables
# Below is our final model
lg = smf.glm('US ~ Advertising+Population', data=x_train, family=sm.families.Binomial())
lg_fit = lg.fit()
print(lg_fit.summary())

prob_lg = lg_fit.predict(x_test)
pred_lg = ['Yes' if x < 0.5 else 'No' for x in prob_lg]
print('confusion matrix:\n', confusion_matrix(y_test, pred_lg).T)
print(classification_report(y_test, pred_lg))

                   Generalized Linear Model Regression Results                   
Dep. Variable:     ['US[No]', 'US[Yes]']   No. Observations:                  320
Model:                               GLM   Df Residuals:                      317
Model Family:                   Binomial   Df Model:                            2
Link Function:                     logit   Scale:                          1.0000
Method:                             IRLS   Log-Likelihood:                -81.089
Date:                   Tue, 15 Feb 2022   Deviance:                       162.18
Time:                           20:18:36   Pearson chi2:                     473.
No. Iterations:                        8                                         
Covariance Type:               nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       0.64

In [37]:
# learning method 2: KNN
# Include advertising and population as predictors
def print_knn(k):
    knn = neighbors.KNeighborsClassifier(n_neighbors=k)
    pred = knn.fit(x_train[['Advertising', 'Population']], y_train).predict(x_test[['Advertising','Population']])
    print('KNN confusion matrix:\n', confusion_matrix(y_test, pred).T)
    print(classification_report(y_test, pred))
    
for k in range(1,10):
    print("k =",k)
    print_knn(k)
    print()

k = 1
KNN confusion matrix:
 [[22 10]
 [ 5 43]]
              precision    recall  f1-score   support

          No       0.69      0.81      0.75        27
         Yes       0.90      0.81      0.85        53

    accuracy                           0.81        80
   macro avg       0.79      0.81      0.80        80
weighted avg       0.83      0.81      0.82        80


k = 2
KNN confusion matrix:
 [[25 14]
 [ 2 39]]
              precision    recall  f1-score   support

          No       0.64      0.93      0.76        27
         Yes       0.95      0.74      0.83        53

    accuracy                           0.80        80
   macro avg       0.80      0.83      0.79        80
weighted avg       0.85      0.80      0.81        80


k = 3
KNN confusion matrix:
 [[20  9]
 [ 7 44]]
              precision    recall  f1-score   support

          No       0.69      0.74      0.71        27
         Yes       0.86      0.83      0.85        53

    accuracy                        

In [39]:
# learning method 3: LDA
# Include advertising, income, and population as predictors
lda = LinearDiscriminantAnalysis()
ldm = lda.fit(x_train[['Advertising','Population']], y_train)

print('Priors:', ldm.priors_)
print('Means:', ldm.means_)
print('Coefficients:', ldm.coef_)

pred = ldm.predict(x_test[['Advertising','Population']])
print(confusion_matrix(pred, y_test).T)
print(classification_report(y_test, pred))



Priors: [0.359375 0.640625]
Means: [[  0.55652174 246.06086957]
 [ 10.20487805 268.76097561]]
Coefficients: [[ 0.43865566 -0.00362916]]
[[26  1]
 [12 41]]
              precision    recall  f1-score   support

          No       0.68      0.96      0.80        27
         Yes       0.98      0.77      0.86        53

    accuracy                           0.84        80
   macro avg       0.83      0.87      0.83        80
weighted avg       0.88      0.84      0.84        80



## Questions 6-9

(Each of the three questions below carries the same weight as the earlier questions.)

Question 6: What was the first method you tried, and what was its best overall accuracy?

Question 7: What was the second method you tried, and what was its best overall accuracy?

Question 8: What was the third method you tried, and what was its best overall accuracy?


In all my models, I only include advertising and population as predictors. These two predictors have the lowest p-values in the logistic regression.

Q6
The first one is logistic regression. It is pretty accurate. The accuracy is 0.86 in the simpler model with only two predictors. This accuracy is the highest among all three methods I try.

Q7
The second one is KNN. It is less accurate than logistic regression but still quite accurate. I try k=1 to k=10. The accuracy is the highest when k=4. It is 0.84. As k increases, accuracy increases first and then decreases, meaning that the complicated models would overfit the data, suggesting that the decision boundary in this example is simple (close to linear).

Q8
The third is LDA. Its accuracy is also very high. It is 0.84. The accuracy is slightly less than logistic regression. This suggests that the decision boundary may be approximately linear since LR and LDA both perform well. 