## Chapter 4, Question 13

### This question should be answered using the Weekly data set, which is part of the ISLP package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

In [42]:
!pip install ISLP



In [43]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
from sklearn.metrics import confusion_matrix, accuracy_score

Weekly = load_data('Weekly')

Review (d). Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

In [44]:
train_data = Weekly[Weekly.Year <= 2008]
test_data = Weekly[Weekly.Year > 2008]

X_train = train_data[['Lag2']]
y_train = train_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

X_test = test_data[['Lag2']]
y_test = test_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

### (e) Repeat (d) using LDA.

In [45]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Use LDA for classification
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
lda_predictions = lda_model.predict(X_test)

# Evaluate LDA model
cm_lda = confusion_matrix(y_test, lda_predictions)
print("Confusion Matrix (Test Data - LDA):\n", cm_lda)

accuracy_lda = accuracy_score(y_test, lda_predictions)
print("\nOverall Accuracy (Test Data - LDA):", accuracy_lda)

Confusion Matrix (Test Data - LDA):
 [[ 9 34]
 [ 5 56]]

Overall Accuracy (Test Data - LDA): 0.625


### (f) Repeat (d) using QDA.

In [46]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Use QDA for classification
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)
qda_predictions = qda_model.predict(X_test)

# Evaluate QDA model
cm_qda = confusion_matrix(y_test, qda_predictions)
print("Confusion Matrix (Test Data - QDA):\n", cm_qda)

accuracy_qda = accuracy_score(y_test, qda_predictions)
print("\nOverall Accuracy (Test Data - QDA):", accuracy_qda)

Confusion Matrix (Test Data - QDA):
 [[ 0 43]
 [ 0 61]]

Overall Accuracy (Test Data - QDA): 0.5865384615384616


### (g) Repeat (d) using KNN with K = 1.

In [47]:
from sklearn.neighbors import KNeighborsClassifier

# Use KNN with K=1 for classification
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# Evaluate KNN model
cm_knn = confusion_matrix(y_test, knn_predictions)
print("Confusion Matrix (Test Data - KNN):\n", cm_knn)

accuracy_knn = accuracy_score(y_test, knn_predictions)
print("\nOverall Accuracy (Test Data - KNN):", accuracy_knn)

Confusion Matrix (Test Data - KNN):
 [[22 21]
 [32 29]]

Overall Accuracy (Test Data - KNN): 0.49038461538461536


### (h) Repeat (d) using naive Bayes.

In [48]:
from sklearn.naive_bayes import GaussianNB

# Use Naive Bayes for classification
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)

# Evaluate Naive Bayes model
cm_nb = confusion_matrix(y_test, nb_predictions)
print("Confusion Matrix (Test Data - Naive Bayes):\n", cm_nb)

accuracy_nb = accuracy_score(y_test, nb_predictions)
print("\nOverall Accuracy (Test Data - Naive Bayes):", accuracy_nb)

Confusion Matrix (Test Data - Naive Bayes):
 [[ 0 43]
 [ 0 61]]

Overall Accuracy (Test Data - Naive Bayes): 0.5865384615384616


### (i) Which of these methods appears to provide the best results on this data?

In [49]:
print("Comparing the overall accuracy of the models on the test data:")
print(f"- Logistic Regression: {accuracy_test}")
print(f"- LDA: {accuracy_lda}")
print(f"- QDA: {accuracy_qda}")
print(f"- KNN (K=1): {accuracy_knn}")
print(f"- Naive Bayes: {accuracy_nb}")

print(f"\nThe method with the highest accuracy on the test data is: Logistic Regression and LDA. ")

Comparing the overall accuracy of the models on the test data:
- Logistic Regression: 0.625
- LDA: 0.625
- QDA: 0.5865384615384616
- KNN (K=1): 0.49038461538461536
- Naive Bayes: 0.5865384615384616

The method with the highest accuracy on the test data is: Logistic Regression and LDA. 


### (j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifer.

In [51]:
import itertools

X_train_full = train_data[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
X_test_full = test_data[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]

# Generate all combinations of predictors
predictor_combinations = []
for i in range(1, len(X_train_full.columns) + 1):
    predictor_combinations.extend(itertools.combinations(X_train_full.columns, i))

# Loop over different combinations of predictors and values of K
best_accuracy = 0
best_combination = None
best_k = None

for combination in predictor_combinations:
    X_train_subset = X_train_full[list(combination)]
    X_test_subset = X_test_full[list(combination)]
    for k in [1, 3, 5, 7]:
        knn_model = KNeighborsClassifier(n_neighbors=k)
        knn_model.fit(X_train_subset, y_train)
        knn_predictions = knn_model.predict(X_test_subset)
        accuracy = accuracy_score(y_test, knn_predictions)

        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_combination = combination
            best_k = k

print(f"Best Accuracy: {best_accuracy}")
print(f"Best Predictor Combination: {best_combination}")
print(f"Best K: {best_k}")

Best Accuracy: 0.625
Best Predictor Combination: ('Lag4',)
Best K: 7


By experiment through different combinations of predictors and different K values, we know in case of 'Lag4' as predictor and K=7 have aaccuracy=0.625, same as the highest accuracy in (i).