## Chapter 4, Question 13  
This question should be answered using the `Weekly` data set, which is part of the ISLP package. This data is similar in nature to the `Smarket` data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from ISLP import load_data
from sklearn.metrics import confusion_matrix, accuracy_score

Weekly = load_data('Weekly')

### Review of Part(d)  
(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor.
Compute the confusion matrix and the overall fraction of correct predictions for the held out data
(that is, the data from 2009 and 2010).

In [2]:
train_data = Weekly[Weekly.Year <= 2008]
test_data = Weekly[Weekly.Year > 2008]

X_train = train_data[['Lag2']]
y_train = train_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

X_test = test_data[['Lag2']]
y_test = test_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

### Part(e) 
(e) Repeat (d) using LDA

In [3]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Use LDA for classification
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
lda_predictions = lda_model.predict(X_test)

# Evaluate LDA model
cm_lda = confusion_matrix(y_test, lda_predictions)
print("Confusion Matrix (Test Data - LDA):\n", cm_lda)

accuracy_lda = accuracy_score(y_test, lda_predictions)
print("\nOverall Accuracy (Test Data - LDA):", accuracy_lda)

Confusion Matrix (Test Data - LDA):
 [[ 9 34]
 [ 5 56]]

Overall Accuracy (Test Data - LDA): 0.625


### Part(f) 
(f) Repeat (d) using QDA

In [4]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# Use QDA for classification
qda_model = QuadraticDiscriminantAnalysis()
qda_model.fit(X_train, y_train)
qda_predictions = qda_model.predict(X_test)

# Evaluate QDA model
cm_qda = confusion_matrix(y_test, qda_predictions)
print("Confusion Matrix (Test Data - QDA):\n", cm_qda)

accuracy_qda = accuracy_score(y_test, qda_predictions)
print("\nOverall Accuracy (Test Data - QDA):", accuracy_qda)

Confusion Matrix (Test Data - QDA):
 [[ 0 43]
 [ 0 61]]

Overall Accuracy (Test Data - QDA): 0.5865384615384616


### Part(g) 
(g) Repeat (d) using KNN with K=1

In [5]:
from sklearn.neighbors import KNeighborsClassifier

# Use KNN with K=1 for classification
knn_model = KNeighborsClassifier(n_neighbors=1)
knn_model.fit(X_train, y_train)
knn_predictions = knn_model.predict(X_test)

# Evaluate KNN model
cm_knn = confusion_matrix(y_test, knn_predictions)
print("Confusion Matrix (Test Data - KNN):\n", cm_knn)

accuracy_knn = accuracy_score(y_test, knn_predictions)
print("\nOverall Accuracy (Test Data - KNN):", accuracy_knn)

Confusion Matrix (Test Data - KNN):
 [[21 22]
 [30 31]]

Overall Accuracy (Test Data - KNN): 0.5


### Part(h) 
(h) Repeat (d) using naive Bayes

In [6]:
from sklearn.naive_bayes import GaussianNB

# Use Naive Bayes for classification
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_predictions = nb_model.predict(X_test)

# Evaluate Naive Bayes model
cm_nb = confusion_matrix(y_test, nb_predictions)
print("Confusion Matrix (Test Data - Naive Bayes):\n", cm_nb)

accuracy_nb = accuracy_score(y_test, nb_predictions)
print("\nOverall Accuracy (Test Data - Naive Bayes):", accuracy_nb)

Confusion Matrix (Test Data - Naive Bayes):
 [[ 0 43]
 [ 0 61]]

Overall Accuracy (Test Data - Naive Bayes): 0.5865384615384616


### Part(i) 
(i) Which of these methods appears to provide the best results on this data?

#### Conclusion:
Based on the accuracy scores:
- **LDA** provided the best results with an accuracy of **62.5%**.
- Logistic Regression results were not directly provided in your snippet but should be evaluated similarly.
- QDA, KNN, and Naive Bayes performed worse compared to LDA.

To summarize, the LDA model is likely the best method for this dataset, followed by Logistic Regression (if its accuracy is higher than LDA). If you need further insights or to compute the accuracy for logistic regression, please let me know!

### Part(j) 
(j) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifer.

In [10]:
import itertools

# Prepare the predictors and response variable
X_train_full = train_data[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
y_train = train_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

X_test_full = test_data[['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5']]
y_test = test_data['Direction'].apply(lambda x: 1 if x == 'Up' else 0)

# Generate all combinations of predictors
predictor_combinations = [itertools.combinations(X_train_full.columns, i) for i in range(1, len(X_train_full.columns) + 1)]
predictor_combinations = [item for sublist in predictor_combinations for item in sublist]  # Flatten the list

# Initialize variables to store the best results
best_accuracy = 0
best_combination = None
best_k = None

# Function to evaluate KNN with different combinations of predictors and K values
def evaluate_knn(X_train, y_train, X_test, y_test):
    global best_accuracy, best_combination, best_k
    for combination in predictor_combinations:
        X_train_subset = X_train[list(combination)]
        X_test_subset = X_test[list(combination)]
        for k in [1, 3, 5, 7]:
            knn_model = KNeighborsClassifier(n_neighbors=k)
            knn_model.fit(X_train_subset, y_train)
            knn_predictions = knn_model.predict(X_test_subset)
            accuracy = accuracy_score(y_test, knn_predictions)

            # Update the best results if the current accuracy is higher
            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_combination = combination
                best_k = k

# Evaluate KNN with the defined function
evaluate_knn(X_train_full, y_train, X_test_full, y_test)

# Output the best results
print(f"Best Accuracy: {best_accuracy:.4f}")
print(f"Best Predictor Combination: {best_combination}")
print(f"Best K: {best_k}")

Best Accuracy: 0.6250
Best Predictor Combination: ('Lag4',)
Best K: 7


#### Conclusion

The analysis of the `Weekly` dataset using the K-Nearest Neighbors (KNN) classifier revealed that:

- **Best Predictor**: The `Lag4` variable was identified as the most significant predictor, achieving the highest accuracy of **62.5%** when used alone.
- **Optimal K Value**: A K value of **7** provided the best model performance, indicating effective averaging of the nearest neighbors' classifications.