This question should be answered using the Weekly data set, which is part of the ISLP package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

In [0]:
# general imports
import numpy as np
import pandas as pd

In [0]:
# import data visualisation packages
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [0]:
# load and preprocess data
url = "abfss://training@sa8451learningdev.dfs.core.windows.net/interpretable_machine_learning/eml_data/Weekly.csv"
Weekly = spark.read.option("header", "true").csv(url).toPandas()
Weekly.set_index('_c0', inplace=True)

float_cols = ["Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Today"]
int_cols = ['Year']
str_cols = ["Direction"]
Weekly[float_cols] = Weekly[float_cols].astype(float)
Weekly[int_cols] = Weekly[int_cols].astype(int)
Weekly[str_cols] = Weekly[str_cols].astype(str)

**a. Produce some numerical and graphical summaries of the `Weekly` data. Do there appear to be any patterns?**

In [0]:
Weekly.head()

In [0]:
Weekly.info()

In [0]:
Weekly.describe()

In [0]:
Weekly.cov()

In [0]:
Weekly.corr()

In [0]:
sns.pairplot(Weekly)

There appears to be a strong discernable relationship between Year and Volume. This can be seen through correlation 
between Year and Volume (~0.84) as well as the graph where is a discernable pattern through which one can draw a regression
line. There does not appear to be a discernable relation between other variables.

**b. Use the full data set to perform a logistic regression with `Direction` as the response and the fve lag variables plus `Volume` as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically signifcant? If so, which ones?**

In [0]:
from sklearn.linear_model import LogisticRegression

In [0]:
X = Weekly.drop(columns='Direction', axis=1)
y = Weekly['Direction']

In [0]:
glmfit = LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X, y)

In [0]:
coefficients = glmfit.coef_

In [0]:
print(coefficients)

**c. Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.**

In [0]:
from sklearn.metrics import confusion_matrix, classification_report

In [0]:
glmpred = glmfit.predict(X)

In [0]:
print(confusion_matrix(y, glmpred))

In [0]:
print(classification_report(y, glmpred))

Logistic regression predicts the values REALLY well given it just turns out 3 Type I errors (false positives). This is not surprising given the test data is the same as training data.

**d. Now ft the logistic regression model using a training data period from 1990 to 2008, with `Lag2` as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).**

In [0]:
X_train = Weekly[Weekly['Year'] < 2009].drop(columns=['Direction','Lag1', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today', 'Year'], axis=1)

In [0]:
X_test = Weekly[Weekly['Year'] >= 2009].drop(columns=['Direction','Lag1', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today', 'Year'], axis=1)

In [0]:
y_train = np.ravel(Weekly[Weekly['Year'] < 2009].drop(columns=['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today', 'Year']))

In [0]:
y_test = np.ravel(Weekly[Weekly['Year'] >= 2009].drop(columns=['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today', 'Year']))

In [0]:
glmfit1 = LogisticRegression(solver='liblinear').fit(X_train, y_train)

In [0]:
glmfit1pred = glmfit1.predict(X_test)

In [0]:
print(confusion_matrix(y_test, glmfit1pred))

In [0]:
print(confusion_matrix(y_test, glmfit1pred))

Looks like I have hit the jackpot here! Just kidding. This model is more realistic given the increase in Type I and Type II errors.

**e. Repeat (d) using LDA.**

In [0]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [0]:
lda = LinearDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
ldapred = lda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, ldapred))

In [0]:
print(classification_report(y_test, ldapred))

LDA does not provide any significant improvement over Logistic Regression. In fact, its accuracy is the same as that of
Logistic Regression.

**f. Repeat (d) using QDA.**

In [0]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [0]:
qda = QuadraticDiscriminantAnalysis().fit(X_train, y_train)

In [0]:
qdapred = qda.predict(X_test)

In [0]:
print(confusion_matrix(y_test, qdapred))

In [0]:
import warnings
warnings.filterwarnings('ignore') # I use this to ignore a warning that states the classification report is unable to calculate F-Scores, which is not required for this instance
print(classification_report(y_test, qdapred))

As we can see QDA actually improves upon true positives and false positives, but it comes at a heavy cost of being able
to predict true negatives and false negatives. Whether this is critical will depend on context. For examples, banks assessing
the ability of the model to predict potential delinquencis might be alright with it since they are likely to prioritise
false positives (people who will default, but the model declares them otherwise) over false negatives (people will not
default, but the model declares them otherwise).

**g. Repeat (d) using KNN with K = 1.**

In [0]:
from sklearn.neighbors import KNeighborsClassifier

In [0]:
knn = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)

In [0]:
knnpred = knn.predict(X_test)

In [0]:
print(confusion_matrix(y_test, knnpred))

In [0]:
print(classification_report(y_test, knnpred))

The precision of K-nearest neighbours reduces. This is because at K = 1, the classifier is highly non-linear and 
accuracy results from Logistic Regression and LDA suggest that the classifier is likely to be linear. As such, the 
K-nearest neighbours likely overfits the test data.

**i. Which of these methods appears to provide the best results on this data?**

The accuracy reports suggest that Logistic Regression and Linear Discriminant Analysis provide the best results
on this data.