# 6.3: Classification Exercises

## Getting Started

### Import Libraries 

We import our standard libraries and specific objects/libraries at the top level of our notebook.

In [1]:
# Import libraries and objects
import numpy as np
from ISLP import load_data
from ISLP.models import ModelSpec as MS
import warnings 
warnings.filterwarnings('ignore') # mute warning messages
from ISLP import confusion_table
from sklearn.neighbors import KNeighborsClassifier

First, load our `Smarket` data.

In [2]:
Smarket = load_data('Smarket')
Smarket

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.010,1.19130,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.29650,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.41120,-0.623,Down
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.27600,0.614,Up
4,2001,0.614,-0.623,1.032,0.959,0.381,1.20570,0.213,Up
...,...,...,...,...,...,...,...,...,...
1245,2005,0.422,0.252,-0.024,-0.584,-0.285,1.88850,0.043,Up
1246,2005,0.043,0.422,0.252,-0.024,-0.584,1.28581,-0.955,Down
1247,2005,-0.955,0.043,0.422,0.252,-0.024,1.54047,0.130,Up
1248,2005,0.130,-0.955,0.043,0.422,0.252,1.42236,-0.298,Down


We can view the variables names.

In [3]:
Smarket.columns

Index(['Year', 'Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Volume', 'Today',
       'Direction'],
      dtype='object')

### K-Nearest Neighbors

We will now perform KNN using the `KNeighborsClassifier()` function. This function is similar
to the other model-fitting functions we've used throughout these exercises.

In [22]:
import numpy as np
from ISLP import load_data
from ISLP.models import ModelSpec as MS
import warnings 
warnings.filterwarnings('ignore') # mute warning messages
from ISLP import confusion_table
from sklearn.neighbors import KNeighborsClassifier

Smarket = load_data('Smarket')

# Excluded Direction because this is our CLASSIFICATINO variable
# Exclued Year because we used this variable to split the sample into TEST and TRAIN
# Excluded TODAY because this is the answer Y we are looking for
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
design = MS(allvars) #model specification - the columns in 'allvars' became the model specification

# create an array where you have the row number from Smarket and TRUE if < 2005 and FALSE if > 2005
train = (Smarket.Year < 2005) 
# print ("train:",train)

# using the train array (true/false) separate the data into TRAIN and TEST
Smarket_train = Smarket.loc[train] 
Smarket_test = Smarket.loc[~train] 
print ("Smarket_train.shape:", Smarket_train.shape) # it will have 998 rows of data
print ("Smarket_test.shape:", Smarket_test.shape) # it will have 252 rows of data


# Create the Data Set
# X has all the predictor variables
X = design.fit_transform(Smarket)
#print ("X:",X)
# create the responde array for testing where each row has TRUE/FALSE on the target variable
y = Smarket.Direction == 'Up' 
#print ("y:",y)

# split X and Y data, into train and test based in 'train = true' or 'train = false'
X_train, X_test = X.loc[train], X.loc[~train]
y_train, y_test = y.loc[train], y.loc[~train]

#print ("y_train",y_train)

# create an aryay Y based on the criteria used to separate the data into TRAIN or TEST
D = Smarket.Direction
L_train, L_test = D.loc[train], D.loc[~train]
print ("Y:", Y)


X_train, X_test = [np.asarray(X) for X in [X_train, X_test]]

knn1 = KNeighborsClassifier(n_neighbors=1) # n_neighbors = K = 1
knn1.fit(X_train, L_train)

knn1_pred = knn1.predict(X_test)
confusion_table(knn1_pred, L_test)


Smarket_train.shape: (998, 9)
Smarket_test.shape: (252, 9)
Y: 0         Up
1         Up
2       Down
3         Up
4         Up
        ... 
1245      Up
1246    Down
1247      Up
1248    Down
1249    Down
Name: Direction, Length: 1250, dtype: object


Truth,Down,Up
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,50,62
Up,61,79


**Reading Confusion Tables**

Down Down - the mode predict 50 down and truth was 50
Up Up - the model predicted 79 up and in fact there was 79 up

Down Up (mistakes) - the model predicted 62 Downs but, in fact, they were 62 Up
Up Down (mistakes) - the model predicted 61 UPs but, in factm they were 61 Downs

The results using $K=1$ are not very good, since only $50%$ of the
observations are correctly predicted. Of course, it may be that $K=1$
results in an overly-flexible fit to the data.

In [6]:
(83+43)/252, np.mean(knn1_pred == L_test)

#means that 51.19% accuracy in the model

(0.5, 0.5119047619047619)

As we can see KNN for $K=1$ only gives 50% accuracy which is no better than random chance. 

**Try running
KNN for several values of K and summarize the results for the best model you find.
Out of all the classification methods we tried, which performs best on the Smarket data? Give
some explanation for why that might be.**

*These exercises were adapted from :* James, Gareth, et al. An Introduction to Statistical Learning: with Applications in Python, Springer, 2023.

**Classification**

- computes the probaboility an observation belogs to a category
- It is based on a thrasholds
    - If the probability of the observatio belongs to some category is greater than 0.5/50%, THEN assign the observation to that category
- PCA
- Clustering
- Logistic Regression
- Support Vector Machine
- Support vector machine
- Decision Trees


Why not use Linear Regression?
- Because classifications deals with the idea of classify the reponse in a specific/pre-determied category
- Lets suppose we are trying to diagnose a patient with either: stroke, droug overdose ot eplileptic seizure. This can be coded as the following equation:

    1  if stoke
y = 2  if droug overdose
    3  if epileptic seizure








![image](../08_notes/k_nearest_neighbours.png)

**KNN (K-Nearest Nighbouts)**

- Defining K (the proximity point)
    1. define a point and check the obeservations that are in the range
    2. there are BLUE and ORANGE dots
    3. If 2 out of 3 are BLUE, than all are considered BLUE ... and BLUE has a 2/3 (66.7%) chance to occure and an ORANGE a 1/3 (33.3%) change to occure.
    4. the KNN has a center and a radius ... so every observation that falls inside the radius belongs to the group
        - the X could be the average size of a squirrel. So all animals in the sample that have the average size +- 2 standard deviations could be considered squirrels.

- Decision Boundaries
    1. define areas in the grafic ... the classification will depend on the area the observation is in 


- Note:
    - Other classification methods: https://scikit-learn.org/stable/modules/clustering.html
    - we should start with a K = 1 then growing K to look for the lower error
    - Common practice is to start with K = sqrt(N)/2 (where N is the number of samples) 

![image](../08_notes/knn_algorithim.png)

