# Classification (02)

In this lab we would be going through:
- Naive Bayes
- K-Nearest Neighbours
- Poisson Regression

For this lab, we would examining the `Smarket` data set that contains a number of numeric variables plus a variable called `Direction` which has the two labels `Up` and `Down`

Our goal is to predict `Direction` using the other features

In [None]:
library(e1071)
library(ISLR2)
attach(Smarket)

In [None]:
train <- (Year < 2005)

# Test data
Smarket.test <- Smarket[!train, ]
dim(Smarket.test)

#Train data
Smarket.train = Smarket[train, ]
dim(Smarket.train)

Direction.2005 = Direction[!train]

## Naive Bayes

We are using the `naiveBayes()` function, which is part of the e1071 naiveBayes() library. 

By default, this implementation of the naive Bayes classifier models each quantitative feature using a Gaussian distribution. However, a kernel density method can also be used to estimate the distributions.

In [None]:
nb.fit <- naiveBayes(Direction ~ Lag1 + Lag2, data = Smarket,
subset = train)
nb.fit

The output contains the estimated mean and standard deviation for each variable in each class.

In [None]:
mean(Lag1[train][Direction[train] == 'Down'])
sd(Lag1[train][Direction[train] == 'Down'])

In [None]:
nb.class = predict(nb.fit, Smarket.test)
table(nb.class, Direction.2005)

mean(nb.class == Direction.2005)

`Naive Bayes` performs very well on this data, with accurate predictions over `59%` of the time. This is slightly worse than `QDA`, but much better than `LDA`.

The `predict()` function can also generate estimates of the probability that each observation belongs to a particular class

In [None]:
nb.preds = predict(nb.fit, Smarket.test, type = "raw")
nb.preds[1:5, ]

## K - Nearest Neighbors

We would be using the `knn()` function which is a part of the `class` library. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, `knn()` forms predictions using a single command.

The function requires four inputs:
1. A matrix containing the predictors associated with the training data, labeled `train.X` below.
2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled `test.X` below.
3. A vector containing the class labels for the training observations, labeled `train.Direction` below.
4. A value for `K`, the number of nearest neighbors to be used by the classifier.

In [None]:
library(class)

In [None]:
train.X = cbind(Lag1, Lag2)[train, ] #cbind() is short for column bind, binds variables together
test.X = cbind(Lag1, Lag2)[!train, ]
train.Direction = Direction[train]

We set a random `seed` before we apply `knn()` because if several observations are tied as nearest neighbors, then `R` will randomly break the tie.

In [None]:
set.seed(1)
knn.pred = knn(train.X, test.X, train.Direction, k=1)

table(knn.pred, Direction.2005)
mean(knn.pred==Direction.2005) #performance

The results using `K = 1` are not very good, since only `50 %` of the observa- tions are correctly predicted. Of course, it may be that `K = 1` results in an overly flexible fit to the data.

In [None]:
#return a k-nn model with three neighbors
knn.pred = function(){
    # your code here
    
}
knn.pred = knn.pred()

In [None]:
table(knn.pred, Direction.2005)

#Test the performance of new model
stopifnot(round(mean(knn.pred == Direction.2005),2) == 0.54)

In [None]:
knn.pred = knn(train.X, test.X, train.Direction, k=4)
mean(knn.pred == Direction.2005)

We can see that the results have improved slightly when we increase the value of `K` from `1` to `3`. But increasing `K` further turns out to provide no further improvements. 

It appears that for this data, QDA provides the best results of the methods that we have examined so far.

## Poisson Regression

We would be using the `glm()` function with the argument `family = poisson` to define a poisson regression model.

We are gonna fit a Poisson regression model to the `Bikeshare` data set found in `ISLR2` library, which measures the number of bike rentals(`bikers`) per hour in Washington DC.

In [None]:
attach(Bikeshare) #attaching the data set to R's context

In [None]:
dim(Bikeshare)
names(Bikeshare)

In [None]:
mod.pois = glm(bikers ~ mnth + hr + workingday + temp + weathersit, 
               data = Bikeshare, family = poisson)
summary(mod.pois)

We are gonna plot these coefficients associated with `mnth` and `hr` for better visualization

In [None]:
coef.mnth <- c(coef(mod.pois)[2:12], -sum(coef(mod.pois)[2:12]))


plot(coef.mnth, xlab = "Month", ylab = "Coefficient", 
     xaxt = "n", col = "blue", pch = 19, type = "o")
axis(side = 1, at = 1:12, 
     labels = c("J", "F", "M", "A", "M", "J", "J", "A", "S", "O", "N", "D"))

coef.hours <- c(coef(mod.pois)[13:35], -sum(coef(mod.pois)[13:35]))
plot(coef.hours, xlab = "Hour", ylab = "Coefficient", col = "blue", pch = 19, type = "o")

We can once again use the `predict()` function to obtain the fitted values (predictions) from this Poisson regression model.

In [None]:
mod.pred = predict(mod.pois, type = "response")
summary(mod.pred)