# Classification

## Terms

**Binary Classification** - A classification where only two classes are involved. <br>
**Multiclass Classification** - A classification where more than two classes are involved.

## Important Packages

- ```forcats```
    - forcats package enables us to easily manipulate factors in R.
    - factors are a special categorical type of variable in R that are often used for class label data.
- ```tidymodels```
    - K-nearest neighbour algorithm is implemented in the parsnip package (included in tidymodels)
    - tidymodels package collection also provides workflow.
- ```parsnip```
    - part of the ```tidyverse``` metapackage (included in tidyverse)
    - The K-nearest neighbor algorithm is implemented in here.

## Basic Functions

### Dealing with dataframe

- ```glimpse(dataframe)```
    - previews the dataframe, making it easier to inspect the data.
- ```factor(col_name, levels = c(..., ..., ...))``` 
    - is used to encode a vector as a factor; allows you to specify the values, and whether they are ordered or not.
    - first argument is the column you want to convert.
    - second argument are the values/categories/levles that are ordered.
- ```as.factor()```
    - simply coerces an existing vector to a factor, if possible.
- ```as_factor(column)``` 
    - used with mutate, and turns a vector from type ___ into type factor.
    - converts the column/variable into a statistical categorical variable.
- ```add_row(df, col_name_1 = ..., col_name_2 = ..., ..., col_name_n = ...)```
    - creates and adds a row/observation to the df
    - specify the name and respective values of each column of the df in the argument.
- ```levels(vector)```
    - factors have "levels" which we can think of as categories
    - returns the name of each category in a column
    - requires a vector as an argument (might need to use pull()
- ```pull(dataframe, column)```
    - pull allows us to extract a specific column.
- ```dist()```
    - finds the euclidean distance between the specified observations of the dataframe.
    - used with ```slice()``` to firest obtain the rows and then result is piped into ```dist()```
    - if there are more than 2 rows, the result is a matrix showing the dsitance between each row; pipe into ```as.matrix()``` to get the matrix
- ```bind_cols(col_object, df)```
    - binds the column (vector) in argument 1 to a dataframe in argument 2.
- ```rename(df, new_col_name = old_col_name)```
    - renames the column name
    
    
```r
dist_two_rows <- df %>%
                 slice(1, 2) %>%
                 select(col1, col2, col3) %>%
                 dist()
```

***NOTE: ```slice(1,5)``` slices row 1 and 5, ```slice(1:5)``` slices row 1 to 5***

### Scatter Plot

```scale_color_manual(labels = c("1", "2"), values = c("orange2", "steelblue2"))``` - Visualizes the relationship between the factor and the predictor variables.

<img src="media/scatter_plot_scale_color_manual.png" width="200px">

## Classification with *K*-nearest neighbors

The *K*-nearest neighbors classifier generally finds the *K* "nearest" or "most similar" observations in our training set, and then uses their label to make a prediction for the new observation's label.

<div style="display:flex; flex-direction:row;">
    <div>
        <p>When K = 1:</p>
        <img src="media/k = 1.png" width="400px">
    </div>
    <div style="margin-left: 50px">
        <p>When K = 3:</p>
        <img src="media/k = 3.png" width="400px">
    </div>
</div>

### Distance between points

The distance from one point to another can be calculated by: $\sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + ... + (a_m - b_m)^2}$

To get the distances between our new observation and each of the observations in the training set to find *K = 5* neighbors:

```R
cancer |>
  select(ID, Perimeter, Concavity, Symmetry, Class) |>
  mutate(dist_from_new = sqrt((Perimeter - new_obs_Perimeter)^2 + 
                              (Concavity - new_obs_Concavity)^2 +
                                (Symmetry - new_obs_Symmetry)^2)) |>
  arrange(dist_from_new) |>
  slice(1:5) # take the first 5 rows
```

## *K*-neighbors Classification Using ```tidymodels```

```library(tidymodels)```

### Creating a model

We create a ***model specification*** for *K*-nearest neighbors:

```r
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 5) %>%
            set_engine("kknn") %>%
            set_mode("classification")
```

- ```weight_func = "rectangular"``` - specifies that we want the straight-line distance.
- ```neighbors = 5``` - specifies we want *K* = 5
- ```set_engine("kknn")``` - specifies that we want to use the ```kknn``` package to train the model.
- ```set_mode("classification")``` - specifies that this is a classification problem.

### Fitting the data to the model

In order to fit the model to the data, we need to pass the model specification and the data set to the ```fit``` function.

```R
knn_fit <- knn_spec %>%
           fit(label ~ predictor + predictor, data = dataframe)
```

- ```label ~ predictor + predictor``` - specifies the label and the predictors.

NOTE: You can use ```label ~ .``` to indicate that we want to use every variable **except** ```label``` which is the predictor.

### Predicting a new observation

To predict a new obervation, we use the ```predict()``` function.

```R
new_obs <- tibble(Perimeter = 0, Concavity = 3.5)
predict(knn_fit, new_obs)
```

## Data preprocessing with ```tidymodels```

### Centering and Scaling

<div style="display:flex; flex-direction:row; width: 900px;">
    <p>
        Imagine a dataset with salary and years of education. When computing the neighbor distances, a difference of $1000 is huge compared to a difference of 10 years of education; however, conceptually, we understand that its the opposite. 10 years of education is <b><i>huge</i></b> compared to a difference of one thousand dollars in yearly salary.
To scale and center our data, we need to find our variable's mean and standard deviation. For each observed value of the variable, we subtract the mean and divide dby the standard deviation. When we do this, the data is said to be <i>standardized</i>, and all variables in a data set will have a mean of 0 and a standard deviation of 1.
    </p>
    <img src="media/scaled_vs_unscaled.png" width="300px;" style="margin-left: 100px;">
</div>


In ```tidymodels``` all data preprocessing happens using a ```recipe```.

```R
some_recipe <- recipe(label ~ ., data = unscaled_dataframe) %>%
               step_scale(all_predictors()) %>%
               step_center(all_predictors()) %>%
               prep()
```

- ```step_scale``` and ```step_center``` - both centers and scales the data in a single recipe step.
- ```prep``` - calculates the standard deviations and means required to scale and center the data. If you run the recipe before ```prep()```, it mentions the preprocessing steps it has to take.

Notably, there are other functions other than ```all_predictors()``` that can be used:
- ```all_nomial()``` and ```all_numeric()``` - specifies all categorical or numeric valariables.
- ```all_predictors()``` and ```all_outcomes()``` - specifies all predictor or target variables.
- ```Area, Smoothnesss``` - just specifying the variables that should be scaled.
- ```-Class``` - everything except for this variable.


***NOTE: if some of our predictors are not numbers (qualitative), then we will need to use ```all_numeric_predictors()```***

To finish scaling the data, we use the ```bake``` function.

```r
scaled_dataframe <- bake(some_recipe, unscaled_dataframe)
```

```bake()``` - applies the result of ```prep()``` onto the unscaled dataframe and puts it to the scaled_dataframe

### Balancing

***Class imbalance*** - where one label is much more common than the other.

Class imbalance is a problem because *K*-nearest neighbor algorithm uses the labels of nearby points to predict the label of a new point; therefore, if there are many more data points with one label overall, the algorihtm is likely to pick that label in general.

To fix this problem, we will ***oversample*** the rare class, replicating rare observations multiple times in our data set to give them more voign power in the *K*-nearest neighbor algorithm.

```R
library(themis)

ups_recipe <- recipe(Class ~ ., data = rare_cancer) %>%
              step_upsample(Class, over_ratio = 1, skip = FALSE) %>%
              prep()

upsampled_cancer <- bake(ups_recipe, rare_cancer)
```

<img src="media/class_imbalance.png" width="700px;">


***NOTE:*** The ```prep()``` function makes the calculations while the ```bake()``` function adds the data to a dataframe.

### Putting it together in a ```workflow```

Workflows allow us to chain together multiple data analysis steps without intermediate steps.

```R
# load the unscaled cancer data 
# and make sure the target Class variable is a factor
unscaled_cancer <- read_csv("data/unscaled_wdbc.csv") |>
  mutate(Class = as_factor(Class))

# create the KNN model
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
  set_engine("kknn") |>
  set_mode("classification")

# create the centering / scaling recipe
uc_recipe <- recipe(Class ~ Area + Smoothness, data = unscaled_cancer) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

# Using a workflow to chain together the data analysis steps
knn_fit <- workflow() |>
  add_recipe(uc_recipe) |>
  add_model(knn_spec) |>
  fit(data = unscaled_cancer)

# predicting
prediction <- predict(knn_fit, new_observation)

```

## Evaluating Accuracy

To understand how well our classifier performs, we can start by splitting the data into a ***training set*** and ***testing set*** and only use the training set when building the classifier. Then, to evaluate the accuracy of the classifier, we set aside the true labels from the test set, and then use the classifier to predict the labels form the test set.

$prediction \space accuracy = \frac{number \space of \space correct \space predictions}{total \space number \space of \space predictions}$

### Randomness and Seeds

#### Random Numbers

R's ```sample(num:num, number_in_list, replace = True)``` function can get us a list of random numbers.

```r
random_nums <- sample(0:9, 10, replace = TRUE)
random_nums

> [1] 4 9 5 9 6 8 4 4 8 8
```

Though the number looks random, it is determined by a seed value. We can set the seed value using ```set.seed(some_number)```. And by setting the seed value, we can make sample's output reproducible.

```r
set.seed(1)
random_nums1 <- sample(0:9, 10, replace=True)
```

### Evaluating Accuracy with ```tidymodels```

#### Splitting dataset to training and testing

The ```initial_split``` funtion handles the procedure of splitting the data. It ***shuffles*** the data beforre splitting (ensures the ordering present int he data does not influence the data that ends up in the training/testing sets), and ***stratifies*** the data by the class label (ensures that roughly the same proportion of each class is present in the training and testing set).

```r
dataset_split <- initial_split(data_set, prop = 0.75, strata = label)
dataset_train <- training(dataset_split)
dataset_test <- testing(dataset_split)
```

- ```prop``` specifies that proportion of the dataset that end up in the training set.
- ```strata``` specifies the categorical variable (the label).

#### Dealing with the label in the testing set

After creating the predictions, we can use ```bind_cols(dataset)``` to merge the original testing set with the set of predictions.
```r
dataset_test_predictions <- predict(knn_fit dataset_test) %>%
                            bind_cols(dataset_test)
```

or

```r
dataset_test_predictions <- dataset_test %>%
                            bind_cols(predict(knn_fit dataset_test))
```

<img src="media/predicting_labels.png" width="400px">

#### Computing the Accuracy

Using the ```metrics```function, we can get the quality of our model.

```r
dataset_test_predictions %>%
    metrics(truth = label, estimate = .pred_class) %>%
    filter(.metric == "accuracy")
```

- ```truth``` - Argument specifies the label in the ```dataset_test_predictions``` that is true.
- ```estimate``` - Argument specifies the predications that the model generated.

<img src="media/classifier_metrics.png" width="400px">

We an also look at the *confusion matrix* for the classifier, which shows the table of predicted labels and wrong labels, using ```conf_mat```:

```r
confusion <- dataset_predictions %>%
             conf_mat(truth = label, estimate = .pred_class)
```

<img src="media/conf_mat.png" width="500px">

The confusion matrix is essentially a classification matrix, the columns of the confusion matrix represent the actual class and the rows represent the predicted class.

<img src="media/confusion_matrix.png" width="400px">

- A **true positive** is an outcome where the model correctly predicts the positive class.
- A **true negative** is an outcome where the model correctly predicts the negative class.
- A **false positive** is an outcome where the model incorrecly predicts the positive class.
- A **false negative** is an outcome where the model incorrectly predicts the negative class.

#### Critically Analyze Performance

A *good* value for the accuracy depends on the application. For instance, if you are predicting whether a tumor is benign or malignant when it is benign 99% of the time, it is very easy to obtain a 99% accuracy just by guessing benign for each observation. It is also important to note the kind of mistakes the classifier is making. For instance, if it identifies the tumor as benign when it is malignant, this might mean the patient is not receiving enough medical care.

You can compare your classifier to the ***majority classifier*** (which guesses the majority class label from the training data, regardless of the predictor variables' values), this helps gives you a sense when considering accuracies. If the majority classifier obtains a 90% accuracy on a problem, then you want your classifier to do better than that.

## Tuning the Classifier

### Cross-Validation

- ```vfold_cv(training_dataframe, v = ..., strata = target_column)```
    - This function splits our training data into $V$-folds automatically.
    - This is to be done after data has been split onto ***training*** and ***test*** sets.
    - Cross-validation uses a random process to select how to partition the training data. Use ```set.seed()``` to make it reproducable.
- ```fit_resamples(..., resamples = df_vfold)```
    - It is used instead of ```fit()``` when doiing cross-validation ***specifically for only specified neighbors***
    - This runs cross-validation on each train/validation split
    - first argument is the ```workflow()``` function which is piped in.
- ```tune_grid(..., resamples = df_vfold, grid = n)```
    - used instead of ```fit_resamples()``` function when doing cross-validation for $n$ neighbours.
    - fits the model for each value in a range of parameter values
    - third argument specifies that the tuning should try at most $n$ values of the number of neighbours $K$ when tuning.
    - first argument is the ```workflow()``` which is piped in.
    - We set the seed prior to tuning to ensure results are reproducible.
- ```collect_metrics(...)```
    - used instead of ```metrics()``` function when doing cross-validation.
    - used to aggregate the mean and standard error of the classifier's validation accuracy across the folds.
    - argument is the ```workflow()```
- ```tune()```
    - each parameter in the model to be tuned should be specified as ```tune()``` in the model specification rather than given a particular value.

In cross-validation, we split our overall training data into $C$ evenly sized chunks. Then, iteratively use 1 chunk as the validation set and combine the remaining $C-1$ chunks as the training set. Here, $C = 5$ different chunks of the data are used, resulting in 5 different choices for the validation set; this is called 5-fold cross-validation.

$cross \space validation \space accuracy = \frac{accuracy_1 + accuracy_2 + accuracy_3 + accuracy_4 + accuracy_5}{folds}$

<img src="media/cross_validation.png" width="400px">

To split our training data into v folds, we use ```vfold_cv```
```r
dataset_vfold <- vfold_cv(dataset_train, v = folds, strata = label)
dataset_vfold
```

The ```strata``` argument makes sure that there is an even split for the label. (e.g. the label is ```am```, there would be a similar number of ```am```s and not ```am```s).

<img src="media/vfold.png" width="400px">

Then, to run the cross-validation on each train/validation split, we use ```fit_resamples``` function instead of ```fit``` in the workflow.

```r
dataset_recipe <- recipe(label ~ predictors, data = dataset) %>%
                  step_scale(all_predictors()) %>%
                  step_center(all_predictors())

knn_fit <- workflow() %>%
           add_recipe(dataset_recipe) %>%
           add_model(knn_spec) %>%
           fit_resamples(resamples = dataset_vfold)
```

We will then use ```collect_metrics``` to aggregate the mean and standard error.

```r
knn_fit %>% 
    collect_metrics()
```

<img src="media/metrics_of_cross_validation.png" width="400px">

We can choose any number of folds, and typically the more we folds, the better our accuracy estimate will be (lower standard error); however, we are limited by computational power, and hence the more time it takes to run the analysis. So, we usually choose 5 or 10 for $C$.

### Using the classifier to choose the value $K$

We can use ```tune()``` in the parameters of the model to tune the model.

```r
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")
```

We create a list of the possible $K$'s to try using ```seq()``` and we pass the data frame to the ```grid``` argument of ```tune_grid```.

```r
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))

knn_results <- workflow() |>
               add_recipe(dataframe_recipe) |>
               add_model(knn_spec) |>
               tune_grid(resamples = dataframe_vfold, grid = k_vals) |>
               collect_metrics() 

accuracies <- knn_results |>
  filter(.metric == "accuracy")

accuracies
```

```tune_grid()``` computes a set of performance metrics (accuracy or RMSE) for a predefined set of tuning parameters that correspond to a model or recipe across one or more resamples of the data.


<img src="media/choosing_k_1.png" width="500px">


Then, we can plot the accuracy against $K$, showing us the optimal number, $K$.

<img src="media/choosing_k_2.png" width="200px">

When selecting $K$, we are looking for a value where:
- we get roughly optimal accuracy, so that our model will be accurate.
- changing the value to a nearby one doesn't decrease accuracy too much, so that our choice is reliable in the presence of uncertainty.
- if $K$ is too large, predicting becomes expensive.

### Under/Overfitting

When you keep increasing $K$, the accuracy of the classifier starts decreasing. 

<img src="media/k_too_high.png" width="200px">

**Underfitting**: If the model *isn't influenced enough* by the training data, it is said to **underfit** the data. As we increase the number of neighbors, more and more of the training observations (and those that are farther) get a "say" in what class of a new observation is. This causes an "averaging effect" to take place.

**Overfitting**: In contrast, when we decrease the number of neighbours, each individual data point has a stronger and stronger vote regarding nearby points. Since the data is noisy, this causes a more "jagged" boundary corresponding to a *less simple* model. In the extreme, setting $K = 1$, then the classifier is essentiallly matching each new observation to its closest neighbor in the training set.

<img src="media/under_overfitting.png" width="500px">
<img src="media/over_underfitting.png" width="500px">

### Summary

<img src="media/tuning_summary.png" width="500px">

The overall workflow is as follows:
1. First read the data into R and apply ```as_factor()``` on the column/varibale you want to choose as your target variable.
2. Use the ```initial_split``` function to split the data into a training and test set. Set the ```strata``` argument to the class label variable. Put the test set aside for now.
3. Use the ```vfold_cv``` function to split up the ***training data*** for cross-validation.
4. Create a ```recipe``` that specifies the class label and predictors, as well as preprocessing steps for all variables. Pass the training data as the data argument of the recipe.
5. Create a ```nearest_neighbors``` model specification, with ```neighbors = tune()```.
6. Add the ```recipe``` and model specification to a ```workflow()```, and use the ```tune_grid``` function on the train/validation splits to estimate the classifier accuracy for a range of $K$ values.
> ```tune_grid()``` and ```fit_resamples()``` are used for ***training*** whereas ```fit()``` is used for ***testing***. <br>
> ```fit_resamples()``` can only be used when you want to test the performance ***using only one specified neighbour***. <br>
> ```tune_grid()``` is a much better alternative as you can test the performance of different neighbours by performing cross validation for each neighbour.
7. Pick a value of $K$ that yields a high accuracy estimate that doesn’t change much if you change $K$ to a nearby value.
8. Make a new model specification for the best parameter value (i.e., $K$), and retrain the classifier using the ```fit``` function.
9. Evaluate the estimated accuracy of the classifier on the test set using the ```predict``` function.

**Strengths:** $K$-nearest neighbors classification:
1. Simple and intuitive
2. Requries few assumptions
3. Works for binary and multi-class classification problems.

**Weaknesses:** $K$-nearest neighbors classification
1. Becomes very slow as the training data gets larger.
2. May not perform well with a large number of predictors.
3. May not perform well when classes are imbalanced.

## Predictor Variable Selection

With more irrelevant predictors, the model accuracy estimate decreases.

<img src="media/model_accuracy_and_irrelevant_predictors.png" width="400px">

Note that it still outperforms the baseline majority classifier because the tuning procedure for the $K$-nearest neighbors classifier combats the extra randomness from the irrelevant variables by increasing the number of neighbors.

#### Finding a good subset of predictors (Method 1 - Best subset selection)

1. Create a separate model for every possible subset of predictors.
2. Tune each one using cross-validation.
3. Pick the subset of predictors that gives you the highest cross-validation accuracy.

This is really expensive when it comes to computing power.

#### Finding a good subset of predictors (Method 2 - Forward Selection)

1. Start with a model with no predictors
2. Run the following 3 steps until you run out of predictors.
    1. For each unused predictor, add it to the model to form a *candidate* model.
    2. Tune all of the candidate models.
    3. Update the model to be the candidate model with the highest cross-validation accuracy.
3. Select the model that provides the best trade-off between accuracy and simplicity.

Starting with $m$ total predictors, in the first iteration, you make $m$ candidate models, each with 1 predictor. In the second iteration, you make $m-1$ candidate models, each with two predictors. This continues. You will end up training $\frac{1}{2} m (m + 1)$ separate models.

##### Implementation

First, we use the ```select``` funciton to extract the "total" set of predictors that we are willing to work with.

```r
cancer_subset <- cancer_irrelevant |> 
  select(Class, 
         Smoothness, 
         Concavity, 
         Perimeter, 
         Irrelevant1, 
         Irrelevant2, 
         Irrelevant3)

names <- colnames(cancer_subset |> select(-Class))
```

The key idea is to use the ```paste``` function is to create a model formula for each susbet of predictors for which we want to build a model. The ```collapse``` argument tells ```paste``` what to put between the items in the list; to make a formula, we need to put a ```+``` symbol between each variable.

```r
example_formula <- paste("Class", "~", paste(names, collapse="+"))
example_formula

# > ## [1] "Class ~ Smoothness+Concavity+Perimeter+Irrelevant1+Irrelevant2+Irrelevant3"
```

Finally, we need to write code that performs the task of sequentially finding the best predictor to add to the model. For each set of predictors to try, we construct a model formula, paste it into a ```recipe```, build a ```workflow``` that tunes a $K$-NN classifier using 5-fold cross-valdiation, and finally records the estimated accuracy.

```r
# create an empty tibble to store the results
accuracies <- tibble(size = integer(), 
                     model_string = character(), 
                     accuracy = numeric())

# create a model specification
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
     set_engine("kknn") |>
     set_mode("classification")

# create a 5-fold cross-validation object
cancer_vfold <- vfold_cv(cancer_subset, v = 5, strata = Class)

# store the total number of predictors
n_total <- length(names)

# stores selected predictors
selected <- c()

# for every size from 1 to the total number of predictors
for (i in 1:n_total) {
    # for every predictor still not added yet
    accs <- list()
    models <- list()
    for (j in 1:length(names)) {
        # create a model string for this combination of predictors
        preds_new <- c(selected, names[[j]])
        model_string <- paste("Class", "~", paste(preds_new, collapse="+"))

        # create a recipe from the model string
        cancer_recipe <- recipe(as.formula(model_string), 
                                data = cancer_subset) |>
                          step_scale(all_predictors()) |>
                          step_center(all_predictors())

        # tune the KNN classifier with these predictors, 
        # and collect the accuracy for the best K
        acc <- workflow() |>
          add_recipe(cancer_recipe) |>
          add_model(knn_spec) |>
          tune_grid(resamples = cancer_vfold, grid = 10) |>
          collect_metrics() |>
          filter(.metric == "accuracy") |>
          summarize(mx = max(mean))
        acc <- acc$mx |> unlist()

        # add this result to the dataframe
        accs[[j]] <- acc
        models[[j]] <- model_string
    }
    jstar <- which.max(unlist(accs))
    accuracies <- accuracies |> 
      add_row(size = i, 
              model_string = models[[jstar]], 
              accuracy = accs[[jstar]])
    selected <- c(selected, names[[jstar]])
    names <- names[-jstar]
}
accuracies

```

<img src="media/forward_selection.png" width="600px">

In order to find the right model from the sequence, we will balance high accuracy and model simplicity. We will do that by looking for the local maxima.

<img src="media/num_predictors_graph.png" width="400px">