## Homework Week 7

1. Why the holdout method for model selection suggests to separate the data into three parts: a training set, a validation set, and a test set?
2. Given a data set (wine), split data (20% test), apply pipeline to standardize the data, classify the data using KNeighborsClassifier (n_neighbors=10), print the test accuracy.

```python
from sklearn import datasets
df = datasets.load_wine()
X = df.data
y = df.target
```

3. What is learning curve? Base on learning curve, how do you know if the model is over fitting or not?
4. In the above data set, fit KNN using 10-fold cross validation and grid search to optimize the number of neighbors; print the optimized parameters and the test accuracy.
5. Calculate the accuracy, precision and recall based on the following confusion matrix.

|  | predicted N0 | predicted Yes|
|--|--------------| -------------|
|Actual No| 50 | 10|
|Actual Yes| 5 | 100|

6. Read the last section in the Chapter 6 of textbook, "Dealing with class imbalance". Discuss why the accuracy is not a valid meature metric in an imbalanced dataset? What other metrics can be used then?

**Problem 1: Why the holdout method for model selection suggests to separate the data into three parts: a training set, a validation set, and a test set?**

The Holdout Method suggests that we separate the data into three parts: the training set, the validation set, and the test set. The training set is used to fit different models. The validation set is used to repeatedly assess and evaluate the performance of the models and tune hyperparameters. After assessing each model, one is selected. The performance of the final model can now be assessed using the test set. This test set represents un-seen data and can give us an idea how the final model performs against unknown data.

**Problem 2: Given a data set (wine), split data (20% test), apply pipeline to standardize the data, classify the data using KNeighborsClassifier (n_neighbors=10), print the test accuracy.**

In [9]:
from sklearn import datasets
df = datasets.load_wine()
X = df.data
y = df.target

In [10]:
from sklearn.model_selection import train_test_split

#split the data into 80% training set and 20% test set
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, 
                     test_size=0.20,
                     stratify=y,
                     random_state=1)

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline

#pipeline to standardize the data and apply KNeighborsClassifier
pipe = make_pipeline(StandardScaler(),
                        KNeighborsClassifier(n_neighbors=10))

#fit, predict, and print the accuracy of the KNeighborsClassifier
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('Test Accuracy: %.3f' % pipe.score(X_test, y_test))

Test Accuracy: 0.944


**Problem 3: What is learning curve? Base on learning curve, how do you know if the model is over fitting or not?**

The learning curve is a plot of the models training and validation accuracies as a fuction of the data size. It is an easy way to detect if the model suffers from overfitting (high variance) or underfitting (high bias). A large gap between the plot of the training accuracy and the validation accuracy is indicative of overfitting (high variance). Conversely, if the plot of the training accuracy and the validation accuracy are low, it is indicative of underfitting (high bias).

**Problem 4: In the above data set, fit KNN using 10-fold cross validation and grid search to optimize the number of neighbors; print the optimized parameters and the test accuracy.**

The follow code is a grid search using just the n_neighbors parameter to find the best accuracy and best number of neighbors.

In [39]:
from sklearn.model_selection import GridSearchCV

param_range = [i for i in range(1,101)] # numbers from 1 to 100 for knn clusters

param_grid = [{'kneighborsclassifier__n_neighbors': param_range}] # parameter grid with just the n_neighbor hyperparameter

# grid search
gs = GridSearchCV(estimator=pipe, 
                  param_grid=param_grid, 
                  scoring='accuracy',
                  cv=10, # 10-fold cross validation
                  n_jobs=-1)

gs = gs.fit(X_train, y_train) # fit the model

print("Best Accuracy: %.3f" % gs.best_score_) # print best accuracy
for key, value in gs.best_params_.items():
  print("Optimized Number of Neighbors: %.3f" % value) #print optimized number of neighbors

Best Accuracy: 0.971
Optimized Number of Neighbors: 23.000


The next block of code adds the hyperparameters 'weights' and 'algorithm' into the grid search.

In [42]:
from sklearn.model_selection import GridSearchCV

param_range = [i for i in range(1,101)]

params = {'kneighborsclassifier__n_neighbors': param_range,
              'kneighborsclassifier__weights': ('uniform', 'distance'),
              'kneighborsclassifier__algorithm': ('ball_tree', 'kd_tree', 'brute')}

gs = GridSearchCV(estimator=pipe, 
                  param_grid=params, 
                  scoring='accuracy',
                  cv=10,
                  n_jobs=-1)

gs = gs.fit(X_train, y_train)
print("Best Accuracy: %.3f" % gs.best_score_)
print("Best Parameters:")
for key, value in gs.best_params_.items():
    print(key, ':', value)

Best Accuracy: 0.979
Best Parameters:
kneighborsclassifier__algorithm : ball_tree
kneighborsclassifier__n_neighbors : 41
kneighborsclassifier__weights : distance


**Problem 5: Calculate the accuracy, precision and recall based on the following confusion matrix.**

|  | predicted N0 | predicted Yes|
|--|--------------| -------------|
|Actual No| 50 | 10|
|Actual Yes| 5 | 100|

**Accuracy** = (True Yes + True No) / Total = (100 + 50) / 165 = 0.91

**Precision** = True Yes / Predicted Yes = 100 / 110 = 0.91

**Recall** = True Yes / (True Yes + False No) 100 / (100 + 5) = 0.95

**Problem 6: Read the last section in the Chapter 6 of textbook, "Dealing with class imbalance". Discuss why the accuracy is not a valid meature metric in an imbalanced dataset? What other metrics can be used then?**

In an imbalanced data set, accuracy is not valid metric to assess the predictive power of the model. To use an example from my final project in a previous class, I was dealing with a data set where the target variable was 10% 'Yes - purchased the item' and 90% 'No - did not purchase'. The model predicted mostly Nos and had a 91% accuracy, but I was looking to predict the Yeses and the model was very poor at predicting the Yeses. It predicted almost zero Yeses. The accuracy looked great, but the model was not answering the question I needed it to: will the member make a purchase?

Recall is another metric that can provide insight when accuracy is not applicable. In my model, recall was 0.069. Less than 7% of the true yeses were predicted correctly.

There are a number of ways to deal with this issue: 
1. You can assign a larger penalty to wrong predictions. 
2. You can upsample the minority variable or downsample the majority variable. In the example from my project, I ended up downsampling the majority varibable to create a smaller dataset where the target variable was 50% Nos and 50% Yeses.