# Homework 07, Matt Briskey

### 1. Why the holdout method for model selection suggests to separate the data into three parts: a training set, a validation set, and a test set?

In machine learning model selection, we fine tune parameters of different models to further improve the performance for making predictions on unseen data.  If we resued the same test dataset over and over again during model selection, it will become part of our training data and thus the model will be more likely to overfit.  Therefore, it's helpful to split the training data into a training set and a validation set to allow for each model to be tested on previously unseen data and thus reduce the chance of overfitting. 

### 2. Given a data set (wine), split data (20% test), apply pipeline to standardize the data, classify the data using KNeighborsClassifier (n_neighbors=10), print the test accuracy.


In [1]:
from sklearn import datasets
df = datasets.load_wine()
X = df.data
y = df.target

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = \
    train_test_split(X, y,
                     test_size=0.20,
                     stratify=y,
                     random_state=1)

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Create a pipeline with StandardScaler and KNeighborsClassifier
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=10))
])

# Fit the pipeline on the training data
pipe_lr.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = pipe_lr.predict(X_test)

# Calculate the test accuracy
test_accuracy = pipe_lr.score(X_test, y_test)

print("Test Accuracy: ", '{:.1%}'.format(test_accuracy))

Test Accuracy:  94.4%


### 3. What is learning curve? Based on a learning curve, how do you know if the model is over fitting or not?

A learning curve is a graphical representation of the performance of a machine learning model on the training and validation sets. It shows how the model's performance improves or stabilizes as more data is used for training. Learning curves are useful for diagnosing model performance and identifying if the model is overfitting or underfitting.

**Overfitting** \
In an overfitting scenario, the model performs well on the training data but poorly on the validation or test data. In underfitting, the learning curve will show a large gap between the training and validation/test performance. The training accuracy will be high, while the validation/test accuracy will be significantly lower and may even plateau or decrease as more data is added. This indicates that the model is memorizing the training data and failing to generalize well to unseen data.

**Underfitting** \
In contrast, underfitting occurs when the model performs poorly on both the training and validation/test data. The learning curve will show low accuracy scores for both the training and validation/test sets, with the scores being similar or only slightly improving with more data. This suggests that the model is too simple to capture the underlying patterns in the data.

### 4. In the above data set, fit KNN using 10-fold cross validation and grid search to optimize the number of neighbors; print the optimized parameters and the test accuracy.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline with StandardScaler and KNeighborsClassifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Define the parameter grid for grid search
param_grid = {
    'knn__n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15]
}

# Create a GridSearchCV object with the pipeline and parameter grid
grid_search = GridSearchCV(pipeline, param_grid, cv=10)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best parameters and best score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score:", '{:.1%}'.format(best_score))

# Predict the labels for the test set using the best estimator from grid search
y_pred = grid_search.best_estimator_.predict(X_test)

# Calculate the test accuracy
test_accuracy = accuracy_score(y_test, y_pred)

print("Test Accuracy: ", '{:.1%}'.format(test_accuracy))


Best Parameters: {'knn__n_neighbors': 5}
Best Score: 97.2%
Test Accuracy:  94.4%


### 5. Calculate the accuracy, precision and recall based on the following confusion matrix.

Title | Predicted NO | Predicted YES |
---| --- | ----------- |
Actual NO | 50 | 10 |
Actual YES | 5 | 100 |

**Accuracy** \
Accuracy = (TP + TN)/(FP + FN + TP + TN) \
Accuracy = (100 + 50) / (100 + 50 + 10 + 5) \
Accuracy = 150/165 \
Accuracy = 90.9%

**Precision** \
Precision = TP/(TP + FP) \
Precision = 100 / (100 + 10) \
Precision = 100 / 110 \
Precision = 90.9%

**Recall** \
Recall = TP/(TP + FN) \
Recall = 100 /(100 + 5) \
Recall = 100/105 \
Recall = 95.2%

### 6. Read the last section in the Chapter 6 of textbook, "Dealing with class imbalance". Discuss why the accuracy is not a valid meature metric in an imbalanced dataset? What other metrics can be used then?

Accuracy isn't a valid measure when using an imbalanced dataset because the skewed class distrubtion will naturally lead to high accuracy without the model learning anything about the data.  Additionally, there can be unequal misclassification costs that accuracy doesn't describe.  For these reasons, precision, recall, and area under the ROC curve should be used when the dataset has a class imbalance.  