# Final KNN Classification Model

In [1]:
# Importing Baseline Model
%run ./baseline_classifier.ipynb

Predicted Category: ['meme' 'layer1' 'meme' 'meme']
  Correct Category: ['layer1' 'meme' 'meme' 'meme']
Model Accuracy: 0.5
Cross-Validation Scores: [1.  1.  0.5 1.  0.5]
Mean Accuracy: 0.8
Standard Deviation: 0.2449489742783178


## Increasing Dataset Size

In [2]:
print(f'# of coins in previous dataset: {coins.shape[0]}')

coins = pd.read_json('./src/components/data/model_coins.json')
coins.index = coins.index.strftime('%m/%d/%Y')
coins = coins.bfill().T
coin_labels = pd.Series(
    (['layer1']*10 + ['meme']*10) * 2,
    index=coins.index
)

coins = coins.assign(category=coin_labels)

print(f'# of coins in dataset now: {coins.shape[0]}')

# of coins in previous dataset: 20
# of coins in dataset now: 40


With double the size of our training dataset, there should be significant improvements to our model's performance. To see the effects, let's re-evauluate our baseline model with no changes except for the fitted dataset.

In [3]:
X, y = coins.drop(columns='category'), coins['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

knn_classifier.fit(X_train, y_train)

In [4]:
y_pred = knn_classifier.predict(X_test)
print(f'Predicted Category: {y_pred}')
print(f'  Correct Category: {y_test.values}')

accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

Predicted Category: ['meme' 'layer1' 'meme' 'layer1' 'layer1' 'meme' 'meme' 'meme']
  Correct Category: ['meme' 'layer1' 'meme' 'layer1' 'layer1' 'layer1' 'layer1' 'layer1']
Model Accuracy: 0.625


In [5]:
cv_scores = cross_val_score(knn_classifier, X, y, cv=5, scoring='accuracy')

print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean Accuracy: {cv_scores.mean()}')
print(f'Standard Deviation: {cv_scores.std()}')

Cross-Validation Scores: [0.625 0.875 0.75  0.625 0.625]
Mean Accuracy: 0.7
Standard Deviation: 0.09999999999999999


Although the mean accuracy is lower, the standard deviation has significantly improved. If you take a look at the scores of the trials, you'll notice that they are much more consistent. This indicates an <b>improvement in our model's ability to generalize to the unseen population data</b>. <br>
<br>
Next, we can aim to increase the model's accuracy by tuning the hyperparameters.

## Tuning Hyperparameters

#### Baseline Model Hyperparameters
The current parameters for our baseline model are the default values of:
<ul>
    <li>rsi__window: 14</li>
    <li>knn__n_neighbors: 5</li>
</ul>
We can double check by accessing the pipeline's attributes.

In [6]:
print(f"rsi__window param: {knn_classifier.get_params()['rsi__window']}")
print(f"knn__n_neighors param: {knn_classifier.get_params()['knn__n_neighbors']}\n")

knn_classifier

rsi__window param: 14
knn__n_neighors param: 5



To find the most effective hyperparameters, <code>GridSearchCV</code> takes a dictionary of parameter options and simluates the model with every possible combination of hyperparameters, choosing the combination that results in the best evaluation.

In [7]:
# Importing GridSearchCV
from sklearn.model_selection import GridSearchCV

In [8]:
param_grid = {
    'rsi__window': [14, 21, 28],
    'knn__n_neighbors': [5, 7, 9]
}

grid_search = GridSearchCV(knn_classifier, param_grid, cv=5, scoring='accuracy')

grid_search.fit(X, y)

print(f'Best parameters: {grid_search.best_params_}')
print(f'Best cross-validation score: {grid_search.best_score_}')

Best parameters: {'knn__n_neighbors': 5, 'rsi__window': 28}
Best cross-validation score: 0.825


Below is a table of evaluations from each simulation that <code>GridSearchCV</code> performed, sorted by descending score.

In [9]:
(
    pd
    .DataFrame(grid_search.cv_results_)
    [['param_knn__n_neighbors', 'param_rsi__window', 'mean_test_score', 'rank_test_score']]
    .sort_values('rank_test_score')
)

Unnamed: 0,param_knn__n_neighbors,param_rsi__window,mean_test_score,rank_test_score
2,5,28,0.825,1
1,5,21,0.8,2
5,7,28,0.8,2
4,7,21,0.775,4
8,9,28,0.775,4
7,9,21,0.725,6
0,5,14,0.7,7
3,7,14,0.7,7
6,9,14,0.7,7


### Final KNN Classification Model

We redefine our model with the best parameters from our grid search. Now, <code>knn_classifier</code> is our final classification model, tuned with the most effective hyperparameters.

In [10]:
knn_classifier = grid_search.best_estimator_

print(f"Tuned rsi__window Parameter: {knn_classifier.get_params()['rsi__window']}")
print(f"Tuned knn__n_neighors Parameter: {knn_classifier.get_params()['knn__n_neighbors']}\n")

knn_classifier

Tuned rsi__window Parameter: 28
Tuned knn__n_neighors Parameter: 5



In [11]:
y_pred = knn_classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Tuned Model Accuracy on Test Set: {accuracy}')
print(classification_report(y_test, y_pred))

scores = cross_val_score(knn_classifier, X, y, cv=5, scoring='accuracy')
print(f'Tuned Model Cross-Validation Scores: {scores}')
print(f'Tuned Model Mean Accuracy: {scores.mean()}')
print(f'Tuned Model Standard Deviation: {scores.std()}')

Tuned Model Accuracy on Test Set: 0.875
              precision    recall  f1-score   support

      layer1       1.00      0.83      0.91         6
        meme       0.67      1.00      0.80         2

    accuracy                           0.88         8
   macro avg       0.83      0.92      0.85         8
weighted avg       0.92      0.88      0.88         8

Tuned Model Cross-Validation Scores: [0.875 0.875 0.875 0.75  0.75 ]
Tuned Model Mean Accuracy: 0.825
Tuned Model Standard Deviation: 0.06123724356957946


## Final KNN Classifier Results

In addition to a slight improvement in the classifier's accuracy, from the baseline's <b>0.8</b> to <b>0.825</b>, we significantly improved the standard deviation of our accuracy scores. The standard deviation improved from our baseline score of <b>~0.2449</b> all the way down to <b>0.0612</b>!