# Baseline KNN Classification Model

In [1]:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, classification_report

## Retriving Coin Data

First, I retrieve the data that I retrieved from CoinGecko that is being used for the visualization above. To prepare this data for the classification model, I take these steps:
<ol>
    <li>Import data into a Pandas DataFrame</li>
    <li>Reformat datetime object into a readable String</li>
    <li>Backwards fill historical price data for coins that have not existed for the full past 365 days, eliminating null values</li>
    <li>Transpose data so that rows are observations of coins and columns are their features (price for each date)</li>
    <li>Create a new column for each coin's correct category for classification training</li>
</ol>

In [2]:
# 1
data = pd.read_json('./src/components/data/model_coins.json')
# 2
data.index = data.index.strftime('%m/%d/%Y')
# 3
data = data.bfill()[data.columns[:20]]
# 4
coin_prices = data.T
# 5
coin_labels = pd.Series(
    (['layer1'] * 10 + ['meme'] * 10),
    index=coin_prices.index
)
# Create final raw dataframe
coins = coin_prices.assign(category=coin_labels)

In [3]:
coins

Unnamed: 0,06/10/2023,06/11/2023,06/12/2023,06/13/2023,06/14/2023,06/15/2023,06/16/2023,06/17/2023,06/18/2023,06/19/2023,...,05/31/2024,06/01/2024,06/02/2024,06/03/2024,06/04/2024,06/05/2024,06/06/2024,06/07/2024,06/08/2024,category
Bitcoin,26469.58,25858.12,25916.58,25910.36,25872.21,25107.75,25564.6,26327.33,26501.04,26333.09,...,68372.492884,67474.954837,67704.326418,67740.016902,68808.293686,70600.011167,71184.599431,70759.588193,69439.266066,layer1
Ethereum,1839.208,1754.673,1751.725,1742.596,1736.789,1650.677,1664.977,1716.377,1726.373,1719.273,...,3748.639152,3761.069246,3813.452442,3780.711985,3766.63765,3814.93203,3871.082091,3812.701857,3686.234183,layer1
BNB,260.4054,239.2653,235.1993,231.0663,243.2521,237.5082,236.0971,239.0363,244.4267,244.0027,...,595.081944,593.707316,600.923979,602.94714,626.564555,686.510668,699.924112,710.043483,681.815182,layer1
Solana,17.31985,15.71229,15.55217,15.21912,14.99608,14.51189,14.74177,15.30565,15.62635,15.43449,...,167.014473,165.91353,165.963235,163.118541,164.812549,171.728129,173.769571,170.37272,162.059675,layer1
Toncoin,1.707057,1.487333,1.502982,1.521668,1.510969,1.394937,1.39765,1.401355,1.407832,1.419981,...,6.473915,6.359331,6.297774,6.811063,6.805827,7.280971,7.206922,7.531571,7.208148,layer1
Cardano,0.2952223,0.2763099,0.2728992,0.2756566,0.2749143,0.2632091,0.2617003,0.263348,0.2669649,0.2611272,...,0.44634,0.447536,0.44987,0.446384,0.456805,0.461416,0.461678,0.458149,0.446164,layer1
Avalanche,13.75152,11.68805,11.56476,11.53143,11.76897,11.36969,11.39559,11.53351,11.60604,11.33278,...,35.980317,36.089247,35.772729,34.935003,35.036731,36.091757,36.535981,35.92617,33.426148,layer1
TRON,0.07203536,0.0697718,0.0702222,0.07108634,0.07194792,0.07103752,0.07095858,0.07054005,0.07160175,0.07018383,...,0.112022,0.112066,0.112482,0.114705,0.113418,0.114491,0.114695,0.114711,0.112643,layer1
Bitcoin Cash,110.7194,103.6478,102.7803,102.7908,105.1761,101.7949,104.6548,107.9994,106.5237,107.0653,...,465.232758,455.213551,463.061046,458.379174,464.742407,477.324874,495.594456,495.600495,473.895378,layer1
NEAR Protocol,1.387753,1.204537,1.206346,1.19885,1.197304,1.173567,1.193292,1.213846,1.256589,1.238284,...,7.285283,7.253502,7.369173,7.187357,7.118196,7.425486,7.660889,7.340025,6.833469,layer1


## Building Custom Transformer for Model Pipeline

Next, to engineer the new RSI features historical price data, we must build a custom transformer class for the pipeline. According to Fidelity.com, the Relative Strength Index (RSI) is a momentum oscillator that measures the velocity (speed and direction) of price movement. The RSI ranges from 0 to 100, representing the scale between oversold and overbought, respectively. <br>
<br>
The RSI is calculated by the following formula: <br>
RSI = 100 - (100 / (1 + (avg price gain / avg price loss))) <br>
<br>
Within the class, I create a function <code>calc_rsi</code> to calculate a row of RSI's for a single coin. We will be using a default window of 14, which means that we are tracking the RSI of price for each 14 day period. This also means that <b>the first 13 days of RSI for each coin will be null</b> as the RSI requires 14 days of price data. <br>
<br>
Because of this, in the <code>transform</code> function, before we concat the RSIs for each coin to the price data, we remove the first 13 days of data from both the price data and the RSI.

In [4]:
class RSITransformer(BaseEstimator, TransformerMixin):
    def __init__(self, window=14):
        self.window = window
    
    def calc_rsi(self, prices):
        delta = prices.diff(1)
        gain = (delta.where(delta > 0, 0)).rolling(window=self.window).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=self.window).mean()

        gain.name = 'gain'
        loss.name = 'loss'
        rs = pd.DataFrame((gain, loss)).T
        
        rsi = rs.apply(
            lambda x: 
            50 if all(x == 0) else 
            0 if x.iloc[0] == 0 else 
            100 if x.iloc[1] == 0 else 
            (100 - (100 / (1 + (x.iloc[0]/x.iloc[1]))))
            , axis=1
        )
        rsi.name = prices.name
        
        return rsi

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_transformed = X.copy()
        rsi_features = []
        
        for coin in X_transformed.index:
            coin_rsi = self.calc_rsi(X_transformed.loc[coin])
            rsi_features.append(coin_rsi)
    
        rsi_combined = pd.DataFrame(rsi_features).T.reset_index()
        rsi_combined['index'] = rsi_combined['index'].apply(lambda x: f'RSI_{x[:10]}')
        rsi_combined = rsi_combined.set_index('index')
        rsi_combined.index.name = None
    
        rsi_combined = rsi_combined.T.iloc[:, self.window:]
        X_transformed = X_transformed.iloc[:, self.window:]
        X_transformed = pd.concat([X_transformed, rsi_combined], axis=1)
    
        return X_transformed

## Building KNN Pipeline

Finally, we build the KNN classification model pipeline that takes in raw data and preforms the preprocessing for the model beforing fitting the KNN classifier. For this baseline model, we will use a RSI window of 14 and K-neighbors value of 5.

In [5]:
knn_classifier = Pipeline([
    ('rsi', RSITransformer(window=14)),
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])

To test the model, I am splitting the data into a training set and testing set for evaluation.

In [6]:
X = coins.drop(columns=['category'])
y = coins['category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [7]:
knn_classifier.fit(X_train, y_train)

#### Time to see the classifier in action! 
Here I am fitting the model with a random training set of coin data and their correct categories.

Let's see how it did. Now we use the rest of the data in the testing set to evaluate it's performance.

In [8]:
y_pred = knn_classifier.predict(X_test)
print(f'Predicted Category: {y_pred}')
print(f'  Correct Category: {y_test.values}')

accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

Predicted Category: ['layer1' 'layer1' 'meme' 'layer1']
  Correct Category: ['meme' 'layer1' 'meme' 'layer1']
Model Accuracy: 0.75


I repeat this evaluation process using the entire dataset for training and testing in 5 different sets, also known as <b>K-Fold Cross-Validation</b>.

In [9]:
cv_scores = cross_val_score(knn_classifier, X, y, cv=5, scoring='accuracy')

print(f'Cross-Validation Scores: {cv_scores}')
print(f'Mean Accuracy: {cv_scores.mean()}')
print(f'Standard Deviation: {cv_scores.std()}')

Cross-Validation Scores: [1.  1.  0.5 1.  0.5]
Mean Accuracy: 0.8
Standard Deviation: 0.2449489742783178


## Results
Looks like our model predicts the correct category of these coins with an average accuracy of 80%! Not bad. Let's tune the model to maximize it's performance.