# URL:https://www.kaggle.com/code/cdeotte/rapids-knn-starter-ensemble-lb-0-961-wow

# RAPIDS KNN Starter Notebook - LB Ensemble 0.961! Wow!
In this notebook, we train a RAPIDS KNN model and ensemble it with the best public notebook. The best public notebook achieves `LB = 0.954` and our ensemble achieves `LB = 0.961` Wow!

# DISCLAIMER
Note that the ensemble weights in this notebook are overfitted to public LB and will not generalize to private LB. The purpopse of this notebook is to demonstrate that KNN offers model diversity and helps improve ensembles. 

To find the correct ensemble weights we use the KNN OOF predictions together will all OOF predictions from all models in our ensemble. We find weights to optimize OOF ensemble AUC locally and then use these weights for test predictions during submission (and we ignore public LB score)! Discussion [here][1]

# UPDATE
In version 2 we increase `n_neighbors` from `101` to `201`. And we give the feature `day` more importance during the KNN distance computation. This improves ensemble LB score `LB 0.956` => `LB 0.961` woohoo!

# Load Data

[1]: https://www.kaggle.com/competitions/playground-series-s5e3/discussion/568455

In [1]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

train = pd.read_csv("/kaggle/input/playground-series-s5e3/train.csv")
print("Train shape", train.shape )
train.head()

Train shape (2190, 13)


Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0


In [2]:
test = pd.read_csv("/kaggle/input/playground-series-s5e3/test.csv")
print("Test shape:", test.shape )
test.head()

Test shape: (730, 12)


Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
0,2190,1,1019.5,17.5,15.8,12.7,14.9,96.0,99.0,0.0,50.0,24.3
1,2191,2,1016.5,17.5,16.5,15.8,15.1,97.0,99.0,0.0,50.0,35.3
2,2192,3,1023.9,11.2,10.4,9.4,8.9,86.0,96.0,0.0,40.0,16.9
3,2193,4,1022.9,20.6,17.3,15.2,9.5,75.0,45.0,7.1,20.0,50.6
4,2194,5,1022.2,16.1,13.8,6.4,4.3,68.0,49.0,9.2,20.0,19.4


In [3]:
RMV = ['rainfall','id']
FEATURES = [c for c in train.columns if not c in RMV]
print("Our features are:")
print( FEATURES )

Our features are:
['day', 'pressure', 'maxtemp', 'temparature', 'mintemp', 'dewpoint', 'humidity', 'cloud', 'sunshine', 'winddirection', 'windspeed']


# KNN Model
We train 5 fold RAPIDS KNN classification model using 201 neighbors! We standardize all features to mean=0, std=1 because KNN likes this.

**UPDATE**: We adjust the weights of the features to increase/decrease importance of certain features during KNN distance computation.

In [4]:
from sklearn.model_selection import KFold
from cuml.neighbors import KNeighborsClassifier

In [5]:
# WEIGHTS TO ADJUST IMPORTANCE OF FEATURES DURING KNN
WGT = {'day': 24, 'pressure': 1, 'maxtemp': 1, 'temparature': 1, 'mintemp': 1, 'dewpoint': 1, 'humidity': 1, 
       'cloud': 1, 'sunshine': 1, 'winddirection': 1, 'windspeed': 1}

In [6]:
%%time
FOLDS = 5
kf = KFold(n_splits=FOLDS, shuffle=True, random_state=777)
    
oof_knn = np.zeros(len(train))
pred_knn = np.zeros(len(test))

for i, (train_index, test_index) in enumerate(kf.split(train)):

    print("#"*25)
    print(f"### Fold {i+1}")
    print("#"*25)
    
    x_train = train.loc[train_index,FEATURES].copy()
    y_train = train.loc[train_index,"rainfall"]    
    x_valid = train.loc[test_index,FEATURES].copy()
    y_valid = train.loc[test_index,"rainfall"]
    x_test = test[FEATURES].copy()

    for c in FEATURES:
        m = x_train[c].mean()
        s = x_train[c].std()
        x_train[c] = WGT[c] * (x_train[c]-m)/s
        x_valid[c] = WGT[c] * (x_valid[c]-m)/s
        x_test[c] = WGT[c] * (x_test[c]-m)/s
        x_test[c] = x_test[c].fillna(0)
        x_train[c] = x_train[c].fillna(0)

    model = KNeighborsClassifier(n_neighbors=201, p=1)
    model.fit(x_train.values, y_train.values)

    # INFER OOF
    oof_knn[test_index] = model.predict_proba(x_valid.values)[:,1]
    # INFER TEST
    pred_knn += model.predict_proba(x_test.values)[:,1]

# COMPUTE AVERAGE TEST PREDS
pred_knn /= FOLDS

#########################
### Fold 1
#########################
#########################
### Fold 2
#########################
#########################
### Fold 3
#########################
#########################
### Fold 4
#########################
#########################
### Fold 5
#########################
CPU times: user 2.32 s, sys: 493 ms, total: 2.81 s
Wall time: 3.8 s
Parser   : 113 ms


In [7]:
from sklearn.metrics import roc_auc_score
true = train.rainfall.values
m = roc_auc_score(true, oof_knn)
print(f"KNN CV Score AUC = {m:.3f}")

KNN CV Score AUC = 0.751


# Submission CSV Ensemble!
We load the best public notebook from version 1 of public notebook which achieves `LB 0.954` (from [here][1]). Then we ensemble our new KNN model preditions with weights `-0.25 * KNN + 1.25 * Public`. We use `scipy.stats.rankdata` to normalize predictions before ensemble. We achieve `LB 0.961` hooray!

[1]: https://www.kaggle.com/code/act18l/lb-probing

In [8]:
print("Best Public Notebook achieves LB = 0.954!")
best_public = pd.read_csv("/kaggle/input/lb-915-public-notebook/submission95427.csv")
display( best_public.head() )
best_public = best_public.rainfall.values

Best Public Notebook achieves LB = 0.954!


Unnamed: 0,id,rainfall
0,2190,2.0
1,2191,2.0
2,2192,2.0
3,2193,0.084932
4,2194,0.019863


In [9]:
from scipy.stats import rankdata

print("Ensemble achieves LB = 0.961! Hooray!")
sub = pd.read_csv("/kaggle/input/playground-series-s5e3/sample_submission.csv")
sub.rainfall = -0.25 * rankdata( pred_knn ) + 1.25 * rankdata( best_public )
sub.rainfall = rankdata( sub.rainfall ) / len(sub)
print( sub.shape )
sub.to_csv(f"submission_ensemble.csv",index=False)
sub.head()

Ensemble achieves LB = 0.961! Hooray!
(730, 2)


Unnamed: 0,id,rainfall
0,2190,0.99726
1,2191,0.99589
2,2192,0.99863
3,2193,0.121918
4,2194,0.057534
