<h1>kNN model on kaggle-pet competition</h1>

we are following Stanford CS231n, and trying to apply their assignements to this competition.

CS231N: http://cs231n.github.io/assignments2018/assignment1/
<p>sci-kit knn: https://scikit-learn.org/stable/modules/neighbors.html

Main differences between assignment and this notebook:
    - we will be using Knn for regression task, and not for classification
    - we will be using sci-kit implementation of knn and not written by us
    

In [48]:
import pandas as pd
import numpy as np

In [148]:
from time import ctime
from sklearn import neighbors
from sklearn import metrics

In [40]:
import os
os.listdir('../data')
assert 'out_breed.csv' in os.listdir('../data') # this assert breaks if the data is configured uncorrectly

In [105]:
breeds = pd.read_csv('../data/out_breed.csv')
colors = pd.read_csv('../data/out_color.csv')
states = pd.read_csv('../data/out_state.csv')
train  = pd.read_csv('../data/out_train.csv')
test   = pd.read_csv('../data/out_test.csv')
sub    = pd.read_csv('../data/out_submission.csv')

In [106]:
train.columns

Index(['Unnamed: 0', 'Type', 'Name', 'Age', 'Breed1', 'Breed2', 'Gender',
       'Color1', 'Color2', 'Color3', 'MaturitySize', 'FurLength', 'Vaccinated',
       'Dewormed', 'Sterilized', 'Health', 'Quantity', 'Fee', 'State',
       'RescuerID', 'VideoAmt', 'Description', 'PetID', 'PhotoAmt',
       'AdoptionSpeed', 'dataset_type'],
      dtype='object')

In [164]:
class PredictiveModel(object):
    """
    base class for the prediction task of Adoption Prediction competition
    
    KNN-classifier
    """
    
    def __init__(self, name, neighbors_number=15):
        self.name = name
        self.model = None
        self.predictions = None
        self.neighbors_number = neighbors_number
        print("{} [{}.__init__] initialized succesfully".format(ctime(), self.name))
        
    def train(self, X, Y):
        """
        train method, feature generation is inside here, data cleaning outside
        
        Args:
            X: pandas.DataFrame, shape = (, 24)
            Y: pandas.Series
        """
        print("{} [{}.train] start training".format(ctime(), self.name))
        
        KNNclassifier = neighbors.KNeighborsClassifier(self.neighbors_number)
        KNNclassifier.fit(X, Y)
        self.model = KNNclassifier
        
        print("{} [{}.train] trained succefully".format(ctime(), self.name))

        
    def predict(self, X):
        """
        predict method, feature generation is inside here, data cleaning outside
        
        Args:
            X: pandas.DataFrame, shape = (, 24)
        Returns:
            Y: pandas.Series
            
        Raise:
            .not trained
        """
        print("{} [{}.train] start predictions".format(ctime(), self.name))
        if not self.model:
            raise Exception("{} [{}.predict] ERROR model is not trained, you need to call {}.train first".format(ctime(), self.name, self.name))
            
        predictions = self.model.predict(X)
        self.predictions = predictions
        
        print("{} [{}.train] predicted succesfully".format(ctime(), self.name))
        return predictions
    
    def evaluate(self, labels):
        """
        evaluate predictions accuracy using competition metric "Quadratic Weighted Kappa"
        more here https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html
        
        Args:
            labels: truth-values, pandas.Series
        
        returns: float
        
        NOTE [Interpreting the Quadratic Weighted Kappa Metric]:
        (https://www.kaggle.com/aroraaman/quadratic-kappa-metric-explained-in-5-simple-steps)
        
        A weighted Kappa is a metric which is used to calculate the amount of similarity between predictions and actuals. A perfect score of  1.0 is granted when both the predictions and actuals are the same. 
        Whereas, the least possible score is -1 which is given when the predictions are furthest away from actuals. In our case, consider all actuals were 0's and all predictions were 4's. This would lead to a QWKP score of -1.
        The aim is to get as close to 1 as possible. Generally a score of 0.6+ is considered to be a really good score.
        """
        print("{} [{}.train] start evaluation".format(ctime(), self.name))
        if not self.predictions:
            raise Exception("{} [{}.predict] ERROR model didn't predict, you need to call {}.predict first".format(ctime(), self.name, self.name))
            
        labels_array = np.array(labels)
        if not labels_array.shape == self.predictions.shape:
            raise Exception("{} [{}.predict] ERROR the shape of truth value (labels) and self.predictions is different, you are giving the wrong number of labels".format(ctime(), self.name, self.name))      
            
        score = metrics.cohen_kappa_score(labels_array, self.predictions)
        
        print("{} [{}.train] predicted succesfully".format(ctime(), self.name))
        return score

In [165]:
"""
Y is our target value, Adoption Speed can be a value [1,2,3,4]
"""
Y = train['AdoptionSpeed']

In [166]:
"""
this is a really primitive data cleaning to make KNN works: we drop the followings
- AdoptionSpeed, is target
- Unnamed:0, dataset_type, is useless
- Name, RescuerId, Description, PhotoAmt, VideoAmt, PetID: this are all strings valued not able to be processed by KNN
"""
X = train.drop(["AdoptionSpeed", "Unnamed: 0", "dataset_type", "Name", "RescuerID", "Description", "PhotoAmt","VideoAmt","PetID"], axis=1)

In [167]:
assert X.shape[0] == Y.shape[0]

In [168]:
train_X, train_Y = X[:train_size], Y[:train_size]

In [169]:
train_size = int(len(X)*0.8)

In [170]:
test_X, test_Y = X[train_size:], Y[train_size:]

In [171]:
assert train_X.shape[0] == train_Y.shape[0]
assert test_X.shape[0] == test_Y.shape[0]

In [172]:
model = PredictiveModel("test")

Wed Feb 27 20:20:38 2019 [test.__init__] initialized succesfully


In [173]:
model.train(train_X, train_Y)

Wed Feb 27 20:20:38 2019 [test.train] start training
Wed Feb 27 20:20:38 2019 [test.train] trained succefully


In [174]:
predictions = model.predict(test_X)

Wed Feb 27 20:20:39 2019 [test.train] start predictions
Wed Feb 27 20:20:39 2019 [test.train] predicted succesfully


In [177]:
model.evaluate(test_Y)

Wed Feb 27 20:22:26 2019 [test.train] start evaluation


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()