# Only a few days left to go! 😱
**Just joining this challenge? You're at the right place.**

The goal is to predict ***pet adoptability***, ie given the profile of a cat or dog in a shelter, guess *how fast* he/she will get adopted by a caring family.

The "adoption speed" has 5 possible values, between 0 and 4. From the [official data description](https://www.kaggle.com/c/petfinder-adoption-prediction/data):

> The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
* 0 - Pet was adopted on the same day as it was listed. 
* 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
* 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
* 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
* 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

Pretty straightforward.

# How are submissions compared?

> Submissions are scored based on the **quadratic weighted kappa**, which measures the agreement between two ratings. This metric typically varies from 0 (random agreement between raters) to 1 (complete agreement between raters). In the event that there is less agreement between the raters than expected by chance, the metric may go below 0. The quadratic weighted kappa is calculated between the scores which are expected/known and the predicted scores.

[You can read more on Wikipedia](https://en.wikipedia.org/wiki/Cohen%27s_kappa#Weighted_kappa) if you want, but you don't need to. Here's the implementation most kernels use:


In [None]:
import scipy as sp
import numpy as np

from collections import Counter
from functools import partial
from math import sqrt

from sklearn.metrics import cohen_kappa_score, mean_squared_error
from sklearn.metrics import confusion_matrix as sk_cmatrix

# FROM: https://www.kaggle.com/myltykritik/simple-lgbm-image-features

# The following 3 functions have been taken from Ben Hamner's github repository
# https://github.com/benhamner/Metrics
def confusion_matrix(rater_a, rater_b, min_rating=None, max_rating=None):
    """
    Returns the confusion matrix between rater's ratings
    """
    assert(len(rater_a) == len(rater_b))
    if min_rating is None:
        min_rating = min(rater_a + rater_b)
    if max_rating is None:
        max_rating = max(rater_a + rater_b)
    num_ratings = int(max_rating - min_rating + 1)
    conf_mat = [[0 for i in range(num_ratings)]
                for j in range(num_ratings)]
    for a, b in zip(rater_a, rater_b):
        conf_mat[a - min_rating][b - min_rating] += 1
    return conf_mat


def histogram(ratings, min_rating=None, max_rating=None):
    """
    Returns the counts of each type of rating that a rater made
    """
    if min_rating is None:
        min_rating = min(ratings)
    if max_rating is None:
        max_rating = max(ratings)
    num_ratings = int(max_rating - min_rating + 1)
    hist_ratings = [0 for x in range(num_ratings)]
    for r in ratings:
        hist_ratings[r - min_rating] += 1
    return hist_ratings


def quadratic_weighted_kappa(y, y_pred):
    """
    Calculates the quadratic weighted kappa
    axquadratic_weighted_kappa calculates the quadratic weighted kappa
    value, which is a measure of inter-rater agreement between two raters
    that provide discrete numeric ratings.  Potential values range from -1
    (representing complete disagreement) to 1 (representing complete
    agreement).  A kappa value of 0 is expected if all agreement is due to
    chance.
    quadratic_weighted_kappa(rater_a, rater_b), where rater_a and rater_b
    each correspond to a list of integer ratings.  These lists must have the
    same length.
    The ratings should be integers, and it is assumed that they contain
    the complete range of possible ratings.
    quadratic_weighted_kappa(X, min_rating, max_rating), where min_rating
    is the minimum possible rating, and max_rating is the maximum possible
    rating
    """
    rater_a = y
    rater_b = y_pred
    min_rating=None
    max_rating=None
    rater_a = np.array(rater_a, dtype=int)
    rater_b = np.array(rater_b, dtype=int)
    assert(len(rater_a) == len(rater_b))
    if min_rating is None:
        min_rating = min(min(rater_a), min(rater_b))
    if max_rating is None:
        max_rating = max(max(rater_a), max(rater_b))
    conf_mat = confusion_matrix(rater_a, rater_b,
                                min_rating, max_rating)
    num_ratings = len(conf_mat)
    num_scored_items = float(len(rater_a))

    hist_rater_a = histogram(rater_a, min_rating, max_rating)
    hist_rater_b = histogram(rater_b, min_rating, max_rating)

    numerator = 0.0
    denominator = 0.0

    for i in range(num_ratings):
        for j in range(num_ratings):
            expected_count = (hist_rater_a[i] * hist_rater_b[j]
                              / num_scored_items)
            d = pow(i - j, 2.0) / pow(num_ratings - 1, 2.0)
            numerator += d * conf_mat[i][j] / num_scored_items
            denominator += d * expected_count / num_scored_items

    return (1.0 - numerator / denominator)

# What Data is Available?

The challenge includes **text, tabular, and image data**. Let's load it to have a look.

In [None]:
import pandas as pd

train = pd.read_csv('../input/petfinder-adoption-prediction/train/train.csv')
test = pd.read_csv('../input/petfinder-adoption-prediction/test/test.csv')
X = pd.concat([train, test], ignore_index=True, sort=False)
X.head()

The files `breed_labels.csv`, `color_labels.csv` and `state_labels.csv` are dictionaries mapping respectively breeds, colors and places to their full names. Additionally, **sentiment data** from Google's Natural Language API and **image metadata** from Google's Vision API are provided in the `*_sentiment.zip` and `*_metadata.zip` files respectively.

We will not do any EDA here, many good kernels can give you the information you need, eg:
* https://www.kaggle.com/artgor/exploration-of-data-step-by-step
* https://www.kaggle.com/erikbruin/petfinder-my-detailed-eda-and-xgboost-baseline
* https://www.kaggle.com/jaseziv83/an-extensive-eda-of-petfinder-my-data

**You are also allowed to use external data sources.** All the allowed sources are listed in [the Official External Data Disclosure Thread](https://www.kaggle.com/c/petfinder-adoption-prediction/discussion/75943). It is now to late to ask to use anything else.

# So, is this a classification or a regression task?

It can be seen as both. But participants are overwhelmingly using **regression models** as it provides better results:

> I tried all "optimal" methods for classification I found in research articles over the last 20 years and they work less well than regression with cutoff searching on this dataset… -- [Amit Steinberg](https://www.kaggle.com/c/petfinder-adoption-prediction/discussion/81375#481535)

So let's go with regression. What is ***cutoff searching***? 

You will first extract features from the data, then fit a regression model. Let's say for a given animal your model gives a value of `1.6`.  Should you consider it as part of **class 1**: "pet was adopted between 1 and 7 days"? or **class 2**: "between 8 and 30 days"?

It is not clear where the ***cutoffs*** between the classes should be! You want to optimize the cuttoffs so they give you the best **evaluation metric**. In practice, most kernels optimize the cutoff coefficients using the [Nelder-Mead method](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method). You can read more about this, and explore alternative optimization methods, in [Grandmaster Abhishek's post](https://www.kaggle.com/c/petfinder-adoption-prediction/discussion/76107).

 Here's an example implementation using Nelder-Mead:

In [None]:
# taken from https://www.kaggle.com/ranjoranjan/single-xgboost-model
class OptimizedRounder(object):
    def __init__(self):
        self.coef_ = 0
    
    def _kappa_loss(self, coef, X, y):
        preds = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3, 4])
        return -cohen_kappa_score(y, preds, weights='quadratic')
    
    def fit(self, X, y):
        loss_partial = partial(self._kappa_loss, X = X, y = y)
        initial_coef = [0.5, 1.5, 2.5, 3.5]
        self.coef_ = sp.optimize.minimize(loss_partial, initial_coef, method='nelder-mead')
    
    def predict(self, X, coef):
        preds = pd.cut(X, [-np.inf] + list(np.sort(coef)) + [np.inf], labels = [0, 1, 2, 3, 4])
        return preds
    
    def coefficients(self):
        return self.coef_['x']

You will then optimize the coefficients with:

```
optR = OptimizedRounder()
optR.fit(regression_predictions, original_labels)
print(optR.coefficients)
```

# Let's build some features!

Ok, we now have a basic understanding of the problem, let's extract some features and actually train a model. Most of the code below is taken from [the *single-xgboost-model* notebook](https://www.kaggle.com/ranjoranjan/single-xgboost-model/notebook), which itself reuses lots of shared code snippets from different public kernels.

**We will first build a simple model using only the basic data from `train.csv` and `test.csv`. Then you should improve it by adding features from the text, images or any other external source.**

# Simplest Model
We start by making a copy of the original data so we leave the original `X` untouched. We drop all text columns for now. 

In [None]:
X_temp = X.copy()

to_drop_columns = ['PetID', 'Name', 'RescuerID', 'Description']
X_temp = X_temp.drop(to_drop_columns, axis=1)

We split back the data into train and test sets, and remove `AdoptionSpeed` from the test set.

In [None]:
X_train = X_temp.loc[np.isfinite(X_temp.AdoptionSpeed), :]
X_test = X_temp.loc[~np.isfinite(X_temp.AdoptionSpeed), :]
X_test = X_test.drop(['AdoptionSpeed'], axis=1)
X_train_non_null = X_train.fillna(-1)
X_test_non_null = X_test.fillna(-1)

That's it! We're ready to train our first model.

# Training a Model
These are the XGBoost parameters, check out [the documentation for more info](https://xgboost.readthedocs.io/en/latest/parameter.html):

In [None]:
import xgboost as xgb
from sklearn.model_selection import StratifiedKFold

xgb_params = {
    'eval_metric': 'rmse',
    'seed': 1337,
    'eta': 0.0123,
    'subsample': 0.8,
    'colsample_bytree': 0.85,
    'tree_method': 'gpu_hist',
    'device': 'gpu',
    'silent': 1,
}

We train the model using stratified k-fold cross-validation:

In [None]:
def run_xgb(params, X_train, X_test):
    n_splits = 10
    verbose_eval = 1000
    num_rounds = 60000
    early_stop = 500

    kf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=1337)

    oof_train = np.zeros((X_train.shape[0]))
    oof_test = np.zeros((X_test.shape[0], n_splits))

    i = 0
    for train_idx, valid_idx in kf.split(X_train, X_train['AdoptionSpeed'].values):

        X_tr = X_train.iloc[train_idx, :]
        X_val = X_train.iloc[valid_idx, :]

        y_tr = X_tr['AdoptionSpeed'].values
        X_tr = X_tr.drop(['AdoptionSpeed'], axis=1)

        y_val = X_val['AdoptionSpeed'].values
        X_val = X_val.drop(['AdoptionSpeed'], axis=1)

        d_train = xgb.DMatrix(data=X_tr, label=y_tr, feature_names=X_tr.columns)
        d_valid = xgb.DMatrix(data=X_val, label=y_val, feature_names=X_val.columns)

        watchlist = [(d_train, 'train'), (d_valid, 'valid')]
        model = xgb.train(dtrain=d_train, num_boost_round=num_rounds, evals=watchlist,
                         early_stopping_rounds=early_stop, verbose_eval=verbose_eval, params=params)

        valid_pred = model.predict(xgb.DMatrix(X_val, feature_names=X_val.columns), ntree_limit=model.best_ntree_limit)
        test_pred = model.predict(xgb.DMatrix(X_test, feature_names=X_test.columns), ntree_limit=model.best_ntree_limit)

        oof_train[valid_idx] = valid_pred
        oof_test[:, i] = test_pred

        i += 1
    return model, oof_train, oof_test

In [None]:
model, oof_train, oof_test = run_xgb(xgb_params, X_train_non_null, X_test_non_null)

Done! Let's compare the distribution of the train and test sets:

In [None]:
import seaborn as sns

def plot_pred(pred):
    sns.distplot(pred, kde=True, hist_kws={'range': [0, 5]})
    
plot_pred(oof_train)
plot_pred(oof_test.mean(axis=1))

Let's compute the Quadratic Weighted Kappa this gets us. Remember that this score is based on our test set. **It won't be the same as what appears on the leaderboard**, as the leadboard uses a secret validation set. It will also be different from the final competition score, which will be calculated based on yet another secret validation set.

In [None]:
optR = OptimizedRounder()
optR.fit(oof_train, X_train['AdoptionSpeed'].values)
coefficients = optR.coefficients()
valid_pred = optR.predict(oof_train, coefficients)
qwk = quadratic_weighted_kappa(X_train['AdoptionSpeed'].values, valid_pred)
print("QWK = ", qwk)

Let's see which features ended up being the **most importants** in this model:

In [None]:
xgb.plot_importance(model)

We can now write the `submission.csv` file to submit it to the challenge.

In [None]:
test_predictions = optR.predict(oof_test.mean(axis=1), coefficients).astype(np.int8)
submission = pd.DataFrame({'PetID': test['PetID'].values, 'AdoptionSpeed': test_predictions})
submission.to_csv('submission.csv', index=False)
submission.head()

**Ok! this is great, we have built features, fit a regression model, and optimized the cutoff values.** Now, we want to add more features to hopefully improve our model.

`# WIP`