In [None]:
%matplotlib inline
import itertools
from pprint import pprint

import pymongo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import matplotlib.colors as mcolors

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor

import topcoder_mongo as DB
import topcoder_ml as TML
import static_var as S
import util as U

sns.set(
    rc={
        'axes.facecolor':'#121212',
        'figure.facecolor':'#121212',
        'text.color': 'white',
        'axes.titlecolor': 'white',
        'axes.labelcolor': 'white',
        'xtick.color': 'white',
        'ytick.color': 'white',
        'figure.autolayout': True,
    },
)

pd.set_option('display.max_rows', 500)

A practical problem that I can never figure out:

**When should I standardize my dataset and should I standardize all features??**

## Retrieve training data

In [None]:
feature, target = TML.get_training_data()
X, y = feature.to_numpy(), target.to_numpy()

In [None]:
X.shape, y.shape

Let's visualize the distribution of `top2_prize`. I plot the frequency of different prize in a $50 interval.

In [None]:
target.top2_prize.min(), target.top2_prize.max()

In [None]:
bins = int((2700 - 300) / 50)
fig, ax = plt.subplots(figsize=(16, 6.67), dpi=200)

sns.histplot(x=target.top2_prize, bins=bins, lw=0.5, ax=ax)
sns.despine(ax=ax, left=True)
ax.set_xlim(300, 2700)
ax.xaxis.grid(False)
ax.yaxis.grid(True, color='white', alpha=0.5)
ax.set_title('Top2 Prize Distribution')
ax.set_xlabel('Top2 Prize')
ax.xaxis.set_major_locator(mticker.MultipleLocator(100))

for p in ax.patches:
    cnt = p.get_height()
    x = p.get_x() + p.get_width() / 2
    y = p.get_height()
    ax.annotate(int(cnt), xy=(x, y), xytext=(x, y + 5), color='white', alpha=0.85, ha='center')

I decide to run a Mongo query to get the challenge ids for each bin, because using `pandas` to achieve that will take more tweak and twist that just run a (still relatively complicated) query.

In [None]:
prize_intv_points = np.linspace(300, 2700, int((2700 - 300) / 50) + 1)
prize_interval = list(zip(prize_interval, prize_interval[1:]))

## 10-Fold Cross Validation Predict

### Cross Validation Strategy

The "Independent and Identically Distributed" assumption that 

> _all samples stem from the same generative process and that the generative process is assumed to have no memory of past generated samples_

may not hold in the scenario of Topcoder dataset. So the following cross validation strategy will be used to split the training and testing sets.

1. Split the dataset by `top2_prize` as if it's a classification problem. i.e. make sure different prizes are presented in the validation set.
2. Split by `project_id` (assuming that challenges are dependant within each project)
2. Split by `sub_track` (assuming that challenges are dependant within each sub-track)