# Setup

### Imports

In [31]:
# Standard
import math
from time import time

# Pandas and plotting
import pandas as pd
import numpy as np
from plotnine import *

# SK Learn requirements
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_val_score



import warnings
warnings.filterwarnings('ignore')

### Load Data

In [2]:
import os

train_path = os.getcwd() + "\\data\\train.tsv"
train = pd.read_csv(train_path, sep='\t')

# Set random seed for reproducibility
pd.np.random.seed(369)

# Preprocessing
## Parse out individual category levels
Product category names consist of subcategory trees (Men/Tops/T-shirts). There don't appear to be more than 5 levels and the vast majority of them have just 3 levels. Below, we create a variable for each level.

In [3]:
from sklearn import preprocessing

train_x = train[["brand_name", "shipping", "category_name"]]
train_y = train.price

def piece(string, delim, n):
    string = str(string)
    if string.count(delim) < n:
        return pd.np.NaN
    return string.split(delim)[n]

train_x["category1"] = train_x.category_name.map(lambda x: piece(x, "/", 0))
train_x["category2"] = train_x.category_name.map(lambda x: piece(x, "/", 1))
train_x["category3"] = train_x.category_name.map(lambda x: piece(x, "/", 2))
train_x["category4"] = train_x.category_name.map(lambda x: piece(x, "/", 3))
train_x["category5"] = train_x.category_name.map(lambda x: piece(x, "/", 4))
train_x = train_x.drop(columns='category_name')
train_x.head()

Unnamed: 0,brand_name,shipping,category1,category2,category3,category4,category5
0,,1,Men,Tops,T-shirts,,
1,Razer,0,Electronics,Computers & Tablets,Components & Parts,,
2,Target,1,Women,Tops & Blouses,Blouse,,
3,,1,Home,Home Décor,Home Décor Accents,,
4,,0,Women,Jewelry,Necklaces,,


## Fill in missing data

For now we will be representing any missing data with the string 'zMissing.'

In [4]:
train_x = train_x.fillna('zMissing')
train_x.head()

Unnamed: 0,brand_name,shipping,category1,category2,category3,category4,category5
0,zMissing,1,Men,Tops,T-shirts,zMissing,zMissing
1,Razer,0,Electronics,Computers & Tablets,Components & Parts,zMissing,zMissing
2,Target,1,Women,Tops & Blouses,Blouse,zMissing,zMissing
3,zMissing,1,Home,Home Décor,Home Décor Accents,zMissing,zMissing
4,zMissing,0,Women,Jewelry,Necklaces,zMissing,zMissing


# One hot encode
http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

Create dummy variables out of categorical variables. First we use LabelEncoder to swap out category names with numbers. Then we use OneHotEncoder to pivot out those values in to columns.

**OneHotEncoder**

* Encode categorical integer features using a one-hot aka one-of-K scheme.
* The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
* The output will be a sparse matrix where each column corresponds to one possible value of one feature.
* It is assumed that input features take on values in the range [0, n_values).
* This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.



In [5]:
# Pandas is proving to be too slow
# dummies = pd.get_dummies(train_x)
# lets try it with sklearn
encoder = preprocessing.OneHotEncoder()
label_encoder = preprocessing.LabelEncoder()
data_label_encoded = train_x.apply(label_encoder.fit_transform)
train_x_encoded = encoder.fit_transform(data_label_encoded)

# Supervised Learning

LightGBM appears to be the defacto boosting algorithm right now. https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf![image.png](attachment:image.png)

In [6]:
# This competition uses root mean squared log error.
# Lets use this to evaluate our models as well.
from sklearn.metrics import make_scorer

# vectorized error calc
def rmsle(y, y0):
    assert len(y) == len(y0)
    return np.sqrt(np.mean(np.power(np.log1p(y)-np.log1p(y0), 2)))

rmsle_score = make_scorer(rmsle, greater_is_better=False)

def plot_scores(scores):
    absolute = np.vectorize(abs)
    scores = absolute(scores)
    mean_score = round(scores.mean(), 5)
    p = ggplot(pd.DataFrame({'scores': scores}), aes('scores')) + \
        geom_density(fill="lightblue", alpha=1/3) + geom_point(aes(x="scores", y=0), shape=6, size=4, colour="orange") + \
        ggtitle("Mean 10-FCV RMSLE on Training Set: {0}".format(mean_score)) + \
        ylab("Density") + xlab("RMSLE") + \
        theme_light()
    return p

## LightGBM
Gradient Boosting Decision Tree (GBDT) is a popular machine learning algorithm, and has quite a few effective implementations such as XGBoost and pGBRT. Although many engineering optimizations have been adopted in these implementations, the efficiency and scalability are still unsatisfactory when the feature dimension is high and data size is large … 

To tackle this problem, we propose two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size. 
With EFB, we bundle mutually exclusive features (i.e., they rarely take nonzero values simultaneously), to reduce the number of features … 

Our experiments on multiple public datasets show that, LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy.

### Parameter Tuning
#### learning_rate 

* Description: Each boosting iteration is supposed to provide an improvement to the training loss. The improvement is multiplied with the learning rate in order to perform smaller updates. Smaller updates allow to overfit slower the data, but requires more iterations for training. 
* Range: ]0, 8[
* Major Impact: Model Performance, Number of Iterations 
* Minor Impact: Training Time 
* Strategy: 
  * Smaller is usually better when training, but set this larger for hyperparameter tuning. Consider using a learning rate of 0.05 or lower for training, while a learning rate of 0.10 or larger is used for tinkering the other hyperparameters. 
  * Once your learning rate is fixed, do not change it. It is not a good practice to consider the learning rate as a hyperparameter to tune. Learning rate should be set according to your training speed and performance tradeoff. Do not let an optimizer tune it. 
  * This hyperparameter needs to work well with num_iterations. For instance, doing 5 iteations at a learning rate of 0.1 approximately would require doing 5000 iterations at a learning rate of 0.001, which might be obnoxious for large datasets.

#### num_iterations 
* Description: Number of boosting iterations. 
* Range: [1, 8[
* Major Impact: Model Performance, Training Time 
* Minor Impact: RAM Usage 
* Strategy:
  * Larger is not always better. Keep an eye on overfitting. Larger is usually better, until training data is overfitted too much. 
  * Typical: 110% of the mean of number of iterations from cross-validation, or use a very large number of iterations if using early stopping. 
  * Combine with early stopping to stop automatically boosting. 

#### early_stopping_round 
* Description: Number of maximum iterations without improvements. 
* Range: [0, 8[ 
* Major Impact: Model Performance, Number of Iterations, Training Time 
* Strategy:
  * Larger is usually better. Typically around 50. 
  * Setting early stopping too large risks overfitting by not allowing training to stop due to luck. Scale this parameter appropriately with the learning rate (usually: linearly). 
  * Tips: make sure you added a validation dataset to watch, otherwise this parameter is useless.   
  
#### num_leaves (Use this in favor of maximum depth when using LightGBM) 
* Range: [1, ∞[ 
* Major Impact: Maximum Depth, Model Performance, Number of Iterations, Training Time, RAM Usage 
* Minor Impact: NULL 
* Description: Maximum leaves for each trained tree. Restricting the number of leaves acts as a regularization in order to not grow very deep trees.
* Strategy: 
  * Larger is usually better, but overfitting speed increases. 
  * Typical: 255, usually {15, 31, 63, 127, 255, 511, 1023, 2047, 4095}. 
  * This is the most sensible hyperparameter for gradient boosting: tune it with the maximum depth. 

#### bagging_fraction 
* Description: Percentage of rows used per iteration frequency. Turns normal gradient descent in to stochastic gradient descent, which may not always be better.
* Range: ]0, 1] 
* Major Impact: Early Stopping, Model Performance, Number of Iterations 
* Minor Impact: Training Time 
* Strategy:
  * Try [0.70, .85, 1]
  * In addition, this is the second most sensible hyperparameter for gradient boosting: tune it with the column sampling.
		
#### feature_fraction
	• Range: ]0, 1]
	• Major Impact: Early Stopping, Model Performance, Number of Iterations
	• Minor Impact: Training Time
	• Description: Each model trained at each iteration will have only a specific % subset of features requested using subsample.
	• Strategy
		○ Smaller is usually better. Typically around 0.70.
		○ This is the second most sensible hyperparameter for gradient boosting: tune it with the row sampling.

#### Boosting
Description: Boosting method.
  * "gbdt" (Gradient Boosted Decision Trees) which is the default boosting method using Decision Trees and Stochastic Gradient Descent;
  * "dart" (Dropout Additive Regression Trees) is similar to Dropout in neural networks, except you are applying this idea to trees (dropping trees randomly).
  * "goss" (Gradient-based One-Side Sampling) which is a method using subsampling to converge faster/better using Stochastic Gradient Descent. From the paper: With GOSS, we exclude a significant proportion of data instances with small gradients, and only use the rest to estimate the information gain. We prove that, since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size.
* Major Impact: Model Performance, Number of Iterations, Training Time
* Minor Impact: RAM Usage

#### metric
* If you want to focus on the mean, use 'mse.' If you want to focus on the median use 'mae.' See this post for more details.
* huber: Less sensitive to outliers
* fair: Supposedly a better way to measure mase
* Poisson: Effective for count data.
* Quantile
* Quantile_l2

# Score Distribution