# Training Script
In this notebook, we create the training script whose hyperparameters will be tuned. This script is stored alone in a `scripts` directory both for ease of reference and because the Azure ML SDK limits the contents of this directory to at most 300 MB.

The notebook cells are each appended in turn in the training script, so it is essential that you run the notebook's cells _in order_ for the script to run correctly. If you edit this notebook's cells, be sure to preserve the blank lines at the start and end of the cells, as they prevent the contents of consecutive cells from being improperly concatenated.

The script sections are
- [import libraries](#import),
- [define utility functions and classes](#utility),
- [define the script input parameters](#parameters),
- [load and prepare the training and tuning data](#data),
- [define the training pipeline](#pipeline),
- [train the model](#train),
- [score the tuning data](#score), and
- [compute the tuning data performance](#performance).

[The cell following the script](#run) runs that script using the training and tuning data created by [the first notebook](00_Data_Prep.ipynb).

Start by creating the `scripts` directory, if it does not already exist.

In [None]:
import os
os.makedirs("scripts", exist_ok=True)

## Load libraries <a id='import'></a>

In [None]:
%%writefile scripts/TrainClassifier.py

from __future__ import print_function
import os
import warnings
import argparse
import json
import pandas as pd
from itertools import groupby
import lightgbm as lgb
from sklearn.feature_extraction import text
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib
from azureml.core import Run
import azureml.core

warnings.filterwarnings(action='ignore', category=UserWarning, module='lightgbm')


## Define utility functions and classes <a id='utility'></a>

In [None]:
%%writefile --append scripts/TrainClassifier.py

def log_evaluation(logger, metric_index=0, period=1):
    """Create a callback that logs the evaluation results.

    Parameters
    ----------
    logger : function
    period : int, optional (default=1)
        The period to print the evaluation results.

    Returns
    -------
    callback : function
        The callback that logs the evaluation results every ``period`` iteration(s).
    """
    def callback(env):
        """internal function"""
        if period > 0 and env.evaluation_result_list and (env.iteration + 1) % period == 0:
            value = env.evaluation_result_list[metric_index]
            logger(value[1], value[2])
    callback.order = 10
    return callback


def cumulative_gain_metric(groups, max_gain, score_at=1, metric_name=None):
    """Return a function that computes the normalized cumulative gain metric."""
    if metric_name is None:
        eval_name = "gain@" + str(score_at)
    else:
        eval_name = metric_name

    def cumulative_gain(y_true, y_pred, weight):
        """
        Compute the normalized cumulative gain. Returns the tuple:
            (eval_name, eval_result, is_bigger_better)
        This function assumes the data are sorted by groups.
        """
        gain = sum([sum([v
                         for _, _, v in sorted(g,
                                               key=lambda x: x[1],
                                               reverse=True)[:score_at]])
                    for _, g in groupby(zip(groups, y_pred, y_true),
                                        key=lambda x: x[0])])
        eval_result = gain / max_gain
        return (eval_name, eval_result, True)

    return cumulative_gain


## Define the input parameters <a id='parameters'></a>
One of the most important parameters is `estimators`, the number of estimators that allows you to trade-off model gain, modeling time, and model size. The table below should give you an idea of the relationships between the number of estimators and the metrics. The default value is 100.

| Estimators | Run time (s) | Size (MB) | Gain@1 | Gain@2 | Gain@3 |
|------------|--------------|-----------|------------|------------|------------|
|        100 |           40 |  2 | 25.02% | 38.72% | 47.83% |
|       1000 |          177 |  4 | 46.79% | 60.80% | 69.11% |
|       2000 |          359 |  7 | 51.38% | 65.93% | 73.09% |
|       4000 |          628 | 12 | 53.39% | 67.40% | 74.74% |
|       8000 |          904 | 22 | 54,62% | 67.77% | 75.35% |

Other parameters that may be useful to tune include the following:
* `ngrams`: the maximum n-gram size for features, an integer ranging from 1 (default 1),
* `min_child_samples`: the minimum number of samples in a leaf, an integer ranging from 1 (default 20),
* `match`: the maximum number of training examples per duplicate question, an integer ranging from 2 (default 10), and
* `unweighted`: whether to use sample weights to compensate for unbalanced data, a boolean (default weighted).

In [None]:
%%writefile --append scripts/TrainClassifier.py

if __name__ == '__main__':
    
    print('azureml.core.VERSION={}'.format(azureml.core.VERSION))
    
    parser = argparse.ArgumentParser(description='Fit and evaluate a model'
                                     ' based on train and tune datasets.')
    parser.add_argument('--data-folder', help='the path to the data',
                        dest='data_folder', default='.')
    parser.add_argument('--inputs', help='the inputs directory',
                        default='data')
    parser.add_argument('--data', help='the training dataset name',
                        default='balanced_pairs_train.tsv')
    parser.add_argument('--tune', help='the tune dataset name',
                        default='balanced_pairs_tune.tsv')
    parser.add_argument('--estimators',
                        help='the number of learner estimators',
                        type=int, default=100)
    parser.add_argument('--min_child_samples',
                        help='the minimum number of samples in a child(leaf)',
                        type=int, default=20)
    parser.add_argument('--ngrams',
                        help='the maximum size of word ngrams',
                        type=int, default=1)
    parser.add_argument('--match',
                        help='the maximum number of duplicate matches',
                        type=int, default=20)
    parser.add_argument('--unweighted',
                        help='whether or not to use instance weights',
                        default='No')
    parser.add_argument('--period',
                        help='the period for performance reporting',
                        type=int, default=100)
    parser.add_argument("--rank", help="the maximum rank of a correct match",
                        type=int, default=3)
    parser.add_argument('--outputs', help='the outputs directory',
                        default='outputs')
    parser.add_argument('--save', help='the model file base name', default='None')
    parser.add_argument('--verbose',
                        help='the verbosity of the estimator',
                        type=int, default=-1)
    parser.add_argument('--input-steps-data', dest="input_steps_data",
                        help='to share data between different steps in a pipeline',
                        default='data')
    parser.add_argument('--hyperparameters',
                        help='hyperparameter config file base name', default='None')
    args = parser.parse_args()
    

## Load and prepare the training data <a id='data'></a>

In [None]:
%%writefile --append scripts/TrainClassifier.py

    # Get a run logger.
    run = Run.get_context()

    # What to name the metric logged
    score_at = args.rank
    metric_name = "gain@" + str(score_at)

    print('Prepare the training data.')
    
    # Paths to the input data.
    data_folder_path = args.data_folder
    inputs_path = os.path.join(data_folder_path, args.inputs)
    data_path = os.path.join(inputs_path, args.data)
    tune_path = os.path.join(inputs_path, args.tune)

    # Paths for the output data.
    outputs_path = args.outputs
    model_path = os.path.join(outputs_path, '{}.pkl'.format(args.save))
    
    # Paths for steps data
    hyperparameters_path = os.path.join(args.input_steps_data, args.hyperparameters)

    # Create the outputs folder.
    os.makedirs(outputs_path, exist_ok=True)
    
    # Create a dict of hyperparameters from the input flags.
    hyperparameters = {
        "estimators": args.estimators,
        "ngrams": args.ngrams,
        "min_child_samples": args.min_child_samples,
        "match": args.match,
        "unweighted": args.unweighted
    }
    
    # Update the hyperparameters with values from a config file.
    if args.hyperparameters != "None" and os.path.isfile(hyperparameters_path):
        with open(hyperparameters_path) as fp:
            hyperparameters_config = json.load(fp)
        for key in hyperparameters.keys():
            if key in hyperparameters_config:
                hyperparameters_config[key] = type(hyperparameters[key])(hyperparameters_config[key])
        hyperparameters.update(hyperparameters_config)

    # Define the input data columns.
    feature_columns = ['Text_x', 'Text_y']
    label_column = 'Label'
    group_column = 'Id_x'
    dupes_answerid_column = 'AnswerId_x'
    questions_answerid_column = 'AnswerId_y'
    name_columns = ['Id_x', 'Id_y']
    weight_column = 'Weight'

    # Load the training data.
    print('Reading {}'.format(data_path))
    train = pd.read_csv(data_path, sep='\t', encoding='latin1')

    # Limit the number of training duplicate matches.
    train = train[train.n < hyperparameters["match"]]
    
    # Sort the data by the group column (needed for computing the gain)
    train.sort_values(group_column, inplace=True)

    # Report on the dataset.
    print('train: {:,} rows with {:.2%} matches'
          .format(train.shape[0], train[label_column].mean()))
    
    # Load the tunning data.
    print('Reading {}'.format(tune_path))
    tune = pd.read_csv(tune_path, sep='\t', encoding='latin1')
    
    # Sort the data by the group column (needed for computing the gain)
    tune.sort_values(group_column, inplace=True)

    # Report on the dataset.
    print('tune: {:,} rows with {:.2%} matches'
          .format(tune.shape[0], tune[label_column].mean()))
    
    # Compute instance weights.
    if hyperparameters["unweighted"] == 'Yes':
        print('No sample weights.')
        labels = train[label_column].unique()
        weight = pd.Series([1.0] * labels.shape[0], labels)
    else:
        print('Using sample weights.')
        label_counts = train[label_column].value_counts()
        weight = train.shape[0] / (label_counts.shape[0] * label_counts)
        print(weight)
    train[weight_column] = train[label_column].apply(lambda x: weight[x])

    # Select and format the training data.
    train_X = train[feature_columns]
    train_y = train[label_column]
    train_group = train[group_column]
    train_sample_weight = train[weight_column]
    train_names = train[name_columns]
    tune_X = tune[feature_columns]
    tune_y = tune[label_column]
    tune_group = tune[group_column]
    

## Define the featurization and estimator <a id='pipeline'></a>

In [None]:
%%writefile --append scripts/TrainClassifier.py

    print('Define the model pipeline.')

    # Select the training hyperparameters.
    n_estimators = hyperparameters["estimators"]
    min_child_samples = hyperparameters["min_child_samples"]
    if hyperparameters["ngrams"] > 0:
        ngram_range = (1, hyperparameters["ngrams"])
    else:
        ngram_range = None
    period = args.period

    # Verify that the hyperparameter settings are valid.
    if n_estimators <= 0:
        raise Exception('n_estimators must be > 0')
    if min_child_samples <= 0:
        raise Exception('min_child_samples must be > 0')
    if (ngram_range is None
        or type(ngram_range) is not tuple
        or len(ngram_range) != 2
        or ngram_range[0] < 1
        or ngram_range[0] > ngram_range[1]):
        raise Exception('ngram_range must be a tuple with two integers (a, b) where a > 0 and a <= b')

    # Define the featurization pipeline
    featurization = [
        (column, text.TfidfVectorizer(ngram_range=ngram_range), column)
        for column in feature_columns]
    features = ColumnTransformer(featurization)

    # Define the estimator.
    estimator = lgb.LGBMClassifier(n_estimators=n_estimators,
                                   min_child_samples=min_child_samples,
                                   verbose=args.verbose)

    # Put them together into the model pipeline.
    model = Pipeline([
        ('features', features),
        ('estimator', estimator)
    ])
    
    # Report the featurization.
    print('Estimators={:,}'.format(n_estimators))
    print('Ngram range={}'.format(ngram_range))
    print('Min child samples={}'.format(min_child_samples))
    

## Train the model <a id='train'></a>

In [None]:
%%writefile --append scripts/TrainClassifier.py

    print('Fitting the model.')

    # Featurize the train and tune dataset.  It's important to only fit the
    # featurizer on the training data, so that the tuning data is treated the
    # same way the testing data will be later on.
    train_X_features = model.named_steps["features"].fit_transform(train_X)
    tune_X_features = model.named_steps["features"].transform(tune_X)

    # Fit the model.
    eval_callback = log_evaluation(run.log, metric_index=1, period=period)
    max_gain = tune_y.sum()
    eval_metric = cumulative_gain_metric(
        tune_group, max_gain, score_at=score_at, metric_name=metric_name)
    model.named_steps["estimator"].fit(
        train_X_features, train_y, sample_weight=train_sample_weight,
        feature_name=model.named_steps["features"].get_feature_names(),
        eval_set=[(tune_X_features, tune_y)], eval_names=["tune"],
        eval_metric=eval_metric,
        callbacks=[eval_callback], verbose=False
    )
    if period > 0 and n_estimators % period != 0:
        metric_value = (model.named_steps["estimator"]
                        .evals_result_["tune"][metric_name][-1])
        run.log(metric_name, metric_value)

    # Write the model to file.
    if args.save != 'None':
        print('Saving the model to {}'.format(model_path))
        joblib.dump(model, model_path)
        print('{}: {:.2f} MB'
              .format(model_path, os.path.getsize(model_path)/(2**20)))

## Run the script to see that it works <a id='run'></a>
Set the effort expended to train the classifier.

In [None]:
estimators = 1000

Run the classifier script. This should take about 10 minutes.

In [None]:
%run -t scripts/TrainClassifier.py --estimators $estimators --match 5 --ngrams 2 --min_child_samples 10 --save model

In [the next notebook](02_Testing_Script.ipynb), we create and run the testing script.