In [1]:
# default_exp model
# %load_ext lab_black

# nb_black if running in jupyter
%load_ext nb_black

%load_ext autoreload
# automatically reload python modules if there are changes in the
%autoreload 2

<IPython.core.display.Javascript object>

In [2]:
# hide
from nbdev.showdoc import *

<IPython.core.display.Javascript object>

# Model

> In this notebook we create and test our machine learning model. The output should be a Python class, but we start by just creating general python functions that we use for our problem.

***input***: toy dataset from data-notebook

***output***: python module containing ML model class or a set of general Python functions

***description:***

In this notebook we hypothetize, explain and explore machine learning models to solve our problem.

This notebook contains an example ML model for classifying the library classification dataset with Random Forest Classifier or some other sklearn-classifier.

*Template notes:*
*Adjust the running number, name, header and top cell `#default_exp module_name` of the notebooks accordingly. Remember to add `# export` to top of all cells containing functions or classes that you have defined and want to use outside this notebook.*


## Import relevant modules

In [3]:
# export
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    train_test_split,
    StratifiedKFold,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn import preprocessing

<IPython.core.display.Javascript object>

In [4]:
# FIX THIS!!!!

# Use 'nbdev_build_lib' shell command to update library
# from ml_project_template.plot import plot_trellis, plot_histogram

<IPython.core.display.Javascript object>

## Define notebook parameters

Remember, only simple assignments here!

 - **toy_data_file** = File location of a small toy dataset file created when 00_dta-notebook is executed, e.g. *data/preprocessed_data/dataset_toy_all_classes.csv*
 - **all_classes_data_file** = File location of a full dataset for training the selected algorithm, e.g. *data/preprocessed_data/dataset_clean_all_classes.csv*
 - **input_data_file** = File location for actual input data for which we want do the predicting, e.g. *data/preprocessed_data/input_file.csv*

In [5]:
# Parameters

# this cell is tagged with 'parameters'
toy_data_file = "data/preprocessed_data/dataset_toy_all_classes.csv"
all_classes_data_file = "data/preprocessed_data/dataset_clean_all_classes.csv"
input_data_file = "data/preprocessed_data/input_file.csv"
seed = 0

<IPython.core.display.Javascript object>

Make immediate derivations from the parameters:

In [6]:
np.random.seed(seed)

<IPython.core.display.Javascript object>

## Import toy data for testing

In [7]:
toy_df = pd.read_csv(toy_data_file, index_col=0)
toy_df.head(30)

Unnamed: 0,record_id,084,092,093,094,095,650
4601,420908822165,78.8911,78.8911,,78.8911,788.33,rock
98805,420908631171,99.1,99.1,99.1,99.1,990.1,"suomalaiset,taidemaalarit"
22333,420907981954,78.462,,,,784.142,perinnemusiikki
2969,420908153948,69.3,69.3,,69.3,675.8,"markkinointitutkimus,markkinointi,tietojärjest..."
59116,420908158377,68.2,,,68.2,691.1,"ruokaohjeet,ruoanvalmistus,pula-ajat"
1545,420908925952,59.31,59.31,59.31,59.31,696.1,"hudvård,kvinnor,naturliga ämnen,näring,massage..."
71997,420908390431,15.9,15.9,15.9,15.9,192.0,"kummitukset,yliluonnolliset olennot"
31844,420908970852,25.5,,,,258.0,"diakonia,järjestöt,historia,uskonnolliset järj..."
35847,420908722756,62.511,62.511,,62.511,624.8,"museoajoneuvot,järjestöt,entistäminen,autot,hi..."
52747,420908582677,75.72,75.72,75.72,75.72,756.2,"fotografering,digitalteknik,digitalkameror,bil..."


<IPython.core.display.Javascript object>

# Selecting the model

Some useful links for choosing the estimator
 - [Classifier comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)
 - [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
 
We tested the following models both with toy dataset and actual dataset:
 - [Random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=randomforest#sklearn.ensemble.RandomForestClassifier)
 - [Support Vector(SVC)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
 - [ExtraTrees](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html)
 - [KNeighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html?highlight=kneighbors#sklearn.neighbors.KNeighborsClassifier)
 - [Decision tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decisiontree#sklearn.tree.DecisionTreeClassifier)
 - [SGD classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html?highlight=sgd%20classifier#sklearn.linear_model.SGDClassifier)
 - [linear SVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
 - [Naive Bayes](https://scikit-learn.org/stable/modules/classes.html?highlight=naive%20bayes#module-sklearn.naive_bayes) 


The best result (see the later input) were obtained with Random Forest Classifier.
## The math behind Random Forest Classifier:
Random Forest constructs a multitude of decision trees and does the ckassification by selecting the class with most trees.

$$
Pr(Y_i=1|X_i) = {\frac{exp(\beta_0 + \beta_1X_i + \dots + \beta_nX_n)}{1 + exp (\beta_0 + \beta_1X_i + \dots + \beta_nX_n)}}
$$


Worst result were obtained with Gaussian Naive Bayes Classifier.
#### Naive Bayes:

Naive Bayes is classifier assumes that features are independent of each other. Naive Bayes is used to calculate posterior probability P(c|x) from class prior probability P(c), predictor prior probability P(x) and likelihood P(x|c)

$$
P(c|x)= {\frac{P(x|c)P(c)}{P(x)}}
$$


## General python functions before constructing the model class

First we create some general functions for splitting data into training and validation set, preprocessing data, fitting the data with selected model algorithm an, predicting and printing out the results etc. We use f1_score for loss function and pipe for predicting.

In [8]:
# export

# FIX THIS IMPORT!!!!
# These functions won't work if we don't define these also in this code block
# from lib_classification.plot import plot_trellis, plot_histogram


seed = 0


"""
Labels can't be of type float for classification. Thus we multiply floats
so that there are no decimals. When printing the results we do the opposite.operation

max_decimals tells number of possible decimals in library classification.
Set max_decimals to 0 if you want to omit decimals alltogether
"""
max_decimals = 6
multiply_factor = 10 ** max_decimals

"""
Sklearn models can't handle NaN values, replace them with suitable value, defaul = 0
"""
replace_nan = 0


def keywords_to_features(df_to_parse, keywords=[]):
    """
    Parse dataframe keywords and create features of them
    """
    # print(f"KEYWORDS: {len(keywords)}")

    # Create keyword list
    if len(keywords) == 0:
        for i in range(len(df_to_parse)):
            item_keywords_str = (str)(df_to_parse.iloc[(i), 6])
            item_keywords_lst = item_keywords_str.split(",")

            for word in item_keywords_lst:
                word = word.strip().lower()
                if word not in keywords:
                    keywords.append(word)

    # Add keyword columns with keyword as a title (value will be 0 or 1 depending if the keyword belongs to the volume or not)
    # NOTE: "A Pandas Series is like a column in a table" (https://www.w3schools.com/python/pandas/pandas_series.asp)
    for i in range(len(keywords)):
        df_to_parse[keywords[i]] = pd.Series([], dtype="int64")
        df_to_parse = df_to_parse.reset_index(drop=True)

    # Fill features with keywords attached to item with value "1"
    for i in range(len(df_to_parse)):
        item_keywords_str = (str)(df_to_parse.iloc[(i), 6])
        item_keywords_lst = item_keywords_str.split(",")

        for word in item_keywords_lst:
            word = word.strip().lower()
            if word in keywords:
                df_to_parse.at[i, word] = 1

    # Drop column with comma-separated keywords and fill NaN with 0
    df_to_parse = df_to_parse.drop(["650"], axis=1)
    df_to_parse = df_to_parse.fillna(0)

    # print(f"DF TO PARSE SHAPE: {df_to_parse.shape}")
    # print(f"KEYWORDS: {len(keywords)}")
    return df_to_parse, keywords


def split_X_y(df):
    """
    Split dataframe into features and labels
    """
    # X = df.iloc[:, :-1]  # .to_numpy()
    # y = df.iloc[:, -1]  # .to_numpy()

    # for col in df.columns:
    #    print(f"*{col}*")

    X = df.copy().reset_index(drop=True)
    y = X.pop("095").reset_index(drop=True)

    return X, y


def modify_lib_data(X, y=None):
    """
    Do the needed modification for library data
    """

    # Sklearn GaussianNB doesn't handle NaN-values in input.
    # We fill the NaN values with 0.
    X = X.fillna(replace_nan)

    # Change datatypes for features and labels
    X = X.astype(
        {
            "record_id": "int",
            "084": "category",
            "092": "category",
            "093": "category",
            "094": "category",
        }
    )

    # for some reason y is of type Series
    # We need dataframe
    # y = y.to_frame()

    # Convert labels from float to big integers,
    # Note: Type 'Category' won't work with categorization models (at least not with GaussianNB)
    if y is not None:
        y = y.multiply(multiply_factor)
        y = y.astype({"095": "int"})

    return X, y


def reverse_mod_lib_data(X, y_pred, y=None):
    """
    Reverse the library data back to original format
    """

    # Sklearn GaussianNB doesn't handle NaN-values in input.
    # We fill the NaN values with 0 and now change it back
    X = X.replace(0, np.nan)

    # Change datatypes for features and labels back to
    X = X.astype(
        {
            "record_id": "int",
            "084": "float",
            "092": "float",
            "093": "float",
            "094": "float",
        }
    )

    # Convert labels from category back to int
    if y is not None:
        y = y.multiply(1 / multiply_factor)
    y_pred = y_pred.multiply(1 / multiply_factor)

    return X, y, y_pred


def get_train_test_data(X, y, seed, stratify=True, test_size=0.2, shuffle=True):
    """
    Split the data into training and test sets
    """

    # Stratify won't work with all datasets, it requires at least 2 rows for each label value
    if stratify:
        return train_test_split(
            X, y, test_size=test_size, shuffle=shuffle, stratify=y, random_state=seed
        )

    else:
        return train_test_split(
            X, y, test_size=test_size, shuffle=shuffle, random_state=seed
        )


def fit(model, scaler, X_train, X_test, y_train, y_test):
    """
    Fit the model
    """

    pipe = Pipeline([("scaler", scaler), ("model", model)])
    pipe.fit(X_train, y_train)
    err_train = pipe.score(X_train, y_train)
    err_test = pipe.score(X_test, y_test)

    return pipe


def predict(pipe, X):
    """
    Use the model (pipe object) to predict labels
    """

    y_pred = pipe.predict(X)
    # pred_probabilities = pipe.predict_proba(X)

    # Print probabilities for first data point only
    # print(
    #    f"\nPredicted probability of each label for first data point:\n{pred_probabilities[0]}"
    # )

    return y_pred


def get_train_loss(pipe, X_train, y_train):
    """
    Return train loss of fitted model
    """

    return pipe.score(X_train, y_train)


def get_test_loss(pipe, X_test, y_test):
    """
    Return test loss of fitted model
    """
    return pipe.score(X_test, y_test)


def loss(pipe, X, y):
    """
    Return loss (model quality metric)

    Note that this may be a different metric than the one that the model optimizer is using (scoring method).
    For example for LogisticRegression the scoring method is mean accuracy,
    but we might want to track for example f1-score for loss because it is better balanced.
    """

    # return mean_squared_error(predict(pipe, X), y)
    return f1_score(y, predict(pipe, X), average="macro")


def print_loss(pipe, X, y, X_train, y_train, X_test, y_test, model_name, dataset_name):
    """
    Print training and validation errors
    """
    print("\n******************************************************************")
    print(f"  Results for {model_name} with {dataset_name}:")
    print("******************************************************************")

    print(f"Training error: {get_train_loss(pipe, X_train, y_train)}")
    print(f"Validation error: {get_test_loss(pipe, X_test, y_test)}")
    print(f"Loss: {loss(pipe, X, y)}")

    # train_test_df = X_train.iloc[:,1:].copy()
    # train_test_df["prediction_correct"] = (predict(pipe, X_train) - y_train.values == 0)
    # display(train_test_df.head)
    # _ = plot_trellis(train_test_df, legend_title="prediction", true_label="correct")


def print_details(X, y, y_pred, label_name="label", pred_column_name="pred", n_rows=10):
    """
    Print the results for observation
    """

    y_compare = pd.concat([y, y_pred], axis=1)

    print(f"\nOriginal and predicted labels (first {n_rows} rows):")
    display(y_compare.head(n_rows))

    X_compare = pd.concat([X, y_compare], axis=1)
    false_preds = X_compare[X_compare[label_name] != X_compare[pred_column_name]]
    n_false_preds = len(false_preds)
    n_right_preds = len(X_compare) - n_false_preds
    print(f"Number of false predictions: {n_false_preds}")
    print(f"Number of right predictions: {n_right_preds}")
    print("\n\nAll false predictions in dataset:")
    display(false_preds)

    # print(
    #    "\nHow different classifications correlate with each other on true and false predictions:"
    # )
    X_compare["prediction_correct"] = (
        X_compare[label_name] - X_compare[pred_column_name] == 0
    )
    # display(X_compare.head())
    # FIX THIS IMPORT!!!
    # _ = plot_trellis(X_compare.iloc[:,1:], legend_title="prediction", true_label="correct")


def test_model(
    model, scaler, df, model_name, dataset_name, test_size=0.2, verbose=True
):
    """
    Test the model with the help of functions above
    """

    label_name = "095"
    pred_column_name = "095_PRED"

    # Create features and labels
    X, y = split_X_y(df)

    # Modify library data as needed
    X, y = modify_lib_data(X, y)

    # Split data into training and test sets
    X_train, X_test, y_train, y_test = get_train_test_data(
        X, y, seed, stratify=False, test_size=test_size
    )

    # Fit and predict
    pipe = fit(model, scaler, X_train, X_test, y_train, y_test)
    y_pred = predict(pipe, X)
    score = loss(pipe, X, y)

    # convert predictions from numpy to dataframe and set an easy column name
    y_pred = pd.DataFrame(y_pred)
    y_pred = y_pred.rename(columns={y_pred.columns[0]: pred_column_name})

    if verbose:
        print_loss(
            pipe, X, y, X_train, y_train, X_test, y_test, model_name, dataset_name
        )

    # Modify the library data back to original format
    X, y, y_pred = reverse_mod_lib_data(X, y_pred, y)

    # Print the results with desired column names
    if verbose:
        print_details(X, y, y_pred, label_name, pred_column_name, 30)

    # print(score)
    return pipe, score


def predict_hkl_class(pipe, items_to_classify, info="", y=None):
    """
    Predict the actual HKL class
    """
    pred_column_name = "095_PRED"

    # Modify library data as needed
    X, y = modify_lib_data(items_to_classify, y)

    y_pred = predict(pipe, X)

    # convert predictions from numpy to dataframe and set an easy column name
    y_pred = pd.DataFrame(y_pred)
    y_pred = y_pred.rename(columns={y_pred.columns[0]: pred_column_name})

    # Modify the library data back to original format
    X, y, y_pred = reverse_mod_lib_data(X, y_pred, y)

    result = pd.concat([X, y_pred], axis=1)
    result["Info"] = info

    return result

<IPython.core.display.Javascript object>

## Final toy dataset:

In [9]:
toy_df, keywords = keywords_to_features(toy_df)

# printing out the number of keywords and rows in our final dataset for training and validating the model
print(f"Number of keywords in input dataset is: {toy_df.shape[1] - 6} should be equal with {len(keywords)}")
print(f"Input data: {toy_df.shape[0]} rows.")
toy_df.head()


Number of keywords in input dataset is: 1917 should be equal with 1917
Input data: 500 rows.


Unnamed: 0,record_id,084,092,093,094,095,rock,suomalaiset,taidemaalarit,perinnemusiikki,...,laskentatoimi,toimintolaskenta,kaupunkiarkeologia,virkatalot,kauppiaat,liikemiehet,pormestarit,maaherrat,kenraalikuvernöörit,talot
0,420908822165,78.8911,78.8911,0.0,78.8911,788.33,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,420908631171,99.1,99.1,99.1,99.1,990.1,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,420907981954,78.462,0.0,0.0,0.0,784.142,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,420908153948,69.3,69.3,0.0,69.3,675.8,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,420908158377,68.2,0.0,0.0,68.2,691.1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<IPython.core.display.Javascript object>

## Start testing

Now we can test and print the results with one single function call.

In [10]:
# Define some initial params
k = 5
test_size = 0.2

#######################################
# Test SVC Classifier
#######################################

from sklearn.svm import SVC

_ = test_model(
    RandomForestClassifier(max_depth=15),
    StandardScaler(),
    toy_df,
    "RANDOM FOREST CLASSIFIER",
    "TOY DATASET",
    test_size,
)

#
# Skip hyperparameter tuning for now, maybe implement this later
#

#cv = StratifiedKFold(n_splits=k)
#print(cross_val_score(pipe, X_train, y_train, cv=cv))

## optimize
#param_grid = {
#    "estimator__C": np.logspace(-4, 4, 10),
#}

# make_pipeline(Imputer(),StandardScaler(),PCA(n_components=2),SVC(random_state=1))

# cv = StratifiedKFold(n_splits=5)
#gs = GridSearchCV(
#    estimator=pipe,
#    param_grid=param_grid,
#    scoring="accuracy",
#    cv=cv,
#    return_train_score=True,
#)
#gs.fit(X_train, y_train)
#
#print("Best Estimator: \n{}\n".format(gs.best_estimator_))
#print("Best Parameters: \n{}\n".format(gs.best_params_))
#print("Best Test Score: \n{}\n".format(gs.best_score_))
#print(
#    "Best Training Score: \n{}\n".format(
#        gs.cv_results_["mean_train_score"][gs.best_index_]
#    )
#)
#print("All Training Scores: \n{}\n".format(gs.cv_results_["mean_train_score"]))
#print("All Test Scores: \n{}\n".format(gs.cv_results_["mean_test_score"]))
# # This prints out all results during Cross-Validation in details
# print("All Meta Results During CV Search: \n{}\n".format(gs.cv_results_))

# Reset pipeline with best params
#pipe.set_params(estimator__C=gs.best_params_["estimator__C"])
#pipe.fit(X_train, y_train)
#print("Test score with best params (should equal to Best Test Score above)")
#print(pipe.score(X_test, y_test))



******************************************************************
  Results for RANDOM FOREST CLASSIFIER with TOY DATASET:
******************************************************************
Training error: 0.85
Validation error: 0.1
Loss: 0.6364728419944833

Original and predicted labels (first 30 rows):


Unnamed: 0,095,095_PRED
0,788.33,788.33
1,990.1,788.33
2,784.142,784.142
3,675.8,675.8
4,691.1,691.1
5,696.1,696.1
6,192.0,788.33
7,258.0,788.33
8,624.8,624.8
9,756.2,756.2


Number of false predictions: 150
Number of right predictions: 350


All false predictions in dataset:


Unnamed: 0,record_id,084,092,093,094,rock,suomalaiset,taidemaalarit,perinnemusiikki,markkinointitutkimus,...,kaupunkiarkeologia,virkatalot,kauppiaat,liikemiehet,pormestarit,maaherrat,kenraalikuvernöörit,talot,095,095_PRED
1,420908631171,99.10000,99.1000,99.100,99.100,,1.0,1.0,,,...,,,,,,,,,990.100,788.33
6,420908390431,15.90000,15.9000,15.900,15.900,,,,,,...,,,,,,,,,192.000,788.33
7,420908970852,25.50000,,,,,,,,,...,,,,,,,,,258.000,788.33
10,420908629131,40.00000,,,40.800,,,,,,...,,,,,,,,,462.000,788.33
11,420908138656,79.18100,79.1810,79.181,79.181,,,,,,...,,,,,,,,,793.400,788.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
483,420909116320,88.50000,88.5000,,88.500,,,,,,...,,,,,,,,,882.000,788.33
484,420908591076,33.25000,,,33.250,,,,,,...,,,,,,,,,332.230,788.33
489,420908562236,78.89150,78.8915,,,,,,,,...,,,,,,,,,788.140,788.33
492,420909000824,79.37000,79.3700,,79.370,,,,,,...,,,,,,,,,796.200,788.33


<IPython.core.display.Javascript object>

## Test results with toy dataset

 - 355/500 right (70%)
 - Training error: 0.8575
 - Validation error: 0.12

# ACTUAL PREDICTING

Because the results with toy dataset were not very encouriging (compare training error and validation error) and because dataset for fitting the model is huge we came up with the idea of limitting the training/validation set to only to those rows that have at least some of the features exactly the same as in input data (the data we actually want to predict from).

*Note: We should refactor this next cell into separate functions and do Python class implementation and move the actual predicting to "02_Loss.ipynb"-notebook.*

In [11]:
import time

start = time.time()
pd.options.display.float_format = (
    lambda x: "{:.0f}".format(x) if int(x) == x else "{:,.2f}".format(x)
)


############################################################
#
# HERE WE DO THE ACTUAL PREDICTING
#   - We should probably refactor this to "02_Loss.ipynb"
#
###########################################################


# Define some initial params
k = 5
test_size = 0.2
MAX_TRAININGSET_SIZE = 2000  # We never use bigger training set than this
MIN_TRAININGSET_SIZE = 50  # Warn if training set is smaller than this
MAX_POSSIBLE_CLASSES = 20  # Warn if number of possible classes is bigger than this


# Read input data (generated in '00_data.inpynb')
input_df = pd.read_csv(
    input_data_file, index_col=0
)  # simulated input data from whole trainng set
INPUT_ROWS = 20  # len(input_df) to handle full input file

all_classes_df = pd.read_csv(all_classes_data_file, index_col=0)  # whole training set
validation_df = input_df  # we have real 095 values in simulated input file
errors_df = pd.DataFrame()  # append try/catch errors here
final_result = pd.DataFrame()  # append prediction results in this dataframe


# UNCOMMENT THESE IF WE HAVE A REAL INPUT FILE
#
# input_data_file = "data/preprocessed_data/testiaineisto.csv"
# output_data_file = "data/preprocessed_data/testiaineisto_output.csv"
# input_df = pd.read_csv(input_data_file, index_col=False)
# validation_df = None
# display(input_df.head())


# Choose what predictors you want to use
# Score means is the wverage score for each predictor with 1000 library items
# This doesn't improve results significantly so you should prbably just omit it
# Try it out if you want
predictors = [
    RandomForestClassifier(max_depth=15),
    SVC(probability=True),
    ExtraTreesClassifier(max_depth=15),
    KNeighborsClassifier(n_neighbors=3),
    DecisionTreeClassifier(max_depth=15),
    SGDClassifier(max_iter=1000),
    LinearSVC(dual=False),
    GaussianNB(),
]
predictor_names = [
    "RandomForest",
    "SVC",
    "ExtraTrees",
    "KNeighbors",
    "DecisionTree",
    "SGD",
    "LinearSVC",
    "GaussianNB",
]
score_means = [
    0.78,  # RandomForest:  681/977 RIGHT, TIME: 4046
    0.47,  # SVC:           668/977 RIGHT, TIME: 4657
    0.78,  # Extratrees:    660/977 RIGHT, TIME: 3996
    0.35,  # KNeighbors:    658/977 RIGHT, TIME: 3859
    0.8,  # Desiciontree:  632/977 RIGHT, TIME: 'decent'
    0.68,  # SGD:           623/977 RIGHT, TIME: 4088
    0.69,  # LinearSVC:     623/977 RIGHT, TIME: 8460
    0.74,  # GaussianNB:    572/977 RIGHT, TIME: 3958
]
precisions = [
    681 / 977,  # RandomForest
    668 / 977,  # SVC
    660 / 977,  # Extratrees
    658 / 977,  # KNeighbors
    632 / 977,  # Desiciontree
    623 / 977,  # SGD
    623 / 977,  # LinearSVC
    572 / 977,  # GaussianNB
]


# Uncomment these if you want to try just one predictor
#
predictors = [RandomForestClassifier(max_depth=15)]
predictor_names = ["RandomForest"]
score_means = [0.78]


#
# LOOP THROUGH NEW LIBRARY ITEMS, CLASSIFY THEM AND ADD TO RESULT DATAFRAME
#
for i in range(20):
    # for i in range(len(input_df)):

    # Get one new item from input
    item_df = input_df.iloc[[i]]
    item_df = item_df.fillna(
        -1
    )  # replace NaN with -1 to omit class matching when selecting best training set

    # parse new item keywords and get known classes of new item
    item_df, keywords_lst = keywords_to_features(item_df.copy())
    item_df = item_df.drop(["095"], axis=1)
    class_084 = item_df.loc[0, "084"]
    class_092 = item_df.loc[0, "092"]
    class_093 = item_df.loc[0, "093"]
    class_094 = item_df.loc[0, "094"]

    #
    # GET BEST TRAINING SET: until large enough and one of these conditions match
    # 1. all classes match to new items's classes
    # 2. 093 class match
    # 3. 093 OR 092 class matche
    # 4. 093 OR 092 OR 094 class match
    # 5. any of the classes match
    #
    training_set_df = all_classes_df[
        (all_classes_df["093"] == class_093)
        & (all_classes_df["092"] == class_092)
        & (all_classes_df["094"] == class_094)
        & (all_classes_df["084"] == class_084)
    ]

    if len(training_set_df) < MAX_TRAININGSET_SIZE:
        training_set_df = all_classes_df[
            (all_classes_df["093"] == class_093) | (all_classes_df["092"] == class_092)
        ]

    if len(training_set_df) < MAX_TRAININGSET_SIZE:
        training_set_df = all_classes_df[
            (all_classes_df["093"] == class_093)
            | (all_classes_df["092"] == class_092)
            | (all_classes_df["094"] == class_094)
        ]

    if len(training_set_df) < MAX_TRAININGSET_SIZE:
        training_set_df = all_classes_df[
            (all_classes_df["093"] == class_093)
            | (all_classes_df["092"] == class_092)
            | (all_classes_df["094"] == class_094)
            | (all_classes_df["084"] == class_084)
        ]

    # If training set grew too large then limit it here to MAX_TRAININGSET_SIZE
    if len(training_set_df) > MAX_TRAININGSET_SIZE:
        training_set_df = training_set_df.sample(MAX_TRAININGSET_SIZE)

    # PARSE KEYWORDS AND INITIALIZE SOME VARIABLES
    training_set_df, items_keywords_lst = keywords_to_features(
        training_set_df.copy(), keywords_lst
    )
    num_possibles = training_set_df["095"].nunique()
    training_set_size = len(training_set_df)
    info = ""
    warning = False

    # Handle some exceptions
    if training_set_size < 1 or num_possibles == 1:
        info = ""
        prediction = np.nan
        predictor = "None"

        if training_set_size < 1:
            info = "No training data."
            score = -1
        elif num_possibles == 1:
            info = "Only one class in training set."
            prediction = training_set_df["095"]
            predictor = "One class, no predictor"
            score = 1

        result_df = item_df.copy()
        result_df["095_PRED"] = prediction
        result_df["Info"] = info
        result_df["score"] = score
        result_df["predictor"] = predictor
        final_result = final_result.append(result_df)
        print(
            f"{i}: PREDICTOR={result_df.loc[0]['predictor']}, SCORE={result_df.loc[0]['score']}, SETSIZE={training_set_size}"
        )
        continue

    # Warn if there are many possible classes or training set is too small
    if num_possibles > MAX_POSSIBLE_CLASSES:
        info = (
            info
            + " WARNING: Over "
            + str(MAX_POSSIBLE_CLASSES)
            + " possible classes ("
            + str(num_possibles)
            + ")"
        )
        warning = True
    if training_set_size < MIN_TRAININGSET_SIZE:
        info = info + " WARNING: training set smaller than " + str(MIN_TRAININGSET_SIZE)
        warning = True

    # print(item_df.shape)
    # print(training_set_df.shape)
    # print(item_df.loc[0, "084"])
    # print(training_set_size)
    # print("Uniques:", num_possibles)
    # print(training_set_df["095"].unique())

    #
    # INNER LOOP: FIND A DECENT ESTIMATOR FROM PREDICTORS LIST
    #
    prev_score_estimator = 0
    for j in range(len(predictors)):
        predictor = predictors[j]
        meanscore = score_means[j]
        # print(str(i) + ": CLASS=" + str(item_df.loc[0, "084"]))
        # print(str(type(predictor)) + ": " + str(meanscore))

        try:
            pipe, score = test_model(
                predictor,
                StandardScaler(),
                training_set_df,
                str(type(predictor)),
                "WHOLE TRAINING SET" + str(len(training_set_df)),
                test_size,
                verbose=False,
            )
        except:
            errors_df.append(item_df)
            continue

        # compare score for this item with predictor's average score
        score_estimator = score / meanscore

        # Add prediction into results if first predictor OR better estimator found
        if j == 0 or score_estimator > prev_score_estimator:
            result_df = predict_hkl_class(pipe, item_df, info)
            result_df["score"] = round(score_estimator, 2)
            result_df["predictor"] = predictor_names[j]
            prev_score_estimator = score_estimator

        if score_estimator < 1.2:
            continue
        else:
            break

    # APPEND NEW ITEM WITH PREDICTED 095 CLASS TO FINAL RESULT DATAFRAME
    result_df["095_PRED"] = result_df["095_PRED"].round(6)
    final_result = final_result.append(result_df)
    print(
        f"{i}: PREDICTOR={result_df.loc[0]['predictor']}, SCORE={result_df.loc[0]['score']}, SET SIZE={training_set_size}    {info}"
    )

# replace -1 back to nan
final_result = final_result.replace(-1, np.nan)


#
# IF WE KNOW THE REAL 095 VALUES WE PRINT SOME EXTRA INFO
#
if validation_df is not None:
    final_result["095_CORRECT"] = np.nan
    for rec_id in final_result["record_id"].tolist():
        right_hkl_class = float(
            validation_df.loc[validation_df["record_id"] == rec_id]["095"]
        )
        final_result.loc[
            final_result.record_id == rec_id, "095_CORRECT"
        ] = right_hkl_class

    # For float comparison we need to use the 6 decimal round function
    false_preds = final_result[
        final_result["095_PRED"].round(6) != final_result["095_CORRECT"].round(6)
    ]
    n_false_preds = len(false_preds)
    n_right_preds = len(final_result) - n_false_preds

    print(f"\nNumber of false predictions: {n_false_preds}")
    print(f"Number of right predictions: {n_right_preds}")
    print("\n\nAll false predictions in dataset:")
    display(false_preds)
    display(false_preds["predictor"].value_counts())


#
# RE-ARRANGE THE FINAL RESULSET AND DISPLAY RESULTS
#
column_names = [
    "095_PRED",
    "095_CORRECT",
    "predictor",
    "score",
    "Info",
    "084",
    "092",
    "093",
    "094",
    "record_id",
]
final_result = final_result.reindex(columns=column_names)
display(final_result.head(100))


# Uncomment these if you want debug data
print("NUMBER OF ERRORS: " + str(len(errors_df)))
print("SCORE MEAN:", final_result["score"].mean())
display(final_result["predictor"].value_counts())
display(errors_df.head(100))
print("Execution time:", int((time.time() - start)))

# output_data_file = "data/preprocessed_data/output_file.csv"
# final_result.to_csv(output_data_file)

0: PREDICTOR=RandomForest, SCORE=0.43, SET SIZE=220    
9: PREDICTOR=RandomForest, SCORE=1.03, SET SIZE=359    
11: PREDICTOR=RandomForest, SCORE=0.89, SET SIZE=422    
12: PREDICTOR=RandomForest, SCORE=0.74, SET SIZE=233    
13: PREDICTOR=RandomForest, SCORE=0.7, SET SIZE=1172    
15: PREDICTOR=RandomForest, SCORE=1.07, SET SIZE=362    
16: PREDICTOR=RandomForest, SCORE=0.85, SET SIZE=408    

Number of false predictions: 1
Number of right predictions: 19


All false predictions in dataset:


Unnamed: 0,record_id,084,092,093,094,rock,suomalaiset,taidemaalarit,perinnemusiikki,markkinointitutkimus,...,liikemiehet,pormestarit,maaherrat,kenraalikuvernöörit,talot,095_PRED,Info,score,predictor,095_CORRECT
0,420909105645,99.13,99.13,99.13,99.13,,,,,,...,,,,,,993.1,WARNING: Over 20 possible classes (99),0.08,RandomForest,993


RandomForest    1
Name: predictor, dtype: int64

Unnamed: 0,095_PRED,095_CORRECT,predictor,score,Info,084,092,093,094,record_id
0,821.1,821.1,RandomForest,0.43,,86.22,86.22,86.22,86.22,420908943150
0,691.12,691.12,RandomForest,0.42,WARNING: Over 20 possible classes (30),59.34,68.22,68.22,68.22,420909061903
0,691.1,691.1,RandomForest,0.17,WARNING: Over 20 possible classes (43),68.2,68.2,68.2,68.2,420909016138
0,788.11,788.11,RandomForest,0.43,WARNING: Over 20 possible classes (40),78.89,78.89,78.89,78.89,420908852805
0,798.0,798.0,RandomForest,0.58,WARNING: Over 20 possible classes (50),65.0,65.0,65.0,65.0,420908697168
0,993.1,993.1,RandomForest,0.09,WARNING: Over 20 possible classes (99),99.13,99.13,99.13,99.13,420908579965
0,188.1,188.1,RandomForest,0.17,WARNING: Over 20 possible classes (99),17.3,17.3,17.3,17.3,420908946672
0,993.1,993.0,RandomForest,0.08,WARNING: Over 20 possible classes (99),99.13,99.13,99.13,99.13,420909105645
0,613.1,613.1,RandomForest,0.36,WARNING: Over 20 possible classes (33),59.34,59.34,59.34,59.34,420909072938
0,788.6,788.6,RandomForest,1.03,,78.31,78.31,78.31,78.31,420907850892


NUMBER OF ERRORS: 0
SCORE MEAN: 0.46299999999999997


RandomForest    20
Name: predictor, dtype: int64

Execution time: 180


<IPython.core.display.Javascript object>

## Observations
We noticed that if there are no shortages in input data (that is there are no missing values in features) the random forest predictor does very well, even if there are tens of possible label classes. For example trying to predict 20 labels the model got 18 right and even in the last two the main class was correct (the error was in decimals).
 
We did huge amount of tests with the 100 row input dataset. Here are main findings: 
 
 - Trying to select the best model on the fly based on testing with all classes with 1000 items and the average lose-function value, we got 675/977 right, elapsed time: 9442ms
 - Using mixture of RandomForestClassifier, SVC, ExtraTreesClassifier and KNeighborsClassifier we got 682/977, elapsed time: 5568ms
 - Results for all models:
     - RandomForestClassifier(max_depth=15), 681/977, SCORE MEAN 0.78, TIME: 4046ms
     - SVC(probability=True), 668/977, SCORE MEAN: 0.47, TIME 4657ms
     - ExtraTreesClassifier(max_depth=15), 660/977, SCORE MEAN: 0.78, TIME: 3996ms
     - KNeighborsClassifier(n_neighbors=3), 658/977, SCORE MEAN: 0.35, TIME: 3859ms
     - DecisionTreeClassifier(max_depth=15), 632/977, SCORE MEAN: 0.8ms
     - SGDClassifier(max_iter=1000),  623/977, SCORE MEAN: 0.68, TIME: 4088ms
     - LinearSVC(dual=False), 623/977, SCORE MEAN: 0.69, TIME: 8460ms
     - GaussianNB(), 572/977, SCORE MEAN 0.74, TIME: 3958ms
  
### Some other observations about data and models:

#### Gaussian Naive Bayes
 - We can't use float as a type in label (even the casting to type 'Category' didn't help)
    - Workaround: multiply the HKLJ-CLASS with a big number and cast it to int
 - Sklearn libraries do not accept NaN values
    - Workaround: replace NaN-values with 0
 - There are lots of rows where there is no info at all about other library classification
    - Use keywords as additional info
    - Omit the the rows where there is no class-information at all from other classification systems

#### Complement Naive Bayes
 - This is an enhancement of Multinomial Naive Bayes
 - We can't use StandardScaler
     - For unknown reason algorithm returns error "negative values in input"
     - Workaround: use MinMaxScaler
 - Quick testing show much worse results than Gaussian Naive Bayes
 

## Define base class for your ML model

We should implement this next. 

## Output of this notebook

The result of this notebook is a collection methods ready for evaluation with the real data.

You should export classes and functions to `model.py` with `# nbdev_build_lib` (workflows will do this automatically).