# Prediction of internal failures in a production line

The dataset comes from the [Bosch production line performance competition](https://www.kaggle.com/c/bosch-production-line-performance/), in which we need to predict internal failures using thousands of measurements and tests made for each component along the assembly line. 

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

## Libraries

In [1]:
import utils
import metric

import numpy as np
import gc
import xgboost as xgb

import pickle
from bayes_opt import BayesianOptimization
from functools import partial

Let's have a look at the contents of the zipped files:

## Modelling

In [2]:
# Load checkpoint (saved at the end of the EDA notebook)
file_name = "./datasets.pkl"
open_file = open(file_name, "rb")
X_train, X_holdout, y_train, y_holdout, skf = pickle.load(open_file)
open_file.close()

### Evaluation metric

We need a function to compute the Matthews Correlation Coefficient (MCC) in an efficient way for xgboost. We'll use some numba magic for this, so as to optimise the threshold probability as well:

In [4]:
y_prob0 = np.random.rand(1000000)
y_prob  = y_prob0 + 0.4 * np.random.rand(1000000) - 0.02
y_true  = (y_prob0 > 0.6).astype(int)

%timeit metric.eval_mcc(y_true, y_prob)

del y_prob0, y_prob, y_true
gc.collect();

168 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### k-fold CV

We'll use xgboost as the learning algorithm. Let's write a wrapper to perform k-fold CV and return the average validation MCC:

In [6]:
# Make parameter set for Tree booster
params = {
    "eta": (0.05, 0.3), 
    "gamma": (0, 100),
    "max_depth": (5, 50), 
    "num_boost_round": (10, 100), 
    "subsample": (0.5, 0.95), 
    "colsample_bytree": (0.5, 0.95), 
    "alpha": (0, 10), 
    "lamda": (0, 10)} 

# Function handle
f = partial(utils.CV, X_train, y_train, skf)

optimizer = BayesianOptimization(f, params, random_state = 111)
optimizer.maximize(init_points = 20, n_iter = 10)


|   iter    |  target   |   alpha   | colsam... |    eta    |   gamma   |   lamda   | max_depth | num_bo... | subsample |
-------------------------------------------------------------------------------------------------------------------------
| [0m 1       [0m | [0m 0.2763  [0m | [0m 6.122   [0m | [0m 0.5761  [0m | [0m 0.159   [0m | [0m 76.93   [0m | [0m 2.953   [0m | [0m 11.71   [0m | [0m 12.02   [0m | [0m 0.6891  [0m |
| [95m 2       [0m | [95m 0.2893  [0m | [95m 2.387   [0m | [95m 0.6519  [0m | [95m 0.2977  [0m | [95m 23.77   [0m | [95m 0.8119  [0m | [95m 35.13   [0m | [95m 65.91   [0m | [95m 0.6234  [0m |
| [95m 3       [0m | [95m 0.3089  [0m | [95m 4.662   [0m | [95m 0.5533  [0m | [95m 0.06849 [0m | [95m 90.08   [0m | [95m 7.94    [0m | [95m 42.83   [0m | [95m 83.37   [0m | [95m 0.9459  [0m |
| [0m 4       [0m | [0m 0.2689  [0m | [0m 5.773   [0m | [0m 0.8662  [0m | [0m 0.1553  [0m | [0m 2.745   [0m | [0m 4.5

Let's train the best model on all the data:

In [7]:
# Make dmatrices
dtrain = xgb.DMatrix(X_train, y_train)
dheld  = xgb.DMatrix(X_holdout, y_holdout.to_numpy())

# Scale positive instances
sum_neg, sum_pos = np.sum(y_train == 0), np.sum(y_train == 1)

# Make parameter dict for xgboost
xgb_params = {"nthread": -1, "booster":"gbtree", "objective": "binary:logistic", "eval_metric": "auc", "tree_method": "hist",
              "eta":              optimizer.max["params"]["eta"], 
              "gamma":            optimizer.max["params"]["gamma"], 
              "max_depth":        int(optimizer.max["params"]["max_depth"]), 
              "subsample":        optimizer.max["params"]["subsample"],
              "alpha":            optimizer.max["params"]["alpha"], 
              "lambda":           optimizer.max["params"]["lamda"],
              "colsample_bytree": optimizer.max["params"]["colsample_bytree"],
             "scale_pos_weight" : sum_neg / sum_pos}

# Train using the parameters
clf = xgb.train(params = xgb_params,
                dtrain = dtrain,
                feval  = metric.mcc_eval,
                evals  = [(dtrain, 'train')],
                maximize = True,
                verbose_eval = False,
                num_boost_round = int(optimizer.max["params"]["num_boost_round"]),
                early_stopping_rounds = 10)

Let's predict on the heldout set and compute the MCC:

In [8]:
y_prob = clf.predict(dheld)
print(f"Heldout Set MCC: {round(metric.eval_mcc(y_holdout.to_numpy(), y_prob), 3)}")

Heldout Set MCC: 0.255
