# Bosch Production Line Performance
### Reduce manufacturing failures


A good chocolate soufflé is decadent, delicious, and delicate. But, it's a challenge to prepare. When you pull a disappointingly deflated dessert out of the oven, you instinctively retrace your steps to identify at what point you went wrong. Bosch, one of the world's leading manufacturing companies, has an imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. Part of doing so is closely monitoring its parts as they progress through the manufacturing processes.

<img src="https://kaggle2.blob.core.windows.net/competitions/kaggle/5357/media/BoschManufacturingKaggleImage.jpg" style="max-height:400px; text-align:center;">

Because Bosch records data at every step along its assembly lines, they have the ability to apply advanced analytics to improve these manufacturing processes. However, the intricacies of the data and complexities of the production line pose problems for current methods.

In this competition, Bosch is challenging Kagglers to predict internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable Bosch to bring quality products at lower costs to the end user.

### Evaluation

Submissions are evaluated on the Matthews correlation coefficient (MCC) between the predicted and the observed response. The MCC is given by:

<img src="https://latex.codecogs.com/gif.latex?MCC&space;=&space;\frac{(TP*TN)&space;-&space;(FP&space;*&space;FN)}{\sqrt{(TP&plus;FP)(TP&plus;FN)(TN&space;&plus;&space;FP)(TN&plus;FN)}}" title="MCC = \frac{(TP*TN) - (FP * FN)}{\sqrt{(TP+FP)(TP+FN)(TN + FP)(TN+FN)}}" />

where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives.

### Description of Data

The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).

The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, we have separated the files by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.

In addition to being one of the largest datasets (in terms of number of features) ever hosted on Kaggle, the ground truth for this competition is highly imbalanced. Together, these two attributes are expected to make this a challenging problem.

### File descriptions

* **train_numeric.csv** - the training set numeric features (this file contains the 'Response' variable)
* **test_numeric.csv** - the test set numeric features (you must predict the 'Response' for these Ids)
* **train_categorical.csv** - the training set categorical features
* **test_categorical.csv** - the test set categorical features
* **train_date.csv** - the training set date features
* **test_date.csv** - the test set date features
* **sample_submission.csv** - a sample submission file in the correct format

## Notebook Objective: Reviewing the Data

In any process involving data, the first goal should always be understanding the data. This involves looking at the data and answering a range of questions including (but not limited to):

* What **features** (columns) does the dataset contain?
* **How many records** (rows) have been provided?
* **What kind** of features:
    * Date Format ?
    * Numerical Values ?
    * Categorical values ?
    * Ordinal Values ?
* Are there **missing values**?
* How do the different features **relate to each other**?

## 1. Import all the libraries

#### installation required
pip install dask toolz

In [1]:
from datetime import datetime

import numpy as np
import pandas as pd

import dask.dataframe as dd

import bokeh.charts as bk
import bokeh.plotting as bk_plt
import bokeh.models as bk_md

bk.output_notebook()

## 2. Load every csv file with Pandas
### 2.0 Loading a small dummy file to check loading a dataframe with Dask

In [2]:
df_test = dd.read_csv('data/dummy_data.csv', dtype='str')
df_test.head(n=5)

Unnamed: 0,Id,L0_S1_F25,L0_S1_F27,L0_S1_F29,L0_S1_F31,L0_S2_F33,L0_S2_F35,L0_S2_F37,L0_S2_F39,L0_S2_F41,...,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,1,t1,t2,t3,,,,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,5,,,,,,,,,,...,,,,,,,,,,
4,8,,,,,,,,,,...,,,,,,,,,,


### 2.1 Load Sample Submission File

In [3]:
df_sample_submission = pd.read_csv("data/sample_submission.csv")
df_sample_submission.head(n=5) # Only display a few lines and not the whole dataframe

Unnamed: 0,Id,Response
0,1,0
1,2,0
2,3,0
3,5,0
4,8,0


### 2.2 Load Training Numeric Features File

In [4]:
df_train_numeric = dd.read_csv('data/train_numeric.csv')
df_train_numeric.head(n=5)

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262,Response
0,4,0.03,-0.034,-0.197,-0.179,0.118,0.116,-0.015,-0.032,0.02,...,,,,,,,,,,0
1,6,,,,,,,,,,...,,,,,,,,,,0
2,7,0.088,0.086,0.003,-0.052,0.161,0.025,-0.015,-0.072,-0.225,...,,,,,,,,,,0
3,9,-0.036,-0.064,0.294,0.33,0.074,0.161,0.022,0.128,-0.026,...,,,,,,,,,,0
4,11,-0.055,-0.086,0.294,0.33,0.118,0.025,0.03,0.168,-0.169,...,,,,,,,,,,0


### 2.3 Load Testing Numeric Features File

In [5]:
df_test_numeric = dd.read_csv('data/test_numeric.csv')
df_test_numeric.head(n=5)

Unnamed: 0,Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L0_S0_F8,L0_S0_F10,L0_S0_F12,L0_S0_F14,L0_S0_F16,...,L3_S50_F4243,L3_S50_F4245,L3_S50_F4247,L3_S50_F4249,L3_S50_F4251,L3_S50_F4253,L3_S51_F4256,L3_S51_F4258,L3_S51_F4260,L3_S51_F4262
0,1,,,,,,,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,5,-0.016,-0.026,-0.033,-0.016,0.205,-0.157,0.0,0.008,0.087,...,,,,,,,,,,
4,8,,,,,,,,,,...,,,,,,,,,,


### 2.4 Load Training Categorical Features File

In [6]:
df_train_categorical = dd.read_csv('data/train_categorical.csv', dtype='str')
df_train_categorical.head(n=5)

Unnamed: 0,Id,L0_S1_F25,L0_S1_F27,L0_S1_F29,L0_S1_F31,L0_S2_F33,L0_S2_F35,L0_S2_F37,L0_S2_F39,L0_S2_F41,...,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,4,,,,,,,,,,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,,,,,,,,,,...,,,,,,,,,,
3,9,,,,,,,,,,...,,,,,,,,,,
4,11,,,,,,,,,,...,,,,,,,,,,


### 2.5 Load Testing Categorical Features File

In [7]:
df_test_categorical = dd.read_csv('data/test_categorical.csv', dtype='str')
df_test_categorical.head(n=5)

Unnamed: 0,Id,L0_S1_F25,L0_S1_F27,L0_S1_F29,L0_S1_F31,L0_S2_F33,L0_S2_F35,L0_S2_F37,L0_S2_F39,L0_S2_F41,...,L3_S49_F4225,L3_S49_F4227,L3_S49_F4229,L3_S49_F4230,L3_S49_F4232,L3_S49_F4234,L3_S49_F4235,L3_S49_F4237,L3_S49_F4239,L3_S49_F4240
0,1,,,,,,,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,5,,,,,,,,,,...,,,,,,,,,,
4,8,,,,,,,,,,...,,,,,,,,,,


### 2.6 Load Training Date Features File

In [8]:
df_train_date = dd.read_csv('data/train_date.csv')
df_train_date.head(n=5)

Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,4,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,82.24,...,,,,,,,,,,
1,6,,,,,,,,,,...,,,,,,,,,,
2,7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,1618.7,...,,,,,,,,,,
3,9,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,1149.2,...,,,,,,,,,,
4,11,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,602.64,...,,,,,,,,,,


### 2.7 Load Testing Date Features File

In [9]:
df_test_date = dd.read_csv('data/test_date.csv')
df_test_date.head(n=5)

Unnamed: 0,Id,L0_S0_D1,L0_S0_D3,L0_S0_D5,L0_S0_D7,L0_S0_D9,L0_S0_D11,L0_S0_D13,L0_S0_D15,L0_S0_D17,...,L3_S50_D4246,L3_S50_D4248,L3_S50_D4250,L3_S50_D4252,L3_S50_D4254,L3_S51_D4255,L3_S51_D4257,L3_S51_D4259,L3_S51_D4261,L3_S51_D4263
0,1,,,,,,,,,,...,,,,,,,,,,
1,2,,,,,,,,,,...,,,,,,,,,,
2,3,,,,,,,,,,...,,,,,,,,,,
3,5,255.45,255.45,255.45,255.45,255.45,255.45,255.45,255.45,255.45,...,,,,,,,,,,
4,8,,,,,,,,,,...,,,,,,,,,,


## 3. Dataset Analysis
### 3.1 Dataset Size

In [10]:
df_train_numeric_size = df_train_numeric.Id.size.compute()
df_train_categorical_size = df_train_categorical.Id.size.compute()
df_train_date_size = df_train_date.Id.size.compute()

df_test_numeric_size  = df_test_numeric.Id.size.compute()
df_test_categorical_size  = df_test_categorical.Id.size.compute()
df_test_date_size  = df_test_date.Id.size.compute()

In [11]:
print("Number of lines - df_train_numeric:", df_train_numeric_size)
print("Number of lines - df_train_categorical:", df_train_categorical_size)
print("Number of lines - df_train_date:", df_train_date_size)

print("Number of lines - df_test_numeric:", df_test_numeric_size)
print("Number of lines - df_test_categorical:", df_test_categorical_size)
print("Number of lines - df_test_date:", df_test_date_size)

Number of lines - df_train_numeric: 1183747
Number of lines - df_train_categorical: 1183747
Number of lines - df_train_date: 1183747
Number of lines - df_test_numeric: 1183748
Number of lines - df_test_categorical: 1183748
Number of lines - df_test_date: 1183748
