# Predicting States of Manufacturing Control Data; Using Neuronal Nets ⚙️

**Note: Use GPU for Training...**

**Objective:** Build a powerfull NN Model that can provide a good estimation.

**Strategy:** I think I will follow this strategy:

**Level 1 Getting Started**

* Quick EDA to identify potential opportunities.
* Simple pre-processing step to encode categorical features.
* A basic CV strategy using 90% for TRaining and 10% for Testing.
* Looking at the feature importances.
* Creating a submission file.
* Submit the file to Kaggle.

**Level 2 Feature Engineering**

* Feature engineering using text information. (Massive boost in the score)
* Cross validation loop.

**Level 3 Model Optimization**
* Work in Progress...

---
**Other Similar Implementations**
I been working on other architechtures at the same time, to see what works more effiently

XGBoost and LGBM Models

https://www.kaggle.com/code/cv13j0/tps-may22-eda-gbdt

---




**Data Description**

For this challenge, you are given (simulated) manufacturing control data and are tasked to predict whether the machine is in state 0 or state 1. 
The data has various feature interactions that may be important in determining the machine state.

Good luck!

**Files**
* train.csv - the training data, which includes normalized continuous data and categorical data
* test.csv - the test set; your task is to predict binary target variable which represents the state of a manufacturing process
* sample_submission.csv - a sample submission file in the correct format

---
**Notebooks Ideas and Credits**

I took ideas or inspiration from the following notebooks, if you enjoy my work, please take a look to the notebooks that inspire my work.

TPSMAY22 Gradient-Boosting Quickstart: https://www.kaggle.com/code/ambrosm/tpsmay22-gradient-boosting-quickstart/notebook

---

# 1. Loading the Requiered Libraries

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
import datetime

---


# 2. Setting the Notebook

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3. Loading the Information (CSV) Into A Dataframe

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv')
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv')

sub = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/sample_submission.csv')

---

# 4. Exploring the Information Available

## 4.1. Analysing the Trian Dataset

In [None]:
%%time
# Explore the shape of the DataFrame...
trn_data.shape

In [None]:
%%time
# Display simple information of the variables in the dataset...
trn_data.info()

In [None]:
%%time
# Display the first few rows of the DataFrame...
trn_data.head()

In [None]:
%%time
# Generate a simple statistical summary of the DataFrame, Only Numerical...
trn_data.describe()

In [None]:
%%time
# Calculates the total number of missing values...
trn_data.isnull().sum().sum()

In [None]:
%%time
# Display the number of missing values by variable...
trn_data.isnull().sum()

In [None]:
%%time
# Display the number of unique values for each variable...
trn_data.nunique()

In [None]:
# Display the number of unique values for each variable, sorted by quantity...
trn_data.nunique().sort_values(ascending = True)

In [None]:
%%time
# Check some of the categorical variables
categ_cols = ['f_29','f_30','f_13', 'f_18','f_17','f_14','f_11','f_10','f_09','f_15','f_07','f_12','f_16','f_08','f_27']
trn_data[categ_cols].sample(5)

In [None]:
%%time
# Generate a quick correlation matrix to understand the dataset better
correlation = trn_data.corr()

In [None]:
%%time
# Diplay the correlation matrix
correlation

In [None]:
%%time
# Check the most correlated variables to the target
correlation['target'].sort_values(ascending = False)[:5]

In [None]:
%%time
# Check the least correlated variables to the target
correlation['target'].sort_values(ascending = True)[:5]

---

## 4.2. Analysing the Trian Labels Dataset

In [None]:
%%time
# Check how well balanced is the dataset
trn_data['target'].value_counts()

In [None]:
%%time
# Check some statistics on the target variable
trn_data['target'].describe()

---

# 5. Feature Engineering

## 5.1 Text Base Features

In [None]:
%%time
# The idea is to create a simple funtion to count the amount of letters on feature 27.
# feature 27 seems quite important 

def count_sequence(df, field):
    '''
    For each letter of the provided suquence it return new feature with the number of occurences.
    '''
    alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']    
    
    for letter in alphabet:
        df[letter + '_count'] = df[field].str.count(letter)
    
    df["unique_characters"] = df['f_27'].apply(lambda s: len(set(s)))
    return df

In [None]:
%%time
# Utilizes the new created funtions to generate more features.
# trn_data = count_sequence(trn_data, 'f_27')
# tst_data = count_sequence(tst_data, 'f_27')

In [None]:
%%time
def count_chars(df, field):
    '''
    Describe something...
    '''
    
    for i in range(10):
        df[f'ch_{i}'] = df[field].str.get(i).apply(ord) - ord('A')
        
    df["unique_characters"] = df[field].apply(lambda s: len(set(s)))
    return df

In [None]:
%%time
# Utilizes the new created funtions to generate more features.
trn_data = count_chars(trn_data, 'f_27')
tst_data = count_chars(tst_data, 'f_27')

## Stats Features

In [None]:
%%time
continuous_feat = ['f_00', 'f_01', 'f_02', 'f_03', 'f_04', 'f_05', 'f_06', 'f_19', 'f_20', 'f_21', 'f_22', 'f_23', 'f_24', 'f_25', 'f_26', 'f_28']

def stat_features(df, cols = continuous_feat):
    '''
    
    '''
    
    df['f_sum']  = df[continuous_feat].sum(axis=1)
    df['f_min']  = df[continuous_feat].min(axis=1)
    df['f_max']  = df[continuous_feat].max(axis=1)
    df['f_std']  = df[continuous_feat].std(axis=1)    
    df['f_mad']  = df[continuous_feat].mad(axis=1)
    df['f_mean'] = df[continuous_feat].mean(axis=1)
    df['f_kurt'] = df[continuous_feat].kurt(axis=1)
    df['f_count_pos']  = df[continuous_feat].gt(0).count(axis=1)

    return df

In [None]:
%%time
trn_data = stat_features(trn_data, continuous_feat)
tst_data = stat_features(tst_data, continuous_feat)

In [None]:
trn_data.head()

---

# 7. Pre-Processing Labels

In [None]:
%%time
# Define a label encoding function
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
def encode_features(df, cols = ['f_27']):
    for col in cols:
        df[col + '_enc'] = encoder.fit_transform(df[col])
    return df

trn_data = encode_features(trn_data)
tst_data = encode_features(tst_data)

In [None]:
# Check the results of the transformation
trn_data.head()

## 7.2 - One-Hot Encode

In [None]:
# We will process to One-Hot encode all this variables...
# f_29           2
# f_30           3
# f_13          13
# f_18          14
# f_17          14
# f_14          14
# f_11          14
# f_10          15
# f_09          15
# f_15          15
# f_07          16
# f_12          16
# f_16          16
# f_08          16

In [None]:
%%time
def one_hot_encoder(df_trn, df_tst, var_list):
    '''
    '''
    df_trn['is_train'] = 1
    df_tst['is_train'] = 0

    combined = df_trn.append(df_tst)
    combined = pd.get_dummies(combined, columns = var_list)
    return combined[combined['is_train'] == 1], combined[combined['is_train'] == 0]

In [None]:
%%time
#trn_data, tst_data = one_hot_encoder(trn_data,tst_data, [
                                                         #'f_29',
                                                         #'f_30',
                                                         #'f_13',
                                                         #'f_18',
                                                         #'f_17',
                                                         #'f_14',
                                                         #'f_11',
                                                         #'f_10',
                                                         #'f_09',
                                                         #'f_15',
                                                         #'f_07',
                                                         #'f_12',
                                                         #'f_16',
                                                         #'f_08'
                                                         #])

---

# 8. Feature Selection for Baseline Model

In [None]:
%%time
# Define what will be used in the training stage
ignore = ['id', 
          'f_27', 
          'f_27_enc', 
          'is_train', 
          'target'] # f_27 has been label encoded...

features = [feat for feat in trn_data.columns if feat not in ignore]
target_feature = 'target'

---

# 9. Creating a Simple Train / Test Split Strategy

In [None]:
%%time
# Creates a simple train split breakdown for baseline model
from sklearn.model_selection import train_test_split
test_size_pct = 0.20
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature], test_size = test_size_pct, random_state = 42)

---

# 10. Building a Baseline NN Model, Simple Split

In [None]:
%%time
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, InputLayer, Add, BatchNormalization, Dropout

from sklearn.preprocessing import StandardScaler
import random

In [None]:
%%time
np.random.seed(1)
random.seed(1)
tf.random.set_seed(1)

In [None]:
%%time
def nn_model():
    '''
    '''
    
    activation_func = 'swish'
    inputs = Input(shape = (len(features)))
    
    x = Dense(64, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(30e-6), 
              activation = activation_func)(inputs)
    
    #x = BatchNormalization()(x)
    
    x = Dense(64, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(30e-6), 
              activation = activation_func)(x)
    
    #x = BatchNormalization()(x)
    
    x = Dense(64, 
          #use_bias  = True, 
          kernel_regularizer = tf.keras.regularizers.l2(30e-6), 
          activation = activation_func)(x)
    
    #x = BatchNormalization()(x)
    
    x = Dense(64, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(30e-6), 
              activation = activation_func)(x)
    
    #x = BatchNormalization()(x)

    x = Dense(16, 
              #use_bias  = True, 
              kernel_regularizer = tf.keras.regularizers.l2(30e-6), 
              activation = activation_func)(x)
    
    #x = BatchNormalization()(x)

    x = Dense(1 , 
              #use_bias  = True, 
              #kernel_regularizer = tf.keras.regularizers.l2(30e-6),
              activation = 'sigmoid')(x)
    
    model = Model(inputs, x)
    
    return model

In [None]:
%%time
architecture = nn_model()
architecture.summary()

In [None]:
%%time
# Defining model parameters...
BATCH_SIZE         = 4096
EPOCHS             = 200 
EPOCHS_COSINEDECAY = 300 
DIAGRAMS           = True
USE_PLATEAU        = False
INFERENCE          = False
VERBOSE            = 0 
TARGET             = 'target'

In [None]:
 %%time
# Defining model training function...
def fit_model(X_train, y_train, X_val, y_val, run = 0):
    '''
    '''
    lr_start = 0.01
    start_time = datetime.datetime.now()
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)

    epochs = EPOCHS    
    lr = ReduceLROnPlateau(monitor = 'val_loss', factor = 0.7, patience = 4, verbose = VERBOSE)
    es = EarlyStopping(monitor = 'val_loss',patience = 12, verbose = 1, mode = 'min', restore_best_weights = True)
    tm = tf.keras.callbacks.TerminateOnNaN()
    callbacks = [lr, es, tm]
    
    # Cosine Learning Rate Decay
    if USE_PLATEAU == False:
        epochs = EPOCHS_COSINEDECAY
        lr_end = 0.0002

        def cosine_decay(epoch):
            if epochs > 1:
                w = (1 + math.cos(epoch / (epochs - 1) * math.pi)) / 2
            else:
                w = 1
            return w * lr_start + (1 - w) * lr_end
        
        lr = LearningRateScheduler(cosine_decay, verbose = 0)
        callbacks = [lr, tm]
        
    model = nn_model()
    optimizer_func = tf.keras.optimizers.Adam(learning_rate = lr_start)
    loss_func = tf.keras.losses.BinaryCrossentropy()
    model.compile(optimizer = optimizer_func, loss = loss_func)
    
    X_val = scaler.transform(X_val)
    validation_data = (X_val, y_val)
    
    history = model.fit(X_train, 
                        y_train, 
                        validation_data = validation_data, 
                        epochs          = epochs,
                        verbose         = VERBOSE,
                        batch_size      = BATCH_SIZE,
                        shuffle         = True,
                        callbacks       = callbacks
                       )
    
    history_list.append(history.history)
    print(f'Training loss:{history_list[-1]["loss"][-1]:.3f}')
    callbacks, es, lr, tm, history = None, None, None, None, None
    
    
    y_val_pred = model.predict(X_val, batch_size = BATCH_SIZE, verbose =VERBOSE)
    score = roc_auc_score(y_val, y_val_pred)
    print(f'Fold {run}.{fold} | {str(datetime.datetime.now() - start_time)[-12:-7]}'
          f'| AUC: {score:.5f}')
    
    score_list.append(score)
    
    tst_data_scaled = scaler.transform(tst_data[features])
    tst_pred = model.predict(tst_data_scaled)
    predictions.append(tst_pred)
    
    return model

In [None]:
%%time
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, roc_curve
import math

# Create empty lists to store NN information...
history_list = []
score_list   = []
predictions  = []

# Define kfolds for training purposes...
kf = KFold(n_splits = 5)

for fold, (trn_idx, val_idx) in enumerate(kf.split(trn_data)):
    X_train, X_val = trn_data.iloc[trn_idx][features], trn_data.iloc[val_idx][features]
    y_train, y_val = trn_data.iloc[trn_idx][TARGET], trn_data.iloc[val_idx][TARGET]
    
    fit_model(X_train, y_train, X_val, y_val)
    
print(f'OOF AUC: {np.mean(score_list):.5f}')

In [None]:
# OOF AUC: 0.99658... 10 Folds, Batch Normalization, Using One-Hot Features, Epochs = 150, [64,64,64,16,1]...
# OOF AUC: 0.99653... 05 Folds, No Batch Normalization, Using Partial One-Hot Features, Epochs = 150, [64,64,64,16,1]...
# OOF AUC: 0.99757... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 150, [64,64,64,16,1]...
# OOF AUC: 0.99766... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 200, [64,64,64,16,1]...
# OOF AUC: 0.99771... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [64,64,64,16,1]...
# OOF AUC: 0.99759... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [96,64,64,16,1]...
# OOF AUC: 0.99772... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [64,64,32,16,1]...
# OOF AUC: 0.99772... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [64,64,32,16,1]...
# OOF AUC: 0.99769... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [64,64,64,16,1], Stat Features = Yes
# OOF AUC: 0.99769... 05 Folds, No Batch Normalization, No One-Hot Features, Epochs = 300, [256,64,64,16,1], Stat Features = Yes

---

# 11. Undertanding Model Behavior, Feature Importance

In [None]:
# Work in Progress...

---

# 12. Baseline Model Submission File Generation

In [None]:
%%time
# Review the format of the submission file
sub.head()

In [None]:
%%time
# Populated the prediction on the submission dataset and creates an output file
sub['target'] = np.array(predictions).mean(axis = 0)
sub.to_csv('my_submission_050722.csv', index = False)

In [None]:
%%time
%%script false --no-raise-error
# Create submission
print(f"{len(features)} features")

pred_list = []
for seed in range(10):
    model = fit_model(X_tr, y_tr, run = seed)
    model.fit(X_tr.values, y_tr)
    pred_list.append(scipy.stats.rankdata(model.predict(tst_data[features].values, batch_size = BATCH_SIZE)))
    print(f"{seed:2}", pred_list[-1])
print()

submission = tst_data[['id']].copy()
submission[TARGET] = np.array(pred_list).mean(axis = 0)

submission.to_csv('submission_nn_05012022.csv', index = False)

In [None]:
%%time
# Review the submission file as a final step to upload to Kaggle.
sub.head()

---