# Predicting States of Manufacturing Control Data 🏭

**Objective:** Build a powerfull GBDT Model that can provide a good estimation.

**Strategy:** I think I will follow this strategy:

**Level 1 Getting Started**

* Quick EDA to identify potential opportunities.
* Simple pre-processing step to encode categorical features.
* A basic CV strategy using 90% for TRaining and 10% for Testing.
* Looking at the feature importances.
* Creating a submission file.
* Submit the file to Kaggle.

**Level 2 Feature Engineering**
* Feature engineering using text information. (Massive boost in the score)
* Cross validation loop (**Work in Progress...**)

---

**Data Description**

For this challenge, you are given (simulated) manufacturing control data and are tasked to predict whether the machine is in state 0 or state 1. 
The data has various feature interactions that may be important in determining the machine state.

Good luck!

**Files**
* train.csv - the training data, which includes normalized continuous data and categorical data
* test.csv - the test set; your task is to predict binary target variable which represents the state of a manufacturing process
* sample_submission.csv - a sample submission file in the correct format

---

**Notebooks Ideas and Credits**

I took ideas or inspiration from the following notebooks, if you enjoy my work, please take a look to the notebooks that inspire my work.

**TPSMAY22 Gradient-Boosting Quickstart:** https://www.kaggle.com/code/ambrosm/tpsmay22-gradient-boosting-quickstart/notebook




---

# 1. Loading the Requiered Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

---


# 2. Setting the Notebook

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3. Loading the Information (CSV) Into A Dataframe

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
trn_data = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/train.csv')
tst_data = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/test.csv')

sub = pd.read_csv('/kaggle/input/tabular-playground-series-may-2022/sample_submission.csv')

---

# 4. Exploring the Information Available

## 4.1. Analysing the Trian Dataset

In [None]:
%%time
# Explore the shape of the DataFrame...
trn_data.shape

In [None]:
%%time
# Display simple information of the variables in the dataset...
trn_data.info()

In [None]:
%%time
# Display the first few rows of the DataFrame...
trn_data.head()

In [None]:
%%time
# Generate a simple statistical summary of the DataFrame, Only Numerical...
trn_data.describe()

In [None]:
%%time
# Calculates the total number of missing values...
trn_data.isnull().sum().sum()

In [None]:
%%time
# Display the number of missing values by variable...
trn_data.isnull().sum()

In [None]:
%%time
# Display the number of unique values for each variable...
trn_data.nunique()

In [None]:
# Display the number of unique values for each variable, sorted by quantity...
trn_data.nunique().sort_values(ascending = True)

In [None]:
%%time
# Check some of the categorical variables
categ_cols = ['f_29','f_30','f_13', 'f_18','f_17','f_14','f_11','f_10','f_09','f_15','f_07','f_12','f_16','f_08','f_27']
trn_data[categ_cols].sample(5)

In [None]:
%%time
# Generate a quick correlation matrix to understand the dataset better
correlation = trn_data.corr()

In [None]:
%%time
# Diplay the correlation matrix
correlation

In [None]:
%%time
# Check the most correlated variables to the target
correlation['target'].sort_values(ascending = False)[:5]

In [None]:
%%time
# Check the least correlated variables to the target
correlation['target'].sort_values(ascending = True)[:5]

---

## 4.2. Analysing the Trian Labels Dataset

In [None]:
%%time
# Check how well balanced is the dataset
trn_data['target'].value_counts()

In [None]:
%%time
# Check some statistics on the target variable
trn_data['target'].describe()

---

# 5. Feature Engineering

## 5.1 Text Base Features

In [None]:
%%time
# The idea is to create a simple funtion to count the amount of letters on feature 27.
# feature 27 seems quite important 

def count_sequence(df, field):
    '''
    For each letter of the provided suquence it return new feature with the number of occurences.
    '''
    alphabet = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']    
    
    for letter in alphabet:
        df[letter + '_count'] = df[field].str.count(letter)
    
    df["unique_characters"] = df['f_27'].apply(lambda s: len(set(s)))
    return df

In [None]:
%%time
# Utilizes the new created funtions to generate more features.
# trn_data = count_sequence(trn_data, 'f_27')
# tst_data = count_sequence(tst_data, 'f_27')

In [None]:
%%time
def count_chars(df, field):
    '''
    Describes something...
    '''
    
    for i in range(10):
        df[f'ch_{i}'] = df[field].str.get(i).apply(ord) - ord('A')
        
    df["unique_characters"] = df[field].apply(lambda s: len(set(s)))
    return df

In [None]:
%%time
# Utilizes the new created funtions to generate more features.
trn_data = count_chars(trn_data, 'f_27')
tst_data = count_chars(tst_data, 'f_27')

---

# 7. Pre-Processing Labels

In [None]:
%%time
# Define a label encoding function
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
def encode_features(df, cols = ['f_27']):
    for col in cols:
        df[col + '_enc'] = encoder.fit_transform(df[col])
    return df

trn_data = encode_features(trn_data)
tst_data = encode_features(tst_data)

In [None]:
# Check the results of the transformation
trn_data.head()

---

# 8. Feature Selection for Baseline Model

In [None]:
%%time
# Define what will be used in the training stage
ignore = ['id', 'target', 'f_27',  'f_27_enc'] # f_27 has been label encoded...

features = [feat for feat in trn_data.columns if feat not in ignore]
target_feature = 'target'

---

# 9. Creating a Simple Train / Test Split Strategy

In [None]:
%%time
# Creates a simple train split breakdown for baseline model
from sklearn.model_selection import train_test_split
test_size_pct = 0.20
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature], test_size = test_size_pct, random_state = 42)

---

# 10. Building a Baseline GBT Model, Simple Split

## 10.1 XGBoost Model

In [None]:
%%time
%%script false --no-raise-error
# Import the model libraries
from xgboost  import XGBClassifier

In [None]:
%%time
%%script false --no-raise-error
# Define the model parameters to get started we use default values to a certain degree
xgb_params = {'n_estimators'     : 8192,
              'min_child_weight' : 96,
              #'max_depth'        : 6,
              #'learning_rate'    : 0.15,
              #'subsample'        : 0.95,
              #'colsample_bytree' : 0.95,
              #'reg_lambda'       : 1.50,
              #'reg_alpha'        : 1.50,
              #'gamma'            : 1.50,
              'max_bin'          : 512,
              'random_state'     : 46,
              'objective'        : 'binary:logistic',
              'tree_method'      : 'gpu_hist',
             }

In [None]:
%%time
%%script false --no-raise-error
# Instanciate the XGBoost model using the previous parameters
xgb = XGBClassifier(**xgb_params)
xgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], early_stopping_rounds = 256, verbose = 250)

In [None]:
%%time
%%script false --no-raise-error
# Check the model performance in the validation dataset
from sklearn.metrics import roc_auc_score
val_preds = xgb.predict_proba(X_valid[features])[:, 1]
roc_auc_score(y_valid, val_preds)

In [None]:
# Record some of the model results for future improvement
# Local Score = 0.9454953628406088 First Model Run >>> LB Score = 0.93147
# Local Score = 0.9448767329168479 First Model Run >>> LB Score = 0.93205

# 0.9816418086418166

---

## 10.2 LGMB Model

In [None]:
%%time
%%script false --no-raise-error
# Import the model libraries
from lightgbm import LGBMClassifier

In [None]:
%%time
%%script false --no-raise-error
# Define the model parameters to get started we use default values to a certain degree
lgb_params = {'n_estimators'      : 8192,
              'min_child_samples' : 96,
              'max_bins'          : 512,
              'random_state'      : 46,
             }

In [None]:
%%time
%%script false --no-raise-error
# Instanciate the XGBoost model using the previous parameters
lgb = LGBMClassifier(**lgb_params)
lgb.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], early_stopping_rounds = 256, verbose = 250)

In [None]:
%%time
%%script false --no-raise-error
# Check the model performance in the validation dataset
from sklearn.metrics import roc_auc_score
val_preds = lgb.predict_proba(X_valid[features])[:, 1]
roc_auc_score(y_valid, val_preds)

---

# 11. Building a Baseline GBT Model, Kfold Loop

In [None]:
%%time
from lightgbm import LGBMClassifier
from xgboost  import XGBClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, roc_curve
import math

In [None]:
%%time
# Define the model parameters to get started we use default values to a certain degree
lgb_params = {'n_estimators'      : 8192, # Was 8192...
              'min_child_samples' : 96,
              'max_bins'          : 512,
              'random_state'      : 46,
             }

xgb_params = {'n_estimators'     : 8192,
              'min_child_weight' : 96,
              'max_depth'        : 6,
              'learning_rate'    : 0.15,
              'subsample'        : 0.95,
              'colsample_bytree' : 0.95,
              'reg_lambda'       : 1.50,
              'reg_alpha'        : 1.50,
              'gamma'            : 1.50,
              'max_bin'          : 512,
              'random_state'     : 46,
              'objective'        : 'binary:logistic',
              'tree_method'      : 'gpu_hist',
             }

In [None]:
%%time
# Create empty lists to store NN information...

score_list   = []
predictions  = [] 
# Define kfolds for training purposes...
kf = KFold(n_splits = 5)

for fold, (trn_idx, val_idx) in enumerate(kf.split(trn_data)):
    print(f'Training Fold {fold} ...')
    X_train, X_valid = trn_data.iloc[trn_idx][features], trn_data.iloc[val_idx][features]
    y_train, y_valid = trn_data.iloc[trn_idx][target_feature], trn_data.iloc[val_idx][target_feature]
    
    # LGBM (Uncomment to use, and Comment the XGBoost Part... LGBM Takes forever)
    # model = LGBMClassifier(**lgb_params)
    # model.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], early_stopping_rounds = 256, verbose = 0)
    
    # XGBoost
    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['auc'], early_stopping_rounds = 256, verbose = 0)
    
    y_valid_pred = model.predict_proba(X_valid.values)[:,1]
    score = roc_auc_score(y_valid, y_valid_pred)

    score_list.append(score)
    print(f"Fold {fold}, AUC = {score:.3f}")
    print((''))
    
    tst_pred = model.predict_proba(tst_data[features].values)[:,1]
    predictions.append(tst_pred)

print(f'OOF AUC: {np.mean(score_list):.3f}')
print('.........')

# 11. Undertanding Model Behavior, Feature Importance

In [None]:
%%time
# Define a funtion to plot the feature importance properly
def plot_feature_importance(importance, names, model_type, max_features = 10):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_df = fi_df.head(max_features)

    #Define size of bar plot
    plt.figure(figsize=(8,6))
    
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
%%time
# Utilize the feature importance function to visualize the most valueable features
import seaborn as sns
import matplotlib.pyplot as plt
plot_feature_importance(model.feature_importances_,X_train.columns,'LGBM ', max_features = 25)

---

# 12. Baseline Model Submission File Generation

In [None]:
%%time
# Review the format of the submission file
sub.head()

In [None]:
%%time
# Populated the prediction on the submission dataset and creates an output file
sub['target'] = np.array(predictions).mean(axis=0)
sub.to_csv('my_submission_043022.csv', index = False)

In [None]:
%%time
# Review the submission file as a final step to upload to Kaggle.
sub.head()

---