<a href="https://www.kaggle.com/code/ahmetekiz/tps-aug-2022-blending?scriptVersionId=105980532" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Tabular Playground Series - Aug 2022

In this notebook, I create base model for TPS Aug 2022. In below, some information about data is gave us. 

>This data represents the results of a large product testing study. For each **product_code** you are given a number of product **attributes** (fixed for the code) as well as a number of **measurement values** for each individual product, representing various lab testing methods. Each product is used in a simulated real-world environment experiment, and and absorbs a certain amount of fluid (**loading**) to see whether or not it fails.

>Your task is to use the data to predict individual product failures of new codes with their individual lab test results.

**Evaluation**: Submissions are evaluated on area under the ROC curve between the **predicted probability** and the observed target.

1. My First Notebook on this Competition: https://www.kaggle.com/code/ahmetekiz/tps-aug-2022-starter

### This notebook's progress
Blending auc score - submission score
1. optimized models blending and with missing values columns: 0.58764 - 0.58318
1. basic models blending and without missing values columns: 0.58799 - 0.58473

<a id="0"></a> <br>
# Table of Contents    
1. [A Glance at the Data](#2) 
1. [Missing Values](#4)
1. [Preprocess](#5)
1. [Create Folds](#6)
1. [Train and Make Predictions](#7)
    1. [LogisticRegression](#8)
    1. [XGBoost](#9)
    1. [CatBoostClassifier](#10)
    1. [LGBMClassifier](#11)
    1. [Tensorflow ANN Model](#14)
1. [Merge New Sets](#17)
1. [Blending Results](#18)
1. [Submission](#30)

<a id="2"></a> <br>
# A Glance at the Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt
import seaborn as sns

# Preprocess
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import GroupKFold
from sklearn import model_selection

# Metrics
from sklearn.metrics import roc_auc_score

# Model
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC

from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

from colorama import Fore, Back, Style

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_train_full = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2022/train.csv", index_col='id')
df_test = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2022/test.csv", index_col='id')

df_train = df_train_full.drop(['failure'], axis=1)

In [None]:
df_train_full.head()

In [None]:
df_train_full.info()

In [None]:
df_train_full.describe()

### Ratio of the Failure on Train Set

In [None]:
colors = ["#22a7f0","#c9e52f"]
labels = ['Normal', 'Failure']
explode = [0,0.05]

print(df_train_full.failure.value_counts())
           
plt.figure(figsize = (5,5))
plt.pie(df_train_full.failure.value_counts(),explode=explode, labels=labels, colors=colors, autopct='%1.1f%%')
plt.title("Failure Rate of Products on Training Set", color = 'blue',fontsize = 14)

### Ratio of Nan Values
There is a lot of missing value. We will deal with them later.

In [None]:
cm = sns.light_palette("yellow", as_cmap=True)
pd.DataFrame({"NaN Count": df_train_full.isna().sum(),
              "NaN Ratio": df_train_full.isna().sum()/len(df_train_full)}).sort_values(by="NaN Count",
                                                                 ascending=False).style.background_gradient(cmap=cm)

<a id="4"></a> <br>
# Missing Values

In [None]:
df_train.isnull().sum()

<div class="alert alert-block alert-warning">
<b>Remember:</b> Null Values can be a sign of failure.
</div>

In [None]:
cols_with_missing = [col for col in df_train.columns if df_train[col].isnull().any()]
cols_with_missing

In [None]:
# new columns for missing columns
new_columns = []
for col in cols_with_missing:
    new_columns.append(f"{col}_was_missing")

print(new_columns)

In [None]:
numerical_cols = [c for c in df_train.columns if df_train[c].dtypes in ['int', 'float']]
print("Numerical Columns\n", numerical_cols)

categorical_cols = [c for c in df_train.columns if df_train[c].dtypes in ['object']]
print("\nCategorical Columns\n", categorical_cols)

In [None]:
# numerical_cols += new_columns
# print(numerical_cols)

<a id="5"></a> <br>
# Preprocess

In [None]:
# Preprocessing for numerical data
# it was constant
numerical_transformer = Pipeline(steps=[('imputer',SimpleImputer(strategy='median')),
                                        ('std_scaler', StandardScaler())
                                       ]) 


# # Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
])

### Apply Preprocess
When I use preprocess on kfold, preprocess can't deal with test data columns categories that training data doesn't contain. So, I make preprocess before everything. ??? 

In [None]:
def do_preprocess(df, preprocessor, train = False, shapes = False, missing_columns=False):
    if missing_columns:
        for col in cols_with_missing:
            df[col + '_was_missing'] = df[col].isnull()
    
    if train==True:
        df = preprocessor.fit_transform(df)
    else:
        df = preprocessor.transform(df)
        
    if shapes:
        print(df.shape)
    else:
        pass
    
    return df

In [None]:
x_train = do_preprocess(df_train, preprocessor, train=True, shapes=True)

## New Columns After Preprocessing

In [None]:
for c in preprocessor.named_transformers_['cat']['onehot'].categories_:
    print(c)

cat_one_hot_attribs = np.concatenate([c for c in preprocessor.named_transformers_['cat']['onehot'].categories_])
print("\nCategorical Columns:\n", cat_one_hot_attribs) 
print("\nNumerical Columns:\n", numerical_cols)

all_cols = np.concatenate((numerical_cols, cat_one_hot_attribs))
print("\nAll Columns:\n", all_cols)
print("\nAll Columns Shape:", all_cols.shape)

In [None]:
# Show feature importances
# code source : https://www.kaggle.com/code/ambrosm/tpsaug22-eda-which-makes-sense
def plot_model_feature_importance(feature_importance_list, features, title_name='Model Feature Importance', number_of_features=10):
    """
    features_importance_list: array of features importance
    features: a array that contain columns or features names
    title_name = title of graph
    number_of_features : how many numbers of features to show on the graph
    """
    importance_df = pd.DataFrame(np.array(feature_importance_list).T, index=features)
    importance_df['mean'] = importance_df.mean(axis=1).abs()
    importance_df['feature'] = features
    importance_df = importance_df.sort_values('mean', ascending=False).reset_index().head(number_of_features)
    plt.figure(figsize=(14, 4))
    plt.barh(importance_df.index, importance_df['mean'], color='lightgreen')
    plt.gca().invert_yaxis()
    plt.yticks(ticks=importance_df.index, labels=importance_df['feature'])
    plt.title(title_name)
    plt.show()
    
    return importance_df # to show the dataframe

# importance_df = plot_model_feature_importance(feature_importances, all_cols, number_of_features=20)

<a id="6"></a> <br>
# Create Folds

In [None]:
# df_train_full is for kfold

df_train_full = df_train_full.copy()
df_train_full["kfold"] = -1 

df_train_full.head()

In [None]:
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_indicies, valid_indicies) in enumerate(kf.split(X=df_train_full)):
    df_train_full.loc[valid_indicies, "kfold"] = fold

df_train_full.head()

In [None]:
# to save id's
df_train_full['id'] = df_train_full.index 
df_train_full = df_train_full.reset_index(drop=True)
df_train_full.head()

<a id="7"></a> <br>
# Train and Make Predictions

I used optimized parameters from: https://www.kaggle.com/code/juhjoo/0-5902-tps-aug-lightgbm-xgboost-ann-ensemble

<a id="8"></a> <br>
# LogisticRegression

In [None]:
%%time
final_test_predictions = []
final_valid_predictions = {}
scores = []  # roc auc scores

for fold in range(5): #5
    x_train = df_train_full[df_train_full.kfold != fold].reset_index(drop=True)
    x_val = df_train_full[df_train_full.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()
    
    valid_ids = x_val['id'].values.tolist()
#     print(valid_ids)
    
    y_train = x_train.failure
    y_val = x_val.failure
    
    # to drop target colum
    x_train = x_train.drop(['id','failure'], axis=1)
    x_val = x_val.drop(['id','failure'], axis=1)
#     print(x_train.head())
    
    #preprocess
    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
        ])
    

    x_train = do_preprocess(x_train, preprocessor, train=True)
    x_val = do_preprocess(x_val, preprocessor, train=False)
    x_test = do_preprocess(x_test, preprocessor, train=False)
    
    #select model
    model = LogisticRegression(max_iter=500, C=0.0001, penalty='l2', solver='newton-cg')
#     model = LogisticRegression()
    
    # train and predict
    model.fit(x_train, y_train)

    #Evaluating on Validation Set
    y_val_pred = model.predict_proba(x_val)[:,1]
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    print(y_val_pred.shape)
    print(type(y_val_pred))
    
    # predict test set
    y_test_pred = model.predict_proba(x_test)[:,1]    
        
    # save test and validation predictions on a list and a dict
    final_test_predictions.append(y_test_pred)
    final_valid_predictions.update(dict(zip(valid_ids, y_val_pred)))
    

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_1"]
final_valid_predictions.to_csv("train_pred_1.csv", index=False)

submission = pd.DataFrame({'id': df_test.index, 'pred_1': y_test_pred})
submission['pred_1'] = np.mean(np.column_stack(final_test_predictions), axis=1)
submission.to_csv('test_pred_1.csv', index=False)
submission

In [None]:
final_valid_predictions.head()

In [None]:
final_valid_predictions.shape

<a id="9"></a> <br>
# XGBoost

In [None]:
%%time
final_test_predictions = []
final_valid_predictions = {}
scores = []  # roc auc scores

for fold in range(5): #5
    x_train =  df_train_full[df_train_full.kfold != fold].reset_index(drop=True)
    x_val = df_train_full[df_train_full.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()
    
    valid_ids = x_val.id.values.tolist()
#     print(valid_ids)
    
    y_train = x_train.failure
    y_val = x_val.failure
    
    # to drop target colum
    x_train = x_train.drop(['id','failure'], axis=1)
    x_val = x_val.drop(['id','failure'], axis=1)
#     print(x_train.head())
    
    #preprocess
    preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
        ])
    
    x_train = do_preprocess(x_train, preprocessor, train=True, missing_columns=True)
    x_val = do_preprocess(x_val, preprocessor, train=False, missing_columns=True)
    x_test = do_preprocess(x_test, preprocessor, train=False, missing_columns=True)
   
    # select model
    Best_trial = {'use_label_encoder': False, 
                  'n_estimators': 35410, 
                  'learning_rate': 0.020400576769972683, 
                  'subsample': 0.87, 
                  'colsample_bytree': 0.47, 
                  'max_depth': 7, 
                  'gamma': 9.200000000000001, 
                  'booster': 'gbtree', 
                  'reg_lambda': 0.08160432917864537, 
                  'reg_alpha': 0.16867965528493895, 
                  'random_state': 42, 
                  'n_jobs': 4, 
                  'min_child_weight': 9.16763314844842,
                  'eval_metric': 'auc',  # auc, rmse, mae
                  'objective': 'binary:logistic',
                 'tree_method' : 'gpu_hist',}
    
    
    model = XGBClassifier(**Best_trial)
    
#     model = XGBClassifier()
        
    # train and predict
    model.fit(x_train, y_train)

    #Evaluating on Validation Set
    y_val_pred = model.predict_proba(x_val)[:,1]
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    
    # predict test set
    y_test_pred = model.predict_proba(x_test)[:,1]    
        
    # save test and validation predictions on a list and a dict
    final_test_predictions.append(y_test_pred)
    final_valid_predictions.update(dict(zip(valid_ids, y_val_pred)))
    

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_2"]
final_valid_predictions.to_csv("train_pred_2.csv", index=False)

submission = pd.DataFrame({'id': df_test.index, 'pred_2': y_test_pred})
submission['pred_2'] = np.mean(np.column_stack(final_test_predictions), axis=1)
submission.to_csv('test_pred_2.csv', index=False)
submission

<a id="10"></a> <br>
# CatBoostClassifier

In [None]:
%%time
final_test_predictions = []
final_valid_predictions = {}
scores = []  # roc auc scores

for fold in range(5): #5
    x_train =  df_train_full[df_train_full.kfold != fold].reset_index(drop=True)
    x_val = df_train_full[df_train_full.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()
    
    valid_ids = x_val.id.values.tolist()
#     print(valid_ids)
    
    y_train = x_train.failure
    y_val = x_val.failure
    
    # to drop target colum
    x_train = x_train.drop(['id','failure'], axis=1)
    x_val = x_val.drop(['id','failure'], axis=1)
#     print(x_train.head())
    
    #preprocess
    preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
        ])
    
    x_train = do_preprocess(x_train, preprocessor, train=True, missing_columns=True)
    x_val = do_preprocess(x_val, preprocessor, train=False, missing_columns=True)
    x_test = do_preprocess(x_test, preprocessor, train=False, missing_columns=True)
    
    # select model
    Best_trial = {'learning_rate': 0.3772498790776553, 
                  'l2_leaf_reg': 37.93184269682747, 
                  'bagging_temperature': 1.471236889897418, 
                  'random_strength': 1.7032726681064423, 
                  'depth': 1, 
                  'min_data_in_leaf': 232,
                  'verbose': 0}
    
    model = CatBoostClassifier(**Best_trial)
#     model = CatBoostClassifier(iterations=10,verbose=5)
    
    # train and predict
    model.fit(x_train, y_train)

    #Evaluating on Validation Set
    y_val_pred = model.predict_proba(x_val)[:,1]
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    
    # predict test set
    y_test_pred = model.predict_proba(x_test)[:,1]    
        
    # save test and validation predictions on a list and a dict
    final_test_predictions.append(y_test_pred)
    final_valid_predictions.update(dict(zip(valid_ids, y_val_pred)))
    

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_3"]
final_valid_predictions.to_csv("train_pred_3.csv", index=False)

submission = pd.DataFrame({'id': df_test.index, 'pred_3': y_test_pred})
submission['pred_3'] = np.mean(np.column_stack(final_test_predictions), axis=1)
submission.to_csv('test_pred_3.csv', index=False)
submission

<a id="11"></a> <br>
# LGBMClassifier

In [None]:
%%time
final_test_predictions = []
final_valid_predictions = {}
scores = []  # roc auc scores

for fold in range(5): #5
    x_train =  df_train_full[df_train_full.kfold != fold].reset_index(drop=True)
    x_val = df_train_full[df_train_full.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()
    
    valid_ids = x_val.id.values.tolist()
#     print(valid_ids)
    
    y_train = x_train.failure
    y_val = x_val.failure
    
    # to drop target colum
    x_train = x_train.drop(['id','failure'], axis=1)
    x_val = x_val.drop(['id','failure'], axis=1)
#     print(x_train.head())
    
    #preprocess
    preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
        ])
    
    x_train = do_preprocess(x_train, preprocessor, train=True, missing_columns=True)
    x_val = do_preprocess(x_val, preprocessor, train=False, missing_columns=True)
    x_test = do_preprocess(x_test, preprocessor, train=False, missing_columns=True)
    
    # select model
    Best_trial = {'n_estimators': 13688, 
                  'reg_alpha': 0.00029898817414380905, 
                  'reg_lambda': 0.022911278881140307, 
                  'colsample_bytree': 0.9, 
                  'num_leaves': 923, 
                  'feature_fraction': 0.4880516337950229, 
                  'bagging_fraction': 0.9230012153122733, 
                  'bagging_freq': 2, 
                  'min_child_samples': 95, 
                  'subsample': 0.61, 
                  'learning_rate': 0.05092641982004301, 
                  'max_depth': 1, 
                  'random_state': 42, 
                  'n_jobs': 4,
                  'metrics' : ['binary_logloss','auc']  # auc, rmse, mae
                  }
    
    model = LGBMClassifier(**Best_trial)
#     model = LGBMClassifier()
    
    # train and predict
    model.fit(x_train, y_train)

    #Evaluating on Validation Set
    y_val_pred = model.predict_proba(x_val)[:,1]
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    
    # predict test set
    y_test_pred = model.predict_proba(x_test)[:,1]    
        
    # save test and validation predictions on a list and a dict
    final_test_predictions.append(y_test_pred)
    final_valid_predictions.update(dict(zip(valid_ids, y_val_pred)))
    

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_4"]
final_valid_predictions.to_csv("train_pred_4.csv", index=False)

submission = pd.DataFrame({'id': df_test.index, 'pred_4': y_test_pred})
submission['pred_4'] = np.mean(np.column_stack(final_test_predictions), axis=1)
submission.to_csv('test_pred_4.csv', index=False)
submission

<a id="14"></a> <br>
# Tensorflow ANN Model

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Dropout, Flatten, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping

print("Tensorflow version:",tf.__version__)

In [None]:
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

def build_model():
    model = keras.Sequential()
    model.add(Dense(256, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Dense(128, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Dense(64, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))
    
    model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=['accuracy'])
   
    return model

In [None]:
def step_decay(epoch):
    decay_rate = 1
    initial_learning_rate = 0.01
    lrate = initial_learning_rate*(1/(1+(decay_rate*epoch)))
#     print("Learning Rate: ",lrate)
#     print("Epoch: ",epoch)
    return lrate


lrate = tf.keras.callbacks.LearningRateScheduler(step_decay)

In [None]:
def plot_history(history):
    
    loss = history.history["loss"]
    val_loss = history.history["val_loss"]
    epochs = range(1, len(loss) + 1)
    plot1 = plt.figure(1)
    plt.plot(epochs, loss, "bo", label="Training loss")
    plt.plot(epochs, val_loss, "b", label="Validation loss")
    plt.title("Training and validation loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()
    plt.show
    
    acc = history.history["accuracy"]
    val_acc = history.history["val_accuracy"]
    plot2 = plt.figure(2)
    plt.plot(epochs, acc, "bo", label="Training accuracy")
    plt.plot(epochs, val_acc, "b", label="Validation accuracy")
    plt.title("Training and validation accuracy")
    plt.xlabel("Epochs")
    plt.ylabel("Accuracy")
    plt.legend()
    plt.show()

In [None]:
%%time
final_test_predictions = []
final_valid_predictions = {}
scores = []  # roc auc scores

for fold in range(5): #5
    x_train =  df_train_full[df_train_full.kfold != fold].reset_index(drop=True)
    x_val = df_train_full[df_train_full.kfold == fold].reset_index(drop=True)
    x_test = df_test.copy()
    
    valid_ids = x_val.id.values.tolist()
#     print(valid_ids)
    
    y_train = x_train.failure
    y_val = x_val.failure
    
    # to drop target colum
    x_train = x_train.drop(['id','failure'], axis=1)
    x_val = x_val.drop(['id','failure'], axis=1)
#     print(x_train.head())
    
    #preprocess
    preprocessor = ColumnTransformer([
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
        ])
    
    x_train = do_preprocess(x_train, preprocessor, train=True, missing_columns=True)
    x_val = do_preprocess(x_val, preprocessor, train=False, missing_columns=True)
    x_test = do_preprocess(x_test, preprocessor, train=False, missing_columns=True)
    
    # select model
    model = build_model()
    num_epochs = 200
    early_stopping = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=20)
    history = model.fit(x_train, y_train, 
                        validation_data=(x_val, y_val), 
                        epochs=num_epochs, 
                        batch_size=2048, 
                        verbose=0, 
                        callbacks=[lrate, early_stopping])

    plot_history(history)
    
    #Evaluating on Validation Set
    y_val_pred = model.predict(x_val).reshape(-1)
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    
    # predict test set
    y_test_pred = model.predict(x_test).reshape(-1)
        
    # save test and validation predictions on a list and a dict
    final_test_predictions.append(y_test_pred)
    final_valid_predictions.update(dict(zip(valid_ids, y_val_pred)))
    

print(np.mean(scores), np.std(scores))
final_valid_predictions = pd.DataFrame.from_dict(final_valid_predictions, orient="index").reset_index()
final_valid_predictions.columns = ["id", "pred_5"]
final_valid_predictions.to_csv("train_pred_5.csv", index=False)

submission = pd.DataFrame({'id': df_test.index, 'pred_5': y_test_pred})
submission['pred_5'] = np.mean(np.column_stack(final_test_predictions), axis=1)
submission.to_csv('test_pred_5.csv', index=False)
submission

<a id="17"></a> <br>
# Merge New Sets

In [None]:
df1 = pd.read_csv("train_pred_1.csv")
df2 = pd.read_csv("train_pred_2.csv")
df3 = pd.read_csv("train_pred_3.csv")
df4 = pd.read_csv("train_pred_4.csv")
df5 = pd.read_csv("train_pred_5.csv")

df_test1 = pd.read_csv("test_pred_1.csv")
df_test2 = pd.read_csv("test_pred_2.csv")
df_test3 = pd.read_csv("test_pred_3.csv")
df_test4 = pd.read_csv("test_pred_4.csv")
df_test5 = pd.read_csv("test_pred_5.csv")

df = df_train_full.copy()

df = df.merge(df1, on="id", how="left")
df = df.merge(df2, on="id", how="left")
df = df.merge(df3, on="id", how="left")
df = df.merge(df4, on="id", how="left")
df = df.merge(df5, on="id", how="left")

df_test = df_test.merge(df_test1, on="id", how="left")
df_test = df_test.merge(df_test2, on="id", how="left")
df_test = df_test.merge(df_test3, on="id", how="left")
df_test = df_test.merge(df_test4, on="id", how="left")
df_test = df_test.merge(df_test5, on="id", how="left")

print(df.columns)

df.head()

In [None]:
df_test.head()

In [None]:
df.columns

<a id="18"></a> <br>
# Blending Results

In [None]:
useful_features = ["pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]

numerical_cols_2 = [c for c in df_train.columns if df_train[c].dtypes in ['int', 'float']]
print("Numerical Columns\n", numerical_cols)

categorical_cols = [c for c in df_train.columns if df_train[c].dtypes in ['object']]
print("\nCategorical Columns\n", categorical_cols)

In [None]:
useful_features = ["pred_1", "pred_2", "pred_3", "pred_4", "pred_5"]
# df_test = df_test[useful_features]

# useful_features = np.concatenate((numerical_cols_2, categorical_cols, useful_features))

final_predictions = []
scores = []

for fold in range(5):
    x_train =  df[df.kfold != fold]
    x_val = df[df.kfold == fold]
    x_test = df_test.copy()

    y_train = x_train.failure
    y_val = x_val.failure

        
    x_train = x_train[useful_features]
    x_val = x_val[useful_features]
    x_test = x_test[useful_features]
    
    #preprocess
    preprocessor = StandardScaler()
    
    x_train = preprocessor.fit_transform(x_train)
    x_val = preprocessor.transform(x_val)
    x_test =  preprocessor.transform(x_test)
    
    # train model
#     model = LogisticRegression()
    model = LogisticRegression(max_iter=500, C=0.0001, penalty='l2', solver='newton-cg')
    model.fit(x_train, y_train)
    
    #Evaluating on Validation Set
    y_val_pred = model.predict_proba(x_val)[:,1]
    score = roc_auc_score(y_val, y_val_pred)
    print(f"auc = {score:.5f}")
    scores.append(score)
    
    # predict test set
    y_test_pred = model.predict_proba(x_test)[:,1]  
    
    preds_valid = model.predict(x_val)
    test_preds = model.predict(x_test)
    
    final_predictions.append(y_test_pred)
    

print(np.mean(scores), np.std(scores))

In [None]:
feature_importances = model.coef_.ravel()
model_names = ["Logistic Regression", "XGBoost", "CatBoostClassifier", "LGBMClassifier", 'ANN']
importance_df = plot_model_feature_importance(feature_importances, model_names, number_of_features=20)

<a id="30"></a> <br>
# Submission

In [None]:
sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/sample_submission.csv')
sample_submission.head()

In [None]:
sample_submission.failure = np.mean(np.column_stack(final_predictions), axis=1)
sample_submission.to_csv("submission.csv", index=False)
sample_submission