<h2 style="color:#2c3f51"> TPS MAY 2022 </h2>

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import optuna
import gc

from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split

from lightgbm import LGBMClassifier

import time


# **Loading the Data**

In [3]:
train = pd.read_csv("../input/tabular-playground-series-may-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-may-2022/test.csv")

# **Data Analysis**

Concatenating train and test data.

In [4]:
data = pd.concat([train, test], sort=True).reset_index(drop=True)

data.describe()

In [5]:
data.info()

The Data has 16 float64 columns, 16 int64 columns ,1 object columns.

So the data has 32 numerical variables and 1 categorical variables. 

# **Feature Engineering**

1.  **Feature transformation**

* Missing values Imputation
* Handling categorical Features
* Feature scaling

**Missing values Imputation**

Checking whether the data contains any missing or null values.

In [7]:
train.isna().sum().any()                                                          
                                                                                                                                                                          
#There are no missing values.

**Handling categorical Features  @ambrosm**

f_27 is the only column having categorical features.
We only split the f_27 string into ten separate features ,and we count the unique characters in the string.

In [8]:
for df in [train, test]:
    for i in range(10):
        df[f'ch{i}'] = df.f_27.str.get(i).apply(ord) - ord('A')
    df["unique_characters"] = df.f_27.apply(lambda s: len(set(s)))
features = [f for f in test.columns if f != 'id' and f != 'f_27']

**2. Feature Selection**

Lets check the better performing features by running the model and get the top 5 best features and apply more statistical methods to it.

* Splitting the data
* Hyperparameter tuning model with optuna
* Best performing features.

**Split Data**

In [9]:
y = train["target"]
X = train.drop(columns=["id","f_27", "target"])

test = test.drop(columns=["id","f_27"])

_, X_test, _, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

**Hyperparameter tuning**

With the help of Optuna we get the parameters where the model performs best.

In [12]:
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objectivesLGBM(trial):
    params = {
        'max_depth' : trial.suggest_int("max_depth", 1, 16),
        'n_estimators': trial.suggest_int('n_estimators', 5, 5000),
        'random_state': trial.suggest_int("random_state", 0, 722),
        'learning_rate': trial.suggest_float('learning_rate', 0, 1),
        'num_leaves': trial.suggest_int('num_leaves', 100, 2000),
        'max_bin': trial.suggest_int('max_bin', 2, 100),
        'device' : 'gpu',
        'n_jobs' : -1,
        'verbose': -1
    }

    model = LGBMClassifier(**params)
    model.fit(X,y)

    return model.score(X,y)

opt = optuna.create_study(direction='maximize')
opt.optimize(objectivesLGBM, n_trials=5)

params = opt.best_params
model = LGBMClassifier(**params)
model.fit(X, y)


print("Training score :", model.score(X, y))

pred_y_test = model.predict(X_test)
print("Roc auc score  :", roc_auc_score(y_test, pred_y_test))

**Best features**

In [11]:
feature_imp = pd.DataFrame(sorted(zip(model.feature_importances_,X.columns)), columns=['Value','Feature'])
feature_imp = feature_imp.sort_values(by = "Value", ascending=False)

selected_features = feature_imp[:5]['Feature'].tolist()
selected_features

**3. Feature construction**

Adding more statistical features to the best performing features.

* Mean
* Standard Deviation
* Max - Min
* Mean absolute Deviation

In [None]:
def add_basics_features(data, features):
    
    for feature in features:
        
        new_feature_name = str(feature) + '_mean'
        data[new_feature_name] = data[feature].mean()
        
        new_feature_name = str(feature) + '_std'
        data[new_feature_name] = data[feature].std()
        
        new_feature_name = str(feature) + '_max_min'
        data[new_feature_name] = data[feature].max() - data[feature].min()
        
         new_feature_name = str(feature) + '_mad'
        data[new_feature_name] = data[feature].mad() 
    
    return data

X = add_basics_features(X, selected_features)
test = add_basics_features(test, selected_features)

### Cross Validation

In [None]:
kf = StratifiedKFold(n_splits=14, shuffle=True, random_state = 0)
y_predict = []

for fold, (train_index, val_index) in enumerate(kf.split(X, y)):
        print('*'*14, f" Fold {fold} ", '*'*14, '\n')
        
        # Split Data
        X_train = X.loc[train_index]
        X_val = X.loc[val_index]

        y_train = y.loc[train_index]
        y_val = y.loc[val_index]
        
        # Create Model here
        model = LGBMClassifier(**params)
        
        # Fit Model
        model.fit(X_train,y_train)

        # Make X_val prediction
        y_pred = model.predict_proba(X_val)[:,1]
        
        # Make Test prediction
        y_predict.append(model.predict_proba(test)[:,1])

        # Evaluate Model
        print("Training score :", model.score(X_train, y_train))
        print("Roc auc score  :", roc_auc_score(y_val, y_pred), '\n')
        
        # Free the memory
        del X_train, y_train, model, X_val, y_val, y_pred
        gc.collect()
        

### Training with complete data

In [None]:
model = LGBMClassifier(**params)

model.fit(X, y)

y_predict.append(model.predict_proba(test)[:,1])

In [None]:
del model, X, y, test, X_test, y_test
gc.collect()

In [None]:
np.array(y_predict).mean(axis=0)

# Submission

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-may-2022/sample_submission.csv")
submission.shape

In [None]:
submission['target'] = np.array(y_predict).mean(axis=0)
submission.to_csv('submission.csv', index=False)
submission