##### PROBLEM STATEMENT: Cirrhosis Patient Survival Prediction

Develop a machine learning model using 19 clinical features to predict the survival state of patients with liver cirrhosis. The survival states, categorized as 0 = D (death), 1 = C (censored), and 2 = CL (censored due to liver transplantation), will be inferred from data sourced from a Mayo Clinic study on primary biliary cirrhosis (PBC) of the liver conducted between 1974 and 1984.

# 1.LOAD DATA 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
train= pd.read_csv('/kaggle/input/playground-series-s3e26/train.csv')
test=pd.read_csv('/kaggle/input/playground-series-s3e26/test.csv')
submission=pd.read_csv('/kaggle/input/playground-series-s3e26/sample_submission.csv')
original = pd.read_csv('../input/cirrhosis-patient-survival-prediction/cirrhosis.csv')

In [None]:
original.shape


In [None]:
train.shape

In [None]:
train.drop(columns=["id"],inplace=True)
test.drop(columns=["id"],inplace=True)
original.drop(columns=["ID"],inplace=True)

In [None]:
train_copy=train.copy()
test_copy=test.copy()
original_copy=original.copy()
original["original"]=1

train["original"]=0
test["original"]=0



target='Status'

train.head()

# 2. DATA CLEANING
## 2.1 MISSING VALUE CHECKS

In [None]:
# Import necessary libraries
from prettytable import PrettyTable  # Use the PrettyTable library to create a visually appealing table
import numpy as np  # Use the numpy library for numerical operations

# Remove missing values from the training dataset
train_copy = train_copy.dropna()

# Create a PrettyTable object to store information about the data
table = PrettyTable()

# Set names for the columns in the table
table.field_names = ['Feature', 'Data Type', 'Train Missing %', 'Test Missing %', 'Original Missing%']

# Iterate through each column in the 'train_copy' DataFrame
for column in train_copy.columns:
    # Get the data type of the current column
    data_type = str(train_copy[column].dtype)
    
    # Calculate the percentage of missing values in the 'train_copy' dataset for the current column
    non_null_count_train = np.round(100 - train_copy[column].count() / train_copy.shape[0] * 100, 1)
    
    # Check if the current column is not the target column
    if column != target:
        # Calculate the percentage of missing values in the 'test_copy' dataset for the current column (if it is not the target column)
        non_null_count_test = np.round(100 - test_copy[column].count() / test_copy.shape[0] * 100, 1)
    else:
        # If the current column is the target column, set 'NA' for the missing value percentage in testing
        non_null_count_test = "NA"
    
    # Calculate the percentage of missing values in the 'original_copy' dataset for the current column
    non_null_count_orig = np.round(100 - original_copy[column].count() / original_copy.shape[0] * 100, 1)
    
    # Add a row to the table with information about the current column
    table.add_row([column, data_type, non_null_count_train, non_null_count_test, non_null_count_orig])

# Print the table
print(table)


In [None]:
msno.matrix(original_copy, color=  (0.8, 0.56, 0.65))
plt.title("Original Data Missing Value Matrix", fontsize=16)
plt.show()

The data loss on the original set was quite large, >25% in many columns, so we had to delete it.

After I checked the file, the data unexpectedly had no rows.

In [None]:
train_copy.duplicated().sum()


In [None]:
original_copy = original_copy.dropna()
original_copy.shape

In [None]:
train_copy=pd.concat([train_copy,original_copy],axis=0)
train_copy.reset_index(inplace=True,drop=True)
train_copy.shape

# 3. EXPLORATORY DATA
## 3.1 Dataset info

Features in the dataset are: N_Days', 'Drug', 'Age', 'Sex', 'Ascites', 'Hepatomegaly', 'Spiders', 'Edema', 'Bilirubin', 'Cholesterol', 'Albumin ', 'Copper', 'Alk_Phos', 'SGOT', 'Tryglicerides', 'Platelets', 'Prothrombin', 'Stage', 'Status',
N_Days: Number of days (Number of days), can link to monitoring or treatment period.

Drug: A drug used in treatment or research.

Age: Patient's age.

Gender: Patient's gender (Male/Female).

Ascites: The presence of fluid in the abdomen.

Enlarged liver: Liver exhaustion.

Spiders: The presence of small arteries growing in the skin.

Edema: Swelling and pain due to accumulation of excess substances.

Bilirubin: Blood bilirubin index, an indicator of liver health.

Cholesterol: The amount of cholesterol in the blood.

Albumin: A blood protein that plays an important role in maintaining glue performance and water retention function.

Copper: The amount of copper in the blood.

Alk_Phos: Alkaline Phosphatase enzyme in the blood.

SGOT: Enzyme Aspartate Aminotransferase in the blood.

Trygliceride: The amount of triglyceride in the blood.

Platelets: The number of platelets (platelets) in the blood.

Protrombin: blood clotting time.

Stage: The stage or danger level of the disease.

State: Health state (can be target variable in analysis).

In [None]:
train_copy.info()

## 3.2 Numerical Feature Distributions

In [None]:
# Filter numerical columns
cont_cols = [f for f in train_copy.columns if train_copy[f].dtype != 'O' and train_copy[f].nunique() > 2 and f != target]

# Number of rows in the subplot grid
n_rows = len(cont_cols)

# Create figure and axes
fig, axs = plt.subplots(n_rows, 2, figsize=(12, 4 * n_rows))
fig.tight_layout()

# Set color palette
sns.set_palette([(0.8, 0.56, 0.65), 'crimson',  (0.99, 0.8, 0.3)])

# Iterate through each numerical column and plot violin plots
for i, col in enumerate(cont_cols):
    # Use flat to flatten axs into a 1D array
    sns.violinplot(x=target, y=col, data=train_copy, ax=axs.flat[i])
    axs.flat[i].set_title(f'{col.title()} Distribution by Target (Train)', fontsize=14)
    axs.flat[i].set_xlabel('Outcome', fontsize=12)
    axs.flat[i].set_ylabel(col.title(), fontsize=12)
    sns.despine()

# Remove excess plots
for ax_row in axs:
    for ax in ax_row:
        if not ax.has_data():
            fig.delaxes(ax)    

plt.show()


In [None]:
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(train_copy[cont_cols].corr(), dtype=bool))
sns.heatmap(train_copy[cont_cols].corr(), annot=True, mask=mask)
plt.show()

## 3.3 Categorical Features Analysis
Part 1:

(train_copy[f].dtype != 'O' and train_copy[f].nunique() / train_copy.shape[0] < 0.025):
Eliminate columns that have a data type other than objects ('O') and whose number of unique values ​​is less than 2.5% of the number of samples in the DataFrame. This can help eliminate columns that have little variation and don't provide much information.

Part 2:

(train_copy[f].dtype == 'O' and f != target):
Remove columns that have data type of object ('O') and are not target columns. This is to avoid analyzing columns that are not target columns and can be considered separately.

In [None]:
# Filter taxonomy columns based on conditions
cat_cols = [f for f in train_copy.columns if
            ((train_copy[f].dtype != 'O' and train_copy[f].nunique() / train_copy.shape[0] < 0.025) or
             (train_copy[f].dtype == 'O' and f != target))]

# Define color palette
custom_palette = [(0.8, 0.56, 0.65), 'crimson', (0.99, 0.8, 0.3)]

# Draw a Stacked Bar chart for each classification column
for col in cat_cols:
    # Create contingency table
    contingency_table = pd.crosstab(train_copy[col], train_copy[target], normalize='index')
    
    # Set style for chart
    sns.set(style="whitegrid")
    
    # Draw the Stacked Bar chart
    contingency_table.plot(kind="bar", stacked=True, color=custom_palette, figsize=(20, 4))
    
    # Set up title and axis labels
    plt.title(f"Percentage Distribution of {target} across {col}")
    plt.xlabel(col)
    plt.ylabel("Percentage")
    
    # Show caption
    plt.legend(title=target)
    
    # Show the chart
    plt.show()

# 4. Train model


In [None]:
def preprocess(df):

    df["Drug"] = df["Drug"].map({"Placebo": 1, "D-penicillamine": 0})
    df["Sex"] = df["Sex"].map({"M": 1, "F": 0})
    
    for col in ["Ascites", "Hepatomegaly", "Spiders"]:
        df[col] = df[col].map({"N": 0, "Y": 1})
        
    df["Edema"] = df["Edema"].map({"N": 0, "S": 1, "Y": 2})
        
    return df

In [None]:
train_copy[target] = train_copy[target].map({"D": 0, "CL": 1, "C": 2})

In [None]:
preprocess(train_copy)

In [None]:
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# Create X and y from the 'train_copy' data
X = train_copy.drop(columns=[target])  
y = train_copy[target]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train the XGBoost model
xgb_model = XGBClassifier()
xgb_model.fit(X_train, y_train)

# Train the LightGBM model
lgb_model = LGBMClassifier()
lgb_model.fit(X_train, y_train)

# Train the CatBoost model
cat_model = CatBoostClassifier(verbose=False)
cat_model.fit(X_train, y_train)

# Predict on the validation set to calculate log_loss
xgb_preds = xgb_model.predict_proba(X_val)
lgb_preds = lgb_model.predict_proba(X_val)
cat_preds = cat_model.predict_proba(X_val)

# Calculate log_loss for each model
xgb_score = log_loss(y_val, xgb_preds)
lgb_score = log_loss(y_val, lgb_preds)
cat_score = log_loss(y_val, cat_preds)

# Display the results
print(f"XGBoost log_loss: {xgb_score:.5f}")
print(f"LightGBM log_loss: {lgb_score:.5f}")
print(f"CatBoost log_loss: {cat_score:.5f}")


In [None]:
# Create a DataFrame to store feature importance
importance_df = pd.DataFrame(index=X_train.columns)

# Calculate feature importance for the XGBoost model
xgb_importance = xgb_model.feature_importances_
importance_df['XGBoost'] = xgb_importance

# Calculate feature importance for the LightGBM model
lgb_importance = lgb_model.feature_importances_
importance_df['LightGBM'] = lgb_importance

# Calculate feature importance for the CatBoost model
cat_importance = cat_model.feature_importances_
importance_df['CatBoost'] = cat_importance

# Sort in descending order by XGBoost model importance
importance_df_sorted_xgb = importance_df.sort_values(by='XGBoost', ascending=False)

# Print XGBoost model feature importance
print("XGBoost Feature Importance:")
print(importance_df_sorted_xgb['XGBoost'])

# Sort in descending order by LightGBM model importance
importance_df_sorted_lgb = importance_df.sort_values(by='LightGBM', ascending=False)

# Print LightGBM model feature importance
print("\nLightGBM Feature Importance:")
print(importance_df_sorted_lgb['LightGBM'])

# Sort in descending order by CatBoost model importance
importance_df_sorted_cat = importance_df.sort_values(by='CatBoost', ascending=False)

# Print CatBoost model feature importance
print("\nCatBoost Feature Importance:")
print(importance_df_sorted_cat['CatBoost'])


In [None]:
# Draw a clustermap for the XGBoost model feature importance
sns.clustermap(importance_df_sorted_xgb[['XGBoost']], cmap='coolwarm', method='average', col_cluster=False)
plt.title('XGBoost Feature Importance Clustermap')
plt.show()

# Draw a clustermap for the LightGBM model feature importance
sns.clustermap(importance_df_sorted_lgb[['LightGBM']], cmap='coolwarm', method='average', col_cluster=False)
plt.title('LightGBM Feature Importance Clustermap')
plt.show()

# Draw a clustermap for the CatBoost model feature importance
sns.clustermap(importance_df_sorted_cat[['CatBoost']], cmap='coolwarm', method='average', col_cluster=False)
plt.title('CatBoost Feature Importance Clustermap')
plt.show()


The columns Sex , Ascites , Hepatomegaly , Spiders , Edama , Stage from the chart show information gaps but when applied in training do not carry much information and other than Bilirubin, the columns I guess bring a lot The information does not help the algorithm as much as other feature columns, and those columns often belong to the classification group.

# 5.Feature Engineering



To make the other column features more useful, we link the features through the newly created column features

In [None]:
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
class DiagnosisDateTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['Diagnosis_Date'] = X['Age'] - X['N_Days']
        return X
class AlbuminLevelTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Albumin_Level'] = np.where((X['Albumin'] < 3.4), 1, np.where((X['Albumin'] > 5.4), 2, 0))
        return X
class CopperRiskTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Copper_Risk'] = np.where((X['Copper'] >= 62) & (X['Copper'] <= 140), 1, 0)
        return X
class BilirubinNormalTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Bilirubin_Normal'] = np.where((X['Bilirubin'] >= 0.1) & (X['Bilirubin'] <= 1.2), 1, 0)
        return X
class NormalSGOTTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Normal_SGOT'] = np.where((X['SGOT'] >= 8) & (X['SGOT'] <= 45), 1, 0)
        return X
class NormalProthrombinTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Normal_Prothrombin'] = np.where((X['Prothrombin'] >= 9.4) & (X['Prothrombin'] <= 12.5), 1, 0)
        return X
class NormalAlkPhosTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['Normal_Alk_Phos'] = np.where((X['Alk_Phos'] >= 44) & (X['Alk_Phos'] <= 147), 1, 0)
        return X
class PlateletsTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Label as 1 (Low) when Platelets are between 150 and 400, otherwise 0
        X['Low_Platelets'] = np.where((X['Platelets'] >= 150) & (X['Platelets'] <= 400), 1, 0)
        return X
class CholesterolTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Label as 0 (Normal) or 1 (High)
        X['Cholesterol_Level'] = np.where(X['Cholesterol'] > 200, 1, 0)
        return X

class TrygliceridesTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Label as 0 (Normal) or 1 (High)
        X['Tryglicerides_Level'] = np.where(X['Tryglicerides'] > 150, 1, 0)
        return X

class PredictStatusTransformer(BaseEstimator, TransformerMixin):
    '''Nếu thời gian điều trị ít hơn giai đoạn của bệnh nhân 
    thì có thể do họ dự đoán được kết quả..
    '''
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['N_Days_per_Stage'] = np.where(X['N_Days'] / X['Stage'] > 300, 1, 0)
        return X
class AgeGroupsTransformer(BaseEstimator, TransformerMixin):
    """Older people might be hit harder (interaction) 
        by health issues. Also can cover lifestyle influences, i.e.
        alcohol consumption etc."""
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        # Use years from above, min=26, max=78
        X['Age_Group'] = pd.cut(round(X['Age'] / 365.25).astype("int16"), 
                                bins=[19, 29, 49, 64, 99], 
                                labels = [0, 1, 2, 3]
                               ).astype('int16')
        return X
class BilirubinAlbuminTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['Bilirubin_Albumin'] = X['Bilirubin'] / X['Albumin']
        return X
class SymptomScoreTransformer(BaseEstimator, TransformerMixin):
    # From data set explanations above let's add all the "bad" symptoms
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        symptom_columns = ["Drug",'Ascites', 'Hepatomegaly', 
                           'Spiders', 'Edema']
        X['Symptom_Score'] = X[symptom_columns].sum(axis=1)
        return X
class APRITransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X['APRI'] = X['SGOT'] / (X['Platelets'] * 100 + 0.0000001)
        return X
class DrugEffectivenessTransformer(BaseEstimator, TransformerMixin):
    # Placeholder concept, assuming 'Bilirubin' improvement is a measure of effectiveness
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['Drug_Effectiveness'] = X['Drug'] * X['Bilirubin']
        return X
class LiverFunctionIndexTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        liver_columns = ['Bilirubin', 'Albumin', 'Alk_Phos', 'SGOT']
        X['Liver_Function_Index'] = X[liver_columns].mean(axis=1)
        return X
class BilirubinAlkRatioTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Tính toán tỉ lệ Bilirubin / Alk_Phos
        X['Bilirubin_Alk_Ratio'] = X['Bilirubin'] * (X['Alk_Phos'] + 0.0000001)
        return X
class CholesterolTriglyceridesRatioTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        # Tính toán tỉ lệ cholesterol / triglycerides
        X['Cholesterol_Tryglicerides_Ratio'] = X['Cholesterol'] / X['Tryglicerides']
        return X
class TimeFeaturesTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X['Diag_Year'] = (X['N_Days'] / 365).astype(int)
        X['Diag_Month'] = ((X['N_Days'] % 365) / 30).astype(int)
        return X
    

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Add all transformers to the pipeline
pipeline = Pipeline([
    #('predict_status', PredictStatusTransformer()),
    ('age_groups', AgeGroupsTransformer()),
    ('time_features', TimeFeaturesTransformer()),
    #('diagnosis_date', DiagnosisDateTransformer()),  
    #('bilirubin_albumin', BilirubinAlbuminTransformer()),
    #('symptom_score', SymptomScoreTransformer()),
    #('apri', APRITransformer()),
    #('drug_effectiveness', DrugEffectivenessTransformer()),
    ('albumin_level', AlbuminLevelTransformer()),    
    ('copper_risk', CopperRiskTransformer()),  
    ('bilirubin_normal', BilirubinNormalTransformer()),
    #('liver_function_index', LiverFunctionIndexTransformer()),
    #('cholesterol_triglycerides_ratio', CholesterolTriglyceridesRatioTransformer()),
    ('normal_sgot', NormalSGOTTransformer()),
    ('normal_prothrombin', NormalProthrombinTransformer()),
    ('normal_alk_phos', NormalAlkPhosTransformer()),
    #('bilirubin_alk_ratio', BilirubinAlkRatioTransformer()),
    #('cholesterol_level', CholesterolTransformer()),
    #('tryglicerides_level', TrygliceridesTransformer()),
    #('platelets_level', PlateletsTransformer()),
    
])
# Apply the pipeline to the DataFrame df
train_copy_transformed = pipeline.fit_transform(train_copy)


In [None]:
cont_cols = [f for f in train_copy.columns if train_copy[f].dtype != 'O' and train_copy[f].nunique() > 2 and f != target]
def high_corr_drop(df, cols):
    df = df.copy()
    # Select only numeric columns for correlation calculation
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    correlation = df[cols].corr().abs()
    mask = np.triu(np.ones_like(correlation).astype(bool), k=1)
    drop_list = correlation[(correlation.where(mask) > 0.91).sum() > 0].index.tolist()
    df = df.drop(drop_list, axis=1)
        
    return df
#train_copy = high_corr_drop(train_copy,cont_cols)
#test_copy = high_corr_drop(test_copy,cont_cols)

In [None]:

cont_cols = [f for f in train_copy.columns if train_copy[f].dtype != 'O' and train_copy[f].nunique() > 2]

In [None]:
plt.figure(figsize=(14, 10))
mask = np.triu(np.ones_like(train_copy[cont_cols].corr(), dtype=bool))
sns.heatmap(train_copy[cont_cols].corr(), annot=True, mask=mask)
plt.show()

# 6.Refine the Model

In [None]:
import optuna
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import log_loss

# Create X and y from the train_copy data
X = train_copy.drop(columns=[target])
y = train_copy[target]

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define the objective function for XGBoost
def objective_xgb(trial):
    params = {
        'booster': trial.suggest_categorical('booster', ['gbtree']),
        'max_depth': trial.suggest_int('max_depth', 3, 7),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_loguniform('gamma', 1e-8, 1.0),
        'subsample': trial.suggest_loguniform('subsample', 0.3, 0.9),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
        "seed": trial.suggest_categorical('seed', [42]),
        'objective': trial.suggest_categorical('objective', ['multi:softmax']),
        'num_class': 3  # Replace 3 with the number of classes in the target variable
    }

    model = XGBClassifier(**params)
    model.fit(X_train, y_train)

    # Predict on the validation set to calculate log_loss
    preds = model.predict_proba(X_val)
    score = xgb_score = log_loss(y_val, preds)
    return score

# Define the objective function for LightGBM
def objective_lgb(trial):
    params = {
        'objective': 'multiclass',
        'num_class': 3,  # Replace 3 with the number of classes in the target variable
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 20, 50),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        'n_estimators': trial.suggest_int('n_estimators', 500, 1000),
        'max_depth': trial.suggest_int('max_depth', 5, 15),
        'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'min_child_weight': trial.suggest_loguniform('min_child_weight', 1e-5, 1e2),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 1.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 1.0),
        'random_state': 42
    }

    model = LGBMClassifier(**params)
    model.fit(X_train, y_train)

    # Predict on the validation set to calculate log_loss
    preds = model.predict_proba(X_val)
    score = log_loss(y_val, preds)
    return score

# Define the objective function for CatBoost
def objective_cat(trial):
    params = {
        'iterations': trial.suggest_int('iterations', 100, 1000),
        'depth': trial.suggest_int('depth', 4, 8),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 0.1),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-5, 1.5),
        'random_strength': trial.suggest_float('random_strength', 0.01, 3),
        'max_bin': trial.suggest_int('max_bin', 2, 255),
        'od_wait': trial.suggest_int('od_wait', 1, 50),
        'od_type': trial.suggest_categorical('od_type', ['Iter']),
        'one_hot_max_size': trial.suggest_int('one_hot_max_size', 40, 70),
        'custom_metric': 'MultiClass',  # Change here
        'random_state': 42,
        'verbose': False
    }

    model = CatBoostClassifier(**params)
    model.fit(X_train, y_train)

    # Predict on the validation set to calculate log_loss
    preds = model.predict_proba(X_val)
    score = log_loss(y_val, preds)
    return score


In [None]:
study_xgb = optuna.create_study(direction='minimize')
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_xgb.optimize(objective_xgb, n_trials=50,show_progress_bar=True)

In [None]:
print('Best parameters', study_xgb.best_params)

In [None]:
# Get the best-performing model
best_params_xgb = study_xgb.best_params
best_model_xgb = XGBClassifier(**best_params_xgb)
best_model_xgb.fit(X_train, y_train)

# Create a DataFrame to store feature importances and print
importances_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_model_xgb.feature_importances_
})

# Sort and print feature importances
importances_df = importances_df.sort_values(by='Importance', ascending=False)
print(importances_df)


In [None]:
study_lgb = optuna.create_study(direction='minimize')
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_lgb.optimize(objective_lgb, n_trials=50,show_progress_bar=True)

In [None]:
print('Best parameters', study_lgb.best_params)

In [None]:
# Get the best-performing model from the LGB study
best_params_lgb = study_lgb.best_params
best_model_lgb = LGBMClassifier(**best_params_lgb)
best_model_lgb.fit(X_train, y_train)

# Create a DataFrame to store feature importances and print
importances_df_lgb = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_model_lgb.feature_importances_
})

# Sort and print feature importances
importances_df_lgb = importances_df_lgb.sort_values(by='Importance', ascending=False)
print(importances_df_lgb)


In [None]:
#study_cat = optuna.create_study(direction='minimize')
#optuna.logging.set_verbosity(optuna.logging.WARNING)
#study_cat.optimize(objective_cat, n_trials=50,show_progress_bar=True)

In [None]:
#print('Best parameters', study_cat.best_params)

In [None]:
# Get the best-performing model from the CatBoost study

#best_params_cat = study_cat.best_params
#best_model_cat = CatBoostClassifier(**best_params_cat, verbose=False)
#best_model_cat.fit(X_train, y_train)

# Create a DataFrame to store feature importances and print
#importances_df_cat = pd.DataFrame({
#    'Feature': X_train.columns,
#    'Importance': best_model_cat.feature_importances_
#})

# Sort and print feature importances
#importances_df_cat = importances_df_cat.sort_values(by='Importance', ascending=False)
#print(importances_df_cat)


# 7.Predict and submit

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import log_loss

# Use the best model from each algorithm
voting_model = VotingClassifier(estimators=[
    #('catboost', best_model_cat),
    ('lgbm', best_model_lgb),
    ('xgboost', best_model_xgb)
], voting='soft')  # 'soft' to use probability estimates

# Train the ensemble model
voting_model.fit(X_train, y_train)

# Predict on the validation set
y_pred = voting_model.predict_proba(X_val)

# Evaluate log loss
logloss = log_loss(y_val, y_pred)
print(f'Log Loss of the ensemble model: {logloss}')


In [None]:
preprocess(test_copy)
pipeline.fit_transform(test_copy)

In [None]:
test_predictions = best_model_lgb.predict_proba(test_copy)

In [None]:
submission_df = pd.DataFrame({
    'id': submission['id'],
    'Status_C': test_predictions[:, 2],
    'Status_CL': test_predictions[:, 1],
    'Status_D': test_predictions[:, 0]
})


submission_df.to_csv('submission.csv', index=False)

In [None]:
submission_df