# **FOREWORD**

This is my second trust with the [Playground season 4 episode 8 competition](https://www.kaggle.com/competitions/playground-series-s4e8/overview). This assignment entails one to identify edible mushrooms from a group of edible and poisionous mushrooms. This is a binary classifier with Matthews Correlation as an evaluation metric. <br> 

This kernel is heavily influenced by the work done in the below public notebooks- <br>
1. https://www.kaggle.com/code/stealthtechnologies/lb-0-98513-multiple-lightgbm-models <br>
2. https://www.kaggle.com/code/carlmcbrideellis/mushrooms-single-lightgbm-model-20-minutes <br> 

**My contribution** <br>
1. Appended the [data store](https://www.kaggle.com/code/ravi20076/playgrounds4e08-datastore-v1) dataset to the competition data for the model <br>
2. Carried out very minimal FE <br> 
3. Inferred on the competition data only <br> 
4. Used different state values for the assignment <br>
5. Inferred 3 metrics, MCC, AUC and LogLoss <br>

Wishing you all the best for the assignment and best regards!

**Key note** <br>
1. I have effectuated the kernel in **test mode** with the parameter **test_req = True** <br>
2. Please set it to False to run it on the complete dataset <br>
3. In test mode, I plan to import the actual submission file from my [adjutant dataset](https://www.kaggle.com/datasets/ravi20076/playgrounds4e08baselinesubmission) and use it for submission <br>

# **IMPORTS**

In [1]:
%%time 

!cp /kaggle/usr/lib/playgrounds4e08_regularimports/playgrounds4e08_regularimports.py myimports.py
from myimports import *



---> Importing commonly used libraries and packages in my model pipelines

Collecting lightgbm==4.5.0
  Downloading lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl.metadata (17 kB)
Downloading lightgbm-4.5.0-py3-none-manylinux_2_28_x86_64.whl (3.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lightgbm
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 4.2.0
    Uninstalling lightgbm-4.2.0:
      Successfully uninstalled lightgbm-4.2.0
Successfully installed lightgbm-4.5.0
Collecting polars==1.2.1
  Downloading polars-1.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Downloading polars-1.2.1-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.9/30.9 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: polars
  A

# **CONFIGURATION**

In [2]:
%%time

target      = "class"

# Please set this to False to run it on the complete dataset
test_req    = True

model_label = "LGBM"
episode     = 8
version_nb  = 2
model_group = 3
device      = "cpu"

op_path    = f"/kaggle/working"
ip_path    = f"/kaggle/input/playgrounds4e08-datastore-v1"

orig_req   = True
nsamples   = 1.0

n_splits     = 5
state        = 42
ftre_imp_req = True
cutoff       = 0.50

if test_req:
    all_states  = [0, 5, 7,]

else:
    all_states  = list(range(100, 125))
    



CPU times: user 6 µs, sys: 2 µs, total: 8 µs
Wall time: 11.9 µs


## **CONFIGURATION PARAMETERS**

|Parameter|Explanation|Possible value options|
|---------------------| -------------------------------| :-:|
|target | Target column value | class|
|test_req | Do we need a syntax  check here? | True/ False| 
|model_label | Model option used here | LGBM| 
|episode | Playground episode number | 8| 
|version_nb | Version number - used for experiment tracking | int value| 
|model_group | Version number - used for experiment tracking | int value| 
|device | Device label | cpu/ gpu| 
|op_path/ ip_path | I-O paths  | | 
|orig_req | Do I need extra original data | True/ False| 
|nsamples | Original data samples required <br> Integer > 1:- partial data required <br> 1.0 - entire original data required | 1| |
|n_splits | Number of CV splits | int value| 
|state | Random state  | int value| 
|ftre_imp_req | Do I need feature importances | True/ False|
|cutoff | Prediction cutoff for labelling as 1/0 | Float between 0-1| 
|all_states | all random states list | list of chosen values|

# **PREPROCESSING**

In [3]:
%%time

X      = pd.read_parquet(os.path.join(ip_path, "train.parquet"))
test   = pd.read_parquet(os.path.join(ip_path, "test.parquet"))
sub_fl = pd.read_parquet(os.path.join(ip_path, "sample_submission.parquet"))

cat_cols = \
['capshape', 'capsurface', 'capcolor', 'doesbruiseorbleed',
 'gillattachment', 'gillspacing', 'gillcolor', 'stemroot', 'stemsurface',
 'stemcolor', 'veiltype', 'veilcolor', 'hasring', 'ringtype',
 'sporeprintcolor', 'habitat', 'season'
 ]

X[cat_cols]     = X[cat_cols].astype("category")
test[cat_cols]  = test[cat_cols].astype("category")

PrintColor(f"---> Shapes = {X.shape} | {test.shape}")

if orig_req:
    PrintColor(f"\n---> We need the original data for model training")

    if nsamples > 1:
        PrintColor(f"---> Partial original data is used = {nsamples * 2:,.0f} sample",
                   color = Fore.CYAN
                  )

        original = X.loc[X.Source == 'Original'].groupby(target).sample(n = nsamples)
        X = X.loc[X.Source == 'Competition']
        X = pd.concat([X, original], axis=0, ignore_index = True)
        X.index = range(len(X))
        del original

    elif nsamples == 1.0:
        PrintColor(f"---> Full original data is used")
else:
    X = X.loc[X.Source == 'Competition']
    PrintColor(f"---> Shapes = {X.shape} | {test.shape} | without original data",
               color = Fore.RED
              )

# Sampling for testing purposes
if test_req:
    X       = X.groupby([target, "Source"]).head(1000)
    X.index = range(len(X))
    test    = test.iloc[0:100]
    sub_fl  = sub_fl.iloc[0:100]

    PrintColor(f"---> Shapes = {X.shape} | {test.shape}")
else:
    PrintColor(f"---> Syntax check is not needed", color = Fore.RED)

y = X[target]
X = X.drop(target, axis=1)

PrintColor(f"---> Shapes = {X.shape} | {y.shape} | {test.shape}")

print();
collect();

[1m[34m---> Shapes = (4201981, 22) | (2077964, 21)[0m
[1m[34m
---> We need the original data for model training[0m
[1m[34m---> Full original data is used[0m
[1m[34m---> Shapes = (4000, 22) | (100, 21)[0m
[1m[34m---> Shapes = (4000, 21) | (4000,) | (100, 21)[0m

CPU times: user 18.9 s, sys: 3.34 s, total: 22.2 s
Wall time: 19.3 s


# **MODEL TRAINING**

In [4]:
%%time

drop_cols  = ["Source", "id", target]
ftre_imp   = 0
sel_cols   = X.drop(columns = drop_cols, errors = "ignore").columns
test_preds = 0
OOF_Preds  = 0
scores     = pd.DataFrame(columns = ["LogLoss", "AUC", "MCC"],
                          index = all_states,
                          dtype = np.float32
                          )
len_train  = len(X.loc[X.Source == "Competition"])

PrintColor(f"\n-------- {model_label} MODEL TRAINING --------\n")
for state in tqdm(all_states):
    model = LGBMC(objective     = "binary",
                  device        = device,
                  n_estimators  = 3000,
                  max_bin       = 256,
                  colsample_bytree = 0.6,
                  reg_lambda    = 80,
                  verbosity     = -1,
                  random_state  = state,
                  )

    model.fit(X[sel_cols], y, callbacks = [log_evaluation(0)])
    print(f"---> Model fitted  - state = {state}")

    if ftre_imp_req:
        ftre_imp  = ftre_imp + (model.feature_importances_ / len(all_states))

    test_preds = test_preds + (model.predict_proba(test[sel_cols])[:,1] / len(all_states))
    oof_preds  = model.predict_proba(X.iloc[0: len_train][sel_cols])[:,1]
    OOF_Preds  = OOF_Preds + (oof_preds / len(all_states))

    scores.loc[state] = \
    (log_loss(y.values[0: len_train], oof_preds[0: len_train]),
     roc_auc_score(y.values[0: len_train], oof_preds[0: len_train]),
     matthews_corrcoef(y.values[0: len_train], np.where(oof_preds[0: len_train] > cutoff, 1, 0))
     )

    del model, oof_preds
    collect();
    print(f"---> Model trained - state = {state}")

print("\n\n\n")
display(scores.\
        style.\
        set_caption("\nOOF scores across seeds\n").\
        format(precision = 6).\
        highlight_min(subset = ["LogLoss"], axis=0, color = "#c0eff7").\
        highlight_max(subset = ["MCC", "AUC"], axis = 0, color = "#f9d1b4").\
        set_properties(**{"text-align": "center",
                          "border" : "dashed 1px maroon",
                          }
                       )
        )

if ftre_imp_req:
    print("\n\n")
    display(pd.DataFrame(ftre_imp, index = sel_cols, columns = ["FtreImp"]).\
            sort_values(["FtreImp"], ascending = False).\
            transpose().\
            style.format(formatter = "{:,.2f}").\
            set_caption(f"Feature Importances").\
            set_properties(**{"text-align": "center"}).\
            background_gradient(subset = sel_cols,
                                cmap = "rocket",
                                axis=1
                               )
            )

print()
collect()

[1m[34m
-------- LGBM MODEL TRAINING --------
[0m


  0%|          | 0/3 [00:00<?, ?it/s]

---> Model fitted  - state = 0
---> Model trained - state = 0
---> Model fitted  - state = 5
---> Model trained - state = 5
---> Model fitted  - state = 7
---> Model trained - state = 7






Unnamed: 0,LogLoss,AUC,MCC
0,0.014347,1.0,1.0
5,0.014317,1.0,1.0
7,0.014231,1.0,1.0







Unnamed: 0,stemwidth,stemheight,capdiameter,capsurface,gillattachment,gillcolor,capcolor,stemcolor,stemsurface,capshape,ringtype,gillspacing,stemroot,habitat,hasring,season,veiltype,doesbruiseorbleed,veilcolor,sporeprintcolor
FtreImp,17164.33,16132.67,14812.33,4146.33,4129.33,3644.67,2655.67,2219.0,1891.33,1718.67,1403.33,1156.33,1016.0,926.67,580.67,538.0,208.33,158.0,119.67,1.33



CPU times: user 16.9 s, sys: 3.69 s, total: 20.6 s
Wall time: 18.8 s


67

# **SUBMISSION AND CLOSURE**

In [5]:
%%time

def PostProcessPreds(sub_fl: pd.DataFrame, target: str = target):
    "This function post-processes the predictions using saved predictions and targets"

    try:
        sub_fl = sub_fl.set_index("id")
    except:
        print(f"---> Submission file index is intact")

    sub_fl.loc[3640058, target] = "e"
    sub_fl.loc[sub_fl.index.isin([3600675, 4057201, 4729429, 4929268, 4985595]), target] = "p"
    return sub_fl;

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 11.7 µs


In [6]:
%%time

print("\n\n")

if test_req == False:
    PrintColor(f"---> Creating the post-processed submission file\n")
    
    sub_fl[target] = np.where(test_preds >= cutoff, "p", "e")
    sub_fl = PostProcessPreds(sub_fl)

    test_preds = \
    pd.DataFrame(test_preds,index = range(len(test)),
                 columns = [f"{model_label}V{version_nb}_{model_group}"],
                 dtype = np.float32,
                )

    print("\n\n")
    display(test_preds.head(10).style.set_caption(f"Submission file predictions"))
    print("\n\n")
    display(sub_fl.head(10).style.set_caption(f"Submission file labels"))
    print("\n\n")

    OOF_Preds = pd.DataFrame(OOF_Preds,
                             index = range(len(OOF_Preds)),
                             columns = [f'{model_label}V{version_nb}_{model_group}'],
                             dtype = np.float32,
                             )

    OOF_Preds.index.name = "id"
    OOF_Preds.sort_index().reset_index().\
    to_parquet(os.path.join(op_path, f'OOF_Preds_{model_label}V{version_nb}_{model_group}.parquet'))

    test_preds.\
    to_parquet(os.path.join(op_path, f'Mdl_Preds_{model_label}V{version_nb}_{model_group}.parquet'))

    sub_fl.\
    reset_index().\
    rename(columns = {"index": "id"}).\
    to_parquet(os.path.join(op_path, f'Submission_{model_label}V{version_nb}_{model_group}.parquet'))

else:
    PrintColor(f"---> Loading the submission file from my imported dataset")
    
    sub_fl = pd.read_parquet(os.path.join(ip_path, "sample_submission.parquet"))
    
    sub_fl[target] = \
    np.mean(
        np.c_[(pd.read_parquet(f"/kaggle/input/playgrounds4e08baselinesubmission/PublicV2_1/Mdl_Preds_LGBMV2_1.parquet").\
               iloc[:,-1].values,
               pd.read_parquet(f"/kaggle/input/playgrounds4e08baselinesubmission/PublicV2_2/Mdl_Preds_LGBMV2_2.parquet").\
               iloc[:,-1].values,
               pd.read_parquet(f"/kaggle/input/playgrounds4e08baselinesubmission/PublicV2_3/Mdl_Preds_LGBMV2_3.parquet").\
               iloc[:,-1].values, 
              )
             ], 
        axis=1
    )
    
    sub_fl[target] = np.where(sub_fl[target] > cutoff, 1, 0)
    sub_fl[target] = sub_fl[target].map({0: "e", 1: "p"})
    sub_fl = PostProcessPreds(sub_fl)
    
    try:
        sub_fl = sub_fl.set_index("id")
    except:
        pass
    
    sub_fl.to_csv("submission.csv")
    
    print()
    display(sub_fl.head(10).style.set_caption(f"Submission file"))
    
print()
collect();




[1m[34m---> Loading the submission file from my imported dataset[0m
---> Submission file index is intact



Unnamed: 0_level_0,class
id,Unnamed: 1_level_1
3116945,e
3116946,p
3116947,p
3116948,p
3116949,e
3116950,e
3116951,e
3116952,p
3116953,p
3116954,e



CPU times: user 3.26 s, sys: 142 ms, total: 3.4 s
Wall time: 3.5 s
