### **BayesianGMMClassifier**

Public LB Score : 0.77025

#### References:

* https://scikit-lego.netlify.app/api/mixture.html
* https://scikit-lego.netlify.app/_modules/sklego/mixture/bayesian_gmm_classifier.html#BayesianGMMClassifier

#### Credits:

* https://www.kaggle.com/code/pourchot/simple-soft-voting
* https://www.kaggle.com/code/ricopue/tps-jul22-clusters-and-lgb
* https://www.kaggle.com/code/hiro5299834/tps-jul-2022-unsupervised-and-supervised-learning


The BayesianGMMClassifier trains a Gaussian Mixture Model for each class in y on a dataset X. Once a density is trained for each class we can evaluate the likelihood scores to see which class is more likely. All parameters of the model are an exact copy of the parameters in scikit-learn.

Following all the productive discussions and reviewing each notebook that was shared within this competition a decent model was produced with a decent public LB score. This output was used to feed a Classifier developed by scikit-lego.  There is a big probability that this method is overfitting the LB sample but it is a good start. 

Also note that to utilise k-means++ as init_params in BGMM you need to have sci-kit learn v 1.1.

#### *Libraries*

In [None]:
import pandas as pd
import numpy as np


# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.mixture import BayesianGaussianMixture
from sklego.mixture import BayesianGMMClassifier
from sklearn.preprocessing import PowerTransformer

In [None]:
# Reading the dataset

data = pd.read_csv("C:/Users/karlc/.jupyter/lab/Notebooks/Tabular_Jul22/data.csv", index_col='id')
submission = pd.read_csv("C:/Users/karlc/.jupyter/lab/Notebooks/Tabular_Jul22/sample_submission.csv")

In [None]:
best_data =['f_07','f_08', 'f_09', 'f_10','f_11', 'f_12', 'f_13', 'f_22','f_23', 'f_24', 'f_25','f_26','f_27', 'f_28']

In [None]:
data_scaled = pd.DataFrame(PowerTransformer().fit_transform(data), columns=data.columns)

In [None]:
# Loading decent score submission
pred_test = pd.read_csv("C:/Users/karlc/.jupyter/lab/Notebooks/Tabular_Jul22/pred_test_bestdata_kmeasnplus200.csv", index_col=[0])
predict_soft = np.argmax(np.array(pred_test), axis=1)

In [None]:
X = np.array(data_scaled[best_data])
y = np.array(predict_soft)

In [None]:
for seed in tqdm(range(0,1)):
    
       
    bgm = BayesianGMMClassifier(
            n_components=7,
            random_state = seed,
            tol =1e-3,
            covariance_type = 'full',
            max_iter = 200,
            n_init=3,
            init_params='k-means++'
                     )
               

    
    # fitting and probability prediction
    bgm.fit(X,y)
    predict = bgm.predict(X)
    pred_seed = bgm.predict_proba(X) 

In [None]:
sns.set(rc={'figure.figsize':(15,10)})
pl = sns.countplot(x=predict)
pl.set_title("Distribution of clusters - Best features")
plt.show()

In [None]:
submission['Predicted'] = predict
submission.to_csv('TabJul22_sub3_bgmc_220722.csv', index=False)
submission.head(20)

**Other improvements:**

* Possible integration and/or ensembling with other classifiers
* Tuning BGMM hyperparameters

Happy Kaggling !
