In [24]:
!pip install scikit-lego

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Intro
I am testing out Sk-legos' Bayesian Gaussian Mixture Model Classifier. From [their website](https://scikit-lego.netlify.app/_modules/sklego/mixture/bayesian_gmm_classifier.html#BayesianGMMClassifier): 

"The BayesianGMMClassifier trains a Gaussian Mixture Model for each class in y on a dataset X. Once a density is trained for each class we can evaluate the likelihood scores to see which class is more likely. All parameters of the model are an exact copy of the parameters in scikit-learn."

It requires a dataset with predictions, for the predictions I will be using my highest scoring in regards to the public Leaderboard. It is from the notebook 'optimized soft voting'. The dataset is the reduced set as per the notebook '1.3-EDA-Target_Distribution'.

# Imports

In [31]:
import pandas as pd
import numpy as np
import pickle

from tqdm import tqdm

# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.mixture import BayesianGaussianMixture
from sklego.mixture import BayesianGMMClassifier
from sklearn.preprocessing import PowerTransformer, RobustScaler

# Definitions and Loading Data

In [26]:
PATH = 'drive/MyDrive/Kaggle/Clustering_072022/'

data = pd.read_csv(PATH + 'src/data_removed.csv', index_col='id')
predictions = pd.read_csv(PATH + 'submissions/optimized_soft_voting/submission_iter-75.csv')

cat_feats = data.columns[data.dtypes == 'int']
num_feats = data.columns[data.dtypes == 'float']

In [27]:
predictions.rename({'Id':'id'}, axis=1, inplace=True)
predictions.set_index('id', inplace=True)

# Scaling Data
- Using Robust Scaler on number features and Power Transformer on all afterwords as per my scaling and transform notebook

In [28]:
data[num_feats] = RobustScaler().fit_transform(data[num_feats])
df = pd.DataFrame(PowerTransformer().fit_transform(data), columns=data.columns)

# Creating Arrays
- Sk-lego requires data to be in arrays (not Pandas dataframes)


In [29]:
X = np.array(df)
y = np.array(predictions['Predicted'])

# Initializing Model
- I used a for loop in case I want to perform soft voting using different seeds

In [30]:
for seed in tqdm(range(0,1)):
    
       
    bgm = BayesianGMMClassifier(
            n_components=7,
            random_state = seed,
            tol =1e-3,
            covariance_type = 'full',
            max_iter = 200,
            n_init=3,
            init_params='kmeans'
                     )
               

    
    # fitting and probability prediction
    bgm.fit(X,y)
    predict = bgm.predict(X)
    pred_seed = bgm.predict_proba(X) 

100%|██████████| 1/1 [03:44<00:00, 224.00s/it]


In [33]:
pickle.dump(pred_seed, open(PATH+'submissions/BGMM_probs.pkl', 'wb'))

In [34]:
predictions['Predicted'] = predict
predictions.to_csv(PATH + 'submissions/BGMM_tol0.001_mi200_ni3_optimizedSViter75.csv', index='id')

# Results:
Massive increase in public Leaderboard score, from 0.60319 to 0.73996!! From ~50 to 28