Hello everyone! This is my solution for this month tabular playground. I learned a lot during my work on this dataset and from notebooks of other participants. Those two kernels were most helpfull and informative. Don't forget to check them too :D <br>
[TPS-May Categorical EDA](https://www.kaggle.com/subinium/tps-may-categorical-eda) <br>
[TPS May: RAPIDS](https://www.kaggle.com/ruchi798/tps-may-rapids)


In [None]:
import pandas as pd
import umap.umap_ as umap
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px

from plotly.subplots import make_subplots
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold 
from sklearn.metrics import log_loss

# ML
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

In [None]:
data_train = pd.read_csv('data/train.csv').drop('id', axis=1)
data_test = pd.read_csv('data/test.csv').drop('id', axis=1)

In [None]:
all_data = pd.concat([data_train, data_test], axis=0)

In [None]:
data_train

<h2>EDA</h2>

<h3>Missing values</h3>

In [None]:
all_data.isnull().sum()

As we can see there aren't any missing values in this dataset

<h3>Feature Description</h3>

In [None]:
data_test.describe().T.style.bar(subset=['mean', 'std'], color='#d65f5f')

In [None]:
data_train.describe().T.style.bar(subset=['mean', 'std'], color='#d65f5f')

<h3>Target Distribution</h3>

In [None]:
fig = go.Figure()

to_plot = data_train.value_counts('target')

fig.add_trace(go.Pie(
    labels = to_plot.index,
    values = to_plot.values,
    textinfo='label+percent'
))

fig.update_layout(
    template='plotly_dark',
    title_text = 'Target Distribution'
)

Unfortunately we have huge disbalance in our target variable. We will do something with this if it's gona be a problem later

<h3>Features Distribution</h3>

In [None]:
fig = make_subplots(
    rows=10,
    cols=5,
    subplot_titles=data_train.columns,
)

# Add traces
columns = data_train.drop('target', axis=1).columns.tolist()

for row in range(10):
    for col in range(5):
        column = columns.pop(0)
        to_plot = data_train[column].value_counts()

        fig.add_trace(go.Scatter(
            x = to_plot.index,
            y = to_plot.values,
            name = column,
            mode='lines'
        ), col=col+1, row=row+1)

        fig.update_yaxes(title='y', visible=False, showticklabels=False)

        if(col+1 == 5):
            break

fig.update_layout(
    height=1000,
    width=700,
    showlegend=False,
    template='plotly_dark',
)
fig.update_annotations(font_size=12)

There is a lot of zero values in every feature. I'm curious how much of dataset is filled with them.

In [None]:
to_plot = data_train.drop('target', axis=1).isin([0]).sum(axis=0)
percent = np.array(to_plot)/100000 * 100

fig = go.Figure()

fig.add_trace(go.Bar(
    x = to_plot.values,
    y = to_plot.index,
    orientation='h',
    text = np.round(percent, 2),
    textposition='outside',
    marker={
        'color': to_plot.values,
        'colorscale': 'Purples',

    }
))

fig.update_layout(
    height=1000,
    width=700,
    template='plotly_dark',
    title_text='Percent of zeros in every column'
)

<h3>Correlation</h3>

In [None]:
data_train_target_num = data_train.replace({'target': {'Class_1': 1, 'Class_2': 2, 'Class_3': 3, 'Class_4': 4}})

plt.figure(figsize=(8, 12))

heatmap = sns.heatmap(data_train_target_num.corr()[['target']].sort_values(by='target', ascending=False),
                     vmin=-1, vmax=1, annot=True, cmap='Purples')

heatmap.set_title('Linear correlation of features with target variable', fontdict={'fontsize': 18}, pad=16);

<h3>Conclusion</h3>

After some visualization and discussion couple of things come up to the light
<ul>
    <li>There aren't any missing values</li>
    <li>Mean and standard deviation is fairly the same for train and test datasets</li>
    <li>Target variable is unbalanced which can be a problem</li>
    <li>Features are left skewed and nearly 60% of every feature is filled with zeros</li>
    <li>Features show weak linear correlation with target variable</li>
</ul>

<h2>Dimensionality Reduction</h2>
There is 50 features in our dataset. It's good opportunity to perform dimensionality reduction but first we gona check if it's necessary to do so.

<h3>Dimensionality reduction using PCA </h3>

In [None]:
pca = PCA().fit(data_train.drop('target', axis=1))

fig = go.Figure()

fig.add_trace(go.Scatter(
    x = list(range(50)),
    y = np.cumsum(pca.explained_variance_ratio_)
))

fig.update_layout(
    template = 'plotly_dark',
    title_text = 'PCA Performence',
    xaxis_title = 'Number of components',
    yaxis_title = 'Cumulative explained variance'
)

As we can see from scatter plot above variance decreasing quite fast. By the time PCA reduce number of features to the 30 we had lost almost 10% of the variance. It's definitely not worth it to reduce dimensionality of this dataset in order to create prediction model but still we can use dimensionality reduction to visualize our dataset. 

In [None]:
pca_vis = PCA(3)
projected = pca_vis.fit_transform(data_train.drop('target', axis=1))

In [None]:
df_vis = pd.DataFrame(projected, columns=['x', 'y', 'z'])
df_vis['target'] = data_train['target']

In [None]:
fig = px.scatter_3d(df_vis, x='x', y='y', z='z', color='target')

# tight layout
fig.update_layout(
    template='plotly_dark'
)

PCA doesn't work very well but it's doesn't mean that visualization is impossible we gonna use other method to do so

<h3> Dimensionality reduction using umap </h3>

In [None]:
sample_data_train = data_train.sample(1000, random_state=42)
scaled_sample_train = pd.DataFrame(StandardScaler().fit_transform(sample_data_train.drop('target', axis=1)))
scaled_sample_target = sample_data_train.replace({'target': {'Class_1': 1, 'Class_2': 2, 'Class_3': 3, 'Class_4': 4}})['target'].reset_index(drop=True)

In [None]:
reducer_2d = umap.UMAP(random_state=1)
embedding_2d = reducer_2d.fit_transform(scaled_sample_train, scaled_sample_target)

In [None]:
df_test_2d = pd.DataFrame(embedding_2d, columns=['x', 'y'])
df_test_2d['target'] = scaled_sample_target

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(
    x = df_test_2d[df_test_2d['target'] == 1]['x'],
    y = df_test_2d[df_test_2d['target'] == 1]['y'],
    mode='markers',
    name='Class_1'
))

fig.add_trace(go.Scatter(
    x = df_test_2d[df_test_2d['target'] == 2]['x'],
    y = df_test_2d[df_test_2d['target'] == 2]['y'],
    mode='markers',
    name='Class_2'
))

fig.add_trace(go.Scatter(
    x = df_test_2d[df_test_2d['target'] == 3]['x'],
    y = df_test_2d[df_test_2d['target'] == 3]['y'],
    mode='markers',
    name='Class_3'
))

fig.add_trace(go.Scatter(
    x = df_test_2d[df_test_2d['target'] == 4]['x'],
    y = df_test_2d[df_test_2d['target'] == 4]['y'],
    mode='markers',
    name='Class_4'
))

fig.update_layout(
    title_text = '2d dataset visualization using UMAP',
    template = 'plotly_dark'
)

In [None]:
reducer_3d = umap.UMAP(random_state=42, n_components=3)
embedding_3d = reducer_3d.fit_transform(scaled_sample_train, scaled_sample_target)

In [None]:
df_test_3d = pd.DataFrame(embedding_3d, columns=['x', 'y', 'z'])
df_test_3d['target'] = scaled_sample_target

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter3d(
    x = df_test_3d[df_test_3d['target'] == 1]['x'],
    y = df_test_3d[df_test_3d['target'] == 1]['y'],
    z = df_test_3d[df_test_3d['target'] == 1]['z'],
    mode = 'markers',
    name = 'Class_1',
    marker = dict(
        size=4
    )
))

fig.add_trace(go.Scatter3d(
    x = df_test_3d[df_test_3d['target'] == 2]['x'],
    y = df_test_3d[df_test_3d['target'] == 2]['y'],
    z = df_test_3d[df_test_3d['target'] == 2]['z'],
    mode = 'markers',
    name = 'Class_2',
    marker = dict(
        size=4
    )
))

fig.add_trace(go.Scatter3d(
    x = df_test_3d[df_test_3d['target'] == 3]['x'],
    y = df_test_3d[df_test_3d['target'] == 3]['y'],
    z = df_test_3d[df_test_3d['target'] == 3]['z'],
    mode = 'markers',
    name = 'Class_3',
    marker = dict(
        size=4
    )
))

fig.add_trace(go.Scatter3d(
    x = df_test_3d[df_test_3d['target'] == 4]['x'],
    y = df_test_3d[df_test_3d['target'] == 4]['y'],
    z = df_test_3d[df_test_3d['target'] == 4]['z'],
    mode = 'markers',
    name = 'Class_4',
    marker = dict(
        size=4
    )
))

fig.update_layout(
    title_text = '3d dataset visualization using UMAP',
    template = 'plotly_dark'
)

Now our visualization looks much better we can clearly see clouds of different classes. 

<h2> Prediction model creation </h2>

In [None]:
data_train_num = data_train

X = data_train_num.drop('target', axis=1)
y = data_train_num['target']

<h3> Scalling </h3>

In [None]:
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X))

<h3> XGBoost </h3>

In [None]:
log_pred = np.zeros((len(X), 4))
test_pred = np.zeros((len(data_test), 4))

In [None]:
xgb_model = XGBClassifier()

In [None]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold_, (train_index, val_index) in enumerate(skf.split(X, y)):
    print('Fold: ', fold_)
    model = xgb_model.fit(
        X.iloc[train_index],
        y.iloc[train_index],
        eval_set = [(X.iloc[train_index], y.iloc[train_index]), (X.iloc[val_index], y.iloc[val_index])],
        eval_metric = 'mlogloss',
        early_stopping_rounds = 50, 
        verbose = 0
    )

    temp_pred = model.predict_proba(X.iloc[val_index])
    log_pred[val_index] = temp_pred

    print(f'Log Loss: {log_loss(y.iloc[val_index], temp_pred)}')

    temp_test = model.predict_proba(data_test)
    test_pred += temp_test

test_pred1 = test_pred/5

print(f'Overall Log Loss: {log_loss(y, log_pred)}')

<h3> Light Gradient Boost </h3>

In [None]:
log_pred = np.zeros((len(X), 4))
test_pred = np.zeros((len(data_test), 4))

In [None]:
lg_model = LGBMClassifier()

In [None]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold_, (train_index, val_index) in enumerate(skf.split(X, y)):
    print('Fold: ', fold_)
    model = lg_model.fit(
        X.iloc[train_index],
        y.iloc[train_index],
        eval_set = [(X.iloc[train_index], y.iloc[train_index]), (X.iloc[val_index], y.iloc[val_index])],
        eval_metric = 'multi_logloss',
        early_stopping_rounds = 50,
        verbose = 0
    )

    temp_pred = model.predict_proba(X.iloc[val_index])
    log_pred[val_index] = temp_pred

    print(f'Log Loss: {log_loss(y.iloc[val_index], temp_pred)}')

    temp_test = model.predict_proba(data_test)
    test_pred += temp_test

test_pred2 = test_pred/5

print(f'Overall Log Loss: {log_loss(y, log_pred)}')

<h3> Catboost </h3>

In [None]:
log_pred = np.zeros((len(X), 4))
test_pred = np.zeros((len(data_test), 4))

In [None]:
cat_model = CatBoostClassifier()

In [None]:
%%time

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for fold_, (train_index, val_index) in enumerate(skf.split(X, y)):
    print('Fold: ', fold_)
    model = cat_model.fit(
        X.iloc[train_index],
        y.iloc[train_index],
        eval_set = [(X.iloc[train_index], y.iloc[train_index]), (X.iloc[val_index], y.iloc[val_index])],
        early_stopping_rounds = 50,
        verbose=0

    )

    temp_pred = model.predict_proba(X.iloc[val_index])
    log_pred[val_index] = temp_pred

    print(f'Log Loss: {log_loss(y.iloc[val_index], temp_pred)}')

    temp_test = model.predict_proba(data_test)
    test_pred += temp_test

test_pred3 = test_pred/10

print(f'Overall Log Loss: {log_loss(y, log_pred)}')

In [None]:
df_pred1 = pd.DataFrame(test_pred1)
df_pred2 = pd.DataFrame(test_pred2)
df_pred3 = pd.DataFrame(test_pred3)

In [None]:
data_test1 = pd.read_csv('data/sample_submission.csv').drop(['Class_1', 'Class_2', 'Class_3', 'Class_4'], axis=1)

data_test1['Class_1'] = df_pred1[0]
data_test1['Class_2'] = df_pred1[1]
data_test1['Class_3'] = df_pred1[2]
data_test1['Class_4'] = df_pred1[3]

In [None]:
data_test2 = pd.read_csv('data/sample_submission.csv').drop(['Class_1', 'Class_2', 'Class_3', 'Class_4'], axis=1)

data_test2['Class_1'] = df_pred2[0]
data_test2['Class_2'] = df_pred2[1]
data_test2['Class_3'] = df_pred2[2]
data_test2['Class_4'] = df_pred2[3]

In [None]:
data_test3 = pd.read_csv('data/sample_submission.csv').drop(['Class_1', 'Class_2', 'Class_3', 'Class_4'], axis=1)

data_test3['Class_1'] = df_pred3[0]
data_test3['Class_2'] = df_pred3[1]
data_test3['Class_3'] = df_pred3[2]
data_test3['Class_4'] = df_pred3[3]

In [None]:
df_pred1

In [None]:
df_pred2

In [None]:
df_pred3

In [None]:
data_test3.to_csv('submission.csv', index=False)