# Classifying 10 different bacteria species

<img src= "https://www.nutraingredients.com/var/wrbm_gb_food_pharma/storage/images/publications/food-beverage-nutrition/nutraingredients.com/article/2019/10/04/gut-health-affected-by-teams-of-bacteria-not-individual-species/10214854-1-eng-GB/Gut-health-affected-by-teams-of-bacteria-not-individual-species.jpg" alt ="Bacteria" style='width: 1600px; height: 500px'>

In this kernel, we go through some simple EDA. It uses code (with some modifications) from these:

* [[TPS-FEB-22] 📊EDA + Modelling📈](https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling)
* [Quick and minimalistic EDA (clusters) + XGBoost](https://www.kaggle.com/remekkinas/quick-and-minimalistic-eda-clusters-xgboost)
* [SUPER LEARNER ENSEMBLE - eXTree (TUNED) - EDA+DIM](https://www.kaggle.com/remekkinas/super-learner-ensemble-extree-tuned-eda-dim?scriptVersionId=86830028)

Please upvote the kernels above because they contain great ideas. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler

sns.set(style="ticks", color_codes=True)

In [None]:
train = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv", index_col='row_id')
test = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv", index_col='row_id')

First, we analyze the train data to find out what is the distribution of the variables and possible transformations we might do when creating a model.

## Summary of the train data

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.head()

In [None]:
train.describe()

It was mentioned in [this discussion](https://www.kaggle.com/c/tabular-playground-series-feb-2022/discussion/304483) that the dataset contains 8 categorical variables. We can identify these variables because they have less than 25 unique values. Let's confirm this claim:

In [None]:
# Obtain the list of variables and remove the target
cols = train.columns.to_list()
cols.remove('target')

cat_cols = [col for col in cols if train[col].nunique() < 25]
num_cols = [col for col in cols if train[col].nunique() >= 25]

In [None]:
print("Number of categorical columns: ", len(cat_cols))
print("Number of numerical columns: ", len(num_cols))

In [None]:
print("Number of missing values in the data ",sum(train.isna().sum()))

In summary, the train data has the following characteristics:

* Shape of the data: *200000* rows and *287* columns.
* Variable types: *8* categorical and *278* columns.
* Missing values: *0*.
* Scale: the describe table above shows that the columns have different scales.


## Distribution of the target variable
The goal of this competition is to predict the bacteria species. Thus, we need to understand the distribution of the target variable.

In [None]:
sns.countplot(x="target", data=train);
plt.title('Bacteria species', fontsize=18);
plt.xticks(rotation='vertical');

The plot above shows that this is a balanced dataset.

## Distribution of the features

Now, let's study in more detail all of the features.

In [None]:
train.iloc[:, :-1].describe().T.sort_values(by='std' , ascending = False)\
                     .style.background_gradient(cmap='GnBu')\
                     .bar(subset=["max"], color='#F8766D')\
                     .bar(subset=["mean",], color='#00BFC4')

In [None]:
fig, axs = plt.subplots(72, 4, figsize=(16,300))
for i,col in enumerate(cols):
    current_ax = axs.flat[i]
    current_ax.hist(train[col], bins=100)
    current_ax.set_title(col)
    current_ax.grid()

It is very hard to make sense of so many variables at the same time. We try a t-SNE decomposition to visualize all of the variables

In [None]:
train_subset = train.sample(10000, random_state= 42)

tsne = TSNE(n_components=2, random_state=0, perplexity= 50, n_iter=3000)
transformed_data = tsne.fit_transform(StandardScaler().fit_transform(train_subset[cols].values))

In [None]:
tsne_data = np.vstack((transformed_data.T, train_subset.target)).T

tsne_df = pd.DataFrame(data=tsne_data, columns=("X", "Y", "target"))

sns.FacetGrid(tsne_df, hue="target", height=6).map(plt.scatter, 'X', 'Y').add_legend()
plt.title('Perplexity= 50, n_iter=3000')
plt.show()

The plot above shows that a linear model might not be the best approach.