# RandomForestClassifier to Predict Bacteria Type

This notebook uses a simple machine learning algorithm (*RandomForestClassifier*). It can predict the type of bacteria from the contributions of each ATGC composition. This is very basic work. If you are very new to the field (like me), this notebook will help you learn new things.

This notebook is my first solution to the *Kaggle competition "Tabular Playground Series - Feb 2022*".  I have used a supervised learning algorithm called RandomForestClassifier from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). This is a very basic technique, but good enough to get a score of 0.95. Hope this will be useful for someone. 

If you have any questions please comment. I am happy to answer. 
If you can offer any suggestion to improve this work, please comment. Thank you.  

**Importing Libraries**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
%matplotlib inline

**Importing data**

In [None]:
train = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv', index_col=0)
test = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv', index_col=0)
submission = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

**Let's see how data looks like**

In [None]:
train.head()
# train.dtypes
# train.describe()
# train.describe(include='object')
train.info()
# train.isnull().sum() #check for null data
# train.isnull().sum().sum()

**Let's arrange our data**

In [None]:
y = train['target'] 
X = train.drop(columns=['target'])

**Exploring target data**

In [None]:
y.value_counts()

In [None]:
y.value_counts().transpose().plot(kind='bar')
plt.title('Target data')
plt.ylabel('Frequency')
plt.xlabel('Bactrria type')
plt.show()

**Let's see the contribution to the targets from each column**

In [None]:
avg = train.groupby(['target']).mean() # getting the average contributuion for each target
# avg.head()
avg.transpose().plot(kind='line',figsize=(25, 10))
plt.title('Contribution to the target')
plt.ylabel('Average contribution')
plt.xlabel('ATGC combination')
plt.show()

# Using Model: RandomForestClassifier 

In [None]:
# using train_split function form sklearn to split traning data set
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [None]:
rf_model = RandomForestClassifier(random_state=1)
rf_model.fit(train_X, train_y) # traning our model

In [None]:
rf_val_predictions = rf_model.predict(val_X) # predicting for validation
rf_val_predictions[:25]

**Validation**

In [None]:
# using accuracy_score function from sklearn.metrics
a_s = accuracy_score(val_y, rf_val_predictions)
print('Accuracy score:',a_s)

**Visualizing accuracy**

In [None]:
labl = y.unique().tolist() #List of Names of Bacteria
# using confusion_matrix from sklearn.metrics
cm = confusion_matrix(val_y, rf_val_predictions, labels= labl)
print(cm)
# Normalise
cmn = cm.astype('float') / cm.sum(axis=1)

In [None]:
fig, ax = plt.subplots(figsize=(12,10))           
ax = sns.heatmap(cmn, annot=True, fmt='.5f',cmap="Blues",vmin=0.0, vmax=0.005)
ax.set_xticklabels(labl, rotation=90)
ax.set_yticklabels(labl,rotation=0)
plt.show()

**Retrain the model with whole train data set**

In [None]:
rf_model.fit(X,y) # final traning traning

**Let's predict bacteria types for test data**

In [None]:
# test.head()

In [None]:
rf_val_predictions_test = rf_model.predict(test) #predictions for test data set
rf_val_predictions_test[:25]

**Preparing submission**

In [None]:
col_nam = test.columns # all column names of test data set
test_pred = test.copy(deep=False) # make a copy of test data set
test_pred['target'] = rf_val_predictions_test # adding new column to test tata set
output= test_pred.drop(columns=col_nam) # droping unnecessary columns
output.reset_index(inplace=True) # rest index
output.head()

**Writing in to a csv file**

In [None]:
output.to_csv('submission.csv', index=False)
print("submission was successful")

**Thank you for reading**
. Please comment If you have any questions. I look forward to any suggestions. 🙂