# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Introduction</p></div>

Bacterial genetics is the study of the structure and distribution of hereditary information in bacteria. Although many species are often considered harmful, some can actually be helpful. In fact, we would not exist without them. The study of bacteria has been used across many industries to provide a variety of benefits from creating new antibiotics and medications to developing genetic enzymes that can break down organic compounds, like plastic. In this analysis, I will examine the genetic information of 10 different species of bacteria using Principal Components Analysis and build a predictive model that can classify bacterial species with 96% accuracy. 

<img src="https://media.istockphoto.com/vectors/icon-vector-template-illustration-design-vector-id1194702268?k=20&m=1194702268&s=612x612&w=0&h=BTdNZSjkEFNxqVAB5bF0Cdfkil7oiL70PISp7235FbM=" width=600>

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Data Overview</p></div>

The [data](https://www.kaggle.com/c/tabular-playground-series-feb-2022/data) used in this study is based on compressed measurements of DNA snippets. Snippets of length 10 were analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet. Using this technique, the DNA segment $ATATGGCCTT$ becomes translated to $A_2T_4G_2C_2$.  

Each row of data contains the spectrum of histograms generated by repeated measurements of the sample. The output of all 286 histogram possibilities ($A_0T_0G_0C_{10}$ to $A_{10}T_0G_0C_0$) then has a bias spectrum (of totally random $ATGC$) subtracted from the results.  

The training set consists of 200,000 bacteria across 10 different species including Bacteroides fragilis, Campylobacter jejuni, Enterococcus hirae, Escherichia coli, Escherichia fergusonii, Klebsiella pneumoniae, Salmonella enterica, Staphylococcus aureus, Streptococcus pneumoniae, Streptococcus pyogenes. The data set also contains several duplicated rows. These will be removed before beginning the analysis.

In [None]:
# Load libraries
import os, warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mpl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, auc, classification_report

# Load data
train = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv', index_col=0)
test = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv', index_col=0)
sub = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

print('Train Shape: {}\nMissing Data: {}\nDuplicates: {}\n'\
      .format(train.shape, train.isna().sum().sum(), train.duplicated().sum()))
print('Test Shape: {}\nMissing Data: {}\nDuplicates: {}\n'\
      .format(test.shape, test.isna().sum().sum(), test.duplicated().sum()))
train_d=train.drop_duplicates() 
print('Dropping Duplicates\nNew Train Shape: {}'.format(train_d.shape))

## <b><span style='color:#53A0AA'>Summary Statistics Grouped by Species</span> </b>

In [None]:
train_d.groupby('target').describe()

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Exploratory Data Analysis</p></div>

In [None]:
init_notebook_mode(connected=True)
pal = sns.color_palette("mako_r", 12).as_hex()[:10]
bact=train_d.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)
bact['index']=bact['index'].str.replace('_', ' ') 

temp = dict(layout=go.Layout(font=dict(family="Franklin Gothic", size=12)))
fig = px.bar(bact, x='index', y='target', text='target', color='index', 
             color_discrete_sequence=pal, opacity=0.8)
fig.update_traces(texttemplate='%{text:,.2f}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(visible=False, showticklabels=False)
fig.update_layout(template=temp, title_text='Distribution of Bacteria Species', 
                  xaxis=dict(title='', tickangle=25, showline=True), 
                  height=450, width=700, showlegend=False)
fig.show()

The class distribution in our target variable is evenly balanced with 10% of each species represented in the data.

## <b><span style='color:#53A0AA'>Correlations Between Genes</span> </b>

In [None]:
cor=train_d.corr()    
cor.style.background_gradient(cmap='viridis')

## <b><span style='color:#53A0AA'>Most Correlated Gene Pairs</span> </b>
Below is a list of the gene pairs that have an absolute correlation above 0.75.

In [None]:
c = cor.abs().unstack().drop_duplicates().reset_index()
c = c.rename(columns={'level_0': 'Gene 1', 'level_1': 'Gene 2', 0: 'Correlation'})
c = c.query('.75 <= Correlation < 1').sort_values(by = 'Correlation', ascending = False).reset_index(drop=True)
c.style.background_gradient(cmap='flare_r')

## <b><span style='color:#53A0AA'>Most Correlated Genes in Each Species</span></b>

In [None]:
for i in train_d.target.unique():
    cor_df=train_d[train_d.target==i]
    cor=cor_df.corr()  
    c = cor.abs().unstack().drop_duplicates().reset_index()
    c = c.rename(columns={'level_0': 'Gene 1', 'level_1': 'Gene 2', 0: 'Correlation'})
    c = c.query('Correlation < 1').sort_values(by = 'Correlation', ascending = False).reset_index(drop=True)
    display(c.iloc[:1,:].style.background_gradient(cmap='flare').set_caption('Most correlated genes in {}'.format(i.replace('_', ' '))))

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Principal Components Analysis</p></div>

To further explore the genetic composition of the species, I will use Principal Components Analysis (PCA). PCA is a dimensionality reduction technique that constructs new variables called Principal Components by finding a linear combination of features that captures the greatest amount of variance possible. Each Principal Component successively finds the maximum variance in the data projected along the orthogonal axis of the previous component, creating a new set of uncorrelated features that collectively explain the variability within the data. Before fitting the PCA model, I will first transform the variables using the Box-Cox transformation to reduce the skewness in the data and standardize the variables to have the same scale as each other.

## <b><span style='color:#53A0AA'>Transform Skewed Variables</span> </b>

In [None]:
skew_cols = train_d.select_dtypes(exclude='object').skew().sort_values(ascending=False)
skew_cols = pd.DataFrame(skew_cols.loc[skew_cols > 0.75]).rename(columns={0:'Skew before'})

# Box-cox transformation
t=train_d.copy()
for i in skew_cols.index.tolist():
    t[i] = boxcox1p(t[i], boxcox_normmax(t[i] + 1))
    
skew_df=pd.concat([skew_cols, t[skew_cols.index].skew()], axis=1).rename(columns={0:'After'})
skew_df.head()

## <b><span style='color:#53A0AA'>Standardized Data</span> </b>

In [None]:
X=t.drop('target', axis=1)
X_scaled=pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)
X_scaled.head()

## <b><span style='color:#53A0AA'>PCA Model</span> </b>

In [None]:
pca = PCA(n_components=286).fit(X_scaled)
pca_sum=pd.Series(np.cumsum(pca.explained_variance_ratio_)).mul(100)
pca_sum.index = np.arange(1, len(pca_sum)+1)
ind=pd.Series(pca.explained_variance_ratio_).mul(100)
ind.index = np.arange(1, len(ind)+1)
print(pca)

fig = go.Figure(
    layout=go.Layout(
        updatemenus=[dict(type="buttons", direction="left", x=0.15, y=1.2, showactive=False)],
        xaxis=dict(range=[1, 287],
                   autorange=False, tickwidth=2),
        yaxis=dict(range=[0, 100],
                   autorange=False)))

fig.add_trace(go.Scatter(x=ind.index[:1], y=ind[:1], line=dict(color='#5758A3', width=3), 
                         visible=True, fill='tozeroy', opacity=0.8,
                         hovertemplate = 'Variance Explained = %{y:.2f}%<br>Principal Component %{x:.0f}',
                         name='Individual'))
fig.add_trace(go.Scatter(x=pca_sum.index[:1], y=pca_sum[:1], line=dict(color='#57A0A3', width=3), 
                         visible=True, fill='tonexty', opacity=0.7,
                         hovertemplate = 'Variance Explained = %{y:.1f}%<br>Number of Principal Components = %{x:.0f}',
                         name='Cumulative'))

fig.update(frames=[go.Frame(data=[
    go.Scatter(x=ind.index[:i], y=ind[:i]),
    go.Scatter(x=pca_sum.index[:i], y=pca_sum[:i])])
                   for i in range(1, 287)])

fig.update_yaxes(title = 'Variance Explained', showline=True, ticksuffix='%', range=[0,105])

fig.update_layout(template=temp,
                  title='Principal Components Explained Variances', 
                  xaxis_title="Number of Principal Components",
                  hovermode="x unified", width=700,
                  legend=dict(orientation="v", yanchor="bottom", y=1.08, xanchor="right", x=.99, title=""),
                  updatemenus=[dict(buttons=list(
                      [dict(label="Play", method="animate", 
                            args=[None, {"frame": {"duration":15, "redraw": False}},{"fromcurrent": True}]),
                       dict(label="Pause", method="animate", 
                            args=[{"frame": {"duration": 0, "redraw": False}},{"mode": "immediate"},
                                  {"transition": {"duration": 0}}])]))])
fig.show()

This graph shows the amount of variance explained cumulatively and individually for each Principal Component. The inflection point occurs quite early on in the graph at about Principal Component 10, where the cumulative amount of variance explained is nearly 70% and each subsequent dimension captures less than 1% of the variability in the data.

In [None]:
df=pd.DataFrame(abs(pca.components_.T), columns=['PC'+str(i+1) for i in range(286)], index=X_scaled.columns)
pca_ind=pd.Series(pca.explained_variance_ratio_)
var_pca=[]
for i,j in zip(df.columns.tolist(), pca_ind):
    k=df[i].nlargest(1)*j
    var_pca.append(pd.DataFrame({'Principal Component':str(i[2:]),'Gene':k.index,'Var':k[0]}))
var_pca=pd.concat(var_pca).reset_index(drop=True)
plot_df=var_pca.iloc[:10,:]

pal = sns.color_palette("viridis", 14).as_hex()[1:11]
fig = px.bar(plot_df, x='Gene', y='Var', text='Var', color='Principal Component', 
             color_discrete_sequence=pal, opacity=0.7)
fig.update_traces(texttemplate='%{text:,.3f}', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_layout(template=temp, title_text='Gene Importance in the Top 10 Principal Components',
                  xaxis_title='Gene Segment', xaxis_tickangle=28, 
                  yaxis_title='Weighted Importance', legend_title='Principal<br>Component',
                  yaxis_tickvals = [0,0.01,0.02],
                  height=500, width=700)
fig.show()

Using the Principal Components loading vectors, this graph shows the most important gene in extracting each of the top 10 Principal Components. The loading magnitudes indicate the strength of the relationship between the variable and the component, with a higher absolute magnitude indicating a stronger relationship. The gene with the largest magnitude in each component is shown along the $x$-axis and was multiplied by the amount of variance that component explains to give us an overall weighted importance.

In [None]:
fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Principal Component 1","Principal Component 2",
                                    "Principal Component 3","Principal Component 4"))

pc1=df.PC1.sort_values(ascending=False)[:5]
fig.add_trace(go.Bar(x=pc1.index, y=pc1, name='PC1', showlegend=False, 
                     marker_color=pal[0], opacity=0.7,
                     hovertemplate='Gene %{x} Loading on Principal Component 1 = %{y:.3f}<extra></extra>'), 
              row=1,col=1) 

pc2=df.PC2.sort_values(ascending=False)[:5]
fig.add_trace(go.Bar(x=pc2.index, y=pc2, name='PC2', showlegend=False, 
                     marker_color=pal[1],opacity=0.8,
                     hovertemplate='Gene %{x} Loading on Principal Component 2 = %{y:.3f}<extra></extra>'), 
              row=1,col=2) 

pc3=df.PC3.sort_values(ascending=False)[:5]
fig.add_trace(go.Bar(x=pc3.index, y=pc3, name='PC3', showlegend=False, 
                     marker_color=pal[2],opacity=0.8,
                     hovertemplate='Gene %{x} Loading on Principal Component 3 = %{y:.3f}<extra></extra>'),
              row=2,col=1) 

pc4=df.PC4.sort_values(ascending=False)[:5]
fig.add_trace(go.Bar(x=pc4.index, y=pc4, name='PC4', showlegend=False, 
                     marker_color=pal[3],opacity=0.7,
                     hovertemplate='Gene %{x} Loading on Principal Component 4 = %{y:.3f}<extra></extra>'),
              row=2,col=2) 

fig.update_traces(marker_line=dict(width=1, color='#28221D'))
fig.update_layout(template=temp, title_text='Genetic Decomposition of Principal Components',
                  xaxis1_title='Gene', xaxis2_title='Gene',
                  xaxis3_title='Gene', xaxis4_title='Gene', xaxis_tickangle=28,
                  height=1000, width=700)

To further explore the composition of the principal components, these plots show the five most important genes with the highest absolute magnitude from the components loading vectors. 

In [None]:
pca = PCA(n_components=10).fit_transform(X_scaled)
pca_df=pd.DataFrame(data=pca, columns=['PC'+str(i+1) for i in range(0,10)]).reset_index(drop=True)
species=train_d.target.reset_index(drop=True).str.replace('_', ' ') 
pca_df=pd.concat([species, pca_df], axis=1)
pca_df['map'] = pca_df['target'].map(pca_df['target'].value_counts())
pca_df = pca_df.sort_values(by='map', ascending=False).drop('map', axis=1)

pal = sns.color_palette("mako_r", 12).as_hex()[:10]
fig = px.scatter(pca_df, x='PC1', y='PC2', color='target', color_discrete_sequence=pal, opacity=0.4)
fig.update_traces(marker_size=7,
                  hovertemplate="Principal Component 1 = %{x}<br>Principal Component 2 = %{y}")
fig.update_layout(template=temp, title='Bacteria Species Projected onto Components 1 and 2', legend_title='', 
                  xaxis_title='Component 1 (variance explained = 31.9%)', 
                  yaxis_title='Component 2 (variance explained = 20.4%)',
                  width=700, height=600)
fig.show()

Using the Principal Component score vectors, this graph shows the coordinates of the species projected onto the first two components' axes. Combined, these two components account for over 52% of the variation in the data. Here, we see how the species are classified across the dimensions. For some species, like Klebsiella pneumoniae and Staphylococcus aureus, the components clearly separate the classes, while for others there is some overlap between the species. Below are the projections onto the first three principal components for each species.

In [None]:
s=pca_df.target.unique()
rgb=[]
for i in pal:
    rgb.append('rgb' + str(mpl.to_rgb(i)))

fig = make_subplots(rows=5, cols=2,
                    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}], 
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}]],
                    horizontal_spacing = 0.1, vertical_spacing = 0.05,
                    subplot_titles=(s[0],s[1],s[2],s[3],s[4],
                                    s[5],s[6],s[7],s[8],s[9]))

p1=pca_df[pca_df.target=='Bacteroides fragilis']
fig.add_trace(go.Scatter3d(x=p1.PC1, y=p1.PC2, z=p1.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[0], opacity=0.4, line_width=1, line_color=rgb[0])),
              row=1, col=1)
p2=pca_df[pca_df.target=='Campylobacter jejuni']
fig.add_trace(go.Scatter3d(x=p2.PC1, y=p2.PC2, z=p2.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[1], opacity=0.3, line_width=1, line_color=rgb[1])),
              row=1, col=2)
p3=pca_df[pca_df.target=='Klebsiella pneumoniae']
fig.add_trace(go.Scatter3d(x=p3.PC1, y=p3.PC2, z=p3.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[2], opacity=0.3, line_width=1, line_color=rgb[2])),
              row=2, col=1)
p4=pca_df[pca_df.target=='Streptococcus pneumoniae']
fig.add_trace(go.Scatter3d(x=p4.PC1, y=p4.PC2, z=p4.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[3], opacity=0.3, line_width=1, line_color=rgb[3])),
              row=2, col=2)
p5=pca_df[pca_df.target=='Staphylococcus aureus']
fig.add_trace(go.Scatter3d(x=p5.PC1, y=p5.PC2, z=p5.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[4], opacity=0.3, line_width=1, line_color=rgb[4])),
              row=3, col=1)
p6=pca_df[pca_df.target=='Streptococcus pyogenes']
fig.add_trace(go.Scatter3d(x=p6.PC1, y=p6.PC2, z=p6.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[5], opacity=0.35, line_width=1, line_color=rgb[5])),
              row=3, col=2)
p7=pca_df[pca_df.target=='Salmonella enterica']
fig.add_trace(go.Scatter3d(x=p7.PC1, y=p7.PC2, z=p7.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[6], opacity=0.35, line_width=1, line_color=rgb[6])),
              row=4, col=1)
p8=pca_df[pca_df.target=='Enterococcus hirae']
fig.add_trace(go.Scatter3d(x=p8.PC1, y=p8.PC2, z=p8.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[7], opacity=0.35, line_width=1, line_color=rgb[7])),
              row=4, col=2)
p9=pca_df[pca_df.target=='Escherichia coli']
fig.add_trace(go.Scatter3d(x=p9.PC1, y=p9.PC2, z=p9.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[8], opacity=0.35, line_width=1, line_color=rgb[8])),
              row=5, col=1)
p10=pca_df[pca_df.target=='Escherichia fergusonii']
fig.add_trace(go.Scatter3d(x=p10.PC1, y=p10.PC2, z=p10.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[9], opacity=0.35, line_width=1, line_color=rgb[9])),
              row=5, col=2)
fig.update_traces(hovertemplate='Component 1 = %{x}<br>Component 2 = %{y}<br>Component 3 = %{z}<extra></extra>')
fig.update_layout(template=temp, title='Bacteria Species Projected onto the first 3 Principal Components',
                  scene1=dict(aspectmode='cube',xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene2=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene3=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene4=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene5=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene6=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene7=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene8=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene9=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene10=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  height=2500, width=800)
fig.show()

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Predicting Bacteria Species with Principal Components</p></div>

To predict the species of bacteria, I will use two types of models, one with the full set of features in the dataset and one with a subset of features using the first 100 Principal Components. These components cumulatively explain over 85% of the variation in the data. 

## <b><span style='color:#53A0AA'>Model Performance with 100 Principal Components</span> </b>

In [None]:
s = StandardScaler()
enc=LabelEncoder()
y=train_d.target
y=enc.fit_transform(y)
train_d.drop('target', axis=1, inplace=True)

X_train, X_val, y_train, y_val = train_test_split(train_d, y, test_size=0.2, shuffle=True, 
                                                  stratify=y, random_state=21)

X_train_scaled = s.fit_transform(X_train)
X_val_scaled = s.transform(X_val)
X_test_scaled = s.transform(test)

pca = PCA(n_components=100)
X_train_pca=pca.fit_transform(X_train_scaled)
X_val_pca=pca.transform(X_val_scaled)
X_test_pca=pca.transform(X_test_scaled)

print("Train Shape: {} {}".format(X_train_pca.shape, y_train.shape))
print("Validation Shape: {} {}".format(X_val_pca.shape, y_val.shape))
print("Test Shape: {}\n".format(X_test_pca.shape))

et_pca=ExtraTreesClassifier(n_estimators=500,
                            class_weight='balanced',
                            random_state=92).fit(X_train_pca, y_train)
print(et_pca)

y_preds=et_pca.predict(X_val_pca)
y_probs=et_pca.predict_proba(X_val_pca)
val_acc=accuracy_score(y_true=y_val, y_pred=y_preds)
val_auc=roc_auc_score(y_true=y_val, y_score=y_probs, average='weighted', multi_class='ovr')
c=classification_report(y_val, y_preds, target_names=enc.classes_, output_dict=True)
c=pd.DataFrame(c).T.iloc[:10,][['f1-score', 'precision', 'recall', 'support']]
val_f1=c['f1-score'].mean()

print('\nModel Accuracy = {:.2f}%\nF1-Score = {:.2f}%\nArea Under the Curve = {:.3f}\n'\
      .format(val_acc*100, val_f1*100, val_auc))

c[['f1-score', 'precision', 'recall']]=c[['f1-score', 'precision', 'recall']].mul(100)
c.sort_values('f1-score', ascending=False).style\
.background_gradient(cmap='flare_r', subset=['f1-score'])\
.format({'f1-score':'{:,.1f}%', 'precision':'{:,.1f}%', 'recall':'{:,.1f}%', "support": "{:,.0f}"})

In [None]:
fpr = {}
tpr = {}
roc_auc = {}
thresh = {}

species=c.sort_values('f1-score', ascending=False).index.str.replace('_', ' ')
for i in range(len(species)):    
    fpr[i], tpr[i], thresh[i] = roc_curve(y_val, y_probs[:,i], pos_label=i)
    roc_auc[i] = auc(fpr[i], tpr[i])

fig = go.Figure()
for i,j in zip(enumerate(species), pal):
    fig.add_trace(go.Scatter(x=fpr[i[0]], y=tpr[i[0]], line=dict(color=j, width=3), opacity=0.7,
                             hovertemplate = 'True positive rate = %{y:.3f}, False positive rate = %{x:.3f}',
                             name='{} AUC = {:.3f}'.format(i[1],roc_auc[i[0]])))
fig.update_layout(template=temp, title="Multiclass ROC Curves<br>of Bacteria Species", hovermode="x unified", 
                  hoverlabel = dict(bgcolor="white",font_size=12), xaxis=dict(zeroline=False, hoverformat=".2f"),
                  xaxis_title='False Positive Rate (1 - Specificity)', yaxis_title='True Positive Rate (Sensitivity)',
                  legend=dict(y=.1, x=.98, xanchor="right",bordercolor="black", borderwidth=.5, font=dict(size=12)),
                  height=550, width=700)
fig.show()

In [None]:
test_preds=et_pca.predict(X_test_pca)
target=enc.inverse_transform(test_preds)
sub_pca=pd.DataFrame({'row_id':[i for i in range(int(2e5),int(3e5))], 'target':target})
bact=sub_pca.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)

fig = px.bar(bact, x='index', y='target', text='target', color='index',
             color_discrete_sequence=pal, opacity=0.8)
fig.update_traces(texttemplate='%{text:,.2f}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(visible=False, showticklabels=False)
fig.update_layout(template=temp, title_text='Predicted Bacteria Species using Principal Components', 
                  xaxis=dict(title='', tickangle=25, showline=True), 
                  height=450, width=700, showlegend=False)
fig.show()

### <b><span style='color:#6E6E6E'>PCA Predictions</span> </b>

In [None]:
sub_pca.to_csv("submission_pca.csv", index=False)
sub_pca.head()

#### <span style='color:#5E5E5E'>Leaderboard Score = 87.69%</span>

## <b><span style='color:#53A0AA'>Model Performance with All Variables</span> </b>

In [None]:
print("Train Shape: {} {}".format(X_train_scaled.shape, y_train.shape))
print("Validation Shape: {} {}".format(X_val_scaled.shape, y_val.shape))
print("Test Shape: {}\n".format(X_test_scaled.shape))

et_all=ExtraTreesClassifier(n_estimators=500,  
                            class_weight='balanced', 
                            random_state=21).fit(X_train_scaled, y_train)
print(et_all)

y_preds=et_all.predict(X_val_scaled)
y_probs=et_all.predict_proba(X_val_scaled)
val_acc=accuracy_score(y_true=y_val, y_pred=y_preds)
val_auc=roc_auc_score(y_true=y_val, y_score=y_probs, average='weighted', multi_class='ovr')
c=classification_report(y_val, y_preds, target_names=enc.classes_, output_dict=True)
c=pd.DataFrame(c).T.iloc[:10,][['f1-score', 'precision', 'recall', 'support']]
val_f1=c['f1-score'].mean()

print('Model Accuracy = {:.2f}%\nF1-Score = {:.2f}%\nArea Under the Curve = {:.3f}\n'\
      .format(val_acc*100, val_f1*100, val_auc))

c[['f1-score', 'precision', 'recall']]=c[['f1-score', 'precision', 'recall']].mul(100)
c.sort_values('f1-score', ascending=False).style\
.background_gradient(cmap='flare_r', subset=['f1-score'])\
.format({'f1-score':'{:,.1f}%', 'precision':'{:,.1f}%', 'recall':'{:,.1f}%', "support": "{:,.0f}"})

In [None]:
test_preds=et_all.predict(X_test_scaled)
target=enc.inverse_transform(test_preds)
sub=pd.DataFrame({'row_id':[i for i in range(int(2e5),int(3e5))], 'target':target})
bact=sub.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)

fig = px.bar(bact, x='index', y='target', text='target', color='index',
             color_discrete_sequence=pal, opacity=0.8)
fig.update_traces(texttemplate='%{text:,.2f}%', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.update_yaxes(visible=False, showticklabels=False)
fig.update_layout(template=temp, title_text='Bacteria Species Predictions with all Variables in the Model', 
                  xaxis=dict(title='', tickangle=25, showline=True), 
                  height=450, width=700, showlegend=False)
fig.show()

### <b><span style='color:#6E6E6E'>Model Predictions</span> </b>

In [None]:
sub.to_csv("submission.csv", index=False)
sub.head()

#### <span style='color:#5E5E5E'>Leaderboard Score = 95.87%</span>

# <div style="color:white;display:fill;border-radius:5px;background-color:#75B7BF;letter-spacing:0.1px;overflow:hidden"><p style="padding:20px;color:white;overflow:hidden;margin:0;font-size:100%;text-align:center">Conclusion</p></div>

To predict bacteria species using genetic sequences, two models were developed: one containing the information of all 286 genetic segments and a simpler method using 100 Principal Components. Using Principal Components Analysis not only helped to reduce the number of variables in our model, but also helped to visualize our data and identify the gene composition in the most important components. Out of the two methods, the model with all variables included achieved the highest accuracy with a score of nearly 96% on the test set. With just 100 Principal Components, 85% of the variance in the data was accounted for and though we experienced a slight decrease in performance, the model was able to predict the species of bacteria with an accuracy of over 87%. 


#### <b><p style="padding:20px;color:#53A0AA;overflow:hidden;margin:0;text-align:center">Thank you for reading!<br>Let me know if you have any questions, and I look forward to any suggestions. 🙂</p></b>

### <span style='color:#707575'>References </span>

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). *An Introduction to Statistical Learning with Applications in R.* New York, NY: Springer.

<img src="https://i0.wp.com/boingboing.net/wp-content/uploads/2019/10/giphy-1.gif?fit=1&resize=620%2C4000&ssl=1" class="center"  width="650">