## **<span style="color:#023e8a;font-size:200%"><center> 🔥🔥EDA FEB22 TPS + Classifier submission🔥🔥</center></span>**
## **<center><span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">If you find this notebook useful or interesting, please, support with an upvote :)</span></center>**

## **<span style="color:#023e8a;font-size:1000%"><center>EDA</center></span><span style="color:#023e8a;font-size:200%"><center>Exploratory Data Analysis. FEB22</center></span>**

# **<a id="Content" style="color:#023e8a;">Table of Content</a>**
* [**<span style="color:#023e8a;">1. First steps</span>**](#First)  
* [**<span style="color:#023e8a;">2. Segment analysis</span>**](#Segment)  
* [**<span style="color:#023e8a;">3. Heatmap corr</span>**](#Heatmap)  
* [**<span style="color:#023e8a;">4. Histplot of target</span>**](#Histplot)  
* [**<span style="color:#023e8a;">5. Feature distributions</span>**](#Feature)  
* [**<span style="color:#023e8a;">6. PCA analysis</span>**](#PCA)  
* [**<span style="color:#023e8a;">7. DNA segments by bacteria (mean)</span>**](#DNAmean)  
* [**<span style="color:#023e8a;">8. DNA segments by bacteria (median)</span>**](#DNAmedian)  
* [**<span style="color:#023e8a;">9. DNA segments by bacteria (min)</span>**](#DNAmin)  
* [**<span style="color:#023e8a;">10. DNA segments by bacteria (max)</span>**](#DNAmax)  
* [**<span style="color:#023e8a;">11. Classifier</span>**](#Classifier)  
* [**<span style="color:#023e8a;">12. Submissions</span>**](#Subs)  

## **<span style="color:#023e8a;">Intro</span>**

**<span style="color:#023e8a;">We need to classify 10 kinds of bacteria, using data obtrained  genomic analysis technique:  </span>**

🦠 `Streptococcus_pyogenes`  

🦠 `Salmonella_enterica`  

🦠 `Enterococcus_hirae`  

🦠 `Escherichia_coli`  

🦠 `Campylobacter_jejuni`  

🦠 `Streptococcus_pneumoniae`   

🦠 `Staphylococcus_aureus`  

🦠 `Escherichia_fergusonii`   

🦠 `Bacteroides_fragilis`  

🦠 `Klebsiella_pneumoniae`  

`Metrics`:  [categorization accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)  
`ML`:  **classification**

## **<span style="color:#023e8a;">About ACGT</span>**

<span style="color:#023e8a;">`ACGT` is an acronym for the four types of bases found in a DNA molecule:</span>   
* `adenine (A)`
* `cytosine (C)`
* `guanine (G)`
* `thymine (T)`  
    
<span style="color:#023e8a;">A DNA molecule consists of two strands wound around each other, with each strand held together by bonds between the bases. Adenine pairs with thymine, and cytosine pairs with guanine. The sequence of bases in a portion of a DNA molecule, called a gene, carries the instructions needed to assemble a protein.</span>
  
[Learn more](https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Fact-Sheet)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from tqdm.notebook import tqdm

## **<span id="First" style="color:#023e8a;">1. First steps</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
df = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv',index_col=0)
test = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv',index_col=0)

In [None]:
df.head()

**<span style="color:#023e8a;">No one missing value in the dataset</span>**

In [None]:
df.isnull().sum().sum()

**<span style="color:#023e8a;">The sum of the sequence is always 10</span>**

In [None]:
import re
list_seq = df.columns.tolist()
check = []
for elem in list_seq[:-1]: #without target
    seq = re.sub(r'[A-Z]', '_', elem).split('_')
    seq = [int(elem) for elem in seq if elem != '']
    check.append(sum(seq))
print(f'max sum: {max(check)}, min sum: {min(check)}')

**<span style="color:#023e8a;">Tnx to: https://www.kaggle.com/c/tabular-playground-series-feb-2022/discussion/304483</span>**

**<span style="color:#023e8a;">It makes sense to use 8 or even 10 features as categorial.</span>**

In [None]:
pd.DataFrame(df.nunique()).sort_values(0)[1:15]

In [None]:
CATEG_FEATURES = pd.DataFrame(df.nunique()).sort_values(0)[1:11].index.tolist()
TARGET = ['target']
NUM_FEATURES = [feat for feat in df.columns if feat not in CATEG_FEATURES + TARGET]

**<span style="color:#023e8a;">Descriptive statistics</span>**

In [None]:
df[NUM_FEATURES].describe().T.style.background_gradient(cmap='RdYlGn', subset=['mean', 'std', '25%', '50%', '75%'])\
                                   .bar(subset=['min'], color='tomato')\
                                   .bar(subset=['max'], color='lightgreen')\
                                   .format('{:.6f}')

## **<span id="Segment" style="color:#023e8a;">2. Segment analysis</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
import re
seq_letters = pd.DataFrame(columns={'A','T','G','C'})
for col in CATEG_FEATURES + NUM_FEATURES:
    seq_letters.loc[col]=(re.split('A|T|G|C',col)[1:])
seq_letters.head(5)

**<span style="color:#023e8a;">All 4 letters here have sequence from 0 to 10</span>**

In [None]:
neg_ans = 0
for col in seq_letters.columns:
    for i in range(11):
        if str(i) not in seq_letters[col].unique():
            print(f'letter {col} has no number in sequence = {i}')
            neg_ans += 1
if neg_ans == 0:
    print('All 4 letters here have sequence from 0 to 10')

**<span style="color:#023e8a;">And as the features we have all of combinations A-C-T-G</span>**

In [None]:
seq_counts = pd.DataFrame(seq_letters['A'].value_counts())
seq_counts['C'] = seq_letters['C'].value_counts()
seq_counts['T'] = seq_letters['T'].value_counts()
seq_counts['G'] = seq_letters['G'].value_counts()
seq_counts

## **<span id="Heatmap" style="color:#023e8a;">3. Heatmap corr</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
f, ax = plt.subplots(figsize=(20,20))
ax = sns.heatmap(df.corr(), vmin=-1, vmax=+1)
plt.show()

## **<span id="Histplot" style="color:#023e8a;">4. Histplot of target</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;">Data is balanced</span>**

In [None]:
df_count = df.groupby('target').count()

f, ax = plt.subplots(figsize=(12,8))
list_names = [elem.replace('_', ' ') for elem in df_count.index.tolist()]
ax = sns.barplot(data=df_count, x=df_count.iloc[:,0], y=list_names, palette=sns.color_palette("hls", 8))
plt.xlabel('count of target', fontsize=16)
plt.show()

del df_count, list_names

## **<span id="Feature" style="color:#023e8a;">5. Feature distributions</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
rows, cols = 56, 5
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 200))
f.set_facecolor("#fff")
n_feat = 0
for row in tqdm(range(rows)):
    for col in range(cols):
        try:
            sns.kdeplot(x=NUM_FEATURES[n_feat], fill=True, alpha=1, linewidth=3, 
                                        edgecolor="#264653", data=df, ax=axs[row, col], color='w')
            axs[row, col].patch.set_facecolor("#619b8a")
            axs[row, col].patch.set_alpha(0.8)
            axs[row, col].grid(color="#264653", alpha=1, axis="both")
        except IndexError: # hide last empty graphs
            axs[row, col].set_visible(False)
        n_feat += 1

f.show()

## **<span id="PCA" style="color:#023e8a;">6. PCA analysis</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;">Firstly we need to normalize data. Method is dramaticaly depends on data scaling.</span>**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(df.iloc[:,:-1])
X_train_std = sc.transform(df.iloc[:,:-1])

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=len(df.columns)-1)
X_train_pca = pca.fit_transform(X_train_std)

**<span style="color:#023e8a;">There is no reason to reduce dimension by PCA. The line is straight and we way lose important information .</span>**

In [None]:
exp_var_pca = pca.explained_variance_ratio_
cum_sum_eigenvalues = np.cumsum(exp_var_pca)

plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where='mid',label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

**<span style="color:#023e8a;">Look at the descriptive statistics of segments:</span>**
* `Mean`  
* `Median`  
* `Min`  
* `Max`  

## **<span id="DNAmean" style="color:#023e8a;">7. DNA segments by bacteria (mean)</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
df_aggr = df.groupby('target').mean().T
df_aggr.style.background_gradient(cmap='RdYlGn')\
       .format('{:.6f}')

## **<span id="DNAmedian" style="color:#023e8a;">8. DNA segments by bacteria (median)</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a">Go to Table of Content</span>**](#Content)

In [None]:
df_aggr = df.groupby('target').median().T
df_aggr.style.background_gradient(cmap='RdYlGn')\
       .format('{:.6f}')

## **<span id="DNAmin" style="color:#023e8a;">9. DNA segments by bacteria (min)</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
df_aggr = df.groupby('target').max().T
df_aggr.style.background_gradient(cmap='RdYlGn')\
       .format('{:.6f}')

## **<span id="DNAmax" style="color:#023e8a;">10. DNA segments by bacteria (max)</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

In [None]:
df_aggr = df.groupby('target').max().T
df_aggr.style.background_gradient(cmap='RdYlGn')\
       .format('{:.6f}')

## **<span id="Classifier" style="color:#023e8a;">11. Classifier</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;">There are duplicates in both train and test data. Drop it. Thanks to: </span>** 
[link](https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees#%E2%9C%85-Cross-validation-method)

In [None]:
train_nodup = pd.DataFrame(
    [list(tup) for tup in df.value_counts().index.values], 
    columns=df.columns
)

**<span style="color:#023e8a;"> After reducing our dataset, dropping duplicates, add weights of every unique row for classifier weight correction. </span>** 

In [None]:
train_nodup['sample_weight'] = df.value_counts().values
sample_weight = train_nodup['sample_weight']

In [None]:
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import StratifiedKFold
X = train_nodup[df.columns[:-1]]
y = train_nodup[df.columns[-1]]

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y = pd.DataFrame(y)

In [None]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

In [None]:
y_preds, y_probas = [], []

for i, (i_train, i_test) in enumerate(skf.split(X, y)):
    X_train, y_train, sample_weight_train = X.iloc[i_train], y.iloc[i_train], sample_weight.iloc[i_train]
    X_val, y_val, sample_weight_val = X.iloc[i_test], y.iloc[i_test], sample_weight.iloc[i_test]
    
    model = ExtraTreesClassifier(n_estimators=300, random_state=42, n_jobs=-1)
    model.fit(X_train,  np.ravel(y_train), sample_weight_train)
        
    y_pred = model.predict(X_val)
    acc = accuracy_score(y_val, y_pred, sample_weight=sample_weight_val)
    print(f'Acc at fold {i}: {acc:.2%}\n')
    y_preds.append(model.predict(test))
    y_probas.append(model.predict_proba(test))

## **<span id="Subs" style="color:#023e8a;">12. Submissions</span>**

[**<span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 2px">Go to Table of Content</span>**](#Content)

**<span style="color:#023e8a;"> Submit mode of the fold predictions and mean of the probas. </span>** 

In [None]:
from scipy.stats import mode
submission = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')
res = mode(y_preds, axis=0)[0]
res_mean = np.array(y_probas)
res_mean = res_mean.mean(axis=0).argmax(axis=1)
submission['target'] = le.inverse_transform(res.ravel())
submission.to_csv('submission.csv', index=False)
submission['target'] = le.inverse_transform(res_mean.ravel())
submission.to_csv('submission_mean.csv', index=False)

## **<center><span style="color:#FEF1FE;background-color:#023e8a;border-radius: 5px;padding: 5px">If you find this notebook useful or interesting, please, support with an upvote :)</span></center>**