# Tabular Playground Series Feb. 2022
* Introduction of Competition
* Loading Libaries
* Exploratory Data Analysis(EDA)
* Model Training and Inference

# 1.Introduction Of Competition
We have a competition on genetic data this month, 10 bacteria will be classified according to the results of genomic analysis in this competition. 
In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count
## 1.1. 🦠 Bacteria species (classes)
* [Streptococcus pyogenes](https://en.wikipedia.org/wiki/Streptococcus_pyogenes)
* [Salmonella enterica](https://ru.wikipedia.org/wiki/Salmonella_enterica)
* [Escherichia coli](https://en.wikipedia.org/wiki/Enterococcus_hirae)
* [Campylobacter jejuni](https://en.wikipedia.org/wiki/Campylobacter_jejuni)
* [Streptococcus pneumoniae](https://en.wikipedia.org/wiki/Streptococcus_pneumoniae)
* [Staphylococcus aureus](https://en.wikipedia.org/wiki/Staphylococcus_aureus)
* [Escherichia fergusonii](https://en.wikipedia.org/wiki/Escherichia_fergusonii)
* [Bacteroides fragilis](https://en.wikipedia.org/wiki/Bacteroides_fragilis)
* [Klebsiella pneumoniae](https://en.wikipedia.org/wiki/Klebsiella_pneumoniae)

TODO: The next version of my notebook will be investigating case bacteria. 

# 2. 📚 Import Libraries & Reduce Memory

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import mode
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
import warnings
import gc
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
warnings.simplefilter('ignore')
KAGGLE_DIR = r'../input/tabular-playground-series-feb-2022/'
LOCAL_DIR = r''
KAGGLE = True
RS = 69420

## 2.1. Reduce the memory Usage

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [None]:
%%time
if KAGGLE:
    print(f"{'*'*10} Loading Training Data... {'*'*10}")
    df = pd.read_csv(KAGGLE_DIR+'train.csv', index_col=0).pipe(reduce_mem_usage)
    print(f"{'*'*10} Loading Testing Data... {'*'*10}")
    test = pd.read_csv(KAGGLE_DIR+'test.csv', index_col=0).pipe(reduce_mem_usage)
    sub = pd.read_csv(KAGGLE_DIR+'sample_submission.csv').pipe(reduce_mem_usage)
else:
    print(f"{'*'*10} Loading Training Data... {'*'*10}")
    df = pd.read_csv(LOCAL_DIR+'train.csv', index_col=0).pipe(reduce_mem_usage)
    print(f"{'*'*10} Loading Testing Data... {'*'*10}")
    test = pd.read_csv(LOCAL_DIR+'test.csv', index_col=0).pipe(reduce_mem_usage)
    sub = pd.read_csv(LOCAL_DIR+'sample_submission.csv').pipe(reduce_mem_usage)

In [None]:
#del df 
#gc.collect()

# 3. 🔍 Exploratory Data Analysis

In [None]:
df.head()

In [None]:
df.info()

In [None]:
print('Train set - dimensions:\t', df.shape)
print('Test set - dimensions:\t', test.shape)

In [None]:
print(f"Missing value of Train set:{df.isnull().sum().sum()} and Missing value of Test set: {test.isnull().sum().sum()}")

In [None]:
target_dist = df["target"].value_counts()

In [None]:
fig = px.pie(df,values=target_dist,names=target_dist.index,
             color_discrete_sequence=px.colors.sequential.RdBu,
            hole=0.1)
fig.show()

In [None]:
sub25 = df.nunique()[df.nunique() < 25][:-1]
sub25

In [None]:
cat_feat = sub25.index.tolist()
all_feat = df.columns.difference(cat_feat)[:-1] # -1 cuz last index is target we dont need it.
df.columns.difference(cat_feat)[-1]

In [None]:
fig = px.pie(df,names = ["Continous Features","Categorical Features"],
             values = [len(cat_feat),len(all_feat)],
             hole = 0.4,
            color_discrete_sequence=px.colors.sequential.RdBu)
fig.show()

# 4.Preprocessing & Cross Validation

In [None]:
from sklearn.preprocessing import LabelEncoder
feature = df.columns[df.columns != "target"]
le = LabelEncoder()
X = df[feature]
y = pd.DataFrame(le.fit_transform(df["target"]), columns=["target"])
print(f"X shape:{X.shape} & y Shape:{y.shape}")

# 5.Modeling & Feature İmportance
We are using model as ExtraTreesClassifier for these case, so what is ExtraTreesClassifier and what is difference with RandomForest?

![](https://miro.medium.com/max/640/0*4VpGqWJUJnmD2mm0.jpg))

ExtraTreesClassifier is an ensemble learning method fundamentally based on decision trees. ExtraTreesClassifier, like RandomForest, randomizes certain decisions and subsets of data to minimize over-learning from the data and overfitting.
Let’s look at some ensemble methods ordered from high to low variance, ending in ExtraTreesClassifier.
## 5.1. Trees:
### 5.1.1. Decision Tree (High Variance)
A single decision tree is usually overfits the data it is learning from because it learn from only one pathway of decisions. Predictions from a single decision tree usually don’t make accurate predictions on new data.
### 5.1.2.Random Forest (Medium Variance)
Random forest models reduce the risk of overfitting by introducing randomness by:
building multiple trees (n_estimators)
drawing observations with replacement (i.e., a bootstrapped sample)
splitting nodes on the best split among a random subset of the features selected at every node
![](https://1.bp.blogspot.com/-Ax59WK4DE8w/YK6o9bt_9jI/AAAAAAAAEQA/9KbBf9cdL6kOFkJnU39aUn4m8ydThPenwCLcBGAsYHQ/s0/Random%2BForest%2B03.gif)
### 5.1.3.Extra Trees (Low Variance)
Extra Trees is like Random Forest, in that it builds multiple trees and splits nodes using random subsets of features, but with two key differences: it does not bootstrap observations (meaning it samples without replacement), and nodes are split on random splits, not best splits. So, in summary, ExtraTrees:
builds multiple trees with bootstrap = False by default, which means it samples without replacement
nodes are split based on random splits among a random subset of the features selected at every node
In Extra Trees, randomness doesn’t come from bootstrapping of data, but rather comes from the random splits of all observations.
ExtraTrees is named for (Extremely Randomized Trees).
![](https://www.researchgate.net/publication/346995264/figure/fig1/AS:969705405812741@1608207193473/The-structure-of-ExtraTree.png)


In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import time

n_splits= 10
folds = StratifiedKFold(n_splits=n_splits,shuffle=True)

y_pred = []
score = []
for fold,(train_id,test_id) in enumerate(folds.split(X,y)):
    print(f"{fold}. Fold")
    start_time = time.time()
    #splitting the data per fold
    X_train,y_train = X.iloc[train_id],y.iloc[train_id]
    X_valid,y_valid = X.iloc[test_id],y.iloc[test_id]
    
    #create a model
    etc = ExtraTreesClassifier(n_estimators=1000) # grid search or optuna will be coming soon!
    #Train for per fold
    etc.fit(X_train,y_train)
    
    #evaluation of per fold
    val_pred = etc.predict(X_valid)
    valid_score_acc = accuracy_score(y_valid,val_pred)
    score.append(valid_score_acc)
    run_time = time.time() - start_time
    print(f"fold acc: {valid_score_acc} run time :{run_time} The overall average of the trainings done so far: {np.mean(score)}")  
    # Now train our whole data for submission and ensemble it
    y_pred.append(etc.predict(test))
print(f"Mean acc score:{np.mean(score)}")

# 5.2. Feature İmportance

In [None]:
df_feature_imp = pd.DataFrame({
    'feature': X.columns, 
    'importance': etc.feature_importances_
})

feature_imp_25 = df_feature_imp.sort_values(
    by='importance', ascending=False
).iloc[:25].reset_index(drop=True)

fig = go.Figure(
    go.Bar(
        x=feature_imp_25.importance,
        y=feature_imp_25.feature,
        orientation='h',
        marker=dict(color=feature_imp_25.importance)
    )
)

fig.update_layout(
    width=1000, height=1000,
    yaxis=dict(autorange='reversed')
)
fig.show()

# Submission

In [None]:
y_pred = mode(y_pred).mode[0]
y_pred = le.inverse_transform(y_pred)

In [None]:
submission = pd.read_csv("../input/tabular-playground-series-feb-2022/sample_submission.csv")
submission["target"] = y_pred
submission

In [None]:
submission.to_csv("submission.csv",index=False)

#  6.References 
[Trees-namanbhandari.medium](https://medium.com/@namanbhandari/extratreesclassifier-8e7fc0502c7)

[Quantdare](https://quantdare.com/what-is-the-difference-between-extra-trees-and-random-forest/)

### I hope you get a great time when you looking my notebook, have a good day.
### Don't forget to mention my shortcomings and give an upvote! 