# **<span style='color:#8A0808'>Contents</span>**

* **Introduction**
* **Exploratory Data Analysis**
* **Model**

# **<span style='color:#8A0808'>Introduction</span>**

**Goal**: Classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss ([Wood et al. 2020](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full)). 

**Metric**: [categorization accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)

# **<span style='color:#8A0808'>Exploratory Data Analysis</span>**

For this challenge, you will be predicting bacteria species based on repeated lossy measurements of DNA snippets. Snippets of length 10 are analyzed using Raman spectroscopy that calculates the histogram of bases in the snippet. In other words, the DNA segment $ATATGGCCTT$ becomes $A_2T_4G_2C_2$.

Each row of data contains a spectrum of histograms generated by repeated measurements of a sample, each row containing the output of all 286 histogram possibilities (e.g., $A_0T_0G_0C{10}$ to $A_{10}T_0G_0C_0$ ), which then has a bias spectrum (of totally random ATGC) subtracted from the results.

The data (both train and test) also contains simulated measurement errors (of varying rates) for many of the samples, which makes the problem more challenging.

**Files**
* train.csv (1.25 GB) - the training set, which contains the spectrum of 10-mer histograms for each sample
* test.csv (621.05 MB) - the test set; your task is to predict the bacteria species (target) for each row_id
* sample_submission.csv (3.2 MB) - a sample submission file in the correct format

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import re

import warnings
warnings.simplefilter('ignore')

In [None]:
train = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv', index_col=0)

## **<span style='color:#8A0808'>Target: 10 barterial spacies</span>**

Credit: thank [Remek Kinas](https://www.kaggle.com/remekkinas) for this collection

Source (description and picutes): Wikipedia

**Bacteroides fragilis** - https://en.wikipedia.org/wiki/Bacteroides_fragilis
Klebsiella pneumoniae is a Gram-negative, non-motile, encapsulated, lactose-fermenting, facultative anaerobic, rod-shaped bacterium. It appears as a mucoid lactose fermenter on MacConkey agar.

![Bacteroides fragilis](https://i.ibb.co/vx37m7N/Bacteroides-Fragilis-Gram.jpg)

Although found in the normal flora of the mouth, skin, and intestines, it can cause destructive changes to human and animal lungs if aspirated, specifically to the alveoli resulting in bloody, brownish or yellow colored jelly like sputum. In the clinical setting, it is the most significant member of the genus Klebsiella of the Enterobacteriaceae. K. oxytoca and K. rhinoscleromatis have also been demonstrated in human clinical specimens. In recent years, Klebsiella species have become important pathogens in nosocomial infections.


**Streptococcus pyogenes** - https://en.wikipedia.org/wiki/Streptococcus_pyogenes
Streptococcus pyogenes is a species of Gram-positive, aerotolerant bacteria in the genus Streptococcus. These bacteria are extracellular, and made up of non-motile and non-sporing cocci (round cells) that tend to link in chains. They are clinically important for humans, as they are an infrequent, but usually pathogenic, part of the skin microbiota that can cause Group A streptococcal infection. S. pyogenes is the predominant species harboring the Lancefield group A antigen, and is often called group A Streptococcus (GAS). However, both Streptococcus dysgalactiae and the Streptococcus anginosus group can possess group A antigen as well. Group A streptococci, when grown on blood agar, typically produce small (2–3 mm) zones of beta-hemolysis, a complete destruction of red blood cells. The name group A (beta-hemolytic) Streptococcus (GABHS) is thus also used.

![Streptococcus pyogenes](https://i.ibb.co/GRhzXHd/Streptococcus-pyogenes.jpg)

The species name is derived from Greek words meaning 'a chain' (streptos) of berries (coccus [Latinized from kokkos]) and pus (pyo)-forming (genes), since a number of infections caused by the bacterium produce pus. The main criterion for differentiation between Staphylococcus spp. and Streptococcus spp. is the catalase test. Staphylococci are catalase positive whereas streptococci are catalase-negative. S. pyogenes can be cultured on fresh blood agar plates. Under ideal conditions, it has an incubation period of 1 to 3 days.

An estimated 700 million GAS infections occur worldwide each year. While the overall mortality rate for these infections is 0.1%, over 650,000 of the cases are severe and invasive, with these cases having a mortality rate of 25%. Early recognition and treatment are critical; diagnostic failure can result in sepsis and death.


**Streptococcus pneumoniae** - https://en.wikipedia.org/wiki/Streptococcus_pneumoniae
Streptococcus pneumoniae, or pneumococcus, is a Gram-positive, spherical bacteria, alpha-hemolytic (under aerobic conditions) or beta-hemolytic (under anaerobic conditions), aerotolerant anaerobic member of the genus Streptococcus. They are usually found in pairs (diplococci) and do not form spores and are non motile.As a significant human pathogenic bacterium S. pneumoniae was recognized as a major cause of pneumonia in the late 19th century, and is the subject of many humoral immunity studies.

![Streptococcus ](https://i.ibb.co/zPKBrJB/Pneumococcus-CDC-PHIL-ID1003.jpg)

Streptococcus pneumoniae resides asymptomatically in healthy carriers typically colonizing the respiratory tract, sinuses, and nasal cavity. However, in susceptible individuals with weaker immune systems, such as the elderly and young children, the bacterium may become pathogenic and spread to other locations to cause disease. It spreads by direct person-to-person contact via respiratory droplets and by auto inoculation in persons carrying the bacteria in their upper respiratory tracts.[3] It can be a cause of neonatal infections.

**Campylobacter jejuni** - https://en.wikipedia.org/wiki/Campylobacter_jejuni
Campylobacter jejuni (/ˈkæmpɪloʊˌbæktər dʒəˈdʒuːni/) is one of the most common causes of food poisoning in Europe and in the United States. The vast majority of cases occur as isolated events, not as part of recognized outbreaks. Active surveillance through the Foodborne Diseases Active Surveillance Network (FoodNet) indicates that about 20 cases are diagnosed each year for each 100,000 people in the US, while many more cases are undiagnosed or unreported; the CDC estimates a total of 1.5 million infections every year. The European Food Safety Authority reported 246,571 cases in 2018, and estimated approximately nine million cases of human campylobacteriosis per year in the European Union.

![Campylobacter jejuni](https://i.ibb.co/7KDgWym/ARS-Campylobacter-jejuni.jpg)

Campylobacter jejuni is in a genus of bacteria that is among the most common causes of bacterial infections in humans worldwide. Campylobacter means "curved rod", deriving from the Greek kampylos (curved) and baktron (rod). Of its many species, C. jejuni is considered one of the most important from both a microbiological and public health perspective.

**Salmonella enterica** - https://en.wikipedia.org/wiki/Salmonella_enterica
Salmonella enterica (formerly Salmonella choleraesuis) is a rod-headed, flagellate, facultative anaerobic, Gram-negative bacterium and a species of the genus Salmonella. A number of its serovars are serious human pathogens.

![Salmonella ](https://i.ibb.co/xMbfx62/1280px-Salmonella-enterica-serovar-typhimurium-01.jpg)

**Escherichia coli** - https://en.wikipedia.org/wiki/Escherichia_coli
Escherichia coli (/ˌɛʃəˈrɪkiə ˈkoʊlaɪ/), also known as E. coli (/ˌiː ˈkoʊlaɪ/), is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus Escherichia that is commonly found in the lower intestine of warm-blooded organisms. Most E. coli strains are harmless, but some serotypes (EPEC, ETEC etc.) can cause serious food poisoning in their hosts, and are occasionally responsible for food contamination incidents that prompt product recalls.The harmless strains are part of the normal microbiota of the gut, and can benefit their hosts by producing vitamin K2, and preventing colonisation of the intestine with pathogenic bacteria, having a mutualistic relationship. E. coli is expelled into the environment within fecal matter. The bacterium grows massively in fresh fecal matter under aerobic conditions for 3 days, but its numbers decline slowly afterwards.
![Escherichia ](https://i.ibb.co/XzM8NrX/1280px-E-coli-at-10000x-original.jpg)

**Enterococcus_hirae** - https://en.wikipedia.org/wiki/Enterococcus
Enterococcus is a large genus of lactic acid bacteria of the phylum Firmicutes. Enterococci are gram-positive cocci that often occur in pairs (diplococci) or short chains, and are difficult to distinguish from streptococci on physical characteristics alone. Two species are common commensal organisms in the intestines of humans: E. faecalis (90–95%) and E. faecium (5–10%). Rare clusters of infections occur with other species, including E. casseliflavus, E. gallinarum, and E. raffinosus.

![Enterococcus_hirae](https://i.ibb.co/Dzs7FvB/Enterococcus-histological-pneumonia-01.png)

**Escherichia fergusonii** - https://en.wikipedia.org/wiki/Escherichia_fergusonii
Escherichia fergusonii is a Gram-negative, rod-shaped species of bacterium. Closely related to the well-known species Escherichia coli, E. fergusonii was first isolated from samples of human blood. The species is named for American microbiologist William W. Ferguson.
Some strains of E. fergusonii are pathogenic. It is known to infect open wounds in humans and may also cause bacteraemia or urinary tract infections. Strains causing these infections have been found to be highly resistant to the antibiotic ampicillin, though some are also resistant to gentamicin and chloramphenicol. An antibiotic-resistant strain of the species was found to be associated with an incidence of cystitis in a 52-year-old woman in 2008.

**Staphylococcus aureus** - https://en.wikipedia.org/wiki/Staphylococcus_aureus
Staphylococcus aureus is a Gram-positive round-shaped bacterium, a member of the Firmicutes, and is a usual member of the microbiota of the body, frequently found in the upper respiratory tract and on the skin. It is often positive for catalase and nitrate reduction and is a facultative anaerobe that can grow without the need for oxygen. Although S. aureus usually acts as a commensal of the human microbiota it can also become an opportunistic pathogen, being a common cause of skin infections including abscesses, respiratory infections such as sinusitis, and food poisoning. Pathogenic strains often promote infections by producing virulence factors such as potent protein toxins, and the expression of a cell-surface protein that binds and inactivates antibodies. S. aureus is one of the leading pathogens for deaths associated with Antimicrobial resistance and the emergence of antibiotic-resistant strains such as methicillin-resistant S. aureus (MRSA) is a worldwide problem in clinical medicine. Despite much research and development, no vaccine for S. aureus has been approved.

![Staphylococcus ](https://i.ibb.co/Kzm0DvQ/1280px-Staphylococcus-aureus-VISA-2.jpg)

**Klebsiella pneumoniae** - https://en.wikipedia.org/wiki/Klebsiella_pneumoniae
Klebsiella pneumoniae is a Gram-negative, non-motile, encapsulated, lactose-fermenting, facultative anaerobic, rod-shaped bacterium. It appears as a mucoid lactose fermenter on MacConkey agar.

![Klebsiella ](https://i.ibb.co/GTPQmYP/Klebsiella-pneumoniae-01.png)

Although found in the normal flora of the mouth, skin, and intestines, it can cause destructive changes to human and animal lungs if aspirated, specifically to the alveoli resulting in bloody, brownish or yellow colored jelly like sputum. In the clinical setting, it is the most significant member of the genus Klebsiella of the Enterobacteriaceae. K. oxytoca and K. rhinoscleromatis have also been demonstrated in human clinical specimens. In recent years, Klebsiella species have become important pathogens in nosocomial infections.
It naturally occurs in the soil, and about 30% of strains can fix nitrogen in anaerobic conditions.As a free-living diazotroph, its nitrogen-fixation system has been much-studied, and is of agricultural interest, as K. pneumoniae has been demonstrated to increase crop yields in agricultural conditions.

In [None]:
target = train.target

The 10 barterial spacies are almost homogenously distributed within the train set. Indeed, the deviation from 10% of each class is less than 0.1%.

In [None]:
plt.figure(figsize=(10,7))
plt.bar(target.unique(), height=target.value_counts().sort_index()/len(train)*100-10, color='#8A8A08')
plt.xticks(rotation=90, fontsize=16)
plt.ylabel('+10 (%)', fontsize=16)
plt.title('Repartition of the barterial species', fontsize=16)
plt.ylim(-0.2,0.2)
plt.show()

In [None]:
lb = LabelEncoder()
train.target = lb.fit_transform(target)

## **<span style='color:#8A0808'>Feature names</span>**

There are 286 features named from $A_0T_0G_0C_{10}$ to $A_{10}T_0G_0C_0$ that can be generated by following function

In [None]:
def generate_feature_names():
    feature_names = []
    for i in range(11):
        for j in range(11-i):
            for k in range(11-i-j):
                feature_names.append(f'A{i}T{j}G{k}C{10-i-j-k}')
    return feature_names

feature_names = generate_feature_names()

# verify that the generated feature names coincide with the feature names from train dataset.
assert(feature_names==train.columns[:-1].tolist())

The distribution of A, T, G and C within the feature names is shown in the figures below

In [None]:
df = pd.DataFrame(columns={'A','T','G','C'})
for i, col in enumerate(feature_names):
    df.loc[i]=(re.split('A|T|G|C',col)[1:])

plt.figure(figsize=(20,10))

plt.subplot(2,2,1)
plt.plot(df['A'], 'r.')
plt.title('A', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

plt.subplot(2,2,2)
plt.plot(df['T'], 'b.')
plt.title('T', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

plt.subplot(2,2,3)
plt.plot(df['G'], 'g.')
plt.title('G', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

plt.subplot(2,2,4)
plt.plot(df['C'], 'y.')
plt.title('C', fontsize=16)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

plt.show()

## **<span style='color:#8A0808'>Features</span>**

In [None]:
train[feature_names] = train[feature_names].astype('float32')

In [None]:
print(f'Feature values range from {np.min(np.min(train[feature_names]))} to {np.max(np.max(train[feature_names]))}')

In [None]:
plt.figure(figsize=(10,7))
plt.plot(np.max(train[feature_names]), 'b', label='max')
plt.plot(np.min(train[feature_names]), 'r', label='min')
plt.plot(np.mean(train[feature_names]), 'g', label='mean')
plt.plot(np.std(train[feature_names]), 'k', label='std')
plt.xticks([])
plt.ylabel('Feature min, max, mean and std', fontsize=16)
plt.legend()
plt.show()

We can observe that the feature data is discrete. Indeed, they have very few unique values comparing to 200k rows in the training set (127k without duplicates).

In [None]:
plt.figure(figsize=(10,7))
train[feature_names].nunique().hist(bins=50, color='#A8A808')
plt.xlabel('Unique value', fontsize=16)
plt.ylabel('Frequence', fontsize=16)
plt.show()

Correlation between features with high and low frequencies

In [None]:
def feature_corr(df):
    fred = df[feature_names].nunique().sort_values(ascending=True)
    low_fred_features = fred[:10].index
    high_fred_features = fred[-10:].index

    plt.figure(figsize=(15,6))

    plt.subplot(1,2,1)
    sns.heatmap(train[high_fred_features].corr(), annot=True, cmap='hot', vmin=-1, vmax=1)
    plt.title('Correlation between features with high frequency')
    plt.tight_layout()

    plt.subplot(1,2,2)
    sns.heatmap(train[low_fred_features].corr(), annot=True, cmap='hot', vmin=-1, vmax=1)
    plt.title('Correlation between features with low frequency')
    plt.tight_layout()

    plt.show()

In [None]:
feature_corr(train)

Unique feature values for each class is shown in the figures below:

In [None]:
plt.figure(figsize=(15,20))
for i in range(10):
    features_i = train[feature_names][train.target==i]
    
    plt.subplot(5,2,i+1)
    features_i.nunique().hist(bins=50, color='#A8A808')
    plt.title(f'Class {i}', fontsize=16)
    plt.tight_layout()
plt.show()
    

Correlation between features with high and low frequency in each class

In [None]:
for i in range(10):
    print(f'Class {i}\n')
    feature_corr(train[train.target==i])    

This discrete nature of the features has been clarified by [AmbrosM](https://www.kaggle.com/ambrosm) by converting the train dataset to the original integer format, (see also the original paper of [Wood et al. 2020](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full)). In this format, each row sum up to 1e6 and has a greatest common divisor (GCD) of 1, 10, 1000 or 10000.

In [None]:
# The code below is given by https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense
def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return np.math.factorial(10) / (np.math.factorial(w) * np.math.factorial(x) * np.math.factorial(y) * np.math.factorial(z) * 4**10)

def gcd_of_all(df_i, feature_names):
    gcd = df_i[feature_names[0]]
    for col in feature_names[1:]:
        gcd = np.gcd(gcd, df_i[col])
    return gcd

# Convert to original integer format
train_i = pd.DataFrame({col: ((train[col] + bias_of(col)) * 1000000).round().astype(int) for col in feature_names})

# Compute the greatest common divisor
train['GCD'] = gcd_of_all(train_i, feature_names)

print('Original integer format of the train dataset:\n')
display(train_i.head(3))

The train dataset is homogeneously distributed in the four GCD groups

In [None]:
plt.figure(figsize=(10,10))
train['GCD'].value_counts().plot(kind='pie',autopct='%.2f', cmap = "Accent", fontsize=16)
plt.show()

The 10 classes in each GCD group is less balanced than in the whole dataset.

In [None]:
plt.figure(figsize=(15,10))

for idx, gcd in enumerate([1,10,1000,10000]):
    train_sub = train[train.GCD==gcd]
    plt.subplot(2,2,idx+1)
    plt.bar(train_sub.target.unique(), height=train_sub.target.value_counts().sort_index()/len(train_sub)*100-10, color='#8A8A08')
    plt.xticks(rotation=90, fontsize=16)
    plt.ylabel('+10 (%)', fontsize=16)
    plt.title(f'GCD={gcd}', fontsize=16)
    plt.ylim(-0.2,0.2)
plt.show()

In [None]:
for i in [1,10,1000,10000]:
    print(f'GCD = {i}\n')
    feature_corr(train[train.GCD==i])

# 🏗 This notebook is under construction