# Initial Data Exploration

Here I'll do some standard unpacking and exploring of the provided data.

[Competition description](https://www.drivendata.org/competitions/63/genetic-engineering-attribution/page/165/)

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations
import sys
import os
sys.path.append('../')
from src.functions import *

The data is given as `train_labels.csv` and `train_values.csv` and is located in the data folder

The site also provides `test_values.csv` from which competitors can generate their submissions.

In [4]:
X = pd.read_csv('../data/train_values.csv').set_index('sequence_id')
y = pd.read_csv('../data/train_labels.csv').set_index('sequence_id')

Find a full overview of the labels and values in the [data readme](../data/README.md)

In [5]:
X.head()

Unnamed: 0_level_0,sequence,bacterial_resistance_ampicillin,bacterial_resistance_chloramphenicol,bacterial_resistance_kanamycin,bacterial_resistance_other,bacterial_resistance_spectinomycin,copy_number_high_copy,copy_number_low_copy,copy_number_unknown,growth_strain_ccdb_survival,...,species_budding_yeast,species_fly,species_human,species_mouse,species_mustard_weed,species_nematode,species_other,species_rat,species_synthetic,species_zebrafish
sequence_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9ZIMC,CATGCATTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCA...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5SAQC,GCTGGATGGTTTGGGACATGTGCAGCCCCGTCTCTGTATGGAGTGA...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
E7QRO,NNCCGGGCTGTAGCTACACAGGGCGGAGATGAGAGCCCTACGAAAG...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
CT5FP,GCGGAGATGAAGAGCCCTACGAAAGCTGAGCCTGCGACTCCCGCAG...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
7PTD8,CGCGCATTACTTCACATGGTCCTCAAGGGTAACATGAAAGTGATCC...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


First I'm going to just do the same steps that Khuyen Tran did for the official [benchmark](https://www.drivendata.co/blog/genetic-attribution-benchmark) for this competition.

In [6]:
bases = set(''.join(X.sequence.values))
seq_length = 4
subsequences = [''.join(perm) for perm in permutations(bases, seq_length)]

In [7]:
len(subsequences)

120

I'll use a function from the benchmark to get counts of every one of these (non-overlapping) subsequences in each gene sequence

In [8]:
subs = get_ngram_features(X, subsequences)

In [9]:
X_subs = subs.join(X.drop('sequence', axis=1))

In [10]:
lab_ids = pd.DataFrame(y.idxmax(axis=1), columns=['lab_id'])


In [11]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.mixture import BayesianGaussianMixture
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB

In [12]:
estimators = [BernoulliNB(), GaussianNB(), MultinomialNB(),
              RandomForestClassifier(), AdaBoostClassifier(), ExtraTreesClassifier(),
              BayesianGaussianMixture()]

In [13]:
for estimator in estimators:
    name = estimator.__class__.__name__
    estimator.fit(X_subs, lab_ids.values.ravel())
    print(f'{name}: {top10_accuracy_scorer(estimator, X_subs, lab_ids.values.ravel())}')

BernoulliNB: 0.6856086452861926
GaussianNB: 0.4582731643842138
MultinomialNB: 0.7007315486297349


MemoryError: could not allocate 688914432 bytes