## Goals
Finde heraus was der Unterschied zwischen der kontaminierten Platte und den anderen Platten ist.
Ist es moeglich auf Grund von Einflussfaktoren vorherzusagen bei welcher Platte etwas schiefgehen wird?
Man kann die Aufgabe als Klassifikationsaufgabe sehen (Kontaminierte Platte, Nicht kontaminierte Platte)

## Ziele prepocessing
### Allgemein
1. Ueberblick der Datan bekommen
2. metadaten auf seqdaten matchen

### Spezifisch
1. Vorauswahl der interessanten Variablen fuer unsere Fragestellung
2. Deskriptive Statistik und Data Cleaning der relevanten Daten

In [1]:
# import packages

import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

pd.options.mode.chained_assignment = None


In [2]:
# read data
df =  pd.read_table('./data/seqtab_nobimera_idtaxa.tsv')
meta_data = pd.read_csv('./data/metaData.csv')

# look at data
print("Shape:")
print(df.shape)
df.head()


meta_data.shape

FileNotFoundError: [Errno 2] No such file or directory: './data/seqtab_nobimera_idtaxa.tsv'

In [None]:
#look at meta data
print("Shape")
print(meta_data.shape)
meta_data.head()

## Bring Data to matching Format

in this we bring the meta data to the same format as the data
Herefore we delete the suffix of the sample id in the seq data to have the raw sample id
Furthermore, we use the sample id as column names for the meta data

### Sorting and checking
Finally, we bring the columns in the same order and check if all the column names match
This was not the case so we try to find out why they are not matching in the next step

In [None]:
# transform meta data to match format of seq data


meta_data.rename(index = meta_data.iloc[:,1], inplace = True)
meta_data = meta_data.drop(meta_data.columns[[1]], axis=1) # drop sample id
meta_data.head()
meta_data.dtypes

In [None]:
meta_data.head()

In [None]:
# removing non numeric columns and save to extra df
df_otu_seq_tax = df[["otu", "seq", "tax"]]
df = df.drop(columns=["otu", "seq", "tax"])
df.head()
df_otu_seq_tax.head()

In [None]:
df.shape

In [None]:
# rename sample names to match metadata
# df.columns = [col[:col.find("_")] for col in df.columns]
col_trimmed = []
for col in df.columns:
    col_trimmed.append(col.replace('_FGCZ', ''))
df.columns = col_trimmed
# matching the colnames to have the same order between meta data and seq data
df = df.sort_index(axis=0)
meta_data = meta_data.sort_index(axis=0)


In [None]:

meta_data.head()

In [None]:
df.head()

## Format modification
The first steps showed that it is more convienent to have tha samples as rows and the features as columns, because of that we will transpose the data frames

In [None]:
# transpose 
df = df.transpose()
df.head()
df.dtypes



In [None]:
meta_data.shape

In [None]:
df.shape
df.shape
meta_data.shape
df.shape
matching = df.transpose().columns == meta_data.transpose().columns
matching
i = 0
for meta_col in meta_data.transpose().columns:
    if meta_col != df.transpose().columns[i]:
        print(meta_col, df.transpose().columns[i])
    i+=1

In [None]:
meta_data.head()

## Explorative Data Analyse
Now that we have a first very rough overview of the data, we look at the scale of the data (numeric, ordinal, etc), missing values



In [None]:
# data types metadata
print("Data Types MetaData")
meta_data.dtypes


In [None]:
# Check taxonomy
df_otu_seq_tax.tax[1].split(';')

In [None]:
# Check for missing values visually in seq data
colours = ['#000099', '#ffff00']
sns.heatmap(df.transpose().isnull(),cmap = sns.color_palette(colours))

In [None]:
# Check for missing values visually
colours = ['#000099', '#ffff00']
sns.heatmap(meta_data.transpose().isnull(),cmap = sns.color_palette(colours))

In [None]:
def missing_values(data_frame):
    for col in data_frame.columns:
        pct_missing = np.mean(data_frame[col].isnull())
        if pct_missing >0 :
            print('{} - {}'.format(col, round(pct_missing * 100)))

## TODO
Predecide which feature are useful for our case and remove the useless feature
Then deal with missing values of the useful values and bring them into good format

In [None]:
#Drop irrelevnt columns
ir_cols = pd.read_csv("./data/irrelevant_cols.csv")
ir_cols.head()
for col in ir_cols.columns:
    meta_data = meta_data.drop(columns = [col])
    print(col)

In [None]:
meta_data.shape

## Data Formating and Cleaning
Now that I have dropped non relevant features and have the seq data and the meta data in the same format, I will deal with missing values, outliers and non-numeric features. The result of this process should be a dataframe that is suited for machine learnign libraries

In [None]:
# Check again for missing values in new meta data
missing_values(meta_data)

In [None]:
meta_data.head()
meta_data.other_animals_present_kind

In [None]:
meta_data.Notes.fillna(0, inplace=True)

In [None]:
meta_data.other_animals_present_kind.fillna(0,inplace=True)