# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [4]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob

In [5]:
DATA_FOLDER = '../../ADA2017-Tutorials/02 - Intro to pandas/Data/' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [269]:
files_name = os.listdir(DATA_FOLDER + 'microbiome/')

In [318]:
rna_data = pd.DataFrame()

for e,f in enumerate(list(filter(lambda n: 'MID' in n, files_name))) :
    tmp = pd.read_excel(DATA_FOLDER + 'microbiome/' + f, 'Sheet 1', index_col=0, header=None)
    #tmp.columns = ['MID'+ str(e+1)]  
    rna_data = pd.concat([rna_data, tmp], axis=1)

rna_data.fillna("unknown", inplace = True)
rna_data.index.name = 'Taxonomy'  

Let's have a look at what we find in the metadata :

In [320]:
pd.read_excel(DATA_FOLDER+"microbiome/metadata.xls")

Unnamed: 0,BARCODE,GROUP,SAMPLE
0,MID1,EXTRACTION CONTROL,
1,MID2,NEC 1,tissue
2,MID3,Control 1,tissue
3,MID4,NEC 2,tissue
4,MID5,Control 2,tissue
5,MID6,NEC 1,stool
6,MID7,Control 1,stool
7,MID8,NEC 2,stool
8,MID9,Control 2,stool


We will add the values of metadata as columns in the data, and columns of metadata as column names : <br>


Metadata's values $\rightarrow$ Data's columns <br>
Metadata's columns $\rightarrow$ Data's columns name

In [321]:
rna_data.columns = meta.transpose().values.tolist()#[::-1]
rna_data.columns.names = meta.transpose().index.values.tolist()#[::-1]

In [322]:
rna_data.head()

BARCODE,MID1,MID2,MID3,MID4,MID5,MID6,MID7,MID8,MID9
GROUP,EXTRACTION CONTROL,NEC 1,Control 1,NEC 2,Control 2,NEC 1,Control 1,NEC 2,Control 2
SAMPLE,unknows,tissue,tissue,tissue,tissue,stool,stool,stool,stool
Taxonomy,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",unknown,2,1,unknown,5,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",unknown,14,15,unknown,26,unknown,1,unknown,1
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7,23,14,2,28,7,8,unknown,16
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Stetteria",unknown,unknown,unknown,unknown,1,unknown,unknown,unknown,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",unknown,1,4,unknown,5,1,2,unknown,2


Let's finally check that the indexes are unique :

In [315]:
rna_data.index.is_unique

True