# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import glob

In [4]:
DATA_FOLDER = '../../ADA2017-Tutorials/02 - Intro to pandas/Data/' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [5]:
import glob
from os.path import basename
import os 

data = pd.DataFrame(columns =['name']) 

# For each  MID files
for excel in  glob.glob(DATA_FOLDER+"microbiome/*.xls"): 
    if "meta" not in excel : 
        #get name of file/name of collumn of data
        mid =os.path.splitext(basename(excel))[0]
        # read without header
        subdata  = pd.read_excel(excel,header=None)
        # rename data with mid name
        subdata.columns = ["name",mid]
        # Test if bacteril name is unique, for integirty 
        assert ( any(subdata['name'].duplicated()) == False)
        # we merge the Dataframe with the name for common key, "outer" add the new bacterie
        data = data.merge(subdata,on='name', how='outer')
        # change na by unknowd
        data.fillna("unknows", inplace = True)
        

# set name to index, do no crash , name are unique thanks to assert
data = data.set_index('name')
# sort collumn name
data = data[[cn for cn in  sorted(data.columns) ]]

meta  = pd.read_excel((DATA_FOLDER+"microbiome/metadata.xls"))
meta.fillna("unknows", inplace = True)
# rename collone with data form metadata
data.columns = [i for i in reversed(meta.transpose().values.tolist())]
# name colonum with data from metadata
data.columns.names = [i for i in reversed(meta.transpose().index.values.tolist())]

data.head()    

SAMPLE,unknows,tissue,tissue,tissue,tissue,stool,stool,stool,stool
GROUP,EXTRACTION CONTROL,NEC 1,Control 1,NEC 2,Control 2,NEC 1,Control 1,NEC 2,Control 2
BARCODE,MID1,MID2,MID3,MID4,MID5,MID6,MID7,MID8,MID9
name,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7,23,14,2,28,7,8,unknows,16
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus",2,2,unknows,unknows,3,2,1,unknows,unknows
"Archaea ""Crenarchaeota"" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus",3,10,4,unknows,14,5,5,1,6
"Archaea ""Crenarchaeota"" Thermoprotei Thermoproteales Thermofilaceae Thermofilum",3,9,5,unknows,10,4,5,unknows,5
"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Methanocellales Methanocellaceae Methanocella",7,9,7,1,17,12,18,unknows,14
