# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import glob  # finds all the pathnames matching a specified pattern
pd.options.mode.chained_assignment = None  # default='warn', Mutes warnings when copying a slice from a DataFrame.

# Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

### importing the files

The files from MID1.xls to MID9.xls have been opened in Excel to have a look directly on the data, to be sure about the number of spreadsheets contained in every file and not to forget to import any data. Once that it's been verified that every .xls file has only one sheet, the first 9 .xls files have been imported into the following 9 different dataframes.
The name 'mb1' stands for 'microbiome 1', 'mb2' stands for 'microbiome 2', and so on. 

In [2]:
mb1 = pd.read_excel('Data/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)
mb2 = pd.read_excel('Data/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)
mb3 = pd.read_excel('Data/microbiome/MID3.xls', 'Sheet 1', index_col=0, header=None)
mb4 = pd.read_excel('Data/microbiome/MID4.xls', 'Sheet 1', index_col=0, header=None)
mb5 = pd.read_excel('Data/microbiome/MID5.xls', 'Sheet 1', index_col=0, header=None)
mb6 = pd.read_excel('Data/microbiome/MID6.xls', 'Sheet 1', index_col=0, header=None)
mb7 = pd.read_excel('Data/microbiome/MID7.xls', 'Sheet 1', index_col=0, header=None)
mb8 = pd.read_excel('Data/microbiome/MID8.xls', 'Sheet 1', index_col=0, header=None)
mb9 = pd.read_excel('Data/microbiome/MID9.xls', 'Sheet 1', index_col=0, header=None)

# take a look on the datasets

From the previous look directly in Excel, all the dataframes looked very simple, with only one column for the name of the bacteria and one column for the corresponding value, probably recorded in some experiment. Here the head of the first dataframe is displayed anyway in order to:
* check that the data have been successfully imported
* provide ourself with a visual image of the dataset as a help for correctly handle this dataset

In [3]:
mb1.head()

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrolobus",2
"Archaea ""Crenarchaeota"" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus",3
"Archaea ""Crenarchaeota"" Thermoprotei Thermoproteales Thermofilaceae Thermofilum",3
"Archaea ""Euryarchaeota"" ""Methanomicrobia"" Methanocellales Methanocellaceae Methanocella",7


It can be observed that Pandas already recognised the first column as index, and put the recorded value as unique value of the dataframe.
Now the 10th dataframe 'metadata.xls' is loaded to provide us information on the values of the dataframes mb1-mb9:

In [4]:
names = pd.read_excel('Data/microbiome/metadata.xls', 'Sheet1', header = None, na_values=['NA'])
names

Unnamed: 0,0,1,2
0,BARCODE,GROUP,SAMPLE
1,MID1,EXTRACTION CONTROL,
2,MID2,NEC 1,tissue
3,MID3,Control 1,tissue
4,MID4,NEC 2,tissue
5,MID5,Control 2,tissue
6,MID6,NEC 1,stool
7,MID7,Control 1,stool
8,MID8,NEC 2,stool
9,MID9,Control 2,stool


Since it is clear that this file provides also the header for the dataframe, let's import it
* with the header
* setting the 'BARCODE' column as index

In [5]:
names = pd.read_excel('Data/microbiome/metadata.xls', 'Sheet1', na_values=['NA'])
names.index=names['BARCODE']
names.drop('BARCODE', axis=1, inplace=True)
names

Unnamed: 0_level_0,GROUP,SAMPLE
BARCODE,Unnamed: 1_level_1,Unnamed: 2_level_1
MID1,EXTRACTION CONTROL,
MID2,NEC 1,tissue
MID3,Control 1,tissue
MID4,NEC 2,tissue
MID5,Control 2,tissue
MID6,NEC 1,stool
MID7,Control 1,stool
MID8,NEC 2,stool
MID9,Control 2,stool


Let's now see in details the infos provided in the different spreadsheets:

In [6]:
print('EXTRACTION CONTROL\n', mb1.describe(), '\n\n')
print('tissue, NEC 1\n', mb2.describe(), '\n\n')
print('tissue, Control 1\n', mb3.describe(), '\n\n')
print('tissue, NEC 2\n', mb4.describe(), '\n\n')
print('tissue, Control 2\n', mb5.describe(), '\n\n')
print('stool, NEC 1\n', mb6.describe(), '\n\n')
print('stool, Control 1\n', mb7.describe(), '\n\n')
print('stool, NEC 2\n', mb8.describe(), '\n\n')
print('stool, Control 2\n', mb9.describe(), '\n\n')

EXTRACTION CONTROL
                  1
count   272.000000
mean     44.801471
std     267.473758
min       1.000000
25%       1.000000
50%       2.000000
75%       7.000000
max    3732.000000 


tissue, NEC 1
                  1
count   288.000000
mean     58.378472
std     467.044618
min       1.000000
25%       1.000000
50%       2.000000
75%       7.250000
max    6176.000000 


tissue, Control 1
                  1
count   367.000000
mean     33.446866
std     154.765809
min       1.000000
25%       1.000000
50%       3.000000
75%      11.000000
max    1968.000000 


tissue, NEC 2
                  1
count   134.000000
mean     76.388060
std     685.826464
min       1.000000
25%       1.000000
50%       2.000000
75%       4.750000
max    7910.000000 


tissue, Control 2
                  1
count   379.000000
mean     47.625330
std     364.842217
min       1.000000
25%       1.000000
50%       3.000000
75%       8.500000
max    5503.000000 


stool, NEC 1
                  1
count   1

## Comment on the metadata.xls and on what we  understood of the dataset
From the metadata.xls file, the following meaning of the files MB1-MB9 has been understood.
* For each bacteria, only one value for the field 'extraction control' has been recorded.
* For each bacteria, the values of the fields 'NEC' and 'Control' have been recorded four times each ('NEC 1', 'NEC 2', 'Control 1', 'Control 2') on two different 'samples'={'tissue', 'stool'} and in two different contexts={1, 2}. 

## Final goal

To keep track of the 'NEC' and 'Control' values, it seems then logical merge the _eigth dataframes mb2-mb9_ into one single dataset with the following carachteristcs:
* a three-level index with the following levels: ['bacteria name', 'sample', 'context']
* with the following columns: ['NEC', 'Control']

The 'extraction control' value will be added then as a further column of the unique dataframe, since it doesn't follow this hierarchy.

# Let's start

let's then procede by renaming the index and the value of each dataframe in a logical way

In [7]:
for mb in [mb1,mb2, mb3, mb4, mb5, mb6, mb7, mb8, mb9]:
    mb.columns=['count']
    mb.index.name='bacteria'

Now, let's do _4 merges_ of the following couples of dataframes:

* (tissue, 1, NEC) with (tissue, 1, Control)
* (tissue, 2, NEC) with (tissue, 2, Control)
* (stool, 1, NEC) with (stool, 1, Control)
* (stool, 2, NEC) with (stool, 2, Control)

The following cell provides a bit cumbersome but short way to do so. The following aspects of the code should be noted:
* _pd.concat()_ has been used to concatenate the colums of 'NEC' and 'Control' values, since it creates a dataframe wth the _union_ of the indices of the two datasets, filling missing values with _NaN_
* two columns have been added to register the _sample_ and the _context_ of the data
* A three-level index ['bacteria', 'sample', 'context'] has been created

In [8]:
samples = ['tissue']*2+['stool']*2
_4df=[1,2,3,4]

for i,[M1, M2] in enumerate(zip([mb2, mb4, mb6, mb8],[mb3, mb5, mb7, mb9])):
    _4df[i] = pd.concat( [ M1['count'].rename('NEC'),
                           M2['count'].rename('Control')], axis=1)
    _4df[i] ['sample'] = samples[i]
    _4df[i] ['context']=(i%2)+1
    _4df[i].index.name='bacteria'
    _4df[i]=_4df[i][['sample','context','NEC','Control']]
    _4df[i].set_index(['sample','context'], append=True, inplace=True)

Let's show the results to have a better understanding of what has been done:

In [9]:
_4df[0].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,1,2.0,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,1,14.0,15.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,1,23.0,14.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",tissue,1,1.0,4.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Thermodiscus",tissue,1,,1.0


In [10]:
_4df[1].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,2,,5.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,2,,26.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,2,2.0,28.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Stetteria",tissue,2,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",tissue,2,,5.0


In [11]:
_4df[2].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,1,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,1,7.0,8.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",stool,1,1.0,2.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Thermosphaera",stool,1,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrodictium",stool,1,1.0,1.0


In [12]:
_4df[3].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,2,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,2,,16.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Sulfophobococcus",stool,2,,2.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Pyrodictiaceae Pyrodictium",stool,2,,5.0
"Archaea ""Crenarchaeota"" Thermoprotei Sulfolobales Sulfolobaceae Stygiolobus",stool,2,1.0,6.0


### Now, combine the dataframes into one
Now, it will be sufficient to _combine_ these datasets to have the desired one. 
Note the use of __combine_first()__ in order not to loose data in case of missing values in the different datasets


In [13]:
tissue = _4df[0].combine_first(_4df[1])
stool = _4df[2].combine_first(_4df[3])
tissue_and_stool = stool.combine_first(tissue)
print('the index of the following dataframe is unique?', tissue_and_stool.index.is_unique)
tissue_and_stool.head(20)


the index of the following dataframe is unique? True


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,1,2.0,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,2,,5.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,1,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,2,,1.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,1,14.0,15.0
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,2,,26.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,1,7.0,8.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,2,,16.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,1,23.0,14.0
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,2,2.0,28.0


### Add the 'extraction control' value to the dataframe
and also change the name of the newly added column from _'count'_ to _'exraction control'_

In [14]:
tissue_and_stool = tissue_and_stool.join(mb1, how='outer')
tissue_and_stool.rename(columns={'count': 'extraction control'}, inplace=True)

### Fill NaN values
let's now fill _NaN_ values with the tag _unknown_

In [15]:
tissue_and_stool.fillna('unknown', inplace=True )
tissue_and_stool.head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control,extraction control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,1,2,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,2,unknown,5,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,1,unknown,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,2,unknown,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,1,14,15,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,2,unknown,26,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,1,7,8,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,2,unknown,16,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,1,23,14,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,2,2,28,7


# Final Comments
another option could have been creating more levels for the index, one for every catgory in the bacteria name hierarchy ('Life', 'Domain', 'Kingdom', etc. etc.) to let the user access to subsets of the bacterias in a easy way ( for example, all the bacterias in the _Archaea_ family).
However, it has been thought that such a thing would have created a useless complication in handling the dataset, since for different bacterias we don't even have the complete description of its hierarchy. Also, the dataframe that has been created still let the user perform such a query in the following way:


In [16]:
bacterias = tissue_and_stool.index.get_level_values(0)
mask = ['Archaea' in name for name in bacterias]
tissue_and_stool.loc[mask]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NEC,Control,extraction control
bacteria,sample,context,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,1,2,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Acidilobaceae Acidilobus",tissue,2,unknown,5,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,1,unknown,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",stool,2,unknown,1,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,1,14,15,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Acidilobales Caldisphaeraceae Caldisphaera",tissue,2,unknown,26,unknown
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,1,7,8,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",stool,2,unknown,16,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,1,23,14,7
"Archaea ""Crenarchaeota"" Thermoprotei Desulfurococcales Desulfurococcaceae Ignisphaera",tissue,2,2,28,7
