# Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

## Task 2.1 Importing the files

The files from MID1.xls to MID9.xls have been opened in Excel to have a look directly on the data, to be sure about the number of spreadsheets contained in every file and not to forget to import any data. Once that it's been verified that every .xls file has only one sheet, the first 9 .xls files have been imported into the following 9 different dataframes.
The name 'mb1' stands for 'microbiome 1', 'mb2' stands for 'microbiome 2', and so on. 

In [1]:
# Importing all the files from excel
mb1 = pd.read_excel(DATA_FOLDER+'/microbiome/MID1.xls', 'Sheet 1', index_col=0, header=None)
mb2 = pd.read_excel(DATA_FOLDER+'/microbiome/MID2.xls', 'Sheet 1', index_col=0, header=None)
mb3 = pd.read_excel(DATA_FOLDER+'/microbiome/MID3.xls', 'Sheet 1', index_col=0, header=None)
mb4 = pd.read_excel(DATA_FOLDER+'/microbiome/MID4.xls', 'Sheet 1', index_col=0, header=None)
mb5 = pd.read_excel(DATA_FOLDER+'/microbiome/MID5.xls', 'Sheet 1', index_col=0, header=None)
mb6 = pd.read_excel(DATA_FOLDER+'/microbiome/MID6.xls', 'Sheet 1', index_col=0, header=None)
mb7 = pd.read_excel(DATA_FOLDER+'/microbiome/MID7.xls', 'Sheet 1', index_col=0, header=None)
mb8 = pd.read_excel(DATA_FOLDER+'/microbiome/MID8.xls', 'Sheet 1', index_col=0, header=None)
mb9 = pd.read_excel(DATA_FOLDER+'/microbiome/MID9.xls', 'Sheet 1', index_col=0, header=None)

NameError: name 'pd' is not defined

### Taking a look on the datasets

From the previous look directly in Excel, all the dataframes looked very simple, with only one column for the name of the bacteria and one column for the corresponding value, probably recorded in some experiment. Here the head of the first dataframe is displayed anyway in order to:
* check that the data have been successfully imported
* provide ourself with a visual image of the dataset as a help for correctly handle this dataset

In [None]:
mb1.head()

It can be observed that Pandas already recognised the first column as index, and put the recorded value as unique value of the dataframe.
Now the 10th dataframe 'metadata.xls' is loaded to provide us information on the values of the dataframes mb1-mb9:

In [None]:
names = pd.read_excel(DATA_FOLDER+'/microbiome/metadata.xls', 'Sheet1', na_values=['NA'])
names

Setting the 'BARCODE' column as index we obtain:

In [None]:
names.set_index('BARCODE', inplace=True)
names

Let's now see in details the infos provided in the different spreadsheets:

In [None]:
print('EXTRACTION CONTROL\n', mb1.describe(), '\n\n')
print('tissue, NEC 1\n', mb2.describe(), '\n\n')
print('tissue, Control 1\n', mb3.describe(), '\n\n')
print('tissue, NEC 2\n', mb4.describe(), '\n\n')
print('tissue, Control 2\n', mb5.describe(), '\n\n')
print('stool, NEC 1\n', mb6.describe(), '\n\n')
print('stool, Control 1\n', mb7.describe(), '\n\n')
print('stool, NEC 2\n', mb8.describe(), '\n\n')
print('stool, Control 2\n', mb9.describe(), '\n\n')

The following things can be noticed:
* the dataframes have very different sizes (from 99 to 385),
* all the dataframes have really high std dev, with a lot of low values and a few really high ones

### Comment on the metadata.xls and on what we  understood of the dataset
From the metadata.xls file, the following meaning of the files `MB1-MB9` has been understood.
* For each bacteria, only one value for the field `'extraction control'` has been recorded.
* For each bacteria, the values of the fields 'NEC' and 'Control' have been recorded four times each (`'NEC 1'`, `'NEC 2'`, `'Control 1'`, `'Control 2'`) on two different `'samples'={'tissue', 'stool'}` and in two different `contexts={1, 2}`. 

### Final goal

To keep track of the `'NEC'` and `'Control'` values, it seems then logical merge the _eigth dataframes mb2-mb9_ into one single dataset with the following carachteristcs:
* a three-level index with the following levels: `['bacteria name', 'sample', 'context']`
* with the following columns: `['NEC', 'Control','extraction control']`


## Task 2.2 First four merges

Let's then procede by renaming the index and the value of each dataframe in a logical way

In [None]:
for mb in [mb1,mb2, mb3, mb4, mb5, mb6, mb7, mb8, mb9]:
    mb.columns=['count']
    mb.index.name='bacteria'

Now, let's do _4 merges_ of the following couples of dataframes:

* `mb2` with `mb3`
* `mb4` with `mb5`
* `mb6` with `mb7`
* `mb8` with `mb9`

and store the results in a list of resulting dataframes called `_4df`

Recall that:
* `mb2, mb3` contains respectively the value of `'NEC', 'Control'` for `sample=tissue, context=1`
* `mb4, mb5` contains respectively the value of `'NEC', 'Control'` for `sample=tissue, context=2`
* `mb6, mb7` contains respectively the value of `'NEC', 'Control'` for `sample=stool, context=1`
* `mb8, mb9` contains respectively the value of `'NEC', 'Control'` for `sample=stool, context=2`

The following cell provides a bit cumbersome but short way to do so. The following aspects of the code should be noted:
* `pd.concat()` has been used to concatenate the colums of `'NEC'` and `'Control'` values, since it creates a dataframe with the _union_ of the indices of the two datasets, filling missing values with `NaN`
* two columns have been added to register the `sample` and the `context` of the data of the new dataframes
* A three-level index `['bacteria', 'sample', 'context']` has been created

In [None]:
samples = ['tissue']*2+['stool']*2
_4df=[1,2,3,4]

for i,[M1, M2] in enumerate(zip([mb2, mb4, mb6, mb8],[mb3, mb5, mb7, mb9])):
    _4df[i] = pd.concat( [ M1['count'].rename('NEC'),
                           M2['count'].rename('Control')], axis=1)
    _4df[i] ['sample'] = samples[i]
    _4df[i] ['context']=(i%2)+1
    _4df[i].index.name='bacteria'
    _4df[i]=_4df[i][['sample','context','NEC','Control']]
    _4df[i].set_index(['sample','context'], append=True, inplace=True)

Let's show the results to have a better understanding of what has been done:

In [None]:
_4df[0].head()

In [None]:
_4df[1].head()

In [None]:
_4df[2].head()

In [None]:
_4df[3].head()

## Task 2.3 combine the four dataframes into one
Now, it will be sufficient to _combine_ these datasets to have the desired one. 
Note the use of `combine_first()` so that the result index columns will be the union of the respective indexes and columns of the two arguments of the command.

In [None]:
tissue = _4df[0].combine_first(_4df[1])
stool = _4df[2].combine_first(_4df[3])
tissue_and_stool = stool.combine_first(tissue)
print('the index of the following dataframe is unique?', tissue_and_stool.index.is_unique)
print('\n\n', tissue_and_stool.describe())
tissue_and_stool.head(20)


### Finally, add the 'extraction control' value to the dataframe
and also change the name of the newly added column from _'count'_ to _'extraction control'_

In [None]:
tissue_and_stool = tissue_and_stool.join(mb1, how='outer')
tissue_and_stool.rename(columns={'count': 'extraction control'}, inplace=True)

### Fill NaN values
let's now fill `NaN` values with the tag `unknown` as requested

In [None]:
tissue_and_stool.fillna('unknown', inplace=True )
tissue_and_stool.head(20)

## Task 2.4 Final Comments
Another option could have been creating more levels for the index, one for every catgory in the bacteria name hierarchy ('Life', 'Domain', 'Kingdom', etc. etc.) to let the user access to subsets of the bacterias in a easy way ( for example, all the bacterias in the _Archaea_ family).
However, it has been thought that such a thing would have created a useless complication in handling this dataset, since for different bacterias we don't even have the complete description of its hierarchy. Also, the dataframe that has been created still let a user familiar with the bacteria hierarchy to perform such a query in the following way:

In [None]:
bacterias = tissue_and_stool.index.get_level_values(0)
mask = ['Archaea' in name for name in bacterias]
tissue_and_stool.loc[mask]