# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [2292]:
import pandas as pd
import numpy as np
import glob, os
import random
pd.options.mode.chained_assignment = None
idx = pd.IndexSlice

In [2293]:
!cat ../Data/ebola/sl_data/2014-10-04.csv 

cat: ../Data/ebola/sl_data/2014-10-04.csv: No such file or directory


In [2294]:
DATA_FOLDER = '' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

### Assumptions

In order to clean the data, we made the following assumptions : 
- A person, counted in one of the three categories (i.d. confirmed, suspect, probable) for deaths and cases, can't be counted in another category. This reasonnable assumption guarantees that we are not counting people multiple times when computing the means.
- As told during the lab session, we consider the Totals/National field for each country as consistent with respect to the different region/cities reports
- Cumulative description/variable (e.g. Total deaths of confirmed) are assumed to carry all items from previous days that has not been registered. More precisely, let d1 and d2 be two dates such that d1 < d2  and d2 - d1 > 1 (with respect to time). Then d2 contains information about the corresponding description from d1 + 1 to d2. 

### Loading the data for each country

In [2295]:
def load_country(country_name):
    return pd.concat(map(pd.read_csv, glob.glob(r'../Data/ebola/' + country_name + '_data/*.csv')))

In [2296]:
#Load the data into a dataframe for each country
guinea_dataframe = load_country('guinea')
liberia_dataframe = load_country('liberia')
sierraleone_dataframe = load_country('sl')

#Print columns label
print('Guinea:', guinea_dataframe.columns.get_values())
print('Liberia:', liberia_dataframe.columns.get_values())
print('Sierra Leone:', sierraleone_dataframe.columns.get_values())

Guinea: ['Beyla' 'Boffa' 'Conakry' 'Coyah' 'Dabola' 'Dalaba' 'Date' 'Description'
 'Dinguiraye' 'Dubreka' 'Forecariah' 'Gueckedou' 'Kerouane' 'Kindia'
 'Kissidougou' 'Kouroussa' 'Lola' 'Macenta' 'Mzerekore' 'Nzerekore' 'Pita'
 'Siguiri' 'Telimele' 'Totals' 'Yomou']
Liberia: ['Bomi County' 'Bong County' 'Date' 'Gbarpolu County' 'Grand Bassa'
 'Grand Cape Mount' 'Grand Gedeh' 'Grand Kru' 'Lofa County'
 'Margibi County' 'Maryland County' 'Montserrado County' 'National'
 'Nimba County' 'River Gee County' 'RiverCess County' 'Sinoe County'
 'Unnamed: 18' 'Variable']
Sierra Leone: ['34 Military Hospital' 'Bo' 'Bo EMC' 'Bombali' 'Bonthe' 'Hastings-F/Town'
 'Kailahun' 'Kambia' 'Kenema' 'Kenema (IFRC)' 'Kenema (KGH)' 'Koinadugu'
 'Kono' 'Moyamba' 'National' 'Police training School'
 'Police traning School' 'Port Loko' 'Pujehun' 'Tonkolili' 'Unnamed: 18'
 'Western area' 'Western area combined' 'Western area rural'
 'Western area urban' 'date' 'variable']


### Removing duplicates

In [2297]:
print('Guinea has duplicates ?', True in guinea_dataframe.duplicated(subset=['Date', 'Description']).values)
print('Liberia has duplicates ?', True in liberia_dataframe.duplicated(subset=['Date', 'Variable']).values)
print('Sierra Leone has duplicates ?', True in sierraleone_dataframe.duplicated(subset=['date', 'variable']).values)

Guinea has duplicates ? False
Liberia has duplicates ? True
Sierra Leone has duplicates ? True


In [2298]:
liberia_dataframe = liberia_dataframe.drop_duplicates(['Date', 'Variable'])
sierraleone_dataframe = sierraleone_dataframe.drop_duplicates(['date', 'variable'])

### Selecting the interesting descriptions/variables 

In [2299]:
guinea_descriptions = set(guinea_dataframe.Description.values.tolist())
liberia_descriptions = set(liberia_dataframe.Variable.values.tolist())
sierraleone_descriptions = set(sierraleone_dataframe.variable.values.tolist())

print(len(guinea_descriptions))
print(len(liberia_descriptions))
print(len(sierraleone_descriptions))

60
45
37


In [2300]:
# All possible descriptions for guinea
print(guinea_descriptions)

# From which we decided to keep only the following descriptions
filtered_guinea_descriptions = ['New cases of confirmed', 
                                'New cases of suspects', 
                                'New cases of probables',
                                'Total deaths of confirmed',
                                'Total deaths of suspects',
                                'Total deaths of probables']

# Will be useful later
guinea_rename_dict = {'New cases of confirmed': 'guinea_new_confirmed',
                      'New cases of suspects': 'guinea_new_suspects',
                      'New cases of probables': 'guinea_new_probables',
                      'Total deaths (confirmed + probables + suspects)': 'guinea_new_deaths',
                      'Total deaths of confirmed': 'guinea_new_death_confirmed',
                      'Total deaths of suspects': 'guinea_new_death_suspects',
                      'Total deaths of probables': 'guinea_new_death_probables',
                      'New deaths registered among health workers': 'guinea_new_deaths_health_workers',
                      'New deaths registered today': 'guinea_new_deaths',
                      'New deaths registered': 'guinea_new_deaths'}

{'Number of suspects cases among health workers', 'Total cases of confirmed', 'New deaths registered today (suspects)', 'Total deaths of confirmed', 'Total PEC center today', 'Total number of male cases', 'Total deaths registered among health workers', 'Total suspected non-class cases', 'Total cases of probables', 'Number of male suspects cases', 'New cases of confirmed among health workers', 'New deaths registered today', 'Total deaths of suspects', 'Number of contacts out of track', 'New cases of probables', 'Number of death of confirmed cases among health workers', 'Number of contacts out of the track 21 days', 'Total number of exits from CTE', 'Total samples tested', 'Number of female probables cases', 'Number of contacts to follow today', 'Number of patients tested', 'Total PEC center today (suspects)', 'Number of deaths of confirmed cases among health workers', 'Total deaths of probables', 'Number of contacts followed yesterday', 'Total cases of suspects', 'New deaths registered 

For guinea, the descriptions 'New deaths registered today (probables)' and 'New deaths registered today (suspects)' or not really interesting as there are only available for one day and equal to 0. Moreover the description 'New deaths registered today (confirmed)' is also available during only one day and is a redundancy of the description 'New deaths registered today'. Thus we decided to drop those descriptions.

In [2301]:
#All possible descriptions for liberia
print(liberia_descriptions)

#From which we decided to keep only the following descriptions
filtered_liberia_descriptions = ['New case/s (confirmed)', 
                                 'New Case/s (Suspected)', 
                                 'New Case/s (Probable)',
                                 'Total death/s in confirmed cases',
                                 'Total death/s in suspected cases',
                                 'Total death/s in probable cases']

liberia_rename_dict = {'New case/s (confirmed)': 'liberia_new_confirmed',
                       'New Case/s (Suspected)': 'liberia_new_suspects',
                       'New Case/s (Probable)': 'liberia_new_probables',
                       'Total death/s in confirmed cases': 'liberia_new_deaths_confirmed',
                       'Total death/s in suspected cases': 'liberia_new_deaths_suspected',
                       'Total death/s in probable cases': 'liberia_new_deaths_probable',
                       'Newly reported deaths': 'liberia_new_deaths',
                       'Newly Reported deaths in HCW': 'liberia_new_deaths_hcw'}

{'Specimens pending for testing', 'Total Number of Confirmed Cases of Sierra Leonean Nationality', 'New case/s (confirmed)', 'Total specimens tested', 'Total Number of Confirmed Cases \n of Guinean Nationality', 'Cumulative cases among HCW', 'Contacts who completed 21 day follow-up', 'Newly Reported deaths in HCW', 'Total death/s in suspected cases', 'New Case/s (Suspected)', 'Total Number of Confirmed Cases \n of Sierra Leonean Nationality', 'Total confirmed cases', 'Newly Reported Cases in HCW', 'Total Case/s (Probable)', 'New Case/s (Probable)', 'Case Fatality Rate (CFR) - Confirmed & Probable Cases', 'Total case/s (confirmed)', 'Contacts lost to follow-up', 'Total no. currently in Treatment \n Units', 'Newly reported deaths', 'Currently under follow-up', 'Total discharges', 'Total suspected cases', 'Total Number of Confirmed Cases of Guinean Nationality', 'Cumulative CFR', 'Total death/s in confirmed, \n probable, suspected cases', 'Total death/s in confirmed, probable, suspected c

In [2302]:
#All possible descriptions for Sierra Leone
print(sierraleone_descriptions)

#From which we decided to keep only the following descriptions
filtered_sierraleone_descriptions = ['death_confirmed', 'death_probable', 'death_suspected',
                                     'new_confirmed', 'new_probable', 'new_suspected']

sierraleone_rename_dict = {'death_confirmed': 'sierraleone_death_confirmed',
                           'death_suspected': 'sierraleone_death_suspected',
                           'death_probable': 'sierraleone_death_probable',
                           'new_confirmed': 'sierraleone_new_confirmed',
                           'new_probable': 'sierraleone_new_probable',
                           'new_suspected': 'sierraleone_new_suspected'}

{'cum_confirmed', 'new_contacts', 'new_probable', 'new_positive', 'death_probable', 'total_lab_samples', 'etc_cum_discharges', 'new_noncase', 'etc_cum_deaths', 'negative_corpse', 'new_suspected', 'etc_new_discharges', 'positive_corpse', 'death_suspected', 'repeat_samples', 'etc_cum_admission', 'cum_contacts', 'cum_suspected', 'cum_probable', 'pending', 'cum_noncase', 'death_confirmed', 'new_confirmed', 'cfr', 'new_samples', 'contacts_followed', 'etc_new_admission', 'percent_seen', 'contacts_healthy', 'cum_completed_contacts', 'new_negative', 'etc_new_deaths', 'contacts_not_seen', 'etc_currently_admitted', 'contacts_ill', 'population', 'new_completed_contacts'}


### Cleaning the dataframes

#### General cleaning

In [2303]:
# Keep only interesting columns
guinea_dataframe = guinea_dataframe[['Date', 'Description', 'Totals']]
liberia_dataframe = liberia_dataframe[['Date', 'Variable', 'National']]
sierraleone_dataframe = sierraleone_dataframe[['date', 'variable', 'National']]

# Standardize the date field for each dataframe
guinea_dataframe.Date = pd.to_datetime(guinea_dataframe.Date)
liberia_dataframe.Date = pd.to_datetime(liberia_dataframe.Date)
sierraleone_dataframe.date = pd.to_datetime(sierraleone_dataframe.date)

# Keep only the insteresting variables/descriptions
guinea_dataframe = guinea_dataframe[[des in filtered_guinea_descriptions for des in guinea_dataframe.Description]]
liberia_dataframe = liberia_dataframe[[var in filtered_liberia_descriptions for var in liberia_dataframe.Variable]]
sierraleone_dataframe = sierraleone_dataframe[[var in filtered_sierraleone_descriptions for var in sierraleone_dataframe.variable]]

# Remove rows with missing value
guinea_dataframe = guinea_dataframe.dropna()
liberia_dataframe = liberia_dataframe.dropna()
sierraleone_dataframe = sierraleone_dataframe.dropna()

# Cast all values to int
guinea_dataframe.Totals = guinea_dataframe.Totals.astype(int)
liberia_dataframe.National = liberia_dataframe.National.astype(int)
sierraleone_dataframe.National = sierraleone_dataframe.National.astype(int)

# Rename all reports 
guinea_dataframe.Description = guinea_dataframe.Description.apply(lambda des: guinea_rename_dict[des]) 
liberia_dataframe.Variable = liberia_dataframe.Variable.apply(lambda var: liberia_rename_dict[var])
sierraleone_dataframe.variable = sierraleone_dataframe.variable.apply(lambda var: sierraleone_rename_dict[var])


#### Cleaning Guinea dataframe

After looking at the guinea dataframe, we noticed that the original 'Total deaths...' description were cumulative values. We thus had to get the daily new deaths cases for each day. This is what is done in the cell below.

In [2304]:
def clean_cum_guinea(des_name, df):
    r_df = df.copy()
    guinea_parts_to_clean = r_df[r_df.Description == des_name]
    guinea_parts_to_clean = guinea_parts_to_clean.set_index(['Date', 'Description']).sort_index()
    
    cleaned_guinea_parts = (guinea_parts_to_clean - guinea_parts_to_clean.shift(1))
    
    idx = pd.IndexSlice
    r_df = r_df.set_index(['Date', 'Description']).sort_index()
    r_df.loc[idx[:, [des_name]], :] = cleaned_guinea_parts
    r_df = r_df.dropna().reset_index()
    return r_df
    
guinea_dataframe = clean_cum_guinea('guinea_new_death_confirmed', guinea_dataframe)
guinea_dataframe = clean_cum_guinea('guinea_new_death_suspects', guinea_dataframe)
guinea_dataframe = clean_cum_guinea('guinea_new_death_probables', guinea_dataframe)

Notice from the guinea dataframe below that **some reports yield negative values**. We generaly decided to interpret them as corrections on previous reports and thus keep them. 

In [2305]:
guinea_dataframe

Unnamed: 0,Date,Description,Totals
0,2014-08-04,guinea_new_confirmed,4.0
1,2014-08-04,guinea_new_probables,0.0
2,2014-08-04,guinea_new_suspects,5.0
3,2014-08-26,guinea_new_confirmed,10.0
4,2014-08-26,guinea_new_death_confirmed,64.0
5,2014-08-26,guinea_new_death_probables,8.0
6,2014-08-26,guinea_new_death_suspects,0.0
7,2014-08-26,guinea_new_suspects,18.0
8,2014-08-27,guinea_new_confirmed,10.0
9,2014-08-27,guinea_new_death_confirmed,2.0


#### Cleaning Liberia dataframe

In [2306]:
# Last record for in liberia_new_deaths_... is equal to 0 which is a dirty value as it should be greater than the value in the precedent record
# '...' here means {'confirmed', 'probable', 'suspected'}

liberia_dataframe = liberia_dataframe.set_index(['Date', 'Variable']).sort_index()

index0 = liberia_dataframe.loc[idx[:, 'liberia_new_deaths_confirmed'], :][-1:].index
index1 = liberia_dataframe.loc[idx[:, 'liberia_new_deaths_probable'], :][-1:].index
index2 = liberia_dataframe.loc[idx[:, 'liberia_new_deaths_suspected'], :][-1:].index

liberia_dataframe = liberia_dataframe.drop(index0)
liberia_dataframe = liberia_dataframe.drop(index1)
liberia_dataframe = liberia_dataframe.drop(index2)

liberia_dataframe = liberia_dataframe.reset_index()

We can see, from the liberia datafame below, that some rows at the end are dirty as they switch to a cumulative fashion. 

In [2307]:
liberia_dataframe[-30:]

Unnamed: 0,Date,Variable,National
415,2014-11-27,liberia_new_suspects,25
416,2014-11-28,liberia_new_confirmed,7
417,2014-11-29,liberia_new_confirmed,10
418,2014-11-29,liberia_new_probables,4
419,2014-11-29,liberia_new_suspects,7
420,2014-11-30,liberia_new_confirmed,10
421,2014-12-01,liberia_new_confirmed,1
422,2014-12-01,liberia_new_probables,9
423,2014-12-01,liberia_new_suspects,25
424,2014-12-02,liberia_new_confirmed,9


In [2308]:
# This cell deals with the cumulative values found at the end of the dataframe for certain reports

dirty_rows_liberia = liberia_dataframe[liberia_dataframe.Date > pd.datetime(2014, 12, 2)]
dirty_rows_liberia = dirty_rows_liberia.set_index(['Date', 'Variable']).sort_index()

cleaned_rows_liberia = dirty_rows_liberia - dirty_rows_liberia.shift(3)

liberia_dataframe = liberia_dataframe.set_index(['Date', 'Variable']).sort_index()
liberia_dataframe.loc[pd.datetime(2014, 12, 4):, :] = cleaned_rows_liberia
liberia_dataframe = liberia_dataframe.dropna().reset_index()

In [2309]:
# This cell deals with cumulative records (similar to the guinea cleaning case)

def clean_cum_liberia(des_name, df):
    r_df = df.copy()
    liberia_parts_to_clean = r_df[r_df.Variable == des_name]
    liberia_parts_to_clean = liberia_parts_to_clean.set_index(['Date', 'Variable']).sort_index()
    
    cleaned_liberia_parts = (liberia_parts_to_clean - liberia_parts_to_clean.shift(1))
    
    idx = pd.IndexSlice
    r_df = r_df.set_index(['Date', 'Variable']).sort_index()
    r_df.loc[idx[:, [des_name]], :] = cleaned_liberia_parts
    r_df = r_df.dropna().reset_index()
    return r_df

liberia_dataframe = clean_cum_liberia('liberia_new_deaths_confirmed', liberia_dataframe)
liberia_dataframe = clean_cum_liberia('liberia_new_deaths_suspected', liberia_dataframe)
liberia_dataframe = clean_cum_liberia('liberia_new_deaths_probable', liberia_dataframe)

#### Cleaning Sierra Leone dataframe

In [2310]:
sierraleone_dataframe_copy = sierraleone_dataframe.copy().set_index(['date', 'variable']).unstack()
tmp = sierraleone_dataframe.set_index(['date', 'variable']).unstack()
tmp = tmp.National[['sierraleone_death_confirmed', 'sierraleone_death_probable', 'sierraleone_death_suspected']]
cleaned = (tmp - tmp.shift(1)).fillna(value=0)

sierraleone_dataframe_copy.National.loc[1:, ['sierraleone_death_confirmed', 'sierraleone_death_probable', 'sierraleone_death_suspected']] = cleaned[1:]
sierraleone_dataframe = sierraleone_dataframe_copy.stack().reset_index()[3:]


In [2311]:
# Remove first three lines 
sierraleone_dataframe = sierraleone_dataframe[3:]

In [2312]:
# Keeps only the year and month of each report
guinea_dataframe['Period'] = guinea_dataframe.Date.dt.to_period('M')
liberia_dataframe['Period'] = liberia_dataframe.Date.dt.to_period('M')
sierraleone_dataframe['Period'] = sierraleone_dataframe.date.dt.to_period('M')

# Keep only the day number of the month. It will be necessary when computing the means
guinea_dataframe.Date = guinea_dataframe.Date.apply(lambda x: x.day)
liberia_dataframe.Date = liberia_dataframe.Date.apply(lambda x: x.day)
sierraleone_dataframe.date = sierraleone_dataframe.date.apply(lambda x: x.day)

### Aggregation and merging

In [2313]:
guinea_dataframe['Country'] = 'Guinea'
liberia_dataframe['Country'] = 'Liberia'
sierraleone_dataframe['Country'] = 'Sierraleone'

In [2314]:
averages_guinea_dataframe = pd.DataFrame(guinea_dataframe.groupby(['Country', 'Period', 'Description']).apply(lambda group: group['Totals'].sum() / group['Date'].max()))
averages_guinea_dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0
Country,Period,Description,Unnamed: 3_level_1
Guinea,2014-08,guinea_new_confirmed,2.0
Guinea,2014-08,guinea_new_death_confirmed,3.516129
Guinea,2014-08,guinea_new_death_probables,0.548387
Guinea,2014-08,guinea_new_death_suspects,0.0
Guinea,2014-08,guinea_new_probables,0.258065
Guinea,2014-08,guinea_new_suspects,1.903226
Guinea,2014-09,guinea_new_confirmed,6.933333
Guinea,2014-09,guinea_new_death_confirmed,6.733333
Guinea,2014-09,guinea_new_death_probables,0.833333
Guinea,2014-09,guinea_new_death_suspects,6.423077


In [2315]:
averages_guinea_dataframe.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,0,0,0,0,0,0
Unnamed: 0_level_1,Description,guinea_new_confirmed,guinea_new_death_confirmed,guinea_new_death_probables,guinea_new_death_suspects,guinea_new_probables,guinea_new_suspects
Country,Period,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Guinea,2014-08,2.0,3.516129,0.548387,0.0,0.258065,1.903226
Guinea,2014-09,6.933333,6.733333,0.833333,6.423077,0.633333,2.9
Guinea,2014-10,6.0,23.0,2.0,-169.0,0.0,28.0


In [2238]:
# Computes average for each month
averages_guinea_dataframe = pd.DataFrame(guinea_dataframe.groupby(['Period', 'Description', 'Country']).apply(lambda group: group['Totals'].sum() / group['Date'].max()))
averages_liberia_dataframe = pd.DataFrame(liberia_dataframe.groupby(['Period', 'Variable']).apply(lambda group: group['National'].sum() / group['Date'].max()))
averages_sierraleone_dataframe = pd.DataFrame(sierraleone_dataframe.groupby(['Period', 'variable']).apply(lambda group: group['National'].sum() / group['date'].max()))

# Remove index
averages_guinea_dataframe = averages_guinea_dataframe.reset_index()
averages_liberia_dataframe = averages_liberia_dataframe.reset_index()
averages_sierraleone_dataframe = averages_sierraleone_dataframe.reset_index()

# Rename columns
averages_guinea_dataframe.columns = ['date', 'report', 'average']
averages_liberia_dataframe.columns = ['date', 'report', 'average']
averages_sierraleone_dataframe.columns = ['date', 'report', 'average']

# Stack all averages into one dataframe
all_averages_dataframe = pd.concat([averages_guinea_dataframe, averages_liberia_dataframe, averages_sierraleone_dataframe])

# Sort final dataframe, set index and unstack result (for readibility reasons)
all_averages_dataframe = all_averages_dataframe.sort_values(['date']).set_index(['date', 'report']).unstack()


In [2239]:
all_averages_dataframe

Unnamed: 0_level_0,average,average,average,average,average,average,average,average,average,average,average,average,average,average,average,average,average,average
report,guinea_new_confirmed,guinea_new_death_confirmed,guinea_new_death_probables,guinea_new_death_suspects,guinea_new_probables,guinea_new_suspects,liberia_new_confirmed,liberia_new_deaths_confirmed,liberia_new_deaths_probable,liberia_new_deaths_suspected,liberia_new_probables,liberia_new_suspects,sierraleone_death_confirmed,sierraleone_death_probable,sierraleone_death_suspected,sierraleone_new_confirmed,sierraleone_new_probable,sierraleone_new_suspected
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2
2014-06,,,,,,,0.517241,0.62069,0.137931,0.37931,0.275862,0.586207,,,,,,
2014-07,,,,,,,0.769231,1.538462,1.923077,0.230769,1.576923,1.269231,,,,,,
2014-08,2.0,3.516129,0.548387,0.0,0.258065,1.903226,1.75,9.107143,10.357143,6.964286,6.357143,3.857143,3.967742,0.096774,0.193548,11.612903,0.709677,3.193548
2014-09,6.933333,6.733333,0.833333,6.423077,0.633333,2.9,4.933333,20.233333,10.833333,8.033333,23.466667,22.666667,5.433333,0.0,0.0,33.266667,0.0,6.275862
2014-10,6.0,23.0,2.0,-169.0,0.0,28.0,1.36,20.090909,6.857143,8.714286,14.322581,21.322581,16.774194,5.516129,4.709677,51.225806,0.774194,12.064516
2014-11,,,,,,,1.3,,,,3.62069,8.724138,14.689655,0.0,0.034483,43.37931,0.0,11.103448
2014-12,,,,,,,9.888889,,,,2.222222,4.333333,35.333333,0.0,0.0,32.6,0.0,8.4


#### Cleaning the date columns

jk

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [2215]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [2216]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

FileNotFoundError: [Errno 2] No such file or directory: '/titanic.html'

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here