# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
DATA_FOLDER = 'data' # Use the data folder provided in Tutorial 02 - Intro to Pandas.
EBOLA = DATA_FOLDER + "/ebola"
GUINEA = EBOLA + "/guinea_data"
LIBERIA = EBOLA + "/liberia_data"
SIERRA_LEONE = EBOLA + "/sl_data"

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average* per year of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

In [None]:
### sierra_descriptions = set([b for a,b,c, in list(preprocessed_sierra.index.values)])
print("Sierra: {}\n".format(sierra_descriptions))

guinea_descriptions = set([b for a,b,c, in list(preprocessed_guinea.index.values)])
print("Guinea: {}\n".format(guinea_descriptions))

liberia_descriptions = set([b for a,b,c, in list(preprocessed_liberia.index.values)])
print("Liberia: {}\n".format(liberia_descriptions))
#sierra[~sierra['34 Military Hospital'].isnull()]
#preprocessed_sierra.iloc[0]


### 1.1 Importing the data

First of all, we will split the cleaning in three parts, one for each country, since the files were consistent for each country. Hence, we start by loading all of the data from each folder into three different DataFrames. 

In [None]:
from IPython.core.display import display, HTML
import pandas as pd
import glob

# A few helper functions

"""
Returns a pandas dataframe from a folder full of 
csv files.
"""
def create_folder_data_frame(FOLDER):
    list_ = []
    for file_ in glob.glob(FOLDER + "/*.csv"):
        df = pd.read_csv(file_, index_col=None)
        list_.append(df)
    return pd.concat(list_)

"""
Sets a new column "Country" that becomes part of a multi index
afterwards, and returns the indexed frame.
"""
def index_and_country(data_frame, country):
    data_frame["Country"] = country
    data_frame.set_index(['Date', 'Description', 'Country'], inplace=True)
    return data_frame
    
"""
We preprocess each Country one after the other and set their indices
to a MultiIndex of Date, Description and Country, with each column
mapped to one of these indices.
"""
def preprocess(FOLDER, name, columns=None):
    frame = create_folder_data_frame(FOLDER)
    if columns:
        frame.rename(columns=columns, inplace=True)
    frame.Date = pd.to_datetime(frame.Date)
    frame = index_and_country(frame, name)
    return frame
    
preprocessed_guinea = preprocess(GUINEA, "Guinea")

liberia_column_mapping = {"Variable": "Description", "National": "Totals", 'Unnamed: 18': "Unknown"}
preprocessed_liberia = preprocess(LIBERIA, "Liberia", liberia_column_mapping)

sierra_column_mapping = {"variable": "Description", "National": "Totals", 'Unnamed: 18': "Unknown", "date": "Date"}
preprocessed_sierra = preprocess(SIERRA_LEONE, "Sierra Leone", sierra_column_mapping)

Dropped because: Only one date given, not related to the question

Decided to drop anything but the columns in our interest after checking them out, they were quite consistent with the others (taking New cases of suspects vs taking Total cases of suspects gives roughly the same data)

Dropping all of the rows in which the sum of all cities diverges from the total column of at least 10%, which is a good margin to decide whether or not a data point is good or not. In fact in the case of an epidemic, one of these measurements is highly improbably going to skyrocket on a day, and go back to normal on the other.

Of course the sum of cities sometimes diverges from the "total" column, but we simply assumed that we are missing some cities, and this is why we chose to consider the column "Totals" as the right column from which to pick the number of persons

Considering NaN as 0

Assuming that there are no .5 people, we can transform everything to integers

Cases: 
    'Total new cases registered so far'
    'Total cases of suspects'
    'Total cases of probables'
    'Total cases of confirmed'
    'New cases of suspects'
    'New cases of probables'
    'New cases of confirmed'
    'Cumulative (confirmed + probable + suspects)'
    
Deaths:
    'Total deaths of suspects'
    'Total deaths of probables'
    'Total deaths of confirmed'
    'Total deaths (confirmed + probables + suspects)'
    'New deaths registered'

### 1.2 Cleaning up

Now that we have our three DataFrames, we need to merge them into a single one in order to query it to get the daily average deaths and cases. What we need to do in order to merge them, is to clean each DataFrame, and create a consistency inbetween columns after choosing which columns to use.

After taking a look at the data, we decided to deliver the daily average of new cases and new deaths in three categories, the suspected, probables, and confirmed cases and deaths, hence we only kept for each DataFrame the right columns.

#### 1.2.1 Guinea

In [None]:
#preprocessed_sierra = preprocess_sierra(SIERRA_LEONE)

#Set all values to integer and replace NaN by 0.
preprocessed_sierra = preprocessed_sierra.apply(pd.to_numeric,errors='coerce')
preprocessed_sierra = preprocessed_sierra.fillna(0.0).astype(int)

#Keep descriptions with interest for us
description = ['death_suspected', 'new_probable', 'new_suspected', 'death_confirmed', 'new_confirmed', 'death_probable']
ix=preprocessed_sierra.index.get_level_values('Description').isin(description)
preprocessed_sierra = preprocessed_sierra[ix]

#Create new DataFrame with descriptions as columns
sierra = pd.DataFrame()
for cat in description:
    sierra[cat] = preprocessed_sierra.xs((cat, 'Sierra Leone'), level=('Description', 'Country'))['Totals'].tolist()

#The data give us only the cumulated total for the deaths. So we computed the dayly numbers of deaths
deaths = ['death_suspected', 'death_probable', 'death_confirmed']
for c in sierra.columns:
    if c in deaths:
        sierra[c] = sierra[c] - sierra[c][0]
        sierra[c][1:] = sierra[c][1:].copy().as_matrix() - sierra[c][:-1].copy().as_matrix()

#Set the minimum value to 0    
sierra = sierra.clip(lower=0)

#Set the country and the date as index
sierra['Country'] = 'Sierra Leone'
date = sorted(list(set(preprocessed_sierra.index.get_level_values('Date'))))
sierra['Date'] = date
sierra.set_index(['Date', 'Country'],inplace=True)
sierra


In [None]:
#Compute the mean by month and country
#sierra = sierra.groupby([sierra.index.get_level_values('Date').month,sierra.index.get_level_values('Country')]).mean()




In [None]:
# The columns that are to interest to us, mapped
# to their centralized names for the final DataFrame
interest = { 'New cases of suspects': 'new_suspected',
        'New cases of probables': 'new_probable',
        'New cases of confirmed': 'new_confirmed',
        'Total deaths of suspects': 'death_suspected',
        'Total deaths of probables': 'death_probable',
        'Total deaths of confirmed': 'death_confirmed' }
    
# We start by keeping only the interesting columns for the task
interest_indices = preprocessed_guinea.index.get_level_values('Description').isin(interest.keys())
guinea_df = preprocessed_guinea[interest_indices]

# Replacing all NaN values by 0 since we assumed that no value
# meant no new cases/new deaths on this day
# Transforming every dtype to integer
guinea_df = guinea_df.fillna(0).astype(int)

# Sorting on the date
guinea_df = guinea_df.sort_index(0)

# Keeping only the interesting columns, namely one for the Total
# and one for the total of cities, to filter out unreliable data points
guinea_df['Cities_total'] = guinea_df.sum(1) - guinea_df.Totals
guinea_df = guinea_df[['Cities_total', 'Totals']]
print(guinea_df.columns)

# Now we want to remove the rows for which the total of cities 
# diverges from the Totals column of at least 10%
guinea_df = guinea_df[np.abs(guinea_df.Cities_total - guinea_df.Totals) <= (0.1 * guinea_df.Totals)]
guinea_df = guinea_df.Totals

# Converting the indices to columns, and readjusting the indices
guinea_df = guinea_df.unstack('Description', fill_value=0)
guinea_df.columns.rename('', inplace=True)

# We saw the missing values in the Total columns, let's fill them by
# the mean of the previous and next columns. Considering that there is
# only one row with such a problem, we directly do it on the row
# We also remove the cumulative sum, in order to get the daily data
for c in guinea_df.columns:
    if "Total" in c:
        guinea_df[c].loc[('2014-09-26', 'Guinea')] = int((guinea_df[c].loc[('2014-09-24', 'Guinea')] + guinea_df[c].loc[('2014-09-30', 'Guinea')]) / 2)
        guinea_df[c] = guinea_df[c] - guinea_df[c][0]
        guinea_df[c][1:] = guinea_df[c][1:].copy().as_matrix() - guinea_df[c][:-1].copy().as_matrix()
        
# And finally, since the data can get negative because of the previous
# computations we made, we can set them back to 0
guinea_df[guinea_df < 0] = 0

# We finalize by changing the names of all indices to match other countries dataframes
guinea_df = guinea_df.rename(columns=interest)

# We are done now, and we can merge the frames
display(HTML(guinea_df.to_html()))

 Now that we have a clean DataFrame to work with, we need to recreate the daily deaths from the total deaths in order to do so, we decided to set the first death of the dates to 0, in order to be able to compute the difference day by day.

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here