# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [1]:
DATA_FOLDER = 'Data/ebola' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

# Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

Our goal in this first task is to obtain a unique dataframe with the daily average per month of new cases and deaths for the three different countries. Our approach is to analyse the three countries separetly and then merge them in an unique ´Dataframe´.

For each country, there are different reports in `.csv` format. We are going to use some functions of the `glob` library in order to merge all the daily reports in a unique ´Datafram´ for each country. Then, the `pandas` library will be used in order to structure all the data and then analyse it. Other libraries as `numpy` or `datetime` will be also used. So let's first import the main libraires: 
  

In [2]:
# Import libraries
import pandas as pd 
import numpy as np 
import glob 
pd.options.mode.chained_assignment = None 
from datetime import datetime, date, time
from dateutil.parser import parse

# Task 1.1. Guinea

We first merge all the '.csv' files on the guine_data folder.

In [3]:
# Import Guinea data in one file
data_folder ='Data/ebola/guinea_data/'
allFiles = glob.glob(data_folder + "/*.csv")
print('Number of csv files:',len(allFiles))
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
guinea_df = pd.concat(list_)

print('Shape guinea_df:',guinea_df.shape)
print('Columns:', guinea_df.columns)

Number of csv files: 22
Shape guinea_df: (714, 25)
Columns: Index(['Beyla', 'Boffa', 'Conakry', 'Coyah', 'Dabola', 'Dalaba', 'Date',
       'Description', 'Dinguiraye', 'Dubreka', 'Forecariah', 'Gueckedou',
       'Kerouane', 'Kindia', 'Kissidougou', 'Kouroussa', 'Lola', 'Macenta',
       'Mzerekore', 'Nzerekore', 'Pita', 'Siguiri', 'Telimele', 'Totals',
       'Yomou'],
      dtype='object')


On the guinea_data folder we can find 22 different reports. It is identified a Dataframe of 25 columns and 714 different entries. We note that the columns of our interest are `Date`, `Description` and `Total` as we only look for global daily new cases and deaths for all the country. We assume that the column `Totals` summarizes correctly all the other regional accounting columns. A further analysis could be to verify this assumption.

Let's select the columns of our interest:

In [4]:
# Switch the column order by name to have 'Date and 'Description' first
guinea_df = guinea_df[['Date','Description', 'Totals']]

guinea_df.head(5)

Unnamed: 0,Date,Description,Totals
0,2014-08-04,New cases of suspects,5
1,2014-08-04,New cases of probables,0
2,2014-08-04,New cases of confirmed,4
3,2014-08-04,Total new cases registered so far,9
4,2014-08-04,Total cases of suspects,11


From the Dataframe above, we identify different classes for new cases and new deaths in `Description` column. Let's select the raws of our interest: 
 - `Total new cases registered so far`
 - `New death registered today`
 - `New death registered`

As we can see, the last two categories has the same meaning. We have to rename one of the two. In addition, we are going to check the shape of boths dataframe to verify that we are not missing any information on the renaming operation to analyse the new deaths.

In [5]:
# Concatenate the Description categories of our interest
guinea_df_new = pd.concat([guinea_df[guinea_df.Description == 'Total new cases registered so far'],
                           guinea_df[guinea_df.Description  == 'New deaths registered'],
                           guinea_df[guinea_df.Description  == 'New deaths registered today']])

# Aggregate New deaths in only one category
guinea_df_new.Description = guinea_df_new.Description.replace('New deaths registered today', 'New deaths registered')

print('Shape guinea_df_new:', guinea_df_new.shape)

Shape guinea_df_new: (44, 3)


It is verified that there is 22 datapoints for each category (new cases/new deaths) which corresponds to the 22 guinea reports. Once the raws with all the relevant information for our analysis has been concatenated, we start the data cleaning :

In [6]:
# Fill nan values with a 0 if there is any value null
if guinea_df_new.Totals.isnull().any():
    guinea_df_new = guinea_df_new.fillna(value=0)

# Get the data type of each column
guinea_df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 3 to 8
Data columns (total 3 columns):
Date           44 non-null object
Description    44 non-null object
Totals         44 non-null object
dtypes: object(3)
memory usage: 1.4+ KB


The datatype for all the columns is `object`. Let's transform the `Date` column in datetime format and the `Totals` column to numeric(`int`). In addition, a `Month` column will be added for the ease of analysing the new cases and deaths per month.

In [7]:
# Change to date format the Date column
guinea_df_new.Date = guinea_df_new.Date.apply(lambda d: pd.to_datetime(d))
guinea_df_new['Month'] = [date.month for date in guinea_df_new.Date]
guinea_df_new = guinea_df_new[['Date','Month', 'Description', 'Totals']]

# Change to numeric the Totals
guinea_df_new['Totals'] = guinea_df_new['Totals'].apply(pd.to_numeric).astype(int)

guinea_df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 3 to 8
Data columns (total 4 columns):
Date           44 non-null datetime64[ns]
Month          44 non-null int64
Description    44 non-null object
Totals         44 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 1.7+ KB


In order to compute the monthly daily average, we needto identify how many days there are per month.

In [9]:
print(guinea_df_new.Month.value_counts())
print('First date of August:',np.min(guinea_df_new.Date))
print('Last date of October:', np.max(guinea_df_new.Date))

9     32
8     10
10     2
Name: Month, dtype: int64
First date of August: 2014-08-04 00:00:00
Last date of October: 2014-10-01 00:00:00


We assume:
- The only report from October is accounting for the month of September new cases and deaths
- The first report of August is accounting for the new cases and deaths of the three first days of the month.

For this reason, we modify the month of report of October to Semptember (10 to 9):

In [45]:
guinea_df_new.Month = guinea_df_new.Month.replace(10, 9)

# New cases
new_cases = guinea_df_new[guinea_df_new.Description == 'Total new cases registered so far']
new_cases_grouped = new_cases.groupby('Month').agg(np.sum)/31
new_cases_grouped.rename(columns={'Totals': 'avg_new_cases'}, inplace=True)

# New deaths
new_deaths = guinea_df_new[guinea_df_new.Description == 'New deaths registered']
new_deaths_grouped = new_deaths.groupby('Month').agg(np.sum)/30
new_deaths_grouped.rename(columns={'Totals': 'avg_new_deaths'}, inplace=True)

# Concatenate
guinea_conc = pd.concat([new_cases_grouped, new_deaths_grouped], axis=1)
guinea_conc['Country'] = 'Guinea'

pd.set_option('precision', 2)
guinea_conc

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths,Country
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,4.16,0.57,Guinea
9,11.23,2.4,Guinea


## Task 1.2. Liberia 

We follow the same procedure done for Guinea, in this case for Liberia:

In [12]:
# Import Liberia data in one file
data_folder ='Data/ebola/liberia_data/'
allFiles = glob.glob(data_folder + "/*.csv")
print('Number of csv files:',len(allFiles))
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
liberia_df = pd.concat(list_)

print('Shape guinea_df:',guinea_df.shape)
print('Columns:', liberia_df.columns)

Number of csv files: 100
Shape guinea_df: (714, 3)
Columns: Index(['Bomi County', 'Bong County', 'Date', 'Gbarpolu County', 'Grand Bassa',
       'Grand Cape Mount', 'Grand Gedeh', 'Grand Kru', 'Lofa County',
       'Margibi County', 'Maryland County', 'Montserrado County', 'National',
       'Nimba County', 'River Gee County', 'RiverCess County', 'Sinoe County',
       'Unnamed: 18', 'Variable'],
      dtype='object')


We identifiy the column categories of our interest: `Date`, `Variable`, `National` 

We still keep the assumption that the National columns well accounts for all the new cases and deaths for each region.

In [25]:
liberia_df = liberia_df[['Date','Variable', 'National']]
liberia_df.head(5)

Unnamed: 0,Date,Variable,National
0,6/16/2014,Specimens collected,1.0
1,6/16/2014,Specimens pending for testing,0.0
2,6/16/2014,Total specimens tested,28.0
3,6/16/2014,Newly reported deaths,2.0
4,6/16/2014,Total death/s in confirmed cases,8.0


In order to classify the new cases and new deaths, we need to know the different categories in Variables:

In [31]:
liberia_df.Variable.value_counts()

Cumulative deaths among HCW                                         101
Cumulative cases among HCW                                          101
Total death/s in probable cases                                     101
Total death/s in suspected cases                                    101
Total death/s in confirmed cases                                    101
Total confirmed cases                                               100
Contacts seen                                                       100
Newly Reported deaths in HCW                                        100
New case/s (confirmed)                                              100
New admissions                                                      100
Total contacts listed                                               100
New Case/s (Probable)                                               100
Total suspected cases                                               100
Total probable cases                                            

We are going to select the following:

- Newly reported deaths
- New case/s (confirmed)
- New Case/s (Suspected)
- New Case/s (Probable)

Let's concatenate this categories:

In [33]:
liberia_df_new = pd.concat([liberia_df[liberia_df.Variable == 'New Case/s (Probable)'],
                            liberia_df[liberia_df.Variable  == 'New Case/s (Suspected)'],
                            liberia_df[liberia_df.Variable  == 'New case/s (confirmed)'],
                            liberia_df[liberia_df.Variable  == 'Newly reported deaths']])

print('Shape liberia_df_new:', liberia_df_new.shape)

Shape liberia_df_new: (400, 3)


The shape of the new concatenated dataframe is computed to verfify that there are 100 data entries per category as there are the same number evola reports. We are able to start with the data cleaning and analysis:

In [41]:
# Fill nan values with a 0 if there is any value null
if liberia_df_new.National.isnull().any():
    liberia_df_new = liberia_df_new.fillna(value=0)

# Get the data type of each column
liberia_df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 24 to 3
Data columns (total 4 columns):
Date        400 non-null datetime64[ns]
Month       400 non-null int64
Variable    400 non-null object
National    400 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 15.6+ KB


As we did before, we modify the datatypes for each category and we create the `Month` column.

In [42]:
# Change to date format the Date column
liberia_df_new.Date = liberia_df_new.Date.apply(lambda d: pd.to_datetime(d))
liberia_df_new['Month'] = [date.month for date in liberia_df_new.Date] # Create a new column 'Month'
liberia_df_new = liberia_df_new[['Date','Month', 'Variable', 'National']]

# Change to numeric the Totals
liberia_df_new['National'] = liberia_df_new['National'].apply(pd.to_numeric).astype(int)

liberia_df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 24 to 3
Data columns (total 4 columns):
Date        400 non-null datetime64[ns]
Month       400 non-null int64
Variable    400 non-null object
National    400 non-null int64
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 15.6+ KB


In [59]:
print(liberia_df_new.Month.value_counts()/4)
print('First date of June:',np.min(liberia_df_new.Date))
print('Last date of December:', np.max(liberia_df_new.Date))

10    25.0
9     24.0
11    15.0
7     11.0
12     9.0
8      9.0
6      7.0
Name: Month, dtype: float64
First date of June: 2014-06-16 00:00:00
Last date of December: 2014-12-09 00:00:00


In this case, we will analyse the new cases and new deaths separately in order to aggregate the three categories meaning `new case`.

In addition, it is identified an error on the number of new cases and deaths in December. For this reason, we drop the values for December. A further analysis could be to identify what is the exact error (apparently it has been used the cumulative value in the New column).


In [46]:
# New cases
new_cases_l = pd.concat([liberia_df_new[liberia_df_new.Variable == 'New Case/s (Probable)'],
                         liberia_df_new[liberia_df_new.Variable  == 'New Case/s (Suspected)'],
                         liberia_df_new[liberia_df_new.Variable  == 'New case/s (confirmed)']])

new_cases_l_grouped = new_cases_l.groupby('Month').agg(np.sum) # Ag
new_cases_l_grouped = new_cases_l_grouped.rename(columns={'National': 'avg_new_cases'})

# New deaths
new_deaths_l = liberia_df_new[liberia_df_new.Variable == 'Newly reported deaths']
new_deaths_l_grouped = new_deaths_l.groupby('Month').agg(np.sum)
new_deaths_l_grouped = new_deaths_l_grouped.rename(columns={'National': 'avg_new_deaths'})
liberia_conc = pd.concat([new_cases_l_grouped, new_deaths_l_grouped], axis=1)
liberia_conc.drop(12, inplace=True)

days_per_month = [15, 31, 31, 30, 31, 30]  
pd.set_option('precision', 2)
liberia_conc.avg_new_cases = liberia_conc.avg_new_cases/days_per_month
liberia_conc.avg_new_deaths = liberia_conc.avg_new_deaths/days_per_month

liberia_conc['Country'] = 'Liberia'
liberia_conc

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths,Country
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,2.67,0.93,Liberia
7,3.03,1.52,Liberia
8,10.81,6.74,Liberia
9,51.07,28.83,Liberia
10,36.74,22.61,Liberia
11,13.23,6.73,Liberia


## Task 1.3. Sierra leone


Again, we follow the same procedure for Sierra leone.

In [86]:
# Import Guinea data in one file
data_folder ='Data/ebola/sl_data/'
allFiles = glob.glob(data_folder + "/*.csv")
print('Number of csv files:',len(allFiles))
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    list_.append(df)
sl_df = pd.concat(list_)

print('Shape sl_df:',sl_df.shape)
print('Columns:', sl_df.columns)

Number of csv files: 103
Shape sl_df: (3262, 27)
Columns: Index(['34 Military Hospital', 'Bo', 'Bo EMC', 'Bombali', 'Bonthe',
       'Hastings-F/Town', 'Kailahun', 'Kambia', 'Kenema', 'Kenema (IFRC)',
       'Kenema (KGH)', 'Koinadugu', 'Kono', 'Moyamba', 'National',
       'Police training School', 'Police traning School', 'Port Loko',
       'Pujehun', 'Tonkolili', 'Unnamed: 18', 'Western area',
       'Western area combined', 'Western area rural', 'Western area urban',
       'date', 'variable'],
      dtype='object')


We identify the columns of our interes: `date`, `variable` and `National`

In order to identify the categories in `variables`

In [87]:
# Switch the column order by name to have 'Date and 'Description' first. Used guinea_df.columns to know the columns
sl_df = sl_df[['date','variable', 'National']]

sl_df.variable.value_counts()

etc_cum_admission         103
cum_probable              103
contacts_not_seen         103
cum_completed_contacts    103
new_suspected             103
etc_currently_admitted    103
contacts_followed         103
etc_new_deaths            103
death_suspected           103
population                103
etc_new_admission         103
etc_cum_discharges        103
etc_cum_deaths            103
cum_contacts              103
new_noncase               103
new_contacts              103
cum_suspected             103
new_probable              103
contacts_healthy          103
death_probable            103
cum_confirmed             103
new_completed_contacts    103
death_confirmed           103
etc_new_discharges        103
cfr                       103
cum_noncase               103
percent_seen              103
new_confirmed             103
contacts_ill              103
negative_corpse            35
positive_corpse            35
pending                    35
repeat_samples             34
total_lab_

It has been identified the following categories in `variables` for our interest:
- new_confirmed
- new_probable
- new_suspected
- death_confirmed

We assume that when there is missing values in `Nationals` mean that it is equal to 0. Let's analyse first new cases categories. 

In [88]:
# New cases 
sl_df_new = pd.concat([sl_df[sl_df.variable == 'new_confirmed'],
                         sl_df[sl_df.variable  == 'new_probable'],
                         sl_df[sl_df.variable  == 'new_suspected']])

# Fill the missing values in Nationals with 0 
sl_df_new = sl_df_new.fillna(value=0)

sl_df_new.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 309 entries, 4 to 2
Data columns (total 3 columns):
date        309 non-null object
variable    309 non-null object
National    309 non-null object
dtypes: object(3)
memory usage: 9.7+ KB


Let's change the format of the columns:

In [89]:
# Change to date format the Date column
sl_df_new.date = sl_df_new.date.apply(lambda d: pd.to_datetime(d))
sl_df_new['Month'] = [date.month for date in sl_df_new.date] 
sl_df_new = sl_df_new[['date','Month', 'variable', 'National']]
sl_df_new.set_index('date', inplace=True)

# Change to numeric the Totals
sl_df_new['National'] = sl_df_new['National'].apply(pd.to_numeric).astype(int)

In [90]:
print(sl_df_new.Month.value_counts()/3)
print('First date of August:',np.min(sl_df_new.index))
print('Last date of December:', np.max(sl_df_new.index))

# Aggregate the per month to get the total new cases per month
sl_df_new = sl_df_new.groupby('Month').agg(np.sum)
sl_df_new.rename(columns={'National': 'avg_new_cases'}, inplace=True)
sl_df_new

9     29.0
10    28.0
11    21.0
8     20.0
12     5.0
Name: Month, dtype: float64
First date of August: 2014-08-12 00:00:00
Last date of December: 2014-12-13 00:00:00


Unnamed: 0_level_0,avg_new_cases
Month,Unnamed: 1_level_1
8,503
9,1180
10,1986
11,1580
12,205


Let's analyse the new deaths. We identify that the category of `new_cases` is a cumulative value.

In [91]:
days = [31-12, 30, 31, 31, 13]
sl_temp = sl_df[sl_df.variable == 'death_confirmed'] 

# Fill the missing values by backfill to compute the difference of the cumulative number of deaths to know the new deaths 
sl_temp = sl_temp.fillna(method='bfill')
sl_temp = sl_temp.fillna('1708')

# Change the data type of Nationals
sl_temp['National'] = sl_temp['National'].apply(pd.to_numeric).astype(int)  # Change to numeric the Totals

# Compute the new cases value by doing the difference
sl_temp.National = sl_temp.National.diff().fillna(0)


# Change to date format the Date column
sl_temp.date = sl_temp.date.apply(lambda d: pd.to_datetime(d))
sl_temp['Month'] = [date.month for date in sl_temp.date] # Create a new column 'Month'
sl_temp = sl_temp[['date','Month', 'variable', 'National']]


sl_temp.set_index('date', inplace=True)
sl_temp = sl_temp.groupby('Month').sum()
sl_temp.National = sl_temp.National/days
sl_temp.rename(columns={'National':'avg_new_deaths'}, inplace=True)

sl_conc = pd.concat([sl_df_new, sl_temp], axis=1)
sl_conc['Country'] = 'Sierra Leone'
sl_conc

Unnamed: 0_level_0,avg_new_cases,avg_new_deaths,Country
Month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8,503,6.47,Sierra Leone
9,1180,5.43,Sierra Leone
10,1986,16.77,Sierra Leone
11,1580,13.74,Sierra Leone
12,205,16.31,Sierra Leone


## Task 1.4. Conclusion

In [92]:
evola = pd.DataFrame()
evola = pd.concat([guinea_conc, liberia_conc, sl_conc])
evola = evola.reset_index()
evola.set_index(['Month', 'Country'])

Unnamed: 0_level_0,Unnamed: 1_level_0,avg_new_cases,avg_new_deaths
Month,Country,Unnamed: 2_level_1,Unnamed: 3_level_1
8,Guinea,4.16,0.57
9,Guinea,11.23,2.4
6,Liberia,2.67,0.93
7,Liberia,3.03,1.52
8,Liberia,10.81,6.74
9,Liberia,51.07,28.83
10,Liberia,36.74,22.61
11,Liberia,13.23,6.73
8,Sierra Leone,503.0,6.47
9,Sierra Leone,1180.0,5.43


## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
# Write your answer here