# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
DATA_FOLDER = '' # Use the data folder provided in Tutorial 02 - Intro to Pandas.

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average per month* of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

In [None]:
# Write your answer here

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
# Write your answer here

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML

DATA_FOLDER = 'Data'
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

In [None]:
import pandas as pd
import string
import math

#Load and cleanup data
frame = pd.read_excel(DATA_FOLDER+'/titanic.xls')
frame.ticket = frame.ticket.map(lambda tickets: str(tickets).strip('{}{}'.format(string.ascii_letters, string.punctuation)).split(' ')[-1])
frame.ticket = frame.ticket.map(lambda tickets: 0 if len(tickets)==0 else int(tickets))
frame

In [None]:
# 1. Describe the *type* and the *value range* of each attribute.

print("\npclass \n range: {}\n type: {}".format(frame.pclass.unique(), frame.pclass.dtype))
print("\nsurvived \n range: {}\n type: {}".format(frame.survived.unique(), frame.survived.dtype))
print("\nsex \n range: {}\n type: {}".format(frame.sex.unique(), frame.sex.dtype))
print("\nembarked \n range: {}\n type: {}".format(frame.embarked.unique(), frame.embarked.dtype))
print("\nparch \n range: {}-{}\n type: {}".format(frame.parch.min(), frame.parch.max(), frame.parch.dtype))
print("\nsibsp \n range: {}-{}\n type: {}".format(frame.sibsp.min(), frame.sibsp.max(), frame.sibsp.dtype))
print("\nage \n range: {}-{}\n type: {}".format(frame.age.min(), frame.age.max(), frame.age.dtype))
print("\nticket \n range: {}-{}\n type: {}".format(frame.ticket.min(), frame.ticket.max(), frame.ticket.dtype))
print("\nfare \n range: {}-{}\n type: {}".format(frame.fare.min(), frame.fare.max(), frame.fare.dtype))
print("\nbody \n range: {}-{}\n type: {}".format(frame.body.min(), frame.body.max(), frame.body.dtype))
print("\nboat \n range: {}\n type: {}".format(frame.boat.unique(), frame.boat.dtype))
print("\ncabin \n range: {}\n type: {}".format(frame.cabin.unique(), frame.cabin.dtype))
print("\nname \n range: {}\n type: {}".format(frame.name.unique(), frame.name.dtype))
print("\nhome.dest \n range: {}\n type: {}".format(frame['home.dest'].unique(), frame['home.dest'].dtype))

<b>pclass</b>, <b>survived</b> and <b>sex</b> can be categorical values. These are deviding the dataset to reasonably smaller subsets
<b>embarked</b> can be also considered as categorical value if we decide to eliminate the samples that have 'nan' value in this column

In [None]:
#Transform the attributes that can be `Categorical`. 
frame['pclass'] = frame.pclass.astype('category')
frame['survived'] = frame.survived.astype('category')
frame['sex'] = frame.sex.astype('category')
frame['embarked'] = frame.embarked.astype('category')

In [None]:
#Drop columns that won't be used furhter
clean_frame = frame.drop(['sibsp', 'parch', 'ticket', 'fare', 'body', 'home.dest', 'boat', 'name'], axis=1)

In [None]:
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

#Plot histograms for the *travel class*, *embarkation port*,
#*sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 

clean_frame['sex'].value_counts().plot(kind='bar', color=['r','g'], title='Grouped by sex')
plt.show()

clean_frame['pclass'].value_counts().plot(kind='bar', color=['r','g','b'], title='Grouped by travel class')
plt.show()

clean_frame['embarked'].value_counts().plot(kind='bar', color=['r','g','b'], title='Grouped by embarkation port')
plt.show()

pd.cut(clean_frame.age, [0,10,20,30,40,50,60,70,80]).value_counts().plot(kind='bar', title='Grouped by decade')
plt.show()

In [None]:
for c in clean_frame.cabin.values:
    if len(str(c).split(' '))>1:
        print(c) 

According to [this](https://www.encyclopedia-titanica.org/titanic-deckplans/g-deck.html) map, we assume that the starting letter for each value represent the floor on which the person stays. For the cases where we have for exemple 'F G63' we assume that the person stayed on floor F in part G cabin 63. When there are more than one value as cabin for the same person we noticed that these cabins are always on the same floor (showed above). 

In [None]:
#Calculate the proportion of passengers by cabin floor. Present your results in a pie chart

#Add new column named floor
clean_frame['floor'] = clean_frame.cabin.dropna().map(lambda f: f[0])
#Make it categorical 
clean_frame['floor'] = clean_frame.floor.astype('category')
#Show the piechart
clean_frame['floor'].value_counts().plot(kind='pie', title='Grouped by cabin floor')
plt.show()

In [None]:
#For each travel class, calculate the proportion of the passengers that survived.
#Present your results in pie charts.
clean_frame[clean_frame.survived==1]['pclass'].value_counts().plot(kind='pie', title='Survived proportions from each class')
plt.show()
clean_frame.groupby(['pclass', 'survived']).pclass.value_counts().plot(kind='pie', title='Survived(1) vs. Dead(0) all together')
plt.show()
clean_frame[clean_frame.pclass==1]['survived'].value_counts().plot(kind='pie', title='Survived(1) from class 1')
plt.show()
clean_frame[clean_frame.pclass==2]['survived'].value_counts().plot(kind='pie', title='Survived(1) from class 2')
plt.show()
clean_frame[clean_frame.pclass==3]['survived'].value_counts().plot(kind='pie', title='Survived(1) from class 3')
plt.show()

In [None]:
#Calculate the proportion of the passengers that survived by travel class and sex.
#Present your results in a single histogram.
grouped = clean_frame[clean_frame.survived==1].groupby(['pclass', 'sex'])

grouped.survived.value_counts().plot(kind='bar', color=['r','g','b','m','y', 'c'], title='Survived passengers by sex and travel class')
plt.show()

In [None]:
#Create 2 equally populated age categories and calculate survival proportions by age category, travel class and sex. 
by_ages = pd.qcut(clean_frame.age, 2)
clean_frame['by_ages'] = by_ages

grouped = clean_frame[clean_frame.survived==1].groupby(['pclass', 'sex', 'by_ages'])
grouped.survived.value_counts().plot(kind='bar', color=['r','g','b','m','y', 'c'], title='Survived passengers by sex, travel class and age category')
plt.show()

In [None]:
#Present your results in a DataFrame with unique index.
clean_frame['id'] = range(len(clean_frame))
clean_frame = clean_frame.set_index(['id'])
clean_frame