# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [None]:
from os.path import join, isfile
from os import listdir
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

%pylab inline

sns.set_palette('hls')
sns.set_context("notebook")

DATA_FOLDER = join('..', '..', 'ADA2017-Tutorials', '02 - Intro to Pandas', 'Data')
DATA_EBOLA = join(DATA_FOLDER, 'ebola')
DATA_TITANIC = DATA_FOLDER

## Task 1. Compiling Ebola Data

The `DATA_FOLDER/ebola` folder contains summarized reports of Ebola cases from three countries (Guinea, Liberia and Sierra Leone) during the recent outbreak of the disease in West Africa. For each country, there are daily reports that contain various information about the outbreak in several cities in each country.

Use pandas to import these data files into a single `Dataframe`.
Using this `DataFrame`, calculate for *each country*, the *daily average* per year of *new cases* and *deaths*.
Make sure you handle all the different expressions for *new cases* and *deaths* that are used in the reports.

### 1.1 Look at the data 

The reporting method is different in each country. Therefore we have to build different parsers. In our case, we are interested in the **daily average** of new **cases** and **death** per year. The dataset contains a lot of fields that are useless for us and that we will drop.
Note that:

* We assume that death in health workers are part of total new death/cases (e.i. total death number = patient death + health worker death).
* We assurme that if a value is missing, it means that there is no change recorded

### 1.1.1 Guinea
Let's first read all csv files and concatenate the datas. We can directly parse the `Date` as a date entry. We fill the missing values (NaN) with 0 since it probably means that no changes were reported. The column `Total` contain the total values for each `Description` (sum of all cities). After parsing we must take a look at the duplicates to see if we have multiple entries for the same tuple (`Date`, `Description`).

In [None]:
# Read all files and concatenate them
guinea_path = join(DATA_EBOLA, 'guinea_data')
guinea_files = [join(guinea_path, f) for f in listdir(guinea_path) if isfile(join(guinea_path, f))]

r=[]
for i in range(len(guinea_files)):
    r.append(pd.read_csv(guinea_files[i], usecols=['Description', 'Totals', 'Date'], 
                         parse_dates=['Date']).fillna(0))
    
r = pd.concat(r)
print('Contains duplicates:', any(r.duplicated(subset=['Date', 'Description'])) )

Since no duplicates were found we can pivot the table and keep the `Date` as index and `Totals` in `Description` column

In [None]:
r = r.pivot_table(index='Date', columns='Description', values='Totals', aggfunc='max').fillna(0)
r.head()

As said above, most of the columns are useless in our case. Let's display the entries to choose the columns that contains meaningfull values.

In [None]:
r.columns

All fields containing the overall (total) cases/death can be dropped as we are interested in new declared cases/death. We will only keep `New cases of confirmed`, `New cases of probables`, `New cases of suspects`, `New deaths registered`, `New deaths registered today (confirmed)`, `New deaths registered today (probables)` and `New deaths registered today (suspects)` since they are more likely to contain meaningfull information for our task.

In [None]:
r[['New cases of confirmed', 'New cases of probables', 'New cases of suspects', 'New deaths registered', 
   'New deaths registered today (confirmed)', 'New deaths registered today (probables)', 
   'New deaths registered today (suspects)']].head()

Note that some of the fields were not properly parsed (type is object instead of int). Therefore we will apply **to_numeric** function to cast them to numbers allowing us to use basic mathematical operation.

We create new fields that will be used to merge all the data (for all the countries). `n_case` contains the new cases, `n_case_un` the probable/suspected cases, `n_death` the new registred deaths and `n_death_un` the deaths suspected/probable

In [None]:
r['n_case'] = pd.to_numeric(r['New cases of confirmed'])
r['n_case_un'] = pd.to_numeric(r['New cases of probables']) + pd.to_numeric(r['New cases of suspects'])
r['n_death'] = pd.to_numeric(r['New deaths registered']) + pd.to_numeric(r['New deaths registered today (confirmed)'])
r['n_death_un'] = pd.to_numeric(r['New deaths registered today (probables)']) + pd.to_numeric(r['New deaths registered today (suspects)'])
r['country'] = ['guinea']*len(r['n_case'])
guinea_res = r[['country', 'n_case', 'n_case_un', 'n_death', 'n_death_un']]
guinea_res.head()

### 1.3 Liberia
Almost the same as Guinea data. We directly parse the `Date` as a date entry. We fill the missing values with 0 since it probably means that no changes were reported. The column `National` contain the total values for each `Variable` (sum of all cities). After parsing we take a look at the duplicates to see if we have multiple entries for the same tuple (`Date`, `Variable`).

In [None]:
# Read all files and concatenate them
liberia_path = join(DATA_EBOLA, 'liberia_data')
liberia_files = [join(liberia_path, f) for f in listdir(liberia_path) if isfile(join(liberia_path, f))]

r_l=[]
for i in range(len(liberia_files)): 
    r_l.append(pd.read_csv(liberia_files[i], usecols=['Date', 'Variable', 'National'], 
                         parse_dates=['Date']).fillna(0))
    
r_l = pd.concat(r_l)
print('Contains duplicates:', any(r_l.duplicated(subset=['Date', 'Variable'])))

The data contain duplicates. We need to handle them. Let's take a look at the duplicates.

In [None]:
r_l[r_l.duplicated(subset=['Date', 'Variable'])]

We can see that those fields are not very relevant for our task. Therefore we can either drop them or merge them. We chose to merge them, using max function (for each duplicated variable we keep one with the higher value), to avoid losing data. Then we can pivot the table as we did for the Guinea data.

In [None]:
r_l = r_l.pivot_table(index='Date', columns='Variable', values='National', aggfunc=max).fillna(0)
r_l.head()

Most of the column are useless in our case. Let's display the entries to choose the columns that contains meaningfull values.

In [None]:
r_l.columns

Same as before, all fields containing the overall (total) cases/death will be dropped. We will only keep `'New Case/s (Probable)`, `New Case/s (Suspected)`, `New case/s (confirmed)` and `Newly reported deaths)`.

In [None]:
r_l[['New Case/s (Probable)', 'New Case/s (Suspected)', 'New case/s (confirmed)', 'Newly reported deaths']].head()

We create the same fields as for Guinea to match the data schema that we choosed.

In [None]:
r_l['n_case'] = r_l['New case/s (confirmed)']
r_l['n_case_un'] = r_l['New Case/s (Suspected)'] + r_l['New Case/s (Probable)']
r_l['n_death'] = r_l['Newly reported deaths']
r_l['country'] = ['liberia']*len(r_l['n_case'])
liberia_res = r_l[['country', 'n_case', 'n_case_un', 'n_death']]
liberia_res.head()

### 1.4 Sierra Leone
Same logic as before. We directly parse the `date` as a date entry. We fill the missing values with 0 since it probably means that no changes were reported. The column `National` contain the total values for each description (sum of all cities). After parsing we take a look at the duplicates to see if we have multiple entries for the same tuple (`date`, `variable`).

In [None]:
# Read all files and concatenate them
sl_path = join(DATA_EBOLA, 'sl_data')
sl_files = [join(sl_path, f) for f in listdir(sl_path) if isfile(join(sl_path, f))]

r_sl=[]
for i in range(len(sl_files)): 
    r_sl.append(pd.read_csv(sl_files[i], usecols=['date', 'variable', 'National'], 
                         parse_dates=['date']).fillna(0))
    
r_sl = pd.concat(r_sl)
print('Contains duplicates:', any(r_sl.duplicated(subset=['date', 'variable'])))

The data contain duplicates, so let's look at the duplicates

In [None]:
r_sl[r_sl.duplicated(subset=['date', 'variable'])]

Same as seen before, those fields are not relevant to our task. Therefore we can either drop them or merge them as we already did for the previous data. We also choose to merge them using max function to avoid data loss. Then we use the same method as before to pivot the table.

In [None]:
r_sl = r_sl.pivot_table(index='date', columns='variable', values='National', aggfunc=max).fillna(0)
r_sl.head()

Most of the column are useless in our case. Let's display the entries to choose the columns that contains meaningfull values.

In [None]:
r_sl.columns

All fields containing the overall (cumulative) of cases/death will be dropped. We will only keep `'new_confirmed`, `new_probable`, `new_suspected` and `death_confirmed` since they are more likely to contain the information we want.

Note that `death_confirmed` migth contain the overall value of death. Let's take a look at it.

In [None]:
r_sl[['new_confirmed', 'new_probable', 'new_suspected', 'death_confirmed']].head(16)

Indeed, `death_confirmed` contains the overall number of death. Moreover we can see that some fields are filled with 0. We assumed that it means that data were probably missing. Let's take a deeper look at those data with missing entries

In [None]:
r_sl.loc[r_sl['death_confirmed']==0, ['new_confirmed', 'new_probable', 'new_suspected', 'death_confirmed']]

Since all fields are empty we can drop thoses entries. Afterward we can estimate the number of new death as the difference of total registred deaths between two days. Note that for day 1 (first entry in the table) we will not be able to estimate the amount of new deaths. Therefore we chose it as our starting point and set its value to 0.

In [None]:
r_sl.drop(r_sl.loc[r_sl['death_confirmed']==0].index, inplace=True)
r_sl['new_death'] = pd.to_numeric(r_sl['death_confirmed']).diff().fillna(0)

Finally, we create the new fields to match the data schema that we choosed.

In [None]:
r_sl['n_case'] = pd.to_numeric(r_sl['new_confirmed'])
r_sl['n_case_un'] = pd.to_numeric(r_sl['new_probable'] + r_sl['new_suspected'])
r_sl['n_death'] = pd.to_numeric(r_sl['new_death'])
r_sl['country'] = ['sl']*len(r_sl['n_case'])
sl_res = r_sl[['country', 'n_case', 'n_case_un', 'n_death']]
sl_res.head()

### 1.5 Results


Now that all data have the same structure we can concatenate them

In [None]:
r = pd.concat([guinea_res, liberia_res, sl_res]).fillna(0)

We can also look at the evolution of number of cases and death for each country.

In [None]:
fig, (ax, ax2) = plt.subplots(1, 2, figsize=(16,6))
for label, df in r.groupby('country'):
    df.plot(y='n_death', ax=ax, label=label)
for label, df in r.groupby('country'):
    df.plot(y='n_case', ax=ax2, label=label)
ax.grid(); ax2.grid()
ax.set_xlabel('Date'); ax2.set_xlabel('Date')
ax.set_ylabel('# Daily deaths'); ax2.set_ylabel('# Daily cases')
ax.set_title('Countries - Daily registred deaths'); ax2.set_title('Countries - Daily registred cases')
plt.legend()

### 1.5.1. Negative number of death
Note that for Sierra Leon (sl), there is a number of registred death negative around early october. This is, of course, not due to the fact that people resuscitated. Below is a more detailed view of the problem

In [None]:
r_sl.iloc[40:50][['death_confirmed', 'new_death']]

The problem appears between the 2014-09-30 and 2014-10-01 where the number of death decreased. The problem seems to be a typo (550 typed instead of 530) or a wrong estimation of number of death in a specific city. In both case we are not able to determine the real value of this field so we drop 2014-09-30 (no changes) and put 2014-10-01 to **5** (532-527) to be consistant with data.

In [None]:
r = r.loc[np.logical_or(r['country']!='sl', r.index != '2014-09-30')]
r.loc[r['n_death'] < 0, 'n_death'] = 5
r[r['country']=='sl'].loc['2014-09-28':'2014-10-04']

### 1.5.2. Liberia sudden cases peak
We can also see a huge peak at the end for the Liberia. it could be a sudden increase of registred cases but there is no correlation with the number of death.

In [None]:
r_l.loc['2014-12-01':'2014-12-09'][['Total probable cases', 'New Case/s (Probable)',  
                                    'Total suspected cases', 'New Case/s (Suspected)', 
                                    'Total confirmed cases', 'New case/s (confirmed)', 'Newly reported deaths']]

As we can see, it seems that there has been an error when the data was reported, like if the information had been put in a wrong column with a shift to the right starting in 2014-12-04 entry. By looking closely we see also that the numbers are oddly inferior to what has been reported before which should not be the case as the number should report a cumulative number, hence be superior. For those reasons, we decided to drop those last six entry.

In [None]:
r.drop(r.loc[np.logical_and(r.index >= '2014-12-04', r['country']=='liberia')].index, inplace=True)

#### 1.5.3. Final plot and results (cleared)

In [None]:
fig, (ax, ax2) = plt.subplots(1, 2, figsize=(16,6))
for label, df in r.groupby('country'):
    df.plot(y='n_death', ax=ax, label=label)
for label, df in r.groupby('country'):
    df.plot(y='n_case', ax=ax2, label=label)
ax.grid(); ax2.grid()
ax.set_xlabel('Date'); ax2.set_xlabel('Date')
ax.set_ylabel('# Daily deaths'); ax2.set_ylabel('# Daily cases')
ax.set_title('Countries - Daily registred deaths'); ax2.set_title('Countries - Daily registred cases')
plt.legend()


We can finally compute the number of death and new cases. We add in the table the `n_case_tot` and `n_death_tot` that take into account the probable cases/deaths. Each column is expressed as the **daily average** case/death **per year**

We can see on the next table that the average number of death per day (`n_death`) is more important in Liberia and Sierra Leone. If we consider the probable cases `n_case_tot`, there are more cases of ebola in Sierra Leone.

In [None]:
r['n_case_tot'] = r['n_case'] +  r['n_case_un']
r['n_death_tot'] = r['n_death'] +  r['n_death_un']

COUNTRIES = ['guinea', 'liberia', 'sl']
ds = [(r[r['country']==COUNTRY].index[-1]-r[r['country']==COUNTRY].index[0]).days for COUNTRY in COUNTRIES]
print('Days spans: {}'.format(ds))
r.groupby('country').sum().divide(ds, axis=0)

## Task 2. RNA Sequences

In the `DATA_FOLDER/microbiome` subdirectory, there are 9 spreadsheets of microbiome data that was acquired from high-throughput RNA sequencing procedures, along with a 10<sup>th</sup> file that describes the content of each. 

Use pandas to import the first 9 spreadsheets into a single `DataFrame`.
Then, add the metadata information from the 10<sup>th</sup> spreadsheet as columns in the combined `DataFrame`.
Make sure that the final `DataFrame` has a unique index and all the `NaN` values have been replaced by the tag `unknown`.

In [None]:
microbiome_path = join(DATA_FOLDER, 'microbiome')

# Read 9 first speadsheets and concatenate them
microbiome_files = [join(microbiome_path, f) for f in listdir(microbiome_path) if isfile(join(microbiome_path, f))]
microbiome_files.sort()

#Save and remove the metadata path
metadata = microbiome_files[-1]
microbiome_files.pop();

In [None]:
arr=[]
for i in range(len(microbiome_files)):
    df_temp = pd.read_excel(microbiome_files[i], header=None, names=['Name', 'Number'])
    df_temp['src'] = i
    arr.append(df_temp)

arr = pd.concat(arr)
arr.head()

In [None]:
col_header = pd.read_excel(metadata)
col_header

In [None]:
print('Number of duplicates: {}'.format(np.sum(arr.duplicated(subset=['Name', 'src']))))

In [None]:
arr_merged = arr.pivot(index='Name', columns='src', values='Number')
arr_merged.head()

In [None]:
t = col_header[['GROUP', 'SAMPLE']].fillna('unknown').values.T
index_new = pd.MultiIndex.from_tuples(list(zip(*t)), names=['GROUP', 'SAMPLE'])

In [None]:
arr_merged.columns = index_new
arr_merged.fillna('unknown', inplace=True)
arr_merged.head()

In [None]:
print('Is index unique: {}'.format(arr_merged.index.is_unique))

## Task 3. Class War in Titanic

Use pandas to import the data file `Data/titanic.xls`. It contains data on all the passengers that travelled on the Titanic.

In [None]:
from IPython.core.display import HTML
HTML(filename=DATA_FOLDER+'/titanic.html')

For each of the following questions state clearly your assumptions and discuss your findings:
1. Describe the *type* and the *value range* of each attribute. Indicate and transform the attributes that can be `Categorical`. 
2. Plot histograms for the *travel class*, *embarkation port*, *sex* and *age* attributes. For the latter one, use *discrete decade intervals*. 
3. Calculate the proportion of passengers by *cabin floor*. Present your results in a *pie chart*.
4. For each *travel class*, calculate the proportion of the passengers that survived. Present your results in *pie charts*.
5. Calculate the proportion of the passengers that survived by *travel class* and *sex*. Present your results in *a single histogram*.
6. Create 2 equally populated *age categories* and calculate survival proportions by *age category*, *travel class* and *sex*. Present your results in a `DataFrame` with unique index.

### 3.1 Read data

In [None]:
df = pd.read_excel(join(DATA_TITANIC, 'titanic.xls'),  
                   converters={'pclass': np.int, 'survived': np.int, 'age': np.float})
df.head()

As a first glance, we can use the html file to have the type and range for some attributes. Some attributes will need further investigations. Let's have a look at how the data is organized column by column : 
* `pclass`: Numerical value of type int64 at first, this field can take 1,2,3 as value. As it has only 3 fixed value, it can be categorical. <br>
We check it contains indeed only 3 values (1,2,3). Then we set it as categorial.

In [None]:
print('Type of data: {}'.format(df['pclass'].dtype))
print('Unique categories: {}'.format(df['pclass'].unique()))
df['pclass'] = df.pclass.astype('category', ordered=True)
df['pclass'].cat.categories

* `survived`: Binary class, hence numerical value that will be 0 or 1. Type in64.It could be set as categorical. Actually we decided to keep it as numerical to simplify the way to plot survival ratio. <br>
We check it contains indeed only 2 values (0, 1)

In [None]:
print('Type of data: {}'.format(df['survived'].dtype))
print('Unique categories: {}'.format(df['survived'].unique()))

* `name`: Passenger name. Simple string value but stored as an object type. Not categorical. Many possiblities.

In [None]:
print('Type of data: {}'.format(df['name'].dtype))
print('Unique categories: {}'.format(df['name'].unique()))

* `sex`: String corresponding to sex of the person with two possibility : female or male. Stored as an object type Can be categorical. <br>
We check if it contains indeed only two possibility before setting it as a categorical

In [None]:
print('Type of data: {}'.format(df['sex'].dtype))
print('Unique categories: {}'.format(df['sex'].unique()))
df['sex'] = df.sex.astype('category', ordered=True)
df['sex'].cat.categories

* `age`: Numerical float64 indicating age of the passenger, It should be a positive number and not too high or NaN. We have 263 personne with no registered age and an age value in range [0.1667, 80]. There are a lot of passengers that do not have age entries. We will let them set as NaN and discard them for plotting. Not categorical.

Note that we were sceptical when we saw that the minimum age was 0.1667 since it is not an integer. But as you can see (in the next paragraph) all the values between 0 and 1 are actually the age of the baby in month.

In [None]:
print('Type of data: {}'.format(df['age'].dtype))
print('Amount of NaN: {}, max: {}, min: {}'.format(
    np.sum(pd.isnull(df['age'])), df['age'].max(), df['age'].min()))
print('Baby age in range 0-1 : {} month/s'.format( 
    np.round(pd.to_numeric(df[df['age'] < 1]['age'].values*12), decimals=1) ))
df.age = pd.to_numeric(df.age)

* `sibsp`: Familly related field (number of siblings and/or spouse on boat). Stored as int64 type and range in [0,8] Not categorical.

In [None]:
print('Type of data: {}'.format(df['sibsp'].dtype))
print('Amount of NaN: {}, max: {}, min: {}'.format(
    np.sum(pd.isnull(df['sibsp'])), df['sibsp'].max(), df['sibsp'].min()))

* `parch`: Family related field (number of parents and/or children on boat). Stored as int64 and range in [0,9]. Not categorical

In [None]:
print('Type of data: {}'.format(df['parch'].dtype))
print('Amount of NaN: {}, max: {}, min: {}'.format(
    np.sum(pd.isnull(df['parch'])), df['parch'].max(), df['parch'].min()))

* `ticket`: String value indicating id of the ticket. Contain letters and numbers and has many possibilities, stored as an object type. Not categorical.

In [None]:
print('Type of data: {}'.format(df['ticket'].dtype))
print('Ticket id possibilities: {}'.format(df['ticket'].value_counts()))

* `fare`: Numerical value, float64, indicating ticket fare, expressed in British pound. Range 0 to 512.3292. Not categorical.

In [None]:
print('Type of data: {}'.format(df['fare'].dtype))
print('Amount of NaN: {}, max: {}, min: {}'.format(
    np.sum(pd.isnull(df['fare'])), df['fare'].max(), df['fare'].min()))

* `cabin`: String that contains the cabin number and the floor, stored as an object. Take many possiblities
    * `floor`: Letter indicate the floor number. We will set it as categorical

In [None]:
print('Type of data: {}'.format(df['cabin'].dtype))

Here we can split the field and isolate the floor letter. To do so we use regex. Since some fields contains multiple letters we only keep the first one as the are the same. ex: `B58 B60` -> `BB` -> `B`. Note that `n` is not a floor but the abreviation of `nan`, so thoses are missing values. We set them to NaN and we don't drop them since there is only 2 values. We will only not take them into account when plotting the results.

In [None]:
floors = [re.sub(r'[0-9 ]', '', str(item))[0] for item in df['cabin']]
print('Floors (unique): {}'.format(np.unique(floors)))
df['floor'] = floors
df.loc[df['floor'] == 'n', 'floor'] = np.NaN
df['floor'] = df.floor.astype('category', ordered=True)
df['floor'].cat.categories

* `embarked`: String stored as an object type, indicating embarcation location with 3 possible values : Cherbourg (C), Queenstown (Q), Southampton (S). Categorical <br>
We check if it contains only three values Cherbourg (C), Queenstown (Q), Southampton (S). However we can see that it contains also unknown values (nan). As you see there are only 2 values that containt NaN. Therefore we will let them set to NaN and not include them in `embarked` plots


In [None]:
print('Type of data: {}'.format(df['embarked'].dtype))
print('Unique categories: {}, Unknown: {}'.format(df['embarked'].unique(), df['embarked'].isnull().sum()))
df['embarked'] = df['embarked']
df['embarked'] = df.embarked.astype('category', ordered=True)
df['embarked'].cat.categories

* `boat`: String stored as an object type that indicates the rescue boat id (not sure about this one). Many possible outcomes, not categorical

In [None]:
print('Type of data: {}'.format(df['boat'].dtype))
print('Boat id possibilities: \n{}'.format(df['boat'].value_counts()))

* `body`: Numerical value that indicates the body id number, type float64 in range [1,328]. Many NaN. Not categorical

In [None]:
print('Type of data: {}'.format(df['body'].dtype))
print('Amount of NaN: {}, max: {}, min: {}'.format(
    np.sum(pd.isnull(df['body'])), df['body'].max(), df['body'].min()))

* `home.dest`: String indicating the home and final destination of the passenger, stored as a object type. Many possible outcomes. Not categorical

In [None]:
print('Type of data: {}'.format(df['home.dest'].dtype))

Note : If ranges was no specified it means that the values can either be a string (no limit) either be a mix between numbers and strings (`cabin` or `boat` for example).

## 3.2 Histogram data
We plot the repartition of the passenger as a function of `pclass`, `sex`, `embarked` and `age`. Note that, as said before, NaN values are not taken into account hence are not part of the results since they do not represent usefull data.

Note that we split the `age` data into subsets (age ranges of 10 years) as asked for histogram plotting.

In [None]:
def nice_bar_plot(data, ax, title='', y_axis=''):
    ax.set_title(title , fontsize=12, fontweight='bold')
    ax.set_xlabel(data.name); ax.set_ylabel(y_axis)
    sns.barplot(x=data.value_counts().keys(), y=data.value_counts().values,  ax=ax)
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=90)

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16,12))

nice_bar_plot(df['pclass'], axes[0, 0], 'Repartition class', '#persons')
nice_bar_plot(df['sex'], axes[0, 1], 'Repartition sex', '#persons')
nice_bar_plot(df['embarked'], axes[1, 0], 'Repartition embarc. location', '#persons')
age_cut = pd.cut(df.age, [0, 10, 20, 30, 40, 50, 60, 70, df['age'].max() ])
nice_bar_plot(age_cut, axes[1, 1], 'Repartition age', '#persons')

## 3.3 Proportion of passengers by cabin floor

We calculate and plot in a pie chart the proportion of passengers by cabin floor. We use the variable `floor` that we computed in part 1 to do so. 

It is important to take into account the fact that there is a lot of NaN entry and we only have the value for 295 passengers, which is not so representative from the total number of passengers (1309) in the dataset. Hence, from the observable datas, we can say that approximately 1/3 of the passengers where on C floor. We found it odd to have people located on T floor, as it seems to be the floor where the motors and turbine where located.

In [None]:
def calc_proportion(data):
    proportion = (data.value_counts().divide(data.value_counts().sum())*100)
    return proportion

print('Total of passengers that have a valid entry of floor : {}'.format(df['floor'].value_counts().sum() ))
print('Repartition by floor : \n{}'.format(calc_proportion(df['floor'])))

In [None]:
sns.set_palette('hls', 10)
def nice_pie_plot(data, ax, title=''):
    ax.set_title(title , fontsize=12, fontweight='bold')
    ax.set_xlabel(data.name)
    ax.pie(data.value_counts().values, labels=data.value_counts().keys(), autopct='%1.1f%%')

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(6,6))
nice_pie_plot(df['floor'], ax, 'Repartition of passenger according to floor')

## 3.4 Proportion of survivors by travel class

Below we compute the proportion and plot it in pie charts where 1 means that the person survived and 0 means that the person died.
Here we can say that the results are relevant as there is an entry for every of the 1309 passengers listed in our dataset. We observe that there is a much higher chance of survival in first class, 61,92% whereas it drops to 42.96% in second class and is even lower in third class at 25.52%.



In [None]:
for i, (label, df_sub) in enumerate(df.groupby('pclass')):
    print("Proportion of survivors in class", i+1, "(in pourcentage) :\n",df_sub['survived'].value_counts(sort=False).divide(df_sub['survived'].value_counts(sort=False).sum())*100)

In [None]:
sns.set_palette('hls', 2)
fig, axes = plt.subplots(1, 3, figsize=(16,5))
for i, (label, df_sub) in enumerate(df.groupby('pclass')):
    df_sub['survived'].value_counts(sort=False).plot.pie(
        legend=True, ax=axes[i], title='Survived ratio Class{}'.format(label))

## 3.5 Proportion of passengers that survived by travel class and sex

Below <font color=red>we compute the numbers **TODO** </font>  and plot an histogram that shows the survival proportion. 
The number are as relevant as before. We can see that the survival chance of a woman is much higher than the one of a man, for any class.

In [None]:
sns.set(style="whitegrid")
g = sns.factorplot(x="pclass", y="survived", hue="sex", data=df, size=6, kind="bar", palette="muted")
g.set_ylabels("survival proportion")

## 3.6. Several survival proportions 

We splitted the population by age using the median, in order to have two equally populated age categories. After we made the computation to observe the survival proportion by age category, travel class and gender with the result displayed in a Dataframe below.

This is a much finer way to observe the survival rate than the one used right above. We can see that as stated before, women had a higher chance to survive. We can see also that overall, younger people had better chance to survive. AS observed before, the higher class a person was in, the higher its chances were to survive. Those observations allows us to confirm that the "Women and children first" rule was applied on the Titanic.


In [None]:
df.age = pd.cut(df.age, [0, df['age'].median(), df['age'].max()])

In [None]:
total_per_id = df.groupby(['age', 'sex', 'pclass'])['survived'].agg(['count'])['count']
sur_rate = df.groupby(['age', 'sex', 'pclass'])['survived'].sum().divide(total_per_id)*100
sur_rate = pd.DataFrame(sur_rate)
sur_rate

In [None]:
sur_rate.index.is_unique