# Data Science for Business

## Spring 2020, module 4 @ HSE

---

## Home assignment 2


Author: **Miron Rogovets**

---

Analyze Titanic data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))
sns.set_style('darkgrid')

In [3]:
df = pd.read_csv('data/titanic_data.csv')
print(df.shape)
df.head(3)

(1309, 12)


Unnamed: 0,Passenger Class,Name,Sex,Age,No of Siblings or Spouses on Board,No of Parents or Children on Board,Ticket Number,Passenger Fare,Cabin,Port of Embarkation,Life Boat,Survived
0,First,"Allen, Miss. Elisabeth Walton",Female,29.0,0,0,24160,211.338,B5,Southampton,2.0,Yes
1,First,"Allison, Master. Hudson Trevor",Male,0.917,1,2,113781,151.55,C22 C26,Southampton,11.0,Yes
2,First,"Allison, Miss. Helen Loraine",Female,2.0,1,2,113781,151.55,C22 C26,Southampton,,No


### I. Start with basic EDA (Exploratory data analysis):
1. Compute average `Age` of passengers and number of passengers who survived and not survived grouped by `Sex` and `Passenger Class` (24 numbers);

2. What can you say about survivors based on the resulting table (open question), e.g. what is the surviving ratio for females in First class compared to the Second and Third? 
    *This answer is limited to 150 words.*


3. What is the average number of males and females on all boats (rounded to the closest integer)? 
    *Do not forget to filter out all `?` in `Life Boat` attribute.* 

In [4]:
df.groupby(['Sex', 'Passenger Class', 'Survived'])['Age'].agg(['mean', 'size'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean,size
Sex,Passenger Class,Survived,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,First,No,35.2,5
Female,First,Yes,37.109,139
Female,Second,No,34.091,12
Female,Second,Yes,26.711,94
Female,Third,No,23.419,110
Female,Third,Yes,20.815,106
Male,First,No,43.658,118
Male,First,Yes,36.168,61
Male,Second,No,33.093,146
Male,Second,Yes,17.449,25


- The total number of survivors is the highest in _First_ class, the lowest in _Second_ (_Third_ in the middle)
- The total number of _Male_ survivors is less than _Females_ in all classes
- About 97% _Females_ from _First_ class survived, while only about 37% _Males_
- _Males_ have surviving ratio less than 50% for all classes
- _Females_ have surviving ratio more than 50% for _First_ and _Second_ classes, and about 50% for _Third_
- The lower the class, the lower the percentage of survivors for _Females_ (as _Males_ have a bit lower percentage for _Second_ than for _Third_)
- The passengers from _First_ class are older in average than passengers from other classes
- Younger passengers have higher probability of surviving (except _Females_ from _First_ class)

In [23]:
df[~df['Life Boat'].isna()].groupby(['Life Boat', 'Sex'])['Age'].agg('size').groupby('Sex').agg('mean')

Sex
Female   13.870
Male      6.958
Name: Age, dtype: float64

### II. Proceed with feature generation.
1. Drop the column `Life Boat`.
2. Generate new attribute `Family size`: sum up `No of Parents or Children on Board` and `No of Siblings or Spouses on Board` and add 1 (for passenger himself). What is the average family size? In which class did the biggest family travel?
    *Do not drop original attributes.*
    
    
3. It seems that `Passenger Fare` is total among all passengers with the same `Ticket Number`: create new attribute `Single passenger fare`. For every passenger you need to compute the number of passengers with the same `Ticket Number` and then use this number as a divisor for `Passenger Fare`. 
    *Do not drop the original attribute.*
    
    
4. Impute missing values: for numerical attributes use averaging over three groups: `Passenger Class`, `Sex`, `Embarkation Port`; for every numerical attribute create separate column that contain 1 for imputed value and 0 for originally presented. 
    *This step is mainly for practicing your groupby/join skills. In real tasks this kind of imputation is relatively rare.*
    
5. Pre-process categorical attributes: For every categorical attribute create a separate column that contains 1 for a missing value and 0 for originally presented. One-hot encode categorical attributes with less than 20 unique values, drop other categorical attributes; drop original attributes. 
6. Set the role of the `Survived` attribute to `label`.

### III. Finish by building a classification model using preprocessed data
1. Compute classification accuracy on a train-test setup:
    - Create a Cross Validation block, fix the random_state parameter to 2020.
    - Use a decision tree with `maximal depth` = 7; uncheck `apply pruning` box; leave all other parameters by default.
    - Use accuracy as a performance metric
2. Analyze the resulting confusion matrix, which error is larger: Type I or Type II? 
3. Provide a short analysis of the results, based on your answers III.2-III.3. E.g. What are the splitting features of the first 3 levels of the best tree (up to 7 attributes)? Do these results coincide with your intuition? You may include some misclassified examples along with explanations why they were misclassified. 
    *This answer is limited to 250 words.*
