# Statistical data analysis

In this excercise notebook, you'll be asked to perform several statistical tests to check data properties.

* Some of this excercieses will be similar to what you've seen so far - including data exploration and data reshaping

* Because a lot of statistics will be required for this excercise, feel free to use any external sources and materials, that will help you.

* You can use any of the statistical libraries you find suitable - scipy/sklearn/statsmodels

Fill parts of the notebook marked as **'??'** with your code to get the results.

Now, almost every cell comes with assert statement - it describes, what **IS EXPECTED TO HAPPEN.**

If everything is fine - you will pass.

If something is wrong, you'll get an error.

You'll be working with well-known Titanic dataset

Info about the dataset:

https://www.kaggle.com/c/titanic/data

Data Dictionary

| Variable | Definition                                 | Key                                            | 
|----------|--------------------------------------------|------------------------------------------------| 
| survival | Survival                                   | 0 = No, 1 = Yes                                | 
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      | 
| sex      | Sex                                        |                                                | 
| Age      | Age in years                               |                                                | 
| sibsp    | # of siblings / spouses aboard the Titanic |                                                | 
| parch    | # of parents / children aboard the Titanic |                                                | 
| ticket   | Ticket number                              |                                                | 
| fare     | Passenger fare                             |                                                | 
| cabin    | Cabin number                               |                                                | 
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton | 


pclass: A proxy for socio-economic status (SES)

    1st = Upper
    2nd = Middle
    3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [71]:
import pandas as pd
import numpy as np
import sklearn as sk
import scipy.stats as st
import statsmodels as stm
import statsmodels.api as stmApi

  from pandas.core import datetools


In [83]:
%matplotlib inline

In [3]:
titanic_data = pd.DataFrame.from_csv("./data/titanic_train_data.csv", index_col=None)
titanic_data.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


# T-test for comparing groups

In [39]:
# Excercise 1: Get the mean, variance and standard deviation of age for males and females in Titanic

# Round your results to 1 decimal place!

expected_df = pd.DataFrame({
    'a': ['Age', 'mean', 27.9, 30.7],
    'b': ["Age", "var", 199.1, 215.4],
    'c': ["Age", "std", 14.1, 14.7]
})
arrays = [expected_df.iloc[0].tolist(), expected_df.iloc[1].tolist()]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
expected_df.columns = index
expected_df = expected_df.iloc[2:]
expected_df.index = ["female", "male"]
expected_df.index.name = "Sex"


age_mean_var_std = '??'

print(age_mean_var_std)


assert(age_mean_var_std.to_string() == expected_df.to_string())

         Age             
        mean    var   std
Sex                      
female  27.9  199.1  14.1
male    30.7  215.4  14.7


In [43]:
# Excercise 2: Perform two-sided t-test to check if mean age is equal in two groups.
# Answer a question: should you use pooled variance or not? 
# Do you consider variances amongst the groups equal?

# What is your null hypothesis? What is your alternative hypothesis?

# Round results to 3rd decimal place

# WATCH OUT FOR MISSING VALUES! Find a way to deal with them

males_age_vec = titanic_data.loc[titanic_data.Sex == "male", "Age"].dropna()
females_age_vec = titanic_data.loc[titanic_data.Sex == "female", "Age"].dropna()

expected_stat = 2.5
expected_pval = 0.01

pval = '??'
stat_val = '??'

assert(pval == expected_pval)
assert(expected_stat == stat_val)

2.5
0.01


Excercise 2: Calculate proportion test of males vs females who survived
You should RESHAPE data first
Your desired data shape should be 

 |**Survived**|**Not survived**
:-----:|:-----:|:-----:
Males|xx|yy
Females|zz|qq

Use proper statistical test, which will answer the following questions:
 * Are Survival and Sex independent?
 * Is there any relationship between two proportions?
 
 
There's no assertion in this excercise - because you might get different values, depending on the statistic used. Although, there's one clear answer to the question :) So you should arrive to one clear conclusion :)

In [57]:
# YOUR CODE GOES HERE!

  if __name__ == '__main__':


(260.71702016732104,
 1.1973570627755645e-58,
 1,
 array([[ 193.47474747,  120.52525253],
        [ 355.52525253,  221.47474747]]))

In [58]:
from sklearn.linear_model import LinearRegression

no_null_age = titanic_data.Age.notnull()
titanic_data.Age.isnull().sum()
#train_df = titanic_data

177

In [107]:
clean_data = titanic_data.dropna()
lr = LinearRegression()
lr.fit(clean_data[["Pclass"]], clean_data["Age"])
X = pd.get_dummies(clean_data[["Fare", "SibSp", "Age", "Parch", "Sex"]])
X = stmApi.add_constant(X)
fit = stmApi.Logit(clean_data["Survived"], X).fit()
print(fit.summary())

Optimization terminated successfully.
         Current function value: 0.449664
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:               Survived   No. Observations:                  183
Model:                          Logit   Df Residuals:                      177
Method:                           MLE   Df Model:                            5
Date:                Mon, 14 Aug 2017   Pseudo R-squ.:                  0.2892
Time:                        14:31:09   Log-Likelihood:                -82.288
converged:                       True   LL-Null:                       -115.78
                                        LLR p-value:                 4.358e-13
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.5249   9.42e+06   1.62e-07      1.000   -1.85e+07    1.85e+07
Fare           0.0034      0.