In [1]:
import pandas as pd
import sklearn as skl
import numpy as np



trainDataFileName = "train.csv"
descriptionFileName = "Description"

In [2]:
with open(descriptionFileName) as f:
    read_data=f.read()
    print(read_data)

Data Description
Overview

The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

Data Dictionary

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex	
Age	Age in years	
sibsp	# of siblings / spouses aboard the Titanic	
parch	# of parents / children aboard the Titanic	
ticke

In [3]:
data = pd.DataFrame.dropna(pd.read_csv(trainDataFileName))

In [4]:
data.describe

<bound method NDFrame.describe of      PassengerId  Survived  Pclass  \
1              2         1       1   
3              4         1       1   
6              7         0       1   
10            11         1       3   
11            12         1       1   
21            22         1       2   
23            24         1       1   
27            28         0       1   
52            53         1       1   
54            55         0       1   
62            63         0       1   
66            67         1       2   
75            76         0       3   
88            89         1       1   
92            93         0       1   
96            97         0       1   
97            98         1       1   
102          103         0       1   
110          111         0       1   
118          119         0       1   
123          124         1       2   
124          125         0       1   
136          137         1       1   
137          138         0       1   
139          140

In [5]:
categoricalVariables = ["Pclass", "Sex", "Embarked"]
numericalVariables = ["SibSp", "Parch", "Fare", "Age"]
otherVariables = ["PassengerId", "Name", "Ticket", "Cabin"]

In [6]:
np.corrcoef(x=data['Survived'],y=data['Age'])

array([[ 1.        , -0.25408475],
       [-0.25408475,  1.        ]])

## To do:

1. Apply standard skl routines to dataset and evaluate performance for multiple train-test splits.
2. Do a Feature Analysis: Correlations (incl significance analysis for p=0.05), Na-Values, Some Splits for nonlinear behavior (categorical variables with <5 values)
3. Feature Engineering: Something found in Name, Ticket Number, Cabin number? Merge sibs/parch

## 1 Start with naive application of standard skl routines

To give a first overview we want to apply standard prediction routines to get an idea of what is possible without spending time in Feature Analysis, Feature Engineering and fine-tuning of algorithmic parameters.

We delete "non-essential" columns (columns that can not directly be used for any prediction model without Feature Engineering)

In [58]:
filteredData = data.drop(otherVariables, axis=1) # Remove non-essential columns
filteredData = pd.get_dummies(filteredData, columns=categoricalVariables) # Create dummy Variables for categorical data
# Split Training and Test set:
train=filteredData.sample(frac=0.33,random_state=42)
test=filteredData.drop(train.index)
print('Percentage of people survived in test set: {:2.1f}%'.format(100*sum(test["Survived"])/len(test["Survived"])))

# Apply logistic regression
from sklearn import linear_model

logReg = linear_model.LogisticRegression()
logReg.fit(X=train.drop("Survived",axis=1), y=train["Survived"])
resultslogReg=pd.DataFrame(logReg.predict(test.drop("Survived", axis=1)), columns=["prediction"])
resultslogReg=pd.concat([test.reset_index(),resultslogReg],axis=1)
diff= resultslogReg["prediction"]- resultslogReg["Survived"]
print('##### Logistic Regression:\nTotal test: {}\nSurvived test: {}\nCorrect Predictions: {}'.format(len(resultslogReg["Survived"]), sum(resultslogReg["Survived"]), (len(resultslogReg["Survived"])- sum(abs(diff)))))


# Apply Random Forest Classifier

from sklearn import ensemble

prf = ensemble.RandomForestClassifier(random_state=42)
prf.fit(X=train.drop("Survived",axis=1), y=train["Survived"])
resultsprf=pd.DataFrame(prf.predict(test.drop("Survived", axis=1)), columns=["prediction"])
resultsprf=pd.concat([test.reset_index(),resultsprf],axis=1)
diff= resultsprf["prediction"]- resultsprf["Survived"]
print('###### Random Forest Classifier:\nTotal test: {}\nSurvived test: {}\nCorrect Predictions: {}'.format(len(resultsprf["Survived"]), sum(resultsprf["Survived"]), (len(resultsprf["Survived"])- sum(abs(diff)))))


Percentage of people survived in test set: 69.1%
##### Logistic Regression:
Total test: 123
Survived test: 85
Correct Predictions: 94
###### Random Forest Classifier:
Total test: 123
Survived test: 85
Correct Predictions: 94


Note that the performance of the Random Forest Classifier varies with its random state.