# Exercise 07

## Kaggle competition 

* Overview of how Kaggle works ([slides](https://github.com/justmarkham/DAT8/raw/master/slides/16_kaggle.pdf))
* Kaggle Titanic competition: [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

# Exercise 07.1 (20 points)

Create a submission using different classification methods, preprocessing extrategies and cross-validation techniques discussed during the class. The output must be detailed in this notebook.

# Exercise 07.2 (20 points)

The reminder points will be allocated based on the performance of each one submission.

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)

In [3]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


## Create features

In [4]:
titanic = train_df
titanic.Age.fillna(titanic.Age.median(), inplace=True)
titanic['Sex_Female'] = (titanic['Sex'] == 'female').astype(np.int)
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
titanic = pd.concat([titanic, embarked_dummies], axis=1)

## Train Model

In [5]:
# define X and y
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.793721973094


# Apply model to Test set

In [6]:
test_df.Age.fillna(titanic.Age.median(), inplace=True)  # Note use median from training set
test_df['Sex_Female'] = (test_df['Sex'] == 'female').astype(np.int)
embarked_dummies = pd.get_dummies(test_df.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
test_df = pd.concat([test_df, embarked_dummies], axis=1)

In [7]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_Q,Embarked_S
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,0,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,1,0,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,0,1,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,0,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,1,0,1


In [47]:
predictions = logreg.predict(test_df[feature_cols])

In [48]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission.csv", index=False)

# Using cross-validation

In [8]:
# Create k-folds
from sklearn.cross_validation import KFold
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []
models = []
for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    models.append(LogisticRegression(C=1e9))
    
    models[-1].fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = models[-1].predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

In [9]:
results

[0.76666666666666672,
 0.8202247191011236,
 0.7752808988764045,
 0.8202247191011236,
 0.7640449438202247,
 0.7865168539325843,
 0.7640449438202247,
 0.7752808988764045,
 0.84269662921348309,
 0.8314606741573034]

In [10]:
probas = pd.DataFrame(index=test_df[feature_cols].index, columns=['p'+ str(i) for i in range(10)])

In [11]:
for i in range(10):
    proba = models[i].predict_proba(test_df[feature_cols])[:, 1]
    probas.iloc[:, i] = proba

In [12]:
probas.head()

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9
0,0.087517,0.098606,0.102348,0.081192,0.093196,0.106704,0.098966,0.108938,0.117309,0.095253
1,0.381119,0.409449,0.439191,0.360096,0.404643,0.393891,0.39987,0.380765,0.413614,0.432706
2,0.101229,0.126403,0.131574,0.09806,0.10524,0.123159,0.11138,0.116369,0.146649,0.116585
3,0.085612,0.085913,0.086607,0.079191,0.086348,0.086899,0.084662,0.082693,0.090954,0.088761
4,0.555966,0.575204,0.574304,0.548026,0.580051,0.56974,0.58187,0.589655,0.556178,0.586308


### From proba to classifier

In [13]:
predictions = ((probas.mean(axis=1) > 0.5) * 1.0).values

In [14]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_CV.csv", index=False)

In [15]:
predictions

array([ 0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,
        0.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  0.,  1.,  1.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,
        1.,  1.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,
        1.,  1.,  0.,  1.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
        0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,
        1.,  1.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,
        1.,  1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  1.,  1.,  0

# Sending the best

In [16]:
np.argmax(results)

8

In [17]:
train_index, test_index = list(kf)[8]
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]

logres = LogisticRegression(C=1e9)
logres.fit(X_train, y_train)
predictions = logreg.predict(test_df[feature_cols])

In [18]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_k8.csv", index=False)

### PRIMER MODELO LOGISTICO

In [383]:
import pandas as pd
import numpy as np

In [384]:
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)


In [385]:
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [386]:
titanic['Age2'] = titanic['Age']**2
titanic['Fare2'] = titanic['Fare'] **2
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Parch2'] = titanic['Parch'] **2


In [387]:
titanic.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_C,Embarked_Q,Embarked_S,Age2,Fare2,SibSp2,Parch2
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S,0,0,0,1,484,52.5625,1,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C,1,1,0,0,1444,5081.308859,1,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S,1,0,0,1,676,62.805625,0,0


In [388]:
feature_cols = ['Pclass', 'Age', 'SibSp','Parch','Fare', 'Age2','Fare2','Sex_Female', 'Embarked_C', 'SibSp2','Parch2',
                'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

In [389]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [390]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.807174887892


In [300]:
test_df.Age.fillna(titanic.Age.mean(), inplace=True)
test_df.Fare.fillna(titanic.Fare.mean(), inplace=True) 
test_df.Embarked.fillna('S', inplace=True)
test_df['Sex_Female'] = test_df.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(test_df.Embarked, prefix='Embarked')
test_df = pd.concat([test_df, embarked_dummies], axis=1)
test_df['Age2'] = test_df['Age']**2
test_df['SibSp2'] = test_df['SibSp'] **2
test_df['Parch2'] = test_df['Parch'] **2
test_df['Fare2'] = test_df['Fare'] **2

In [301]:
predictions = logreg.predict(test_df[feature_cols])

In [302]:

# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_M3.csv", index=False)

### SEGUNDO MODELO LOGISTICO 



In [242]:
import pandas as pd
import numpy as np
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [243]:
titanic["Si_Pa"] = (titanic["SibSp"]*titanic["Parch"])**2
titanic["Age_Fare"] = titanic["Fare"]*titanic["Age"]


In [244]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_C,Embarked_Q,Embarked_S,Si_Pa,Age_Fare
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S,0,0,0,1,0,159.500000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,1,1,0,0,0,2708.765400
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,1,0,0,1,0,206.050000
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,1,0,0,1,0,1858.500000
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S,0,0,0,1,0,281.750000
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q,0,0,1,0,0,251.204047
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.000000,0,0,17463,51.8625,E46,S,0,0,0,1,0,2800.575000
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.000000,3,1,349909,21.0750,,S,0,0,0,1,9,42.150000
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.000000,0,2,347742,11.1333,,S,1,0,0,1,0,300.599100
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.000000,1,0,237736,30.0708,,C,1,1,0,0,0,420.991200


In [245]:
feature2_cols = ['Pclass', 'Sex_Female', 'Embarked_C','Embarked_Q', 'Embarked_S','Si_Pa','Age_Fare']
X_1 = titanic[feature2_cols]
y_1 = titanic.Survived

In [246]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, random_state=1)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [247]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.784753363229


## TERCER MODELO LOGISTICO 


In [248]:
import pandas as pd
import numpy as np
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [249]:
titanic["Si_Par"] = (titanic["SibSp"]*titanic["Parch"])
titanic["Age_P"] = titanic["Pclass"]*titanic["Age"]

In [250]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_Female,Embarked_C,Embarked_Q,Embarked_S,Si_Par,Age_P
0,1,0,3,"Braund, Mr. Owen Harris",male,22.000000,1,0,A/5 21171,7.2500,,S,0,0,0,1,0,66.000000
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.000000,1,0,PC 17599,71.2833,C85,C,1,1,0,0,0,38.000000
2,3,1,3,"Heikkinen, Miss. Laina",female,26.000000,0,0,STON/O2. 3101282,7.9250,,S,1,0,0,1,0,78.000000
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.000000,1,0,113803,53.1000,C123,S,1,0,0,1,0,35.000000
4,5,0,3,"Allen, Mr. William Henry",male,35.000000,0,0,373450,8.0500,,S,0,0,0,1,0,105.000000
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q,0,0,1,0,0,89.097353
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.000000,0,0,17463,51.8625,E46,S,0,0,0,1,0,54.000000
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.000000,3,1,349909,21.0750,,S,0,0,0,1,3,6.000000
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.000000,0,2,347742,11.1333,,S,1,0,0,1,0,81.000000
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.000000,1,0,237736,30.0708,,C,1,1,0,0,0,28.000000


In [254]:
feature3_cols = ['Fare', 'Sex_Female', 'Embarked_C','Embarked_Q', 'Embarked_S','Si_Par','Age_P']
X_2 = titanic[feature3_cols]
y_2 = titanic.Survived

In [255]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_2, y_2, random_state=1)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [256]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.811659192825


## CUARTO MODELO LOGISTICO 


In [314]:
import pandas as pd
import numpy as np
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [315]:
titanic["Age_P"] = titanic["Pclass"]*titanic["Age"]
titanic["Si_Pa"] = (titanic["SibSp"]*titanic["Parch"])**2

In [316]:
feature4_cols = [ 'Age', 'Fare', 'Sex_Female', 'Embarked_C', 'Embarked_S','Si_Pa','Age_P']
X_3 = titanic[feature4_cols], 
y_3 = titanic.Survived

In [317]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, random_state=1)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [318]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.811659192825


In [319]:
test_df.Age.fillna(titanic.Age.mean(), inplace=True)
test_df.Fare.fillna(titanic.Fare.mean(), inplace=True) 
test_df.Embarked.fillna('S', inplace=True)
test_df['Sex_Female'] = test_df.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(test_df.Embarked, prefix='Embarked')
test_df = pd.concat([test_df, embarked_dummies], axis=1)
test_df["Age_P"] = test_df["Pclass"]*test_df["Age"]
test_df["Si_Pa"] = (test_df["SibSp"]*test_df["Parch"])**2

In [279]:
predictions = logreg.predict(test_df[feature4_cols])

In [280]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_M2.csv", index=False)

## PRIMER MODELO CROSS-VALIDATION


In [369]:
import pandas as pd
import numpy as np
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [370]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Parch2'] = titanic['Parch'] **2
titanic['Age2'] = titanic['Age']**2
titanic['Fare2'] = titanic['Fare'] **2

In [371]:
feature_cols = ['Pclass', 'Age', 'SibSp','Parch','Fare', 'Age2', 'Fare2', 'Parch2', 'SibSp2', 'Sex_Female', 'Embarked_C',
                'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

In [372]:
# Create k-folds
from sklearn.cross_validation import KFold
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []
models = []
for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    models.append(LogisticRegression(C=1e9))
    
    models[-1].fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = models[-1].predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

In [373]:
results

[0.78888888888888886,
 0.8202247191011236,
 0.7640449438202247,
 0.84269662921348309,
 0.7865168539325843,
 0.7865168539325843,
 0.7865168539325843,
 0.7640449438202247,
 0.8539325842696629,
 0.84269662921348309]

In [361]:
test_df.Age.fillna(titanic.Age.mean(), inplace=True)
test_df.Fare.fillna(titanic.Fare.mean(), inplace=True) 
test_df.Embarked.fillna('S', inplace=True)
test_df['Sex_Female'] = test_df.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(test_df.Embarked, prefix='Embarked')
test_df = pd.concat([test_df, embarked_dummies], axis=1)
test_df['Age2'] = test_df['Age']**2
test_df['Fare2'] = test_df['Fare'] **2

In [362]:
probas = pd.DataFrame(index=test_df[feature_cols].index, columns=['p'+ str(i) for i in range(10)])

In [363]:
for i in range(10):
    proba = models[i].predict_proba(test_df[feature_cols])[:, 1]
    probas.iloc[:, i] = proba

In [366]:
predictions = ((probas.mean(axis=1) > 0.6) * 1.0).values

In [367]:
predictions = predictions.astype(np.int)

In [368]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_M4.csv", index=False)

## Metodo SelectPercentile

In [398]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [406]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Parch2'] = titanic['Parch'] **2
titanic['Age2'] = titanic['Age']**2
titanic['Fare2'] = titanic['Fare'] **2
titanic['Age_SibSp2'] = titanic['Age'] *titanic['SibSp2']



In [407]:
feature = ['Pclass', 'Age', 'SibSp','Parch','Fare', 'Age2', 'Fare2', 'Parch2', 'SibSp2', 'Sex_Female', 'Embarked_C',
                'Embarked_Q', 'Embarked_S','Age_SibSp2']
X = titanic[feature_cols]
y = titanic.Survived

In [408]:
from sklearn.feature_selection import SelectPercentile, f_classif

results = pd.DataFrame(index=range(99), columns=['mean_accuracy'])

for i in range(1,100):
    sel = SelectPercentile(f_classif, percentile=i)
    sel.fit(X, y)
    sel.get_support()
    X_sel = sel.transform(X)
    results.iloc[i-1] = pd.Series(cross_val_score(logreg, X_sel, y, cv=10, scoring='accuracy')).mean()


In [409]:
results.idxmax()

mean_accuracy    76
dtype: int64

In [413]:
results.iloc[76]

mean_accuracy    0.803603
Name: 76, dtype: object

In [415]:
from sklearn.feature_selection import SelectPercentile, f_classif

sel = SelectPercentile(f_classif, percentile=76)
sel.fit(X, y)
sel.get_support()
print(np.array(feature)[~sel.get_support()])

['SibSp' 'Age2' 'Parch2' 'Embarked_Q']


### ELIMINANDO VARIABLES 'SibSp' 'Age2' 'Parch2' 'Embarked_Q'

In [440]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [441]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Fare2'] = titanic['Fare'] **2
titanic['Age_SibSp2'] = titanic['Age'] *titanic['SibSp2']


In [442]:
feature_cols = ['Pclass', 'Age','Parch','Fare', 'Fare2',  'SibSp2', 'Sex_Female', 'Embarked_C',
                 'Embarked_S','Age_SibSp2']
X = titanic[feature_cols]
y = titanic.Survived

In [443]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_3, y_3, random_state=1)

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

In [444]:
# calculate classification accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))

0.811659192825


## CROSS VALIDATION FINAL

In [457]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [458]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Fare2'] = titanic['Fare'] **2
titanic['Age_SibSp2'] = titanic['Age'] *titanic['SibSp2']

In [459]:
feature_cols = ['Pclass', 'Age','Parch','Fare', 'Fare2',  'SibSp2', 'Sex_Female', 'Embarked_C',
                 'Embarked_S','Age_SibSp2']
X = titanic[feature_cols]
y = titanic.Survived

In [460]:
# Create k-folds
from sklearn.cross_validation import KFold
kf = KFold(X.shape[0], n_folds=10, random_state=0)

results = []
models = []
for train_index, test_index in kf:
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # train a logistic regression model
    models.append(LogisticRegression(C=1e9))
    
    models[-1].fit(X_train, y_train)

    # make predictions for testing set
    y_pred_class = models[-1].predict(X_test)

    # calculate testing accuracy
    results.append(metrics.accuracy_score(y_test, y_pred_class))

In [461]:
results

[0.77777777777777779,
 0.8314606741573034,
 0.7752808988764045,
 0.8202247191011236,
 0.7752808988764045,
 0.7865168539325843,
 0.7752808988764045,
 0.7752808988764045,
 0.84269662921348309,
 0.8202247191011236]

In [462]:
test_df.Age.fillna(titanic.Age.mean(), inplace=True)
test_df.Fare.fillna(titanic.Fare.mean(), inplace=True) 
test_df.Embarked.fillna('S', inplace=True)
test_df['Sex_Female'] = test_df.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(test_df.Embarked, prefix='Embarked')
test_df = pd.concat([test_df, embarked_dummies], axis=1)
test_df['SibSp2'] = test_df['SibSp'] **2
test_df['Fare2'] = test_df['Fare'] **2
test_df['Age_SibSp2'] = test_df['Age'] *titanic['SibSp2']

In [463]:
probas = pd.DataFrame(index=test_df[feature_cols].index, columns=['p'+ str(i) for i in range(10)])

In [464]:
for i in range(10):
    proba = models[i].predict_proba(test_df[feature_cols])[:, 1]
    probas.iloc[:, i] = proba

In [465]:
predictions = ((probas.mean(axis=1) > 0.6) * 1.0).values

In [466]:
predictions = predictions.astype(np.int)

In [467]:
# Kaggle needs the submission to have a certain format;
# see https://www.kaggle.com/c/titanic-gettingStarted/download/gendermodel.csv
# for an example of what it's supposed to look like.
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],
                            'Survived': predictions })
submission.to_csv("titanic_submission_M5.csv", index=False)

# Gaussian Naive Bayes

In [474]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
train_df = pd.read_csv('titanic_train.csv', header=0)
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)

In [475]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Fare2'] = titanic['Fare'] **2
titanic['Age_SibSp2'] = titanic['Age'] *titanic['SibSp2']

In [476]:
feature_cols = ['Pclass', 'Age','Parch','Fare', 'Fare2',  'SibSp2', 'Sex_Female', 'Embarked_C',
                 'Embarked_S','Age_SibSp2']
X = titanic[feature_cols]
y = titanic.Survived

In [477]:
# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [478]:
# testing accuracy of Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_class = gnb.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)

0.70852017937219736

## Multinomial Naive Bayes model

In [480]:
import pandas as pd
import numpy as np
test_df = pd.read_csv('titanic_test.csv', header=0)
titanic = train_df
titanic.Age.fillna(titanic.Age.mean(), inplace=True)
titanic.Embarked.fillna('S', inplace=True)
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
titanic = pd.concat([titanic, embarked_dummies], axis=1)


In [483]:
# import both Multinomial and Gaussian Naive Bayes
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics

In [481]:
titanic['SibSp2'] = titanic['SibSp'] **2
titanic['Fare2'] = titanic['Fare'] **2
titanic['Age_SibSp2'] = titanic['Age'] *titanic['SibSp2']

In [482]:
feature_cols = ['Pclass', 'Age','Parch','Fare', 'Fare2',  'SibSp2', 'Sex_Female', 'Embarked_C',
                 'Embarked_S','Age_SibSp2']
X = titanic[feature_cols]
y = titanic.Survived

In [485]:
# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [486]:
# testing accuracy of Multinomial Naive Bayes
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
y_pred_class = mnb.predict(X_test)
metrics.accuracy_score(y_test, y_pred_class)

0.65022421524663676

## CONCLUSIONES

En el presente taller tiene como fin la elaboración de diferentes modelos para pronosticar la probablidad de sobrevivencia que tenia los pasajeros del titanic al momento del naufragio.

Como variable respuesta tome la variable survived la cual esta indexada con 0 y 1, en donde 1 se da cuadno la persona sobrevive, y como variables explicativas inicialmente tenemos pclass (1 = 1st; 2 = 2nd; 3 = 3rd, clase en donde viajaba el pasajero), sex a la cual se le hace una tranformación debido a que esta viene por defecto como categorica; por lo que fue necesario recodificarla como 0 para hombre y 1 para mujer, age (edad de la persona), sibsp (número de hermanos a bordo),parch (numero de padres a bordo), Fare (tarifa del tiquete), embarked (puerto de embarque, C = Cherbourg; Q = Queenstown; S = Southampton), dado que el modelo presento un ajuste inicial de 0.76 fue necesario crear nuevas variables con el objetivo de encontrar nuevas variables que pudieran explicar mejor el modelo y por lo tanto generar un incremento en el accuracy del modelo; por lo que cree nuevas varias elevandolas al cuadrado (Age, sibsp, parch, Fare) y genere combinaciones entre varias de ellas; siendo estas ultimas la que menor aporte tuvieron al ajuste del modelo,

De igual manera se encontraros valores perdidos en Age y embarked por lo que se decide imputar los datos por la media de Age y embarked por S respectivamente.

Los modelos que realice para este ejercicio fueron: regresion logistica, Naive Bayes (modelo multinomial, modelo gaussiano) y cross valitation, cada uno de los modelos fue entrenado en una base de datos llamada train y sus predicciones se realizaron mediante la base de datos test para luego ser enviado a la plataforma kaggle, de los modelos que realice encontre  que la iteraccion que cree de la variables no fue significativa para realizar un mejor ajuste del modelo, mientras que aquellas que se elevaron al cuadrado le otorgaron una mayor presición a las predicciones estimadas.

Una vez analizado el accuracy de cada modelo y evaluando varias metodologias el modelo que mejoor ajuste presento fue el realizado por cross validation en donde utilice como variables predictoras las recomendadas por el metodo SelectPercentile, el cual me indica que variables utilizar de acuerdo a un percentil de las pntuaciones mas altas (Cross validaon final), este  modelo presento un ajuste en la plataforma de 0.7799, el cual tuvo un incremento de 0.02 de ajuste aproximadamente con relación al modelo logistico, de igual manera este modelo tambien presnto un mayor poder predictivo que los modelos de naive bayes.
