# Titanic project 
Version 3, incl predictions for missing age values

In [1]:
import pandas as pd

### Read data

In [2]:
df = pd.read_csv('data/train.csv')
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

### Split into train and test

In [3]:
# define X
X = df.drop(['Survived'], axis=1)

# define y
y = df['Survived']

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=52)

### Feature engineering

In [6]:
import feature_engineering as fe
import estimate_age as ea

In [7]:
def feature_engineer(df):
    df = df[['Sex', 'Age', 'Fare', 'Pclass', 'PassengerId', 
             'Name', 'SibSp', 'Parch']]
    
    df = ea.estimate_age(df) # add predictions for age NaNs to the df

    df = fe.female_class3(df) # Create column for women in 3rd class
    
    df = fe.male_class1(df) # Create column for men in 1st class
    
    df = fe.fill_fare_na(df) # fill NaNs with median value
    
    df = fe.log_fare(df) # take log of fare
    
    cols = ['PassengerId','Name', 'SibSp', 'Parch', 'Pclass', 'Age']
    for col in cols:
        del df[col] # delete unnecessary columns
    
    df = fe.one_hot(df) # One-hot encoding
    
    return df # Return result

In [8]:
# feature-engineer training dataset
X_train_fe = feature_engineer(X_train)
X_train_fe.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = pd.cut(df['Age'], bins = bins, labels = labels)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = df['child'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['many_sibsp'] = (df['SibSp'] > cutoff).astype('int')
A value is trying to be set on a copy of a slic

Unnamed: 0,child,female_class3,male_class1,log_fare,Sex_male
0,0,1,0,2.286202,0
1,0,1,0,2.05786,0
2,0,0,0,2.629187,1
3,0,0,0,2.994066,1
4,0,0,0,3.848018,1


### Define model

In [9]:
from sklearn.ensemble import RandomForestClassifier
m = RandomForestClassifier(max_depth=5, n_estimators=1000)

### Fit  model & training score

In [10]:
# fit model
m.fit(X_train_fe, y_train)
# training score
m.score(X_train_fe, y_train)

0.8562874251497006

### Test score

In [11]:
# feature-engineer testing data
X_test_fe = feature_engineer(X_test)
# test score
m.score(X_test_fe, y_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = pd.cut(df['Age'], bins = bins, labels = labels)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = df['child'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['many_sibsp'] = (df['SibSp'] > cutoff).astype('int')
A value is trying to be set on a copy of a slic

0.8609865470852018

### Cross validation

In [12]:
from sklearn.model_selection import cross_val_score
#cv_results = cross_val_score(m, X_train_scaled, y_train, cv=5, scoring='accuracy')
cv_results = cross_val_score(m, X_train_fe, y_train, cv=5, scoring='accuracy')
print(cv_results, '\nMean:', cv_results.mean(), '\nstd:', cv_results.std())

[0.79104478 0.80597015 0.79104478 0.82706767 0.78947368] 
Mean: 0.800920210975199 
std: 0.01438932138063298


## Deployment

### Train model with entire training dataset

In [13]:
# feature-engineer entire training dataset
X_fe = feature_engineer(X)
# fit model
m.fit(X_fe, y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = pd.cut(df['Age'], bins = bins, labels = labels)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = df['child'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['many_sibsp'] = (df['SibSp'] > cutoff).astype('int')
A value is trying to be set on a copy of a slic

0.8552188552188552

### Read and feature-engineer kaggle dataset

In [14]:
X_kaggle = pd.read_csv('data/test.csv')
# feature-engineer kaggle dataset
X_kaggle_fe = feature_engineer(X_kaggle)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = pd.cut(df['Age'], bins = bins, labels = labels)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['child'] = df['child'].astype('int')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['many_sibsp'] = (df['SibSp'] > cutoff).astype('int')
A value is trying to be set on a copy of a slic

### Compute predictions and save file

In [15]:
ypred = m.predict(X_kaggle_fe)
kaggle_submission = X_kaggle[['PassengerId']]
kaggle_submission['Survived'] = ypred
kaggle_submission.to_csv('predict.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
