## Prepare Features For Modeling: Write Out All Final Datasets

### Read In Data

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

This dataset contains information about 891 people who were on board the ship when departed on April 15th, 1912. As noted in the description on Kaggle's website, some people aboard the ship were more likely to survive the wreck than others. There were not enough lifeboats for everybody so women, children, and the upper-class were prioritized. Using the information about these 891 passengers, the challenge is to build a model to predict which people would survive based on the following fields:

- **Name** (str) - Name of the passenger
- **Pclass** (int) - Ticket class (1st, 2nd, or 3rd)
- **Sex** (str) - Gender of the passenger
- **Age** (float) - Age in years
- **SibSp** (int) - Number of siblings and spouses aboard
- **Parch** (int) - Number of parents and children aboard
- **Ticket** (str) - Ticket number
- **Fare** (float) - Passenger fare
- **Cabin** (str) - Cabin number
- **Embarked** (str) - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

In [1]:
# Read in data
import pandas as pd

titanic_train = pd.read_csv('../../../data/split_data/train_features.csv')
titanic_val = pd.read_csv('../../../data/split_data/val_features.csv')
titanic_test = pd.read_csv('../../../data/split_data/test_features.csv')
titanic_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Age_clean,Embarked_clean,Fare_clean,Fare_clean_tr,Title,Cabin_ind,Family_cnt
0,2,1,62.0,0,0,10.5,147,2,62.0,3,10.5,1.600434,11,0,0
1,3,1,8.0,4,1,29.125,147,1,8.0,2,29.125,1.962697,7,0,5
2,3,1,32.0,0,0,56.4958,147,2,32.0,3,56.4958,2.240801,11,0,0
3,3,0,20.0,1,0,9.825,147,2,20.0,3,9.825,1.579307,8,0,1
4,2,0,28.0,0,0,13.0,147,2,28.0,3,13.0,1.670278,8,0,0


In [3]:
# Define the list of features to be used for each dataset
raw_original_features = ['Pclass', 'Sex', 'Age_clean', 'SibSp', 'Parch', 'Fare',
                         'Cabin', 'Embarked']

cleaned_original_features = ['Pclass', 'Sex', 'Age_clean', 'SibSp', 'Parch', 'Fare_clean',
                             'Cabin', 'Embarked_clean']

all_features = ['Pclass', 'Sex', 'Age_clean', 'SibSp', 'Parch', 'Fare_clean', 'Fare_clean_tr',
                'Cabin', 'Cabin_ind', 'Embarked_clean', 'Title', 'Family_cnt']

reduced_features = ['Pclass', 'Sex', 'Age_clean', 'Family_cnt', 'Fare_clean_tr',
                    'Cabin_ind', 'Title']

### Write Out All Data

In [4]:
# Write out final data for each feature set
titanic_train[raw_original_features].to_csv('../../../data/final_data/train_features_raw.csv', index=False)
titanic_val[raw_original_features].to_csv('../../../data/final_data/val_features_raw.csv', index=False)
titanic_test[raw_original_features].to_csv('../../../data/final_data/test_features_raw.csv', index=False)

titanic_train[cleaned_original_features].to_csv('../../../data/final_data/train_features_original.csv', index=False)
titanic_val[cleaned_original_features].to_csv('../../../data/final_data/val_features_original.csv', index=False)
titanic_test[cleaned_original_features].to_csv('../../../data/final_data/test_features_original.csv', index=False)

titanic_train[all_features].to_csv('../../../data/final_data/train_features_all.csv', index=False)
titanic_val[all_features].to_csv('../../../data/final_data/val_features_all.csv', index=False)
titanic_test[all_features].to_csv('../../../data/final_data/test_features_all.csv', index=False)

titanic_train[reduced_features].to_csv('../../../data/final_data/train_features_reduced.csv', index=False)
titanic_val[reduced_features].to_csv('../../../data/final_data/val_features_reduced.csv', index=False)
titanic_test[reduced_features].to_csv('../../../data/final_data/test_features_reduced.csv', index=False)

### Move Labels To Proper Directory

In [5]:
# Read in all labels
titanic_train_labels = pd.read_csv('../../../data/split_data/train_labels.csv')
titanic_val_labels = pd.read_csv('../../../data/split_data/val_labels.csv')
titanic_test_labels = pd.read_csv('../../../data/split_data/test_labels.csv')

In [6]:
# Double-check the labels
titanic_train_labels

Unnamed: 0,Survived
0,1
1,0
2,1
3,0
4,1
...,...
529,1
530,0
531,0
532,1


In [7]:
# Write out labels to final directory
titanic_train_labels.to_csv('../../../data/final_data/train_labels.csv', index=False)
titanic_val_labels.to_csv('../../../data/final_data/val_labels.csv', index=False)
titanic_test_labels.to_csv('../../../data/final_data/test_labels.csv', index=False)