# CSE 627 Group Project

- By John Meyer & Jacob Hubbard

Since we are relatively new to Python data science, we decided to utilize data science tools that we already knew to minimize the learning curve required for this project.
We use Python code blocks import, extract features, visualize data, and fit machine learning models while we use R for data visualization and exploration.
Since Python and R do not work together well in the same notebook, this submission is a series of Jupyter notebooks (with different Python and R kernels).

This is a Python notebook.

## Titanic Dataset

To download the dataset and for an explanation of each column name, go to <https://www.kaggle.com/c/titanic>

## Configuration

In [None]:
# Jupyter config
%matplotlib inline
%config InlineBackend.figure_format = 'svg'  # Or 'retina'

In [None]:
# Python imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import *
from sklearn.mixture import *

#plt.style.use('seaborn-whitegrid')  # Set the aesthetic style of the plots

## Data Import & Feature Extraction

In [None]:
training_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In what follows, we extract features from existing columns in the dataset using ideas found in <https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/>.
The following regex is of our own authorship.

In [None]:
name_regex = r"""^(?P<LastName>[A-Za-z '-]+?), ((the )?(?P<Title>\w+)( of)?\.)?( (?P<FirstName>[A-Za-z'-]+?|[A-Za-z'-]+?\/[A-Za-z'-]+?))?( (?P<MiddleNames>[A-Za-z- ]+?))?( ".+| \(.+)?$"""

parsed_names = training_data['Name'].str.extract(name_regex)
training_data = training_data.assign(
    NameTitle = parsed_names['Title'],
    FirstName = parsed_names['FirstName'],
    MiddleNames = parsed_names['MiddleNames'],
    LastName = parsed_names['LastName'],
)
del training_data['Name']

parsed_names = test_data['Name'].str.extract(name_regex)
test_data = test_data.assign(
    NameTitle = parsed_names['Title'],
    FirstName = parsed_names['FirstName'],
    MiddleNames = parsed_names['MiddleNames'],
    LastName = parsed_names['LastName'],
)
del test_data['Name']

In [None]:
deck_regex = r"""^(?P<Deck>[A-Z])(?P<CabinNumber>\d+)"""

parsed_decks = training_data['Cabin'].str.extract(deck_regex)
training_data = training_data.assign(
    Deck = parsed_decks['Deck'],
    CabinNumber = parsed_decks['CabinNumber'],
)
del training_data['Cabin']

parsed_decks = test_data['Cabin'].str.extract(deck_regex)
test_data = test_data.assign(
    Deck = parsed_decks['Deck'],
    CabinNumber = parsed_decks['CabinNumber'],
)
del test_data['Cabin']

In [None]:
training_data = training_data.assign(
    FamilySize = training_data['SibSp'] + training_data['Parch'],
)
training_data = training_data.assign(
    FarePerPerson = training_data['Fare'] / (training_data['FamilySize'] + 1),
)


test_data = test_data.assign(
    FamilySize = test_data['SibSp'] + test_data['Parch'],
)
test_data = test_data.assign(
    FarePerPerson = test_data['Fare'] / (test_data['FamilySize'] + 1),
)

In [None]:
training_data = training_data.assign(
    WithFamily = training_data['SibSp'] + training_data['Parch'] > 0,
)


test_data = test_data.assign(
    WithFamily = training_data['SibSp'] + training_data['Parch'] > 0,
)

In [None]:
training_data

In [None]:
print('Feature Count:', len(training_data.keys()))

In [None]:
training_data.to_csv('train_processed.csv', index=False)
test_data.to_csv('test_processed.csv', index=False)