### Data Cleaning | Jupyter Notebook
---
In order for textual data to be utilized it must be cleaned and/or converted into a numerical format. Missing data within the dataset also have to be dealt with, in an appropriate manner. To ensure that these data points do not skew or effect the model's performance in a negative way.

In [1]:
import re
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer, OrdinalEncoder
from sklearn.impute import SimpleImputer

In [2]:
# load training data
train_data = pd.read_csv('Data/train.csv')

#### Data Processing
1. Convert loaded data into a DataFrame
2. Find and replace all missing values with NaN
3. Encode textual data with numerical representations
    a. Binarize textual data (Sex)
    b. Filling empty age values with the mean (Age)
    c. Use Regex to remove non-numerical values (Ticket)
    d. Encoding of char data (Embarked)
4. Remove columns that were not encoded (PassengerId, Name, Cabin)

In [3]:
train_data = pd.DataFrame(train_data)
train_data = train_data.replace(r'', np.nan, regex=True)

LabelBinarizer: 

In [4]:
lb = LabelBinarizer()
train_data['Sex'] = lb.fit_transform(train_data['Sex'])

SimpleImputer:

In [5]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
train_data['Age'] = np.round(imp.fit_transform(train_data[['Age']]))

In [6]:
train_data['Ticket'] = [re.sub("[^0-9]", "", i) for i in train_data['Ticket']]
ticket = []
for i in train_data['Ticket']:
    if i == '':
        ticket.append(float(0))
    else:
        ticket.append(float(i))

In [7]:
train_data['Ticket'] = ticket

OrdinalEncoder: 

In [8]:
enc = OrdinalEncoder()
train_data['Embarked'] = enc.fit_transform(train_data[['Embarked']])
train_data['Embarked'] = enc.set_params(encoded_missing_value=-1).fit_transform(train_data[['Embarked']])

In [9]:
cleaned_data = train_data.drop(['PassengerId', 'Name', 'Cabin'], axis=1)

#### Data Shuffling & Split

In [10]:
cleaned_train_data, val_data = train_test_split(cleaned_data, test_size=0.2, random_state=42)
train_targets = cleaned_train_data['Survived'].values
train_features = cleaned_train_data.drop(['Survived'], axis=1).values
val_targets = val_data['Survived'].values
val_features = val_data.drop(['Survived'], axis=1).values

#### Train Logistic Regression Model

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
titanic_logistic_reg = LogisticRegression(random_state=42).fit(train_features, train_targets)

In [13]:
val_features[0].reshape(1, -1)

array([[3.00000e+00, 1.00000e+00, 3.00000e+01, 1.00000e+00, 1.00000e+00,
        2.66100e+03, 1.52458e+01, 0.00000e+00]])

In [16]:
titanic_logistic_reg.predict(val_features[0:])

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [15]:
titanic_logistic_reg.predict_proba(val_features[0:])

array([[0.50001589, 0.49998411],
       [0.50011178, 0.49988822],
       [0.63452503, 0.36547497],
       [0.50148495, 0.49851505],
       [0.50001583, 0.49998417],
       [0.50011867, 0.49988133],
       [0.50005513, 0.49994487],
       [0.50206427, 0.49793573],
       [0.50219244, 0.49780756],
       [0.50007016, 0.49992984],
       [0.50067943, 0.49932057],
       [0.50217071, 0.49782929],
       [0.50002467, 0.49997533],
       [0.50001608, 0.49998392],
       [0.50136859, 0.49863141],
       [0.50010503, 0.49989497],
       [0.50007017, 0.49992983],
       [0.50197565, 0.49802435],
       [0.50126291, 0.49873709],
       [0.5006595 , 0.4993405 ],
       [0.50208499, 0.49791501],
       [0.50067937, 0.49932063],
       [0.50208984, 0.49791016],
       [0.50003918, 0.49996082],
       [0.50217612, 0.49782388],
       [0.50207218, 0.49792782],
       [0.50010508, 0.49989492],
       [0.50145834, 0.49854166],
       [0.50207218, 0.49792782],
       [0.50209199, 0.49790801],
       [0.

In [17]:
titanic_logistic_reg.score(val_features, val_targets)

0.5865921787709497