<h3>Importing and initializing libraries to structure and organize Data</h3>
<h5><strong>model {string}:</strong> Indicates which classification model will be used in this notebook.</h5>
<h5><strong>x {DataFrame}</strong> The Training Set processed and ready to train the Classification Model in the Notebook.</h5>
<h5><strong>y {Numpy.Array}</strong> The "Survived" column that we want to predict given x.</h5>
<h5><strong>X {DataFrame}</strong> The Input Data Set that we want to input to the model to predict y.</h5>

In [120]:
model = 'tree'
clf = None

x = None
y = None
X = None

is_data_processed = False if (x and y and X) is None else True

<h3>Importing and initializing Data preprocessing libraries</h3>

In [121]:
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
min_max_scaler = MinMaxScaler()

<h3>Loads XML files to DataFrame and removes unused Data Columns</h3>
<h5><strong>unused_columns {list}:</strong> columns to be removed from the Data Sets hence they have no use for the model.</h5>
<h5><strong>test_df {DataFrame}:</strong> Input Data Set.</h5>
<h5><strong>train_df {DataFrame}:</strong> Training Set.</h5>
<h5><strong>dfs {Dict}:</strong> dictionary containing both the Input Data Set (test_df) and the Training Data Set (train_df).</h5>

In [3]:
import pandas as pd
import os

unused_columns = ['Name', 'Ticket', 'Cabin', 'SibSp', 'Parch', 'Embarked']

test_df = pd.read_csv(os.getcwd() + '/test.csv').drop(columns=unused_columns).convert_dtypes()
train_df = pd.read_csv(os.getcwd() + '/train.csv').drop(columns=unused_columns).convert_dtypes()

dfs = { 'test_df' : test_df , 'train_df' : train_df }

<h3>Data Preprocessing</h3>
<h5><strong>1 -</strong> Fills the NaN (missing values) with their average in the "Age" and "Fare" columns.</h5>
<h5><strong>2 -</strong> Transforms the "Age" column from Float to Integer.</h5>
<h5><strong>3 -</strong> Using OneHotEncoding to deal witg the categorical values in the "Sex" and "Pclass" columns.</h5>
<h5><strong>4 -</strong> Update each DataFrame with preprocessed Data inside the Dictionary.</h5>
<h5><strong>5 -</strong> Applies MinMax into all numerical values in the DataSet so that they can have the same Scale and separates the "Survived" column into the variable 'y'.</h5>

In [123]:
if not is_data_processed:
    
    for key in dfs:
        
        #1
        imputer.fit(dfs[key][['Age', 'Fare']])
        dfs[key][['Age', 'Fare']] = imputer.transform(dfs[key][['Age', 'Fare']])

        #2
        dfs[key]['Age'] = dfs[key]['Age'].astype(int)

        #3
        encoded_sex = pd.get_dummies(dfs[key]['Sex'], prefix='Sex')
        encoded_pclass = pd.get_dummies(dfs[key]['Pclass'], prefix='Pclass')

        #4
        dfs[key] = pd.DataFrame(dfs[key]['PassengerId']).join(dfs[key]['Survived'] if len(dfs[key].columns) > 5 else pd.DataFrame()).join(encoded_sex).join(encoded_pclass).join(dfs[key][['Age', 'Fare']])
        
    #5
    x = min_max_scaler.fit_transform(dfs['train_df'].copy(deep=True).drop(columns=['Survived']))
    y = min_max_scaler.fit_transform(dfs['train_df']['Survived'].to_numpy().reshape(-1, 1)).ravel()
    X = min_max_scaler.fit_transform(dfs['test_df'])
        
    is_data_processed = True

In [124]:
dfs['train_df'].head()

Unnamed: 0,PassengerId,Survived,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Age,Fare
0,1,0,0,1,0,0,1,22,7.25
1,2,1,1,0,1,0,0,38,71.2833
2,3,1,1,0,0,0,1,26,7.925
3,4,1,1,0,1,0,0,35,53.1
4,5,0,0,1,0,0,1,35,8.05


<h3>Pre-process Data Set using MinMax Scaler</h3>

In [125]:
dfs['test_df'].head()

Unnamed: 0,PassengerId,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Age,Fare
0,892,0,1,0,0,1,34,7.8292
1,893,1,0,0,0,1,47,7.0
2,894,0,1,0,1,0,62,9.6875
3,895,0,1,0,0,1,27,8.6625
4,896,1,0,0,0,1,22,12.2875


<h3>Logistic Regression</h3>
<h5>Using a K-Fold = 5</h5>

In [126]:
if model == 'logistic':
    from sklearn.linear_model import LogisticRegressionCV
    
    clf = LogisticRegressionCV(cv=5)
    clf.fit(x, y)

<h3>Stohastic Gradient Descent</h3>
<h5>Loss = Huber</h5>
<h5>Penalty = Level 2</h5>
<h5>Maximum iterations = 1500</h5>
<h5>Expected accuracy = 78.67564534231201</h5>

In [127]:
if model == 'sgdc':
    from sklearn.linear_model import SGDClassifier
    
    clf = SGDClassifier(loss='huber', penalty='l2', max_iter=1500)
    clf.fit(x, y)

<h3>Decision Tree Classification</h3>
<h4>Model with best score until now</h4>
<h5>random_state = 0</h5>
<h5>max_depth = 3 (avoid increasing the max_depth to more than 3 because it causes overfitting)</h5>
<h5>Expected accuracy = 0.8237934904601572</h5>

In [128]:
if model == 'tree':
    from sklearn import tree
    
    clf = tree.DecisionTreeClassifier(random_state=0, max_depth=3)
    clf.fit(x, y)

<h3>Perform prediction</h3>

In [129]:
output = clf.predict(X)

<h3>Saving predictions into CSV file to be submitted in Kaggle</h3>

In [130]:
predictions_df = pd.DataFrame({'PassengerId' : pd.read_csv('/Users/matheusgrossi/Documents/Jupyter Notebooks/Titanic/Data/test.csv')['PassengerId'].values.tolist(), 'Survived' : output.astype(int)}, index=None)
predictions_df.to_csv('predictions.csv', index=False)

<h3>Current Model Accuracy</h3>

In [131]:
accuracy = clf.score(x, y.ravel())
print(accuracy)

0.8237934904601572
