### Machine Learning Predictions on Titanic datasets
In this notebook, we will use ensemble voting method ( Linear regression and Random Forest) to predict survived column on the titanic dataset

### Import relevant libraries

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### Define a function to clean the data

In [2]:
def clean_data(df):
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True); # take care of missing Embarked values
    df['Age'] = df['Age'].fillna(df['Age'].mean()); # take care of missing Age values
    df['Fare'] = df['Fare'].fillna(df['Fare'].min()); # take care of missing Fare values
    df = df.drop(columns=['Cabin','Ticket','PassengerId'], axis=1); # drop unwanted columns that do not contribute to survivability
    return df

### Define a function to encode non-numerical data

In [3]:
def encode_data(df):
    # quantize the age column
    df['Age'] = pd.cut(df['Age'],[0, 10, 20, 30, 40, 50, np.inf],labels=[1, 2, 3, 4, 5, 6]).astype(int);
    # encode non-numerical values
    df.replace({'Sex':{'male':0,'female':1}, 'Embarked':{'S':0,'C':1,'Q':2}}, inplace=True);
    return df

### Add new columns by engineering the available features

In [4]:
regex = "([A-Za-z]+)\."

title_status = {'Master.':0,
 'Mrs.':0,
 'Mr.':0,
 'Ms.':0,
 'Col.':1,
 'Mme.':0,
 'Countess.':1,
 'Mlle.':0,
 'Don.':1,
 'Lady.':0,
 'Miss.':0,
 'Dr.':1,
 'Sir.':0,
 'Capt.':1,
 'Rev.':1,
 'Major.':1,
 'Jonkheer.':0,
  'Dona.':1};

def get_title(row):
    match = re.search(regex, str(row))
    title = match.group(0);
    return title

def feature_engineer(df):
    df['FamilySize'] = df['SibSp'] + df['Parch']; # add new feature of family size
    df['socialstatus'] = df.Name.apply(lambda x: get_title(x)); # add titles from name column
    df.replace({'socialstatus':title_status}, inplace=True); # replace titles by an assumed social status
    df = df.drop(columns=['Name'], axis=1); # once titles are extracted, drop the names
    return df

### Import data and apply the data processing

In [5]:
# import the data as pandas dataframe
test  = pd.read_csv('../input/titanic/test.csv');
train = pd.read_csv('../input/titanic/train.csv');
PID = test['PassengerId']; # save PID for competition submission

# process the data

train = clean_data(train);
test  = clean_data(test);

train = encode_data(train);
test  = encode_data(test);

train = feature_engineer(train);
test  = feature_engineer(test);

### Split the data into train and test subsets for model accuracy

In [6]:
X = train.drop(columns = ['Survived'],axis=1);
y = train['Survived'];
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X,y, test_size=0.25, random_state=10);


## Ensemble Learning: Voting Classifier

We could take several classification model to create an ensemble model. Here we will use LinearRegression and RandomForest model
to create an Ensemble Voting classifier.

Each different models within Ensemble (here in this case LinearRegression and RandomForest) predicts different outcome from the 
training. Ensemble have lower error and lesser overfitting as compared to individual models. Since each individual models have
different bias (or personality), the biases also gets averaged out in the ensemble.

One of the Ensemble methods is the Voting classifier that combines the prediction of different models. We will use soft voting 
which averages the probability of predictions from different models.

In [7]:
LogisticRegression(class_weight='balanced')
logistic_regression = LogisticRegression(max_iter=200);
random_forest = RandomForestClassifier(n_estimators = 200);

model = VotingClassifier (estimators=[('lr',logistic_regression), ('rf', random_forest)], voting='soft')
model.fit(X_train_m, y_train_m);
y_pred_m = model.predict(X_test_m);
from sklearn import metrics 
print("ACCURACY OF THE MODEL: ", metrics.accuracy_score(y_test_m, y_pred_m));

ACCURACY OF THE MODEL:  0.8565022421524664


### Submit the prediction to competition forum

In [8]:
y_submission = model.predict(test);
output = pd.DataFrame({'PassengerId': PID, 'Survived': y_submission})
output.to_csv('submission.csv', index=False)