# Pip installing components required for the task

For this task, the following things were installed in a python environment
- numpy
- pandas
- scikit-learn

In [4]:
!pip install numpy
!pip install pandas
!pip install -U scikit-learn



# Imports

In [6]:
# Imports
import numpy as np
import pandas as pd
import sklearn as sk

# Data Preparation

- The first thing I inquires is to see what type of information is stored in each of the data frames. This would be used to obtain data such as columns, number of non-null entries in comparison to the total number of entries for a particular column as well as the data types of each column.

In [7]:
# Obtain CSV Files
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_test_comp = pd.read_csv('gender_submission.csv')

# Get info
df_train.info()
df_test.info()
df_test_comp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass  

- I noticed that for one particular column, being the 'Cabin' column, they were significantly less entries in comparison to other columns(204 entries compared to 891 for all of except Age and Embarked). Therefore, It was best that It would be dropped.
- Then I noticed that there were 5 columns that have some characters as their datatypes(object) and noticed that of the 5, Name and ticket have too many unique values, so those columns were deleted as well.

In [8]:
# Can see that there is only 204 of 889 entries in the Cabin column, so I will drop that column, as well as any ticket entries
df_train = df_train.drop(['Cabin', 'Name', 'Ticket'], axis=1)
df_test = df_test.drop(['Cabin', 'Name', 'Ticket'], axis=1)

# Show
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Fare         891 non-null    float64
 8   Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB


- Based on the data above now, with a minimum of 714 entries and most values having a max of 891 entries, the next part I had to consider was whether I fill the na values of just drop the na values completely.
- Looking at the values below 891 entries(Age with 714 entries and Embarked with 889 Entries)
  - Looking at the Embarked entries, only 2 entries have no values, so it is safer to drop those 2 values
  - With age, I noticed that the data is in order of passenger ID, plus realistically, each passenger would have been of a massively varying age, so I decided to drop that too in the code below

In [9]:
# Drop All rows with NaN in the age
df_train = df_train.dropna()
df_test = df_test.dropna()
# Show shape
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Sex          712 non-null    object 
 4   Age          712 non-null    float64
 5   SibSp        712 non-null    int64  
 6   Parch        712 non-null    int64  
 7   Fare         712 non-null    float64
 8   Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB


- I noticed that the file has already been split into test and training dataset, so therefore what I decided to do next is to split the survived column in train as a separate value as the target values 'train_label'

In [10]:
# Split into label and feature for each
train_label = df_train['Survived']
df_train = df_train.drop('Survived', axis=1)

- On the test dataset side, I noticed that for the test dataset, it does not include the 'Survived column' and I would need that as the test target for comparison, but I did notice, in the beginning, that in the df_test_comp dataframe, that it has those the required column. Therefore, I decided to combine the 2 datasets based off of the 'PassengerID' column before splitting up it into the main dataset and the target

In [12]:
# Tweak test to match on passengerId
df_test = pd.merge(df_test, df_test_comp, on='PassengerId', how='inner')
test_label = df_test['Survived']
df_test = df_test.drop(['Survived'], axis=1)

- Now, there are 2 remaining columns of type object being Sex & Embarked.
    - Knowing that Sex is only 2 values(male and female) I decided to set them into a dictionary and replace them with numerical values
    - For the embarked values, I obtained an array of all the unique values stored in that column, and found out they were 3. Knowing that, I replaced them based off of a dictionary created within it

In [28]:
# Change Male and Female fro male =0 and female =1
df_train['Sex'] = df_train['Sex'].replace({'male':0, 'female':1})
df_test['Sex'] = df_test['Sex'].replace({'male':0, 'female':1})
df_train.info()

# Check which values are in the Embarked column
print(df_train['Embarked'].unique())

# Replace with numeric values
df_train['Embarked'] = df_train['Embarked'].replace({'S':0, 'C':1, 'Q':2})
df_test['Embarked'] = df_test['Embarked'].replace({'S':0, 'C':1, 'Q':2})

df_test.sample(20)

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Pclass       712 non-null    int64  
 2   Sex          712 non-null    int64  
 3   Age          712 non-null    float64
 4   SibSp        712 non-null    int64  
 5   Parch        712 non-null    int64  
 6   Fare         712 non-null    float64
 7   Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(1)
memory usage: 50.1+ KB
['S' 'C' 'Q']


  df_train['Sex'] = df_train['Sex'].replace({'male':0, 'female':1})
  df_test['Sex'] = df_test['Sex'].replace({'male':0, 'female':1})
  df_train['Embarked'] = df_train['Embarked'].replace({'S':0, 'C':1, 'Q':2})
  df_test['Embarked'] = df_test['Embarked'].replace({'S':0, 'C':1, 'Q':2})


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
108,1028,3,0,26.5,0,0,7.225,1
328,1304,3,1,28.0,0,0,7.775,0
223,1177,3,0,36.0,0,0,7.25,0
323,1297,2,0,20.0,0,0,13.8625,1
230,1190,1,0,30.0,0,0,45.5,0
267,1232,2,0,18.0,0,0,10.5,0
290,1261,2,0,29.0,0,0,13.8583,1
238,1200,1,0,55.0,1,1,93.5,0
15,908,2,0,35.0,0,0,12.35,2
117,1037,3,0,31.0,3,0,18.0,0


# Using SVM as a Classifier

In [29]:
from sklearn import svm
from sklearn import metrics

In [30]:
svm_model = svm.SVC()
svm_model.fit(df_train, train_label)

In [31]:
# Test Data
y_pred = svm_model.predict(df_test)
svm_acc = metrics.accuracy_score(test_label, y_pred)
print(svm_acc)

0.6344410876132931


# Using Logistic Regression as a Classifier

In [32]:
# Import Logistic regression
from sklearn.linear_model import LogisticRegression

In [33]:
# Train model and obtainn outputs
logistic_model = LogisticRegression(solver='saga', max_iter=100000)
logistic_model.fit(df_train, train_label)

In [34]:
# Test Data
y_pred = logistic_model.predict(df_test)
logistic_acc = metrics.accuracy_score(test_label, y_pred)
print(logistic_acc)

0.8580060422960725


# Decision trees Using the random forest algorithm
This next part utilises Random Forest Classifier Algorithm

In [35]:
# Import the model
from sklearn.ensemble import RandomForestClassifier

In [37]:
# Train model
decision_model = RandomForestClassifier()
decision_model.fit(df_train, train_label)

In [39]:
# Test Model
y_pred = decision_model.predict(df_test)
decision_acc = metrics.accuracy_score(test_label, y_pred)
print(decision_acc)

0.7764350453172205
