# Synthetic Titanic Data Classification
## Predictive Theory

In this theory, we will be looking at possible relationships for predicting the survival of passengers on the titanic. 

<b>Some known facts about the titanic are:</b> 

1) Most of the women and kids were rescued

2) 1st class passengers were rescued (both men and women)

3) Most of the titanic’s engineers were rescued
          
        
Our predictive theory tries to use age, gender, passenger class,  are their children onboard the ship or not, and sibling or spouse on board or not to predict if the passenger survived.

Looking at the relationship between survival and gender we can say that as most of the women were evacuated, the chance of a survivor being a woman is much greater. 

 Also, after looking at the survivor list on the real titanic we can see that most of the males who survived are from the first class. So, to model this we would need to know the passenger class. 
 
The survival of women with kids is much greater than normal women and this could significantly impact their probability of survival.

## Problems

1) Missing data in almost all the columns

2) Some columns where the mean of the data is not at an appropriate value to fill in.


## Possible Solution

1) Some of the columns can be filled with fair information as first-class tickets cost more.

2) Using the other column information try to predict the missing data

3) Look at the possibility of replacing by mean for the columns where it is possible. 



### Plan of Action

1) Using the predictive theory try to model a RandomForestClassifier to predict the survival of the passenger.

### Importing the required libraries

In [None]:
import pandas as pd # For data processing 
import numpy as np # For array operations
import matplotlib.pyplot as plt # For visualizing the data

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OrdinalEncoder

from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

### Reading the csv files

In [None]:
dataset = pd.read_csv("../input/tabular-playground-series-apr-2021/train.csv")
testset = pd.read_csv("../input/tabular-playground-series-apr-2021/test.csv")

### Viewing the top of the dataset

In [None]:
dataset.head() # Viewing the top of the dataset

In [None]:
dataset.isna() #Finding the missing data points

### Handling the missing datapoints

In [None]:
dataset.Age = dataset.Age.fillna(np.mean(dataset.Age)) #Filling the missing datapoints

In [None]:
dataset.head() # Viewing the data after filling the NaN values

In [None]:
trainlen = len(dataset)
print("Length of the training set is: ", trainlen) # Looking at the number of data points in the dataset

### Encoding the Categorical Variables

In [None]:
encoder = OrdinalEncoder() #Here we are using Ordinal Data encoding
gender = encoder.fit_transform(np.array(dataset.Sex).reshape(-1,1)) #Transforming the categorical data

In [None]:
gender[:3] # Viewing a sample of the encoded data

### Engineering the featurevector (Input for the model)

In [None]:
feature1 = np.array(dataset["PassengerId"]).reshape(-1,1)
feature2 = np.array(dataset["Pclass"]).reshape(-1,1)
feature3 = np.array(dataset["Age"]).reshape(-1,1)
feature4 = np.array(dataset["Parch"]).reshape(-1,1)
feature5 = np.array(dataset["SibSp"]).reshape(-1,1)
feature6 = gender

featurevector = np.concatenate((feature1,feature2,feature3,feature4,feature5,feature6) , axis = 1)

In [None]:
print("The shape of the final feature vector is: ", featurevector.shape) # Viewing the shape of the featurevector

In [None]:
featurevector[:3] # Viewing at the final featurevector

In [None]:
target = np.array(dataset["Survived"]).reshape(-1,1) # Forming the targetvector

In [None]:
x_train , x_test , y_train , y_test = train_test_split(featurevector , target , test_size = 0.10 , random_state = 6) #Splitting the data

In [None]:
#Viewing the train data after splitting
print("The shape of the final target vector is: ", y_train.shape) 
print("The shape of the final feature vector is: ", x_train.shape)

In [None]:
classifier = RandomForestClassifier(n_estimators = 600 , ccp_alpha = 0.002) # Instantiating the classifier with 600 estimators and ccp_alpha as 0.001
classifier.fit(x_train , y_train) #Training the classifier

In [None]:
plt.plot(y_train[:50])
plt.plot(classifier.predict(x_train[:50]))
plt.grid("on")
plt.title("Plot to look at the data match")
plt.xlabel("Samples --->")
plt.ylabel("Prediction --->")

In [None]:
plot_confusion_matrix(classifier , x_test , y_test) # Evaluating the model

In [None]:
testset.Age = testset.Age.fillna(np.mean(testset.Age)) #Filling the NaN datapoints in the testset using the mean of the column

In [None]:
feature1 = np.array(testset["PassengerId"]).reshape(-1,1)
feature2 = np.array(testset["Pclass"]).reshape(-1,1)
feature3 = np.array(testset["Age"]).reshape(-1,1)
feature4 = np.array(testset["Parch"]).reshape(-1,1)
feature5 = np.array(testset["SibSp"]).reshape(-1,1)
feature6 = encoder.fit_transform(np.array(testset.Sex).reshape(-1,1))

featurevector = np.concatenate((feature1,feature2,feature3,feature4,feature5,feature6) , axis = 1) # Engineering the feature vector for testing the model

In [None]:
predictions = classifier.predict(featurevector) # Inference
id_pass = testset["PassengerId"]

In [None]:
print("Shape of predictions is: ", len(predictions))
print("Shape of passenger id is: ", len(id_pass))

In [None]:
data = {'PassengerId': id_pass , 'Survived' : predictions} # Making a dictionary to store the data

In [None]:
submission = pd.DataFrame(data = data) # Creating the dataframe to store the data

In [None]:
submission.head() #Viewing the submission data

In [None]:
submission.to_csv("submission.csv") #Saving the dataframe to a csv file

## Conclusion
Using a simple predictive theory we have structurally engineered a supervised learning based classifier and the performance of the final model is quite good. For improving the performance of the model we can try to adjust the 