## Decision Tree Classifier

Decision tree is a popular and simple machine learning algorithm.  The algorithm determines a result based on each feature provided at multiple tree levels.  Every node in a tree signifies a choice.  The node will then either have a leaf node, which is a final choice, or a branch with more nodes.

In [75]:
import pandas as pd 
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

In [76]:
# Import the processed and splitted data

xtrain = "data/X_train.csv"
X_train = pd.read_csv(xtrain)

xtest = "data/X_test.csv"
X_test = pd.read_csv(xtest)

ytrain = "data/y_train.csv"
y_train = pd.read_csv(ytrain)

ytest = "data/y_test.csv"
y_test = pd.read_csv(ytest)

In [77]:
# Fit the data into the Decision tree classifier
# Get predictions and evaluate the result

treeCls = DecisionTreeClassifier()
treeCls = treeCls.fit(X_train, y_train)

prediction = treeCls.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.80      0.83      0.81       105
           1       0.74      0.70      0.72        74

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



## Random Forest Classifier

A random forest is another machine learning that uses trees for classification and regression. It is works by combining multiple decision tree which are trained no multiple subset of features instead of all at the same time. The final decision is made by majority vote of the decision trees for classification or by averaging out the trees outcome for regression.

In [78]:
# Fit the data into the Random Forest classifier
# Get predictions and evaluate the result

rdmFrst = RandomForestClassifier()
rdmFrst = rdmFrst.fit(X_train, y_train.to_numpy().ravel())
prediction = rdmFrst.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.87      0.88      0.87       105
           1       0.82      0.81      0.82        74

    accuracy                           0.85       179
   macro avg       0.84      0.84      0.84       179
weighted avg       0.85      0.85      0.85       179



The Random Forest model significantly outperformed the Decision Tree. This improvement can be attributed to Random Forest's use of multiple decision trees trained on different subsets of the data. By aggregating the results from various trees, Random Forest reduces the overfitting problem that is often seen in individual Decision Trees, leading to better generalization on the test data.

## Linear Support Vector Classification

The linear SVC algorithm classifies data by finding a hyperplane that best separates the data into different classes. The hyperplane depends on critical points that are closest, known as support vectors. Each data instances can be represented as a point in space based on their features. The hyperplane generated by the algorithm will separate the instances, effectively classifying the dataset. The decision boundary, i.e the hyperplane, is directly dependant on the support vectors. The linear SVC is an optimized SVC algorithm for speed and scalabilty.

## Traditional SVC

In [79]:
tradSvc = SVC(class_weight='balanced')
tradSvc = tradSvc.fit(X_train, y_train.to_numpy().ravel())
prediction = tradSvc.predict(X_test)
print(classification_report(y_test, prediction, zero_division=1))

              precision    recall  f1-score   support

           0       0.58      0.47      0.52       105
           1       0.40      0.51      0.45        74

    accuracy                           0.49       179
   macro avg       0.49      0.49      0.48       179
weighted avg       0.51      0.49      0.49       179



## Linear SVC

In [80]:
linSvc = LinearSVC()
linSvc = linSvc.fit(X_train, y_train.to_numpy().ravel())
prediction = linSvc.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.80      0.86      0.83       105
           1       0.78      0.70      0.74        74

    accuracy                           0.79       179
   macro avg       0.79      0.78      0.78       179
weighted avg       0.79      0.79      0.79       179



The linear SVC model greatly outperforms the traditional SVC algorithm. This is due to poor generalization on the traditional algorithm. The linear algorithm perfroms well on unbalanced data compared to the former.

## K-Nearest Neighbors Classifier

The KNN algorithm predicts based on distance among data points. The algorithm remembers all of the training data, then when a new data point is introduced, it calculates the distance to each point from the training set and classifies the new point to whichever n nearest points' class.

In [81]:
knn = KNeighborsClassifier()
knn = knn.fit(X_train, y_train.to_numpy().ravel())
prediction = knn.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.61      0.76      0.68       105
           1       0.47      0.30      0.36        74

    accuracy                           0.57       179
   macro avg       0.54      0.53      0.52       179
weighted avg       0.55      0.57      0.55       179



##  Gaussian Naive Bayes

Gaussian Naive Bayes is a classification algorithm based on Bayes’ Theorem with the assumption that the features follow a normal (Gaussian) distribution. If features are continuous, we assume that the likelihood of the features given the class follows a Gaussian (normal) distribution. . It calculates the probability of each class given the input features by combining the prior probability of each class with the likelihood of the features, modeled using a Gaussian distribution. The class with the highest posterior probability is chosen as the predicted class. It's fast, efficient, and works well with continuous data, even with small datasets.

In [82]:
gaus = GaussianNB()
gaus = gaus.fit(X_train, y_train.to_numpy().ravel())
prediction = gaus.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.86      0.75      0.80       105
           1       0.70      0.82      0.76        74

    accuracy                           0.78       179
   macro avg       0.78      0.79      0.78       179
weighted avg       0.79      0.78      0.78       179



## CatBoost Classifier

CatBoost Classifier is a high-performance, open-source machine learning algorithm based on gradient boosting. Gradient Boosting is a powerful machine learning technique used for both classification and regression problems. It builds an ensemble of decision trees in a sequential manner, where each new tree tries to correct the errors made by the previous ones. Instead of simply averaging the outputs like in random forests, gradient boosting fits the new tree to the residuals (the differences between predicted and actual values) of the previous trees.

In [83]:
catb = CatBoostClassifier(verbose=0)
catb = catb.fit(X_train, y_train)
prediction = catb.predict(X_test)
print(classification_report(y_test, prediction))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85       105
           1       0.81      0.74      0.77        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179



## Best algorithm based on accuracy

In [84]:
def best(models, X_train, y_train, X_test, y_test):
    best_accuracy = float('-inf')
    for name, model in models.items():
        model = model.fit(X_train, y_train)
        prediction = model.predict(X_test)
        accuracy = accuracy_score(y_test, prediction)
        if accuracy >= best_accuracy:
            best_accuracy = accuracy
            best_model = name, model
    return best_model

In [85]:
models = {"Random Forest" : RandomForestClassifier(),
          "Linear SVC" : LinearSVC(),
          "KNN" : KNeighborsClassifier(),
          "Gaussian Naive Bayes" : GaussianNB(),
          "CatBoost" : CatBoostClassifier(verbose=0)}

name, model = best(models, X_train, y_train.to_numpy().ravel(), X_test, y_test)

print(name)
print(model.score(X_test, y_test))


Random Forest
0.8491620111731844


## Conclusion

Based on the above comparison, the Random Forest algorithm achieved the highest accuracy of approximately 85%, making it the most effective model for predicting Titanic survival.