<a href="https://colab.research.google.com/github/01-Projects-In-Python/Project-TitanicDataset-SurvivingPrediction/blob/main/Code-TitanicDataset-SurvivingPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PROJECT - TITANIC DATASET - SURVIVING PREDICTION

__a. Purpose:__

Using the Titanic Dataset create a new Logistic Regression model using K- Cross Validation (K from 2 to 5, multiple cases and with penalty of ridge regression) and compare the change in the the classification accuracy between the Logistic Regression model provided in the class and your new model.

__b. Objectives:__

1. Read the data.
2. Build the Logistic Regression model using the hyperparameters specified and evaluate the model's performance.
3. Compare the changes in the classification accuracy between the two Logistic Regression models.

__c. Data:__

 This dataset contains the following variables:

| Variable | Description|
| --- | --- |
| Survived | survival: 0 = No; 1 = Yes
| Pclass | Passenger Class: 1 = 1st; 2 = 2nd; 3 = 3rd
| Age | Age
| SipSb | Number of Siblings/Spouses Aboard
| Parch | Number of Parents/Children Aboard
| Fare | Passenger Fare
| Male | Dummy variable: Male Passenger
| Q | Dummy variable: Port of Embarkation	Queenstown
| S | Dummy variable: Port of Embarkation	Southampton

[Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic)


## Objective 1: Read the data.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

In [None]:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive

In [None]:
train = pd.read_csv('/gdrive/My Drive/train.csv')

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,1,38.0,1,0,71.2833,0,0,0
2,2,1,3,26.0,0,0,7.925,0,0,1
3,3,1,1,35.0,1,0,53.1,0,0,1
4,4,0,3,35.0,0,0,8.05,1,0,1


In [None]:
train.drop(columns = train.columns[0], axis = 1, inplace = True)

In [None]:
train.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,male,Q,S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


In [None]:
train.shape

(889, 9)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  889 non-null    int64  
 1   Pclass    889 non-null    int64  
 2   Age       889 non-null    float64
 3   SibSp     889 non-null    int64  
 4   Parch     889 non-null    int64  
 5   Fare      889 non-null    float64
 6   male      889 non-null    int64  
 7   Q         889 non-null    int64  
 8   S         889 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 62.6 KB


##### **- Model provided:**

In [None]:
# Train Test split
X_train, X_test, y_train, y_test = train_test_split(train.drop('Survived',axis=1),
                                                    train['Survived'],
                                                    test_size=0.30,
                                                    random_state=101)

In [None]:
# Model building
logmodel = LogisticRegression(C=0.001)
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

In [None]:
# Evaluationg the model performance using classification report (precision, recall, f1-score)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.66      0.96      0.78       163
           1       0.79      0.22      0.35       104

    accuracy                           0.67       267
   macro avg       0.73      0.59      0.56       267
weighted avg       0.71      0.67      0.61       267



In [None]:
accuracy_score = logmodel.score(X_test, y_test)
print('Model accuracy {0}'.format(accuracy_score))

Model accuracy 0.6741573033707865


The score of 67.40% represents the mean accuracy on the test data and labels. It demonstrates a satisfactory level of classification.

## Objective 2: Build the Logistic Regression model using the hyperparameters specified and evaluate the model's performance.

__Hyperparameters:__ K- Cross Validation (K from 2 to 5 multiple cases) and with penalty of ridge regression (l2).

In [None]:
# Perform splitting:
X2 = train.drop(['Survived'], axis = 1)
y2 = train['Survived']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.2, random_state = 101)

In [None]:
print(X2_train.shape, y2_train.shape)
print(X2_test.shape, y2_test.shape)

(711, 8) (711,)
(178, 8) (178,)


In [None]:
# Model building
lg_model = LogisticRegression(penalty = 'l2', max_iter = 500)

In [None]:
k_folds = range(2, 6)
for kf in k_folds:
  k = KFold(n_splits = kf, random_state = 1, shuffle = True)
  cv_scores = cross_val_score(lg_model, X2_train, y2_train, cv = kf)

  for x, y in k.split(X2_train):
    X2_train_fold = X2_train.iloc[x]
    X2_test_fold = X2_train.iloc[y]
    y2_train_fold = y2_train.iloc[x]
    y2_test_fold = y2_train.iloc[y]

    lg_model.fit(X2_train_fold, y2_train_fold)
    predictions = lg_model.predict(X2_test_fold)
    class_rep = classification_report(y2_test_fold, predictions)
    conf_matrix = confusion_matrix(y2_test_fold, predictions)

  print('**Logistic Regression model performance with {0}-folds and penalty L2:\n'.format(kf))
  print('Classification report:\n')
  print('{0}\n'.format(class_rep))
  print('Confusion matrices:\n')
  print('{0}\n'.format(conf_matrix))
  print('Scores with {0}-folds: {1}'.format(kf, cv_scores))
  print('Model accuracy {0}\n'.format(cv_scores.mean()))

**Logistic Regression model performance with 2-folds and penalty L2:

Classification report:

              precision    recall  f1-score   support

           0       0.83      0.86      0.85       222
           1       0.75      0.71      0.73       133

    accuracy                           0.80       355
   macro avg       0.79      0.78      0.79       355
weighted avg       0.80      0.80      0.80       355


Confusion matrices:

[[191  31]
 [ 39  94]]

Scores with 2-folds: [0.78370787 0.81690141]
Model accuracy 0.8003046368096218

**Logistic Regression model performance with 3-folds and penalty L2:

Classification report:

              precision    recall  f1-score   support

           0       0.85      0.85      0.85       150
           1       0.74      0.74      0.74        87

    accuracy                           0.81       237
   macro avg       0.80      0.79      0.80       237
weighted avg       0.81      0.81      0.81       237


Confusion matrices:

[[128  22]

## Objective 3: Compare the changes in the classification accuracy between the two Logistic Regression models.

- Comparison within the Logistic Regression models with hyperparameters defined: Between the models with different k-folds (range from 2 to 5) the one with better performance is the model with 3-folds. This model has an accuracy score of 81% with a score of 85% for precision, recall, and f1 for the "0" (no survived) category of the `Survived` variable and 72% for the "1" (survived) category of the `Survived` variable.

- Comparison with the Logistic Regression model provided in the class: Based on the result of the accuracy score of the model provided (67%) we could say that the model with the l2 penalty and 3-folds performs better with an accuracy of 81%, the precision, and the f1-score are higher 19% and 7% respectively for the "0" (no survived) category of the `Survived` variable.
