**TITANIC** **SURVIVAL** **PREDICTION**

Introduction:

In this beginner-friendly project, I will delve into the Titanic dataset to build a predictive model that can determine whether a passenger survived the disaster or not. By exploring various attributes such as passenger class, age, gender, and embarkation point, I aim to uncover patterns and insights that might shed light on the factors that influenced survival rates.

I will follow a structured approach, starting with data loading and exploration, followed by preprocessing to handle missing values and convert categorical variables into numeric form. Then, I'll split the dataset into training and testing sets, build a logistic regression model, and evaluate its performance using accuracy metrics, confusion matrix, and classification report.

Through this project, I have not only gain hands-on experience in data preprocessing and predictive modeling but also pay homage to the individuals who were affected by this tragic event. Let's embark on this journey to uncover insights from the Titanic dataset and build a model that can predict survival outcomes with reasonable accuracy.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Load the dataset
titanic_data = pd.read_csv("/content/Titanic-Dataset.csv")

In [None]:
# Explore the data
print(titanic_data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [None]:
# Data preprocessing
titanic_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True) # Drop unnecessary columns
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1}) # Convert 'Sex' to numeric
titanic_data['Embarked'] = titanic_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2}) # Convert 'Embarked' to numeric
titanic_data.fillna(titanic_data.mean(), inplace=True) # Fill missing values with mean

In [None]:
# Split the data into features and target variable
X = titanic_data.drop('Survived', axis=1)
y = titanic_data['Survived']

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Build and train the model
model = LogisticRegression()
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# Make predictions
y_pred = model.predict(X_test)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

In [None]:
print("Accuracy:", accuracy)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_rep)

Accuracy: 0.7988826815642458
Confusion Matrix:
 [[89 16]
 [20 54]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       105
           1       0.77      0.73      0.75        74

    accuracy                           0.80       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



The output shows the evaluation metrics of the logistic regression model applied to the Titanic dataset:

Accuracy: 0.7988826815642458 (approximately 79.89%)
  This metric represents the proportion of correct predictions out of all predictions made. In this case, the model achieved an accuracy of approximately 79.89%, indicating that it correctly predicted the survival outcome for about 79.89% of the passengers in the test dataset.

Confusion Matrix:
[[89 16]
[20 54]]
  The confusion matrix provides a breakdown of correct and incorrect predictions made by the model. It consists of four terms:
    - True Positive (TP): 89 (Predicted as survived and actually survived)
    - False Positive (FP): 16 (Predicted as survived but actually did not survive)
    - False Negative (FN): 20 (Predicted as not survived but actually survived)
    - True Negative (TN): 54 (Predicted as not survived and actually did not survive)

- Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.85      0.83       105
           1       0.77      0.73      0.75        74

    accuracy                           0.80       179
   macro avg       0.79      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179

  The classification report provides precision, recall, and F1-score for each class (0 - not survived, 1 - survived), as well as their averages. Precision measures the proportion of true positive predictions out of all positive predictions made, recall measures the proportion of true positive predictions out of all actual positive instances, and F1-score is the harmonic mean of precision and recall. The report also includes support, which represents the number of instances of each class in the test set.

Overall, the logistic regression model achieved a relatively good performance in predicting survival outcomes, with an accuracy of approximately 79.89%. However, further analysis and possibly model improvement could be explored to enhance its performance.