# Titanic Survival Prediction
                                                       

**Introduction:**

Welcome to the Kaggle Notebook for the Titanic Survival Prediction! In this notebook, we'll explore and analyze the famous Titanic dataset, aiming to predict whether a passenger survived or not based on various features. The sinking of the Titanic is one of the most infamous maritime disasters in history, and this dataset provides a glimpse into the passengers' demographics and the factors that influenced their survival.

**Problem Statement:**

The primary goal of this analysis is to build a predictive model that can accurately predict whether a passenger survived or not. We'll leverage machine learning techniques, specifically logistic regression in this case, to achieve this. The insights gained from this analysis can offer valuable information about the factors that contributed to survival on the Titanic.



**Data Description:**

The dataset includes information about passengers on the Titanic, such as their age, gender, class, number of siblings/spouses aboard (SibSp), number of parents/children aboard (Parch), fare, and embarkation port. The target variable is 'Survived,' indicating whether a passenger survived (1) or not (0).

This dataset provides an excellent opportunity to explore the relationships between various features and survival outcomes.

In [167]:
#imported files
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [168]:
#data loaded
df = pd.read_csv('/kaggle/input/titanic/train_and_test2.csv')
df.head(n=10)

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.25,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.925,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.05,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
5,6,28.0,8.4583,0,0,0,0,0,0,0,...,0,0,0,3,0,0,1.0,0,0,0
6,7,54.0,51.8625,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,0
7,8,2.0,21.075,0,3,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
8,9,27.0,11.1333,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
9,10,14.0,30.0708,1,1,0,0,0,0,0,...,0,0,0,2,0,0,0.0,0,0,1


In [169]:
#shape of the data set
df.shape

(1309, 28)

In [170]:
#checking null values in the data set
df.isnull().sum()

Passengerid    0
Age            0
Fare           0
Sex            0
sibsp          0
zero           0
zero.1         0
zero.2         0
zero.3         0
zero.4         0
zero.5         0
zero.6         0
Parch          0
zero.7         0
zero.8         0
zero.9         0
zero.10        0
zero.11        0
zero.12        0
zero.13        0
zero.14        0
Pclass         0
zero.15        0
zero.16        0
Embarked       2
zero.17        0
zero.18        0
2urvived       0
dtype: int64

In [171]:
# Handle missing values
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

In [172]:
# Drop unnecessary columns
columns_to_drop = ['zero', 'zero.1', 'zero.2', 'zero.3', 'zero.4', 'zero.5', 'zero.6',
                   'zero.7', 'zero.8', 'zero.9', 'zero.10', 'zero.11', 'zero.12',
                   'zero.13', 'zero.14', 'zero.15', 'zero.16', 'zero.17', 'zero.18']
df = df.drop(columns=columns_to_drop)


In [173]:
# Rename the target variable
df.rename(columns={'2urvived': 'Survived'}, inplace=True)

In [174]:
# Select features and target variable
X = df[['Pclass', 'Sex', 'Age', 'sibsp', 'Parch', 'Fare', 'Embarked']]
y = df['Survived']

In [175]:

# Convert categorical variables into numerical using one-hot encoding
X = pd.get_dummies(X, columns=['Sex', 'Embarked'], drop_first=True)

In [176]:
# Feature scaling
scaler = StandardScaler()
X[['Age', 'Fare']] = scaler.fit_transform(X[['Age', 'Fare']])

In [177]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [178]:
# Create and train the logistic regression model
LR = LogisticRegression()


In [179]:
LR.fit(X_train, y_train)

In [180]:
# Make predictions on the test set
y_pred = LR.predict(X_test)


In [181]:
# Evaluate the model performance
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred) * 100
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

In [182]:
# Display the results
print("Confusion Matrix:")
print(conf_matrix)
print("\nAccuracy: {:.2f}%".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 Score: {:.2f}".format(f1))

Confusion Matrix:
[[174  15]
 [ 46  27]]

Accuracy: 76.72%
Precision: 0.64
Recall: 0.37
F1 Score: 0.47


**Conclusion:**

In summary, the logistic regression model achieved an accuracy of 76.72% in predicting Titanic passenger survival. While the model demonstrated moderate precision (64%) in correctly identifying survivors, the recall (37%) indicates room for improvement in capturing all actual survivors. The overall F1 score is 0.47, suggesting a balanced performance between precision and recall.