▶**TASK-1 : TITANIC PASSENGERS SURVIVAL PREDICTION**

> Author: Tushar Surja

> Batch: June 2025 batch B33

> Domain: Data Science


> Aim: To build a model that predicts whether a
passenger on the Titanic survived or not.



## Load the dataset

### Subtask:
Load the Titanic dataset from the specified path into a pandas DataFrame.


**Reasoning**:
Import pandas and load the dataset into a DataFrame, then display the head and info of the DataFrame to understand its structure and content.



In [1]:
import pandas as pd

df = pd.read_csv('Titanic-Dataset.csv')
display(df.head())
display(df.info())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None

## Explore and preprocess the data

### Subtask:
Analyze the dataset to understand its structure, identify missing values, handle categorical features, and perform any necessary transformations.


**Reasoning**:
Check for missing values, handle missing values in 'Age' and 'Embarked' columns, drop the 'Cabin' column, and convert categorical features into numerical representations.



In [2]:
# Check for missing values
print("Missing values before handling:")
print(df.isnull().sum())

# Handle missing values in 'Age' column by imputing with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Handle missing values in 'Embarked' column by imputing with the mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Drop the 'Cabin' column
df = df.drop('Cabin', axis=1)

# Convert categorical features into numerical representations
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

# Display the first few rows and information of the preprocessed DataFrame
print("\nMissing values after handling:")
print(df.isnull().sum())
display(df.head())
display(df.info())

Missing values before handling:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Missing values after handling:
PassengerId    0
Survived       0
Pclass         0
Name           0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Sex_male       0
Embarked_Q     0
Embarked_S     0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,True,False,True
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,False,False,False
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,False,False,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,False,False,True
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,True,False,True


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Age          891 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Sex_male     891 non-null    bool   
 10  Embarked_Q   891 non-null    bool   
 11  Embarked_S   891 non-null    bool   
dtypes: bool(3), float64(2), int64(5), object(2)
memory usage: 65.4+ KB


None

## Split the data

### Subtask:
Divide the dataset into training and testing sets.


**Reasoning**:
Divide the dataset into training and testing sets as per the instructions.



In [3]:
from sklearn.model_selection import train_test_split

X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (668, 11)
Shape of X_test: (223, 11)
Shape of y_train: (668,)
Shape of y_test: (223,)


## Build and train the model

### Subtask:
Choose an appropriate machine learning model (e.g., a classification model like Logistic Regression or a Decision Tree) and train it on the training data.


**Reasoning**:
Import and train a Logistic Regression model on the training data.



In [4]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=200)
model.fit(X_train[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S']], y_train)

print("Model training complete.")

Model training complete.


## Evaluate the model

### Subtask:
Assess the performance of the trained model on the testing data using relevant metrics (e.g., accuracy, precision, recall).


**Reasoning**:
Make predictions on the test set and calculate the accuracy of the model.



In [5]:
from sklearn.metrics import accuracy_score

X_test_selected = X_test[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S']]
predictions = model.predict(X_test_selected)

accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 0.8071748878923767


## Perform predictions

### Subtask:
Use the trained model to make predictions on a few new, unseen passenger data points to test its predictive capability.


**Reasoning**:
Create a DataFrame with hypothetical passenger data and use the trained model to predict their survival status.



In [6]:
import pandas as pd

# Create hypothetical passenger data
hypothetical_passengers = pd.DataFrame({
    'Pclass': [1, 3, 2],
    'Age': [30, 25, 40],
    'SibSp': [1, 0, 2],
    'Parch': [0, 1, 0],
    'Fare': [50.0, 10.0, 25.0],
    'Sex_male': [True, False, True],
    'Embarked_Q': [False, True, False],
    'Embarked_S': [True, False, True]
})

# Make predictions
hypothetical_predictions = model.predict(hypothetical_passengers)

# Print the predictions
print("Predictions for hypothetical passengers:")
for i, prediction in enumerate(hypothetical_predictions):
    survival_status = "Survived" if prediction == 1 else "Did not survive"
    print(f"Passenger {i+1}: {survival_status}")

Predictions for hypothetical passengers:
Passenger 1: Did not survive
Passenger 2: Survived
Passenger 3: Did not survive


## Summary:

### Data Analysis Key Findings

*   The initial dataset contained missing values in the 'Age', 'Cabin', and 'Embarked' columns.
*   Missing values in 'Age' were imputed with the median, and missing values in 'Embarked' were imputed with the mode.
*   The 'Cabin' column was dropped due to a high number of missing values.
*   Categorical features 'Sex' and 'Embarked' were converted into numerical representations using one-hot encoding, resulting in 'Sex\_male', 'Embarked\_Q', and 'Embarked\_S' columns.
*   The dataset was split into training and testing sets, with the testing set comprising 25% of the data.
*   A Logistic Regression model was trained on the training data using selected features.
*   The trained model achieved an accuracy of approximately 80.7% on the testing data.
*   Predictions for three hypothetical passengers were generated, indicating survival for one and non-survival for the other two.

### Insights or Next Steps

*   The Logistic Regression model shows reasonable performance on the Titanic dataset, achieving over 80% accuracy. Further evaluation with other metrics like precision and recall could provide a more comprehensive understanding of the model's performance, especially regarding false positives and false negatives.
*   Exploring other classification models such as Decision Trees, Random Forests, or Support Vector Machines could potentially lead to improved accuracy. Additionally, feature engineering or exploring interactions between existing features might enhance the model's predictive power.
