# ✨ **Experiment No. 01: Titanic Dataset - Data Preprocessing** 🚢

**Author-** *Prashant Kumar*

**Class-** *T.Y.B.Tech (CSE)*

**Roll No.** *B34*



---

## 📌 **Introduction**
On April 15, 1912, during her maiden voyage, the widely considered **"unsinkable" RMS Titanic** sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of **1,502 out of 2,224 passengers and crew**.

This experiment aims to analyze **how different factors (such as gender, age, and socio-economic status) influenced survival rates**. We will perform **data preprocessing** on the Titanic dataset to clean the data and prepare it for further analysis.

---

## 🔹 **Dataset Used**  
📂 **File Name:** `titanic-data.csv`  
📊 **Dataset Description:** Contains details about **passengers, survival status, age, class, and other attributes**.

---

## ⚙ **Step 1: Load the Dataset**
```

In [65]:
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv("C:/Users/prash/Desktop/Data Analytics & Visualization/titanic-data.csv")
print("First 5 rows of the dataset:")
print(data.head())



First 5 rows of the dataset:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450

## 📊 **Step 2: Explore the Data**

In [66]:
# Check data types and column names
print("\nData Types:")
print(data.dtypes)
print("\nColumn Names:", data.columns)


Data Types:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Column Names: Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


In [67]:
# Explore the target variable 'Survived'
print("\nSurvival Count:")
print(data['Survived'].value_counts())


Survival Count:
Survived
0    549
1    342
Name: count, dtype: int64


## 📈 **Step 3: Analyze Survival Rate by Different Categories**

In [68]:
# Total number of passengers
print("\nTotal number of passengers:", len(data))
print("Number of passengers who survived:", len(data[data['Survived'] == 1]))
print("Number of passengers who didn't survive:", len(data[data['Survived'] == 0]))


Total number of passengers: 891
Number of passengers who survived: 342
Number of passengers who didn't survive: 549


In [69]:
# Gender-wise survival rate
print("\nSurvival Rate by Gender:")
print('% of male who survived:', 100 * np.mean(data['Survived'][data['Sex'] == 'male']))
print('% of female who survived:', 100 * np.mean(data['Survived'][data['Sex'] == 'female']))


Survival Rate by Gender:
% of male who survived: 18.890814558058924
% of female who survived: 74.20382165605095


In [70]:
# Class-wise survival rate
print("\nSurvival Rate by Passenger Class:")
for pclass in [1, 2, 3]:
    print(f'% of passengers who survived in class {pclass}:', 
          100 * np.mean(data['Survived'][data['Pclass'] == pclass]))


Survival Rate by Passenger Class:
% of passengers who survived in class 1: 62.96296296296296
% of passengers who survived in class 2: 47.28260869565217
% of passengers who survived in class 3: 24.236252545824847


In [71]:
# Data Summary
print("\nDataset Shape:", data.shape)
print("\nDataset Info:")
print(data.info())
print("\nAge Distribution:")
print(data['Age'].value_counts())


Dataset Shape: (891, 12)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

Age Distribution:
Age
24.00    30
22.00    27
18.00    26
28.00    25
30.00    25
         ..
24.50     1
0.67      1
0.42      1
34.50     1
74.00     1
Name: count, Length: 88, dtype: int64


## 🛠 **Step 4: Data Cleaning & Handling Missing Values**

In [72]:
# Handling Missing Values
df2 = data.copy()
print("\nMissing Values Before Handling:")
print(df2.isnull().sum())


Missing Values Before Handling:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


## 🔄 **Step 5: Convert Categorical Data to Numeric**

In [73]:
# Fill missing Age values with mean
df2['Age'].fillna(df2['Age'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Age'].fillna(df2['Age'].mean(), inplace=True)


In [74]:
# Fill missing Embarked values with mode
df2['Embarked'].fillna(df2['Embarked'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Embarked'].fillna(df2['Embarked'].mode()[0], inplace=True)


In [75]:
# Fill missing Cabin values with mode
df2['Cabin'].fillna(df2['Cabin'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Cabin'].fillna(df2['Cabin'].mode()[0], inplace=True)


In [76]:
# Convert 'Sex' column to numeric
df2['Sex'] = df2['Sex'].apply(lambda x: 1 if x == 'male' else 0)


In [77]:
# Check for missing values after handling
print("\nMissing Values After Handling:")
print(df2.isnull().sum())


Missing Values After Handling:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64


## 📊 **Step 6: Compute Correlation Matrix**

In [78]:
# Drop non-numeric columns before computing correlation
df_numeric = df2.select_dtypes(include=['number'])

# Display correlation matrix
print("\nCorrelation Matrix:")
print(df_numeric.corr())



Correlation Matrix:
             PassengerId  Survived    Pclass       Sex       Age     SibSp  \
PassengerId     1.000000 -0.005007 -0.035144  0.042939  0.033207 -0.057527   
Survived       -0.005007  1.000000 -0.338481 -0.543351 -0.069809 -0.035322   
Pclass         -0.035144 -0.338481  1.000000  0.131900 -0.331339  0.083081   
Sex             0.042939 -0.543351  0.131900  1.000000  0.084153 -0.114631   
Age             0.033207 -0.069809 -0.331339  0.084153  1.000000 -0.232625   
SibSp          -0.057527 -0.035322  0.083081 -0.114631 -0.232625  1.000000   
Parch          -0.001652  0.081629  0.018443 -0.245489 -0.179191  0.414838   
Fare            0.012658  0.257307 -0.549500 -0.182333  0.091566  0.159651   

                Parch      Fare  
PassengerId -0.001652  0.012658  
Survived     0.081629  0.257307  
Pclass       0.018443 -0.549500  
Sex         -0.245489 -0.182333  
Age         -0.179191  0.091566  
SibSp        0.414838  0.159651  
Parch        1.000000  0.216225  
Fare

## ✅ **Conclusion** 🏆

🔹 Women and first-class passengers had higher survival rates, supporting historical records.

🔹 Preprocessing techniques like missing value handling and categorical encoding improved data quality.

🔹 The cleaned dataset is now ready for further analysis and predictive modeling.