# 1. Project Setup
## 1.1 Define Project Objectives
For the Titanic dataset, the objectives include:

- Understanding the dataset's structure and content.
- Identifying key features that may influence survival rates.
- Handling missing data effectively.
- Generating visual insights about passengers' demographics, ticket classes, and survival rates.

## 1.2 Feature Description

| Feature       | Description                                                                 | Data Type     | Example Value      |
|---------------|-----------------------------------------------------------------------------|---------------|--------------------|
| PassengerId   | Unique identifier for each passenger                                       | Integer       | 892                |
| Survived      | Survival status (0 = No, 1 = Yes)                                         | Integer       | 1                  |
| Pclass        | Passenger class (1 = First, 2 = Second, 3 = Third)                        | Integer       | 3                  |
| Name          | Full name of the passenger                                                | String        | Kelly, Mr. James   |
| Sex           | Gender of the passenger (male/female)                                     | String        | male               |
| Age           | Age of the passenger in years                                             | Int         | 34.5               |
| SibSp         | Number of siblings and/or spouses aboard the Titanic                      | Integer       | 0                  |
| Parch         | Number of parents and/or children aboard the Titanic                      | Integer       | 0                  |
| Ticket        | Ticket number                                                             | String        | 330911             |
| Fare          | Passenger fare paid                                                      | Float         | 7.8292             |
| Cabin         | Cabin number (if assigned)                                               | String/NaN    | NaN (missing value)|
| Embarked      | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)      | String        | Q                  |

In [14]:
# IMPORTING NECESSARY LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Load and Inspect the Data
## 2.1 Load the Dataset

In [15]:
data_path = r"C:\Users\USER\Downloads\archive (22)\tested.csv"
df = pd.read_csv(data_path)

In [16]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


## 2.2 Understand the Dataset
- Check for missing values: df.isnull().sum()
- Understand feature types: df.dtypes
- Compute basic statistics: df.describe(include='all')

In [18]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [19]:
df.shape

(418, 12)

In [20]:
df.describe(include = 'all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,418.0,418.0,418.0,418,418,332.0,418.0,418.0,418,417.0,91,418
unique,,,,418,2,,,,363,,76,3
top,,,,"Kelly, Mr. James",male,,,,PC 17608,,B57 B59 B63 B66,S
freq,,,,1,266,,,,5,,3,270
mean,1100.5,0.363636,2.26555,,,30.27259,0.447368,0.392344,,35.627188,,
std,120.810458,0.481622,0.841838,,,14.181209,0.89676,0.981429,,55.907576,,
min,892.0,0.0,1.0,,,0.17,0.0,0.0,,0.0,,
25%,996.25,0.0,1.0,,,21.0,0.0,0.0,,7.8958,,
50%,1100.5,0.0,3.0,,,27.0,0.0,0.0,,14.4542,,
75%,1204.75,1.0,3.0,,,39.0,1.0,0.0,,31.5,,


# 3. Data Cleaning
## 3.1 Handle Missing Values
- Categorical Features: Replace with the mode.
- Numerical Features: Replace with the mean/median.

In [21]:
(df.isna().sum() / len(df)) * 100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            20.574163
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.239234
Cabin          78.229665
Embarked        0.000000
dtype: float64

In [22]:
# Fill missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df['Fare'].fillna(df['Fare'].median(), inplace=True)
df['Cabin'] = df['Cabin'].fillna('Unknown')

In [23]:
df.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

## 3.2 Remove Irrelevant Features
Drop columns with no analytical value (e.g., PassengerId, Ticket, Cabin).

In [24]:
df.drop('PassengerId', axis = 1, inplace = True)
df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [25]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  418 non-null    int64  
 1   Pclass    418 non-null    int64  
 2   Name      418 non-null    object 
 3   Sex       418 non-null    object 
 4   Age       418 non-null    float64
 5   SibSp     418 non-null    int64  
 6   Parch     418 non-null    int64  
 7   Ticket    418 non-null    object 
 8   Fare      418 non-null    float64
 9   Cabin     418 non-null    object 
 10  Embarked  418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,Unknown,Q
1,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,Unknown,S
2,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,Unknown,Q
3,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,Unknown,S
4,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,Unknown,S


## 3.3 Correct data types

In [26]:
# Convert column types
df['Pclass'] = df['Pclass'].astype('category')
df['Cabin'] = df['Cabin'].astype('str')
df['Embarked'] = df['Embarked'].astype('category')
df['Age'] = df['Age'].astype('int64')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   Survived  418 non-null    int64   
 1   Pclass    418 non-null    category
 2   Name      418 non-null    object  
 3   Sex       418 non-null    object  
 4   Age       418 non-null    int64   
 5   SibSp     418 non-null    int64   
 6   Parch     418 non-null    int64   
 7   Ticket    418 non-null    object  
 8   Fare      418 non-null    float64 
 9   Cabin     418 non-null    object  
 10  Embarked  418 non-null    category
dtypes: category(2), float64(1), int64(4), object(4)
memory usage: 30.6+ KB
