# Task Introduction

Task 1: Understanding Dataset & Data

The objective of this task is to explore and understand the structure of the Titanic dataset using Python and Pandas. This includes examining the datasetâ€™s size, feature types, missing values, and statistical properties. The task helps in identifying the target variable, input features, and potential data quality issues to assess the datasetâ€™s suitability for machine learning models.

# Import Library

In [1]:
import pandas as pd
import numpy as np


# Load Dataset

In [2]:
df = pd.read_csv("data/Titanic_Machine_Learning_from_Disaster.csv")


# 1. Display First & Last Records

In [3]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
df.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


# 2. Identify Feature Types

| Column Name | Type                 | Reason              |
| ----------- | -------------------- | ------------------- |
| PassengerId | Numerical (ID)       | Unique identifier   |
| Survived    | Binary               | 0 = No, 1 = Yes     |
| Pclass      | Ordinal              | 1st, 2nd, 3rd class |
| Sex         | Categorical          | Male / Female       |
| Age         | Numerical            | Continuous          |
| SibSp       | Numerical (Discrete) | Count               |
| Fare        | Numerical            | Continuous          |
| Embarked    | Categorical          | Port names          |


# 3. Use info() and describe()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.353842,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.383838,0.486592,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.308642,0.836071,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.699118,14.526497,0.42,20.125,28.0,38.0,80.0
SibSp,891.0,0.523008,1.102743,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.381594,0.806057,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.204208,49.693429,0.0,7.9104,14.4542,31.0,512.3292


In [9]:
df.describe(include=object).T

Unnamed: 0,count,unique,top,freq
Name,891,891,"Braund, Mr. Owen Harris",1
Sex,891,2,male,577
Ticket,891,681,347082,7
Cabin,204,147,G6,4
Embarked,889,3,S,644


# 4. Unique Values in Categorical Columns

In [11]:
df['Sex'].unique()

array(['male', 'female'], dtype=object)

In [12]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [14]:
df['Sex'].value_counts(normalize=True)

Sex
male      0.647587
female    0.352413
Name: proportion, dtype: float64

In [15]:
df['Embarked'].value_counts(normalize=True)

Embarked
S    0.724409
C    0.188976
Q    0.086614
Name: proportion, dtype: float64

# 5. Identify Target Variable & Input Features

Titanic Dataset

ðŸŽ¯ Target Variable: Survived

ðŸ“¥ Input Features:

Pclass

Sex

Age

SibSp

Parch

Fare

Embarked

ðŸš« Exclude:

PassengerId (no predictive value)

Name, Ticket (no value addition)

# 6. Dataset Size & ML Suitability

In [16]:
df.shape

(891, 12)

Titanic: ~891 rows Ã— 12 columns

Suitable for:

Classification

Supervised ML

Learning & experimentation

# 7. Data Quality Issues

In [19]:
## Check missing values:

In [18]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [None]:
## Check Class imbalance:

In [20]:
df['Survived'].value_counts(normalize=True)

Survived
0    0.616162
1    0.383838
Name: proportion, dtype: float64

# Analysis

Final Dataset Analysis Report (Titanic Dataset)

1. Dataset Overview

The Titanic dataset contains 891 records and 12 features, representing passenger information such as demographic details, ticket class, fare, and survival status. The dataset is commonly used for binary classification problems in machine learning.

2. Feature Types

Numerical Features: PassengerId, Age, SibSp, Parch, Fare

Categorical Features: Name, Sex, Ticket, Cabin, Embarked

Ordinal Feature: Pclass (1st, 2nd, 3rd class)

Binary Feature (Target): Survived (0 = Did not survive, 1 = Survived)

3. Target Variable & Input Features

Target Variable: Survived

Input Features: Pclass, Sex, Age, SibSp, Parch, Fare, Embarked

Columns such as PassengerId, Name, and Ticket do not directly contribute to prediction and may be excluded or feature-engineered.

4. Statistical Summary & Distribution

Average passenger age is ~29.7 years, with values ranging from 0.42 to 80 years.

The mean fare is 32.20, indicating the presence of high-fare outliers.

61.6% of passengers did not survive, while 38.4% survived, showing moderate class imbalance.

Male passengers dominate the dataset (64.8%), and most passengers embarked from Southampton (72.4%).

5. Data Quality Issues

Missing Values:

Age: 177 missing values

Cabin: 687 missing values (major data loss)

Embarked: 2 missing values

Categorical features require encoding before model training.

Presence of class imbalance in the target variable.

6. Dataset Size & ML Suitability

With 891 rows, the dataset is suitable for traditional machine learning algorithms such as Logistic Regression, Decision Trees, and Random Forests. However, it is not suitable for deep learning models due to limited data size.

7. Conclusion (ML Readiness)

The Titanic dataset is well-structured and appropriate for supervised machine learning tasks. Although preprocessing steps such as handling missing values, encoding categorical variables, and addressing imbalance are required, the dataset provides a strong foundation for building and evaluating classification models.