In [2]:
# ================================
# Task 1: Understanding Dataset & Data Types
# Dataset: Titanic
# ================================

import pandas as pd
import numpy as np

# 1. Load the dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# 2. Display first and last few records
print("First 5 rows of the dataset:")
display(df.head())

print("\nLast 5 rows of the dataset:")
display(df.tail())

# 3. Dataset structure and data types
print("\nDataset Information:")
df.info()

# 4. Statistical summary of numerical columns
print("\nStatistical Summary:")
display(df.describe())

# 5. Identify categorical columns and check unique values
categorical_columns = ['Sex', 'Embarked', 'Pclass']

for col in categorical_columns:
    print(f"\nUnique values in {col}:")
    print(df[col].unique())

# 6. Identify feature types manually
numerical_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Sex', 'Embarked']
ordinal_features = ['Pclass']
binary_features = ['Survived']

print("\nFeature Classification:")
print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)
print("Ordinal Features:", ordinal_features)
print("Binary Features:", binary_features)

# 7. Identify target variable and input features
target_variable = 'Survived'
input_features = df.drop(columns=[target_variable]).columns.tolist()

print("\nTarget Variable:", target_variable)
print("Input Features:", input_features)

# 8. Dataset size analysis
print("\nDataset Size:")
print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])

# 9. Data quality checks
print("\nMissing values in each column:")
print(df.isnull().sum())

# Check target imbalance
print("\nTarget Variable Distribution:")
print(df['Survived'].value_counts())


First 5 rows of the dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



Last 5 rows of the dataset:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q



Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Statistical Summary:


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292



Unique values in Sex:
['male' 'female']

Unique values in Embarked:
['S' 'C' 'Q' nan]

Unique values in Pclass:
[3 1 2]

Feature Classification:
Numerical Features: ['Age', 'Fare', 'SibSp', 'Parch']
Categorical Features: ['Sex', 'Embarked']
Ordinal Features: ['Pclass']
Binary Features: ['Survived']

Target Variable: Survived
Input Features: ['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Dataset Size:
Number of rows: 891
Number of columns: 12

Missing values in each column:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Target Variable Distribution:
Survived
0    549
1    342
Name: count, dtype: int64


Dataset Analysis Report â€“ Titanic Dataset

Dataset Overview
The Titanic dataset contains information about 891 passengers with 12 attributes, including demographic details, ticket information, and survival status. Each row represents one passenger, and each column represents a specific feature.

Feature Types

Numerical Features: Age, Fare, SibSp, Parch

Categorical Features: Sex, Embarked

Ordinal Feature: Pclass (1st, 2nd, 3rd class with inherent order)

Binary Feature: Survived (Target Variable)

Target Variable
The target variable is Survived, which indicates whether a passenger survived (1) or not (0). All other columns act as input features for machine learning models.

Dataset Structure & Size

Number of rows: 891

Number of columns: 12
The dataset is of moderate size and is suitable for machine learning tasks such as classification.

Data Quality & Missing Values
Missing values are present in the following columns:

Age

Cabin

Embarked

The Cabin column has a high number of missing values, which may require removal or feature engineering during preprocessing.

Data Distribution & Imbalance
The target variable shows slight class imbalance, with more passengers not surviving than surviving. This imbalance should be considered during model evaluation.

Conclusion
The dataset is suitable for machine learning after basic preprocessing such as handling missing values and encoding categorical features. Understanding the dataset structure and quality is essential before building predictive models.