<a href="https://colab.research.google.com/github/Awino614/DATA-WRANGLING/blob/main/TITANIC_EXPLORATORY_DATA_ANALYSIS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

awino614_train_csv_file_path = kagglehub.dataset_download('awino614/train-csv-file')

print('Data source import complete.')


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# TITLE :EXPLORATORY DATA ANALYSIS
AUTHOR : DOROTHY AWINO ONGONGA
DATE : 13/6/2025
Exploratory Data Analysis (EDA) is the process of examining and visualizing data sets to summarize their main characteristics, often before applying any modeling techniques. It helps analysts and data scientists understand the structure, patterns, relationships, and anomalies in the data.

🔍 Key Purposes of EDA
To gain insights into the data

.To check data quality (missing values, duplicates, data types)
.To identify patterns, trends, and outliers
.To form hypotheses or choose the right analytical models



In [None]:
# Import libraries
import pandas as pd  # Data manipulation
import numpy as np  # Numerical computations
import matplotlib.pyplot as plt  # Static plots
import seaborn as sns  # Statistical plots
import missingno as msno  # Missing data visualization

# Configuring Seaborn plot aesthetics
sns.set_theme(style='darkgrid', context='notebook')

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load Data
train = pd.read_csv("/kaggle/input/train-csv-file/train.csv")

In [None]:
# Preview the first 5 rows of the dataset
train.head()

In [None]:
# Preview the first 20 rows of the dataset
train.head(20)

### **Step 2: Checking the Dimensions of the Dataset**

The `.shape` attribute provides the dimensions of the dataset as a tuple, showing the number of rows and columns.

#### **Why Is This Important?**

* **Number of rows**: Reflects the volume of data available. Larger datasets may demand optimized memory management, whereas smaller datasets often require techniques like cross-validation due to limited samples.
* **Number of columns**: Represents the number of features (variables) available for analysis or model development.



In [None]:
# Get the number of rows and columns
print(f'The dataset has {train.shape[0]} rows and {train.shape[1]} columns.')

**🧭 Step 3: Overview of Columns and Data Types**
The .info() function provides a concise summary of the dataset's structure.
What Does the Output Reveal?
•	Column names: Displays the names of all features (columns) in the dataset.
•	Non-null counts: Shows how many values are present (i.e., not missing) in each column.
•	Data types: Indicates the data type of each column—such as int64, float64, or object.
💡 In the Titanic dataset, we observe:
•	5 columns are of type integer
•	2 columns are of type float
•	5 columns are of type object
This summary helps us determine whether each column’s data type aligns with its intended use. For instance, it's common for columns containing dates to be read as object types, even though they should be converted to datetime for accurate analysis.
Similarly, the Survived column, although stored as numeric values (0 and 1), actually represents categorical outcomes—0 meaning the passenger did not survive, and 1 meaning they did. Therefore, it would be more appropriate to treat it as a categorical variable rather than a numerical one, especially during analysis and modeling.


In [None]:
# Get an overview of the dataset’s columns and their data types
train.info()

In [None]:
# Converting data types
train['Survived'] = train['Survived'].astype('category')
train['Pclass'] = train['Pclass'].astype('category')
train['Sex'] = train['Sex'].astype('category')
train['Cabin'] = train['Cabin'].astype('category')
train['Embarked'] = train['Embarked'].astype('category')

In [None]:
train.info()

Step 4: Statistical Summary of Numerical Features
The .describe() function provides a quick overview of summary statistics for all numerical columns in the dataset.
What Does .describe() Show?
•	Count: Total number of non-missing entries in the column
•	Mean: The average value
•	Standard Deviation (std): Measures how much the values deviate from the mean
•	Min/Max: The smallest and largest values in the column
•	25%, 50%, 75%: Percentiles (quartiles); the 50% value represents the median
How to Interpret These Statistics Effectively:
•	Check for unusual values: For example, the Age column should realistically fall between 0 and 100. Values outside this range (e.g., negative numbers or extremely high ages) may indicate data quality issues.
•	Assess skewness: If the mean and median differ significantly, the distribution might be skewed—this is often the case with columns like Fare.
•	Spot potential errors: Extremely high or low values could be outliers or data entry mistakes and should be investigated.
💡 In the Titanic dataset, the PassengerId column is simply an identifier and does not carry any analytical value. It can be excluded from most forms of analysis.


In [None]:
# Summary statistics for numerical columns
train.describe().T

In [None]:
# Drop the PassengerId column
train = train.drop(columns=['PassengerId'])

In [None]:
# Drop the name column
train = train.drop(columns=['Name'])

In [None]:
# Summary statistics for numerical columns
train.describe().T

In [None]:
# List column names
train.columns

🧭 Step 6: Checking Unique Values per Column
To examine how many distinct values each column contains, use the .nunique() function.

Why Is This Useful?
Helps identify categorical features with a limited number of unique values—such as Sex, which typically includes just "male" and "female".

Helps detect identifier columns like PassengerId, which often have a unique value for every row and are not useful for predictive modeling.

💡 In the Titanic dataset, the PassengerId column contains 891 unique values, which matches the number of rows—confirming it is an identifier. Similarly, each passenger has a unique Name, further supporting its use for identification rather than analysis.

In [None]:
# Count the unique values in each column
train.nunique()

In [None]:
# you can also use the following function to see the unique values inside each column. this will help you see inconsistency.
# Function to display unique values for categorical variables
def show_unique_values(train):
    # Select only columns with object or categorical data types
    categorical_columns = train.select_dtypes(include='object').columns
    print(f'Categorical columns: {list(categorical_columns)}\n')

    # Iterate over each categorical column and print unique values
    for col in categorical_columns:
        print(f"Unique values in '{col}': {train[col].unique()}\n")

# Display unique values for all categorical columns in the dataset
show_unique_values(train)

Step 3: Handling Missing Values
🚩 Why Are Missing Values Significant?
Missing values are a common challenge in real-world datasets and can arise due to various reasons, such as:

Human errors during data entry or collection

Technical issues, including data corruption or transmission failures

Incomplete information, such as passengers on the Titanic not providing cabin details

Failing to address missing values can lead to:

Biased analysis – Missing data can distort statistical summaries and distributions

Computation errors – Some machine learning models cannot process missing values and may produce errors

Loss of information – Carelessly removing rows or columns may result in unnecessary data loss

🛠 Common Strategies for Handling Missing Data
Approach	Description	Best Used When
Drop Data	Remove rows or columns containing missing values	When the proportion of missing data is small and the information is non-essential
Impute Data	Replace missing values with estimates (e.g., mean, median, mode, etc.)	When there's enough data to generate reliable estimates
Flag Missing	Create a new column indicating whether data was missing	When the absence of data itself might carry important meaning




🧭 Step 3.1: Visualizing Missing Data
Before choosing a strategy, it's helpful to visualize where missing values occur in the dataset. One way to do this is by using the msno.bar(df) function, which presents a bar chart showing non-missing vs. missing data across all columns.

What Does This Visualization Show?
Each white gap in the plot indicates a missing value

Columns with many white gaps have a higher proportion of missing data

This allows for quick identification of problematic variables

💡 In the Titanic dataset, the visualization reveals missing values in the Cabin and Age columns, and a few missing entries in the Embarked column.


In [None]:
# Visualize missing data using missingno library
import missingno as msno
msno.bar(train)

🧭 Step 3.2: Identifying Missing Values
To determine the number of missing values in each column, you can use functions like .isnull().sum().

💡 Tip: Columns with over 50% missing values are typically candidates for removal, as they may lack sufficient data for reliable analysis.

In the Titanic dataset, for instance, the Cabin column has approximately 77% missing values, making it difficult to extract meaningful insights from this feature.

In [None]:
# Count the number of missing values in each column
missing_values = train.isnull().sum().sort_values(ascending=False)
missing_percentage = (missing_values / len(train)) * 100
print(pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage}))

🧭 Step 3: Dropping Missing Data¶
If a column has a high percentage of missing values, you can drop it:

When to Use?
When the column or row isn’t critical to your analysis.
When removing the data won’t significantly reduce the size of your dataset.

In [None]:
# Drop a column with too many missing values
train = train.drop(columns=['Cabin'])

# Or If a row has multiple missing values, you can drop it:
# train = train.dropna()

🧭 Step 4: Imputing Missing Values
Instead of removing rows or columns with missing data, you can fill in the gaps using various imputation techniques:

Mean Imputation: Ideal for numerical features with a normal (bell-shaped) distribution.

Median Imputation: More effective for skewed data or when outliers are present, as it is less sensitive to extreme values.

Mode Imputation: Best suited for categorical variables, where the most frequently occurring value is used to fill in the missing entries.

In [None]:
# Fill missing values in the 'Age' column with the mean age
train['Age'].fillna(train['Age'].mean(), inplace=True)

# Fill missing values in the 'Fare' column with the median
train['Fare'].fillna(train['Fare'].median(), inplace=True)

# Fill missing values in the 'Embarked' column with the most common value (mode)
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace=True)

🧭 Step 5: Flagging Missing Values
Rather than deleting or filling in missing values, you can create a new column to indicate where data is missing.

This approach is helpful when the absence of data could carry important information. For instance, in the Titanic dataset, passengers with missing Cabin values may have been assigned to a different part of the ship—potentially influencing survival outcomes.

In [None]:
# Create a new column indicating missing values for 'Cabin'
# train['Cabin_missing_flag'] = train['Cabin'].isnull().astype(int)

Step 2: Univariate Analysis
🎯 What Is Univariate Analysis?
Univariate analysis focuses on examining a single variable at a time to understand its characteristics, including:

Distribution (e.g., normal, skewed)

Central tendency (mean, median, mode)

Spread or variability (range, variance, standard deviation)

This type of analysis helps answer questions such as:

What is the age distribution of passengers?

How are passengers distributed across embarkation points?

Are ticket prices evenly distributed or skewed?

🧭 Step 2.1: Analyzing Numerical Features
When working with numerical columns, the goal is to explore each one individually to understand its shape and patterns.

Common Plots for Numerical Data:
Histogram – Shows how frequently values occur across intervals.

KDE Plot (Kernel Density Estimate) – A smooth curve that estimates the probability distribution of the data.

Boxplot – Displays the range, median, and identifies outliers.

Using Histograms
Histograms are ideal for visualizing the distribution of numeric variables such as Age, Fare, etc.

What to Look For:

Peaks indicate the most common value ranges (e.g., age groups).

Gaps suggest missing or infrequent values.

Skewness helps determine whether the data is symmetrical or biased to one side.

💡 Tip: You can enhance a histogram by adding a density curve using sns.histplot(data, kde=True) to better visualize the distribution.



In [None]:
# Histogram for Age
plt.figure(figsize=(8, 5))
sns.histplot(train['Age'].dropna(), bins=30, kde=True)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

2. KDE Plot (Smoothed Distribution)
The KDE plot shows the probability density function of a numerical column:

When to Use KDE Plots?:
When you want to visualize the shape of the data distribution more smoothly than a histogram.
Great for detecting skewness and multimodal distributions (multiple peaks).

In [None]:
# KDE Plot for Fare
plt.figure(figsize=(8, 5))
sns.kdeplot(train['Fare'], shade=True)
plt.title('KDE Plot of Fare')
plt.xlabel('Fare')
plt.show()

3. Boxplot (Detecting Outliers)
Boxplots visualize the minimum, lower quartile (25%), median, upper quartile (75%), and maximum values:

Interpretation:
Line in the middle: Median (50% value).
Box edges: 25th percentile (Q1) and 75th percentile (Q3).
Whiskers: Minimum and maximum values (excluding outliers).
Dots outside the whiskers: Outliers.
💡 Tip: If the boxplot has a long "tail" or many outliers, the data might be skewed.

In [None]:
# Boxplot for Fare
plt.figure(figsize=(8, 5))
sns.boxplot(x=train['Fare'])
plt.title('Boxplot of Fare')
plt.show()

🧭 Step 2: Univariate Analysis for Categorical Columns
There are common visualizations used when we are dealing with categorical columns:

Countplot: Shows the frequency count of each category.
Pie Chart: Shows proportions of categories in a pie format (less commonly used in data science).
1. Countplot
Countplots are used to count the frequency of each category in a column:

Interpretation:
Bars represent categories: Higher bars indicate more frequent categories.
Detect class imbalance (e.g., if most passengers embarked from "S", your data is imbalanced).
💡 Tip: You can add hue='Survived' to compare survival rates across embarkation points.

In [None]:
# Countplot for Embarked
plt.figure(figsize=(8, 5))
sns.countplot(x='Embarked', data=train, palette='pastel')
plt.title('Countplot of Embarked')
plt.xlabel('Embarkation Port')
plt.ylabel('Count')
plt.show()

2. Pie Chart
While pie charts are visually appealing, they’re generally less informative than bar charts for categorical data.

Why Use a Pie Chart?
Useful for displaying proportions (e.g., percentage of males vs. females).
Avoid using them when you have more than 3-4 categories.
💡 Tip: You can add hue='Survived' to compare survival rates across embarkation points.

In [None]:
# Pie chart for Sex distribution
train['Sex'].value_counts().plot.pie(autopct='%1.1f%%', figsize=(6, 6), colors=['#ff9999', '#66b3ff'])
plt.title('Sex Distribution')
plt.show()

🧭 Step 3: Summary Statistics for Categorical Variables
In addition to plots, you can also view summary statistics for categorical columns using .value_counts():

Interpretation:
Class "3" had the most passengers, followed by "1" and "2".
value_counts() helps detect any rare categories.

In [None]:
# Frequency count of unique values in the 'Pclass' column
print(train['Pclass'].value_counts())

STEP 3: 🎯 UNDERSTANDING BIVARIATE ANALYSIS
Bivariate analysis examines the relationship between two variables at a time. It helps uncover patterns, associations, or differences that aren't visible when analyzing variables individually.
It allows you to explore questions like:
•	Does fare vary across different passenger classes (Pclass)?
•	Are younger passengers more likely to survive?
•	Does the port of embarkation influence survival rates?
By analyzing two variables together, you can detect correlations, group differences, and underlying trends.
________________________________________
🛠 Choosing the Right Visuals for Bivariate Analysis
Type of Variables	Recommended Visualizations	Example
Numerical vs Numerical	Scatter plots, Correlation heatmaps	Age vs Fare
Numerical vs Categorical	Boxplots, Violin plots, Bar plots	Fare vs Pclass, Age vs Survived
Categorical vs Categorical	Grouped bar plots, Mosaic plots, Countplots	Pclass vs Survived, Embarked vs Sex
________________________________________
🧭 Step 3.1: Numerical vs Numerical Analysis
📊 Scatter Plot
Scatter plots are ideal for visualizing the relationship between two numerical features.
What to Look For:
•	Clusters may reveal subgroups (e.g., younger passengers typically paid lower fares).
•	Use hue or color to add a third variable, such as survival status, for deeper insights.
•	Linear patterns suggest a potential correlation between the two variables.


In [None]:
# Scatter plot for Age vs Fare
plt.figure(figsize=(8, 5))
sns.scatterplot(x='Age', y='Fare', data=train, hue='Survived', palette='coolwarm')
plt.title('Scatter Plot of Age vs Fare (Colored by Survived)')
plt.show()

2. Correlation Heatmap
A correlation heatmap shows the strength and direction of relationships between numerical variables:

What the Heatmap Shows:
Positive correlations (closer to 1): Variables increase together (e.g., age and fare).
Negative correlations (closer to -1): One variable decreases as the other increases.
Values near 0: No clear relationship between variables.
💡 Tip: Correlation is only meaningful for numerical columns.

In [None]:
# Correlation heatmap for numerical columns only
plt.figure(figsize=(8, 6))
numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns  # Select only numerical columns
sns.heatmap(train[numerical_columns].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

 Step 2: Numerical vs Categorical Analysis¶
1. Boxplot
Boxplots are great for visualizing the distribution of numerical values grouped by categories:

Interpretation:
Length of the box: Shows the interquartile range (IQR).
Line inside the box: Median fare for each Pclass.
Outliers may indicate passengers who paid abnormally high fares (luxury tickets).

In [None]:
# Boxplot of Fare grouped by Pclass
plt.figure(figsize=(8, 5))
sns.boxplot(x='Pclass', y='Fare', data=train, palette='Set2')
plt.title('Boxplot of Fare by Pclass')
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.show()

2. Violin Plot
A violin plot is similar to a boxplot but also shows the density of the data:

Interpretation:
Wider parts of the plot show where data points are concentrated.
Use violin plots when you want to see both distribution and summary statistics.

In [None]:
# Violin plot of Age grouped by Survived
plt.figure(figsize=(8, 5))
sns.violinplot(x='Survived', y='Age', data=train, split=True, palette='muted')
plt.title('Violin Plot of Age by Survival')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Age')
plt.show()

🧭 Step 3: Categorical vs Categorical Analysis¶
1. Grouped Bar Plot
Grouped bar plots show the count or proportion of one category for each level of another category:

Interpretation:
This plot shows how survival rates differ across embarkation ports (Embarked).
Large differences between the bar heights suggest that the embarkation location influenced survival chances.

In [None]:
# Grouped bar plot of Survived vs Embarked
plt.figure(figsize=(8, 5))
sns.countplot(x='Embarked', hue='Survived', data=train, palette='pastel')
plt.title('Survival Counts by Embarked Port')
plt.xlabel('Embarked Port')
plt.ylabel('Count')
plt.show()

2. Mosaic Plot
Mosaic plots show the proportion of categories across different groups:

Interpretation:
Larger blocks indicate more frequent category combinations (e.g., "Pclass 3 + Did Not Survive" may have a large block).
Helps detect imbalances between groups.

In [None]:
# Install required library for mosaic plot
# !pip install statsmodels
from statsmodels.graphics.mosaicplot import mosaic
from itertools import product

# Mosaic plot of Pclass vs Survived
plt.figure(figsize=(10, 10))
mosaic(train, ['Pclass', 'Survived'], title='Mosaic Plot of Pclass vs Survived')
plt.show()

STEP 3: MULTIVARIATE ANALYSIS
________________________________________
🎯 What is Multivariate Analysis?
Multivariate analysis examines three or more variables at the same time to uncover deeper insights and complex relationships. It helps answer more sophisticated questions, such as:
•	How do passenger class (Pclass), age, and fare together influence survival?
•	Do survival rates vary by embarkation point when also considering passenger class?
By analyzing multiple variables simultaneously, we can identify interactions, combined effects, and hidden patterns that are often missed in simpler two-variable (bivariate) analyses.
________________________________________
🛠 Common Techniques for Multivariate Analysis
Visualization Tool	Use Case	Example
Pair Plot	Compare several numerical variables at once	Age, Fare, and Survived relationships
FacetGrid	Plot data by subgroups	Age vs Fare split by survival status
Heatmap	Show correlation across numeric features	Correlation matrix of all numeric columns
3D Scatter Plot	Visualize interactions in three dimensions	Age vs Fare vs Survived
________________________________________
🧭 Step 1: Using Pair Plots
Pair plots display relationships between pairs of numerical features while also showing the distribution of each individual feature.
How to Interpret:
•	Diagonal: KDE (density) plots that show the distribution of single variables.
•	Off-diagonal: Scatter plots showing pairwise relationships, e.g., Age vs Fare.
•	Hue (color): Can be used to distinguish between categories, such as survival status.
💡 Tip: Pair plots are excellent for spotting correlations, clusters, and trends across multiple features at once.


In [None]:
# Pair plot for numerical columns
plt.figure(figsize=(10, 10))
sns.pairplot(train, hue='Survived', diag_kind='kde', palette='coolwarm')
plt.show()

 Step 2: FacetGrid (Subplots for Subgroups)¶
FacetGrid creates multiple subplots for different subsets of data based on categorical variables:

Interpretation:
This plot shows how the age distribution differs based on Survived (columns) and Pclass (rows).
You can see patterns like younger passengers in Pclass 1 having higher survival rates.
💡 Tip: FacetGrid is useful for detecting interactions between multiple variables.

In [None]:
# FacetGrid for Age distribution by Survived and Pclass
g = sns.FacetGrid(train, col='Survived', row='Pclass', height=4, aspect=1.5)
g.map(sns.histplot, 'Age', kde=True)
plt.show()

🧭 Step 3: Correlation Heatmap for Numerical Columns
A heatmap shows the correlation between multiple numerical variables:

What the Heatmap Shows:
Strong positive correlations (closer to +1) indicate that variables increase together.
Strong negative correlations (closer to -1) indicate that as one variable increases, the other decreases.
Correlation values close to zero indicate no strong relationship.

In [None]:
# Heatmap of numerical features only
plt.figure(figsize=(8, 6))
numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns  # Select only numerical columns
sns.heatmap(train[numerical_columns].corr(), annot=True, cmap='Blues', fmt='.2f')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

🧭 Step 4: 3D Scatter Plot¶
A 3D scatter plot helps visualize three numerical variables at once:

Interpretation:
The X, Y, and Z axes represent three numerical variables.
The color represents categories (e.g., passenger class).
The size of points can represent another variable (e.g., Fare).
💡 Tip: 3D scatter plots are useful for detecting interactions and clusters but can be hard to interpret for large datasets.

In [None]:
# 3D scatter plot for Age, Fare, and Survived
import plotly.express as px

fig = px.scatter_3d(train, x='Age', y='Fare', z='Survived', color='Pclass', size='Fare', opacity=0.7)
fig.update_traces(marker=dict(line=dict(width=0)))
fig.update_layout(title='3D Scatter Plot: Age vs Fare vs Survived')
fig.show()

Step 4: Outlier Detection and Handling
🎯 What Are Outliers?
Outliers are data points that deviate significantly from the rest of the dataset. These can arise due to:

Genuine anomalies (e.g., wealthy Titanic passengers who paid unusually high fares),

Data entry mistakes (e.g., recording someone's age as 500),

Unique cases that deserve further analysis (e.g., individuals who survived despite being in high-risk groups).

Outliers can negatively impact your analysis by:

Distorting summary statistics like the mean and standard deviation,

Affecting model accuracy, either by causing overfitting or masking true patterns.

🛠 Common Techniques for Outlier Detection
Method	Description	Best Used When
Boxplot	Graphical representation of data distribution showing outliers as individual points	Suitable for small to medium datasets
Z-score	Measures how many standard deviations a point is from the mean	Ideal for normally distributed data
IQR (Interquartile Range)	Flags outliers using the range between the 25th (Q1) and 75th (Q3) percentiles	Useful for skewed distributions

🧭 Step 1: Identifying Outliers with Boxplots
A boxplot provides a summary of a feature’s distribution, including:

Minimum and maximum values

Lower (Q1) and upper (Q3) quartiles

Median

Outliers (plotted as dots beyond the whiskers)

How to Interpret:

Points outside the "whiskers" are considered potential outliers.

If the plot has a long tail or numerous dots, it suggests a feature with extreme values.

💡 Tip: For skewed data, boxplots may show many outliers. Consider transforming the data or using more robust methods if needed.



In [None]:
# Boxplot for Fare to detect outliers
plt.figure(figsize=(8, 5))
sns.boxplot(x=train['Fare'], palette='pastel')
plt.title('Boxplot of Fare')
plt.show()

🧭 Step 2: Detecting Outliers Using Z-Score
A boxplot shows the minimum, lower quartile (Q1), median, upper quartile (Q3), and maximum values, highlighting outliers as dots.

Interpretation:
Data points with a Z-score greater than the threshold (typically 3) are considered outliers.
A higher threshold (e.g., 4) detects fewer outliers, while a lower threshold (e.g., 2.5) detects more.
💡 Tip: Z-scores work best for normally distributed data. If the data is skewed, consider using IQR instead.

In [None]:
# Function to detect outliers using Z-score
from scipy.stats import zscore

def detect_outliers_zscore(data, threshold=3):
    z_scores = zscore(data.dropna())  # Drop NaN to avoid errors
    outliers = data[(abs(z_scores) > threshold)]
    return outliers

# Detect outliers in the 'Age' column
outliers_age = detect_outliers_zscore(train['Age'])
print(f'Number of outliers in Age: {len(outliers_age)}')

🧭 Step 3: Detecting Outliers Using IQR (Interquartile Range)¶
The IQR method identifies outliers as data points that fall below Q1 - 1.5 IQR or above Q3 + 1.5 IQR:

Interpretation:
Lower bound: Q1 - 1.5 * IQR (minimum expected value).
Upper bound: Q3 + 1.5 * IQR (maximum expected value).
Data points outside this range are considered outliers.

In [None]:
# Function to detect outliers using IQR
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    return outliers

# Detect outliers in the 'Fare' column using IQR
outliers_fare = detect_outliers_iqr(train['Fare'])
print(f'Number of outliers in Fare: {len(outliers_fare)}')

🧭 Step 4: Handling Outliers¶
Once outliers are detected, you can handle them using the following approaches:

Remove Outliers: Remove rows containing outliers
Cap Outliers: Cap values at the upper and lower bounds
Impute Outliers: Replace outliers with mean or median values
Leave Outliers: In some cases (e.g., fraud detection, rare event analysis), outliers contain meaningful information and should be kept.

In [None]:
# Remove outliers in the 'Fare' column
#train = train[(df['Fare'] >= train['Fare'].quantile(0.25) - 1.5 * (train['Fare'].quantile(0.75) - train['Fare'].quantile(0.25))) &
        #(train['Fare'] <= train['Fare'].quantile(0.75) + 1.5 * (train['Fare'].quantile(0.75) - train['Fare'].quantile(0.25)))]

In [None]:
# Cap outliers in the 'Fare' column
#train['Fare'] = train['Fare'].clip(lower=train['Fare'].quantile(0.05), upper=train['Fare'].quantile(0.95))

In [None]:
# Impute outliers with the median
#train['Fare'] = train['Fare'].mask((train['Fare'] < train['Fare'].quantile(0.05)) | (train['Fare'] > train['Fare'].quantile(0.95)), train['Fare'].median())


STEP 5: EXPLORING THE TARGET VARIABLE – UNDERSTANDING ‘SURVIVED’

🎯 What Is Target Variable Exploration?
In the Titanic dataset, the target variable is Survived, which indicates whether a passenger lived (1) or died (0). Analyzing this variable is crucial as it helps:
•	Determine whether the dataset is balanced or imbalanced.
•	Identify patterns or factors (e.g., age, gender, class, embarkation point) that may influence survival.

🧭 Step 1: Visualizing the Distribution of ‘Survived’
1. Countplot
A countplot is effective for displaying the frequency of each class in a categorical variable—perfect for a binary target like Survived.
Interpretation:
•	The height of each bar represents the number of passengers who survived (1) or didn’t (0).
•	A significant difference in bar heights suggests a class imbalance, which could impact model training and evaluation.
💡 Tip: When dealing with imbalanced classes in machine learning, consider strategies such as resampling, or use metrics like F1-score, precision, and recall, instead of relying solely on accuracy.




In [None]:
# Countplot for Survived
plt.figure(figsize=(8, 5))
sns.countplot(x='Survived', data=train, palette='Set2')
plt.title('Survival Count')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

🧭 Step 2: Survival Rate by numerical columns¶
Let’s visualize survival rates across different age groups using KDE plots:

Interpretation:
The green curve shows the distribution of ages for survivors, while the red curve shows ages for non-survivors.
Peaks in the curves indicate common age ranges for each group.
Younger passengers may have had a higher survival rate, following the "women and children first" rule.
💡 Tip: If you see missing values for Age, consider imputing them before plotting

In [None]:
# KDE Plot for Age by Survival Status
plt.figure(figsize=(8, 5))
sns.kdeplot(train[train['Survived'] == 1]['Age'], shade=True, label='Survived', color='green')
sns.kdeplot(train[train['Survived'] == 0]['Age'], shade=True, label='Did Not Survive', color='red')
plt.title('Age Distribution by Survival Status')
plt.xlabel('Age')
plt.legend()
plt.show()

🧭 Step 3: Survival Rate by categorical columns
Survival rates may differ significantly between males and females. Let’s visualize this relationship:

Interpretation:
The plot shows survival counts grouped by gender.
Titanic survival famously followed the "women and children first" protocol, so you may see higher survival rates for females.
💡 Calculation Tip: You can calculate the survival rate for each gender.

In [None]:
# Countplot for Survived grouped by Gender
plt.figure(figsize=(8, 5))
sns.countplot(x='Survived', hue='Sex', data=train, palette='muted')
plt.title('Survival Rate by Gender')
plt.xlabel('Survived (0 = No, 1 = Yes)')
plt.ylabel('Count')
plt.show()

In [None]:
# Calculate survival rate by gender
gender_survival_rate = train.groupby('Sex')['Survived'].apply(lambda x: (x == 1).mean() * 100)
print(gender_survival_rate)

🧭 Step 3: Combined Analysis (Gender, Class, and Survival)
To explore survival rates based on multiple variables at once (e.g., gender and passenger class):

Interpretation:
This plot shows survival counts broken down by gender and passenger class. Look for patterns like:

High survival counts for first-class females.
Low survival counts for third-class males.

In [None]:
# Grouped bar plot for survival by Gender and Class
plt.figure(figsize=(10, 6))
sns.countplot(x='Pclass', hue='Sex', data=train[train['Survived'] == 1], palette='Set1')
plt.title('Survivors by Gender and Class')
plt.xlabel('Passenger Class')
plt.ylabel('Survivor Count')
plt.show()