
# 🧭 Lecture 6: Data Analytics Concepts & Tasks
**Module:** 6 – Data Analytics  
**Duration:** 2 hours  
**Dataset:** Titanic Dataset  

---

## 🎯 Learning Objectives
- Understand **Descriptive vs Predictive Analytics**
- Identify and apply **common data analytics tasks**
- Perform **data preprocessing** and **integration**
- Conduct **data analysis** using Pandas
- Recognize **data ethics** and **bias reduction** practices

---

## 🧩 1. Introduction — The Analytics Mindset

Data analytics means **extracting knowledge and insight from data**.  
There are **8 main analytical tasks** (*from Provost & Fawcett, Data Science for Business*):

| Concept | Description | Example |
|----------|--------------|----------|
| Classification | Predict class membership | Email spam, disease diagnosis |
| Regression | Predict a continuous value | Price prediction, stock price |
| Similarity Matching | Find similar entities | Recommenders, dating apps |
| Clustering | Group entities by similarity | Customer segmentation |
| Co-occurrence | Find associations | Market basket analysis |
| Profiling / Anomaly Detection | Identify trends & outliers | Fraud detection |
| Link Prediction | Predict connections | "People You May Know" |
| Data Reduction | Reduce dimensionality | PCA for visualization |

---



## 🧠 2. Descriptive vs Predictive Analytics

| Type | Description | Example |
|------|--------------|----------|
| **Descriptive Analytics** | What happened? | Average survival rate by gender |
| **Predictive Analytics** | What might happen next? | Predict survival probability |

This lecture focuses on **Descriptive Analytics**.



## 🧮 3. Data Collection — Titanic Dataset


In [67]:

# Comprehensive Jupyter Notebook for Module 6: Data Analytics Concepts & Tasks
# Focus: Titanic Dataset - From Raw Data to Analytical Insights

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

print("=== Module 6: Data Analytics Concepts & Tasks ===")
print("=== Titanic Dataset Comprehensive Analytics ===")



=== Module 6: Data Analytics Concepts & Tasks ===
=== Titanic Dataset Comprehensive Analytics ===


In [68]:
# Load dataset
titanic = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
titanic.head()


# Display first few rows
print("\n🔍 FIRST 5 ROWS:")
print("=" * 30)
display(titanic.head())

print("\n📈 BASIC STATISTICS:")
print("=" * 30)
display(titanic.describe(include='all'))


🔍 FIRST 5 ROWS:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



📈 BASIC STATISTICS:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


In [69]:
# Initial dataset overview
print("\n📊 INITIAL DATASET OVERVIEW:")
print("=" * 40)
titanic.info()


📊 INITIAL DATASET OVERVIEW:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [70]:
# Display first few rows
print("\n🔍 FIRST 5 ROWS:")
print("=" * 30)
display(titanic.head())

print("\n📈 BASIC STATISTICS:")
print("=" * 30)
display(titanic.describe(include='all'))


🔍 FIRST 5 ROWS:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



📈 BASIC STATISTICS:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


# 4. Data Quality Assessment

In [71]:
print("\n2. DATA QUALITY ASSESSMENT")
print("=" * 50)

# Check for missing values
print("🔍 MISSING VALUES ANALYSIS:")
print("=" * 40)
missing_data = titanic.isnull().sum()
missing_percent = (missing_data / len(titanic)) * 100

missing_summary = pd.DataFrame({
    'Missing Count': missing_data,
    'Missing Percentage': missing_percent
})
missing_summary = missing_summary[missing_summary['Missing Count'] > 0]
display(missing_summary)

# Check for duplicates
print(f"\n🔄 DUPLICATE RECORDS: {titanic.duplicated().sum()}")


2. DATA QUALITY ASSESSMENT
🔍 MISSING VALUES ANALYSIS:


Unnamed: 0,Missing Count,Missing Percentage
Age,177,19.86532
Cabin,687,77.104377
Embarked,2,0.224467



🔄 DUPLICATE RECORDS: 0


In [72]:
# Data types assessment
print("\n📋 DATA TYPES ANALYSIS:")
print("=" * 35)
data_types = pd.DataFrame({
    'Column': titanic.columns,
    'Data Type': titanic.dtypes,
    'Unique Values': [titanic[col].nunique() for col in titanic.columns],
    'Sample Values': [titanic[col].dropna().head(3).tolist() for col in titanic.columns]
})
display(data_types)


📋 DATA TYPES ANALYSIS:


Unnamed: 0,Column,Data Type,Unique Values,Sample Values
PassengerId,PassengerId,int64,891,"[1, 2, 3]"
Survived,Survived,int64,2,"[0, 1, 1]"
Pclass,Pclass,int64,3,"[3, 1, 3]"
Name,Name,object,891,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B..."
Sex,Sex,object,2,"[male, female, female]"
Age,Age,float64,88,"[22.0, 38.0, 26.0]"
SibSp,SibSp,int64,7,"[1, 1, 0]"
Parch,Parch,int64,7,"[0, 0, 0]"
Ticket,Ticket,object,681,"[A/5 21171, PC 17599, STON/O2. 3101282]"
Fare,Fare,float64,248,"[7.25, 71.2833, 7.925]"



# 5. Data Preprocessing (ETL) & Cleaning


In [73]:

# Check missing values
titanic.isnull().sum()


Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [74]:
# Data Pre-processing & Cleaning
print("\n3. DATA PRE-PROCESSING & CLEANING")
print("=" * 55)

# Create a copy for cleaning
df_clean = titanic.copy()
print(" Created working copy of dataset")

# Handle missing values systematically
print("\n HANDLING MISSING VALUES:")
print("=" * 35)

# Age - Fill with median based on Pclass and Sex
age_median_by_group = df_clean.groupby(['Pclass', 'Sex'])['Age'].median()
print("Age median by class and gender:")
display(age_median_by_group)

def fill_age(row):
    if pd.isnull(row['Age']):
        return age_median_by_group[row['Pclass'], row['Sex']]
    return row['Age']

df_clean['Age'] = df_clean.apply(fill_age, axis=1)
print(f" Filled {titanic['Age'].isnull().sum()} missing age values")

# Embarked - Fill with mode
embarked_mode = df_clean['Embarked'].mode()[0]
df_clean['Embarked'].fillna(embarked_mode, inplace=True)
print(f" Filled {titanic['Embarked'].isnull().sum()} missing embarked values")

# Fare - Fill with median by Pclass (though there are no missing fares in this dataset)
fare_median_by_class = df_clean.groupby('Pclass')['Fare'].median()
df_clean['Fare'] = df_clean.apply(
    lambda row: fare_median_by_class[row['Pclass']] if pd.isnull(row['Fare']) else row['Fare'],
    axis=1
)
print(" Checked and handled any missing fare values")

# Cabin - Create new feature indicating cabin availability
df_clean['Has_Cabin'] = df_clean['Cabin'].notnull().astype(int)
df_clean.drop('Cabin', axis=1, inplace=True)
print(" Created 'Has_Cabin' feature and dropped original cabin column")

# Verify cleaning
print(f"\n MISSING VALUES AFTER CLEANING:")
print("=" * 35)
print(df_clean.isnull().sum())


3. DATA PRE-PROCESSING & CLEANING
 Created working copy of dataset

 HANDLING MISSING VALUES:
Age median by class and gender:


Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Pclass,Sex,Unnamed: 2_level_1
1,female,35.0
1,male,40.0
2,female,28.0
2,male,30.0
3,female,21.5
3,male,25.0


 Filled 177 missing age values
 Filled 2 missing embarked values
 Checked and handled any missing fare values
 Created 'Has_Cabin' feature and dropped original cabin column

 MISSING VALUES AFTER CLEANING:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Has_Cabin      0
dtype: int64



### Option2: Handle Duplicates


In [75]:

print("Duplicates before:", titanic.duplicated().sum())
titanic.drop_duplicates(inplace=True)
print("Duplicates after:", titanic.duplicated().sum())


Duplicates before: 0
Duplicates after: 0



### Step 3: Feature Engineering


In [76]:
print("\n4. FEATURE ENGINEERING")
print("=" * 30)

# Create age groups
print(" CREATING AGE GROUPS:")
df_clean['Age_Group'] = pd.cut(df_clean['Age'],
                              bins=[0, 12, 18, 35, 60, 100],
                              labels=['Child', 'Teen', 'Adult', 'Middle', 'Senior'])

# Create fare categories
print(" CREATING FARE CATEGORIES:")
df_clean['Fare_Category'] = pd.qcut(df_clean['Fare'], 4,
                                   labels=['Low', 'Medium', 'High', 'Very High'])

# Family size features
print(" CREATING FAMILY FEATURES:")
df_clean['Family_Size'] = df_clean['SibSp'] + df_clean['Parch'] + 1
df_clean['Is_Alone'] = (df_clean['Family_Size'] == 1).astype(int)

# Title extraction from name
print(" EXTRACTING TITLES FROM NAMES:")
df_clean['Title'] = df_clean['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
title_mapping = {
    'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
    'Dr': 'Professional', 'Rev': 'Professional', 'Col': 'Military',
    'Major': 'Military', 'Mlle': 'Miss', 'Countess': 'Royalty',
    'Ms': 'Miss', 'Lady': 'Royalty', 'Jonkheer': 'Royalty',
    'Don': 'Royalty', 'Dona': 'Royalty', 'Mme': 'Mrs', 'Capt': 'Military',
    'Sir': 'Royalty'
}
df_clean['Title'] = df_clean['Title'].map(title_mapping)
df_clean['Title'].fillna('Other', inplace=True)

print(" Feature engineering completed")
print("\n NEW FEATURES CREATED:")
new_features = ['Age_Group', 'Fare_Category', 'Family_Size', 'Is_Alone', 'Title', 'Has_Cabin']
for feature in new_features:
    print(f"  - {feature}: {df_clean[feature].nunique()} unique values")


4. FEATURE ENGINEERING
 CREATING AGE GROUPS:
 CREATING FARE CATEGORIES:
 CREATING FAMILY FEATURES:
 EXTRACTING TITLES FROM NAMES:
 Feature engineering completed

 NEW FEATURES CREATED:
  - Age_Group: 5 unique values
  - Fare_Category: 4 unique values
  - Family_Size: 9 unique values
  - Is_Alone: 2 unique values
  - Title: 7 unique values
  - Has_Cabin: 2 unique values



# 5. Data Ethics & Reducing Bias Analysis



In [77]:
print("\n DATA ETHICS & REDUCING BIAS ANALYSIS")
print("=" * 50)

# Gender bias in survival outcomes
print("1. GENDER BIAS IN SURVIVAL:")
print("=" * 30)
gender_survival = df_clean.groupby('Sex')['Survived'].agg(['mean', 'count'])
gender_survival['survival_rate'] = gender_survival['mean'] * 100
gender_survival = gender_survival.rename(columns={'mean': 'survival_probability', 'count': 'passenger_count'})
display(gender_survival)

# Class bias analysis
print("\n2. SOCIOECONOMIC BIAS (BY PASSENGER CLASS):")
print("=" * 45)
class_survival = df_clean.groupby('Pclass')['Survived'].agg(['mean', 'count'])
class_survival['survival_rate'] = class_survival['mean'] * 100
class_survival = class_survival.rename(columns={'mean': 'survival_probability', 'count': 'passenger_count'})
display(class_survival)

# Age discrimination analysis
print("\n3. AGE DISCRIMINATION ANALYSIS:")
print("=" * 35)
age_survival = df_clean.groupby('Age_Group')['Survived'].agg(['mean', 'count'])
age_survival['survival_rate'] = age_survival['mean'] * 100
age_survival = age_survival.rename(columns={'mean': 'survival_probability', 'count': 'passenger_count'})
display(age_survival)

# Fare-based privilege analysis
print("\n4. FARE-BASED PRIVILEGE ANALYSIS:")
print("=" * 38)
fare_survival = df_clean.groupby('Fare_Category')['Survived'].agg(['mean', 'count'])
fare_survival['survival_rate'] = fare_survival['mean'] * 100
fare_survival = fare_survival.rename(columns={'mean': 'survival_probability', 'count': 'passenger_count'})
display(fare_survival)



 DATA ETHICS & REDUCING BIAS ANALYSIS
1. GENDER BIAS IN SURVIVAL:


Unnamed: 0_level_0,survival_probability,passenger_count,survival_rate
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.742038,314,74.203822
male,0.188908,577,18.890815



2. SOCIOECONOMIC BIAS (BY PASSENGER CLASS):


Unnamed: 0_level_0,survival_probability,passenger_count,survival_rate
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.62963,216,62.962963
2,0.472826,184,47.282609
3,0.242363,491,24.236253



3. AGE DISCRIMINATION ANALYSIS:


Unnamed: 0_level_0,survival_probability,passenger_count,survival_rate
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Child,0.57971,69,57.971014
Teen,0.428571,70,42.857143
Adult,0.357977,514,35.797665
Middle,0.384259,216,38.425926
Senior,0.227273,22,22.727273



4. FARE-BASED PRIVILEGE ANALYSIS:


Unnamed: 0_level_0,survival_probability,passenger_count,survival_rate
Fare_Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Low,0.197309,223,19.730942
Medium,0.303571,224,30.357143
High,0.454955,222,45.495495
Very High,0.581081,222,58.108108


# Descriptive Analytics & Summary Statistics

In [78]:

print("\n COMPREHENSIVE DESCRIPTIVE ANALYTICS")
print("=" * 50)

# Overall dataset summary
print("1. DATASET OVERVIEW:")
print("=" * 20)
print(f"Total Passengers: {len(df_clean)}")
print(f"Overall Survival Rate: {df_clean['Survived'].mean():.2%}")
print(f"Male Passengers: {(df_clean['Sex'] == 'male').sum()} ({(df_clean['Sex'] == 'male').mean():.2%})")
print(f"Female Passengers: {(df_clean['Sex'] == 'female').sum()} ({(df_clean['Sex'] == 'female').mean():.2%})")

# Numerical features summary
print("\n2. NUMERICAL FEATURES SUMMARY:")
print("=" * 35)
numerical_summary = df_clean[['Age', 'Fare', 'SibSp', 'Parch', 'Family_Size']].describe()
display(numerical_summary)

# Categorical features summary
print("\n3. CATEGORICAL FEATURES DISTRIBUTION:")
print("=" * 45)
categorical_cols = ['Pclass', 'Sex', 'Embarked', 'Age_Group', 'Fare_Category', 'Title', 'Is_Alone']
for col in categorical_cols:
    print(f"\n{col.upper()}:")
    value_counts = df_clean[col].value_counts(normalize=True).round(3)
    display(pd.DataFrame({'Count': df_clean[col].value_counts(), 'Percentage': value_counts * 100}))


 COMPREHENSIVE DESCRIPTIVE ANALYTICS
1. DATASET OVERVIEW:
Total Passengers: 891
Overall Survival Rate: 38.38%
Male Passengers: 577 (64.76%)
Female Passengers: 314 (35.24%)

2. NUMERICAL FEATURES SUMMARY:


Unnamed: 0,Age,Fare,SibSp,Parch,Family_Size
count,891.0,891.0,891.0,891.0,891.0
mean,29.112424,32.204208,0.523008,0.381594,1.904602
std,13.304424,49.693429,1.102743,0.806057,1.613459
min,0.42,0.0,0.0,0.0,1.0
25%,21.5,7.9104,0.0,0.0,1.0
50%,26.0,14.4542,0.0,0.0,1.0
75%,36.0,31.0,1.0,0.0,2.0
max,80.0,512.3292,8.0,6.0,11.0



3. CATEGORICAL FEATURES DISTRIBUTION:

PCLASS:


Unnamed: 0_level_0,Count,Percentage
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
3,491,55.1
1,216,24.2
2,184,20.7



SEX:


Unnamed: 0_level_0,Count,Percentage
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
male,577,64.8
female,314,35.2



EMBARKED:


Unnamed: 0_level_0,Count,Percentage
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1
S,646,72.5
C,168,18.9
Q,77,8.6



AGE_GROUP:


Unnamed: 0_level_0,Count,Percentage
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1
Adult,514,57.7
Middle,216,24.2
Teen,70,7.9
Child,69,7.7
Senior,22,2.5



FARE_CATEGORY:


Unnamed: 0_level_0,Count,Percentage
Fare_Category,Unnamed: 1_level_1,Unnamed: 2_level_1
Medium,224,25.1
Low,223,25.0
High,222,24.9
Very High,222,24.9



TITLE:


Unnamed: 0_level_0,Count,Percentage
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Mr,517,58.0
Miss,185,20.8
Mrs,126,14.1
Master,40,4.5
Professional,13,1.5
Royalty,5,0.6
Military,5,0.6



IS_ALONE:


Unnamed: 0_level_0,Count,Percentage
Is_Alone,Unnamed: 1_level_1,Unnamed: 2_level_1
1,537,60.3
0,354,39.7



# Data Analytics Concepts & Tasks Application


In [79]:
print("\n APPLYING DATA ANALYTICS CONCEPTS & TASKS")
print("=" * 55)

# 1. CLASSIFICATION Analysis
print("1. CLASSIFICATION PATTERNS:")
print("=" * 25)
print("Predicting survival based on features:")
classification_features = ['Pclass', 'Sex', 'Age_Group', 'Fare_Category', 'Title']
for feature in classification_features:
    survival_by_feature = df_clean.groupby(feature)['Survived'].mean().sort_values(ascending=False)
    print(f"\n{feature}:")
    for category, rate in survival_by_feature.items():
        print(f"  {category}: {rate:.2%} survival")

# 2. REGRESSION Analysis
print("\n2. REGRESSION RELATIONSHIPS:")
print("=" * 30)
print("Relationship between Fare and Survival:")
fare_survival_corr = df_clean[['Fare', 'Survived']].corr().iloc[0,1]
print(f"Correlation between Fare and Survival: {fare_survival_corr:.3f}")

print("\nRelationship between Age and Survival:")
age_survival_corr = df_clean[['Age', 'Survived']].corr().iloc[0,1]
print(f"Correlation between Age and Survival: {age_survival_corr:.3f}")

# 3. SIMILARITY MATCHING
print("\n3. SIMILARITY MATCHING PATTERNS:")
print("=" * 35)
print("Finding similar passenger profiles with different outcomes:")
similar_profiles = df_clean[
    (df_clean['Pclass'] == 1) &
    (df_clean['Sex'] == 'female') &
    (df_clean['Age'].between(20, 40))
][['Name', 'Age', 'Fare', 'Survived']].head(10)
display(similar_profiles)

# 4. CLUSTERING Analysis
print("\n4. CLUSTERING INSIGHTS:")
print("=" * 25)
print("Natural groupings in the data:")
cluster_summary = df_clean.groupby(['Pclass', 'Sex']).agg({
    'Survived': 'mean',
    'Fare': 'median',
    'Age': 'median',
    'PassengerId': 'count'
}).round(3)
cluster_summary = cluster_summary.rename(columns={'PassengerId': 'Count'})
display(cluster_summary)

# 5. CO-OCCURRENCE GROUPING
print("\n5. CO-OCCURRENCE PATTERNS:")
print("=" * 30)
print("Common feature combinations among survivors:")
high_survival_groups = df_clean[df_clean['Survived'] == 1].groupby(['Pclass', 'Sex']).size().sort_values(ascending=False)
print("Most common survivor profiles:")
for (pclass, sex), count in high_survival_groups.head(5).items():
    total_group = len(df_clean[(df_clean['Pclass'] == pclass) & (df_clean['Sex'] == sex)])
    survival_rate = count / total_group
    print(f"  Pclass {pclass}, {sex}: {count} survivors ({survival_rate:.1%} of group)")

# 6. PROFILING & PATTERN MINING
print("\n6. PASSENGER PROFILING:")
print("=" * 25)
print("Typical survivor profile vs non-survivor profile:")

survivor_profile = df_clean[df_clean['Survived'] == 1][['Age', 'Fare', 'Pclass']].median()
non_survivor_profile = df_clean[df_clean['Survived'] == 0][['Age', 'Fare', 'Pclass']].median()

profile_comparison = pd.DataFrame({
    'Survivors': survivor_profile,
    'Non-Survivors': non_survivor_profile,
    'Difference': survivor_profile - non_survivor_profile
})
display(profile_comparison)

# 7. ANOMALY DETECTION
print("\n7. ANOMALY DETECTION:")
print("=" * 25)
print("Identifying unusual patterns:")

# High fare but low class anomalies
anomalies = df_clean[(df_clean['Fare'] > 100) & (df_clean['Pclass'] == 3)]
if len(anomalies) > 0:
    print(f"Found {len(anomalies)} passengers with high fare but traveling in 3rd class")
    display(anomalies[['Name', 'Pclass', 'Fare', 'Survived']])

# Age anomalies
age_anomalies = df_clean[df_clean['Age'] > 70]
if len(age_anomalies) > 0:
    print(f"\nElderly passengers (70+ years): {len(age_anomalies)}")
    display(age_anomalies[['Name', 'Age', 'Pclass', 'Survived']])

# 8. DATA REDUCTION Insights
print("\n8. DATA REDUCTION INSIGHTS:")
print("=" * 30)
print("Key features that explain survival variance:")

# Calculate feature importance through correlation
feature_correlations = {}
for col in ['Pclass', 'Fare', 'Age', 'Sex', 'Family_Size', 'Is_Alone']:
    if col == 'Sex':
        # Convert Sex to numerical for correlation
        correlation = df_clean['Survived'].corr(df_clean['Sex'].map({'male': 0, 'female': 1}))
    else:
        correlation = df_clean['Survived'].corr(df_clean[col])
    feature_correlations[col] = abs(correlation)

# Sort by absolute correlation
important_features = pd.Series(feature_correlations).sort_values(ascending=False)
print("Feature importance (absolute correlation with survival):")
for feature, importance in important_features.items():
    print(f"  {feature}: {importance:.3f}")



 APPLYING DATA ANALYTICS CONCEPTS & TASKS
1. CLASSIFICATION PATTERNS:
Predicting survival based on features:

Pclass:
  1: 62.96% survival
  2: 47.28% survival
  3: 24.24% survival

Sex:
  female: 74.20% survival
  male: 18.89% survival

Age_Group:
  Child: 57.97% survival
  Teen: 42.86% survival
  Middle: 38.43% survival
  Adult: 35.80% survival
  Senior: 22.73% survival

Fare_Category:
  Very High: 58.11% survival
  High: 45.50% survival
  Medium: 30.36% survival
  Low: 19.73% survival

Title:
  Mrs: 79.37% survival
  Miss: 70.27% survival
  Royalty: 60.00% survival
  Master: 57.50% survival
  Military: 40.00% survival
  Professional: 23.08% survival
  Mr: 15.67% survival

2. REGRESSION RELATIONSHIPS:
Relationship between Fare and Survival:
Correlation between Fare and Survival: 0.257

Relationship between Age and Survival:
Correlation between Age and Survival: -0.060

3. SIMILARITY MATCHING PATTERNS:
Finding similar passenger profiles with different outcomes:


Unnamed: 0,Name,Age,Fare,Survived
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,71.2833,1
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,53.1,1
31,"Spencer, Mrs. William Augustus (Marie Eugenie)",35.0,146.5208,1
61,"Icard, Miss. Amelie",38.0,80.0,1
88,"Fortune, Miss. Mabel Helen",23.0,263.0,1
151,"Pears, Mrs. Thomas (Edith Wearne)",22.0,66.6,1
166,"Chibnall, Mrs. (Edith Martha Bowerman)",35.0,55.0,1
215,"Newell, Miss. Madeleine",31.0,113.275,1
218,"Bazzani, Miss. Albina",32.0,76.2917,1
230,"Harris, Mrs. Henry Birkhardt (Irene Wallach)",35.0,83.475,1



4. CLUSTERING INSIGHTS:
Natural groupings in the data:


Unnamed: 0_level_0,Unnamed: 1_level_0,Survived,Fare,Age,Count
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,female,0.968,82.665,35.0,94
1,male,0.369,41.262,40.0,122
2,female,0.921,22.0,28.0,76
2,male,0.157,13.0,30.0,108
3,female,0.5,12.475,21.5,144
3,male,0.135,7.925,25.0,347



5. CO-OCCURRENCE PATTERNS:
Common feature combinations among survivors:
Most common survivor profiles:
  Pclass 1, female: 91 survivors (96.8% of group)
  Pclass 3, female: 72 survivors (50.0% of group)
  Pclass 2, female: 70 survivors (92.1% of group)
  Pclass 3, male: 47 survivors (13.5% of group)
  Pclass 1, male: 45 survivors (36.9% of group)

6. PASSENGER PROFILING:
Typical survivor profile vs non-survivor profile:


Unnamed: 0,Survivors,Non-Survivors,Difference
Age,27.0,25.0,2.0
Fare,26.0,10.5,15.5
Pclass,2.0,3.0,-1.0



7. ANOMALY DETECTION:
Identifying unusual patterns:

Elderly passengers (70+ years): 5


Unnamed: 0,Name,Age,Pclass,Survived
96,"Goldschmidt, Mr. George B",71.0,1,0
116,"Connors, Mr. Patrick",70.5,3,0
493,"Artagaveytia, Mr. Ramon",71.0,1,0
630,"Barkworth, Mr. Algernon Henry Wilson",80.0,1,1
851,"Svensson, Mr. Johan",74.0,3,0



8. DATA REDUCTION INSIGHTS:
Key features that explain survival variance:
Feature importance (absolute correlation with survival):
  Sex: 0.543
  Pclass: 0.338
  Fare: 0.257
  Is_Alone: 0.203
  Age: 0.060
  Family_Size: 0.017


# Advanced GroupBy Analysis

In [80]:
print("\n ADVANCED GROUPBY ANALYSIS")
print("=" * 40)

# Multi-dimensional analysis
print("1. SURVIVAL BY CLASS AND GENDER:")
print("=" * 35)
class_gender_survival = df_clean.groupby(['Pclass', 'Sex'])['Survived'].agg(['mean', 'count'])
class_gender_survival['survival_rate'] = (class_gender_survival['mean'] * 100).round(1)
class_gender_survival = class_gender_survival.rename(columns={'mean': 'survival_prob', 'count': 'passenger_count'})
display(class_gender_survival)

print("\n2. SURVIVAL BY AGE GROUP AND CLASS:")
print("=" * 38)
age_class_survival = df_clean.groupby(['Age_Group', 'Pclass'])['Survived'].mean().unstack().round(3)
display(age_class_survival)

print("\n3. FAMILY SIZE IMPACT ON SURVIVAL:")
print("=" * 35)
family_survival = df_clean.groupby('Family_Size')['Survived'].agg(['mean', 'count'])
family_survival['survival_rate'] = (family_survival['mean'] * 100).round(1)
family_survival = family_survival.rename(columns={'mean': 'survival_prob', 'count': 'passenger_count'})
display(family_survival)

print("\n4. TITLE-BASED SURVIVAL ANALYSIS:")
print("=" * 35)
title_survival = df_clean.groupby('Title')['Survived'].agg(['mean', 'count']).sort_values('mean', ascending=False)
title_survival['survival_rate'] = (title_survival['mean'] * 100).round(1)
title_survival = title_survival.rename(columns={'mean': 'survival_prob', 'count': 'passenger_count'})
display(title_survival)



 ADVANCED GROUPBY ANALYSIS
1. SURVIVAL BY CLASS AND GENDER:


Unnamed: 0_level_0,Unnamed: 1_level_0,survival_prob,passenger_count,survival_rate
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,0.968085,94,96.8
1,male,0.368852,122,36.9
2,female,0.921053,76,92.1
2,male,0.157407,108,15.7
3,female,0.5,144,50.0
3,male,0.135447,347,13.5



2. SURVIVAL BY AGE GROUP AND CLASS:


Pclass,1,2,3
Age_Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Child,0.75,1.0,0.417
Teen,0.917,0.5,0.283
Adult,0.787,0.429,0.24
Middle,0.541,0.383,0.086
Senior,0.214,0.333,0.2



3. FAMILY SIZE IMPACT ON SURVIVAL:


Unnamed: 0_level_0,survival_prob,passenger_count,survival_rate
Family_Size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.303538,537,30.4
2,0.552795,161,55.3
3,0.578431,102,57.8
4,0.724138,29,72.4
5,0.2,15,20.0
6,0.136364,22,13.6
7,0.333333,12,33.3
8,0.0,6,0.0
11,0.0,7,0.0



4. TITLE-BASED SURVIVAL ANALYSIS:


Unnamed: 0_level_0,survival_prob,passenger_count,survival_rate
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mrs,0.793651,126,79.4
Miss,0.702703,185,70.3
Royalty,0.6,5,60.0
Master,0.575,40,57.5
Military,0.4,5,40.0
Professional,0.230769,13,23.1
Mr,0.156673,517,15.7


# Key Insights Summary

In [81]:
print("\n KEY ANALYTICAL INSIGHTS SUMMARY")
print("=" * 45)

insights = [
    f"• Overall survival rate: {df_clean['Survived'].mean():.1%}",
    f"• Female survival rate: {df_clean[df_clean['Sex']=='female']['Survived'].mean():.1%}",
    f"• Male survival rate: {df_clean[df_clean['Sex']=='male']['Survived'].mean():.1%}",
    f"• 1st Class survival rate: {df_clean[df_clean['Pclass']==1]['Survived'].mean():.1%}",
    f"• 3rd Class survival rate: {df_clean[df_clean['Pclass']==3]['Survived'].mean():.1%}",
    f"• Children (0-12) survival rate: {df_clean[df_clean['Age_Group']=='Child']['Survived'].mean():.1%}",
    f"• Most predictive feature: {important_features.index[0]} (correlation: {important_features.iloc[0]:.3f})",
    f"• Average fare difference survivors vs non-survivors: ${profile_comparison.loc['Fare', 'Difference']:.2f}"
]

for insight in insights:
    print(insight)

print(f"\n DATASET READY FOR NEXT STEPS:")
print("  • Cleaned records: {len(df_clean)} passengers")
print("  • Features available: {len(df_clean.columns)}")
print("  • Missing values: {df_clean.isnull().sum().sum()}")
print("  • Data types optimized for analysis")



 KEY ANALYTICAL INSIGHTS SUMMARY
• Overall survival rate: 38.4%
• Female survival rate: 74.2%
• Male survival rate: 18.9%
• 1st Class survival rate: 63.0%
• 3rd Class survival rate: 24.2%
• Children (0-12) survival rate: 58.0%
• Most predictive feature: Sex (correlation: 0.543)
• Average fare difference survivors vs non-survivors: $15.50

 DATASET READY FOR NEXT STEPS:
  • Cleaned records: {len(df_clean)} passengers
  • Features available: {len(df_clean.columns)}
  • Missing values: {df_clean.isnull().sum().sum()}
  • Data types optimized for analysis



## 8. Analytical Tasks Mapped to Titanic Examples

| Concept | Example | Description |
|----------|----------|--------------|
| Classification | Predict “Survived” | Predict class label |
| Regression | Predict Fare | Continuous output |
| Clustering | Group by demographics | Discover hidden patterns |
| Co-occurrence | Embarkation + Survival | Association rules |
| Profiling | Typical survivor traits | Identify common patterns |
| Link Prediction | Family/companions | Relationship modeling |
| Data Reduction | Drop unneeded columns | Simplify dataset |

---



##  9. NVIDIA DLI Quiz

1️ Email label prediction (“spam” or “not spam”) is:  
>  Classification problem  

2️ Finding items frequently bought together:  
>  Market-basket analysis  

3️ Breaking complex problems into smaller subproblems helps us apply existing methods.  
>  True  



##  10. Wrap-Up Summary
- Data analytics converts data → insight.  
- Descriptive analytics helps summarize “what happened.”  
- Predictive analytics forecasts “what will happen.”  
- Always address **data quality**, **integration**, and **ethics**.

 **Next:** Visualization 101 — Matplotlib & Seaborn.
