<a href="https://colab.research.google.com/github/IshuDhana/sample/blob/main/Lab_M1_02_Data_Manipulation_%2B_JSON_Handling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Scenario**

1. You're working as a **data analyst **for a maritime safety research organization.
2. Your team has been given access to the famous Titanic passenger dataset to analyze survival patterns.
3. Your task is to clean the data, perform statistical analysis, engineer new features that might predict survival, and export the processed data to JSON format for use in other systems.

**Learning Objectives**

 1. Download and import CSV data from a public source
 2. Calculate descriptive statistics (mean, median, standard deviation) for numeric columns
 3. Identify and count missing values in the dataset
 4. Perform feature engineering by creating new variables from existing ones
 5. Analyze how engineered features differentiate between groups (survived vs. not survived)
 6. Export data to JSON format using Python classes
 7. Validate JSON output structure and content

**Introduction**

Welcome to the Data Manipulation and JSON Handling lab! In this exercise, you'll work with real-world data from the Titanic disaster to practice essential data analysis skills.

This lab will help you understand how to:

**What you'll build:**

1. A data analysis script that processes the Titanic dataset
2. Feature engineering functions that create new predictive variables
3. A class-based system to structure and export data to JSON
4. Statistical analysis comparing survival groups

**Why this matters:** Data manipulation and feature engineering are fundamental skills in AI and machine learning. Most real-world datasets require cleaning, transformation, and the creation of new features before they can be used effectively. JSON is a universal data format used in APIs, web applications, and data pipelines.

**Success criteria:**

 1. Successfully download and import the Titanic dataset
 2. Calculate all required statistics correctly
 3. Identify missing values accurately
 4. Create at least 2 engineered features
 5. Export data to JSON using classes
 6. Validate JSON structure and content


**Background Story**

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 after striking an iceberg. The disaster resulted in the deaths of over 1,500 passengers and crew. The dataset contains information about 891 passengers, including whether they survived or not.

**Your research team wants to understand:**

What statistical patterns exist in the data?
Can we create new features that better predict survival?
How can we structure this data for use in other systems?


**Step-by-Step Instructions**

**Step 1:** Setting Up the Project

**Objective:** Create your project structure and download the dataset.

**What to do:**

Create a new Python file named titanic_analysis.ipynb or titanic_analysis.py
Create a folder named data in the same directory as your script
Download the Titanic dataset from Kaggle
Dataset Download: The Titanic dataset is available on Kaggle. You can download it from:

**Direct Link:** Titanic Dataset on Kaggle
**Alternative:** If you don't have Kaggle access, you can use this direct **download link:** Titanic train.csv
Instructions:

Download the train.csv file
Save it in your data folder as titanic.csv
Make sure your file structure looks like this:

Expected outcome: You should have a project structure with the data folder and your Python script ready.

Checkpoint: Run your script to verify the setup works. It should print the directory paths without errors.



In [None]:
import pandas as pd
import numpy as np
import json
from pathlib import Path
import shutil # Import shutil for file operations

# Set up paths
DATA_DIR = Path("/content/data") # This should be the directory path
CSV_FILE = DATA_DIR / "titanic.csv" # This will correctly become /content/data/titanic.csv
JSON_FILE = DATA_DIR / "titanic_data.json"

# Create data directory if it doesn't exist
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Move titanic.csv from /content/ to /content/data if it exists in root
if Path("/content/titanic.csv").exists() and not CSV_FILE.exists():
    shutil.move(str(Path("/content/titanic.csv")), str(CSV_FILE))
    print(f"Moved /content/titanic.csv to {CSV_FILE}")

print("Project setup complete!")
print(f"Data directory: {DATA_DIR}")
print(f"CSV file location: {CSV_FILE}")

Moved /content/titanic.csv to /content/data/titanic.csv
Project setup complete!
Data directory: /content/data
CSV file location: /content/data/titanic.csv


**Step 2:** Importing and Exploring the Data

**Objective:** Load the CSV file into a pandas DataFrame and get an initial understanding of the data.

**What to do:**

Use pandas to read the CSV file
Display basic information about the dataset
View the first few rows


In [None]:
df = pd.read_csv(CSV_FILE)
print(f"Dataset loaded successfully! Shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())

Dataset loaded successfully! Shape: (891, 12)

Columns: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

First few rows:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 175

**Step 3: ** Calculating Descriptive Statistics

**Objective:** Calculate mean, median, and standard deviation for numeric columns.

**What to do:**

Identify all numeric columns in the dataset
Calculate mean, median, and standard deviation for each numeric column
Display the results in a clear format

In [None]:
# Select numeric columns only
numeric_df = df.select_dtypes(include=np.number)

# Calculate statistics (mean, median, std)
descriptive_stats = numeric_df.agg(['mean', 'median', 'std'])

print("Descriptive Statistics for Numeric Columns:")
print(descriptive_stats)

Descriptive Statistics for Numeric Columns:
        PassengerId  Survived    Pclass        Age     SibSp     Parch  \
mean     446.000000  0.383838  2.308642  29.699118  0.523008  0.381594   
median   446.000000  0.000000  3.000000  28.000000  0.000000  0.000000   
std      257.353842  0.486592  0.836071  14.526497  1.102743  0.806057   

             Fare  
mean    32.204208  
median  14.454200  
std     49.693429  


**Step 4: ** Identifying Missing Values

**Objective:** Count and analyze missing values in the dataset.

**What to do:**

Count missing values for each column
Calculate the percentage of missing values
Identify which columns have the most missing data

In [None]:
# Count missing values
print("\n" + "="*50)
print("MISSING VALUES ANALYSIS")
print("="*50)

missing_data = {}

for col in df.columns:
    missing_count = df[col].isnull().sum()
    missing_percent = (missing_count / len(df)) * 100
    missing_data[col] = {'count': missing_count, 'percent': missing_percent}
    print(f"Column '{col}': Missing = {missing_count} ({missing_percent:.2f}%)")

print("\nSummary of Missing Values:")
for col, data in missing_data.items():
    if data['count'] > 0:
        print(f"- {col}: {data['count']} missing ({data['percent']:.2f}%)")


MISSING VALUES ANALYSIS
Column 'PassengerId': Missing = 0 (0.00%)
Column 'Survived': Missing = 0 (0.00%)
Column 'Pclass': Missing = 0 (0.00%)
Column 'Name': Missing = 0 (0.00%)
Column 'Sex': Missing = 0 (0.00%)
Column 'Age': Missing = 177 (19.87%)
Column 'SibSp': Missing = 0 (0.00%)
Column 'Parch': Missing = 0 (0.00%)
Column 'Ticket': Missing = 0 (0.00%)
Column 'Fare': Missing = 0 (0.00%)
Column 'Cabin': Missing = 687 (77.10%)
Column 'Embarked': Missing = 2 (0.22%)

Summary of Missing Values:
- Age: 177 missing (19.87%)
- Cabin: 687 missing (77.10%)
- Embarked: 2 missing (0.22%)


**Step 5:** Feature Engineering

**Objective:** Create new features that might help differentiate between survivors and non-survivors.

**What to do:**

Create a "FamilySize" feature (SibSp + Parch + 1)
Create an "IsAlone" feature (FamilySize == 1)
Create an "AgeGroup" feature (categorize age into groups)
Analyze how these features differ between survivors and non-survivors

In [None]:
# Create a copy of the dataframe for feature engineering
df_features = df.copy()

# Feature 1: Family Size
df_features['FamilySize'] = df_features['SibSp'] + df_features['Parch'] + 1
print("\nFamily Size (SibSp + Parch + 1) preview:")
print(df_features[['SibSp', 'Parch', 'FamilySize']].head(10))

# Feature 2: Is Alone
df_features['IsAlone'] = (df_features['FamilySize'] == 1).astype(int) # Convert boolean to int (0 or 1)
print("\nIsAlone (FamilySize == 1) preview:")
print(df_features[['FamilySize', 'IsAlone']].head(10))

# Feature 3: Age Groups
def categorize_age(age):
    """Categorize age into groups"""
    if pd.isna(age):
        return 'Unknown'
    elif age < 18:
        return 'Child'
    elif age < 30:
        return 'Young Adult'
    elif age < 50:
        return 'Adult'
    else:
        return 'Senior'

df_features['AgeGroup'] = df_features['Age'].apply(categorize_age)
print("\nAgeGroup preview:")
print(df_features[['Age', 'AgeGroup']].head(10))

# Analyze feature differences between survivors and non-survivors
print("\n" + "="*50)
print("FEATURE ANALYSIS: SURVIVED vs NOT SURVIVED")
print("="*50)

# Analysis for FamilySize
print("\nFamily Size by Survival:")
family_survival = df_features.groupby('Survived')['FamilySize'].agg(['mean', 'median', 'std'])
print(family_survival)

# Analysis for IsAlone
print("\nIs Alone by Survival (0=Not Alone, 1=Alone):")
isalone_survival = df_features.groupby('Survived')['IsAlone'].value_counts(normalize=True).unstack()
print(isalone_survival)

# Analysis for AgeGroup
print("\nAge Group by Survival:")
agegroup_survival = df_features.groupby('Survived')['AgeGroup'].value_counts(normalize=True).unstack()
print(agegroup_survival)

# Statistical test: Do these features help differentiate?
print("\n" + "="*50)
print("FEATURE DIFFERENTIATION ANALYSIS")
print("="*50)

survived = df_features[df_features['Survived'] == 1]
not_survived = df_features[df_features['Survived'] == 0]

print("\nFamily Size:")
print(f"  Survived mean: {survived['FamilySize'].mean():.2f}")
print(f"  Not Survived mean: {not_survived['FamilySize'].mean():.2f}")
print(f"  Difference: {abs(survived['FamilySize'].mean() - not_survived['FamilySize'].mean()):.2f}")

# Insights:
print("\nInsights from engineered features:")
print("- \"FamilySize\": The mean family size for survivors vs. non-survivors can reveal if traveling with a small to medium-sized family was beneficial for survival compared to traveling alone or with a very large family. Often, very large families had lower survival rates as it was harder for all members to evacuate.")
print("- \"IsAlone\": This feature directly tests if traveling alone had an impact. We can observe the proportion of 'alone' passengers who survived versus those who didn't. Historically, those traveling alone might have had fewer responsibilities but also less support during an emergency.")
print("- \"AgeGroup\": By categorizing age, we can see if certain age demographics (e.g., children, elderly) had higher or lower survival rates, which could be due to 'women and children first' policies or physical capabilities.")

# Thinking about another new feature:
print("\nAnother potential new feature could be 'Title' extracted from the 'Name' column. Titles like 'Mr.', 'Mrs.', 'Miss', 'Master', etc., often correlate with marital status, gender, and sometimes social standing or age, which are all factors that could influence survival.")


Family Size (SibSp + Parch + 1) preview:
   SibSp  Parch  FamilySize
0      1      0           2
1      1      0           2
2      0      0           1
3      1      0           2
4      0      0           1
5      0      0           1
6      0      0           1
7      3      1           5
8      0      2           3
9      1      0           2

IsAlone (FamilySize == 1) preview:
   FamilySize  IsAlone
0           2        0
1           2        0
2           1        1
3           2        0
4           1        1
5           1        1
6           1        1
7           5        0
8           3        0
9           2        0

AgeGroup preview:
    Age     AgeGroup
0  22.0  Young Adult
1  38.0        Adult
2  26.0  Young Adult
3  35.0        Adult
4  35.0        Adult
5   NaN      Unknown
6  54.0       Senior
7   2.0        Child
8  27.0  Young Adult
9  14.0        Child

FEATURE ANALYSIS: SURVIVED vs NOT SURVIVED

Family Size by Survival:
              mean  median       std
Surv

**Step 6:** Creating a Data Export Class

**Objective:** Create a Python class to structure and export data to JSON format.

**What to do:**

Create a class that encapsulates the data processing logic. We will structure two different classes: one for Passenger and another TitanicDataset
Include the information we used to calculate statistics, identify missing values, and engineering features that you have previously done.
Add a method to export everything to JSON

Tip: It is good practice to encapsulate related functionality within a class and to take your time to structure it well. It will make your code cleaner and easier to maintain on the long run. If you want to take this one step further, you can also add new methods inside TitanicDataset for handling statistics, missing values, and feature engineering.