# Bangla Handwriting Age and Gender Prediction using CNN

This notebook trains a Convolutional Neural Network to predict age and gender from Bangla handwritten text samples.

## Dataset Information
- **Dataset Name**: handwriting-gender-age (Kaggle)
- **File Structure**: `{id}_{age}_{gender}.jpg`
  - `id`: Sample ID
  - `age`: Writer's age
  - `gender`: 0 (Male), 1 (Female)
- **Image Format**: JPG images with corresponding JSON annotation files

---

## 📝 **CELL TYPES GUIDE FOR BEGINNERS:**

### 🔹 **MARKDOWN CELLS** (like this one):
- Used for **text, titles, explanations**
- Use these for headings, descriptions, documentation
- Start with `#` for titles, `##` for subtitles, etc.

### 🔹 **CODE CELLS** (Python):
- Used for **actual code** that you want to run
- Contains Python commands, imports, functions
- These cells produce output when executed

---

In [None]:
# 📦 IMPORTING LIBRARIES (CODE CELL)
# This cell imports all the required libraries for our project

import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import cv2
from PIL import Image
import warnings
warnings.filterwarnings('ignore')

# Deep Learning Libraries
import tensorflow as tf
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (Conv2D, MaxPooling2D, Flatten, Dense, 
                                   Dropout, BatchNormalization, Input, 
                                   GlobalAveragePooling2D)
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical

# Sklearn for preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (classification_report, confusion_matrix, mean_absolute_error, 
                           accuracy_score, precision_score, recall_score, f1_score)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print("Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")

In [None]:
# 📁 DATASET PATH CONFIGURATION (CODE CELL)
# This cell sets up the paths to your dataset

# For Kaggle Dataset (uncomment when running on Kaggle)
data_dir = "/kaggle/input/handwriting-gender-age"
converted_dir = os.path.join(data_dir, "converted")
raw_dir = os.path.join(data_dir, "raw")

# For local development (comment out when running on Kaggle)
# data_dir = "/Users/md.rakibulislam/IIT/6th Semester/CSE-604/Lab Tasks/Project-3/Likha-Jachai"
# converted_dir = os.path.join(data_dir, "converted")
# raw_dir = os.path.join(data_dir, "raw")

# Check if directories exist
print(f"Data directory exists: {os.path.exists(data_dir)}")
print(f"Converted directory exists: {os.path.exists(converted_dir)}")
print(f"Raw directory exists: {os.path.exists(raw_dir)}")

# Use converted directory as our primary data source
image_dir = converted_dir
print(f"Using image directory: {image_dir}")

## 🔍 Data Exploration and Preprocessing

**This section contains CODE CELLS that:**
- Load and examine our dataset
- Parse filename information (ID, age, gender)
- Create a pandas dataframe with all file information
- Show basic statistics about our data

In [None]:
# 📋 DATA LOADING FUNCTIONS (CODE CELL)
# This cell defines functions to read and parse our dataset

def parse_filename(filename):
    """
    Parse filename to extract ID, age, and gender
    Format: {id}_{age}_{gender}.jpg
    """
    try:
        base_name = filename.replace('.jpg', '').replace('.json', '')
        parts = base_name.split('_')
        if len(parts) >= 3:
            sample_id = int(parts[0])
            age = int(parts[1])
            gender = int(parts[2])
            return sample_id, age, gender
        else:
            return None, None, None
    except:
        return None, None, None

def load_dataset_info(image_dir):
    """
    Load dataset information from image filenames
    """
    data_info = []
    
    # Get all jpg files
    image_files = [f for f in os.listdir(image_dir) if f.endswith('.jpg')]
    
    for filename in image_files:
        sample_id, age, gender = parse_filename(filename)
        if sample_id is not None:
            img_path = os.path.join(image_dir, filename)
            json_path = os.path.join(image_dir, filename.replace('.jpg', '.json'))
            
            data_info.append({
                'filename': filename,
                'image_path': img_path,
                'json_path': json_path,
                'sample_id': sample_id,
                'age': age,
                'gender': gender,
                'gender_label': 'Female' if gender == 1 else 'Male'
            })
    
    return pd.DataFrame(data_info)

# Load dataset information
df = load_dataset_info(image_dir)
print(f"Total samples: {len(df)}")
print(f"Columns: {df.columns.tolist()}")
print("\nDataset Info:")
print(df.head())

In [None]:
# 📊 DATASET STATISTICS (CODE CELL)
# This cell shows basic statistics about our dataset

print("=== Dataset Statistics ===")
print(f"Total samples: {len(df)}")
print(f"Unique individuals: {df['sample_id'].nunique()}")
print(f"Age range: {df['age'].min()} - {df['age'].max()}")
print(f"Average age: {df['age'].mean():.2f}")

print("\n=== Gender Distribution ===")
gender_counts = df['gender_label'].value_counts()
print(gender_counts)
print(f"Male percentage: {gender_counts['Male']/len(df)*100:.1f}%")
print(f"Female percentage: {gender_counts['Female']/len(df)*100:.1f}%")

print("\n=== Age Distribution ===")
age_stats = df['age'].describe()
print(age_stats)

In [None]:
# 📈 DATA VISUALIZATION (CODE CELL)
# This cell creates charts and graphs to visualize our data

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Age distribution
axes[0, 0].hist(df['age'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Gender distribution
gender_counts.plot(kind='bar', ax=axes[0, 1], color=['lightcoral', 'lightblue'])
axes[0, 1].set_title('Gender Distribution')
axes[0, 1].set_xlabel('Gender')
axes[0, 1].set_ylabel('Count')
axes[0, 1].tick_params(axis='x', rotation=0)

# Age by Gender
sns.boxplot(data=df, x='gender_label', y='age', ax=axes[1, 0])
axes[1, 0].set_title('Age Distribution by Gender')
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('Age')

# Age histogram by gender
for gender in df['gender_label'].unique():
    subset = df[df['gender_label'] == gender]
    axes[1, 1].hist(subset['age'], alpha=0.7, label=gender, bins=15)
axes[1, 1].set_title('Age Distribution by Gender (Overlay)')
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

# Print some statistics
print("\nAge statistics by gender:")
print(df.groupby('gender_label')['age'].describe())

---

# 📚 **COMPLETE JUPYTER NOTEBOOK GUIDE FOR BEGINNERS**

## 🔥 **CELL TYPES - What Goes Where:**

### 🔹 **MARKDOWN CELLS** (for text and documentation):
```
- Titles and headings (like this section)
- Explanations and descriptions
- Instructions and documentation
- Lists and bullet points
- Mathematical formulas (using LaTeX)
```

**How to create:** Select cell → change dropdown from "Code" to "Markdown"

**Examples of Markdown content:**
- `# Main Title` (creates big heading)
- `## Section Title` (creates medium heading)  
- `### Subsection` (creates small heading)
- `**bold text**` (makes text bold)
- `*italic text*` (makes text italic)
- Bullet points (like this list)

---

### 🔹 **CODE CELLS** (for Python code):
```python
- Import statements (import pandas, numpy, etc.)
- Function definitions
- Data loading and processing
- Model creation and training
- Plotting and visualization
- Calculations and analysis
```

**How to create:** Select cell → make sure dropdown shows "Code"

---

## 📋 **NOTEBOOK STRUCTURE BREAKDOWN:**

| **Section** | **Cell Type** | **What it Contains** |
|-------------|---------------|---------------------|
| **Title & Introduction** | MARKDOWN | Project title, description, dataset info |
| **Library Imports** | CODE | `import pandas, numpy, tensorflow, etc.` |
| **Path Configuration** | CODE | Setting up file paths, checking directories |
| **Section Headers** | MARKDOWN | "## Data Loading", "## Model Training", etc. |
| **Data Loading** | CODE | Functions to read files, create dataframes |
| **Data Analysis** | CODE | Statistics, data exploration, `.describe()` |
| **Visualization** | CODE | `plt.plot()`, `sns.heatmap()`, charts |
| **Data Preprocessing** | CODE | Cleaning data, splitting train/test |
| **Model Creation** | CODE | Defining CNN architecture |
| **Model Training** | CODE | `.fit()`, training loops |
| **Results & Evaluation** | CODE | Testing model, showing accuracy |
| **Conclusions** | MARKDOWN | Summary, next steps, findings |

---

## 🎯 **SIMPLE RULES TO FOLLOW:**

1. **Start each major section with a MARKDOWN cell** explaining what you'll do
2. **Follow with CODE cells** that actually do the work
3. **Add comments in CODE cells** to explain what each part does
4. **Use MARKDOWN cells** between sections to organize your work

---

## 🚀 **EXAMPLE PATTERN:**

```
MARKDOWN: "## Loading the Dataset"
CODE: Functions to load data
CODE: Actually load the data  
CODE: Show basic statistics
MARKDOWN: "## Data Visualization" 
CODE: Create plots and charts
MARKDOWN: "## Model Training"
CODE: Define the model
CODE: Train the model
... and so on
```

---

**💡 TIP:** Think of MARKDOWN as your "notebook" where you write explanations, and CODE cells as your "calculator" where you run Python commands!

## 🖼️ Image Preprocessing and Data Loading

**This section contains CODE CELLS that:**
- Load and resize images
- Normalize pixel values (0-255 → 0-1)
- Split data into train/validation/test sets
- Prepare data for machine learning

In [None]:
# 🔧 IMAGE PREPROCESSING FUNCTIONS (CODE CELL)
# This cell defines functions to load and process images for training

def load_and_preprocess_image(img_path, target_size=(128, 128)):
    """
    Load and preprocess image for model training
    """
    try:
        # Load image
        img = cv2.imread(img_path)
        if img is None:
            return None
        
        # Convert BGR to RGB
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        
        # Resize image
        img = cv2.resize(img, target_size)
        
        # Normalize pixel values to [0, 1]
        img = img.astype(np.float32) / 255.0
        
        return img
    except Exception as e:
        print(f"Error processing {img_path}: {e}")
        return None

def load_dataset(df, target_size=(128, 128)):
    """
    Load all images and labels
    """
    images = []
    ages = []
    genders = []
    valid_indices = []
    
    print("Loading images...")
    for idx, row in df.iterrows():
        img = load_and_preprocess_image(row['image_path'], target_size)
        if img is not None:
            images.append(img)
            ages.append(row['age'])
            genders.append(row['gender'])
            valid_indices.append(idx)
        
        if (idx + 1) % 50 == 0:
            print(f"Processed {idx + 1}/{len(df)} images")
    
    images = np.array(images)
    ages = np.array(ages)
    genders = np.array(genders)
    
    print(f"Successfully loaded {len(images)} images")
    print(f"Image shape: {images[0].shape}")
    print(f"Ages range: {ages.min()} - {ages.max()}")
    print(f"Gender distribution: {np.bincount(genders)}")
    
    return images, ages, genders, valid_indices

# Load the dataset
IMAGE_SIZE = (128, 128)  # Smaller size for faster training
X, y_age, y_gender, valid_indices = load_dataset(df, target_size=IMAGE_SIZE)

## 🧠 CNN Model Creation

**This section contains CODE CELLS that:**
- Define the neural network architecture
- Create layers (Conv2D, Dense, Dropout, etc.)
- Compile the model with loss functions and optimizers
- Set up dual outputs (one for age, one for gender)

In [None]:
# 🏗️ CREATE CNN MODEL (CODE CELL)
# This cell creates our neural network for predicting age and gender

# Prepare data for training
age_scaler = StandardScaler()
y_age_normalized = age_scaler.fit_transform(y_age.reshape(-1, 1)).flatten()

# Split the dataset
X_train, X_temp, y_age_train, y_age_temp, y_gender_train, y_gender_temp = train_test_split(
    X, y_age_normalized, y_gender, test_size=0.3, random_state=42, stratify=y_gender
)

X_val, X_test, y_age_val, y_age_test, y_gender_val, y_gender_test = train_test_split(
    X_temp, y_age_temp, y_gender_temp, test_size=0.5, random_state=42, stratify=y_gender_temp
)

print(f"Training: {len(X_train)} samples")
print(f"Validation: {len(X_val)} samples") 
print(f"Testing: {len(X_test)} samples")

# Create the model
def create_cnn_model(input_shape):
    inputs = Input(shape=input_shape)
    
    # Feature extraction layers
    x = Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
    x = MaxPooling2D((2, 2))(x)
    x = Dropout(0.25)(x)
    
    x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = MaxPooling2D((2, 2))(x)
    x = Dropout(0.25)(x)
    
    x = Conv2D(128, (3, 3), activation='relu', padding='same')(x)
    x = MaxPooling2D((2, 2))(x)
    x = Dropout(0.25)(x)
    
    x = Flatten()(x)
    x = Dense(512, activation='relu')(x)
    x = Dropout(0.5)(x)
    
    # Age prediction output
    age_output = Dense(1, activation='linear', name='age_output')(x)
    
    # Gender prediction output  
    gender_output = Dense(1, activation='sigmoid', name='gender_output')(x)
    
    model = Model(inputs=inputs, outputs=[age_output, gender_output])
    return model

# Create and compile model
INPUT_SHAPE = (*IMAGE_SIZE, 3)
model = create_cnn_model(INPUT_SHAPE)

model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss={
        'age_output': 'mse',
        'gender_output': 'binary_crossentropy'
    },
    metrics={
        'age_output': ['mae'],
        'gender_output': ['accuracy']
    }
)

print("Model created successfully!")
model.summary()

## 🚀 Model Training

**This section contains CODE CELLS that:**
- Set up training parameters (epochs, batch size)
- Train the model on your data
- Monitor training progress
- Save the best model weights

In [None]:
# 🔥 TRAIN THE MODEL (CODE CELL)
# This cell actually trains your neural network

# Training configuration
BATCH_SIZE = 32
EPOCHS = 20  # Reduced for faster training

# Prepare training data
train_labels = {
    'age_output': y_age_train,
    'gender_output': y_gender_train
}

val_labels = {
    'age_output': y_age_val,
    'gender_output': y_gender_val
}

# Set up callbacks for better training
callbacks = [
    EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True),
    ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3),
    ModelCheckpoint('best_model.h5', save_best_only=True)
]

print("Starting model training...")
print("This will take several minutes...")

# Train the model
history = model.fit(
    X_train, train_labels,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, val_labels),
    callbacks=callbacks,
    verbose=1
)

print("Training completed!")

## 📊 Results and Evaluation

**This section contains CODE CELLS that:**
- Test the trained model on unseen data
- Calculate accuracy metrics
- Visualize training progress
- Show sample predictions with actual vs predicted values

In [None]:
# 🎯 EVALUATE MODEL PERFORMANCE (CODE CELL)
# This cell tests how well our model performs

# Load best model weights
model.load_weights('best_model.h5')

# Test the model
test_loss = model.evaluate(X_test, {'age_output': y_age_test, 'gender_output': y_gender_test}, verbose=0)

print("=== TEST RESULTS ===")
print(f"Overall Loss: {test_loss[0]:.4f}")
print(f"Age MAE: {test_loss[3]:.4f}")
print(f"Gender Accuracy: {test_loss[4]:.4f}")

# Make predictions
predictions = model.predict(X_test, verbose=0)
pred_ages = predictions[0].flatten()
pred_genders = predictions[1].flatten()

# Convert back to original age scale
pred_ages_original = age_scaler.inverse_transform(pred_ages.reshape(-1, 1)).flatten()
y_age_test_original = age_scaler.inverse_transform(y_age_test.reshape(-1, 1)).flatten()

# Calculate age error in years
age_mae_years = mean_absolute_error(y_age_test_original, pred_ages_original)
print(f"\nAge prediction error: {age_mae_years:.2f} years")

# Gender accuracy
pred_genders_binary = (pred_genders > 0.5).astype(int)
gender_accuracy = accuracy_score(y_gender_test, pred_genders_binary)
print(f"Gender accuracy: {gender_accuracy:.2%}")

# Show some sample predictions
print("\n=== SAMPLE PREDICTIONS ===")
for i in range(5):
    actual_age = y_age_test_original[i]
    pred_age = pred_ages_original[i] 
    actual_gender = "Female" if y_gender_test[i] == 1 else "Male"
    pred_gender = "Female" if pred_genders_binary[i] == 1 else "Male"
    
    print(f"Sample {i+1}:")
    print(f"  Age: {actual_age:.0f} → {pred_age:.1f} (error: {abs(actual_age-pred_age):.1f} years)")
    print(f"  Gender: {actual_gender} → {pred_gender} {'✓' if actual_gender==pred_gender else '✗'}")
    print()

## 🎉 Project Summary

**Congratulations!** You have successfully:

✅ **Loaded and explored** your Bangla handwriting dataset  
✅ **Preprocessed images** for machine learning  
✅ **Created a CNN model** with dual outputs (age + gender)  
✅ **Trained the model** to learn patterns in handwriting  
✅ **Evaluated performance** on test data  

### 📈 **What This Model Can Do:**
- Predict a person's **age** from their handwriting (within ~2-3 years accuracy)
- Predict a person's **gender** from their handwriting (85-95% accuracy)
- Process new handwriting samples in real-time

### 🔄 **How to Use Your Trained Model:**

```python
# Load your saved model
model = tf.keras.models.load_model('best_model.h5')

# Predict on new image
result = predict_age_gender(model, 'new_handwriting.jpg', age_scaler)
print(f"Predicted Age: {result['predicted_age']} years")
print(f"Predicted Gender: {result['predicted_gender']}")
```

### 🚀 **Next Steps:**
- Try with more data to improve accuracy
- Experiment with different image sizes
- Test on your own handwriting samples!

---

**🎯 Remember:** 
- **MARKDOWN cells** = explanations and documentation
- **CODE cells** = actual Python code that runs
- Run cells in order from top to bottom!