## 1. Préambule

### 1.1 Problématique

**Fruits!** startup wants to develop a mobile application for fruit recognition. The goal is to create a scalable image classification pipeline that can:

- Process large volumes of fruit images (90,000+ images)
- Extract meaningful features using transfer learning (MobileNetV2)
- Reduce dimensionality for efficient storage and processing
- Scale horizontally in a Big Data cloud environment

### 1.2 Objectifs

1. **Feature Extraction**: Use MobileNetV2 pretrained on ImageNet to extract 1280-dimensional features
2. **Dimensionality Reduction**: Apply PCA to reduce features while preserving variance
3. **Distributed Processing**: Leverage PySpark for scalable, distributed computation
4. **Cloud-Ready**: Design pipeline compatible with AWS EMR / Databricks

### 1.3 Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    Local Development (Docker)                    │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────┐    ┌──────────────┐    ┌─────────────────┐       │
│  │  Images  │───▶│  PySpark     │───▶│  PyTorch GPU    │       │
│  │  (S3/Local)   │  DataFrames  │    │  MobileNetV2    │       │
│  └──────────┘    └──────────────┘    └────────┬────────┘       │
│                                               │                 │
│                                               ▼                 │
│  ┌──────────┐    ┌──────────────┐    ┌─────────────────┐       │
│  │  Output  │◀───│  PCA         │◀───│  Features       │       │
│  │  (Parquet)    │  Reduction   │    │  (1280-dim)     │       │
│  └──────────┘    └──────────────┘    └─────────────────┘       │
├─────────────────────────────────────────────────────────────────┤
│  MLflow Tracking: http://localhost:5009                         │
│  Spark UI: http://localhost:4049                                │
│  Jupyter: http://localhost:8889                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## 2. Environment Setup

### 2.1 Imports & Configuration

In [None]:
# Standard library
import os
import sys
import time
import warnings
from pathlib import Path
from datetime import datetime

# Data processing
import numpy as np
import pandas as pd

# PyTorch
import torch
import torchvision
from PIL import Image

# PySpark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import PCA, StandardScaler
from pyspark.ml.linalg import Vectors

# MLflow
import mlflow

# Progress bar
from tqdm.auto import tqdm

# Suppress warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.insert(0, '/app/src')

print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Configuration
CONFIG = {
    # Paths
    'DATASET_PATH': '/app/dataset/fruits-360_dataset/fruits-360',
    'OUTPUT_PATH': '/app/data/Results',
    
    # Processing
    'BATCH_SIZE': 32,
    'PCA_COMPONENTS': 50,
    
    # Spark
    'SPARK_DRIVER_MEMORY': '4g',
    'SPARK_EXECUTOR_MEMORY': '4g',
    
    # MLflow
    'MLFLOW_TRACKING_URI': 'http://mlflow:5000',
    'MLFLOW_EXPERIMENT': 'mission9_experiments',
}

# Create output directory
os.makedirs(CONFIG['OUTPUT_PATH'], exist_ok=True)

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

### 2.2 Spark Session

In [None]:
# Create Spark session
spark = SparkSession.builder \
    .appName('Mission9_Fruits360') \
    .master('local[*]') \
    .config('spark.driver.memory', CONFIG['SPARK_DRIVER_MEMORY']) \
    .config('spark.executor.memory', CONFIG['SPARK_EXECUTOR_MEMORY']) \
    .config('spark.sql.execution.arrow.pyspark.enabled', 'true') \
    .config('spark.sql.execution.arrow.maxRecordsPerBatch', '1000') \
    .config('spark.sql.parquet.writeLegacyFormat', 'true') \
    .config('spark.executor.resource.gpu.amount', '1') \
    .config('spark.task.resource.gpu.amount', '0.25') \
    .getOrCreate()

sc = spark.sparkContext

print(f"Spark version: {spark.version}")
print(f"Spark UI: http://localhost:4049")
print(f"Default parallelism: {sc.defaultParallelism}")

### 2.3 MLflow Tracking

In [None]:
# Setup MLflow
mlflow.set_tracking_uri(CONFIG['MLFLOW_TRACKING_URI'])
mlflow.set_experiment(CONFIG['MLFLOW_EXPERIMENT'])

print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")
print(f"MLflow UI: http://localhost:5009")

---

## 3. Data Loading

### 3.1 Dataset Overview

The **Fruits-360** dataset contains:
- **90,483 images** of fruits and vegetables
- **131 classes** (apple varieties, bananas, oranges, etc.)
- **100x100 pixel** RGB images
- **Training set**: 67,692 images
- **Test set**: 22,688 images

In [None]:
# Check dataset structure
dataset_path = Path(CONFIG['DATASET_PATH'])

if dataset_path.exists():
    print(f"Dataset found at: {dataset_path}")
    
    for split in ['Training', 'Test']:
        split_path = dataset_path / split
        if split_path.exists():
            classes = list(split_path.iterdir())
            n_images = sum(len(list((split_path / c.name).glob('*.jpg'))) for c in classes if c.is_dir())
            print(f"  {split}: {len(classes)} classes, {n_images:,} images")
else:
    print(f"Dataset not found at: {dataset_path}")
    print("\nTo download the dataset:")
    print("1. Visit: https://www.kaggle.com/datasets/moltean/fruits")
    print("2. Download and extract to: dataset/fruits-360_dataset/")
    print("\nOr generate a subset for testing:")
    print("  python scripts/subset_data.py --percentage 10")

### 3.2 Load Images

In [None]:
def load_images(path: str, split: str = 'Training') -> 'DataFrame':
    """
    Load images from a directory using Spark's binaryFile reader.
    
    Args:
        path: Base path to dataset
        split: 'Training' or 'Test'
        
    Returns:
        Spark DataFrame with columns: path, content, label, filename
    """
    full_path = f"{path}/{split}"
    
    # Load binary files
    df = spark.read.format('binaryFile') \
        .option('pathGlobFilter', '*.jpg') \
        .option('recursiveFileLookup', 'true') \
        .load(full_path)
    
    # Extract label from path
    df = df.withColumn(
        'label',
        F.element_at(F.split(F.col('path'), '/'), -2)
    )
    
    # Extract filename
    df = df.withColumn(
        'filename',
        F.element_at(F.split(F.col('path'), '/'), -1)
    )
    
    return df

In [None]:
# Load training data
print("Loading training images...")
start_time = time.time()

df_train = load_images(CONFIG['DATASET_PATH'], 'Training')
train_count = df_train.count()

print(f"Loaded {train_count:,} training images in {time.time() - start_time:.1f}s")

In [None]:
# Load test data
print("Loading test images...")
start_time = time.time()

df_test = load_images(CONFIG['DATASET_PATH'], 'Test')
test_count = df_test.count()

print(f"Loaded {test_count:,} test images in {time.time() - start_time:.1f}s")

### 3.3 Data Exploration

In [None]:
# Show sample data
print("Sample training data:")
df_train.select('path', 'label', 'filename', 'length').show(10, truncate=50)

In [None]:
# Class distribution
print("Class distribution (top 20):")
df_train.groupBy('label') \
    .count() \
    .orderBy(F.desc('count')) \
    .show(20)

In [None]:
# Number of unique classes
n_classes = df_train.select('label').distinct().count()
print(f"Total unique classes: {n_classes}")

---

## 4. Feature Extraction

### 4.1 MobileNetV2 Setup

We use **MobileNetV2** pretrained on ImageNet for transfer learning:
- Efficient architecture designed for mobile devices
- 1280-dimensional feature vectors from the penultimate layer
- GPU-accelerated inference with PyTorch

In [None]:
# Import feature extractor
from feature_extractor import MobileNetV2Extractor

# Initialize extractor
extractor = MobileNetV2Extractor(
    device='cuda' if torch.cuda.is_available() else 'cpu',
    batch_size=CONFIG['BATCH_SIZE']
)

# Display model info
print("\nModel Information:")
for key, value in extractor.get_model_info().items():
    print(f"  {key}: {value}")

In [None]:
# Test feature extraction on a sample image
sample_row = df_train.select('content', 'label').first()
sample_features = extractor.extract(bytes(sample_row['content']))

print(f"Sample label: {sample_row['label']}")
print(f"Feature shape: {sample_features.shape}")
print(f"Feature range: [{sample_features.min():.4f}, {sample_features.max():.4f}]")

### 4.2 Distributed Feature Extraction

In [None]:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import ArrayType, FloatType

# Create Pandas UDF for distributed feature extraction
@pandas_udf(ArrayType(FloatType()))
def extract_features_udf(content_series: pd.Series) -> pd.Series:
    """
    Pandas UDF for batch feature extraction.
    Uses GPU-accelerated MobileNetV2.
    """
    # Initialize extractor on worker (cached)
    if not hasattr(extract_features_udf, '_extractor'):
        # Limit GPU memory usage for PyTorch
        if torch.cuda.is_available():
            # Set memory fraction to limit usage (e.g., 6GB out of total)
            # Assuming ~24GB total, 0.25 is approx 6GB
            torch.cuda.set_per_process_memory_fraction(0.25)
            
        extract_features_udf._extractor = MobileNetV2Extractor(
            device='cuda' if torch.cuda.is_available() else 'cpu'
        )
    
    extractor = extract_features_udf._extractor
    
    # Process batch
    images = [bytes(img) for img in content_series]
    features = extractor.extract_batch(images)
    
    return pd.Series([f.tolist() for f in features])

In [None]:
# Start MLflow run
run_name = f"feature_extraction_{datetime.now().strftime('%Y%m%d_%H%M%S')}"

with mlflow.start_run(run_name=run_name):
    # Log parameters
    mlflow.log_param('dataset_path', CONFIG['DATASET_PATH'])
    mlflow.log_param('batch_size', CONFIG['BATCH_SIZE'])
    mlflow.log_param('pca_components', CONFIG['PCA_COMPONENTS'])
    mlflow.log_param('train_count', train_count)
    mlflow.log_param('test_count', test_count)
    mlflow.log_param('n_classes', n_classes)
    
    # Log model info
    for key, value in extractor.get_model_info().items():
        mlflow.log_param(f'model_{key}', value)
    
    # Extract features from training set
    print("\nExtracting training features...")
    start_time = time.time()
    
    df_train_features = df_train.withColumn(
        'features',
        extract_features_udf(F.col('content'))
    )
    
    # Trigger computation and cache
    df_train_features = df_train_features.cache()
    _ = df_train_features.count()
    
    train_extraction_time = time.time() - start_time
    print(f"Training features extracted in {train_extraction_time:.1f}s")
    mlflow.log_metric('train_extraction_time', train_extraction_time)
    
    # Extract features from test set
    print("\nExtracting test features...")
    start_time = time.time()
    
    df_test_features = df_test.withColumn(
        'features',
        extract_features_udf(F.col('content'))
    )
    
    df_test_features = df_test_features.cache()
    _ = df_test_features.count()
    
    test_extraction_time = time.time() - start_time
    print(f"Test features extracted in {test_extraction_time:.1f}s")
    mlflow.log_metric('test_extraction_time', test_extraction_time)
    
    print(f"\n✓ Feature extraction complete!")

----------------------------------------
Exception occurred during processing of request from ('127.0.0.1', 40624)
Traceback (most recent call last):
  File "/usr/lib/python3.12/socketserver.py", line 318, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.12/socketserver.py", line 349, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.12/socketserver.py", line 362, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python3.12/socketserver.py", line 766, in __init__
    self.handle()
  File "/usr/local/lib/python3.12/dist-packages/pyspark/accumulators.py", line 295, in handle
    poll(accum_updates)
  File "/usr/local/lib/python3.12/dist-packages/pyspark/accumulators.py", line 267, in poll
    if self.rfile in r and func():
                           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pyspark/accumulators.py", line 271, in accum_u

### 4.3 Feature Analysis

In [None]:
# Verify feature extraction
sample = df_train_features.select('label', 'features').first()
print(f"Label: {sample['label']}")
print(f"Feature dimension: {len(sample['features'])}")
print(f"First 10 features: {sample['features'][:10]}")

---

## 5. PCA Dimensionality Reduction

### 5.1 Apply PCA

We reduce the 1280-dimensional features to a lower dimension using PCA:
- Reduces storage and computation costs
- Removes noise and redundant features
- Prepares data for downstream classification

In [None]:
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.ml.feature import PCA, StandardScaler

# Convert features array to Vector type
@F.udf(VectorUDT())
def to_vector(features):
    return Vectors.dense(features)

# Apply to training data
df_train_vec = df_train_features.withColumn(
    'features_vector',
    to_vector(F.col('features'))
)

# Apply to test data
df_test_vec = df_test_features.withColumn(
    'features_vector',
    to_vector(F.col('features'))
)

In [None]:
# Standardize features before PCA
scaler = StandardScaler(
    inputCol='features_vector',
    outputCol='features_scaled',
    withStd=True,
    withMean=True
)

# Fit scaler on training data
scaler_model = scaler.fit(df_train_vec)

# Transform both datasets
df_train_scaled = scaler_model.transform(df_train_vec)
df_test_scaled = scaler_model.transform(df_test_vec)

print("Features standardized")

In [None]:
# Apply PCA
n_components = CONFIG['PCA_COMPONENTS']

print(f"Fitting PCA with {n_components} components...")
start_time = time.time()

pca = PCA(
    k=n_components,
    inputCol='features_scaled',
    outputCol='pca_features'
)

# Fit PCA on training data
pca_model = pca.fit(df_train_scaled)

pca_fit_time = time.time() - start_time
print(f"PCA fitted in {pca_fit_time:.1f}s")

In [None]:
# Transform datasets
df_train_pca = pca_model.transform(df_train_scaled)
df_test_pca = pca_model.transform(df_test_scaled)

print("PCA transformation applied")

### 5.2 Variance Analysis

In [None]:
# Analyze explained variance
explained_variance = pca_model.explainedVariance.toArray()
cumulative_variance = np.cumsum(explained_variance)

print(f"\nPCA Explained Variance Analysis:")
print(f"  Components: {n_components}")
print(f"  Total explained variance: {cumulative_variance[-1]*100:.2f}%")
print(f"\nVariance by component (first 10):")
for i in range(min(10, len(explained_variance))):
    print(f"  PC{i+1}: {explained_variance[i]*100:.2f}% (cumulative: {cumulative_variance[i]*100:.2f}%)")

In [None]:
# Log PCA results to MLflow
with mlflow.start_run(run_name=f"pca_{n_components}comp"):
    mlflow.log_param('n_components', n_components)
    mlflow.log_metric('explained_variance_total', cumulative_variance[-1])
    mlflow.log_metric('pca_fit_time', pca_fit_time)
    
    # Log variance per component
    for i, var in enumerate(explained_variance[:10]):
        mlflow.log_metric(f'var_pc{i+1}', var)

### 5.3 Visualization

In [None]:
import matplotlib.pyplot as plt

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, len(explained_variance) + 1), explained_variance * 100)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance (%)')
axes[0].set_title('Explained Variance per Component')

# Cumulative variance
axes[1].plot(range(1, len(cumulative_variance) + 1), cumulative_variance * 100, 'b-o')
axes[1].axhline(y=95, color='r', linestyle='--', label='95% threshold')
axes[1].axhline(y=99, color='g', linestyle='--', label='99% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance (%)')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()

plt.tight_layout()
plt.savefig(f"{CONFIG['OUTPUT_PATH']}/pca_variance.png", dpi=150)
plt.show()

print(f"Plot saved to: {CONFIG['OUTPUT_PATH']}/pca_variance.png")

---

## 6. Results Export

### 6.1 Save to Parquet

In [None]:
from pyspark.sql.types import ArrayType, FloatType

# Convert PCA vector to array for easier storage
@F.udf(ArrayType(FloatType()))
def vector_to_array(v):
    return v.toArray().tolist()

# Prepare training output
df_train_output = df_train_pca.select(
    'path',
    'label',
    'filename',
    vector_to_array('pca_features').alias('pca_features')
)

# Prepare test output
df_test_output = df_test_pca.select(
    'path',
    'label',
    'filename',
    vector_to_array('pca_features').alias('pca_features')
)

In [None]:
# Save training results to Parquet
train_output_path = f"{CONFIG['OUTPUT_PATH']}/training_pca"
print(f"Saving training results to: {train_output_path}")

df_train_output.write \
    .mode('overwrite') \
    .partitionBy('label') \
    .parquet(train_output_path)

print("✓ Training results saved")

In [None]:
# Save test results to Parquet
test_output_path = f"{CONFIG['OUTPUT_PATH']}/test_pca"
print(f"Saving test results to: {test_output_path}")

df_test_output.write \
    .mode('overwrite') \
    .partitionBy('label') \
    .parquet(test_output_path)

print("✓ Test results saved")

### 6.2 Export to CSV

Export to CSV format for compatibility with other tools and cloud storage.

In [None]:
def export_to_csv(df, output_path: str, n_components: int):
    """
    Export PCA results to CSV with individual feature columns.
    """
    # Add individual feature columns
    df_csv = df.select(
        'label',
        'filename',
        'pca_features'
    )
    
    # Explode array to columns
    for i in range(n_components):
        df_csv = df_csv.withColumn(f'f_{i}', F.col('pca_features')[i])
    
    # Select final columns
    feature_cols = [f'f_{i}' for i in range(n_components)]
    df_csv = df_csv.select('label', 'filename', *feature_cols)
    
    # Write to CSV
    df_csv.coalesce(1).write \
        .mode('overwrite') \
        .option('header', 'true') \
        .csv(output_path)
    
    return df_csv

In [None]:
# Export training to CSV
train_csv_path = f"{CONFIG['OUTPUT_PATH']}/training_pca_csv"
print(f"Exporting training CSV to: {train_csv_path}")

export_to_csv(df_train_output, train_csv_path, n_components)
print("✓ Training CSV exported")

In [None]:
# Export test to CSV
test_csv_path = f"{CONFIG['OUTPUT_PATH']}/test_pca_csv"
print(f"Exporting test CSV to: {test_csv_path}")

export_to_csv(df_test_output, test_csv_path, n_components)
print("✓ Test CSV exported")

---

## 7. Scaling Analysis

### 7.1 Performance Metrics

In [None]:
# Summary metrics
print("="*60)
print("Pipeline Performance Summary")
print("="*60)
print(f"\nDataset:")
print(f"  Training images: {train_count:,}")
print(f"  Test images: {test_count:,}")
print(f"  Total classes: {n_classes}")

print(f"\nFeature Extraction:")
print(f"  Model: MobileNetV2 (ImageNet)")
print(f"  Original features: 1,280")
print(f"  Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

print(f"\nPCA Reduction:")
print(f"  Components: {n_components}")
print(f"  Explained variance: {cumulative_variance[-1]*100:.2f}%")
print(f"  Compression ratio: {1280/n_components:.1f}x")

print(f"\nOutput:")
print(f"  Training Parquet: {train_output_path}")
print(f"  Test Parquet: {test_output_path}")
print(f"  Training CSV: {train_csv_path}")
print(f"  Test CSV: {test_csv_path}")

### 7.2 Scalability Tests

For scalability testing, use the subset generator to create progressively larger datasets:

```bash
# Generate subsets
python scripts/subset_data.py --percentage 1   # ~900 images
python scripts/subset_data.py --percentage 5   # ~4,500 images
python scripts/subset_data.py --percentage 10  # ~9,000 images
python scripts/subset_data.py --percentage 25  # ~22,500 images
python scripts/subset_data.py --percentage 50  # ~45,000 images
```

In [None]:
# Scaling recommendations
print("\n" + "="*60)
print("Scaling Recommendations")
print("="*60)

print("""
For production deployment on AWS EMR or Databricks:

1. STORAGE:
   - Upload images to S3 bucket (EU region for RGPD)
   - Use s3a:// protocol for Spark access
   - Store results in Parquet format

2. COMPUTE:
   - EMR cluster with GPU instances (p3.2xlarge)
   - Or Databricks with GPU runtime
   - Scale executors based on dataset size

3. OPTIMIZATION:
   - Increase spark.sql.execution.arrow.maxRecordsPerBatch
   - Use spark.sql.shuffle.partitions = 2x num_cores
   - Enable adaptive query execution

4. RGPD COMPLIANCE:
   - Use EU-west-1 or EU-central-1 regions
   - Enable S3 bucket encryption
   - Configure VPC for network isolation
""")

---

## 8. Conclusion

### Summary

This notebook demonstrates a complete Big Data image processing pipeline:

1. **Data Loading**: PySpark binary file reader for distributed image loading
2. **Feature Extraction**: GPU-accelerated MobileNetV2 transfer learning
3. **Dimensionality Reduction**: PCA from 1280 to 50 components
4. **Results Export**: Parquet and CSV formats for cloud storage

### Key Metrics

| Metric | Value |
|--------|-------|
| Total Images | ~90,000 |
| Original Features | 1,280 |
| PCA Components | 50 |
| Explained Variance | ~95% |
| Compression Ratio | 25.6x |

### Next Steps

1. Deploy to AWS EMR or Databricks for cloud execution
2. Train classification model on PCA features
3. Implement real-time inference API
4. Set up continuous training pipeline

In [None]:
# Cleanup
spark.stop()
print("\n✓ Spark session stopped")
print("\n" + "="*60)
print("Pipeline execution complete!")
print("="*60)