# Application for AI Development Intern - Arpnik Singh (s.arpnik1997@gmail.com)

### This notebook addresses the two main components of the assessment:
1. Processing Video Pipeline for Carrot Detection
   
2. Demo: Toy Classification Implementation


## 1. Processing Video Pipeline
### Problem Statement
Given a working ML model that can process individual images and identify carrots, how can we adapt it to process live video in a grocery store and create records of detected carrots?


### Solution Overview
Transform the single-image carrot classifier into a real-time video processing pipeline that can track and record carrot detections across video frames in a grocery store environment.

#### Architectural Overview
Video Stream → Frame Extractor + Preprocessing → Inference → Tracker → Logger → DB

#### Implementation Strategy
##### Phase 1: Core Pipeline
- Set up video ingestion and frame processing
- Integrate existing model with batch inference
- Implement basic detection recording

##### Phase 2: Tracking & Intelligence
- Add temporal tracking to prevent duplicate counting
- Implement confidence calibration for grocery environment
- Add detection filtering and validation logic

##### Phase 3: Production Deployment
- Optimize for real-time performance
- Add monitoring and alerting systems
- Implement data analytics and reporting features

#### Success Metrics
- **Detection Accuracy:** Precision/recall of carrot identification
- **Processing Latency:** End-to-end time from video frame to recorded detection
- **System Uptime:** Reliability of continuous operation
- **False Positive Rate:** Minimize incorrect carrot identifications in busy environment

### Core Architectural Components

#### 1. Video Processing Pipeline
**Frame Extraction:** Capture frames from live video stream at optimal intervals (e.g., 2-5 FPS to balance accuracy vs. performance)

**Preprocessing:** Resize, normalize, and batch frames for efficient model inference

**Buffer Management:** Implement sliding window buffer to handle continuous video stream

#### 2. Model Adaptation Layer
**Batch Inference:** Process multiple frames simultaneously to improve throughput

**Confidence Thresholding:** Set appropriate confidence levels to reduce false positives in cluttered grocery environment


#### 3. Temporal Tracking System
**Object Tracking:** Implement tracking algorithm (DeepSORT, ByteTrack) to associate carrot detections across frames

**Duplicate Filtering:** Prevent counting the same carrot multiple times as it moves through frame

**Persistence Logic:** Maintain carrot identity even during brief occlusions

#### 4. Record Generation & Storage
**Detection Logging:** Capture timestamp, bounding box coordinates, confidence score, and unique carrot ID

**Image Snapshots:** Store cropped carrot images for verification and audit trail

**Database Integration:** Use time-series database for efficient querying of detection records

### Key Architectural Considerations
#### Performance Optimization

**Edge Computing:** Deploy inference at edge to reduce latency and bandwidth requirements

**GPU Acceleration:** Utilize GPU for real-time processing, with CPU fallback

**Model Quantization:** Reduce model size while maintaining accuracy for deployment constraints

#### Scalability & Reliability
- **Multi-Camera Support:** Architecture should handle multiple grocery store cameras simultaneously
- **Load Balancing:** Distribute processing across multiple inference servers
- **Failover Mechanisms:** Ensure system continues operating if individual components fail

#### Environmental Challenges
- **Lighting Variations:** Handle different lighting conditions throughout the day
- **Occlusion Handling:** Manage partial carrot visibility due to customers/shopping carts
- **Scale Variations:** Detect carrots at different distances from camera
- **Background Clutter:** Distinguish carrots from other orange/similar colored objects

#### Data Management

**Storage Optimization:** Implement data retention policies to manage storage costs

**Privacy Compliance:** Ensure customer privacy while recording detection data

**Real-time Analytics:** Provide dashboards for inventory monitoring and insights

#### Suggested list of technologies to be used
- OpenCV for video capture
- YOLO/Detectron2 as base model
- DeepSORT/ByteTrack for object tracking
- Redis/PostgreSQL for detection logs

This architecture transforms a static image classifier into a robust, real-time video analytics system capable of operating in the challenging environment of a grocery store while maintaining accuracy and performance.

## 2. Demo: Parkinson's Disease Detection via Audio Analysis
### **🎙️ Voice → 🧠 Parkinson's Detection**

## Problem Statement
Build a binary classifier to detect Parkinson's disease from short audio recordings.
## Why Voice Analysis?
Parkinson's disease affects speech patterns before visible symptoms appear. Voice changes include tremor, reduced volume, and altered rhythm.
Early detection enables better treatment outcomes.

## Classification Task:
**Input:** Short audio recordings (≤45 seconds)

**Output:** Binary prediction (Healthy vs. Diseased)

**Challenge:** Audio data is high-dimensional, requiring deep learning approaches


### Key Achievements:
- **99.4% accuracy** using deep neural networks on voice recordings
- Systematic comparison of multiple speech processing architectures
- Transfer learning from pre-trained models (WavLM, Wav2Vec2)
- Real-world medical AI application with early detection potential

#### Classification Fundamentals Demonstrated:
- **Binary Classification**: Parkinson's vs Healthy patients
- **Feature Engineering**: Audio signal processing (MFCCs, spectrograms, filter banks)
- **Transfer Learning**: Fine-tuning pre-trained speech models
- **Model Evaluation**: Cross-validation, confusion matrices, precision/recall
- **Class Imbalance Handling**: Techniques for medical dataset challenges
- **Hyperparameter Optimization**: Using Orion framework for systematic tuning

#### Tech Stack:
- PyTorch for deep learning implementation
- SpeechBrain for speech processing toolkit
- HuggingFace for pre-trained models and datasets
- Orion framework for hyperparameter optimization

#### Medical Relevance:
Speech changes often appear **before** other Parkinson's symptoms, making this approach valuable for early detection and better patient outcomes.

**📁 Complete Implementation Available**: The full project code, notebooks, and documentation are available in the `speech-pd-detection/` folder of this repository.

This project demonstrates my ability to:
1. Work with complex real-world classification problems
2. Apply advanced deep learning techniques to audio data
3. Handle medical dataset challenges (ethics, class imbalance, validation)
4. Achieve state-of-the-art performance through systematic experimentation
5. Build practical AI solutions for healthcare applications### Model Comparison Results

The complete implementation and results are available in the `speech-pd-detection/` folder. Here's a summary of the key findings:

| Model | Accuracy | Key Features |
|-------|----------|-------------|
| 🥇 WavLM | **99.4%** | Fine-tuned pre-trained self-supervised model (2 epochs) |
| 🥈 Wav2Vec2 | **99.2%** | Fine-tuned self-supervised model |
| 🥉 Xvector+FBanks | **98.0%** | Memory-efficient model trained from scratch |
| ⭐ Xvector+MFCCs | **94-97%** | Comparable performance with traditional features |
| 🤔 ECAPA-TDNN | **85-90%** | Pre-trained model with added noise robustness |

### Classification Techniques Demonstrated:

#### 1. **Feature Engineering**
```python
# Audio feature extraction pipeline
- MFCCs (Mel-Frequency Cepstral Coefficients)
- Filter Banks (FBanks) 
- Raw waveform processing
- Spectral features and transformations
```

#### 2. **Transfer Learning**
```python
# Leveraging pre-trained speech models
- WavLM: Self-supervised learning on large speech corpora
- Wav2Vec2: Contrastive learning for speech representations
- Fine-tuning with medical domain data
```

#### 3. **Model Architecture**
```python
# Deep learning approaches
- X-vector embeddings for speaker verification adapted to PD detection
- ECAPA-TDNN for robust speaker representations
- Attention mechanisms for temporal modeling
```

#### 4. **Evaluation Strategy**
```python
- Negative log-likelihood (NLL) loss as trianing objective
- Cross-validation with patient-level splits
- Classification error rate 
```

### Key Insights from the Project:

1. **Transfer Learning Effectiveness**: Pre-trained models (WavLM, Wav2Vec2) achieved superior performance with minimal fine-tuning
2. **Feature Importance**: Raw waveform processing outperformed traditional acoustic features
3. **Data Efficiency**: High accuracy achieved with limited medical data through transfer learning
4. **Robustness**: Models maintained performance across different recording conditions

### Real-World Impact:
This work demonstrates practical AI application in healthcare, addressing:
- **Early Detection**: Speech changes appear before motor symptoms
- **Accessibility**: Smartphone-based screening potential
- **Scalability**: Automated analysis for large-scale screening programs
- **Cost-Effectiveness**: Reduces need for expensive clinical assessments
