# FrontierML: Machine Learning with Real-World Data

Welcome to **FrontierML**, an interactive course that teaches machine learning through hands-on implementation with real-world data collection and analysis.

## Course Overview

This comprehensive course combines **theoretical understanding** with **practical implementation**, covering 22 essential machine learning and AI techniques:

- **Real-world data collection** through ethical web scraping and APIs
- **Mathematical foundations** with step-by-step derivations and citations
- **Implementation from scratch** to understand core concepts
- **Industry-standard tools** (scikit-learn, pandas, matplotlib)
- **Best practices** for reproducible data science

## Course Structure

### **Foundation & Data (Chapters 1-2)**

#### Chapter 1: Data Collection and Web Scraping
**File:** `01_data_collection.ipynb`

Learn to collect and preprocess real-world data:
- Ethical web scraping principles and techniques
- API interactions for data collection
- Data quality assessment and cleaning
- Feature engineering for machine learning

**Key Skills:** Web scraping, data preprocessing, feature engineering

#### Chapter 2: Linear Regression
**File:** `02_linear_regression.ipynb`

Master the foundation of predictive modeling:
- Mathematical derivation of least squares estimation
- Implementation from scratch using NumPy
- Real estate price prediction with scraped data
- Model evaluation and interpretation

**Key Skills:** Regression analysis, mathematical implementation, model evaluation

---

### **Supervised Learning - Classification (Chapters 3-7, 11-12, 18)**

#### Chapter 3: Logistic Regression
**File:** `03_logistic_regression.ipynb`

Understand probabilistic classification:
- Sigmoid function and maximum likelihood estimation
- Binary and multi-class classification
- Feature scaling and regularization

**Key Skills:** Classification, probability estimation, performance evaluation

#### Chapter 4: Decision Trees
**File:** `04_decision_trees.ipynb`

Learn tree-based learning algorithms:
- Information theory and entropy calculations
- Tree construction and splitting criteria
- Overfitting prevention and pruning

**Key Skills:** Tree algorithms, information theory, interpretability

#### Chapter 5: Random Forest
**File:** `05_random_forest.ipynb`

Master ensemble learning techniques:
- Bootstrap aggregating (bagging) principles
- Random feature selection strategies
- Out-of-bag error estimation

**Key Skills:** Ensemble methods, hyperparameter tuning, model selection

#### Chapter 6: Support Vector Machines
**File:** `06_support_vector_machines.ipynb`

Understand margin-based classification:
- Geometric interpretation and margin maximization
- Kernel methods and the kernel trick
- Handling non-linearly separable data

**Key Skills:** Geometric thinking, kernel methods, optimization

#### Chapter 7: Neural Networks
**File:** `07_neural_networks.ipynb`

Introduction to deep learning fundamentals:
- Perceptron and multi-layer networks
- Backpropagation algorithm implementation
- Activation functions and optimization

**Key Skills:** Neural networks, gradient-based optimization, deep learning

#### Chapter 11: Naive Bayes Classification
**File:** `11_naive_bayes.ipynb`

Learn probabilistic classification with independence assumptions:
- Bayes' theorem and conditional independence
- Gaussian, Multinomial, and Bernoulli variants
- Text classification applications

**Key Skills:** Bayesian inference, probabilistic models, text classification

#### Chapter 12: K-Nearest Neighbors (KNN)
**File:** `12_k_nearest_neighbors.ipynb`

Understand instance-based learning:
- Distance metrics and similarity measures
- Optimal k selection using cross-validation
- Curse of dimensionality considerations

**Key Skills:** Instance-based learning, distance metrics, lazy learning

#### Chapter 18: AdaBoost (Adaptive Boosting)
**File:** `18_adaboost.ipynb`

Master adaptive ensemble methods:
- Weight update mechanisms and error-based learning
- Weak learner combination strategies
- Theoretical guarantees and convergence

**Key Skills:** Boosting algorithms, adaptive learning, ensemble theory

---

### **Supervised Learning - Advanced (Chapters 13)**

#### Chapter 13: Gradient Boosting Machines
**File:** `13_gradient_boosting.ipynb`

Learn state-of-the-art boosting techniques:
- XGBoost, LightGBM, and CatBoost implementations
- Regularization techniques to prevent overfitting
- Hyperparameter tuning for optimal performance

**Key Skills:** Advanced boosting, hyperparameter optimization, production ML

---

### **Unsupervised Learning - Clustering (Chapters 8-9, 15, 17)**

#### Chapter 8: K-Means Clustering
**File:** `08_k_means_clustering.ipynb`

Understand centroid-based clustering:
- Lloyd's algorithm and convergence properties
- Cluster evaluation metrics and validation
- Applications in data segmentation

**Key Skills:** Clustering algorithms, unsupervised evaluation, data exploration

#### Chapter 9: Hierarchical Clustering
**File:** `09_hierarchical_clustering.ipynb`

Learn tree-based clustering methods:
- Agglomerative and divisive clustering approaches
- Linkage criteria and distance metrics
- Dendrogram interpretation and cluster selection

**Key Skills:** Hierarchical methods, tree structures, cluster analysis

#### Chapter 15: DBSCAN Clustering
**File:** `15_dbscan_clustering.ipynb`

Master density-based clustering:
- Core points, border points, and noise identification
- Parameter selection for epsilon and minimum points
- Comparison with other clustering methods

**Key Skills:** Density-based methods, outlier detection, parameter tuning

#### Chapter 17: Gaussian Mixture Models (GMM)
**File:** `17_gaussian_mixture_models.ipynb`

Understand probabilistic clustering:
- Expectation-Maximization algorithm
- Model selection using information criteria
- Applications in density estimation

**Key Skills:** Probabilistic models, EM algorithm, density estimation

---

### **Dimensionality Reduction (Chapters 10, 16)**

#### Chapter 10: Principal Component Analysis (PCA)
**File:** `10_principal_component_analysis.ipynb`

Learn linear dimensionality reduction:
- Eigenvalue decomposition and covariance matrices
- Variance explanation and component interpretation
- Visualization of high-dimensional data

**Key Skills:** Linear algebra, dimensionality reduction, data visualization

#### Chapter 16: Linear Discriminant Analysis (LDA)
**File:** `16_linear_discriminant_analysis.ipynb`

Understand supervised dimensionality reduction:
- Fisher's linear discriminant
- Between-class and within-class scatter matrices
- Comparison with PCA

**Key Skills:** Supervised reduction, discriminant analysis, classification preprocessing

---

### **Pattern Mining (Chapter 14)**

#### Chapter 14: Association Rule Mining
**File:** `14_association_rule_mining.ipynb`

Discover relationships in transactional data:
- Apriori and FP-Growth algorithms
- Support, confidence, and lift metrics
- Market basket analysis applications

**Key Skills:** Pattern discovery, rule mining, recommendation systems

---

### **Reinforcement Learning (Chapter 19)**

#### Chapter 19: Q-Learning (Reinforcement Learning)
**File:** `19_q_learning.ipynb`

Introduction to reinforcement learning:
- Markov Decision Processes fundamentals
- Q-learning algorithm and value iteration
- Exploration vs exploitation strategies

**Key Skills:** Reinforcement learning, dynamic programming, agent-based modeling

---

### **Deep Learning (Chapters 20-21)**

#### Chapter 20: Autoencoders (Deep Learning)
**File:** `20_autoencoders.ipynb`

Learn representation learning with neural networks:
- Encoder-decoder architectures
- Variational autoencoders and regularization
- Applications in anomaly detection

**Key Skills:** Deep learning, representation learning, generative models

#### Chapter 21: Convolutional Neural Networks (CNN)
**File:** `21_convolutional_neural_networks.ipynb`

Master computer vision with deep learning:
- Convolution and pooling operations
- CNN architectures (LeNet, AlexNet, ResNet concepts)
- Applications in image classification

**Key Skills:** Computer vision, convolutional operations, image processing

---

## Prerequisites

- **Python Programming:** Basic to intermediate Python skills
- **Mathematics:** Linear algebra, calculus basics, probability
- **Statistics:** Descriptive statistics, hypothesis testing concepts

## Required Libraries

All dependencies are listed in `requirements.txt`:
```bash
pip install -r requirements.txt
```

Key libraries:
- **Data Processing:** pandas, numpy, scipy
- **Machine Learning:** scikit-learn
- **Deep Learning:** tensorflow, keras
- **Visualization:** matplotlib, seaborn, plotly
- **Web Scraping:** requests, beautifulsoup4, selenium

## Getting Started

### Option 1: Jupyter Book (Recommended)
```bash
# Build the interactive book
make book

# Serve locally
make serve
```

### Option 2: Individual Notebooks
```bash
# Start Jupyter Lab
make jupyter

# Navigate to notebooks/ directory
```

## Learning Philosophy

This course follows several key principles:

### 1. **Theory + Practice**
Every algorithm is presented with:
- Mathematical foundations and derivations
- Step-by-step implementation from scratch
- Real-world applications with actual data

### 2. **Real Data Focus**
Instead of toy datasets:
- Scrape real-world data from websites and APIs
- Handle messy, incomplete data
- Address real preprocessing challenges

### 3. **Reproducible Science**
All work follows scientific principles:
- Proper citations for all concepts
- Documented methodology and assumptions
- Version-controlled code and data

### 4. **Progressive Complexity**
Start simple, build complexity:
- Begin with linear models and fundamentals
- Progress through supervised and unsupervised learning
- Advance to ensemble methods and deep learning

## Learning Outcomes

By completing this course, you will:

1. **Master 22 essential ML/AI techniques** from basic regression to deep learning
2. **Collect real-world data** from various sources ethically and efficiently
3. **Understand mathematical foundations** with confidence and proper citations
4. **Implement algorithms from scratch** using NumPy and Python
5. **Apply production tools** like scikit-learn and TensorFlow effectively
6. **Evaluate and interpret models** using appropriate metrics and visualizations
7. **Follow best practices** for reproducible data science workflows

## Course Categories Coverage

**Supervised Learning:** Linear/Logistic Regression, Trees, Forests, SVM, Neural Networks, Naive Bayes, KNN, Boosting  
**Unsupervised Learning:** K-Means, Hierarchical, DBSCAN, GMM, PCA, LDA  
**Pattern Mining:** Association Rules  
**Reinforcement Learning:** Q-Learning  
**Deep Learning:** Autoencoders, CNNs  

## Assessment Approach

Each chapter includes:
- **Interactive exercises** embedded in notebooks
- **Real-world projects** with actual data
- **Mathematical problems** to test understanding
- **Implementation challenges** to build coding skills

---

**Ready to begin your comprehensive machine learning journey?** Start with [Chapter 1: Data Collection](01_data_collection.ipynb) and learn to gather real-world data for analysis!