# The Machine Learning Landscape

## What is Machine Learning?
- Machine Learning is the science (and art) of programming computers so they can learn from data. Instead of being explicitly programmed to perform a task, a machine learning system is trained using data and algorithms to identify patterns and make predictions or decisions. The goal is to create models that can generalize well to new, unseen data.

## Why use Machine Learning?
1. **No Known Algorithmic Solution**:  
   - For complex tasks like speech recognition, image classification, or natural language processing, writing explicit rules is impractical or impossible.  
   - ML learns patterns directly from data.  

2. **Dynamic Environments**:  
   - ML systems can adapt to changing data (e.g., spam filters adapting to new types of spam).  

3. **Large-Scale Data**:  
   - ML can process and extract insights from massive datasets that are too large or complex for manual analysis.  

4. **Personalization**:  
   - ML enables tailored solutions for individual users (e.g., Netflix recommendations, personalized ads).  

5. **Automation**:  
   - ML automates decision-making, reducing the need for human intervention in repetitive or complex tasks.
 
## Examples of Applications:

1. **Image and Video Recognition**:  
   - Facial recognition (e.g., unlocking smartphones).  
   - Object detection in self-driving cars.  
   - Medical imaging (e.g., detecting tumors in X-rays or MRIs).  

2. **Natural Language Processing (NLP)**:  
   - Language translation (e.g., Google Translate).  
   - Sentiment analysis (e.g., analyzing customer reviews).  
   - Chatbots and virtual assistants (e.g., Siri, Alexa).  

3. **Recommendation Systems**:  
   - Personalized product recommendations (e.g., Amazon, Netflix).  
   - Music and content recommendations (e.g., Spotify, YouTube).  

4. **Healthcare**:  
   - Predicting disease outbreaks.  
   - Personalized treatment plans.  
   - Drug discovery and development.  

5. **Finance**:  
   - Fraud detection (e.g., identifying suspicious transactions).  
   - Algorithmic trading.  
   - Credit scoring and risk assessment.  

6. **Retail and E-commerce**:  
   - Demand forecasting.  
   - Inventory management.  
   - Dynamic pricing (e.g., adjusting prices based on demand).  

7. **Autonomous Vehicles**:  
   - Self-driving cars (e.g., Tesla, Waymo).  
   - Drone navigation.  

8. **Gaming**:  
   - AI opponents in games (e.g., chess, Go).  
   - Procedural content generation.  

9. **Marketing**:  
   - Customer segmentation.  
   - Targeted advertising.  
   - Churn prediction (e.g., identifying customers likely to leave).  

10. **Manufacturing**:  
    - Predictive maintenance (e.g., predicting equipment failures).  
    - Quality control and defect detection.

## **Types of Machine Learning Systems**  
Machine learning algorithms can be classified according to the amount of supervision they get during training. There are **4 major types** of ML algorithms:  


#### 1. **Supervised Learning**  
With supervised learning, the training set we feed into the algorithm contains the targets/labels/desired predictions. Most supervised learning tasks fall under two umbrellas: **Classification** and **Regression**.  

- **Classification**: Predicting discrete values (e.g., is the email spam or not spam).  
- **Regression**: Predicting continuous target values (e.g., predicting the price of houses in dollars).  

Some regression-based models are used for classification as well, such as **Logistic Regression**, which outputs a probability.  

**Popular Supervised Learning Algorithms**:  
- K-Nearest Neighbors  
- Linear Regression  
- Logistic Regression  
- Decision Trees and Random Forests  
- Artificial Neural Networks  
- Naive Bayes  


#### 2. **Unsupervised Learning**  
In unsupervised learning, the data is unlabeled, and the system tries to learn without a teacher by finding internal structure within the dataset.  

**Unsupervised Learning Tasks**:  
- **Clustering**:  
  - K-Means  
  - DBSCAN  
  - Hierarchical Cluster Analysis  
- **Anomaly Detection**:  
  - One-Class SVM  
  - Isolation Forest  
  - Auto-encoders  
- **Dimensionality Reduction**: Compressing data without losing too much information (e.g., merging highly correlated features).  
  - Principal Component Analysis (PCA)  
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)  
  - Kernel PCA  
  - Local Linear Embedding (LLE)  
- **Association Rule Learning**: Finding interesting relations between attributes.  
  - Apriori  
  - Eclat  


#### 3. **Semi-Supervised Learning**  
In semi-supervised learning, we have partially labeled data. The goal is to use unlabeled data around the labeled data as helpers to solve the task. Most semi-supervised learning algorithms are a combination of unsupervised and supervised learning algorithms.  


#### 4. **Reinforcement Learning**  
An agent observes the environment, selects an action, gets a reward, and updates its policy.  


### **Batch vs. Online ML Algorithms**  
We can also categorize ML systems into **batch** or **online algorithms**, depending on whether the algorithm learns from an incoming stream of data or not.  

- **Batch Learning**:  
  - The model is incapable of incremental learning.  
  - It learns from all available data offline and is then deployed to produce predictions without feeding it new data points.  
  - Also called **Offline Learning**.  

- **Online Learning**:  
  - The model is trained incrementally by continuously feeding it data instances as they come, either individually or in small groups called mini-batches.  
  - Each learning step is fast and cheap, so the system can learn on the fly.  
  - **Challenges**:  
    - Bad incoming data can damage the model.  
    - Requires monitoring through performance metrics and anomaly detection to mitigate risks.  


### **Instance-Based vs. Model-Based Learning**  
Another way to categorize machine learning algorithms is by how they generalize. There are two approaches:  

- **Instance-Based Learning**:  
  - Performs similarity-based comparisons.  
  - A new data point is classified based on its similarity to the target group in the training set.  
  - Requires a measure of similarity.  

- **Model-Based Learning**:  
  - Builds a model for each class of data points and uses the model to classify new data points.  
  - **Example**: Linear Regression.

## **Main Challenges of Machine Learning**  
The two main things that can go wrong with a machine learning project are:  
1. Collecting bad data.  
2. Picking a bad learning algorithm.  


#### **Data Challenges**  
1. **Data Quantity**:  
   - Even for simple algorithms, thousands of examples are often required (e.g., recognizing cats/dogs in images).  
   - A famous paper showed that many algorithms perform similarly when given enough data, suggesting companies should invest more in **data corpus engineering** than algorithm development.  

2. **Non-Representative Data**:  
   - The training sample must be representative of the production data to generalize well.  
   - **Sampling Noise**: Occurs when the training set is too small.  
   - **Sampling Bias**: Occurs even with large samples if the sampling method is flawed (e.g., non-response bias).  

3. **Poor Quality Data**:  
   - Outliers, errors, and noise in the data make it harder for the algorithm to detect patterns.  
   - **Data Cleaning Steps**:  
     - Outlier detection and cleaning (remove or replace outliers).  
     - Handling missing features (discard instances, fill with median/average, or predict missing values using an auxiliary model).  

4. **Irrelevant Features**:  
   - Datasets often contain irrelevant features, which hinder learning.  
   - **Feature Engineering** is critical:  
     - **Feature Selection**: Choosing the most useful features.  
     - **Feature Extraction**: Creating new features based on existing ones or gathering new data.  


#### **Overfitting**  
- **Definition**: The model performs well on training data but fails to generalize to new data.  
- **Causes**: Complex models (e.g., deep neural networks) may memorize noise or small datasets.  
- **Solutions**:  
  - Use a simpler model with fewer parameters.  
  - Gather more training data.  
  - Reduce noise in the training data (fix errors, remove outliers).  
  - Apply **regularization** to constrain the model and prevent overfitting.  
    - Example: In linear regression (`f(x) = ax + b`), control the degrees of freedom by limiting parameter ranges.  
    - Regularization is controlled via **hyperparameters**.  


#### **Underfitting**  
- **Definition**: The model is too simple to capture the underlying structure of the training data.  
- **Solutions**:  
  - Use a more powerful model with more parameters.  
  - Improve feature engineering (feed better features to the algorithm).  
  - Reduce regularization constraints on the model.  

## **Testing & Validating**  
To evaluate a model, we split the data into two sets: **training** and **testing**. We care about the **generalization error** (out-of-training error), as it reflects the model's performance in a production environment.  


#### **Key Concepts**  
1. **Overfitting Detection**:  
   - If the **training error** is low but the **testing error** is high, the model is overfitting.  
   - Common practice: Use **80%** of the data for training and **20%** for testing (adjust based on dataset size).  

2. **Validation Set**:  
   - Fine-tuning hyperparameters (e.g., regularization) on the test set can lead to overfitting.  
   - A **validation set** is used for hyperparameter tuning.  
   - After tuning, the model is trained on the full training set (including validation) and evaluated on the test set.  

3. **Cross-Validation**:  
   - A computationally expensive but robust alternative to setting aside a large validation set.  
   - Involves training the model **N times** (e.g., k-fold cross-validation).  
   - Ensures the validation and test sets are representative of production data.  

4. **Train-Dev Set**:  
   - Helps diagnose whether poor performance is due to overfitting or data quality.  
   - **Steps**:  
     - Train the model and evaluate on both the **train-dev** and **validation** sets.  
     - **Good on train-dev but bad on validation**: Data is not suitable for the task.  
     - **Bad on both train-dev and validation**: Overfitting or poor algorithm/data quality.  


#### **Model Assumptions**  
- A model is a **simplified version** of observations, designed to discard noise and capture generalizable patterns.  
- **Assumptions** guide what information to keep or discard.  
  - Example: A linear model assumes the relationship between input and output is linear, with deviations being noise. 

# Exercises

---

### 1) **How would you define Machine Learning?**

Machine Learning (ML) is a subfield of artificial intelligence that focuses on developing algorithms and techniques enabling computers to learn patterns from data without being explicitly programmed for specific tasks. Instead of following detailed instructions, ML systems learn from examples and past experiences to make decisions or predictions. ML involves creating models that identify patterns in data, generalize from them, and apply this knowledge to new data. These models are trained using historical datasets and can automate tasks, classify information, predict outcomes, or detect anomalies.

---

### 2) **Can you name four types of problems where Machine Learning shines?**

Machine Learning excels in solving problems such as:

- **Complex problems without algorithmic solutions**
- **Building systems that adapt to unstable environments**
- **Replacing large lists of manual rules**
- **Assisting humans in learning**

---

### 3) **What is a labeled training set?**

A labeled training set is a collection of data used to train machine learning models, where each example (or data point) is paired with the correct output (or label). These labels serve as the "ground truth" that the model learns to predict during training.

---

### 4) **What are the two most common supervised tasks?**

The two most common supervised tasks are:

1. **Regression**
2. **Classification**

---

### 5) **Can you name four common unsupervised tasks?**

Four common unsupervised tasks are:

1. **Clustering**
2. **Visualization**
3. **Dimensionality Reduction**
4. **Anomaly Detection**

---

### 6) **What type of machine learning algorithm would you use to allow a robot to walk in many paths in an unknown terrain?**

**Reinforcement Learning** is the most suitable algorithm for this task, as it allows the robot to learn through trial and error by receiving feedback from its environment.

---

### 7) **What type of algorithm would you use to segment your customers into multiple groups?**

- **Clustering**: If you don’t know how to define groups, clustering helps separate clients into similar groups based on patterns or similarities.
- **Classification**: If you already know how to define the groups, classification assigns data points to predefined labels.

---

### 8) **Would you frame the problem of spam detection as a supervised learning or an unsupervised learning problem?**

**Supervised Learning**: The algorithm is trained on a dataset of emails alongside their labels (spam or not spam).

---

### 9) **What is an online learning system?**

An online learning system continues to learn from new data after being deployed in production. This is in contrast to a batch learning model, which stops learning after the initial training process.

---

### 10) **What is out-of-core learning?**

Out-of-core learning refers to algorithms that can handle large datasets that cannot fit into a computer's RAM. These algorithms divide the data into mini-batches and use online learning techniques to learn from them incrementally.

---

### 11) **What type of learning algorithm relies on a similarity measure to make predictions?**

**Instance-based models**, such as **K-Nearest Neighbors (KNN)**, rely on similarity measures to make predictions.

---

### 12) **What is the difference between a model's parameters and a learning algorithm's hyperparameters?**

- **Model Parameters**: These are the internal variables of the model that are learned during training (e.g., weights in a neural network).
- **Hyperparameters**: These are external configurations set before training (e.g., learning rate, number of layers in a neural network). They control the learning process and are tuned to optimize model performance.

---

### 13) **What do model-based algorithms search for? What is the most common strategy they use to succeed? How do they make predictions?**

Model-based algorithms search for the optimal parameters of a model that best represent the underlying patterns in the training data. The most common strategy is **optimization**, where a cost function is minimized to reduce the difference between the model's predictions and the actual target values. Predictions are made by applying the learned parameters to new input data (e.g., using learned coefficients in linear regression or weights in neural networks).

---

### 14) **Can you name four of the main challenges in Machine Learning?**

Four main challenges in Machine Learning are:

1. **Model Overfitting**
2. **Model Underfitting**
3. **Data Mismatch**
4. **Noisy Data**

---

### 15) **If your model performs great on the training data but fails on the test data, what is happening? Can you name three possible solutions?**

This is a case of **overfitting**, where the model memorizes noise or specific details in the training data instead of learning generalizable patterns. Three possible solutions are:

1. **Regularization**: Add penalties to the model to reduce complexity.
2. **Add More Data**: Increase the size of the training dataset.
3. **Simplify the Model**: Reduce the complexity of the model (e.g., fewer layers in a neural network).

---

### 16) **What is a test set, and why would you want to use it?**

A test set is a portion of the dataset kept separate from the training data. It is used to evaluate the performance of a machine learning model after training. The test set assesses the model's generalizability to unseen data, providing an estimate of how well the model will perform in real-world scenarios.

---

### 17) **What is the purpose of a validation set?**

A validation set is used to fine-tune the model's hyperparameters during training. It helps in selecting the best model configuration without touching the test set, ensuring an unbiased evaluation of the final model.

---

### 18) **What is the train-dev set? When do you use it, and how do you use it?**

The **train-dev set** is a subset of the training data separated after the initial train/validation/test split. It is used to diagnose specific issues in the model's performance, particularly when there is a mismatch between the training data and the validation/test data distributions. By evaluating the model on the train-dev set, you can determine whether performance issues stem from overfitting to the training data or from data distribution mismatches.

---

### 19) **What can go wrong if you tune hyperparameters using the test set?**

Tuning hyperparameters using the test set can lead to **overfitting to the test set**. This means the model may perform well on the test set but fail to generalize to new, unseen data in production, as the test set is no longer a reliable indicator of real-world performance.