# Supervised Learning

- Task driven learning.
- Making predictions using data.
- Consider the *email spam detection* problem -- predicting whether an email is spam or not.

![image.png](attachment:image.png)

- Given a dataset with *'right answers'*, an algorithm learns to produce predictions on never-before-seen data.

- Examples: spam detection, fraud detection, housing price prediction, sentiment prediction, ..

## Terminology

- *Label*: Variable we’re predicting – usually represented by the variable **y**
- *Features*: Input variables describing data – usually represented by variables **$\{x_1, x_2, …, x_n\}$**
- *Example*: particular instance of data, **x**
- *Labeled Example*: has **\{features,label\}**: **(x,y)** – used to train the model
    - Input data with labeled examples form the *training dataset*.
- *Unlabeled Example*: has **\{features,?\}**: **(x,?)** – used for making predictions on new data
    - Collection of unlabeled examples are the *test dataset* which are used to test the performance of the trained model.
- *Model*: maps examples to predicted labels **y’**


## How does supervised learning work?



1. **Training** the Machine Learning Algorithm using **labelled data**.
    - The model learns the relationship between attributes of input data and the outcome.
    - The goal is to approximate a mapping function which can predict the output variable **(Y)** for a new input data **(x)**, i.e.,
**$Y = f(X)$**

In other words, the algorithm learns by comparing its output with the correct outputs to find errors and then modifies the model accordingly.

![image.png](attachment:image.png)

2. **Predictions** on **new (future) data** for which label is unknown using the trained model to predict **future outputs**.

## Types of Supervised Learning

1. Regression: Given a labeled dataset, regression predicts real-valued continuous output (eg., housing prices, monthly income).
2. Classification: Given a labeled dataset, classification predicts categorical discrete-valued output (eg., spam/not spam, male/female).

## Regression

- Consider the problem of *predicting housing prices*.
- Popular algorithms:
        - Linear Regression
        - Support Vector Machines
        - Random Forest
        - Neural Network
        - Decision Trees
- Features: Input variables that can be used to predict housing prices such as: size (feet$^2$), number of bedrooms, number of floors, age of house (years)
    - Lets consider one input variable (size in sq. ft)     -->  **Univariate/Simple Linear Regression**
- **Simple Linear Regression**: Finds a linear function (a straight line) that predicts the target variable (y) as a function of the independent variable (x).

![image.png](attachment:image.png)
(Image credit: Andrew Ng)

![image.png](attachment:image.png)

- **Multiple Linear Regression**: Considering multiple input variables, MLR models a linear function that predicts the target variable as a function of the independent variables.

## Classification

- *Binary Classification*: Spam/no spam, cancer/no cancer
        - Using one input variable

    ![image.png](attachment:image.png)

        - Using more than one input variable
   ![image.png](attachment:image.png)
  - *Multi-class Classification*: Handwritten Digit Recognition (0 to 9), Cancer stage (0, 1, 2, 3)
  
  (Image Credit: Andrew Ng)
  
  - Popular algorithms:
          - Logistic Regression
          - Decision Trees
          - Naive Bayes
          - Support Vector Machine
          - K Nearest Neighbor

## Decision Trees for Classification

- Uses a tree-like model for decisions.
- Consider an example of the titanic dataset for predicting whether a passenger will survive or not (y). 
- Features $(x_1, x_2, .., x_n)$: gender, age, and number of spouses or children aboard

![image.png](attachment:image.png)

- **Condition/internal node** based on which the tree splits into branches/edges.
- End of the branch that doesn’t split anymore is the **decision/leaf**, in this case, whether the passenger died or survived, represented as red and green text respectively.