# M3 - M4
Machine Learning & Linear Regression

 - Instructor: Sabya DG


## Agenda
1. Machine Learning
    - Supervised vs. Unsupervised Learning
2. Supervised Learning
    - `X` and `y`
    - Regression vs. Classification    
    - The golden rule: train/test split
3. Simple Linear Regression
4. Polynomial Regression

## Machine Learning


## Definition of ML / What is ML?

Seen as a subset of AI. ML algorithms build a model based on sample data (training data), in order to make predictions without being explicitly programmed to do so.

A field of study that gives computers the ability to learn without being explicitly programmed.
– Arthur Samuel (1959)

# Types of Machine Learning: Supervised and Unsupervised

## Machine Learning: Supervised Learning
- In supervised learning, we have a set of observations (__*X*__) with an associated target (__*y*__)
- We wish to find a model function that relates __*X*__ to __*y*__
- Then use that model function to predict future observations



## Machine Learning: Unsupervised Learning
- We have __*X*__ (the data) but no __*y*__ (associated target)



# Types of Supervised Learning: Regression and Classification

## Classification vs. Regression

* Classification problems: predicting among two or more categories, also known as classes
    - Predict whether a patient has a liver disease or not
    - Predict whether the letter grade of a student (A, B, C, D or F)
* Regression problem: predicting a continuous (in other words, a number) value
    - Predict house prices
    - Predict someone's age from their photo

## The golden rule
- When you're doing supervised learning, now that you've identified **X** and **y**
- **You need to split your data into train and test**
- **You only work with the training data**

### Why?
- As soon as you start making decisions on what features to include, drop etc., you are letting a part of the test data influence your decision-making
- Your results will not be truly representative of "unseen data"

## The big picture
- We train using the **training data**
- We test what is learned by the model on the **test data**
- We have two scores: **training** vs. **test**

### Which matters more?
- It doesn't matter how good our **training score** is because the **test score is what matters**
- Good models that generalize well though will have **similar training and testing scores**

**We want to pick models that generalize well to unseen data**

## The fundamental tradeoff

| Model | Training Score relative to Test Score | Performance |
|:-|:-|:-|
| Too Complex|High training score compared to test score| Overfit |
|Too Simple|Low training score and low test score|Underfit|

- Models that have **extremely high training scores** (that are too good to be true) that are **highly complex** that learned very complex relationships in the training data **can be overfit**
- On the other hand, models that have **low training scores** that are **very simple** may not have learned the necessary relationships in the training data needed to predict well on unseen data; they are **underfit**

![img](https://miro.medium.com/max/2250/1*_7OPgojau8hkiPUiHoGK_w.png)

## Minimizing approximation error ...
... means that our model generalizes well


$$E_{approx} = (E_{test} - E_{train})$$

- There is generally a "trade-off" between complexity and test error
- A more complex model will fit closer to the peculiarities of the training data
    - i.e., $E_{approx}\;$ tends to get bigger as our model becomes more complex
- This means it will likely not generalise well to new data!
- $E_{approx}\;$ tends to get smaller with more data

## The fundamental tradeoff (part 2)
... In the "bias-variance" language

- **The bias error** is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
- **The variance** is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).

## Linear Regression

- Linear regression is one of the most basic and popular ML/statistical techniques.
- Used as a predictive model
- Assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate, **y**) and the independent variable/s (input variable/s used in the prediction, **X**)

### Let's start with **simple** linear regression
- Only one independent/input variable is used to predict the dependent variable.

## Simple Linear Regression

$$\hat{y} = wx + b$$

$\hat{y}$ = Dependent variable

$b$ = Constant

$w$ = Coefficients

$x$ = Independent variable

## Multiple Linear Regression
- Many $x$'s and $w$'s

$$\hat{y} = w_1x_1 + w_2x_2 + ... + b$$

- The larger the value of $w_i$, the more influence $x_i$ has on the target $\hat{y}$

## Matrix representation

- $\hat{y}$ is the linear function of features $x$ and weights $w$.

$$\hat{y} = w^Tx + b$$
        
- $\hat{y} \rightarrow$ prediction
- $w \rightarrow$ weight vector
- $b \rightarrow$ bias
- $x \rightarrow$ features

$$\hat{y} = \begin{bmatrix}w_1 & w_2 & \cdots & w_d\end{bmatrix}\begin{bmatrix}x_1 \\ x_2 \\ \vdots \\ x_d\end{bmatrix} + b$$

## Matrix representation for multiple predictions

$$\hat{y} = w^TX + b$$
        

$$\hat{y} = \begin{bmatrix}w_1 & w_2 & \cdots & w_d\end{bmatrix}\begin{bmatrix}x^{(1)}_1 & x^{(2)}_1 & \ldots &x^{(n)}_1\\ x^{(1)}_2 & x^{(2)}_2 & \ldots & x^{(n)}_2 \\ \vdots & \vdots & \ldots \\ x^{(1)}_d & x^{(2)}_d & \ldots & x^{(n)}_d\end{bmatrix} + b$$

## Let's try it!


Let's start simple and imagine we have a dataset of Height and Weight. Let Height be our feature and Weight our Target.

In [8]:
# Import necessary libraries


In [None]:
# Import the pandas library for data manipulation

# Read the CSV file into a DataFrame

# Display the first 5 rows of the DataFrame to inspect the data structure


In [None]:
# Retrieve and display the column names of the DataFrame


In [None]:
# Import the LinearRegression model from scikit-learn's linear_model module

# Prepare the feature (X) and target (y) variables for training

# 'Weight' is used as the target variable (output)

# Create an instance of the LinearRegression model



In [None]:
# Train the LinearRegression model using the training data


In [None]:
# Using the trained LinearRegression model to make predictions on the training data (height)


In [None]:
# Retrieve the coefficient (slope) of the trained LinearRegression model


In [None]:
# Retrieve the intercept of the trained LinearRegression model


In [None]:
# Plot a scatter plot of the training data (height vs. weight)

# Plot the linear regression line using the learned coefficient and intercept

# Display the plot



In [None]:
# Evaluate the performance of the LinearRegression model using the R-squared score


## Coefficients and Intercept

The intuition behind Linear Regression is in the coefficients and intercept.

Some people refer to the coefficients as weights and the intercept as the bias. The 'Weights' and 'Bias' are what is being learned during `fit`.

## Let's load the California Housing Data Set

[Documentation](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)

In [None]:
# Import the fetch_california_housing function from scikit-learn's datasets module

# Fetch the California housing dataset and storing it in the variable 'california'


## The golden rule
- **You need to split your data into train and test**


## So... how do we split?
- Most common way is to use `train_test_split` in `sklearn`
- Shuffles the data first and then splits it
- 80/20, 75/25, 70/30 are common splits

## Splitting out our X and y
- In this case, we are working with a regression problem. Could you say why?
- What are the features?
- What is the target?

In [None]:
# Create a DataFrame for the features (input variables) from the California housing dataset

# Create a DataFrame for the target variable (output) from the California housing dataset

# Display the first 5 rows of the features DataFrame to inspect the data structure


In [None]:
# Display the first 5 rows of the target variable DataFrame to inspect the data structure


In [None]:
# Import the train_test_split function from scikit-learn's model_selection module

# Split the data into training and testing sets



In [None]:
# Print the total number of records in the combined training and testing data

In [None]:
 # Retrieve the total number of records (rows) in the features DataFrame (X)


In [None]:
# Retrieving the shape of the training features DataFrame (X_train)


In [None]:
# Retrieve the shape of the testing features DataFrame (X_test)


In [None]:
# Calculating the proportion of training samples to the total number of samples


In [None]:
# Generating descriptive statistics for the features DataFrame (X)


## Scaling the data

In [None]:
# Importing the StandardScaler class from scikit-learn's preprocessing module

# Creating an instance of the StandardScaler

# Fitting the scaler to the training data and transforming it


In [None]:
# Creating a DataFrame for the scaled training features

# Generating descriptive statistics for the scaled training features DataFrame


In [None]:
#X_train_scaled_df.head()


In [None]:
#sn.distplot(X_train['Population'], kde=True)


## Training the model

In [None]:
# Importing the LinearRegression model from scikit-learn's linear_model module

# Creating an instance of the LinearRegression model

# Training the LinearRegression model using the scaled training features (X_train_scaled) and the target variable (y_train)


In [None]:
# Retrieve and display the coefficients (slopes) of the trained LinearRegression model


In [None]:
# Retrieve and display the intercept of the trained LinearRegression model


In [None]:
# Creating and display a DataFrame to organize the coefficients of the trained LinearRegression model


Let’s try to make some sense of it here!

We can use these coefficients to interpret our model. They show us how much each of these features affects our model’s prediction.

**IMPORTANT**
In linear models:

* if the coefficient is +, then if the feature value goes UP the predicted value goes UP
* if the coefficient is -, then if the feature values goes UP the predicted value goes DOWN
* if the coefficient is 0, the feature is not used in making a prediction

## Feature Importances

In [None]:
# Import the statsmodels library for statistical modeling

# Create an Ordinary Least Squares (OLS) regression model

# Fit the OLS model to the training data


In [None]:
# Retrieve the estimated parameters (coefficients) of the fitted OLS regression model


In [None]:
# Generate a summary of the fitted OLS regression model


In [None]:
# Install the rfpimp package, which is used for Random Forest Feature Importance

In [None]:
# Import all necessary functions and classes from the rfpimp package


In [None]:
# Create a DataFrame for the scaled training features using the original feature names (from X)

# Calculate and plot the feature importances using the importances function from rfpimp


In [None]:
# Define a function to display feature importances for a given model and dataset


## Predicting

In [None]:
# Scaling the testing features


In [None]:
# Use the trained LinearRegression model to make predictions


In [None]:
# Retrieve the actual target values from the test set


## Prediction By Hand

In [None]:
# Retrieving the coefficients (slopes) of the trained LinearRegression model


In [None]:
# Retrieve the intercept (bias term) of the trained LinearRegression model


In [None]:
# Manually calculate and display the predicted value for the first test sample (X_test_scaled[0])

## Results interpretation

- Weights (coef_)
- Bias (intercept_)

- **R-squared** measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model
- **Adjusted R-squared** adjusts the statistic based on the number of independent variables in the model

**What does that mean?**

* $R^2$ is a measure of fit.  
  
* It indicates how much variation of a dependent variable is explained by the independent variables.
  
* An R-squared of 100% means that $y$ is completely explained by the independent variables.


$R^2 = 1 - \frac{Unexplained Variation}{TotalVariation}$

$R^2 = 1 - \frac{RSS}{TSS}$

$R^2	=	$   coefficient of determination    
$RSS	=	$  sum of squares of residuals   
$$RSS =\sum_{i=1}^{n}(y_{i}-{\hat{y}})^{2}$$   
$TSS	=	$   total sum of squares   
$$TSS=\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}$$
$\bar{y}$ = mean value

Thus,

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_{i}-{\hat{y}})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}
$$

In [None]:
# Evaluate the performance of the LinearRegression model on the scaled test data


0.6032930801870925

## Understanding $R^2$

In [None]:
# Generating a sequence of n points and target values with a linear relationship to x.

#plot the data points



In [None]:
# Define a function to create a plot that visualizes the data, the fitted model, and the residuals



In [None]:
#Apply linear regression


## Polynomial regression

### Non-linear regression motivation
- Linear regression might seem rather limited.
- What if the true relationship between the target and the features is non-linear?


**We still use the linear regression framework, but create quadratic, cubic etc. features**

## Let's see an example

In [None]:
   #initialize x and y

    # transforming the data to include another axis


In [None]:
#plot scatter graph

## Fitting a linear regression line

In [None]:
#Apply linear Regression


## Using polynomial features

In [None]:
#import libraries


In [None]:
#Generating Polynomial Features and Creating a DataFrame with Target Variable


In [None]:
# Displaying the first 5 rows of the DataFrame containing polynomial features and target variable


## Fitting polynomial features


What sklearn does is - If you substitute $x^2$ as another variable such as `m`, then the equation now is:

`y=w*m + b`  

The relation between `y` and `m` is linear but it is not linear between `x` and `y`.

Because of this "technically", it is linear regression just the variables between which it happens is $x^2$ (`m`) and `y` and not `x` and `y`.

In [None]:
# Fitting a Linear Regression model using the polynomial features


In [None]:
# Displaying the coefficients of the polynomial regression model


In [None]:
# Displaying the intercept of the polynomial regression model
