# Lab 10 - Linear Regression

## Purpose

Application of linear regression as a simple approach for supervised learning. Understanding its use for predicting a quantitative response as well as its assumptions and limitations.

## Methodology  

Step-by-step guided implementation of a linear regression model. From the creation of the dataset to making predictions and assessing the quality of the model.

## Results

A working model fitted to real-world data that can be used to make predictions and better understanding the relationship between the response and the predictors.


## Steps

- [ ] Create a Dataset
- [ ] Exploratory Data Analysis
- [ ] Finding the "Best" Line
- [ ] Fitting a Linear Regression Model
- [ ] Using the Model for Predictions
- [ ] Assessing the Accuracy of the Model
- [ ] Multiple Linear Regression

---

## Setup

### Library we will use

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

Instructions for installing statsmodels if needed: https://www.statsmodels.org/dev/install.html

## First, some notes on Machine Learning

Where does Linear Regression fit in?

### Supervised vs Unsupervised learning

Most machine learning problems fall into one of two categories: supervised or unsupervised learning.

In the case of **supervised learning**, we wish to fit a model that relates the response to the predictors, with one of the following aims in mind:
* accurately predicting the response for future observations - **prediction**
* better understanding the relationship between the response and the predictors - **inference**.

On the other hand, **unsupervised learning** is applied in unlabeled datasets. We are not interested in prediction, because we do not have an associated response variable. Instead we wish to answer questions such as:
* Is there an informative way to visualize the data?
* Can we discover subgroups among the variables or among the observations?

We can further distinguish machine learning algorithms by the output they produce:

### Regression vs Classification

In terms of output, two main types of machine learning models exist: those for regression and those for classification.

We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems.

Can you think of some examples of regression and classification problems?

### Linear Regression

Linear regression is the most popular machine leaning model for regression. It is often underrated because of its relative simplicity, however, more complex models may fit the data better, at the cost of losing simplicity or interpretability.


## Create a dataset

Create two lists. One for our predictor and other for the response. Create a dataframe with two columns named 'calçado' and 'altura'. One for the predictor and the other for the response.

In [None]:
# your code here

predictor = [44, 41, 36, 37, 37, 43, 43, 41, 42, 42, 43.5, 43, 41, 44.5, 43, 43, 41] # número do calcado
response = [184, 168, 160, 163, 160, 173, 180, 170, 172, 182, 185, 175, 175, 183, 185, 184, 182] # altura

students_df = pd.DataFrame({'calçado': predictor, 'altura': response}) # your code here
students_df

### Exploratory Data Analysis

The first step before fitting a linear regression model is exploratory data analysis and data visualization: is there a relationship that we can model?

hint: one option is to use the convenient plt.plot function. you can also create a title and label the axis accordingly

In [None]:
students_df['calçado']

In [None]:
# your code here

plt.plot(students_df['calçado'], students_df['altura'], 'o')

plt.title("Estudantes de IA")
plt.xlabel("Número do calçado")
plt.ylabel("Altura (cm)")

### How do we find the "Best" Line?

Linear Regression assumes there is approximately a linear relationship between X and Y. Mathematically, we can write this linear relationship as:

$$Y \approx \beta_{0} + \beta_{1}{X}$$

$\beta_{0}$ and $\beta_{1}$ are unknown and represent the intercept and slope of the fit line, respectively. They are known as the model coefficients or parameters.

We can try to eye-ball what the best-fit line might look like. However, to actually choose a line, we need to come up with some criteria for what “best” actually means. So, how can we estimate $\beta_{0}$ and $\beta_{1}$?

The most common choice for linear regression is ordinary least squares (OLS). OLS chooses the $\beta_{0}$ and $\beta_{1}$ that minimize the residual sum of squares (RSS):

$$RSS = \sum_{i=1}^n (y_{i} - \hat{y_{i}})^2$$

https://content.codecademy.com/programs/data-science-path/line-fitter/line-fitter.html

### Fitting a Linear Regression Model in Python

We will use the statsmodels.api library but there are several Python libraries that can be used to fit a linear regression.

hint 1: use 'altura ~ calçado' for the formula parameter. it means that we are predicting the response *altura* from the predictor *calçado*

hint 2: to print the model coefficients try searching for the available methods available on the fitted model object

In [None]:
model.fit

In [None]:
# Create the linear regression model 
model = sm.OLS.from_formula('altura ~ calçado', data = students_df) # (formula, data)

# Fit the model
results = model.fit() # Your code here

# Print the model coefficients
results.params

Now, plot the best fit line by completing the code below:

In [None]:
np.linspace(30, 50, 10)

In [None]:
predictions

In [None]:
# Get predictions from the linear model
sample_x = np.linspace(0, 50, 10)
predictions = results.predict({'calçado': sample_x}) # Your code here

# Plot the dataset
plt.scatter(students_df['calçado'], students_df['altura'])

# Plot the best fit line
plt.plot(sample_x, predictions, 'red')

plt.title("Estudantes de IA")
plt.xlabel("Tamanho do calçado")
plt.ylabel("Altura (cm)")

plt.xlim([30, 50])
plt.ylim([150, 200])

plt.show()

### Using the Model for Predictions

This is the power of machine learning! We can use the estimated model coefficients to make predictions on new data.

In [None]:
results.params

In [None]:
# Your code here

calcado = 39 # Your code here
pred = results.params[0]+results.params[1]*calcado

print('Altura estimada para um(a) aluno(a) que calça o {}: {}cm'.format(calcado, round(pred)))

### Assessing the Accuracy of the Model

The quality of a linear regression fit is typically assessed using some metrics. One of the most common is the $R^2$ statistic.

Also called the proportion of variance explained, it always takes on a value between 0 and 1, and is independent of the scale of the response. The formula to calculate $R^2$ is the following:

$$R^2 = \frac{TSS−RSS}{TSS}$$

where $TSS = \sum_{i=1}^n (y_{i} - \bar{y_{i}})^2$ is the total sum of squares. Recall that RSS is the residual sum of squares defined above.

TSS−RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$ measures the proportion of variability in Y that can be explained using X. An $R^2$ statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. A number near 0 indicates that the regression does not explain much of the variability in the response; this may occur because the linear model is wrong which may leed to poor predictions.

Check which methods are available on top of the results object. Can you find the $R^2$ statistic of your model? Is it a good result?

In [None]:
# Your code here
results.rsquared

In [None]:
# Run for a complete overview of the fitted model
print(results.summary())

Another way to check the accuracy of the model is to plot the residuals. It may help identify outliers or other limitations of the trained model.

Complete the following code to further inspect the residuals.

hint: the method *predict* can be used to compute the fitted values

In [None]:
# Calculate fitted_values
fitted_values = results.predict(students_df) # your code here

# Calculate residuals (subtracting the fitted values from the actual values)
residuals = students_df['altura'] - fitted_values # your code here

# Plot a histogram of the residuals
plt.hist(residuals) 
plt.show()

# Plot the residuals against the fitted values
plt.scatter(fitted_values, residuals)
plt.show()

### Multiple Linear Regression

Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor.

Gather additional data to perform the same prediction as before. Fit the model and make a new prediction. How do the results compare?

hint: use 'altura ~ calcado + genero' for the formula parameter. it means that we are predicting the response *altura* from the predictors *calçado* and *genero*

In [None]:
genero = [0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0] # {0: male, 1: female}

# Read in the data
students_df2 = pd.DataFrame({'calçado': predictor, 'altura': response, 'genero': genero})

In [None]:
# Create the model here:
model = sm.OLS.from_formula('altura ~ calçado + genero', data = students_df2) # (formula, data)

# Fit the model here:
results = model.fit() # Your code here

# Print model information:
print(results.summary())

In [None]:
results.params

In [None]:
# make a new prediction

calcado = 43
genero = 0
pred = results.params[0] + results.params[1]*calcado + results.params[2]*genero # Your code here

genero_dict = {1: 'feminino', 0: 'masculino'}

print('Altura estimada para um(a) aluno(a) do género {} e que calça o {}: {}cm'.format(genero_dict[genero], calcado, round(pred)))


## Suggested next steps

* Apply linear regression without using a library
* Use another library to apply linear regression