<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/labs/08%20-%20Regression%201/Exercises/exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Supervised Learning, Linear Regression

This exercise is an application of what you learned in the walkthrough. The following cell gather the different modules you need for this exercise (take a look at the sklearn library).

Some exercises consist of filling a part of the code without writing the whole code. Replace the `"YOUR CODE HERE"` with your own code.

In [1]:
# Useful starting lines
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set_style("darkgrid")

import warnings
warnings.filterwarnings('ignore')

# Sklearn import
from sklearn.preprocessing import MinMaxScaler # Normalization
from sklearn.linear_model import LinearRegression # Regression linear model
from sklearn.model_selection import train_test_split # Splitting the data set
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error # Metrics for errors
from sklearn.model_selection import KFold # Cross validation

## 1. Load data

In this exercise, we use data on advertising expenses across various platforms and product sales. The goal is to understand how different advertising platforms impact sales.

Load the dataset from the given URL. Then display the first 5 rows.


In [2]:
url = 'https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/Advertising.csv'
# Load the data
Advertising = pd.read_csv(url)
# Display the first 5 rows
print(Advertising.head(5))

# Number of oberservations and columns
# YOUR CODE HERE

   id     TV  Radio  Newspaper  Sales
0   1  230.1   37.8       69.2   22.1
1   2   44.5   39.3       45.1   10.4
2   3   17.2   45.9       69.3    9.3
3   4  151.5   41.3       58.5   18.5
4   5  180.8   10.8       58.4   12.9


Next, we will try a simple linear regression using only one feature (univariate regression), that is, we want to predict the `sales` using only the `TV` feature (which is money spent on TV).

To get a first sense of the relationship between different variables, display the correlation table.

In [3]:
# Display the correlation table
print(Advertising.corr())

                 id        TV     Radio  Newspaper     Sales
id         1.000000  0.017715 -0.110680  -0.154944 -0.051616
TV         0.017715  1.000000  0.054809   0.056648  0.782224
Radio     -0.110680  0.054809  1.000000   0.354104  0.576223
Newspaper -0.154944  0.056648  0.354104   1.000000  0.228299
Sales     -0.051616  0.782224  0.576223   0.228299  1.000000


## 2. Using Sklearn

When using sklearn, we don't need to add a column of ones to the data in order to have the constant parameter. Sklearn takes care of it for you, you just need to set the `fit_intercept` argument to be True (which also the default value for this argument).

1. From the dataset, save the feature `TV` and the target `Sales` in two different variables X and y respectively in a dataframe pandas format (not as a series) (`data[['sth']]` instead of `data['sth]`).
2. Split the data into a train and a test set. The test set size should be 20% of the original data. Additionally, set the `random_state` to 0 and `shuffle` to `True`.
3. Create a linear regression model using `LinearRegression` module from sklearn. Make sure it includes an intercept. Then, fit the model with the corresponding data.
4. Print the values of the intercept and coefficients.
5. Compute the R2, MAE, and MSE.
6. Plot the regression


In [4]:
# 1. Create X, y
X = Advertising[['TV']]
y = Advertising[['Sales']]

In [5]:
# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, shuffle=True)

For this excerise, we don't require you to normalize the data, but this is how it can be done:

```python
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transfrom(X_test))
```

In [6]:
# 3. Create the linear regression model
model=LinearRegression(fit_intercept=True)
model.fit(X_train, y_train)


In [22]:
# 4. Model parameters
print("intercept:", model.intercept_)
print("coefficients:", model.coef_)

TypeError: 'tuple' object is not callable

In [20]:
# 5. Model performance
predictions=model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)


In [None]:
# 6. Plot the regression
# YOUR CODE HERE

Using this single-variate model, you can simply switch the feature (`TV`, `Radio`, `Newspapers`) to see which predicts the target variable (`Sales`) the best.

**Hint:** Simply change the feature variable and re-run the cells above. Then compare the evaluation metrics (r2, MAE and MSE).

<h2> IMPORTANT: This exercise answers the moodle quiz question 1. <h2>

In [None]:
# YOUR CODE HERE

## 3. Using more features for prediction

Let's try to use all features to predict the sales.

1. From the dataset, save all the features to X and the target `sales` to y in dataframe pandas format.
2. Split the data into a train and a test set. The test set size should be 20% of the original data. Additionally, set the `random_state` to 0 and `shuffle` to `True`.
3. Create a linear regression model using `LinearRegression` module from sklearn. Make sure it includes an intercept. Then, fit the model with the corresponding data.
4. Print the values of the intercept and coefficients.
5. Compute the R2, MAE, and MSE.

In [None]:
# 1. Use all features
# YOUR CODE HERE

In [None]:
# 2. Split the data
# YOUR CODE HERE

In [None]:
# 3. Create the linear regression model
# YOUR CODE HERE

In [None]:
# 4. Print the model parameters
# YOUR CODE HERE

In [None]:
# 5. Model performance
# YOUR CODE HERE

Does the model performance (evaluated using R2, MAE, and MSE) improve significantly when using all features? Does the model become "better"?

<h2> IMPORTANT: This exercise answers the moodle quiz question 2. <h2>

In [None]:
# Compute how much each of the metrics has improved
# YOUR CODE HERE