# Linear regression

In [3]:
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import numpy as np 

df = pd.read_csv("../data/Advertising.csv", index_col=0)
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 7.8 KB


In [6]:
# 200 samples - each row is a sample, each sample is a point
df.shape

(200, 4)

In [8]:
X, y = df.drop("Sales", axis = "columns"), df["Sales"]

# X matrix - feature matrix
# each column is a feature
# TV -> x1, Radio -> x2, Newspaper -> x3
X

Unnamed: 0,TV,Radio,Newspaper
1,230.1,37.8,69.2
2,44.5,39.3,45.1
3,17.2,45.9,69.3
4,151.5,41.3,58.5
5,180.8,10.8,58.4
...,...,...,...
196,38.2,3.7,13.8
197,94.2,4.9,8.1
198,177.0,9.3,6.4
199,283.6,42.0,66.2


In [9]:
# vector of labels - the variable that we want to predict, the answers
y

1      22.1
2      10.4
3       9.3
4      18.5
5      12.9
       ... 
196     7.6
197     9.7
198    12.8
199    25.5
200    13.4
Name: Sales, Length: 200, dtype: float64

## Multiple linear regression 

$y = w_0 + w_1x_1 + w_2x_2 + w_3x_3$

- goal is to estimate $w_i$, $i\in\{0,1,2,3\}$
- we use scikit-learn to do this 

## Scikit-learn steps

Steps: 
1. train|test split - some cases train|validation|test - split
2. Scale the dataset 
    - many algorithms require scaling, some don't
    - which type of scaling to use?
    - scale training data, test data to the training data, to avoid data leakage
3. Fit the algorithm to the training data
4. Transform the training data, transform the test data
5. Calculate evaluation metrics

### 1. Train|test split

In [13]:
from sklearn.model_selection import train_test_split

# help(train_test_split)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# check that it adds up to 200
print(f"{X_train.shape = }")
print(f"{y_train.shape = }")
print(f"{X_test.shape = }")
print(f"{y_test.shape = }")

X_train.shape = (140, 3)
y_train.shape = (140,)
X_test.shape = (60, 3)
y_test.shape = (60,)


In [17]:
X_test.head(10)

Unnamed: 0,TV,Radio,Newspaper
96,163.3,31.6,52.9
16,195.4,47.7,52.9
31,292.9,28.3,43.2
159,11.7,36.9,45.2
129,220.3,49.0,3.2
116,75.1,35.0,52.7
70,216.8,43.9,27.2
171,50.0,11.6,18.4
175,222.4,3.4,13.1
46,175.1,22.5,31.5


### 2. Feature scaling

Scaling of data is required for many algorithms 
- normalization (min-max)

  - $X' = \frac{X-X_{min}}{X_{max}-X_{min}}$

- feature standardization (standard normal distribution)
  - $X' = \frac{X - \mu}{\sigma}$

In [18]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
scaler

In [23]:
scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

print(f"{scaled_X_train.shape = }")
print(f"{scaled_X_train.min() = }")
print(f"{scaled_X_train.max() = }")


scaled_X_train.shape = (140, 3)
scaled_X_train.min() = 0.0
scaled_X_train.max() = 1.0


In [24]:
print(f"{scaled_X_test.shape = }")
print(f"{scaled_X_test.min() = }")
print(f"{scaled_X_test.max() = }")

scaled_X_test.shape = (60, 3)
scaled_X_test.min() = 0.005964214711729622
scaled_X_test.max() = 1.1302186878727631
