link for learning how statistical learning works

https://www.statlearning.com/resources-second-edition

# Linear Regression Code Alongs

- we have labels -> supervised learning
    - This means the model is trained on data that includes both inputs and correct outputs (labels).
It learns to map inputs to the right answers.
Example: Predicting someone's salary based on their years of experience, using known salary data.

- try to predict real number -> regression
    - The goal is to predict a continuous value, like price, temperature, or age.
Example: "What will the house cost based on its size and location?"

- predict discrete values -> classification
    - Here, the model predicts categories or classes, like "spam vs. not spam" or "dog vs. cat".
Example: "Is this email spam?"


In [84]:
import pandas as pd

df = pd.read_csv('../../data/Advertising.csv', index_col=0)

df.head()


Unnamed: 0,TV,radio,newspaper,sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


## EDA left for the reader ...(me)

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 1 to 200
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   radio      200 non-null    float64
 2   newspaper  200 non-null    float64
 3   sales      200 non-null    float64
dtypes: float64(4)
memory usage: 7.8 KB


In [86]:
df.describe()

Unnamed: 0,TV,radio,newspaper,sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [87]:
df.shape

(200, 4)

In [88]:
print(f"{df.shape[0]} samples")
print(f"{df.shape[1] - 1} features")
print("sales column is our label/target variable")

200 samples
3 features
sales column is our label/target variable


## DIvide data into X and y

In [89]:


X = df.drop('sales',axis='columns')
X

Unnamed: 0,TV,radio,newspaper
1,230.1,37.8,69.2
2,44.5,39.3,45.1
3,17.2,45.9,69.3
4,151.5,41.3,58.5
5,180.8,10.8,58.4
...,...,...,...
196,38.2,3.7,13.8
197,94.2,4.9,8.1
198,177.0,9.3,6.4
199,283.6,42.0,66.2


In [90]:
# tuple unpacking 
# X = design matrix / feature matrix / feature / independent variable
# y = target variable / label / dependent variable


X, y = df.drop('sales',axis='columns'), df['sales']
X.head()

Unnamed: 0,TV,radio,newspaper
1,230.1,37.8,69.2
2,44.5,39.3,45.1
3,17.2,45.9,69.3
4,151.5,41.3,58.5
5,180.8,10.8,58.4


In [91]:
y.head()

1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: sales, dtype: float64

In [92]:
type(y), type(X)

(pandas.core.series.Series, pandas.core.frame.DataFrame)

## Scikit-learn steps

1. train | test split or train | val | test split
2. scale dataset
    - many algorithms require scaling, some don't
    - there exists different types of scaling (features standardization, min-max scaling)
    - scale training data and test data to the training datas parameters to avoid data leakage
3. fit algorithm to training data
4. predict on test data
5. evaluate metrics


- 0. information about points above
- 1. Split your dataset to separate training and evaluation phases.
Helps prevent overfitting and gives a fair performance estimate


- 2. Normalize feature values to improve algorithm performance.
Use training data stats to scale both train and test sets → avoids data leakage.

- 3. Train the model by letting it learn patterns from the training set.

- 4. Use the trained model to make predictions on unseen data.

- 5. Measure how well the model performs using metrics like accuracy, precision, recall, or RMSE.


# 1. train|test split

In [93]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

X_train.shape

(134, 3)

In [94]:
print (f"{X_train.shape = }")
print (f"{y_train.shape = }")
print (f"{X_test.shape = }")
print (f"{y_test.shape = }")

X_train.shape = (134, 3)
y_train.shape = (134,)
X_test.shape = (66, 3)
y_test.shape = (66,)


In [95]:
X_train.head()

Unnamed: 0,TV,radio,newspaper
43,293.6,27.7,1.8
190,18.7,12.1,23.4
91,134.3,4.9,9.3
137,25.6,39.0,9.3
52,100.4,9.6,3.6


In [96]:
y_train.head()

43     20.7
190     6.7
91     11.2
137     9.5
52     10.7
Name: sales, dtype: float64

# 2. feature scaling

- min-max scaling
- values transformed into 0 and 1

In [97]:
from sklearn.preprocessing import MinMaxScaler

# instansiate an instance from the MinMaxScaler class
# this is a scaler object
# it will scale the data to a range between 0 and 1
# it will fit the data to the scaler
# it will transform the data to the scaled data
scaler = MinMaxScaler()
type(scaler)

sklearn.preprocessing._data.MinMaxScaler

In [98]:
scaler

0,1,2
,feature_range,"(0, ...)"
,copy,True
,clip,False


In [99]:
scaler.fit(X_train)

scaled_X_train = scaler.transform(X_train)
scaled_X_test = scaler.transform(X_test)

print(f"{scaled_X_train.min() = }")
print(f"{scaled_X_train.max() = }")
print(f"{scaled_X_test.min() = }")
print(f"{scaled_X_test.max() = }")
# scaler.transform/scaler.fit? fråga copilot

scaled_X_train.min() = np.float64(0.0)
scaled_X_train.max() = np.float64(1.0)
scaled_X_test.min() = np.float64(0.005964214711729622)
scaled_X_test.max() = np.float64(1.1302186878727631)


In [100]:
scaled_X_train[:5]

array([[0.99053094, 0.55846774, 0.01491054],
       [0.06087251, 0.24395161, 0.22962227],
       [0.45180927, 0.09879032, 0.08946322],
       [0.08420697, 0.78629032, 0.08946322],
       [0.33716605, 0.19354839, 0.03280318]])

In [101]:
type(scaled_X_train)

numpy.ndarray

In [102]:
type(scaled_X_test)

numpy.ndarray

# 3. Linear regression

In [103]:
from sklearn.linear_model import LinearRegression

# instansiate an instans from LingearRegression class
model = LinearRegression()
model

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [104]:
model.fit(scaled_X_train, y_train)
print(f"Parameters: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Parameters: [13.20747617  9.75285112  0.61108329]
Intercept: 2.7911595196243653


# 4. Prediction

In [105]:
sample_features = scaled_X_test[0].reshape(1, -1)  # Reshape to 2D array for prediction
sample_features

array([[0.54988164, 0.63709677, 0.52286282]])

In [106]:
model.predict(sample_features)

array([16.58673085])

In [107]:
y_test.iloc[0]

np.float64(16.9)

### predict on whole test data

In [108]:
y_pred = model.predict(scaled_X_test)
y_pred[:5]

array([16.58673085, 21.18622524, 21.66752973, 10.81086512, 22.25210881])

In [109]:
y_test.iloc[:5]

96     16.9
16     22.4
31     21.4
159     7.3
129    24.7
Name: sales, dtype: float64

# 5. evaluate

common metrics for regression case
- mae - mean absolute error
- mse - mean squared error
- rmse - root mean sqared error

In [110]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"{mae =}")
print(f"{mse =}")
print(f"{rmse =}")

mae =1.4937750024728977
mse =3.72792833068152
rmse =np.float64(1.9307843822347228)
