# Polynomial Regression

Polynomial regression is a supervised machine learning algorithm used to make predictions on continuous values.

It is essentially an extension of linear regression, where we apply linear regression to the original features plus newly created polynomial features.

If there is a more complex (non-linear) relationship between the input and output, polynomial regression can be applied. This method adds higher-degree powers of each feature as new features, as well as their combinations.

For example, if we use polynomial regression with a degree d=3 and features aa and bb, the new feature set will include:$a^2$, $b^2$, $a^2b$, $ab^2$ it's just like applying linear regression on features a,b,c,d,e,f where c=$a^2$, d=$b^2$, e=$a^2b$, f=$ab^2$.

The higher the degree dd, the more likely the model is to overfit the training data, as it gains many **degress of freedom**.

In genereal, to evalueate a regression model we use these metrics:
- **MSE :**
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

- **RMSE :**
$$
RMSE = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }
$$

- **MAE :**
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|
$$

- **MSPE :**
$$
MSPE = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{y_i} \right)^2
$$

- **R-squared :**
$$
R^2 = 1 - \frac{ \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }{ \sum_{i=1}^{n} (y_i - \bar{y})^2 }
$$

- **Adjusted R-squared :**
$$
R^2_{adj} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)
$$

p: Number of features.

n: Number of samples.

# Step 1: Import libraries

In [42]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import r2_score

# Step 2: Get the data

In [45]:
df = pd.read_csv("data/housing.csv")

# Step 3: Dataset Overview

In [48]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [50]:
df.shape

(20640, 10)

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [54]:
df.describe(include='number')

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [56]:
df.describe(include='object')

Unnamed: 0,ocean_proximity
count,20640
unique,5
top,<1H OCEAN
freq,9136


# Step 4: Data transformation

In [59]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [61]:
df['ocean_proximity'].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [63]:
encoder = OneHotEncoder(sparse_output=False)
encoded_ocean_proximity = encoder.fit_transform(df[['ocean_proximity']])
encoded_df = pd.DataFrame(encoded_ocean_proximity, columns=encoder.categories_[0])
df = df.drop(columns=['ocean_proximity'])
df = pd.concat([df, encoded_df], axis=1)

In [65]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,0.0,0.0,0.0,1.0,0.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,0.0,0.0,0.0,1.0,0.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,0.0,0.0,0.0,1.0,0.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,0.0,0.0,0.0,1.0,0.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,0.0,0.0,0.0,1.0,0.0


# Step 5: Split data in train/test sets

In [68]:
y = df['median_house_value']
X = df.drop(columns=['median_house_value'])
X = X.fillna(X.mean()) 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [70]:
degree = 3
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

# Extraire le modèle LinearRegression (étape finale du pipeline)
linear_model = model.named_steps['linearregression']
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Afficher les coefficients et l'intercept
print("Coefficients :", linear_model.coef_)
print("Intercept :", linear_model.intercept_)


Coefficients : [-9.77899555e-01  1.00295533e+04 -2.09254082e+04  2.16356534e+05
  4.09918565e+04 -2.44891481e+05  3.62662230e+04 -4.09642464e+04
 -2.13728787e+04 -1.31803978e+04  2.25939229e+04 -3.91159822e+03
 -3.70556500e+03 -9.17780708e+03 -8.49262719e+03 -4.23286408e+04
  1.19096984e+04  9.17538508e+02 -5.54952997e+03  1.06133496e+03
 -1.38486005e+03  5.98488460e+04  8.84467329e+04 -1.97106217e+05
 -4.55166565e+00  1.37024855e+04  1.20537439e+05 -4.69346911e+04
  1.74025008e+04  4.74299750e+02 -2.95118979e+03  8.86166083e+02
 -1.90237069e+03  1.64017740e+05  3.04247652e+05 -7.76191260e+05
  1.11143716e-01 -1.21142577e+04  4.16882515e+05  5.51155890e+02
  4.06994219e+00  2.27675848e+02 -1.87690714e+01 -1.87555608e+02
 -1.02637338e+04  1.71072441e+05  1.71386645e+05 -1.00415247e-01
 -3.20966037e+05  1.77261080e+05 -7.40003893e-01  1.40879409e+00
 -8.10586127e-01  7.18232052e+00 -5.70077554e+01  8.82120905e+03
  8.34088762e+03 -4.79667351e-01  1.12639003e+04  8.32615727e+03
  1.179547

In [39]:
mse = mean_squared_error(y_test, y_pred)
rmse = root_mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MSE : {mse}')
print(f'RMSE : {rmse}')
print(f'MAE : {mae}')
print(f'MAPE : {mape * 100:.2f}%')
print(f"R² : {r2}")

n = len(y_test)           
p = X_test.shape[1]    

r2_adj = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print("Adjusted R²:", r2_adj)

MSE : 22092748432.730465
RMSE : 148636.29581206088
MAE : 44534.16917012241
MAPE : 25.65%
R² : -0.6859429994434174
Adjusted R²: -0.6912704809681536
