## Summary

L’objectif de ce projet est de construire et comparer différents modèles de régression linéaire afin de prédire la valeur médiane des logements en Californie à partir de caractéristiques socio-économiques et géographiques.

## Import librairies

In [1]:
from pathlib import Path

import pandas as pd
from loguru import logger

## Settings

In [2]:
# Return a new path pointing to the current working directory
HOME_DIR = Path.cwd().parent

# create a variable for data directory
DATA_DOWNLOAD_ROOT = "https://raw.githubusercontent.com/MouslyDiaw/handson-machine-learning/refs/heads/master/tp2_regression/data/housing.csv"

REPORT_DIR = Path(HOME_DIR, "reports")
REPORT_DIR.mkdir(parents=True, exist_ok=True)

MODEL_DIR = Path(HOME_DIR, "models")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

logger.info(f"\nWork directory: {HOME_DIR} \nData root: {DATA_DOWNLOAD_ROOT}\nModels directory: {MODEL_DIR}")

[32m2025-09-30 06:58:24.061[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m13[0m - [1m
Work directory: /Users/mouslydiaw/Documents/SenIA/handson-machine-learning/tp2_regression 
Data root: https://raw.githubusercontent.com/MouslyDiaw/handson-machine-learning/refs/heads/master/tp2_regression/data/housing.csv
Models directory: /Users/mouslydiaw/Documents/SenIA/handson-machine-learning/tp2_regression/models[0m


## Data collection

In [3]:
Path(DATA_DOWNLOAD_ROOT, "housing.csv")

PosixPath('https:/raw.githubusercontent.com/MouslyDiaw/handson-machine-learning/refs/heads/master/tp2_regression/data/housing.csv/housing.csv')

In [4]:
data_housing = pd.read_csv(DATA_DOWNLOAD_ROOT)
logger.info(f"Data shape: {data_housing.shape}")

[32m2025-09-30 06:58:24.201[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mData shape: (20640, 10)[0m


In [11]:
logger.info(f"Features/Target description: \n{housing.DESCR}")

[32m2025-09-30 06:42:57.591[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mFeatures/Target description: 
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California distr

In [5]:
data_housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
