<div style="text-align: center;">
    <a href="https://www.hi-paris.fr/">
        <img border="0" src="https://www.hi-paris.fr/wp-content/uploads/2020/09/logo-hi-paris-retina.png" width="25%"></a>
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="70%"></a>
</div>

# DPE Label Prediction Challenge — Starting Kit

<i> Datacamp 2025 — Codabench Challenge </i>

## Introduction

### Context

In France, every building put up for sale or rent must undergo a **Diagnostic de Performance Énergétique (DPE)** — an energy performance assessment that assigns a label from **A** (most efficient) to **G** (least efficient). This label directly impacts property value and informs public policy on building renovation ([see official documentation](https://www.service-public.fr/particuliers/vosdroits/F16096)).

### The data

The dataset used in this challenge comes from officially published DPE diagnostics for **existing housing** in France (`dpe03existant_2025.csv`). The raw dataset contains a large number of columns, many of which have significant missing values.

#### Data selection process

The following preprocessing steps were applied to produce the challenge dataset:

1. **Column filtering** — All columns with **any missing values** were removed, ensuring a clean dataset with no imputation needed.
2. **Geographic filtering** — Overseas territories (departments 971, 972, 973, 974, 988) were excluded. Only **metropolitan France** is kept.
3. **Building type** — The data contains only **apartments** (`type_batiment = "appartement"`), as they represent a larger diversity of buildings and are present throughout the country.

#### Features

After filtering, the dataset contains a mix of:

| Type | Examples |
|------|----------|
| **Categorical** | `type_installation_chauffage` (heating system type), `code_region_ban` (region code), `code_departement_ban` (department code) |
| **Numerical** | Energy consumption metrics, surface area, and other physical characteristics of the dwelling |

Key properties of the data:
- **Outliers**: The numerical features exhibit significant outliers (visible in boxplots), requiring careful preprocessing decisions (normalization, clipping, exclusion, etc.).
- **Correlations**: Some numerical features are correlated, opening the door to dimensionality reduction or feature selection strategies.
- **Class imbalance**: The DPE labels (A–G) are not uniformly distributed — labels D and E dominate, while A and B are rare.

#### Train / test split

The data is split into **train (70%)** and **test (30%)** using a **stratified split** on both `etiquette_dpe` and `code_region_ban`, ensuring that the geographic and label distributions are preserved across splits. A further split separates the test set into a **public test** and a **private test** set.

### The task

**Given the characteristics of a dwelling, predict its DPE energy label (A–G).**

This is a **multiclass classification** problem with 7 classes. Your model will be evaluated on its ability to correctly assign the right energy label to unseen dwellings.

### Why does it matter?

Accurate DPE prediction helps:
- Identify buildings most in need of energy renovation
- Support public policy on climate and housing
- Enable real-estate professionals to estimate energy performance before a full diagnostic

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Load the data
from ingestion_program.ingestion import get_train_data
X_df, y = get_train_data("dev_phase/input_data")

# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the codabench platform.

## The submission file

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [2]:
# %load solution/submission.py
from sklearn.ensemble import RandomForestClassifier


# The submission here should simply be a function that returns a model
# compatible with scikit-learn API
def get_model():
    return RandomForestClassifier()


## Local testing pipeline

Here you can show how the model will be used to generate predictions on the test set, and how the evaluation will be performed.

In [3]:
model = get_model()
X_train, y_train = get_train_data("dev_phase/input_data")
model.fit(X_train, y_train)

X_test = pd.read_csv("dev_phase/input_data/test/test_features.csv")
from ingestion_program.ingestion import evaluate_model
y_test = evaluate_model(model, X_test)

from scoring_program.scoring import compute_accuracy
print("Accuracy on test set:", compute_accuracy(y_test, pd.read_csv("dev_phase/input_data/test/test_labels.csv")))

Accuracy on test set: 0.95


  return fit_method(estimator, *args, **kwargs)


## Submission

To submit your code, you can refer to the actual challenge.