# Real-Time Pollution Data in France

Attempt to gather real-time emissions of pollutants in France

Source : https://www.data.gouv.fr/fr/datasets/donnees-temps-reel-de-mesure-des-concentrations-de-polluants-atmospheriques-reglementes-1/

Real-time data : https://files.data.gouv.fr/lcsqa/concentrations-de-polluants-atmospheriques-reglementes/temps-reel/

Data used for initial exploration : https://files.data.gouv.fr/lcsqa/concentrations-de-polluants-atmospheriques-reglementes/temps-reel/2023/FR_E2_2023-11-05.csv

# Documentation

ZAG = Zone atmosphérique de gestion

PM2.5 = Matières particulaires en suspension dans l’air, d’un diamètre 
aérodynamique inférieur ou égal à 2,5 µm, également appelées 
particules fines 

# Exploration

## Imports

In [1]:
import pandas as pd
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go

pd.options.plotting.backend = "plotly"
pd.set_option('expand_frame_repr', False)

## Opening the dataset

In [18]:
df = pd.read_csv("../data/raw/FR_E2_2023-11-05.csv", delimiter=";")
df = df.rename(str.lower, axis='columns')
print("dtypes :")
display(df.dtypes)
print("\nshape :")
display(df.shape)
print("\nhead :")
display(df.head(15))

dtypes :


date de début             object
date de fin               object
organisme                 object
code zas                  object
zas                       object
code site                 object
nom site                  object
type d'implantation       object
polluant                  object
type d'influence          object
discriminant              object
réglementaire             object
type d'évaluation         object
procédure de mesure       object
type de valeur            object
valeur                   float64
valeur brute             float64
unité de mesure           object
taux de saisie           float64
couverture temporelle    float64
couverture de données    float64
code qualité              object
validité                   int64
dtype: object


shape :


(29346, 23)


head :


Unnamed: 0,date de début,date de fin,organisme,code zas,zas,code site,nom site,type d'implantation,polluant,type d'influence,...,procédure de mesure,type de valeur,valeur,valeur brute,unité de mesure,taux de saisie,couverture temporelle,couverture de données,code qualité,validité
0,2023/11/05 00:00:00,2023/11/05 01:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.3,µg-m3,,,,A,1
1,2023/11/05 01:00:00,2023/11/05 02:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.2,0.225,µg-m3,,,,A,1
2,2023/11/05 02:00:00,2023/11/05 03:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,,,µg-m3,,,,N,-1
3,2023/11/05 03:00:00,2023/11/05 04:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.325,µg-m3,,,,A,1
4,2023/11/05 04:00:00,2023/11/05 05:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.3,µg-m3,,,,A,1
5,2023/11/05 05:00:00,2023/11/05 06:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.2,0.2,µg-m3,,,,A,1
6,2023/11/05 06:00:00,2023/11/05 07:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.325,µg-m3,,,,A,1
7,2023/11/05 07:00:00,2023/11/05 08:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.3,µg-m3,,,,A,1
8,2023/11/05 08:00:00,2023/11/05 09:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.4,0.4,µg-m3,,,,A,1
9,2023/11/05 09:00:00,2023/11/05 10:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.3,µg-m3,,,,A,1


## Checking values

In [11]:
print("Unique values :")
df.isnull().sum()
cols_to_remove = []
df_dict = {}
for col in df.columns:
    if col not in ["emissions", "valeur", "valeur brute"]:
        print(f'{col} : unique values count = {len(list(df[col].unique()))}')
        df_dict[col] = df[col].value_counts(dropna=False)
        if len(list(df[col].unique())) == 1:
            cols_to_remove += [col]

print("\nsingle value cols :", cols_to_remove)
for col in cols_to_remove:
    print(df_dict[col], "\n")


Unique values :
date de début : unique values count = 19
date de fin : unique values count = 19
organisme : unique values count = 17
code zas : unique values count = 67
zas : unique values count = 67
code site : unique values count = 502
nom site : unique values count = 500
type d'implantation : unique values count = 5
polluant : unique values count = 9
type d'influence : unique values count = 3
discriminant : unique values count = 29
réglementaire : unique values count = 1
type d'évaluation : unique values count = 3
procédure de mesure : unique values count = 54
type de valeur : unique values count = 1
unité de mesure : unique values count = 3
taux de saisie : unique values count = 1
couverture temporelle : unique values count = 1
couverture de données : unique values count = 1
code qualité : unique values count = 3
validité : unique values count = 2

single value cols : ['réglementaire', 'type de valeur', 'taux de saisie', 'couverture temporelle', 'couverture de données']
réglementai

Let's see some value distributions :

In [16]:
for key, value in df_dict.items():
    if key not in ["emissions", "valeur", "valeur brute"]:
        print(f"all unique {key}", *df_dict[key].index.sort_values(), sep=" | ")

all unique date de début | 2023/11/05 00:00:00 | 2023/11/05 01:00:00 | 2023/11/05 02:00:00 | 2023/11/05 03:00:00 | 2023/11/05 04:00:00 | 2023/11/05 05:00:00 | 2023/11/05 06:00:00 | 2023/11/05 07:00:00 | 2023/11/05 08:00:00 | 2023/11/05 09:00:00 | 2023/11/05 10:00:00 | 2023/11/05 11:00:00 | 2023/11/05 12:00:00 | 2023/11/05 13:00:00 | 2023/11/05 14:00:00 | 2023/11/05 15:00:00 | 2023/11/05 16:00:00 | 2023/11/05 17:00:00 | 2023/11/05 18:00:00
all unique date de fin | 2023/11/05 01:00:00 | 2023/11/05 02:00:00 | 2023/11/05 03:00:00 | 2023/11/05 04:00:00 | 2023/11/05 05:00:00 | 2023/11/05 06:00:00 | 2023/11/05 07:00:00 | 2023/11/05 08:00:00 | 2023/11/05 09:00:00 | 2023/11/05 10:00:00 | 2023/11/05 11:00:00 | 2023/11/05 12:00:00 | 2023/11/05 13:00:00 | 2023/11/05 14:00:00 | 2023/11/05 15:00:00 | 2023/11/05 16:00:00 | 2023/11/05 17:00:00 | 2023/11/05 18:00:00 | 2023/11/05 19:00:00
all unique organisme | AIR BREIZH | AIR PAYS DE LA LOIRE | AIRPARIF | ATMO AUVERGNE-RHÔNE-ALPES | ATMO BOURGOGNE-FRA

Data spans from 00:00 to 18:00

In [17]:
sectors_df = df.groupby(["zas", "polluant"]).size()
print("emissions per zas")
display(sectors_df)

emissions per zas


zas                            polluant  
ZAG AVIGNON                    NO             45
                               NO2            45
                               NOX as NO2     45
                               O3             30
                               PM10           30
                                            ... 
ZR PROVENCE-ALPES-COTE-D-AZUR  NO2            75
                               NOX as NO2     75
                               O3            165
                               PM10          105
                               PM2.5          90
Length: 428, dtype: int64

All emissions are measured either in µg/m3, µg-m3 or mg/m3. This means we can put them all on the same scale, which will be µg/m3

In [21]:
df.sort_values("date de début")

Unnamed: 0,date de début,date de fin,organisme,code zas,zas,code site,nom site,type d'implantation,polluant,type d'influence,...,procédure de mesure,type de valeur,valeur,valeur brute,unité de mesure,taux de saisie,couverture temporelle,couverture de données,code qualité,validité
0,2023/11/05 00:00:00,2023/11/05 01:00:00,ATMO GRAND EST,FR44ZAG02,ZAG METZ,FR01011,Metz-Centre,Urbaine,NO,Fond,...,Auto NO Conf meth CHIMILU,moyenne horaire brute,0.3,0.30000,µg-m3,,,,A,1
17232,2023/11/05 00:00:00,2023/11/05 01:00:00,AIR PAYS DE LA LOIRE,FR52ZAG01,ZAG NANTES-SAINT-NAZAIRE,FR23249,CAME,Périurbaine,NO2,Industrielle,...,Auto NO2_NOx Conf app API 200E,moyenne horaire brute,0.4,0.42500,µg-m3,,,,A,1
17218,2023/11/05 00:00:00,2023/11/05 01:00:00,AIR PAYS DE LA LOIRE,FR52ZAG01,ZAG NANTES-SAINT-NAZAIRE,FR23249,CAME,Périurbaine,NO,Industrielle,...,Auto NO Conf app API 200E,moyenne horaire brute,0.1,0.10000,µg-m3,,,,A,1
17204,2023/11/05 00:00:00,2023/11/05 01:00:00,AIR PAYS DE LA LOIRE,FR52ZAG01,ZAG NANTES-SAINT-NAZAIRE,FR23249,CAME,Périurbaine,SO2,Industrielle,...,Auto SO2 Conf app AF22M,moyenne horaire brute,-0.5,-0.46667,µg-m3,,,,R,1
2805,2023/11/05 00:00:00,2023/11/05 01:00:00,AIRPARIF,FR11ZAG01,ZAG PARIS,FR04037,PARIS 13eme,Urbaine,NOX as NO2,Fond,...,Auto NO2_NOx app AC32M,moyenne horaire brute,10.3,10.27500,µg-m3,,,,R,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27244,2023/11/05 18:00:00,2023/11/05 19:00:00,ATMO REUNION,FR04ZAR01,ZAR SAINT-DENIS,FR38022,Chaussée Royale,Urbaine,PM10,Trafic,...,Auto PM_Conf_app TEOM-FDMS 8500bc,moyenne horaire brute,,,µg-m3,,,,N,-1
26427,2023/11/05 18:00:00,2023/11/05 19:00:00,ATMO REUNION,FR04ZAR01,ZAR SAINT-DENIS,FR38008,Ecole JOINVILLE,Urbaine,PM10,Fond,...,Auto PM_Conf_app FIDAS 200,moyenne horaire brute,8.1,8.07500,µg-m3,,,,A,1
27054,2023/11/05 18:00:00,2023/11/05 19:00:00,ATMO REUNION,FR04ZAR01,ZAR SAINT-DENIS,FR38020,Plateau Caillou,Urbaine,NO2,Fond,...,Auto NO2_NOx Conf app API T200,moyenne horaire brute,7.0,7.00000,µg-m3,,,,A,1
27491,2023/11/05 18:00:00,2023/11/05 19:00:00,ATMO REUNION,FR04ZAR02,ZAR VOLCAN,FR38098,Sarda Garriga,Urbaine,NO2,Trafic,...,Auto NO2_NOx Conf app API T200,moyenne horaire brute,,,µg-m3,,,,N,-1


# (TODO, taken from NEC Analysis)

# Plotting

## Missing data visuals

In [11]:
missing_df = df.groupby(["year", "country"])["parent_sector_code"].value_counts(dropna=False)

years = missing_df.index.get_level_values("year").unique().values
year = 2020
fig = px.bar(
    x=missing_df[missing_df.index.get_level_values('year') == year].index.get_level_values(1),
    y=missing_df[missing_df.index.get_level_values('year') == year].values, 
    color=missing_df[missing_df.index.get_level_values('year') == year].index.get_level_values(2).isnull(), 
    title=f"parent_sector_code value counts per country in {year}",
    labels={"x": "Country", "y": "Count", "color": "parent_sector_code missing"},
    text_auto=".2s",
)
fig.update_traces(width=1)
fig.show()

In [12]:
missing_df_total = df.groupby(["year"])["parent_sector_code"].value_counts(dropna=False)
fig = px.bar(
    x=missing_df_total.index.get_level_values(0),
    y=missing_df_total.values, 
    color=missing_df_total.index.get_level_values(1).isnull(), 
    title=f"parent_sector_code value counts per country per year",
    labels={"x": "Year", "y": "Count", "color": "parent_sector_code missing"},
    text_auto=".2s",
)
fig.update_traces(width=1)
fig.show()


## Visuals

### Per pollutant by year in a specific country

In [13]:
pollutant_test = "PM2.5"
country = "Ireland"
data = df[(df["pollutant_name"] == pollutant_test) & (df["country"] == country)]

top_avg = data.sort_values(["emissions", "year"], ascending=[False, False], na_position="last")
# display(top_avg)

(top_avg
    .iloc[:10]
    .plot(kind="bar", barmode="group", x="year", y=["emissions"], color="sector_name", 
          title=f"{pollutant_test} emissions ({pollutants_df.loc[pollutant_test].values[0]}) per sector in {country}",
          labels={"year": "Year", "value": f"{pollutant_test} emissions ({pollutants_df.loc[pollutant_test].values[0]})", "sector_name": "Sector"})
)

In [14]:
for pollutant in pollutants_df.index[:3]:
  (
    df[(df["pollutant_name"] == pollutant) & (df["country"] == country)]
      .sort_values(["emissions", "year"], ascending=[False, False], na_position="last")
      .iloc[:10]
      .plot(kind="bar", barmode="group", x="year", y=["emissions"], color="sector_name", 
        title=f"{pollutant} emissions ({pollutants_df.loc[pollutant].values[0]}) per sector in {country}",
        labels={"year": "Year", "value": f"{pollutant} emissions ({pollutants_df.loc[pollutant].values[0]})", "sector_name": "Sector"})
      .show()
  )

### Per pollutant unit by year

In [15]:
selected_year_test = 2000
unit_test = "kg"

test_df = (
    df[(df["unit"] == unit_test) & (df["year"] == selected_year_test)]
    .groupby(["country", "year"])["emissions"]
    .sum()
    .sort_values(ascending=False, na_position="last")
    .drop("EU27", level=0)
)
# display(test_df)
# display(test_df.describe())

In [16]:
fig = px.bar(barmode="group",
       x=test_df.index.get_level_values(0),
       y=test_df.values, 
       color=test_df.index.get_level_values(0),
       title=f"emissions in {unit_test} per country in {selected_year_test}",
       labels={"x": "Country", "y": f"Emissions ({unit_test})", "color": "Country"},
       text_auto=".2s",
)
fig.update_traces(width=1)

In [20]:
print(f"pollutant units : {pollutants_df['unit'].unique()}")
selected_year = 2000
for unit in pollutants_df["unit"].unique()[:]:
    test_df = (
        df[(df["unit"] == unit) & (df["year"] == selected_year)]
        .groupby(["country", "year"])["emissions"]
        .sum()
        .sort_values(ascending=False, na_position="last")
        .drop("EU27", level=0)
    )
    if len(test_df) > 0:
        fig = px.bar(barmode="group",
            x=test_df.index.get_level_values(0),
            y=test_df.values, 
            color=test_df.index.get_level_values(0), 
            title=f"emissions in {unit} per country in {selected_year}",
            labels={"x": "Country", "y": f"Emissions ({unit})", "color": "Country"},
            text_auto=".2s",

            )
        fig.update_traces(width=1)
        fig.show()
    else:
        print(f"no emissions in {unit} per country in {selected_year}")

pollutant units : ['Gg (1000 tonnes)' 'TJNCV' 't' 'kg' 'g I-TEQ']


# Remarks

## Dataset anomalies :

- Some **notation key** values seem to be wrong : 
    | Key | Count |
    |---|---|
    |NO  |  802444|
    |NE  |  340031|
    |NR  |  200782|
    |IE  |  197660|
    |??  |     563|
    |C   |     451|
    |N.  |      98|
    |N/  |      64|
    |N?  |      64|
    |?.  |      61|
    |Na  |       1|  
    
    ["??", "N.", "N/", "N?", "?.", "Na"] seem to be misinputs.


## Remarks : 

- **Pollutants** have different **units**, making comparisons between **pollutants** of different **units** difficult / impossible.
- EU27 seems to represent the sum of the **emissions** of each country of the EU (no UK).
- A **sector** called "NATIONAL TOTAL FOR COMPLIANCE often appears in **emission sectors**, seems to represent the sum of all **emissions** for a **sector** but very unreliable : sometimes is here without another **emission**, sometimes the other way around.

## notes : 
- observer tendances
- regarder l'impact de mesures européennes prises
- regarder régulations, expliquer les émissions tombées à 0
- croiser datasets, générer un dataset avec features pour ML en sélectionnant les colonnes les + importantes
- étudier les caractéristiques des polluants, est-il obligatoire de publier
- librairie missingno pour explorer les valeurs manquantes
