Skip to content

Erin-Anzallo/Project_Datascience

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Forecasting Europe 2030: A Machine Learning Analysis of Growth, Equality and Climate SDGs

Project Overview

The 2030 SDG Monitor is a data science project designed to analyze, forecast and visualize the progress of European countries towards the United Nations Sustainable Development Goals (SDGs) for the year 2030.

Focusing on SDG 8 (Decent Work), SDG 10 (Reduced Inequalities) and SDG 13 (Climate Action), this project utilizes a predictive pipeline based on historical data (2005–2019) to project future trends and assess whether countries are on track to meet EU targets.

Key Features

  • Predictive Modeling: Forecasts indicators up to 2030 using Linear Regression models.
  • Hybrid Methodology: Combines time-series trends with autoregressive dynamics (lagged socio-economic variables).
  • Model Validation: Includes a backtesting module to evaluate model accuracy (MAE) using a 2019 cutoff (pre-pandemic).
  • Interactive Dashboard: A web-based interface (Plotly Dash) featuring:
    • Choropleth maps for European overview.
    • Trend lines comparing historical data, forecasts and 2030 targets.
    • A color system (Green/Orange/Red) to visualize distance to targets.

Project Structure

Project_Datascience/
├── data/
│   └── Final_Cleaned_Database.csv    # Historical dataset (2005-2022)
├── results/
│   ├── descriptive_analysis/         # Exploratory data analysis charts
│   ├── forecast_2030/                # Generated forecast data and static plots
│   └── model_validation_plot/        # Backtesting performance charts
├── src/
│   ├── dashboard.py                  # Interactive Dash application
│   ├── descriptive_analysis.py       # Descriptive analysis script
│   ├── forecast_to_2030.py           # Main forecasting script
│   ├── model_validation.py           # Backtesting and error analysis script
│   └── preprocessing_data.py         # Data preprocessing script
└── README.md

Installation & Requirements

This project requires Python 3.9+. Install the necessary dependencies:

pip install -r requirements.txt

Usage

1. Data Preprocessing

Clean and prepare the raw dataset for analysis.

python src/preprocessing_data.py

Output: Generates the cleaned dataset data/Final_cleaned_database.csv.

2. Descriptive Analysis

Perform exploratory data analysis to visualize historical trends.

python src/descriptive_analysis.py

Output: Generates distribution and correlation plots in results/descriptive_analysis/.

3. Model Validation (Optional)

Run the backtesting script to evaluate the reliability of the models. It trains on data up to 2019 and tests on 2020-2022.

python src/model_validation.py

Output: Generates validation charts and an error summary table in results/model_validation_plot/.

4. Generate Forecasts

Run the main forecasting script to generate predictions for 2023-2030.

python src/forecast_to_2030.py

Output: Creates graph_forecast_data.csv and static trend images in results/forecast_2030/.

5. Launch the Dashboard

Start the interactive web application to explore the results.

python src/dashboard.py

Output: The app will run locally. Open your browser at http://127.0.0.1:8050/.

Methodology

To address the complexity of socio-economic and environmental indicators, we developed a Linear Regression pipeline combining temporal trends and autoregressive dynamics:

  1. Socio-Economic Indicators (e.g., GDP, Unemployment, Inequality):

    • Modeled using Lagged Features (e.g., $X_{t-1}$) to capture system dynamics.
    • Example: Unemployment Rate is predicted using the previous year's NEET Rate and Income Distribution.
  2. Environmental Indicators (GHG Emissions, Renewable Share):

    • Modeled using Time-Series Trends (Year as the sole feature).
    • This approach captures the structural, often policy-driven trajectory of green transition metrics.

Note on Training: The models are trained on data from 2005 to 2019 to avoid biasing the long-term trend with the specific anomalies of the COVID-19 pandemic years.

Indicators & Targets

Indicator SDG 2030 Target / Goal
Real GDP Per Capita SDG 8 Growth Trend
Unemployment Rate SDG 8 ≤ 5.0%
NEET Rate SDG 8 ≤ 9.0%
Income Distribution (S80/S20) SDG 10 Reduction Trend
Income Share Bottom 40% SDG 10 Increase Trend
Renewable Energy Share SDG 13 ≥ 42.5%
GHG Emissions SDG 13 Reduction Trend

Visualization Logic (Color System)

The dashboard uses a color-coded system to evaluate progress:

  • For Numeric Targets: The color is determined by comparing the 2030 Forecast directly to the fixed threshold.
  • For Trend Targets: The color is determined by comparing the 2030 Forecast to the Last Historical Value (2022).
    • Green: The forecast shows an improvement relative to the 2022 baseline.
    • Red: The forecast shows a deterioration or stagnation relative to the 2022 baseline.
    • Note: Even if the trend line points in the right direction, the status remains Red if the final 2030 prediction is not better than the actual 2022 value.

Author

  • Erin Anzallo - M1 Data Science Project

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages