# README.md

## Project Structure
- Data Preparation and Cleaning
- Individual Dashboard Components
- Dashboard

### Components
- Descriptive Outviews of the Datasets
- Correlation Map
- Choropleth Map with Pie Chart
- Sankey Diagram (assessment of budget flows)
- PCA based on BERT and NLP embeddings (ML pipeline implementation)
- Animated Hans Rosling graph

---

All main elements of the project sructure may require a different notebook (.ipynb file), all under a same folder.
For the data cleaning & preparation, first follow some of the steps outlined in the research proposal document. Do not forget to earmark 20% of the dataset as testing data.

#### Sankey Diagram

Total Funding of Horizon EU $→$ Funding Structures & Mechanisms $→$ Countries $→$ Areas/Industries (euroSciVocTitle) $→$ Regions $→$ SME

Challenges:
Categories may be too many and condensation will be required by folding back on the directories of euroSciVoc



# Horizon Europe: analytics and pipelines


In [7]:
!pip install Dash

Collecting Dash
  Downloading dash-3.0.4-py3-none-any.whl.metadata (10 kB)
Collecting Flask<3.1,>=1.0.4 (from Dash)
  Downloading flask-3.0.3-py3-none-any.whl.metadata (3.2 kB)
Collecting Werkzeug<3.1 (from Dash)
  Downloading werkzeug-3.0.6-py3-none-any.whl.metadata (3.7 kB)
Collecting retrying (from Dash)
  Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB)
Downloading dash-3.0.4-py3-none-any.whl (7.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m59.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading flask-3.0.3-py3-none-any.whl (101 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading werkzeug-3.0.6-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m228.0/228.0 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading retrying-1.3.4-py3-none-any.whl (11 kB)
Installing collected packages: Werkzeug, retryi

In [24]:
import numpy as np
import pandas as pd
import polars as ps
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from dash import Dash, dcc, html, Input, Output

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
euroSciVoc = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/euroSciVoc.xlsx')
legalBasis = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/legalBasis.xlsx')
organization = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/organization.xlsx')
project = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/project.xlsx')
topics = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/topics.xlsx')
webItem = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/webItem.xlsx')
webLink = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/datasets/webLink.xlsx')

## Description/Exploration

In [None]:
data_all = [euroSciVoc, legalBasis, organization, project, topics, webItem, webLink]
[print(df.info()) for df in data_all]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38789 entries, 0 to 38788
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   projectID              38789 non-null  int64  
 1   euroSciVocCode         38789 non-null  object 
 2   euroSciVocPath         38789 non-null  object 
 3   euroSciVocTitle        38789 non-null  object 
 4   euroSciVocDescription  0 non-null      float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.5+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20512 entries, 0 to 20511
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   projectID            20512 non-null  int64  
 1   legalBasis           20512 non-null  object 
 2   title                20512 non-null  object 
 3   uniqueProgrammePart  15341 non-null  object 
 4   Unnamed: 4           0 non-null      float64
 5

[None, None, None, None, None, None, None]

### Variables of Interest

Outcome variables:

Covariates:

In [None]:
euroSciVoc.head()

Unnamed: 0,projectID,euroSciVocCode,euroSciVocPath,euroSciVocTitle,euroSciVocDescription
0,101116741,/29/97/543,/social sciences/political sciences/government...,government systems,
1,101163161,/27/81/30021/30833628,"/agricultural sciences/agriculture, forestry, ...",grains and oilseeds,
2,101163161,/23/43/251/48354418,/natural sciences/physical sciences/optics/mic...,microscopy,
3,101163161,/23/43/257/761,/natural sciences/physical sciences/astronomy/...,astrochemistry,
4,101163161,/29/89,/social sciences/law,law,


## EU map

## Pie Chart

## Sankey Diagrams

In [None]:
# Define the nodes
labels = ["Start", "Step 1", "Step 2", "End"]

# Define the links
source = [0, 1, 1, 2]  # indices in 'labels'
target = [1, 2, 3, 3]
value =  [10, 5, 5, 10]

# Create Sankey diagram
fig = go.Figure(data=[go.Sankey(
    node=dict(
        pad=15,
        thickness=20,
        line=dict(color="black", width=0.5),
        label=labels
    ),
    link=dict(
        source=source,
        target=target,
        value=value
    )
)])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=12)
fig.show()

In [25]:
# Example data
df = pd.DataFrame({
    "iso_alpha": ["FRA", "DEU", "ESP", "ITA"],
    "country": ["France", "Germany", "Spain", "Italy"],
    "gdp": [2700000000000, 3800000000000, 1400000000000, 2000000000000],
})

fig_map = px.choropleth(
    df,
    locations="iso_alpha",
    color="gdp",
    hover_name="country",
    hover_data={"gdp": ":,.0f"},
    color_continuous_scale="Greens",
    scope="europe"
)

fig_map.update_layout(title="EU Countries GDP Map", geo=dict(lakecolor='lightblue'))
fig_map.show()


# Empty Pie Chart (initially)
fig_pie = go.Figure()
fig_pie.update_layout(title_text="Hover over a country to see GDP pie", showlegend=True)
fig_pie.show()

In [16]:
# Example Sankey diagram
fig_sankey = go.Figure(data=[go.Sankey(
    node=dict(
      pad=15,
      thickness=20,
      line=dict(color="black", width=0.5),
      label=["Start", "Middle A", "Middle B", "End"],
      color=["blue", "lightblue", "lightgreen", "green"]
    ),
    link=dict(
      source=[0, 1, 0, 2],
      target=[1, 3, 2, 3],
      value=[8, 4, 2, 8]
    )
)])

fig_sankey.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig_sankey.show()

## Implementing the Dashboard


In [11]:

app = Dash(__name__)

app.layout = html.Div([
    html.H1("EU Dashboard"),
    dcc.Graph(figure=fig)
])

if __name__ == '__main__':
    app.run(debug=True)

<IPython.core.display.Javascript object>

In [32]:


app = Dash(__name__)


# App Layout: multiple components
app.layout = html.Div([
    html.H1("EU Dashboard with Map and Sankey Diagram"),

    html.Div([
        html.H2("EU Choropleth Map"),
        dcc.Graph(figure=fig_map)
    ]),

    html.Div([
        html.H2("Simple Sankey Diagram"),
        dcc.Graph(figure=fig_sankey)
    ])
])


# Empty Pie Chart at start
fig_pie = go.Figure()
fig_pie.update_layout(title_text="Hover over a country to see GDP pie", showlegend=True)

# App Layout
app.layout = html.Div([
    html.H1("EU GDP Map with Pie Chart on Hover"),

    html.Div([
        dcc.Graph(id="choropleth-map", figure=fig_map, style={"width": "60%", "display": "inline-block"}),
        dcc.Graph(id="pie-chart", figure=fig_pie, style={"width": "38%", "display": "inline-block", "vertical-align": "top"}),
    ])
])

# Callback to update pie chart based on hover
@app.callback(
    Output('pie-chart', 'figure'),
    Input('choropleth-map', 'hoverData')
)
def update_pie(hoverData):
    if hoverData is None:
        return go.Figure()

    try:
        # Careful: hoverData returns 'location' inside 'points'
        country_hovered = hoverData['points'][0]['location']
    except (KeyError, IndexError, TypeError):
        # If anything goes wrong, just return empty
        return go.Figure()

    selected_row = df[df['iso_alpha'] == country_hovered]

    if selected_row.empty:
        return go.Figure()

    country_name = selected_row.iloc[0]['country']
    gdp_value = selected_row.iloc[0]['gdp']

    # Pie chart: hovered country vs rest
    fig = go.Figure(data=[go.Pie(
        labels=[country_name, 'Rest of EU'],
        values=[gdp_value, df['gdp'].sum() - gdp_value],
        hole=0.4,
    )])

    fig.update_layout(title_text=f"{country_name} GDP Share")

    return fig

# Run app
if __name__ == '__main__':
    app.run(debug=True)

<IPython.core.display.Javascript object>