<img src='https://www.icos-cp.eu/sites/default/files/2017-11/ICOS_CP_logo.png' width="400" align="right"/> <br clear="all" />
# Analysis of city characteristics
For questions and feedback contact jupyter-info@icos-ri.eu.

This notebook and associated notebooks are designed to produce the type of analyses presented in the paper "Monitoring CO₂ in diverse European cities: Highlighting needs and challenges through characterization" (Storm et al., 2025 in preparation). 


## Brief overview of the notebook

Note that these steps should be followed in sequence, as each often depends on the previous one.

1. Select characteristics

    18 characteristics are used in the study, with additional characteristics available here.
    Possible to upload selection files used in the study, and download new custom selection files.

2. Select cities

    Choose a subset from the available 308 cities based on the percentile range of a selected characteristic or by country.

3. Prepare selected characteristics for integration and analysis
    - Apply min-max normalization within the 10th to 90th percentile range.
    - Invert values for relevant characteritics so that larger values indicate a greater monitoring challenge.
    - Weight characteristics according to their relative importance.<br><br>

4. Analyze the integrated characteristics
    - Calculate the challenge score.
    - Generate a similarity matrix, used for similarity searches between cities.
    - Conduct dendrogram cluster analysis.
    - Create a map showing city cluster associations.

## Select characteristics and cities 

### Available characteristics (sect. 2.2.1 though sect. 2.2.6)

Below is a more extensive overview of the different characteristics to choose from in the tool below. Relevant sections in the study by Storm et al. (2025) are referenced, and a summary of all datasets and their sources is found in Table 1. Characteristics included in the study are in bold, and references to data sources for characteristics not included in the study are provided below.

<details>
<summary>  
    <b>General characteristics</b>
</summary>
    
Total population 2018. Data downloaded from <a href = "https://ec.europa.eu/eurostat/web/gisco/geodata/population-distribution/geostat" target ="_blank">GEOSTAT</a> (last access 2024-10-26)
   
     
- <b>Total_pop</b>

Share of the city area containing 50% of the population (same method as "non-point-source aggregation" (Eq. 1))
    
- area_50percent_pop

Area in km²
- area

Distance to closest city in km
    
Calculated based on the minimum distance between city borders using ArcGIS Pro v. 3.3 tool "Close".
- dist_closest_city

Share of times the wind is >2m/s Jan/Feb 2018 hours 9:00 - 18:00UTC (sect. 2.2.1).
- <b>share_over_2m_s</b>

Share of times the wind originates from the dominant 30-degree wind direction (sect. 2.2.1)
- dominant_wind_share
    
Share of days (fraction between 0 and 1) with cloud cover less than 30% at 11:00UTC year 2018 (sect. 2.2.6): 
- share_cloud_over30percent

Share of days (fraction between 0 and 1) with cloud cover less than 30% at 11:00UTC year summer (J+J) 2018 (sect. 2.2.6).
- <b>share_cloud_over30percent_summer</b>
    
Share of days (fraction between 0 and 1) with cloud cover less than 30% at 11:00UTC year winter (J+F) 2018 (sect. 2.2.6).
- <b>share_cloud_over30percent_winter</b>
    
    
</details>

<details>
<summary>  
    <b>Land cover</b>
</summary>
    
Share of vegetation (%)
- vegetation
    
The percent of edge cells for vegetated areas (sect. 2.2.3).
- <b>veg_share_edge_area</b>
    
Share of cropland in the 20 km buffer zone (sect. 2.2.3).
- <b>40_buffer</b>

Share of cropland in the 20 km buffer zone in the dominant wind direction. Only when wind is >2m/s (sect. 2.2.1; sect. 2.2.3).
- <b>40_buffer_dom_wind</b>

Share of water in the 20 km buffer zone around the city
- share_water_buffer

Share of vegetation in the 20 km buffer zone around the city
- share_vegetation_buffer

Share of different land cover classes (used to calculate the total share of vegetation):
- 10: Trees (ESA Worldcover class 10)
- 20: Shrubland (ESA Worldcover class 20)
- 30: Grassland (ESA Worldcover class 30)
- 40: Cropland (ESA Worldcover class 40)
- 50: Built-up (ESA Worldcover class 50)
- 60: Bare / sparse vegetation (ESA Worldcover class 60)
- 70: Snow and ice (ESA Worldcover class 70)
- 80: Permanent water bodies (ESA Worldcover class 80)
- 90: Herbaceous wetland (ESA Worldcover class 90)
- 95: Mangroves (ESA Worldcover class 95)
- 100: Moss and lichen (ESA Worldcover class 100)


</details>

<details>
<summary>  
    <b>CO<sub>2</sub> emissions: from TNO</b>
</summary>
 
Total emissions 2018 (kg)
- co2_ff_total

Total emission per km²
- co2_ff_total_km2

Total emission per person
- co2_ff_total_pop

Share of point source emissions (sect 2.2.2)
- <b>co2_ff_share_point_sources_total</b>

Total non-point source emissions 2018
- co2_ff_total_no_point

Total non-point emission per km²
- co2_ff_total_no_point_km2

Total non-point emission per person
- co2_ff_total_no_point_pop

Share of city area containing 50% of the emissions (sect. 2.2.2; Eq. 1)
- <b>area_percentage</b>

Emission intensity (kg CO<sub>2</sub> / km²) within 20km buffer zone (sect. 2.2.2)
- <b>co2_ff_total_20km_buffer_km2</b>
    
Emission intensity (kg CO<sub>2</sub> / km²) within 20km buffer in the dominant wind direction (Jan/Feb 2018 hours 9:00 - 18:00UTC). Only when wind is >2m/s. (sect. 2.2.1; sect. 2.2.2)
- <b>emiss_intensity_buff_dom_wind_jan_feb_9_18</b>

Share of emissions from point sources in the buffer zone
- co2_ff_total_20km_buffer_share_point_sources

Count of point sources in the buffer zone
- co2_ff_total_20km_buffer_count_point_sources


Emission shares per sector (for absolute emissions co2_ff_[letter]). These sectors account for 96% of the emissions in the available cities. Shipping is also included, although it is negligible for many cities. 

Public power
- co2_ff_A_share
- co2_ff_A_share_point_sources
- co2_ff_count_point_sources_A

Industry
- co2_ff_B_share
- co2_ff_B_share_point_sources
- co2_ff_count_point_sources_B

Other stationary combustions
- co2_ff_C_share
- No point sources

Road transport
- co2_ff_F_share
- No point sources

Shipping 
- co2_ff_G_share
- No point sources


</details>

<details>
<summary>  
    <b>CO<sub>2</sub> emissions: from ODIAC</b>
</summary>

"The Open-Data Inventory for Anthropogenic Carbon dioxide (ODIAC) is a high-spatial resolution global emission data product of CO<sub>2</sub> emissions from fossil fuel combustion (Oda and Maksyutov, 2011)"

Downloaded from <a href = "https://db.cger.nies.go.jp/dataset/ODIAC/" target = "blank">here</a> 2024-06-03. Total CO<sub>2</sub> emissions from fossil fuel combustion in year 2018 (montly files summed).Emissions from international aviation and marine bunker are not included.
    
Montly emission files available (can apply temporal profiles also: "Weekly/diurnal emissions can be modeled by applying the TIMES (Temporal Improvements for Modeling Emissions by Scaling, Nassar et al. 2013) temporal scaling factors to the ODIAC monthly emission fields."). 

Total emissions in year 2018

- co2_ff_total_ODIAC    
    
</details>

<details>
<summary>  
    <b>Biogenic activity</b>
</summary>
    
Average NEE in µmol/m²/s for the whole city during January and February 2018. Average per hour in different columns. (sect. 2.2.3)
- NEE_average_[hour between 0 and 23] (<b>NEE_average_15</b>)

Average NEE to calculate the average net added or removed CO<sub>2</sub> for each hour of the day for the entire city. Values during January and February 2018. Average per hour in different columns. 
- tCO2_offset_per_hour_[hour between 0 and 23]

Above estimates are compared to the hourly fossil fuel emissions from TNO (scaled using standard temporal profiles from Ingrid Super, updated from Denier van der Gon et al., 2011) for each hour of the day for the entire city. (sect. 2.2.3: sect. 2.2.2)
- tCO2_rel_emission_per_hour_[hour between 0 and 23]  (<b>tCO2_rel_emission_per_hour_15</b>)

</details>

<details>
<summary>  
    <b>Urban topography</b>
</summary>

Mean building height in meters (sect. 2.2.4)
- <b>mean_built_up</b>

</details>

<details>
<summary>  
    <b>Natural topography</b>
</summary>

Share of flat areas 
- share_flatness

Mean Terrain Ruggedness Index (sect. 2.2.4). 
- <b>Mean_TRI</b> 

</details>

<details>
<summary>  
    <b>Nuclear contamination</b>
</summary>
Average nuclear contamination winter (J+F) afternoons (12 and 15 UTC)
- average_nuclear_contamination_permil_winter
    
Share of days >0.5 permil nuclear contamination winter
    
- share_daily_average_over_0_5_permil_winter

Share of days >0.3 permil nuclear contamination winter
    
- share_daily_average_over_0_3_permil_winter
    
Average nuclear masking potential winter (%)
    
- nuclear_masking_potential_winter
    
Average ffCO<sub>2</sub> STILT signal winter

- average_modelled_ffco2_signal_winter
    
Average ffCO<sub>2</sub> STILT signal on days >0.5 permil nuclear contamination winter
    
- average_modelled_ffco2_signal_days_below_0_5permil_winter
    
Average representation bias due to sample selection (>0.5 permil) winter in percent (sect 2.2.5)

- <b>representation_bias_sample_selection_winter </b> 
    
Average nuclear masking potential winter in percent (sect 2.2.5)
- <b>nuclear_masking_potential_winter</b>
    


### Make selection of variables

Use the checkboxes next to each characteristic description to make your selection. Selected variables will appear directly in the dataframe below the selection cell, shown by their short names. Full descriptions are available in the "Available variables" section as well as below the dataframe.

Variables that are of interest but should not be included in the analysis can also be selected. These can be excluded by assigning them a weight of zero in a later step.

- E.g. in the study, the "Population year 2018" variable is used to subset the 308 cities to the 96 cities with over 200,000 inhabitants.

You can also upload selection files ("Upload (0)"), such as those available from the study (see below). The selections specified in an uploaded file can still be adjusted by toggling checkboxes on or off.

Additionally, it is possible to download a custom selection file ("Save Selection").


#### Variables used in the study

To reproduce the variable selection in Storm et al. (2025), you can upload one of the following selection files in the tool below. Each file represents a specific challenge analyzed in the study.

<details>
    <summary><b>Available challenge selection files</b></summary>

The tool in the study was used to produce results for each individual challenge:

- **Background challenge**  
  [Open selection file](./saved_selections_and_weights/selection_background_challenge.json)

- **Biogenic challenge**  
  [Open selection file](./saved_selections_and_weights/selection_biogenic_challenge.json)

- **Modelling challenge**  
  [Open selection file](./saved_selections_and_weights/selection_modelling_challenge.json)

- **Observational challenge**  
  [Open selection file](./saved_selections_and_weights/selection_observational_challenge.json)

For the **Overall challenge**, all 18 metrics from the individual challenges are combined. This selection was used to produce the cluster analysis in the study:

- **Overall challenge**  
  [Open selection file](./saved_selections_and_weights/selection_overall_challenge.json)

Once the selection file is opened, click on "File" > "Download" in the top left corner. 

</details>

#### Save Selection
The selection can be downloaded to a file, which can then be uploaded in the same way as the selection files listed above. To do this, click on the "Save Selection" button, which saves the file to an output folder located in the home directory. To locate the output folder, follow the links to the output generated by other sections of this notebook.

In [None]:
import city_characteristic_analysis_functions as functions

# Callback function to capture the subset dataframe for use in the next step.
def capture_subset_df(subset_df):
    global subset_df_global
    subset_df_global = subset_df

# Call the function with the callback
functions.column_selection(capture_subset_df)

### (Optionally) create a subset of the cities (sect. 2.1)

<mark><b>Note that this cell needs to be run</b></mark> even if you intend to keep all cities (simply run the cell and proceed to the next cell).

By default, cities from all available countries are included (Fig. 1). By toggling off a country’s checkbox, its cities will be excluded.

Only selected characteristics can be used to subset the dataframe based on a specified percentile range. The percentile range is based on all 308 cities, even if some cities have already been excluded by unchecking their associated checkboxes. Press enter after setting the percentile range and see how the number of cities has changed in the resulting dataframe.

- To replicate the analyses in the study, the characteristic "Total_pop" (total population in 2018) must be selected. We subset the data to include cities between the 69th and 100th percentiles, which corresponds to a population threshold of 200,000, selecting 96 of the possible 308 cities.

In [None]:
def capture_filtered_df(subset_df):
    global subset_df_filtered
    subset_df_filtered = subset_df
    
functions.subset_cities(subset_df_global, capture_filtered_df)

## Prepare the selected data for further analysis 

### min-max normalization to the 10th to 90th percentile range (sect 2.4)

The resulting dataframe will have values between 0 (below 10th percentile) and 1 (above 90th percentile).

In [None]:
def capture_scaled_df(subset_df):
    global subset_df_scaled
    subset_df_scaled = subset_df
    
functions.scale_df(subset_df_filtered, capture_scaled_df)

### Invert values for relevant characteristics (sect. 2.4)

Applied to the following variables in the study to ensure that a higher value always indicates a greater challenge:

- share_over_2m_s_scaled (share wind >2 m s−1): The more often wind speeds reach at least 2 m/s, the less challenging it is expected to be.
- share_flatness_scaled (share flatness): The flatter a city is, the less challenging it is expected to be.
- dominant_wind_share_scaled (share wind from dominant wind direction): The more frequently the wind comes from its dominant direction, the less challenging it is expected to be.

If these variables are selected, their checkboxes will be checked automatically when the cell below is run. All checkboxes can be toggled on and off.


In [None]:
def capture_inverted_df(subset_df):
    global subset_df_scaled_inverted
    subset_df_scaled_inverted = subset_df
    
functions.invert_values(subset_df_scaled, capture_inverted_df)

#### Optionally save resulting dataframe

- The resulting dataframe for the "overall challenge" was saved and used to create Figure 3.

Un-comment (remove #) to download the data as a csv-file. It will be located in the same folder as this notebook. 

In [None]:
#subset_df_scaled_inverted.to_csv('data_for_figure3_creation.csv', encoding = 'utf-8-sig')

### Weigh the variables (Sect. 2.4.1)

Weighing the variables is optional, but the cell <mark><b>must be run</b></mark> (by clicking the 'Run' button) for the subsequent cells to function properly. It is only possible to run the cell if the weights add up to 100.

The default weights have no effect, meaning all characteristics influence the subsequent analysis equally.

#### Weights used in the study (Table 1)

To reproduce the variable selection in *Storm et al., 2025*, you can upload one of the following weight files in the tool below. Each file represents a specific challenge analyzed in the study.

<details>
    <summary><b>Available challenge weight files</b></summary>

The tool in the study was used to produce results for each individual challenge:

- **Background challenge**  
  [Open weight file](./saved_selections_and_weights/weights_background_challenge.json)

- **Biogenic challenge**  
  [Open weight file](./saved_selections_and_weights/weights_biogenic_challenge.json)

- **Modelling challenge**  
  [Open weight file](./saved_selections_and_weights/weights_modelling_challenge.json)

- **Observational challenge**  
  [Open weight file](./saved_selections_and_weights/weights_observational_challenge.json)

For the **Overall challenge**, all 18 metrics from the individual challenges are combined. This selection was used to produce the cluster analysis in the study:

- **Overall challenge**  
  [Open weight file](./saved_selections_and_weights/weights_overall_challenge.json)

Once the weight file is opened, click on "File" > "Download" in the top left corner. 

</details>

In [None]:
def capture_weighted_df(subset_df):
    global subset_df_scaled_inverted_weighted
    subset_df_scaled_inverted_weighted = subset_df
    
functions.weigh_variables_df(subset_df_scaled_inverted, capture_weighted_df)

## Challenge score (sect. 2.4.1)

The relative combined challenge scores are calculated based on all selected variables. These scores for the overall challenge are used to scale the size of the points representing cities in Figure 1.

The score can range between 0% and 100% (0 and 1), indicating minimum to maximum relative challenge. The minimum and maximum values can be achieved if a city consistently falls within the bottom 10th or top 90th percentile for all metrics. 

In the output dataframe, which can optionally be downloaded as "challenge_score.csv" (by checking "Save as CSV"), the columns "challenge_rank" and "challenge_quartile" are included. A higher challenge score, rank, and quartile indicate a higher relative challenge.


In [None]:
functions.calculate_challenge_score(subset_df_scaled_inverted_weighted)

## Similarity matrix (sect 2.4.2; sect. 3.3)

Similarity matrices show the similarity between each city and all other cities given distances across all selected (and weighted) variables. In the study, Munich's similarity matrix scores for the different challenges are displayed in Table 4.

The default distance calculation method in the tool below is Euclidean distance, which is also the method used in the study (Sect. 2.4.2). Read more about the available options in the SciPy manual <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html" target="_blank">here</a>.

The shorter the distances across the metrics, the more similar a city is. If a metric is given more weight, its distance in that aspect will have a greater impact on the integrated distance score. For easier interpretation, when Euclidean distance is selected, values are inverted in the tool as follows:

`similarity_matrix = (1 - similarity_matrix) * 100`

Multiplying the inverted similarity matrix by 100 provides a similarity score in percentage.

Note that this only applies when <mark>Euclidean distance is selected</mark>. For other distance calculation methods, a lower value still indicates shorter distances and greater similarity.

The full similarity matrix resulting from the tool will be used in subsequent steps. However, for ease of interpretation, the tool allows the selection of specific cities to be displayed in a subset similarity matrix. The cities listed under "Selected Cities" will appear in the subset similarity matrix after running the tool ("Run"). Initially, Paris, Munich, and Zurich are included in this list—if they are available in the dataframe. Use the buttons "Add City" and "Remove City" to change the list of selected cities. 

In [None]:
def capture_similarity_matrix(similarity_matrix_result):
    global similarity_matrix
    similarity_matrix = similarity_matrix_result
    
functions.create_similarity_matrix(subset_df_scaled_inverted_weighted, capture_similarity_matrix)

## Cluster analysis: dendrogram (sect. 2.4.3; sect. 3.4)

The cluster analysis in this study is used to identify cities that differ—i.e., fall into different clusters—from Paris, Munich, and Zurich.

The linkage method used for the dendrogram is "ward". You can read about the different options in the SciPy manual <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html" target="_blank">here</a>. 

Note that clusters in the dendrogram are created by drawing a horizontal line across the y-axis. Therefore, reaching the exact target number of clusters ("Target clusters") may not be possible. In such cases, the closest achievable number of clusters to the target will be used.


In [None]:
def capture_clusters_dendrogram(clusters_df):
    global clusters_dendrogram
    clusters_dendrogram = clusters_df
    
functions.create_dendrogram(similarity_matrix, capture_clusters_dendrogram)

### Spatial representation of dendrogram clusters 

Table and map showing cities and their associated dendrogram clusters.

In [None]:
display(clusters_dendrogram)

functions.cluster_map(clusters_dendrogram)