# Visual Analytics

## Assignment 1

**Instructor:** Dr. Marco D'Ambros  
**TAs:** Carmen Armenti, Mattia Giannaccari

**Contacts:** marco.dambros@usi.ch, carmen.armenti@usi.ch, mattia.giannaccari@usi.ch

**Due Date:** 10 April, 2025 @ 23:55

---

### Goal

The goal of this assignment is to use Python and Jupyter notebook to explore, analyze and visualize the datasets provided. 

The assignment is divided into four sections, each requiring you to apply the knowledge gained from both the theoretical and practical lectures to solve the exercises. Specifically, when creating tabular or graphical representations, you should apply the principles learned in the theoretical lectures and use the technologies introduced in the practical sessions. The datasets you need to use are detailed in the **Datasets Description** section and can be found in the following folder [Assignment1_Data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/EqjXB7uSEoVAujKPSZY1hvIBMhAXJv5y6Z-UwaO6bCtOjg?e=kxcaai).

### Submission Guidelines
- **Format:** Please submit a Jupyter Notebook containing your solutions along with a clear explanation of the **steps** taken to arrive at each solution. Each solution must be introduced by a Markdown cell indicating the exercise number. If you prefer, you may use the uploaded assignment file and develop your solution by adding cells below each exercise instructions. It is essential that every choice is justified, and the solution is thoroughly commented to explain each step. Exercises without explanations will be evaluated negatively.

- **Filename:** Please name the Jupyter notebook as follows: `SurenameName_Assignment1.ipynb`.

- **Submission:** Please submit your solution (the jupyter notebook and any other script you may have used to support your solution) to iCorsi.


## Preparatory Phase

Installing the needed modules

In [175]:
%pip install pandas
%pip install bokeh
%pip install matplotlib
%pip install chardet
%pip install geopandas
%pip install seaborn

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Importing the modules

In [176]:
import pandas as pd
import numpy as np
import chardet

---
## Section 1 - Data quality (10 points)

**Data Source:** `used_cars.csv`.

In the `used_cars.csv` dataset, please perform the following data cleaning steps: 
- Identify any missing or invalid values in the following columns: `vehicle type`, `price`, `brand`, and `month of registration`. If needed, standardize the data. For the `price` column specifically, the prices are recorded in euros, please consider valid only values within the range of €1,000 and €500,000. 
- For each of the previous columns, report the number of missing or invalid entries.
- After identifying missing or invalid values in the columns above, remove **any** rows where at least one of these columns contains such data.

Please clearly outline the steps you take to clean the dataset and document your approach. You may use any preferred tool or technology, such as Python (vanilla or Pandas) or OpenRefine.

In [177]:
with open('datasets/used_cars.csv', 'rb') as f:
    data = f.read()

encoding_result = chardet.detect(data)
encoding = encoding_result['encoding']

df_usedcars = pd.read_csv('datasets/used_cars.csv', encoding=encoding)
df_usedcars.rename(columns=lambda x : x.strip(), inplace=True)
columns = {'vehicleType', 'price', 'brand', 'monthOfRegistration'}
df_usedcars

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200,test,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


### Step 1

Before filtering the data I am going to see which values are inside the `DataFrame` to check if `NaN` values are present and if standardization is needed.

In [178]:
for col in columns:
    print(f"{col} has NaN? {df_usedcars[col].isnull().any()}")

monthOfRegistration has NaN? False
vehicleType has NaN? True
price has NaN? False
brand has NaN? False


We can see from the previous cell that only `vehicleType` has `NaN` values.

In [179]:
for col in columns:
    print(df_usedcars[col].value_counts().sort_index())
    print('=' * 50)

monthOfRegistration
0     37675
1     24561
2     22403
3     36170
4     30918
5     30631
6     33167
7     28958
8     23765
9     25074
10    27337
11    25489
12    25380
Name: count, dtype: int64
vehicleType
andere         3357
bus           30201
cabrio        22898
coupe         19015
kleinwagen    80023
kombi         67564
limousine     95894
suv           14707
Name: count, dtype: int64
price
0             10778
1              1189
2                12
3                 8
4                 1
              ...  
32545461          1
74185296          1
99000000          1
99999999         15
2147483647        1
Name: count, Length: 5597, dtype: int64
brand
BMW                   3
alfa_romeo         2345
audi              32873
bmw               40265
bmw                   6
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda           

Since in `brand` column there are multiple occurence of `bmw` written in different way, I am going to standardize the entries.

In [180]:
df_usedcars['brand'] = df_usedcars['brand'].apply(lambda x : x.strip().lower())
df_usedcars['brand'].value_counts().sort_index()

brand
alfa_romeo         2345
audi              32873
bmw               40274
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda              2836
hyundai            3646
jaguar              621
jeep                807
kia                2555
lada                225
lancia              484
land_rover          770
mazda              5695
mercedes_benz     35309
mini               3394
mitsubishi         3061
nissan             5037
opel              40136
peugeot           11027
porsche            2215
renault           17969
rover               490
saab                530
seat               7022
skoda              5641
smart              5249
sonstige_autos     3982
subaru              779
suzuki             2328
toyota             4694
trabant             591
volkswagen        79640
volvo              3327
Name: count, dtype: int64

Using filters I am going to extract the rows where at least one condition is satisfied.
Then I chain them throw the `|` operator.

Since from before I have noticed that `monthOfRegistration` values are between `0` and `12`, I am going to consider the `monthOfRegistration` as categorical, so the acceptable values are the one between `1` and `12`.

`NaN` values are present only in `vehicleType` column so I don't need to handle it in numerical columns.
Since `brand` has no `NaN` I can skip his filter since the values are already standardized.

In [181]:
filter_vehicle_type = df_usedcars['vehicleType'].isna()
filter_brand = df_usedcars['brand'].isna()
filter_price = (df_usedcars['price'] < 1_000) | (df_usedcars['price'] > 500_000)
filter_month = (df_usedcars['monthOfRegistration'] < 1) | (df_usedcars['monthOfRegistration'] > 12)

filter = filter_vehicle_type | filter_brand | filter_price | filter_month

### Step 2

Report the number of missing values counting the `True` values inside each filter series

In [182]:
print('=' * 15 + ' Missing/Invalid Values ' + '=' * 15)
print(f"  Missing values for vehicleType: \t  {filter_vehicle_type.sum()}")
print(f"  Missing values for brand: \t\t  {filter_brand.sum()}")
print(f"  Invalid values for price: \t\t  {filter_price.sum()}")
print(f"  Invalid values for monthOfRegistration: {filter_month.sum()}")
print(f"  Total removed values: \t\t  {filter.sum()}")
print('=' * 54)

  Missing values for vehicleType: 	  37869
  Missing values for brand: 		  0
  Invalid values for price: 		  83435
  Invalid values for monthOfRegistration: 37675
  Total removed values: 		  115942


### Step 3

Removing the rows where at least one condition is verified. This implies to remove the rows with same index of the one in the `filter` where value is `True`

In [183]:
df_usedcars_filtered = df_usedcars[~filter]
df_usedcars_filtered

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
6,2016-04-01 20:48:51,Peugeot_206_CC_110_Platinum,privat,Angebot,2200,test,cabrio,2004,manuell,109,2_reihe,150000,8,benzin,peugeot,nein,2016-04-01 00:00:00,0,67112,2016-04-05 18:18:39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371521,2016-03-27 20:36:20,Opel_Zafira_1.6_Elegance_TÜV_12/16,privat,Angebot,1150,control,bus,2000,manuell,0,zafira,150000,3,benzin,opel,nein,2016-03-27 00:00:00,0,26624,2016-03-29 10:17:23
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


## Section 2 - Data Analysis, Visualization, and Exploration (60 points) 📊
In this section, you will need to use two different datasets: `us_accidents.csv` for the first three exercises and `eu_energy.csv` for the next three. Each exercise is worth 10 points.

In [184]:
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.models import (
    ColumnDataSource, NumeralTickFormatter, TableColumn, DataTable, HTMLTemplateFormatter, 
    RadioButtonGroup, CustomJS, Row, InlineStyleSheet, ColumnDataSource, Span, HoverTool, 
    CustomJSHover, CustomJSTickFormatter
)
from bokeh.models.widgets import DataTable, TableColumn, GroupingInfo, SumAggregator, DataCube
from bokeh.layouts import column, row

import calendar

reset_output()
output_notebook()

### Section 2.1 
**Data Source**: `us_accidents.csv`

1. In the US Accidents dataset please remove all rows where one or more columns have missing data and explicitly identify the number of rows with null values. Consider the years 2020 and 2022.

    - What are the cities with the highest number of accidents in 2020 and 2022? Report them with the number of accidents.
    - Please provide the yearly total number of car accidents in 2020 and 2022 for each `County` and `City` combination.
    - Please retrieve the 10 cities with the highest total number of accidents in 2020 and 2022, and create a visualization that:
    
        - As a **primary goal** shows the increase in accident numbers for each city that allows the comparison of the increase per city. Which is the city with the most significant increase?
        - As a **secondary goal** presents the absolute number of accidents in both 2020 and 2022 for each selected city.
    
    Please explain the insights gained from the visualization and justify the choice of the representation.


In [185]:
df_accidents = pd.read_csv('datasets/us_accidents.csv')
df_accidents.rename(columns=lambda x : x.strip(), inplace=True)

In [239]:
length = len(df_accidents.index)
df_accidents_cleaned = df_accidents.dropna()
length_clean = len(df_accidents_cleaned.index)

nans = length - length_clean
print(f'The number of rows with nans are: {nans}')

The number of rows with nans are: 4173845


Considering the accidents in `2020` and `2022`, I am excluding any incidents that have at least one of the `Start_Time` or `End_Time` values other than `2020` or `2022`

In [240]:
years = {2020, 2022}
colors = {2020 : "steelblue", 2022 : "indianred"}

df_accidents_cleaned['Start_Time'] = pd.to_datetime(df_accidents_cleaned['Start_Time'], format='mixed')
df_accidents_cleaned['End_Time'] = pd.to_datetime(df_accidents_cleaned['End_Time'], format='mixed')

df_accidents_20_22 = df_accidents_cleaned.query('Start_Time.dt.year in @years and End_Time.dt.year in @years')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_cleaned['Start_Time'] = pd.to_datetime(df_accidents_cleaned['Start_Time'], format='mixed')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_cleaned['End_Time'] = pd.to_datetime(df_accidents_cleaned['End_Time'], format='mixed')


#### Step 1

After filtering all the data, I am going to show the city with the most accidents both years.

In [241]:
df_accidents_20_22['Year'] = df_accidents_20_22['Start_Time'].dt.year
accidents_per_year = df_accidents_20_22.groupby(['Year', 'City']).size().sort_values(ascending=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Year'] = df_accidents_20_22['Start_Time'].dt.year


In [242]:
for year in years:
    print(f"The city with the most accidents in {year} is {accidents_per_year.loc[year].idxmax()} with {accidents_per_year.loc[year].max()} accidents.")

The city with the most accidents in 2020 is Miami with 20938 accidents.
The city with the most accidents in 2022 is Miami with 61311 accidents.


#### Step 2
In order to show the accident per `Year` per `County` per `City` I need to group them.

Since there are too many values, to be plotted in a chart, I am showing the results in a table.

In [243]:
df_accidents_grouped = df_accidents_20_22.groupby(['Year', 'County', 'City']).size().reset_index(name='Accidents').sort_values(by='Accidents', ascending=False)

source = ColumnDataSource(df_accidents_grouped)

template = """
    <div style="color:dimgray;">
        <%= value %>
    </div>
"""

template_numbers = """
    <div style="color:dimgray;">
        <%= value.toLocaleString() %> 
    </div>
"""

columns = [
    TableColumn(field="Year", title="Year", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="County", title="County", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="City", title="City", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Accidents", title="Number of Accidents", formatter=HTMLTemplateFormatter(template=template_numbers))
]

grouping = [
    GroupingInfo(getter='Year', aggregators=[SumAggregator(field_="Accidents")])
]

target = ColumnDataSource(data=dict(row_indices=[], labels=[]))

css = """
.slick-group {
    color: dimgray;
    border-bottom: 2px solid #dee2e6 !important;
}
"""

data_cube = DataCube(
    source=source,
    columns=columns,
    grouping=grouping,
    target=target,
    stylesheets=[css]
)

show(data_cube)


#### Step 3

First of all I retrive the first 10 `city` per number of accident in total. 

In [244]:
df_accidents_top10 = df_accidents_20_22.groupby('City').size().sort_values(ascending=False)[:10].index

df_accidents_top10 = df_accidents_20_22[
    (df_accidents_20_22['City'].isin(df_accidents_top10)) & 
    (df_accidents_20_22['Year'].isin(years))
]

accidents_counts = df_accidents_top10.groupby(['City', 'Year']).size().unstack()

accidents_counts

Year,2020,2022
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Charlotte,8922,16554
Dallas,7697,18815
Houston,6831,19491
Los Angeles,17077,26544
Miami,20938,61311
Nashville,4866,12248
Orlando,8393,34308
Raleigh,5858,13509
Sacramento,4340,12025
San Diego,5540,12118


To show the the increasing of the number of the accident I use an `hbar` from `Bokeh` and as `tooltip` I add the secondary information. 

Since it is not specify if the `increment` to show must be `absolute` or `relative` I am going to show both.

In [245]:
accidents_counts['Increment'] = accidents_counts[2022] - accidents_counts[2020]
accidents_counts['Percentage'] = ((accidents_counts['Increment'] / accidents_counts[2020]) * 100).replace([float('inf'), -float('inf')], 0).fillna(0)
accidents_counts.sort_values(by=['Increment'], ascending=True, inplace=True)

# Absolute

source = ColumnDataSource(data={
    'city': accidents_counts.index.tolist(),
    'increment': accidents_counts['Increment'].tolist(),
    'percentage': accidents_counts['Percentage'].tolist(),
    'accidents_2020': accidents_counts[2020].tolist(),
    'accidents_2022': accidents_counts[2022].tolist()
})

TOOLTIPS = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px;">
        <span style="font-size: 12px; color: steelblue;">City:</span> @city<br>
        <span style="font-size: 12px; color: steelblue;">Increment:</span> @increment{0,0}<br>
        <span style="font-size: 12px; color: steelblue;">Percentage:</span> @percentage{0.0}%<br>
        <span style="font-size: 12px; color: steelblue;">2020 Accidents:</span> @accidents_2020{0,0}<br>
        <span style="font-size: 12px; color: steelblue;">2022 Accidents:</span> @accidents_2022{0,0}
    </div>
"""

plot_increment = figure(
    y_range=accidents_counts.index.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS,
    title="Accident Increment Absolute from 2020 to 2022 in Top 10 Cities",
    x_axis_label="Absolute Increment",
    y_axis_label="City"
)

plot_increment.hbar(
    y='city',
    right='increment',
    source=source,
    height=0.85,
)

plot_increment.toolbar.logo = None
plot_increment.toolbar_location = None
plot_increment.xgrid.grid_line_color = None
plot_increment.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_increment.yaxis.minor_tick_line_color = None
plot_increment.xaxis.minor_tick_line_color = None
plot_increment.x_range.start = 0

show(plot_increment)

# Percentage

accidents_counts.sort_values(by=['Percentage'], ascending=True, inplace=True)

plot_percentage = figure(
    y_range=accidents_counts.index.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS,
    title="Accident Increment Percentage from 2020 to 2022 in Top 10 Cities",
    x_axis_label="Relative Increment (%)",
    y_axis_label="City",
)

plot_percentage.hbar(
    y='city',
    right='percentage',
    source=source,
    height=0.85
)

plot_percentage.toolbar.logo = None
plot_percentage.toolbar_location = None
plot_percentage.xgrid.grid_line_color = None
plot_percentage.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_percentage.yaxis.minor_tick_line_color = None
plot_percentage.xaxis.minor_tick_line_color = None
plot_percentage.x_range.start = 0

show(plot_percentage)



As we can see from the plot `Miami` has the biggest **absolute** increment from `2020` to `2022` while `Orlando` has the biggest **relative** increment.

2. We define the **accident duration** as the time elapsed from the start of the accident until its impact on traffic flow is resolved.

    Please provide a table that shows the minimum and maximum accident duration for each combination of `State`, `County`, `City`, `Year`, `Month`, ensuring that only combinations with data for all 12 months is available. Then, filter the data to include only **Los Angeles**, **Dallas**, and **New York** cities and plot the behavior of the minimum and maximum durations for accidents that occurred in 2022. Choose a visualization that highlights how the average values of both minimum and maximum durations relate to the minimum-maximum range.

    - Which city shows the least pronounced variation? 
    - What insights can you draw from the plot?

    Please explain what the plot reveals and justify the choice of visualization.
    

In this case I consider that if an accident has a `Start_Time` in `June` but its `End_Time` is in `July` it will be considered as an accident of `June` since the instant in which it happen is in `June`.

In [246]:
df_accidents_20_22['Month'] = df_accidents_20_22['Start_Time'].dt.month
df_accidents_20_22['Duration'] = (df_accidents_20_22['End_Time'] - df_accidents_20_22['Start_Time']).dt.total_seconds() / 60
df_accidents_20_22['Duration'] = df_accidents_20_22.Duration.apply(lambda x : int(x))

df_accidents_sccym = df_accidents_20_22.groupby(['State', 'County', 'City', 'Year', 'Month'])['Duration'].agg(['min', 'max']).reset_index()
df_accidents_sccym.rename(columns={'min' : 'Min', 'max': 'Max'}, inplace=True)

valid_entries = df_accidents_sccym.groupby(['State', 'County', 'City', 'Year'])['Month'].nunique() == 12
df_accidents_sccym = df_accidents_sccym.set_index(['State', 'County', 'City', 'Year']).loc[valid_entries].reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Month'] = df_accidents_20_22['Start_Time'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Duration'] = (df_accidents_20_22['End_Time'] - df_accidents_20_22['Start_Time']).dt.total_seconds() / 60
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accid

After creating the needed `DataFrame` I can procede to visualize the data in a table.

In [247]:
duration_template = """
<div style="color:dimgray;">
    <% if (value >= 1440) { %>
        <%= Math.floor(value / 1440) %>d <%= Math.floor((value % 1440)/60) %>h <%= (value % 1440) % 60 %>min
    <% } else if (value >= 60) { %>
        <%= Math.floor(value / 60) %>h <%= value % 60 %>min
    <% } else { %>
        <%= value %>min
    <% } %>
</div>
"""

month_template = """
<div style="color:dimgray;">
    <% 
    var monthNames = ["January", "February", "March", "April", "May", "June", 
                     "July", "August", "September", "October", "November", "December"];
    var monthName = monthNames[value - 1]; 
    %>
    <%= monthName %>
</div>
"""

header_hover_css = """
    .slick-header-column:hover {
        background-color: #eeeeee !important;
    }
    .slick-header-column:hover .slick-column-name {
        color: #111111 !important;
    }
"""

source = ColumnDataSource(df_accidents_sccym)
original_data = df_accidents_sccym.to_dict('list') 

columns = [
    TableColumn(field="State", title="State", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="County", title="County", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="City", title="City", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Year", title="Year", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Month", title="Month", formatter=HTMLTemplateFormatter(template=month_template)),
    TableColumn(field="Min", title="Min Duration", formatter=HTMLTemplateFormatter(template=duration_template)),
    TableColumn(field="Max", title="Max Duration", formatter=HTMLTemplateFormatter(template=duration_template))
]

data_table = DataTable(
    source=source, 
    columns=columns, 
    width=800, 
    height=400, 
    index_position=None,
    scroll_to_selection=False,
    stylesheets=[InlineStyleSheet(css=header_hover_css)]
)

# Create radio button group with labels
year_selector = RadioButtonGroup(
    labels=["Year 2020", "Year 2022", "All Years"],
    active=2,
    width=400
)

# JavaScript callback for filtering
filter_code = """
    var year_filter = null;
    switch (this.origin.active) {
        case 0: year_filter = 2020; break;
        case 1: year_filter = 2022; break;
        case 2: year_filter = null; break;
    }
    
    var new_data = {};
    var indices = [];
    
    // Find matching indices
    for (var i = 0; i < original_data.Year.length; i++) {
        if (year_filter === null || original_data.Year[i] === year_filter) {
            indices.push(i);
        }
    }
    
    // Create filtered dataset
    for (var key in original_data) {
        new_data[key] = [];
        for (var idx of indices) {
            new_data[key].push(original_data[key][idx]);
        }
    }
    
    source.data = new_data;
    source.change.emit();
"""

# Add callback to radio buttons
year_selector.js_on_event("button_click", CustomJS(
    args=dict(source=source, original_data=original_data),
    code=filter_code
))

centered_row = Row(
    children=[year_selector],
    align="center",          # Horizontal centering
)

# Show the components
show(column(data_table, centered_row))

After showing the information for both years, I can concentrate on `Los Angeles`, `Dallas` and `New York`.

In [248]:
cities = ['Los Angeles', 'Dallas', 'New York']
df_accidents_cities = df_accidents_sccym[
    (df_accidents_sccym['City'].isin(cities)) & 
    ((df_accidents_sccym['City'] != 'Dallas') | (df_accidents_sccym['County'] == 'Dallas')) & 
    (df_accidents_sccym['Year'] == 2022)
]
df_accidents_cities


Unnamed: 0,State,County,City,Year,Month,Min,Max
3036,CA,Los Angeles,Los Angeles,2022,1,7,871
3037,CA,Los Angeles,Los Angeles,2022,2,6,1003
3038,CA,Los Angeles,Los Angeles,2022,3,7,967
3039,CA,Los Angeles,Los Angeles,2022,4,7,976
3040,CA,Los Angeles,Los Angeles,2022,5,7,1555
3041,CA,Los Angeles,Los Angeles,2022,6,9,7077
3042,CA,Los Angeles,Los Angeles,2022,7,6,1052
3043,CA,Los Angeles,Los Angeles,2022,8,6,1950
3044,CA,Los Angeles,Los Angeles,2022,9,7,1490
3045,CA,Los Angeles,Los Angeles,2022,10,7,10710


In [249]:
cities = ['Los Angeles', 'Dallas', 'New York']
mins = {city: float(df_accidents_cities[df_accidents_cities['City'] == city]['Min'].mean()) for city in cities}
maxs = {city: float(df_accidents_cities[df_accidents_cities['City'] == city]['Max'].mean()) for city in cities}

global_min_y = (df_accidents_cities['Min'].min(), int(df_accidents_cities['Min'].max()*1.1))
global_max_y = (df_accidents_cities['Max'].min(), int(df_accidents_cities['Max'].max()*1.1))

In [250]:
def format_duration(minutes):
    days = minutes // (24 * 60)
    hours = (minutes % (24 * 60)) // 60
    remaining = minutes % 60
    if days > 0:
        return f"{days}d {hours}h" if hours > 0 else f"{days}d"
    elif hours > 0:
        return f"{hours}h {remaining}m" if remaining > 0 else f"{hours}h"
    else:
        return f"{remaining}m"

df_accidents_cities['MinTimeStamp'] = df_accidents_cities['Min'].apply(format_duration)
df_accidents_cities['MaxTimeStamp'] = df_accidents_cities['Max'].apply(format_duration)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_cities['MinTimeStamp'] = df_accidents_cities['Min'].apply(format_duration)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_cities['MaxTimeStamp'] = df_accidents_cities['Max'].apply(format_duration)


In [251]:
code = """
        const minutes = tick;
        const days = Math.floor(minutes / (24*60));
        const remaining = minutes % 60;
        const hours = Math.floor(minutes/60) % (24);
        if (days === 0) {                                   
            if (hours === 0) {
                if (remaining === 0) {
                    return `${hours}`;
                }
                else {
                    return `${remaining}m`;
                }
            }
            else {
                 return `${hours}h`;
            }
        }
        return `${days}d`;
"""

# Create plot function
def create_city_plot(df, city, year, duration_type='Min'):
    """
    Creates a Bokeh plot for accident durations by month with consistent y-axis.
    """
    # Filter data
    city_data = df[(df['City'] == city) & (df['Year'] == year)].copy()
    city_data['Month Name'] = city_data['Month'].apply(lambda x: calendar.month_abbr[x])
    
    source = ColumnDataSource(city_data)
    
    duration_label = 'minimum' if duration_type == 'Min' else 'maximum'
    title = f"{city} {duration_label} accident duration in {year}"
    
    x_range = sorted(city_data['Month Name'].unique(), key=lambda m: list(calendar.month_abbr).index(m))
    
    # Y-axis range based on global min/max values
    y_range = global_min_y if duration_type == 'Min' else global_max_y
    
    # Get mean value based on duration type
    mean_value = mins[city] if duration_type == 'Min' else maxs[city]
    
    # Create Bokeh plot
    p = figure(
        title=title,
        x_range=x_range,
        x_axis_label=None,
        y_axis_label="Accident Duration",
        width=500,
        height=300
    )
    
    # Custom formatter for hover and axis
    time_formatter = CustomJSHover(code=code)
    
    # Add HoverTool with custom formatting
    hover = HoverTool(
        tooltips=[
            ("Month", "@{Month Name}"),
            ("Accident Duration", "@{" + duration_type + "TimeStamp" + "}")
        ]
    )
    p.add_tools(hover)
    
    # Line and scatter plot
    p.line(x='Month Name', y=duration_type, source=source, line_width=2, color='steelblue')
    p.scatter(x='Month Name', y=duration_type, source=source, size=1, alpha=1, color='steelblue')
    
    # Add horizontal mean line
    mean_line = Span(
        location=mean_value, 
        dimension='width',
        line_color='orange', 
        line_dash='dashed', 
        line_width=2)
    p.add_layout(mean_line)
    
    # Custom y-axis formatter
    p.yaxis.formatter = CustomJSTickFormatter(code=code)
    
    # Style adjustments
    p.toolbar.logo = None
    p.toolbar_location = None
    p.ygrid.grid_line_color = None
    p.yaxis.minor_tick_line_color = None
    p.xaxis.minor_tick_line_color = None
    p.y_range.start = 0
    
    return p

In [252]:
# Create plots with shared y-axis ranges
columns = []

for city in cities:
    columns.append(column(
        create_city_plot(df_accidents_cities, city, 2022, 'Min'),
        create_city_plot(df_accidents_cities, city, 2022, 'Max')
    ))

# Show the grid layout
show(row(*columns))

3. Please filter the data for the years 2019 to 2023 and divide it into two bins based on the `Year` value. Then, calculate the duration ranges for each bin, grouped by `County` and `City`. Classify accidents by congestion level:

    - Accidents affecting a road length greater than the median of `Distance(mi)` across the dataset are considered **severe**.
    - Those below the median are categorized as **not severe**.

    The resulting dataframe should have `County` and `City` as row indices, with year bins and severity (severe/not severe) as hierarchical columns. The values in the dataframe should represent the range of distances, with severe accidents placed under the "Severe" column and non-severe accidents under the "Not Severe" column. Each cell should display the range of distances for a specific city, county, and year interval. For this exercise, you have to use `groupby()` and __cannot__ rely on `pivot_table()`.
    
    What is the combination of county-city-year-range with the widest range of accidents duration?
    
    
    The following table shows how the dataframe should look:

<br>
YB = Year bin range
<br>
DB = Range of minimum and maximum durations
<br>

<table>
    <tr>
        <th rowspan="2">County</th>
        <th rowspan="2">City</th>
        <th colspan="2">Not Severe</th> 
        <th colspan="2">Severe</th>
    </tr>
    <tr>
        <th>YB</th>
        <th>YB</th>
        <th>YB</th>
        <th>YB</th>
    </tr>
    <tr>
        <th>Abbeville</th>
        <th>Bradley</th>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
    </tr>
    <tr>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
    </tr>
    <tr>
        <th>Yuma</th>
        <th>Dateland</th>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
    </tr>
    <tr>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
    </tr>
</table>

#### Step 4

Here I prepare the DataFrame to show all the data requested. 

First let's add `Year`, `Month` and `Duration` as new columns.

In [253]:
df_accidents_cleaned = df_accidents_cleaned.dropna(subset=['Start_Time', 'End_Time'])
df_accidents_cleaned['Year'] = df_accidents_cleaned['Start_Time'].dt.year
df_accidents_cleaned['Month'] = df_accidents_cleaned['Start_Time'].dt.month
df_accidents_cleaned['Duration'] = (df_accidents_cleaned['End_Time'] - df_accidents_cleaned['Start_Time']).dt.total_seconds() / 60
df_accidents_cleaned['Duration'] = df_accidents_cleaned.Duration.apply(lambda x : int(x))

In [254]:
df_accidents_cleaned['Year'].value_counts()

Year
2022    1421077
2021    1035468
2020     649322
2023     229087
2019     198640
2018       9549
2017       7754
2016       3652
Name: count, dtype: int64

Than I calculate the `median` of the number of accidents between `2019` and `2023`.

In [255]:
years_2 = [2019, 2023]
metric = 'Distance(mi)'

df_accidents_19_23 = df_accidents_cleaned.query(f'Year >= {years_2[0]} and Year <= {years_2[1]}')

accidents_bins = pd.cut(df_accidents_19_23['Year'], bins=2)
df_accidents_19_23['Year Bin'] = accidents_bins.apply(lambda x: f"[{int(x.left) + 1}, {int(x.right)}]")


median = float(df_accidents_19_23[metric].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_19_23['Year Bin'] = accidents_bins.apply(lambda x: f"[{int(x.left) + 1}, {int(x.right)}]")


Now I classify the `Severity` and group the accidents based on the metrics below:

In [256]:
df_accidents_19_23['Severe'] = df_accidents_19_23[metric].apply(lambda x : x > median)
df_accidents_19_23_grouped = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])[metric].agg(lambda x: f"[{round(x.min(), 3)}, {round(x.max(), 3)}]")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_19_23['Severe'] = df_accidents_19_23[metric].apply(lambda x : x > median)
  df_accidents_19_23_grouped = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])[metric].agg(lambda x: f"[{round(x.min(), 3)}, {round(x.max(), 3)}]")


Here we have a first representation with `NaN`s

In [257]:
final_table = df_accidents_19_23_grouped.unstack(level=[2, 3])

final_table.columns = pd.MultiIndex.from_tuples(
    [('Not Severe', str(col[0])) if col[1] == False else ('Severe', str(col[0])) for col in final_table.columns],
    names=[None, None]
)

final_table = final_table.sort_index(axis=1, level=0)

final_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Not Severe,Not Severe,Severe,Severe
Unnamed: 0_level_1,Unnamed: 1_level_1,"[2019, 2021]","[2022, 2023]","[2019, 2021]","[2022, 2023]"
County,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Abbeville,Aaronsburg,,,,
Abbeville,Abbeville,"[0.008, 0.263]","[0.009, 0.259]","[0.281, 1.962]","[0.265, 0.956]"
Abbeville,Abbotsford,,,,
Abbeville,Abbottstown,,,,
Abbeville,Aberdeen,,,,
...,...,...,...,...,...
Yuma,Zortman,,,,
Yuma,Zumbro Falls,,,,
Yuma,Zumbrota,,,,
Yuma,Zuni,,,,


If we want to see something without NaN we can remove them and plot the table:

In [258]:
final_table.dropna()

Unnamed: 0_level_0,Unnamed: 1_level_0,Not Severe,Not Severe,Severe,Severe
Unnamed: 0_level_1,Unnamed: 1_level_1,"[2019, 2021]","[2022, 2023]","[2019, 2021]","[2022, 2023]"
County,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Abbeville,Abbeville,"[0.008, 0.263]","[0.009, 0.259]","[0.281, 1.962]","[0.265, 0.956]"
Abbeville,Calhoun Falls,"[0.04, 0.247]","[0.043, 0.207]","[0.406, 0.406]","[0.282, 0.492]"
Abbeville,Donalds,"[0.013, 0.253]","[0.016, 0.233]","[0.265, 0.771]","[0.332, 0.748]"
Abbeville,Due West,"[0.097, 0.117]","[0.056, 0.181]","[0.281, 1.382]","[0.272, 0.839]"
Abbeville,Honea Path,"[0.027, 0.252]","[0.012, 0.254]","[0.284, 0.583]","[0.288, 0.655]"
...,...,...,...,...,...
Yuba,Oregon House,"[0.01, 0.255]","[0.026, 0.255]","[0.308, 1.873]","[0.287, 0.645]"
Yuba,Smartsville,"[0.0, 0.239]","[0.009, 0.261]","[0.289, 7.606]","[0.272, 0.683]"
Yuba,Wheatland,"[0.0, 0.254]","[0.01, 0.261]","[0.265, 2.576]","[0.265, 2.873]"
Yuma,Roll,"[0.012, 0.012]","[0.097, 0.26]","[0.551, 13.724]","[0.461, 11.915]"


Now I show the widest range from the `final_table`

In [None]:
df_accidents_19_23_ranged = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])['Duration'].agg(["min", "max"])
df_accidents_19_23_ranged['range'] = df_accidents_19_23_ranged['max'] - df_accidents_19_23_ranged['min']
df_accidents_19_23_ranged.sort_values(by='range', ascending=False, inplace=True)
df_accidents_19_23_ranged = df_accidents_19_23_ranged.reset_index()

max_range_row = df_accidents_19_23_ranged.iloc[0]

print(f"The widest range {int(max_range_row['range'])} is in {max_range_row['City']}, {max_range_row['County']} "
      f"(Year Bin: {max_range_row['Year Bin']}, Severe: {max_range_row['Severe']}): "
      f"[{int(max_range_row['min'])}, {int(max_range_row['max'])}]")

  df_accidents_19_23_ranged = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])['Duration'].agg(["min", "max"])


The widest range 1579244.0 is in Norristown, Montgomery (Year Bin: [2019, 2021], Severe: True): [15.0, 1579259.0]


### Section 2.2 
**Data Source:** `eu_energy.csv`

Please note that:

- EU countries are the following: Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden
- Renewable energy sources: Hydroelectric power, solar power, wind power, biofuel
- Non-renewable energy sources: Coal, fossil fuels, gas, oil, nuclear
- Clean energy sources: Hydroelectric power, solar power, wind power, nuclear
- Non-clean energy sources: Biofuel, coal, fossil fuels, gas, oil

4. Please provide a visualization that highlights the relationship between:
    - Population size;
    - CO2 emissions per capita;
    - Renewable energy production.

    in 2017. Describe the visualization identifying groups and outliers.

In [259]:
from bokeh.models import Select, CheckboxGroup, RadioGroup
from bokeh.palettes import Turbo256

In [None]:
df_energy = pd.read_csv('datasets/eu_energy.csv')
df_energy.rename(columns=lambda x : x.strip())
df_energy['population'] = df_energy['population'].apply(int)

#### Step 1
I prepare the data with the metrics needed, filtering the countries, grouping the `renewable energy` in one column and calculating the values `per capita`. 

To then print the population size in more visible way I standardize the data using the formula below.

In [261]:
year = 2017
renewable = ['hydro', 'solar', 'wind', 'biofuel']
eu_countries = [
    "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czechia", 
    "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", 
    "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", 
    "Malta", "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", 
    "Slovenia", "Spain", "Sweden"
]

df_energy = df_energy[df_energy['country'].isin(eu_countries)]
df_energy['renewable'] = sum(df_energy[x + '_electricity'].fillna(0) for x in renewable)
df_energy['greenhouse_gas_emissions_rate'] = (df_energy['greenhouse_gas_emissions'].fillna(0) * 1e6) / df_energy['population'].fillna(1)

df_eu_2017 = df_energy.query(f'year == {year}')
df_eu_2017['emissions_per_capita'] = df_eu_2017['greenhouse_gas_emissions'] * 1e6 / df_eu_2017['population']
df_eu_2017['population_size'] = df_eu_2017['population'].apply(lambda x: (x / 1e6)**0.5 * 5 + 5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_eu_2017['emissions_per_capita'] = df_eu_2017['greenhouse_gas_emissions'] * 1e6 / df_eu_2017['population']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_eu_2017['population_size'] = df_eu_2017['population'].apply(lambda x: (x / 1e6)**0.5 * 5 + 5)


As chart I prefered to use the `bubble chart` because it allows me to plot 3 metrics at once and using the tooltips I can delete the inability of humans to quantify the differences between areas.

In [298]:
source = ColumnDataSource(df_eu_2017)

p = figure(title="EU Countries: CO₂ vs Renewable Energy (2017)", 
           x_axis_label='CO₂ Emissions per Capita (tonnes)',
           y_axis_label='Renewable Energy (TWh)',
           tools="pan,wheel_zoom,box_zoom,reset,save",
           width=800, height=600)

scatter = p.scatter(
    x='emissions_per_capita',
    y='renewable',
    size='population_size',
    source=source,
    fill_color='steelblue',
    fill_alpha=0.9
)

hover = HoverTool(tooltips=[
    ("Country", "@country"),
    ("CO₂ per capita", "@emissions_per_capita{0.00} tonnes"),
    ("Renewables", "@renewable{0.00} TWh"),
    ("Population", "@population{0,0}")
])
p.add_tools(hover)

p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.xaxis.minor_tick_line_color = None
p.yaxis.minor_tick_line_color = None
p.x_range.start = 0
p.y_range.start = 0

show(p)

5. Please compute the renewable energy production percentage (one datapoint per country, per year). Then, create a visualization to investigate how the distribution of these values evolves over the years, from 2010 to 2017.

#### Step 2
I start creating calculating the `total_electricity` produced by each `Country` in order to then calculate the ratio with the `renewable` ones.

In [263]:
year_range = [x for x in range(2010, 2018)]
df_energy_10_17 = df_energy.query('year in @year_range')
df_energy_10_17['total_electricity'] = df_energy_10_17[[col for col in df_energy_10_17.columns if 'electricity' in col.lower()]].sum(axis=1)
df_energy_10_17['renewable_rate'] = (df_energy_10_17['renewable'] / df_energy_10_17['total_electricity']) * 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_energy_10_17['total_electricity'] = df_energy_10_17[[col for col in df_energy_10_17.columns if 'electricity' in col.lower()]].sum(axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_energy_10_17['renewable_rate'] = (df_energy_10_17['renewable'] / df_energy_10_17['total_electricity']) * 100


Before plotting the evolution over the years for each `Country` I need to calculate some data from our dataset.

Since it is impossible to visualize all the `Country` at once in human readable way I prefere to add some option for the viewer in order to better visualize them, adding the possibility to chose which country he wants to see, the `top 5` and the `bottom 5`.

In [264]:
countries = sorted(df_energy_10_17['country'].unique())
avg_rate = df_energy_10_17.groupby('country')['renewable_rate'].mean()
top5 = list(avg_rate.sort_values(ascending=False).head(5).index)
bottom5 = list(avg_rate.sort_values(ascending=True).head(5).index)

In [265]:
p = figure(
    title="Renewable Energy % by Country (2010-2017)",
    x_axis_label="Year",
    y_axis_label="Renewable Energy (%)",
    tools="pan,wheel_zoom,box_zoom,reset,save",
    width=1000,
    height=600,
    x_range=(2010, 2017)
)

palette = Turbo256[::max(1, len(Turbo256)//len(countries))]
country_colors = {country: palette[i % len(palette)] for i, country in enumerate(countries)}

renderers_dict = {}
for country in countries:
    country_df = df_energy_10_17[df_energy_10_17['country'] == country].sort_values('year')
    source = ColumnDataSource(country_df)
    
    line_r = p.line(
        x='year',
        y='renewable_rate',
        source=source,
        line_width=2,
        color=country_colors[country]
    )
    
    renderers_dict[country] = [line_r]

hover = HoverTool(tooltips=[
    ("Country", "@country"),
    ("Year", "@year"),
    ("Renewable Rate", "@renewable_rate{0.0}%"),
    ("Renewable Production", "@renewable{0.00} GWh")
])
p.add_tools(hover)


checkbox = CheckboxGroup(labels=countries, active=list(range(len(countries))))
checkbox.tags = []

select_filter = Select(title="Filter countries:", value="All", options=["All", "None", "Top 5", "Bottom 5"])

select_callback = CustomJS(args=dict(checkbox=checkbox,
                                     select_filter=select_filter,
                                     countries=countries,
                                     top5=top5,
                                     bottom5=bottom5,
                                     renderers_dict=renderers_dict),
code="""
    var newActive = [];
    if (select_filter.value === "Top 5") {
        for (var i = 0; i < countries.length; i++){
            if (top5.indexOf(countries[i]) >= 0) {
                newActive.push(i);
            }
        }
    } else if (select_filter.value === "Bottom 5") {
        for (var i = 0; i < countries.length; i++){
            if (bottom5.indexOf(countries[i]) >= 0) {
                newActive.push(i);
            }
        }
    } else if (select_filter.value === "None") {
        // "None": no countries selected.
        newActive = [];
    } else {
        // "All": every index.
        for (var i = 0; i < countries.length; i++){
            newActive.push(i);
        }
    }
    // Use the tags field to indicate we're suppressing the checkbox callback.
    checkbox.tags = ["suppress"];
    checkbox.active = newActive;
    
    // Update glyphs visibility based on allowed countries.
    var allowed = newActive.map(i => countries[i]);
    for (var i = 0; i < countries.length; i++) {
        var country = countries[i];
        var vis = allowed.indexOf(country) >= 0;
        var renderers = renderers_dict[country];
        for (var j = 0; j < renderers.length; j++) {
            renderers[j].visible = vis;
        }
    }
""")
select_filter.js_on_change('value', select_callback)

checkbox_callback = CustomJS(args=dict(checkbox=checkbox,
                                        select_filter=select_filter,
                                        countries=countries,
                                        renderers_dict=renderers_dict),
code="""
    // If the selection was changed by the select widget, check for the suppression flag.
    if (checkbox.tags.indexOf("suppress") >= 0) {
        // Remove the flag and skip further processing.
        var idx = checkbox.tags.indexOf("suppress");
        checkbox.tags.splice(idx, 1);
        return;
    }
    // Otherwise, update glyph visibility based on manually toggled checkboxes.
    var active = checkbox.active;
    var activeCountries = active.map(i => countries[i]);
    for (var i = 0; i < countries.length; i++){
        var country = countries[i];
        var vis = activeCountries.indexOf(country) >= 0;
        var renderers = renderers_dict[country];
        for (var j = 0; j < renderers.length; j++){
            renderers[j].visible = vis;
        }
    }
""")
checkbox.js_on_change('active', checkbox_callback)


p.ygrid.grid_line_color = None
p.xaxis.minor_tick_line_color = None
p.yaxis.minor_tick_line_color = None
p.y_range.start = 0

layout = row(p, checkbox, select_filter)
show(layout)

6. Please provide visualizations that show the evolution over the years (from 1990 to 2020) of:
    - Renewable energy production per capita for each country
    - Clean energy production per capita for each country
    - Net import per capita for each country

    Are there countries that behave differently from the others?

    *Please note that the goal of the visualization is not to compare all the countries with each other but to identify which ones present different trends compared to all the others.*

#### Step 3
Let's begin calculating the data needed

In [266]:
clean = ['hydro', 'solar', 'wind', 'nuclear']
year_range = [x for x in range (1990, 2021)]

df_energy['renewable_per_capita'] = (df_energy['renewable']*10e6)/df_energy['population']
df_energy['clean'] = sum(df_energy[x + '_electricity'].fillna(0) for x in clean)
df_energy['clean_per_capita'] = (df_energy['clean']*10e6)/df_energy['population']
df_energy['net_import'] = (df_energy['net_elec_imports']*10e6)/df_energy['population']
df_energy_90_20 = df_energy.query('year in @year_range')

For the graph I decided that the better way to show everything is to choose a `Country` and a `metric` to plot each time, but leaving the `global trend` visible to better understand the difference with the rest of the countries.

In [267]:
def get_filtered_data(df, country, metric):
    df_sel = df[df['country'] == country].sort_values('year')
    return dict(year=df_sel['year'].tolist(), y=df_sel[metric].tolist())

default_metric = "renewable_per_capita"
default_country = "Italy"

filtered_dict = get_filtered_data(df_energy_90_20, default_country, default_metric)
filtered_dict["unit"] = ["kWh/person"] * len(filtered_dict["year"])

sorted_countries = sorted(df_energy_90_20['country'].unique())
italy_index = sorted_countries.index("Italy")

In [268]:
def get_global_trend(df, metric):
    grouped = df.groupby("year")[metric].mean().reset_index()
    return dict(year=grouped["year"].tolist(), y=grouped[metric].tolist())

default_global_dict = get_global_trend(df_energy_90_20, default_metric)
global_source = ColumnDataSource(data=default_global_dict)

In [304]:
from bokeh.io import show
from bokeh.layouts import row, column
from bokeh.models import (
    ColumnDataSource, HoverTool, CustomJS, RadioButtonGroup, RadioGroup,
    Span
)
from bokeh.plotting import figure
import pandas as pd
from collections import defaultdict

# ------------------------------
# Prepare the Data
# ------------------------------
# Assume df_energy_90_20 is your main DataFrame with columns:
# 'year', 'country', 'renewable_per_capita', 'clean_per_capita', 'net_import'
default_metric = "renewable_per_capita"
default_country = "Italy"

sorted_countries = sorted(df_energy_90_20['country'].unique())
italy_index = sorted_countries.index(default_country)

# Full flat source for all data
source_full = ColumnDataSource(df_energy_90_20.to_dict(orient='list'))

# Filtered source for selected country
def get_filtered_data(df, country, metric):
    df_sel = df[df['country'] == country].sort_values('year')
    return dict(year=df_sel['year'].tolist(), y=df_sel[metric].tolist(), unit=["kWh/person"] * len(df_sel))

filtered_dict = get_filtered_data(df_energy_90_20, default_country, default_metric)
source_filtered = ColumnDataSource(data=filtered_dict)

# Global trend for default metric
global_data = df_energy_90_20.groupby('year')[default_metric].mean().reset_index()
global_source = ColumnDataSource(data={
    'year': global_data['year'].tolist(),
    'y': global_data[default_metric].tolist()
})

# Precompute xs, ys, alpha for all country lines
grouped = df_energy_90_20.groupby("country")
xs = []
ys = []
alpha = []
for name, group in grouped:
    sorted_group = group.sort_values("year")
    xs.append(sorted_group["year"].tolist())
    ys.append(sorted_group[default_metric].tolist())
    alpha.append(1.0 if name == default_country else 0.6)

all_lines_source = ColumnDataSource(data=dict(xs=xs, ys=ys, alpha=alpha))

# ------------------------------
# Create the Plot
# ------------------------------
p = figure(
    title="Per Capita Energy Metrics by Year",
    x_axis_label="Year", 
    y_axis_label="Value",
    width=800, 
    height=400, 
    x_range=(1990, 2020)
)

# Plot all country lines (grey)
p.multi_line(xs='xs', ys='ys', line_alpha='alpha', line_color="lightgray", line_width=1.5, source=all_lines_source)

# Plot selected country
p.line('year', 'y', source=source_filtered, line_width=2, line_color="navy")
p.scatter('year', 'y', source=source_filtered, size=2, color="navy")

# Global trend
p.line('year', 'y', source=global_source, line_width=2, line_dash="dashed", color="indianred", legend_label="Global Trend")

# Static x-axis baseline
x_axis_span = Span(location=0, dimension='width', line_color='black', line_width=1)
p.add_layout(x_axis_span)

# Hover
hover = HoverTool(tooltips=[
    ("Year", "@year"),
    ("Renewable per capita", "@y{0.0} kWh/person")
])
p.add_tools(hover)

# ------------------------------
# UI Controls
# ------------------------------
metric_labels = ["Renewable per capita", "Clean per capita", "Energy import per capita"]
metric_radio = RadioButtonGroup(labels=metric_labels, active=0, align="center")
country_radio = RadioGroup(labels=sorted_countries, active=italy_index)

# ------------------------------
# Callback
# ------------------------------
callback = CustomJS(args=dict(
    source_full=source_full,
    source_filtered=source_filtered,
    global_source=global_source,
    metric_radio=metric_radio,
    country_radio=country_radio,
    hover=hover,
    p=p,
    all_lines=all_lines_source
), code="""
    const metrics = {
        0: "renewable_per_capita",
        1: "clean_per_capita",
        2: "net_import"
    };

    const units = {
        0: "kWh/person",
        1: "kWh/person",
        2: "kWh/person"
    };

    const metric_labels = {
        0: "Renewable per capita",
        1: "Clean per capita",
        2: "Energy import per capita"
    };

    const metric_col = metrics[metric_radio.active];
    const unit = units[metric_radio.active];
    const metric_label = metric_labels[metric_radio.active];
    const sel_country = country_radio.labels[country_radio.active];

    const full = source_full.data;
    const years = full['year'];
    const countries = full['country'];
    const metric_vals = full[metric_col];

    let filt_years = [];
    let filt_vals = [];
    let filt_units = [];

    for (let i = 0; i < years.length; i++) {
        if (countries[i] === sel_country) {
            filt_years.push(years[i]);
            filt_vals.push(metric_vals[i]);
            filt_units.push(unit);
        }
    }

    source_filtered.data = { "year": filt_years, "y": filt_vals, "unit": filt_units };
    source_filtered.change.emit();

    // --- Update Global Trend ---
    const global_obj = {};
    for (let i = 0; i < years.length; i++) {
        const yr = years[i];
        if (!(yr in global_obj)) {
            global_obj[yr] = [];
        }
        global_obj[yr].push(metric_vals[i]);
    }

    let global_years = [];
    let global_avgs = [];
    for (let key in global_obj) {
        global_years.push(Number(key));
        const arr = global_obj[key];
        let sum = 0;
        for (let j = 0; j < arr.length; j++) sum += arr[j];
        global_avgs.push(sum / arr.length);
    }

    // Sort
    const global_data = global_years.map((y, i) => ({ year: y, y: global_avgs[i] }));
    global_data.sort((a, b) => a.year - b.year);
    global_source.data = {
        year: global_data.map(d => d.year),
        y: global_data.map(d => d.y)
    };
    global_source.change.emit();

    // --- Update Hover ---
    hover.tooltips = [
        ["Year", "@year"],
        [metric_label, "@y{0.0} " + unit]
    ];

    // --- Update y_range.start if needed ---
    const min_val = Math.min(...filt_vals);
    p.y_range.start = (min_val < 0) ? min_val : 0;
    p.y_range.change.emit();

    // --- Update All Country Lines ---
    const grouped = {};
    for (let i = 0; i < years.length; i++) {
        const ctry = countries[i];
        const yr = years[i];
        const val = full[metric_col][i];
        if (!(ctry in grouped)) {
            grouped[ctry] = { x: [], y: [] };
        }
        grouped[ctry].x.push(yr);
        grouped[ctry].y.push(val);
    }

    const xs = [];
    const ys = [];
    const alphas = [];

    for (const c in grouped) {
        xs.push(grouped[c].x);
        ys.push(grouped[c].y);
        alphas.push(c === sel_country ? 1.0 : 0.6);
    }

    all_lines.data = { xs: xs, ys: ys, alpha: alphas };
    all_lines.change.emit();

    p.change.emit();
""")

metric_radio.js_on_change('active', callback)
country_radio.js_on_change('active', callback)

# ------------------------------
# Layout and Show
# ------------------------------
p.ygrid.grid_line_color = None
p.yaxis.minor_tick_line_color = None
p.xaxis.minor_tick_line_color = None

layout = row(column(p, metric_radio), country_radio)
show(layout)


## Section 3 - Graph Redesign and Analysis (15 points) 📊

**Data Source:** `market_value_decline.csv`

The 2008 financial crisis had a significant impact on banks worldwide, leading to substantial losses in market value. The following graph compares the market value of major banks in 2007 (pre-crisis) and 2009 (post-crisis), using blue to represent their value before the meltdown and green to represent their value after.

The **primary** goal of this visualization is to highlight the extent of losses suffered by each bank, while also drawing attention to J.P. Morgan’s relatively minor decline compared to its peers. The **secondary** goal is to illustrate the overall ranking of banks by market value, showing their relative sizes before and after the crisis.

Does this visualization effectively convey both the absolute losses and the percentage changes in market value? Does it allow for an easy comparison of which banks retained the most value relative to their original size?

1. Evaluate the effectiveness of the graph in communicating the market value losses and the relative sizes of the banks. What improvements can be made?
2. Propose a visualization that better captures both the absolute and relative losses per bank. Should we emphasize the percentage decline more? Should we use a different chart type?
3. Implement your proposed visualization using the *market_value_decline* dataset.

**Exercise Submission Requirements:**
1. `Written analysis` of the original graph's shortcomings: Please examine the existing graph and identify any issues that hinder its ability to clearly convey the intended quantitative message.
2. `Justifications` for the proposed improvements: For each issue you identify, please discuss potential improvements or alternative visualization techniques that might resolve these issues.
3. `Redesigned graph` that better communicates the data. Be sure to explain how your redesign enhances data interpretation and achieves the intended objectives more effectively.

![exercise3.png](exercise3.png)




### 1 Written analysis

The objective of the chart is to show the difference between the market values before and after the crisis in 2008.

1. The main problem is that humans are not that good in comparing areas. This force to write the values of the circles decrasing the `data-ink ratio` value of the chart.

2. Another problem is that in the legend we can see that the comparison is done between a quarter of 2007 and a month in 2009. This is a way to manipulate data to force them to show the crisis. This is reflected in the title where there is no evidence of the period of the year hiding better this difference in the periods. 

3. A problem could be the color choice for blue-blind people that see the chart monochromatic as you can see in the image below:

4. A not first impact visible problem is that the values are integer for the `2007` cap and floating point for `2009` giving no reason for the reader to compare the two values if not for the graph and moreover the unit of measurement is not defined.

5. Last, but not least, problem is the way in which banks names are represented on the chart: bank names are rotate by 45 degrees and the center of the name is aligned with the center of the chart itself.

![blueblind-exercise3.png](blueblind-exercise3.png)

### 2 Justications
An horizontal bar plot is a solution for both points `1.` and `5.` since a bar plot allows to compare in a better way the market values increasing the `data-ink ratio`.

To solve `2.` issue, I need the original dataset on which data are plotted containing only months or only quarters for both years, which, unfortunately, with the given dataset is not possible.

To solve `4.` issue I can convert all the values to integer approximating the `2009` cap values, but for the unit of measurement there is no clue so it cannot be solved.

For the last remeaning issue (`3.`) a proper set of color (like blue and red) could be choose to solve the problem.

### 3 Redesigned Graph

In [270]:
from bokeh.transform import dodge

In [271]:
df_bank = pd.read_csv('datasets/market_value_decline.csv')
df_bank

Unnamed: 0,;market_value_2007;market_value_2009
0,Morgan Stanley;49;16.0
1,RBS;120;4.6
2,Deutsche Bank;76;10.3
3,Credit Agricole;67;17.0
4,Societe Generale;80;26.0
5,Barclays;91;7.4
6,BNP Paribas;108;32.5
7,Unicredit;93;26.0
8,UBS;100;35.0
9,Credit Suisse;100;27.0


First think since the `csv` is not well formatted I am going to reformat it.

In [272]:
with open('datasets/market_value_decline.csv', 'r') as file:
    lines = file.readlines()

lines[0] = "bank" + lines[0]

for i in range(len(lines)):
    lines[i] = lines[i].replace(';', ',')

with open('datasets/cleaned_market_data.csv', 'w') as file:
    file.writelines(lines)

In [273]:
df_bank = pd.read_csv('datasets/cleaned_market_data.csv')
df_bank.rename(columns=lambda x : x.strip(), inplace=True)
df_bank

Unnamed: 0,bank,market_value_2007,market_value_2009
0,Morgan Stanley,49,16.0
1,RBS,120,4.6
2,Deutsche Bank,76,10.3
3,Credit Agricole,67,17.0
4,Societe Generale,80,26.0
5,Barclays,91,7.4
6,BNP Paribas,108,32.5
7,Unicredit,93,26.0
8,UBS,100,35.0
9,Credit Suisse,100,27.0


In [301]:
df_bank['market_value_2009'] = round(df_bank['market_value_2009'])
df_bank['absolute_increment'] = df_bank['market_value_2009'] - df_bank['market_value_2007']
df_bank['relative_increment'] = (df_bank['absolute_increment'] / df_bank['market_value_2007']) * 100
df_bank.sort_values(by=['relative_increment'], ascending=True, inplace=True)

source = ColumnDataSource(data={
    "bank": df_bank.bank,
    "2007" : df_bank.market_value_2007,
    "2009": df_bank.market_value_2009,
    "absolute" : df_bank.absolute_increment,
    "relative" : df_bank.relative_increment
})

TOOLTIPS = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px; font-size: 12px;">
        <strong>Bank:</strong> @bank<br>
        <span style="color: lightcoral;"><strong>Market Cap 2007:</strong></span> @2007{0}<br>
        <span style="color: steelblue;"><strong>Market Cap 2009:</strong></span> @2009{0}<br>
        <span style="color: black;"><strong>Increment:</strong></span> @absolute{0} (@relative{0.00}%)<br>
    </div>
"""

# Create figure
p = figure(
    y_range=source.data['bank'],
    height=600,
    width=1000,
    tooltips=TOOLTIPS,
    title="Banks: Market Cap Comparison (2007 vs 2009)",
    x_axis_label=None,
    y_axis_label=None,
    toolbar_location=None
)

# Add bars with dodge offset
p.hbar(y=dodge('bank', 0.15, range=p.y_range), 
       right='2007', 
       height=0.3, 
       source=source,
       color='lightcoral', 
       legend_label="Market Cap 2007")

p.hbar(y=dodge('bank', -0.15, range=p.y_range), 
       right='2009', 
       height=0.3, 
       source=source,
       color='steelblue', 
       legend_label="Market Cap 2009")

# Styling
p.x_range.start = 0
p.xaxis.formatter = NumeralTickFormatter(format="0")
p.xaxis.minor_tick_line_color = None
p.ygrid.grid_line_color = None
p.legend.location = "bottom_right"
p.legend.orientation = "vertical"
p.legend.label_text_font_size = '10pt'

show(p)

## Section 4 - Geospatial Analysis (35 points) 🌍

**Data Source:** `airports.csv`, `countries.csv`, `routes.csv`, `europe.geojson`.

Please create an interactive map representation—focused on European countries—such that, when a country is selected, the map displays the flight balance (number of incoming flights - number of outgoing flights) between that country and all other European countries. The map should dynamically update based on the selected country, visually representing the extent to which each country is a net sender or receiver of flights.

**Hints**:
1. If `A` is a GeoDataFrame and `B` a DataFrame, the result of `A.merge(B,..)` is a GeoDataFrame, whereas the result of `B.merge(A,..)` is a DataFrame. The function `to_json()` on a DataFrame with a geometry column does **not** work.
2. When updating the map, to access the color mapper you can use the following method: `color_mapper = p.select_one(LinearColorMapper)`, where `p` is the figure.
3. You can discard Guernsey and Gibraltar that are not present in the geojson.

In [275]:
import geopandas as gpd

from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import RdYlGn11

In [276]:
df_airports = pd.read_csv('datasets/airports.csv', delimiter=",")
df_airports.rename(columns=lambda x : x.strip(), inplace=True)

df_countries = pd.read_csv('datasets/countries.csv', delimiter=';')
df_countries.rename(columns=lambda x : x.strip(), inplace=True)

df_routes = pd.read_csv('datasets/routes.csv', delimiter=';')
df_routes.rename(columns=lambda x : x.strip(), inplace=True)

gpd_europe = gpd.read_file('datasets/europe.geojson')



First of all, I am going to schematize the information needed per country:
- Name
- Number of incoming flights
- Number of outgoing flights
- Flight balance

Then on the visual representation I can add more information based on data.

To count incoming and outcoming flights I can simply `groupby` source and destination airports. Then calculate the flight balance subtracting this 2 values.

In [277]:
df_incoming = df_routes.groupby('destination_airport').size().reset_index(name='incoming')
df_incoming.rename(columns={'destination_airport': 'airport_id'}, inplace=True)
df_outgoing = df_routes.groupby('source_airport').size().reset_index(name='outgoing')
df_outgoing.rename(columns={'source_airport': 'airport_id'}, inplace=True)

Now I need to join the generated df throw the `airports id`. Then I left join with `df_airport` to get the country of the airports.

In [308]:
df_airport_traffic = pd.merge(df_incoming, df_outgoing, on='airport_id', how='outer')

df_airport_traffic.fillna(0, inplace=True)
df_airport_traffic[['incoming', 'outgoing']] = df_airport_traffic[['incoming', 'outgoing']].astype(int)
df_airport_traffic['flight_balance'] = df_airport_traffic['incoming'] - df_airport_traffic['outgoing']

df_airport_traffic = pd.merge(
    df_airport_traffic, 
    df_airports[['IATA', 'country']],
    left_on='airport_id', 
    right_on='IATA', 
    how='left'
).drop('IATA', axis=1)

df_airport_traffic = df_airport_traffic.query("country in @df_countries.name")
df_airport_traffic

Unnamed: 0,airport_id,incoming,outgoing,flight_balance,country
1,AAL,21,20,1,Denmark
4,AAR,8,8,0,Denmark
23,ABZ,41,41,0,United Kingdom
26,ACE,117,116,1,Spain
27,ACH,2,2,0,Switzerland
...,...,...,...,...,...
3379,ZAG,43,42,1,Croatia
3384,ZAZ,8,8,0,Spain
3411,ZQW,6,5,1,Germany
3413,ZRH,247,247,0,Switzerland


Since there are of course different airport in the same country I need to group them.

In [309]:
df_country_traffic = df_airport_traffic.groupby('country')[['incoming', 'outgoing', 'flight_balance']].agg(np.sum)
df_country_traffic

  df_country_traffic = df_airport_traffic.groupby('country')[['incoming', 'outgoing', 'flight_balance']].agg(np.sum)


Unnamed: 0_level_0,incoming,outgoing,flight_balance
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,36,36,0
Austria,388,388,0
Belarus,54,54,0
Belgium,410,412,-2
Bosnia and Herzegovina,23,23,0
Bulgaria,89,89,0
Croatia,205,204,1
Cyprus,154,149,5
Czech Republic,193,192,1
Denmark,309,309,0


Since I need only Europe datas I'm going to left join with `gpd_europe` adding the data about incoming and outgoing flight

In [316]:
gpd_europe_traffic = gpd_europe.merge(
    df_country_traffic, 
    'inner', 
    left_on='NAME',
    right_on='country'
)

gpd_europe_traffic[['incoming', 'outgoing', 'flight_balance']] = gpd_europe_traffic[['incoming', 'outgoing', 'flight_balance']].fillna(0)
gpd_europe_traffic[['incoming', 'outgoing', 'flight_balance']] = gpd_europe_traffic[['incoming', 'outgoing', 'flight_balance']].astype(int)

gpd_europe_traffic


Unnamed: 0,FID,FIPS,ISO2,ISO3,UN,NAME,AREA,POP2005,REGION,SUBREGION,LON,LAT,geometry,incoming,outgoing,flight_balance
0,0.0,AL,AL,ALB,8,Albania,2740,3153731,150,39,20.068,41.143,"POLYGON ((19.43621 41.02106, 19.45055 41.06, 1...",36,36,0
1,0.0,BK,BA,BIH,70,Bosnia and Herzegovina,5120,3915238,150,39,17.786,44.169,"POLYGON ((17.64984 42.88908, 17.57853 42.94382...",23,23,0
2,0.0,BU,BG,BGR,100,Bulgaria,11063,7744591,150,151,25.231,42.761,"POLYGON ((27.87917 42.8411, 27.895 42.8025, 27...",89,89,0
3,0.0,CY,CY,CYP,196,Cyprus,924,836321,142,145,33.219,35.043,"POLYGON ((33.65262 35.3541, 33.71305 35.38194,...",154,149,5
4,0.0,DA,DK,DNK,208,Denmark,4243,5416945,150,154,9.264,56.058,"MULTIPOLYGON (((11.51389 54.82972, 11.56444 54...",309,309,0
5,0.0,EI,IE,IRL,372,Ireland,6889,4143294,150,154,-8.152,53.177,"MULTIPOLYGON (((-9.65639 53.22222, -9.66333 53...",295,295,0
6,0.0,EN,EE,EST,233,Estonia,4239,1344312,150,154,25.793,58.674,"MULTIPOLYGON (((23.99083 58.1, 23.97805 58.097...",41,41,0
7,0.0,AU,AT,AUT,40,Austria,8245,8291979,150,155,14.912,47.683,"POLYGON ((13.83361 48.7736, 13.85806 48.77055,...",388,388,0
8,0.0,EZ,CZ,CZE,203,Czech Republic,7727,10191762,150,151,15.338,49.743,"POLYGON ((14.70028 48.58138, 14.65639 48.6075,...",193,192,1
9,0.0,FI,FI,FIN,246,Finland,30459,5246004,150,154,26.272,64.504,"MULTIPOLYGON (((23.70583 59.92722, 23.64944 59...",220,219,1


Last step before visualize the data is to create a `df` in which I have the outgoing flight and incoming flight for each pair of country

In [379]:
# Ottieni la lista dei paesi validi dal dataframe df_airports
valid_countries = df_countries['name'].unique().tolist()

# Merge dei dati per ottenere i paesi di origine e destinazione
# Merge per l'aeroporto di origine
routes_with_countries = df_routes.merge(
    df_airports[['IATA', 'country']],
    left_on='source_airport',
    right_on='IATA',
    how='inner'
)
routes_with_countries = routes_with_countries.rename(columns={'country': 'source_country'})

# Merge per l'aeroporto di destinazione
routes_with_countries = routes_with_countries.merge(
    df_airports[['IATA', 'country']],
    left_on='destination_airport',
    right_on='IATA',
    how='inner'
)
routes_with_countries = routes_with_countries.rename(columns={'country': 'destination_country'})

# Rimuovere le colonne IATA duplicate dal merge
routes_with_countries = routes_with_countries.drop(['IATA_x', 'IATA_y'], axis=1, errors='ignore')

# Filtra solo le righe dove sia il paese di origine che quello di destinazione sono nella lista dei paesi validi
filtered_routes = routes_with_countries[
    routes_with_countries['source_country'].isin(valid_countries) & 
    routes_with_countries['destination_country'].isin(valid_countries)
]

# Creare una matrice paese-paese con il conteggio dei voli
country_flight_matrix = filtered_routes.groupby(['source_country', 'destination_country']).size().unstack(fill_value=0)

# Visualizza la matrice risultante
display(country_flight_matrix)

destination_country,Albania,Austria,Belarus,Belgium,Bosnia and Herzegovina,Bulgaria,Croatia,Cyprus,Czech Republic,Denmark,...,Portugal,Romania,Serbia,Slovakia,Slovenia,Spain,Sweden,Switzerland,Ukraine,United Kingdom
source_country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Albania,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
Austria,1,15,2,2,1,3,6,3,1,5,...,6,8,3,1,2,40,4,12,10,11
Belarus,0,2,0,0,0,0,0,1,2,0,...,0,0,0,0,0,1,1,1,2,1
Belgium,0,2,0,1,0,4,5,2,4,5,...,18,4,2,1,3,57,7,6,2,17
Bosnia and Herzegovina,0,1,0,0,2,0,1,0,0,1,...,0,0,3,0,1,0,3,1,0,0
Bulgaria,0,3,0,4,0,6,0,4,2,1,...,0,3,2,0,0,8,0,1,0,10
Croatia,0,6,0,5,1,0,22,0,2,6,...,0,0,2,0,0,5,9,8,0,35
Cyprus,0,3,1,2,0,4,0,0,0,0,...,0,3,2,0,0,0,2,3,5,50
Czech Republic,0,1,2,4,0,2,1,0,4,4,...,2,5,1,4,1,16,5,4,3,16
Denmark,0,5,0,5,1,1,6,0,4,23,...,5,1,2,0,2,31,9,7,0,22


Now I have just to visualize data.

In [345]:
from bokeh.io import show
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar, HoverTool
from bokeh.plotting import figure
from bokeh.palettes import RdYlGn11
import pandas as pd

# Convert to lat/lon (EPSG:4326) for web plotting
gpd_europe_traffic = gpd_europe_traffic.to_crs("EPSG:4326")

# Add new column to determine color based on flight balance
def determine_color(flight_balance):
    if flight_balance < 0:
        return 'indianred'
    elif flight_balance > 0:
        return 'seagreen'
    else:
        return 'black'

# Apply the color logic to the dataframe
gpd_europe_traffic['flight_balance_color'] = gpd_europe_traffic['flight_balance'].apply(determine_color)

# Convert to GeoJSON for Bokeh
geo_source = GeoJSONDataSource(geojson=gpd_europe_traffic.to_json())

# Set up color mapper for flight_balance (still for color palette)
color_mapper = LinearColorMapper(
    palette=RdYlGn11[::-1],
    low=gpd_europe_traffic['flight_balance'].min(),
    high=gpd_europe_traffic['flight_balance'].max(),
    nan_color="lightgray"
)

# Create Bokeh figure
p = figure(
    title="Flight Balance by Country",
    height=600,
    width=800,
    toolbar_location='right',
    tools="pan,wheel_zoom,reset,save",
)

p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.axis.visible = False

# Add patches for countries
p.patches('xs', 'ys', source=geo_source,
          fill_color={'field': 'flight_balance', 'transform': color_mapper},
          line_color='dimgray',
          line_width=0.5,
          fill_alpha=0.8)

# Add color bar
color_bar = ColorBar(
    color_mapper=color_mapper,
    label_standoff=12,
    border_line_color=None,
    location=(0, 0)
)
p.add_layout(color_bar, 'right')

# Custom tooltip with bold text and colored values
tooltip_template = """
    <div style="font-size: 12px;">
        <b>Country:</b> @NAME <br>
        <b>Incoming:</b> <span style="color: seagreen;">+@incoming</span> <br>
        <b>Outgoing:</b> <span style="color: indianred;">-@outgoing</span> <br>
        <b>Flight Balance:</b> <span style="color: @flight_balance_color;">@flight_balance{+0}</span>
    </div>
"""

# Add custom HoverTool
hover = HoverTool(tooltips=tooltip_template)
p.add_tools(hover)

# Show the map
show(p)


In [447]:
from bokeh.io import show
from bokeh.models import (
    GeoJSONDataSource, LinearColorMapper, ColorBar, HoverTool,
    CustomJS, TapTool
)
from bokeh.plotting import figure
from bokeh.palettes import RdYlGn11
import pandas as pd

reset_output()
output_notebook()

# Convert to lat/lon (EPSG:4326) for web plotting
gpd_europe_traffic = gpd_europe_traffic.to_crs("EPSG:4326")

# Determine color for flight balance
def determine_color(flight_balance):
    if flight_balance < 0:
        return 'indianred'
    elif flight_balance > 0:
        return 'seagreen'
    else:
        return 'black'

gpd_europe_traffic['flight_balance_color'] = gpd_europe_traffic['flight_balance'].apply(determine_color)

# Prepare data source
geo_source = GeoJSONDataSource(geojson=gpd_europe_traffic.to_json())

# Color mapper
color_mapper = LinearColorMapper(
    palette=RdYlGn11[::-1],
    low=gpd_europe_traffic['flight_balance'].min(),
    high=gpd_europe_traffic['flight_balance'].max(),
    nan_color="lightgray"
)

# Bokeh figure
p = figure(
    title="Flight Balance by Country",
    height=600,
    width=800,
    toolbar_location='right',
    tools="pan,wheel_zoom,reset,save,tap",
)

p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None
p.axis.visible = False

# Add patches
p.patches('xs', 'ys', source=geo_source,
          fill_color={'field': 'flight_balance', 'transform': color_mapper},
          line_color='dimgray',
          line_width=0.5,
          fill_alpha=0.8)

# Add color bar
color_bar = ColorBar(
    color_mapper=color_mapper,
    label_standoff=12,
    border_line_color=None,
    location=(0, 0)
)
p.add_layout(color_bar, 'right')

# Default tooltip template
default_tooltip = """
    <div style="font-size: 12px;">
        <b>Country:</b> @NAME <br>
        <b>Incoming:</b> <span style="color: seagreen;">+@incoming</span> <br>
        <b>Outgoing:</b> <span style="color: indianred;">-@outgoing</span> <br>
        <b>Flight Balance:</b> <span style="color: @flight_balance_color;">@flight_balance{+0}</span>
    </div>
"""

# Dynamic tooltip (JS will set this later)
dynamic_tooltip = """
    <div style="font-size: 12px;">
        <b>Country:</b> @NAME <br>
        <b>Incoming from @reference_country:</b> <span style="color: seagreen;">+@relative_incoming</span> <br>
        <b>Outgoing to @reference_country:</b> <span style="color: indianred;">-@relative_outgoing</span> <br>
        <b>Flight Balance:</b> <span style="color: @relative_color;">@relative_balance{+0}</span> <br>
    </div>
"""

# Add initial HoverTool
hover = HoverTool(tooltips=default_tooltip)
p.add_tools(hover)

p.js_on_event("reset", CustomJS(args=dict(
    source=geo_source,
    hover_tool=hover,
    original_tooltip=default_tooltip,
), code="""
    hover_tool.tooltips = original_tooltip;
    source.change.emit();
"""))

# Add TapTool callback
p.js_on_event("tap", CustomJS(args=dict(
    source=geo_source,
    hover_tool=hover,
    original_tooltip=default_tooltip,
    dynamic_tooltip=dynamic_tooltip,
    country_matrix=country_flight_matrix.to_dict(),
), code="""
    const indices = source.selected.indices;

    if (indices.length === 0) {
        hover_tool.tooltips = original_tooltip;

        // Clean relative fields
        delete source.data['relative_balance'];
        delete source.data['relative_color'];
        delete source.data['relative_incoming'];
        delete source.data['relative_outgoing'];
        delete source.data['reference_country'];
        source.change.emit();
        return;
    }

    const clicked_index = indices[0];
    const clicked_country = source.data['NAME'][clicked_index];

    hover_tool.tooltips = dynamic_tooltip;

    // Fill relative_balance and color fields based on the matrix
    const matrix = country_matrix;
    const names = source.data['NAME'];
    const rel_bal = [];
    const rel_col = [];
    const rel_incoming = [];
    const rel_outgoing = [];
    const reference_country = [];

    for (let i = 0; i < names.length; i++) {
        const other = names[i];
        // Declare incoming and outgoing as local variables
        let incoming = 0;
        let outgoing = 0;

        // Incoming: flights from other country to the clicked country
        if (matrix[other] && matrix[other][clicked_country]) {
            incoming = matrix[other][clicked_country];
        }
        
        // Outgoing: flights from the clicked country to other country
        if (matrix[clicked_country] && matrix[clicked_country][other]) {
            outgoing = matrix[clicked_country][other];
        }
        
        // Calculate balance (incoming minus outgoing)
        let balance = incoming - outgoing;

        // Convert values to string for tooltip display
        rel_incoming.push(incoming.toString());
        rel_outgoing.push(outgoing.toString());
        rel_bal.push((balance > 0 ? "+" : "") + balance.toString());
        rel_col.push(balance > 0 ? "seagreen" : (balance < 0 ? "indianred" : "black"));
        reference_country.push(clicked_country);
    }

    // Assign the arrays
    source.data['relative_balance'] = rel_bal;
    source.data['relative_color'] = rel_col;
    source.data['relative_incoming'] = rel_incoming;
    source.data['relative_outgoing'] = rel_outgoing;
    source.data['reference_country'] = reference_country;
    source.change.emit();
"""))

# Show the plot
show(p)


## Datasets Description

You can find the dataset in the `datasets` folder. The descriptions of the datasets are provided below.

### Used Cars

The content of the dataset is in German, but it should not impose critical issues in understanding the data. Each entry contains the following information.

| **Field**                    | **Description** |
|------------------------------|---------------|
| **dateCrawled**               | When this ad was first crawled, all field values are taken from this date. |
| **name**                      | The name of the car. |
| **seller**                    | Seller type (private or dealer). |
| **offerTypeprice**            | The price in euros for the car on the ad. |
| **abtest**                    | Type of test. |
| **vehicleType**               | Type of vehicle. |
| **yearOfRegistration**        | The year the car was first registered. |
| **gearboxpowerPS**            | Power of the car in PS (horsepower). |
| **modelkilometer**            | How many kilometers the car has driven. |
| **monthOfRegistration**       | The month the car was first registered. |
| **fuelType**                  | Vehicle fuel type. |
| **brand**                     | Vehicle brand. |
| **notRepairedDamage**         | If the car has any damage that has not been repaired yet. |
| **dateCreated**               | The date the ad was created on eBay. |
| **nrOfPictures**              | Number of pictures in the ad. |
| **postalCodelastSeenOnline**  | When the crawler last saw this ad online. |


### US Accidents

| **Field**              | **Description** |
|------------------------|---------------|
| **ID** | Unique identifier of the accident record. |
| **Severity** | Severity of the accident (1-4), where 1 indicates the least impact on traffic and 4 indicates significant impact. |
| **Start_Time** | Start time of the accident in local time zone. |
| **End_Time** | End time of the accident in local time zone (when the impact on traffic flow was dismissed). |
| **Start_Lat** | Latitude in GPS coordinate of the start point. |
| **Start_Lng** | Longitude in GPS coordinate of the start point. |
| **End_Lat** | Latitude in GPS coordinate of the end point. |
| **End_Lng** | Longitude in GPS coordinate of the end point. |
| **Distance(mi)** | Length of the road extent affected by the accident. |
| **Description** | Natural language description of the accident. |
| **Number** | Street number in address field. |
| **Street** | Street name in address field. |
| **Side** | Relative side of the street (Right/Left) in address field. |
| **City** | City in address field. |
| **County** | County in address field. |
| **State** | State in address field. |
| **Zipcode** | Zipcode in address field. |
| **Country** | Country in address field. |
| **Timezone** | Timezone based on the location of the accident (eastern, central, etc.). |
| **Airport_Code** | Closest airport-based weather station to the accident location. |
| **Weather_Timestamp** | Timestamp of weather observation record (in local time). |
| **Temperature(F)** | Temperature (in Fahrenheit). |
| **Wind_Chill(F)** | Wind chill (in Fahrenheit). |
| **Humidity(%)** | Humidity (in percentage). |
| **Pressure(in)** | Air pressure (in inches). |
| **Visibility(mi)** | Visibility (in miles). |
| **Wind_Direction** | Wind direction. |
| **Wind_Speed(mph)** | Wind speed (in miles per hour). |
| **Precipitation(in)** | Precipitation amount in inches, if any. |
| **Weather_Condition** | Weather condition (rain, snow, thunderstorm, fog, etc.). |
| **Amenity** | POI annotation indicating presence of an amenity nearby. |
| **Bump** | POI annotation indicating presence of a speed bump or hump nearby. |
| **Crossing** | POI annotation indicating presence of a crossing nearby. |
| **Give_Way** | POI annotation indicating presence of a give-way sign nearby. |
| **Junction** | POI annotation indicating presence of a junction nearby. |
| **No_Exit** | POI annotation indicating presence of a no-exit nearby. |
| **Railway** | POI annotation indicating presence of a railway nearby. |
| **Roundabout** | POI annotation indicating presence of a roundabout nearby. |
| **Station** | POI annotation indicating presence of a station nearby. |
| **Stop** | POI annotation indicating presence of a stop sign nearby. |
| **Traffic_Calming** | POI annotation indicating presence of traffic calming measures nearby. |
| **Traffic_Signal** | POI annotation indicating presence of a traffic signal nearby. |
| **Turning_Loop** | POI annotation indicating presence of a turning loop nearby. |
| **Sunrise_Sunset** | Period of day (day or night) based on sunrise/sunset. |
| **Civil_Twilight** | Period of day (day or night) based on civil twilight. |
| **Nautical_Twilight** | Period of day (day or night) based on nautical twilight. |
| **Astronomical_Twilight** | Period of day (day or night) based on astronomical twilight. |


### Energy Data

| **Field**                | **Description** |
|---------------------------|-----------------|
| **country**               | Geographic location. |
| **year**                  | Year of observation. |
| **gdp**                   | (Gross Domestic Product) This data is adjusted for inflation and differences in the cost of living between countries. |
| **population**            | Population by country, based on data and estimates from different sources. |
| **greenhouse_gas_emissions** | Emissions from electricity generation. Measured in megatonnes of CO₂ equivalents. |
| **net_elec_imports**      | Net electricity imports. Electricity imports minus exports, measured in TWh. |
| **biofuel_consumption**   | Primary energy consumption from biofuels. Measured in terawatt-hours. |
| **coal_consumption**      | Primary energy consumption from coal. Measured in terawatt-hours. |
| **fossil_fuel_consumption** | Primary energy consumption from fossil fuels. Measured in terawatt-hours. |
| **gas_consumption**       | Primary energy consumption from gas. Measured in terawatt-hours. |
| **oil_consumption**       | Primary energy consumption from oil. Measured in terawatt-hours. |
| **nuclear_consumption**   | Primary energy consumption from nuclear power. Measured in terawatt-hours, using the substitution method. |
| **hydro_consumption**     | Primary energy consumption from hydropower. Measured in terawatt-hours, using the substitution method. |
| **solar_consumption**     | Primary energy consumption from solar power. Measured in terawatt-hours, using the substitution method. |
| **wind_consumption**      | Primary energy consumption from wind power. Measured in terawatt-hours, using the substitution method. |
| **biofuel_electricity**   | Electricity generation from bioenergy. Measured in terawatt-hours. |
| **coal_electricity**      | Electricity generation from coal. Measured in terawatt-hours. |
| **fossil_electricity**    | Electricity generation from fossil fuels. Measured in terawatt-hours. |
| **gas_electricity**       | Electricity generation from gas. Measured in terawatt-hours. |
| **oil_electricity**       | Electricity generation from oil. Measured in terawatt-hours. |
| **nuclear_electricity**   | Electricity generation from nuclear. Measured in terawatt-hours. |
| **hydro_electricity**     | Electricity generation from hydropower. Measured in terawatt-hours. |
| **solar_electricity**     | Electricity generation from solar power. Measured in terawatt-hours. |
| **wind_electricity**      | Electricity generation from wind power. Measured in terawatt-hours. |



### Airports

As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe, as shown in the map above. Each entry contains the following information:

| **Field**                 | **Description** |
|---------------------------|---------------|
| **Airport ID** | Unique OpenFlights identifier for this airport. |
| **Name** | Name of the airport. May or may not contain the city name. |
| **City** | Main city served by the airport. May be spelled differently from the name. |
| **Country** | Country or territory where the airport is located. Can be cross-referenced with ISO 3166-1 codes. |
| **IATA** | 3-letter IATA code. Null if not assigned/unknown. |
| **ICAO** | 4-letter ICAO code. Null if not assigned/unknown. |
| **Latitude** | Decimal degrees, usually to six significant digits. Negative is South, positive is North. |
| **Longitude** | Decimal degrees, usually to six significant digits. Negative is West, positive is East. |
| **Altitude** | Altitude in feet. |
| **Timezone** | Hours offset from UTC. Fractional hours are expressed as decimals (e.g., India is 5.5). |
| **DST** | Daylight savings time classification: E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None), or U (Unknown). |
| **Tz database time zone** | Timezone in "tz" (Olson) format (e.g., "America/Los_Angeles"). |
| **Type** | Type of the airport. Value is "airport" for air terminals. |
| **Source** | Source of the data. "OurAirports" for data sourced from OurAirports. |


### Routes

As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67663 routes between 3321 airports on 548 airlines spanning the globe. \
Each entry contains the following information.

| **Field**                | **Description** |
|--------------------------|---------------|
| **Airline** | 2-letter (IATA) or 3-letter (ICAO) code of the airline. |
| **Airline ID** | Unique OpenFlights identifier for the airline. |
| **Source airport** | 3-letter (IATA) or 4-letter (ICAO) code of the source airport. |
| **Source airport ID** | Unique OpenFlights identifier for the source airport. |
| **Destination airport** | 3-letter (IATA) or 4-letter (ICAO) code of the destination airport. |
| **Destination airport ID** | Unique OpenFlights identifier for the destination airport. |
| **Codeshare** | "Y" if the flight is a codeshare (operated by another carrier), empty otherwise. |
| **Stops** | Number of stops on the flight ("0" for direct). |
| **Equipment** | 3-letter codes for plane type(s) generally used on this flight, separated by spaces. |


The data is UTF-8 encoded. The special value `\N` is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported


<aside>
💡 Notes:

- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately.
- Routes where one carrier operates both its own and codeshare flights are listed only once.
</aside>


### Countries

This dataset contains the information related to European countries. 