# TO ASK LIST
- Section 2.2 - there are 3 Dallas Cities

# Visual Analytics

## Assignment 1

**Instructor:** Dr. Marco D'Ambros  
**TAs:** Carmen Armenti, Mattia Giannaccari

**Contacts:** marco.dambros@usi.ch, carmen.armenti@usi.ch, mattia.giannaccari@usi.ch

**Due Date:** 10 April, 2025 @ 23:55

---

### Goal

The goal of this assignment is to use Python and Jupyter notebook to explore, analyze and visualize the datasets provided. 

The assignment is divided into four sections, each requiring you to apply the knowledge gained from both the theoretical and practical lectures to solve the exercises. Specifically, when creating tabular or graphical representations, you should apply the principles learned in the theoretical lectures and use the technologies introduced in the practical sessions. The datasets you need to use are detailed in the **Datasets Description** section and can be found in the following folder [Assignment1_Data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/EqjXB7uSEoVAujKPSZY1hvIBMhAXJv5y6Z-UwaO6bCtOjg?e=kxcaai).

### Submission Guidelines
- **Format:** Please submit a Jupyter Notebook containing your solutions along with a clear explanation of the **steps** taken to arrive at each solution. Each solution must be introduced by a Markdown cell indicating the exercise number. If you prefer, you may use the uploaded assignment file and develop your solution by adding cells below each exercise instructions. It is essential that every choice is justified, and the solution is thoroughly commented to explain each step. Exercises without explanations will be evaluated negatively.

- **Filename:** Please name the Jupyter notebook as follows: `SurenameName_Assignment1.ipynb`.

- **Submission:** Please submit your solution (the jupyter notebook and any other script you may have used to support your solution) to iCorsi.


## Preparatory Phase

Installing the needed modules

In [1]:
%pip install pandas
%pip install bokeh
%pip install matplotlib
%pip install chardet

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


Importing the modules

In [2]:
import pandas as pd
import numpy as np
import chardet

---
## Section 1 - Data quality (10 points)

**Data Source:** `used_cars.csv`.

In the `used_cars.csv` dataset, please perform the following data cleaning steps: 
- Identify any missing or invalid values in the following columns: `vehicle type`, `price`, `brand`, and `month of registration`. If needed, standardize the data. For the `price` column specifically, the prices are recorded in euros, please consider valid only values within the range of €1,000 and €500,000. 
- For each of the previous columns, report the number of missing or invalid entries.
- After identifying missing or invalid values in the columns above, remove **any** rows where at least one of these columns contains such data.

Please clearly outline the steps you take to clean the dataset and document your approach. You may use any preferred tool or technology, such as Python (vanilla or Pandas) or OpenRefine.

In [3]:
with open('datasets/used_cars.csv', 'rb') as f:
    data = f.read()

encoding_result = chardet.detect(data)
encoding = encoding_result['encoding']

df_usedcars = pd.read_csv('datasets/used_cars.csv', encoding=encoding)
df_usedcars.rename(columns=lambda x : x.rstrip())
columns = {'vehicleType', 'price', 'brand', 'monthOfRegistration'}
df_usedcars

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
0,2016-03-24 11:52:17,Golf_3_1.6,privat,Angebot,480,test,,1993,manuell,0,golf,150000,0,benzin,volkswagen,,2016-03-24 00:00:00,0,70435,2016-04-07 03:16:57
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371523,2016-03-14 17:48:27,Suche_t4___vito_ab_6_sitze,privat,Angebot,2200,test,,2005,,0,,20000,1,,sonstige_autos,,2016-03-14 00:00:00,0,39576,2016-04-06 00:46:52
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


### Step 1

Before filtering the data I am going to see which values are inside the `DataFrame` to check if `NaN` values are present and if standardization is needed.

In [4]:
for col in columns:
    print(f"{col} has NaN? {df_usedcars[col].isnull().any()}")

price has NaN? False
monthOfRegistration has NaN? False
brand has NaN? False
vehicleType has NaN? True


We can see from the previous cell that only `vehicleType` has `NaN` values.

In [5]:
for col in columns:
    print(df_usedcars[col].value_counts().sort_index())
    print('=' * 50)

price
0             10778
1              1189
2                12
3                 8
4                 1
              ...  
32545461          1
74185296          1
99000000          1
99999999         15
2147483647        1
Name: count, Length: 5597, dtype: int64
monthOfRegistration
0     37675
1     24561
2     22403
3     36170
4     30918
5     30631
6     33167
7     28958
8     23765
9     25074
10    27337
11    25489
12    25380
Name: count, dtype: int64
brand
BMW                   3
alfa_romeo         2345
audi              32873
bmw               40265
bmw                   6
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda              2836
hyundai            3646
jaguar              621
jeep                807
kia                2555
lada                225
lancia              484
land_rover          770
mazda              569

Since in `brand` column there are multiple occurence of `bmw` written in different way, I am going to standardize the entries.

In [6]:
df_usedcars['brand'] = df_usedcars['brand'].apply(lambda x : x.rstrip().lower())
df_usedcars['brand'].value_counts().sort_index()

brand
alfa_romeo         2345
audi              32873
bmw               40274
chevrolet          1845
chrysler           1452
citroen            5182
dacia               900
daewoo              542
daihatsu            806
fiat               9676
ford              25573
honda              2836
hyundai            3646
jaguar              621
jeep                807
kia                2555
lada                225
lancia              484
land_rover          770
mazda              5695
mercedes_benz     35309
mini               3394
mitsubishi         3061
nissan             5037
opel              40136
peugeot           11027
porsche            2215
renault           17969
rover               490
saab                530
seat               7022
skoda              5641
smart              5249
sonstige_autos     3982
subaru              779
suzuki             2328
toyota             4694
trabant             591
volkswagen        79640
volvo              3327
Name: count, dtype: int64

Using filters I am going to extract the rows where at least one condition is satisfied.
Then I chain them throw the `|` operator.

Since from before I have noticed that `monthOfRegistration` values are between `0` and `12`, I am going to consider the `monthOfRegistration` as categorical, so the acceptable values are the one between `1` and `12`.

`NaN` values are present only in `vehicleType` column so I don't need to handle it in numerical columns.
Since `brand` has no `NaN` I can skip his filter since the values are already standardized.

In [7]:
filter_vehicle_type = df_usedcars['vehicleType'].isna()
filter_brand = df_usedcars['brand'].isna()
filter_price = (df_usedcars['price'] < 1_000) | (df_usedcars['price'] > 500_000)
filter_month = (df_usedcars['monthOfRegistration'] <= 1) | (df_usedcars['monthOfRegistration'] >= 12)

#filter = filter_vehicle_type | filter_brand | filter_price | filter_month
filter = filter_vehicle_type | filter_price | filter_month

### Step 2

Report the number of missing values counting the `True` values inside each filter series

In [8]:
print('=' * 15 + ' Missing/Invalid Values ' + '=' * 15)
print(f"  Missing values for vehicleType: \t  {filter_vehicle_type.sum()}")
print(f"  Missing values for brand: \t\t  {filter_brand.sum()}")
print(f"  Invalid values for price: \t\t  {filter_price.sum()}")
print(f"  Invalid values for monthOfRegistration: {filter_month.sum()}")
print('=' * 54)

  Missing values for vehicleType: 	  37869
  Missing values for brand: 		  0
  Invalid values for price: 		  83435
  Invalid values for monthOfRegistration: 87616


### Step 3

Removing the rows where at least one condition is verified. This implies to remove the rows with same index of the one in the `filter` where value is `True`

In [9]:
df_usedcars_filtered = df_usedcars[~filter]
df_usedcars_filtered

Unnamed: 0,dateCrawled,name,seller,offerType,price,abtest,vehicleType,yearOfRegistration,gearbox,powerPS,model,kilometer,monthOfRegistration,fuelType,brand,notRepairedDamage,dateCreated,nrOfPictures,postalCode,lastSeen
1,2016-03-24 10:58:45,A5_Sportback_2.7_Tdi,privat,Angebot,18300,test,coupe,2011,manuell,190,,125000,5,diesel,audi,ja,2016-03-24 00:00:00,0,66954,2016-04-07 01:46:50
2,2016-03-14 12:52:21,"Jeep_Grand_Cherokee_""Overland""",privat,Angebot,9800,test,suv,2004,automatik,163,grand,125000,8,diesel,jeep,,2016-03-14 00:00:00,0,90480,2016-04-05 12:47:46
3,2016-03-17 16:54:04,GOLF_4_1_4__3TÜRER,privat,Angebot,1500,test,kleinwagen,2001,manuell,75,golf,150000,6,benzin,volkswagen,nein,2016-03-17 00:00:00,0,91074,2016-03-17 17:40:17
4,2016-03-31 17:25:20,Skoda_Fabia_1.4_TDI_PD_Classic,privat,Angebot,3600,test,kleinwagen,2008,manuell,69,fabia,90000,7,diesel,skoda,nein,2016-03-31 00:00:00,0,60437,2016-04-06 10:17:21
6,2016-04-01 20:48:51,Peugeot_206_CC_110_Platinum,privat,Angebot,2200,test,cabrio,2004,manuell,109,2_reihe,150000,8,benzin,peugeot,nein,2016-04-01 00:00:00,0,67112,2016-04-05 18:18:39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
371521,2016-03-27 20:36:20,Opel_Zafira_1.6_Elegance_TÜV_12/16,privat,Angebot,1150,control,bus,2000,manuell,0,zafira,150000,3,benzin,opel,nein,2016-03-27 00:00:00,0,26624,2016-03-29 10:17:23
371524,2016-03-05 19:56:21,Smart_smart_leistungssteigerung_100ps,privat,Angebot,1199,test,cabrio,2000,automatik,101,fortwo,125000,3,benzin,smart,nein,2016-03-05 00:00:00,0,26135,2016-03-11 18:17:12
371525,2016-03-19 18:57:12,Volkswagen_Multivan_T4_TDI_7DC_UY2,privat,Angebot,9200,test,bus,1996,manuell,102,transporter,150000,3,diesel,volkswagen,nein,2016-03-19 00:00:00,0,87439,2016-04-07 07:15:26
371526,2016-03-20 19:41:08,VW_Golf_Kombi_1_9l_TDI,privat,Angebot,3400,test,kombi,2002,manuell,100,golf,150000,6,diesel,volkswagen,,2016-03-20 00:00:00,0,40764,2016-03-24 12:45:21


## Section 2 - Data Analysis, Visualization, and Exploration (60 points) 📊
In this section, you will need to use two different datasets: `us_accidents.csv` for the first three exercises and `eu_energy.csv` for the next three. Each exercise is worth 10 points.

In [10]:
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.models import ColumnDataSource, NumeralTickFormatter, TableColumn, DataTable, HTMLTemplateFormatter, RadioButtonGroup, CustomJS, Row, InlineStyleSheet, Span, FixedTicker
from bokeh.models.widgets import DataTable, TableColumn, GroupingInfo, SumAggregator, DataCube
from bokeh.layouts import column, gridplot

import math

reset_output()
output_notebook()

### Section 2.1 
**Data Source**: `us_accidents.csv`

1. In the US Accidents dataset please remove all rows where one or more columns have missing data and explicitly identify the number of rows with null values. Consider the years 2020 and 2022.

    - What are the cities with the highest number of accidents in 2020 and 2022? Report them with the number of accidents.
    - Please provide the yearly total number of car accidents in 2020 and 2022 for each `County` and `City` combination.
    - Please retrieve the 10 cities with the highest total number of accidents in 2020 and 2022, and create a visualization that:
    
        - As a **primary goal** shows the increase in accident numbers for each city that allows the comparison of the increase per city. Which is the city with the most significant increase?
        - As a **secondary goal** presents the absolute number of accidents in both 2020 and 2022 for each selected city.
    
    Please explain the insights gained from the visualization and justify the choice of the representation.


In [11]:
df_accidents = pd.read_csv('datasets/us_accidents.csv')
df_accidents.rename(columns=lambda x : x.rstrip())
df_accidents.dropna(inplace=True)
df_accidents

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
3402762,A-3412645,Source1,3,2016-02-08 00:37:08,2016-02-08 06:37:08,40.108910,-83.092860,40.112060,-83.031870,3.230,...,False,False,False,False,False,False,Night,Night,Night,Night
3402767,A-3412650,Source1,3,2016-02-08 07:53:43,2016-02-08 13:53:43,39.172393,-84.492792,39.170476,-84.501798,0.500,...,False,False,False,False,False,False,Day,Day,Day,Day
3402771,A-3412654,Source1,2,2016-02-08 11:51:46,2016-02-08 17:51:46,41.375310,-81.820170,41.367860,-81.821740,0.521,...,False,False,False,False,False,False,Day,Day,Day,Day
3402773,A-3412656,Source1,2,2016-02-08 15:16:43,2016-02-08 21:16:43,40.109310,-82.968490,40.110780,-82.984000,0.826,...,False,False,False,False,False,False,Day,Day,Day,Day
3402774,A-3412657,Source1,2,2016-02-08 15:43:50,2016-02-08 21:43:50,39.192880,-84.477230,39.196150,-84.473350,0.307,...,False,False,False,False,False,False,Day,Day,Day,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7728389,A-7777757,Source1,2,2019-08-23 18:03:25,2019-08-23 18:32:01,34.002480,-117.379360,33.998880,-117.370940,0.543,...,False,False,False,False,False,False,Day,Day,Day,Day
7728390,A-7777758,Source1,2,2019-08-23 19:11:30,2019-08-23 19:38:23,32.766960,-117.148060,32.765550,-117.153630,0.338,...,False,False,False,False,False,False,Day,Day,Day,Day
7728391,A-7777759,Source1,2,2019-08-23 19:00:21,2019-08-23 19:28:49,33.775450,-117.847790,33.777400,-117.857270,0.561,...,False,False,False,False,False,False,Day,Day,Day,Day
7728392,A-7777760,Source1,2,2019-08-23 19:00:21,2019-08-23 19:29:42,33.992460,-118.403020,33.983110,-118.395650,0.772,...,False,False,False,False,False,False,Day,Day,Day,Day


Considering the accidents in `2020` and `2022`, I am excluding any incidents that have at least one of the `Start_Time` or `End_Time` values other than `2020` or `2022`

In [12]:
years = {2020, 2022}
colors = {2020 : "steelblue", 2022 : "indianred"}

df_accidents['Start_Time'] = pd.to_datetime(df_accidents['Start_Time'], errors='coerce')
df_accidents['End_Time'] = pd.to_datetime(df_accidents['End_Time'], errors='coerce')

print("NaT after the conversion in Start_Time column: ", df_accidents['Start_Time'].isna().sum())
print("NaT after the conversion in End_Time column: ", df_accidents['End_Time'].isna().sum())

df_accidents_20_22 = df_accidents.query('Start_Time.dt.year in @years and End_Time.dt.year in @years')
df_accidents_20_22

NaT after the conversion in Start_Time column:  690284
NaT after the conversion in End_Time column:  690284


Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
3677739,A-3705247,Source1,2,2022-05-18 22:10:00,2022-05-19 00:10:57,38.904107,-77.018215,38.905693,-77.013704,0.266,...,False,False,False,False,False,False,Night,Night,Night,Night
3677741,A-3705249,Source1,2,2022-11-12 01:03:52,2022-11-12 02:29:16,46.483201,-114.126399,46.485270,-114.125932,0.145,...,False,False,False,False,False,False,Night,Night,Night,Night
3677742,A-3705250,Source1,2,2022-09-03 09:40:03,2022-09-03 11:16:03,29.725044,-95.298193,29.722916,-95.298129,0.147,...,False,False,True,False,False,False,Day,Day,Day,Day
3677743,A-3705251,Source1,2,2022-02-11 17:33:06,2022-02-11 19:40:56,30.698144,-86.571374,30.700222,-86.572708,0.164,...,False,False,False,False,False,False,Night,Day,Day,Day
3677745,A-3705254,Source1,2,2022-12-01 12:54:25,2022-12-01 14:36:39,33.861092,-81.414454,33.861431,-81.414371,0.024,...,False,False,False,False,False,False,Day,Day,Day,Day
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7235933,A-7285296,Source1,2,2020-01-06 20:54:00,2020-01-06 22:08:36,33.696102,-117.084844,33.696102,-117.084844,0.000,...,False,False,False,False,True,False,Night,Night,Night,Night
7235935,A-7285298,Source1,2,2020-01-06 21:14:00,2020-01-06 23:16:38,35.736667,-119.742500,35.736667,-119.742500,0.000,...,False,False,False,False,False,False,Night,Night,Night,Night
7235936,A-7285299,Source1,2,2020-01-06 21:19:00,2020-01-06 22:19:37,34.075263,-118.281157,34.075263,-118.281157,0.000,...,False,False,True,True,False,False,Night,Night,Night,Night
7246305,A-7295668,Source1,2,2020-01-01 00:08:02,2020-01-01 00:37:03,42.315690,-83.085920,42.312500,-83.094120,0.473,...,False,False,False,False,False,False,Night,Night,Night,Night


### Step 1

After filtering all the data, I am going to show the city with the most accidents both years.

In [13]:
df_accidents_20_22['Year'] = df_accidents_20_22['Start_Time'].dt.year
accidents_per_year = df_accidents_20_22.groupby(['Year', 'City']).size().sort_values(ascending=False)
accidents_per_year

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Year'] = df_accidents_20_22['Start_Time'].dt.year


Year  City          
2022  Miami             41376
      Orlando           23297
2020  Miami             20059
2022  Los Angeles       17961
2020  Los Angeles       16820
                        ...  
2022  Kinard                1
      Kindred               1
2020  Franklin Grove        1
      Franconia             1
2022  Zwingle               1
Length: 16130, dtype: int64

In [14]:
for year in years:
    print(f"The city with the most accidents in {year} is {accidents_per_year.loc[year].idxmax()} with {accidents_per_year.loc[year].max()} accidents.")

The city with the most accidents in 2020 is Miami with 20059 accidents.
The city with the most accidents in 2022 is Miami with 41376 accidents.


### Step 2
In order to show the accident per year per `County` per `City` I need to group them.

Since there are too many values I am showing the results in a table.

In [15]:
df_accidents_grouped = df_accidents_20_22.groupby(['Year', 'County', 'City']).size().reset_index(name='Accidents').sort_values(by='Accidents', ascending=False)

source = ColumnDataSource(df_accidents_grouped)

template = """
    <div style="color:dimgray;">
        <%= value %>
    </div>
"""

template_numbers = """
    <div style="color:dimgray;">
        <%= value.toLocaleString() %> 
    </div>
"""

columns = [
    TableColumn(field="Year", title="Year", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="County", title="County", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="City", title="City", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Accidents", title="Number of Accidents", formatter=HTMLTemplateFormatter(template=template_numbers))
]

grouping = [
    GroupingInfo(getter='Year', aggregators=[SumAggregator(field_="Accidents")])
]

target = ColumnDataSource(data=dict(row_indices=[], labels=[]))

css = """
.slick-group {
    color: dimgray;
    border-bottom: 2px solid #dee2e6 !important;
}
"""

data_cube = DataCube(
    source=source,
    columns=columns,
    grouping=grouping,
    target=target,
    stylesheets=[css]
)

show(data_cube)


### Step 3

First of all I retrive the first 10 `city` per number of accident in total. 

In [16]:
df_accidents_top10 = df_accidents_20_22.groupby('City').size().sort_values(ascending=False)[:10].index

df_accidents_top10 = df_accidents_20_22[
    (df_accidents_20_22['City'].isin(df_accidents_top10)) & 
    (df_accidents_20_22['Year'].isin(years))
]

accidents_counts = df_accidents_top10.groupby(['City', 'Year']).size().unstack(fill_value=0)

accidents_counts

Year,2020,2022
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Baton Rouge,4758,7778
Charlotte,8743,11133
Dallas,7439,12845
Houston,6576,13171
Los Angeles,16820,17961
Miami,20059,41376
Nashville,4723,8292
Orlando,8095,23297
Raleigh,5704,9119
San Diego,5449,8202


To show the the increasing of the number of the accident I use an `hbar` from `Bokeh` and as `tooltip` I add the secondary information

In [17]:
accidents_counts['Increment'] = accidents_counts[2022] - accidents_counts[2020]
accidents_counts['Percentage'] = ((accidents_counts['Increment'] / accidents_counts[2020]) * 100).replace([float('inf'), -float('inf')], 0).fillna(0)
accidents_counts.sort_values(by=['Increment'], ascending=True, inplace=True)

# Absolute

source = ColumnDataSource(data={
    'city': accidents_counts.index.tolist(),
    'increment': accidents_counts['Increment'].tolist(),
    'percentage': accidents_counts['Percentage'].tolist(),
    'accidents_2020': accidents_counts[2020].tolist(),
    'accidents_2022': accidents_counts[2022].tolist()
})

TOOLTIPS = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px;">
        <span style="font-size: 12px; color: steelblue;">City:</span> @city<br>
        <span style="font-size: 12px; color: steelblue;">Increment:</span> @increment{0,0}<br>
        <span style="font-size: 12px; color: steelblue;">Percentage:</span> @percentage{0.0}%<br>
        <span style="font-size: 12px; color: steelblue;">2020 Accidents:</span> @accidents_2020{0,0}<br>
        <span style="font-size: 12px; color: steelblue;">2022 Accidents:</span> @accidents_2022{0,0}
    </div>
"""

plot_increment = figure(
    y_range=accidents_counts.index.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS,
    title="Accident Increment Absolute from 2020 to 2022 in Top 10 Cities",
    x_axis_label="Number of Accident Increments",
    y_axis_label="City"
)

plot_increment.hbar(
    y='city',
    right='increment',
    source=source,
    height=0.85,
)

plot_increment.toolbar.logo = None
plot_increment.toolbar_location = None
plot_increment.xgrid.grid_line_color = None
plot_increment.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_increment.yaxis.minor_tick_line_color = None
plot_increment.xaxis.minor_tick_line_color = None
plot_increment.x_range.start = 0

show(plot_increment)

# Percentage

accidents_counts.sort_values(by=['Percentage'], ascending=True, inplace=True)

plot_percentage = figure(
    y_range=accidents_counts.index.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS,
    title="Accident Increment Percentage from 2020 to 2022 in Top 10 Cities",
    x_axis_label="Accident Increase Percentage",
    y_axis_label="City",
)

plot_percentage.hbar(
    y='city',
    right='percentage',
    source=source,
    height=0.85
)

plot_percentage.toolbar.logo = None
plot_percentage.toolbar_location = None
plot_percentage.xgrid.grid_line_color = None
plot_percentage.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_percentage.yaxis.minor_tick_line_color = None
plot_percentage.xaxis.minor_tick_line_color = None
plot_percentage.x_range.start = 0

show(plot_percentage)



As we can see from the plot `Miami` has the biggest **absolute** increment from `2020` to `2022` while `Orlando` has the biggest **percentage** increment.

2. We define the **accident duration** as the time elapsed from the start of the accident until its impact on traffic flow is resolved.

    Please provide a table that shows the minimum and maximum accident duration for each combination of `State`, `County`, `City`, `Year`, `Month`, ensuring that only combinations with data for all 12 months is available. Then, filter the data to include only **Los Angeles**, **Dallas**, and **New York** cities and plot the behavior of the minimum and maximum durations for accidents that occurred in 2022. Choose a visualization that highlights how the average values of both minimum and maximum durations relate to the minimum-maximum range.

    - Which city shows the least pronounced variation? 
    - What insights can you draw from the plot?

    Please explain what the plot reveals and justify the choice of visualization.
    

In this case I consider that if an accident has a `Start_Time` in `June` but its `End_Time` is in `July` it will be considered as an accident of `June` since the instant in which it happen is in `June`.

In [18]:
df_accidents_20_22['Month'] = df_accidents_20_22['Start_Time'].dt.month
df_accidents_20_22['Duration'] = (df_accidents_20_22['End_Time'] - df_accidents_20_22['Start_Time']).dt.total_seconds() / 60
df_accidents_20_22['Duration'] = df_accidents_20_22.Duration.apply(lambda x : int(x))

df_accidents_sccym = df_accidents_20_22.groupby(['State', 'County', 'City', 'Year', 'Month'])['Duration'].agg(['min', 'max']).reset_index()
df_accidents_sccym.rename(columns={'min' : 'Min', 'max': 'Max'}, inplace=True)

valid_entries = df_accidents_sccym.groupby(['State', 'County', 'City', 'Year'])['Month'].nunique() == 12
df_accidents_sccym = df_accidents_sccym.set_index(['State', 'County', 'City', 'Year']).loc[valid_entries].reset_index()

df_accidents_sccym

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Month'] = df_accidents_20_22['Start_Time'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_20_22['Duration'] = (df_accidents_20_22['End_Time'] - df_accidents_20_22['Start_Time']).dt.total_seconds() / 60
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accid

Unnamed: 0,State,County,City,Year,Month,Min,Max
0,AL,Baldwin,Daphne,2022,1,21,176
1,AL,Baldwin,Daphne,2022,2,13,276
2,AL,Baldwin,Daphne,2022,3,11,1545
3,AL,Baldwin,Daphne,2022,4,13,245
4,AL,Baldwin,Daphne,2022,5,23,207
...,...,...,...,...,...,...,...
24727,WY,Sweetwater,Rock Springs,2022,8,33,1725
24728,WY,Sweetwater,Rock Springs,2022,9,78,78
24729,WY,Sweetwater,Rock Springs,2022,10,20,1489
24730,WY,Sweetwater,Rock Springs,2022,11,14,1494


After creating the needed `DataFrame` I can procede to visualize the data in a table.

In [19]:
duration_template = """
<div style="color:dimgray;">
    <% if (value >= 1440) { %>
        <%= Math.floor(value / 1440) %>d <%= Math.floor((value % 1440)/60) %>h <%= (value % 1440) % 60 %>min
    <% } else if (value >= 60) { %>
        <%= Math.floor(value / 60) %>h <%= value % 60 %>min
    <% } else { %>
        <%= value %>min
    <% } %>
</div>
"""

month_template = """
<div style="color:dimgray;">
    <% 
    var monthNames = ["January", "February", "March", "April", "May", "June", 
                     "July", "August", "September", "October", "November", "December"];
    var monthName = monthNames[value - 1]; 
    %>
    <%= monthName %>
</div>
"""

header_hover_css = """
    .slick-header-column:hover {
        background-color: #eeeeee !important;
    }
    .slick-header-column:hover .slick-column-name {
        color: #111111 !important;
    }
"""

source = ColumnDataSource(df_accidents_sccym)
original_data = df_accidents_sccym.to_dict('list') 

columns = [
    TableColumn(field="State", title="State", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="County", title="County", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="City", title="City", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Year", title="Year", formatter=HTMLTemplateFormatter(template=template)),
    TableColumn(field="Month", title="Month", formatter=HTMLTemplateFormatter(template=month_template)),
    TableColumn(field="Min", title="Min Duration", formatter=HTMLTemplateFormatter(template=duration_template)),
    TableColumn(field="Max", title="Max Duration", formatter=HTMLTemplateFormatter(template=duration_template))
]

data_table = DataTable(
    source=source, 
    columns=columns, 
    width=800, 
    height=400, 
    index_position=None,
    scroll_to_selection=False,
    stylesheets=[InlineStyleSheet(css=header_hover_css)]
)

# Create radio button group with labels
year_selector = RadioButtonGroup(
    labels=["Year 2020", "Year 2022", "All Years"],
    active=2,
    width=400
)

# JavaScript callback for filtering
filter_code = """
    var year_filter = null;
    switch (this.origin.active) {
        case 0: year_filter = 2020; break;
        case 1: year_filter = 2022; break;
        case 2: year_filter = null; break;
    }
    
    var new_data = {};
    var indices = [];
    
    // Find matching indices
    for (var i = 0; i < original_data.Year.length; i++) {
        if (year_filter === null || original_data.Year[i] === year_filter) {
            indices.push(i);
        }
    }
    
    // Create filtered dataset
    for (var key in original_data) {
        new_data[key] = [];
        for (var idx of indices) {
            new_data[key].push(original_data[key][idx]);
        }
    }
    
    source.data = new_data;
    source.change.emit();
"""

# Add callback to radio buttons
year_selector.js_on_event("button_click", CustomJS(
    args=dict(source=source, original_data=original_data),
    code=filter_code
))

centered_row = Row(
    children=[year_selector],
    align="center",          # Horizontal centering
)

# Show the components
show(column(data_table, centered_row))

After showing the information for both years, I can concentrate on `Los Angeles`, `Dallas` and `New York`.

In [20]:
cities = ['Los Angeles', 'Dallas', 'New York']
df_accidents_cities = df_accidents_sccym[
    (df_accidents_sccym['City'].isin(cities)) & 
    ((df_accidents_sccym['City'] != 'Dallas') | (df_accidents_sccym['County'] == 'Dallas')) & 
    (df_accidents_sccym['Year'] == 2022)
]
df_accidents_cities


Unnamed: 0,State,County,City,Year,Month,Min,Max
2616,CA,Los Angeles,Los Angeles,2022,1,7,835
2617,CA,Los Angeles,Los Angeles,2022,2,8,1003
2618,CA,Los Angeles,Los Angeles,2022,3,7,929
2619,CA,Los Angeles,Los Angeles,2022,4,11,924
2620,CA,Los Angeles,Los Angeles,2022,5,7,1555
2621,CA,Los Angeles,Los Angeles,2022,6,10,7077
2622,CA,Los Angeles,Los Angeles,2022,7,7,1052
2623,CA,Los Angeles,Los Angeles,2022,8,6,1003
2624,CA,Los Angeles,Los Angeles,2022,9,7,1490
2625,CA,Los Angeles,Los Angeles,2022,10,7,10710


In [21]:
from bokeh.models import ColumnDataSource, Span, NumeralTickFormatter, HoverTool
from bokeh.plotting import figure, show
from bokeh.layouts import row, column
import calendar

# Calculate means
cities = ['Los Angeles', 'Dallas', 'New York']
mins = {city: float(df_accidents_cities[df_accidents_cities['City'] == city]['Min'].mean()) for city in cities}
maxs = {city: float(df_accidents_cities[df_accidents_cities['City'] == city]['Max'].mean()) for city in cities}

# Global min/max for y_range
global_min_y = (df_accidents_cities['Min'].min(), int(df_accidents_cities['Min'].max()*1.1))
global_max_y = (df_accidents_cities['Max'].min(), int(df_accidents_cities['Max'].max()*1.1))

# Create plot function
def create_city_plot(df, city, year, duration_type='Min'):
    """
    Creates a Bokeh plot for accident durations by month with consistent y-axis.
    
    Parameters:
    - df (DataFrame): The input DataFrame with accident data.
    - city (str): The city to filter.
    - year (int): The year to filter.
    - duration_type (str): 'Min' or 'Max' for duration type.
    """
    # Filter data
    city_data = df[(df['City'] == city) & (df['Year'] == year)].copy()
    city_data['Month Name'] = city_data['Month'].apply(lambda x: calendar.month_abbr[x])
    
    source = ColumnDataSource(city_data)
    
    duration_label = 'minimum' if duration_type == 'Min' else 'maximum'
    title = f"{city} {duration_label} accident duration in {year}"
    
    x_range = sorted(city_data['Month Name'].unique(), key=lambda m: list(calendar.month_abbr).index(m))
    
    # Y-axis range based on global min/max values
    y_range = global_min_y if duration_type == 'Min' else global_max_y
    
    # Get mean value based on duration type
    mean_value = mins[city] if duration_type == 'Min' else maxs[city]
    
    # Create Bokeh plot
    p = figure(
        title=title,
        x_range=x_range,
        y_range=y_range,
        x_axis_label=None,
        y_axis_label="Duration (minutes)",
        width=500,
        height=300
    )
    
    # Add HoverTool with custom duration formatting
    hover = HoverTool(tooltips=[("Month", "@{Month Name}"),
                                ("Duration", "@{" + duration_type + "} min")])
    p.add_tools(hover)
    
    # Line and scatter plot
    p.line(x='Month Name', y=duration_type, source=source, line_width=2, color='steelblue')
    p.scatter(x='Month Name', y=duration_type, source=source, size=5, alpha=1, color='steelblue')
    
    # Add horizontal mean line
    mean_line = Span(location=mean_value, dimension='width', 
                     line_color='orange', line_dash='dashed', line_width=2)
    p.add_layout(mean_line)
    
    # Style adjustments
    p.toolbar.logo = None
    p.toolbar_location = None
    p.ygrid.grid_line_color = None
    p.yaxis[0].formatter = NumeralTickFormatter(format="0")
    p.yaxis.minor_tick_line_color = None
    p.xaxis.minor_tick_line_color = None
    p.y_range.start = 0
    
    return p

# Create plots with shared y-axis ranges
columns = []

for city in cities:
    columns.append(column(
        create_city_plot(df_accidents_cities, city, 2022, 'Min'),
        create_city_plot(df_accidents_cities, city, 2022, 'Max')
    ))

# Show the grid layout
show(row(*columns))


--- TO COMPLETE ---

3. Please filter the data for the years 2019 to 2023 and divide it into two bins based on the `Year` value. Then, calculate the duration ranges for each bin, grouped by `County` and `City`. Classify accidents by congestion level:

    - Accidents affecting a road length greater than the median of `Distance(mi)` across the dataset are considered **severe**.
    - Those below the median are categorized as **not severe**.

    The resulting dataframe should have `County` and `City` as row indices, with year bins and severity (severe/not severe) as hierarchical columns. The values in the dataframe should represent the range of distances, with severe accidents placed under the "Severe" column and non-severe accidents under the "Not Severe" column. Each cell should display the range of distances for a specific city, county, and year interval. For this exercise, you have to use `groupby()` and __cannot__ rely on `pivot_table()`.
    
    What is the combination of county-city-year-range with the widest range of accidents duration?
    
    
    The following table shows how the dataframe should look:

<br>
YB = Year bin range
<br>
DB = Range of minimum and maximum durations
<br>

<table>
    <tr>
        <th rowspan="2">County</th>
        <th rowspan="2">City</th>
        <th colspan="2">Not Severe</th> 
        <th colspan="2">Severe</th>
    </tr>
    <tr>
        <th>YB</th>
        <th>YB</th>
        <th>YB</th>
        <th>YB</th>
    </tr>
    <tr>
        <th>Abbeville</th>
        <th>Bradley</th>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
    </tr>
    <tr>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
    </tr>
    <tr>
        <th>Yuma</th>
        <th>Dateland</th>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
        <td>DB</td>
    </tr>
    <tr>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
        <td colspan="2">...</td>
    </tr>
</table>

### Step 4

Here I prepare the DataFrame to show all the data requested. 

First let's add `Year`, `Month` and `Duration` as new columns.

In [22]:
df_accidents = df_accidents.dropna(subset=['Start_Time', 'End_Time'])
df_accidents['Year'] = df_accidents['Start_Time'].dt.year
df_accidents['Month'] = df_accidents['Start_Time'].dt.month
df_accidents['Duration'] = (df_accidents['End_Time'] - df_accidents['Start_Time']).dt.total_seconds() / 60
df_accidents['Duration'] = df_accidents.Duration.apply(lambda x : int(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents['Year'] = df_accidents['Start_Time'].dt.year
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents['Month'] = df_accidents['Start_Time'].dt.month
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents['Duration'] = (df_accidents['End_Time'] - df_accidents['Start_Time']).dt.t

In [23]:
df_accidents['Year'].value_counts()

Year
2022    961657
2021    894957
2020    633402
2019    198639
2023    154669
2018      9549
2017      7740
2016      3652
Name: count, dtype: int64

Than I calculate the `median` of the number of accidents between `2019` and `2023`.

In [24]:
from pandas.api.types import CategoricalDtype

In [25]:
years_2 = [2019, 2023]
metric = 'Distance(mi)'

df_accidents_19_23 = df_accidents.query(f'Year >= {years_2[0]} and Year <= {years_2[1]}')

accidents_bins = pd.cut(df_accidents_19_23['Year'], bins=2)
df_accidents_19_23['Year Bin'] = accidents_bins.apply(lambda x: f"[{int(x.left) + 1}, {int(x.right)}]")


median = float(df_accidents_19_23[metric].median())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_19_23['Year Bin'] = accidents_bins.apply(lambda x: f"[{int(x.left) + 1}, {int(x.right)}]")


Now I classify the `Severity` and group the accidents based on the metrics below:

In [26]:
df_accidents_19_23['Severe'] = df_accidents_19_23[metric].apply(lambda x : x > median)
df_accidents_19_23_grouped = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])[metric].agg(lambda x: f"[{round(x.min(), 3)}, {round(x.max(), 3)}]")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_accidents_19_23['Severe'] = df_accidents_19_23[metric].apply(lambda x : x > median)
  df_accidents_19_23_grouped = df_accidents_19_23.groupby(['County', 'City', 'Year Bin', 'Severe'])[metric].agg(lambda x: f"[{round(x.min(), 3)}, {round(x.max(), 3)}]")


Here we have a first representation with `NaN`s

In [27]:
final_table = df_accidents_19_23_grouped.unstack(level=[2, 3])

final_table.columns = pd.MultiIndex.from_tuples(
    [('Not Severe', str(col[0])) if col[1] == False else ('Severe', str(col[0])) for col in final_table.columns],
    names=[None, None]
)

final_table = final_table.sort_index(axis=1, level=0)

final_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Not Severe,Not Severe,Severe,Severe
Unnamed: 0_level_1,Unnamed: 1_level_1,"[2019, 2021]","[2022, 2023]","[2019, 2021]","[2022, 2023]"
County,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Abbeville,Aaronsburg,,,,
Abbeville,Abbeville,"[0.015, 0.248]","[0.009, 0.247]","[0.25, 1.962]","[0.259, 0.74]"
Abbeville,Abbotsford,,,,
Abbeville,Abbottstown,,,,
Abbeville,Aberdeen,,,,
...,...,...,...,...,...
Yuma,Zortman,,,,
Yuma,Zumbro Falls,,,,
Yuma,Zumbrota,,,,
Yuma,Zuni,,,,


If we want to see something without NaN we can remove them and plot the table:

In [28]:
final_table.dropna()

Unnamed: 0_level_0,Unnamed: 1_level_0,Not Severe,Not Severe,Severe,Severe
Unnamed: 0_level_1,Unnamed: 1_level_1,"[2019, 2021]","[2022, 2023]","[2019, 2021]","[2022, 2023]"
County,City,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Abbeville,Abbeville,"[0.015, 0.248]","[0.009, 0.247]","[0.25, 1.962]","[0.259, 0.74]"
Abbeville,Calhoun Falls,"[0.068, 0.247]","[0.043, 0.207]","[0.406, 0.406]","[0.282, 0.492]"
Abbeville,Donalds,"[0.013, 0.238]","[0.016, 0.233]","[0.253, 0.771]","[0.332, 0.748]"
Abbeville,Due West,"[0.097, 0.117]","[0.056, 0.161]","[0.281, 1.382]","[0.272, 0.839]"
Abbeville,Honea Path,"[0.027, 0.245]","[0.012, 0.247]","[0.252, 0.583]","[0.254, 0.593]"
...,...,...,...,...,...
Yuba,Oregon House,"[0.01, 0.245]","[0.026, 0.245]","[0.255, 1.873]","[0.255, 0.645]"
Yuba,Smartsville,"[0.0, 0.239]","[0.009, 0.229]","[0.289, 0.628]","[0.261, 0.47]"
Yuba,Wheatland,"[0.0, 0.242]","[0.01, 0.214]","[0.254, 2.576]","[0.249, 2.873]"
Yuma,Roll,"[0.012, 0.012]","[0.097, 0.191]","[0.551, 13.724]","[0.461, 11.915]"


### Section 2.2 
**Data Source:** `eu_energy.csv`

Please note that:

- EU countries are the following: Austria, Belgium, Bulgaria, Croatia, Cyprus, Czechia, Denmark, Estonia, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Latvia, Lithuania, Luxembourg, Malta, Netherlands, Poland, Portugal, Romania, Slovakia, Slovenia, Spain, Sweden
- Renewable energy sources: Hydroelectric power, solar power, wind power, biofuel
- Non-renewable energy sources: Coal, fossil fuels, gas, oil, nuclear
- Clean energy sources: Hydroelectric power, solar power, wind power, nuclear
- Non-clean energy sources: Biofuel, coal, fossil fuels, gas, oil

4. Please provide a visualization that highlights the relationship between:
    - Population size;
    - CO2 emissions per capita;
    - Renewable energy production.

    in 2017. Describe the visualization identifying groups and outliers.

In [51]:
df_energy = pd.read_csv('datasets/eu_energy.csv')
df_energy.rename(columns=lambda x : x.rstrip())
df_energy['population'] = df_energy['population'].apply(int)

renewable = ['hydro', 'solar', 'wind', 'nuclear']

df_energy

Unnamed: 0,country,year,gdp,population,greenhouse_gas_emissions,net_elec_imports,biofuel_consumption,coal_consumption,fossil_fuel_consumption,gas_consumption,...,wind_consumption,biofuel_electricity,coal_electricity,fossil_electricity,gas_electricity,oil_electricity,nuclear_electricity,hydro_electricity,solar_electricity,wind_electricity
0,Austria,1900,2.743996e+10,5979177,,,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1,Austria,1901,2.754978e+10,6040558,,,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
2,Austria,1902,2.862871e+10,6102566,,,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
3,Austria,1903,2.889683e+10,6165209,,,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
4,Austria,1904,2.934634e+10,6228494,,,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2476,Sweden,2018,4.634074e+11,10162300,7.78,-17.22,16.489,23.032,186.873,10.169,...,43.980,11.91,0.34,3.66,0.38,2.94,68.55,62.21,0.41,16.62
2477,Sweden,2019,,10267922,7.92,-26.16,17.033,22.853,197.686,10.202,...,52.316,13.04,0.21,3.35,0.29,2.85,66.13,65.37,0.68,19.85
2478,Sweden,2020,,10368968,6.94,-25.00,15.520,18.894,178.590,12.717,...,72.291,11.18,0.00,2.44,0.10,2.34,49.20,72.39,1.05,27.53
2479,Sweden,2021,,10467095,7.87,-25.57,18.291,15.749,178.053,13.047,...,71.452,13.08,0.01,3.06,0.29,2.76,52.97,73.89,1.53,27.24


In [52]:
df_energy.columns

Index(['country', 'year', 'gdp', 'population', 'greenhouse_gas_emissions',
       'net_elec_imports', 'biofuel_consumption', 'coal_consumption',
       'fossil_fuel_consumption', 'gas_consumption', 'oil_consumption',
       'nuclear_consumption', 'hydro_consumption', 'solar_consumption',
       'wind_consumption', 'biofuel_electricity', 'coal_electricity',
       'fossil_electricity', 'gas_electricity', 'oil_electricity',
       'nuclear_electricity', 'hydro_electricity', 'solar_electricity',
       'wind_electricity'],
      dtype='object')

5. Please compute the renewable energy production percentage (one datapoint per country, per year). Then, create a visualization to investigate how the distribution of these values evolves over the years, from 2010 to 2017.

In [103]:
renewable_column = 'renewable_electricity'
year = 2017
mega = math.pow(10, 6)
tera = math.pow(10, 12)

df_energy[renewable_column] = sum(df_energy[x + '_electricity'].fillna(0) for x in renewable)
df_energy['greenhouse_gas_emissions_rate'] = (df_energy['greenhouse_gas_emissions'] * mega)/df_energy['population']
df_energy[renewable_column + '_rate'] = (df_energy[renewable_column] * mega)/df_energy['population']
df_energy_2017 = df_energy.query(f'year == {year}')
df_energy_2017

Unnamed: 0,country,year,gdp,population,greenhouse_gas_emissions,net_elec_imports,biofuel_consumption,coal_consumption,fossil_fuel_consumption,gas_consumption,...,fossil_electricity,gas_electricity,oil_electricity,nuclear_electricity,hydro_electricity,solar_electricity,wind_electricity,renewable_electricity,greenhouse_gas_emissions_rate,renewable_electricity_rate
117,Austria,2017,373238000000.0,8797497,11.56,6.55,5.569,36.49,273.535,90.699,...,16.37,10.91,3.7,0.0,38.29,1.27,6.57,46.13,1.31401,5.243537
240,Belgium,2017,447723800000.0,11384491,16.06,6.02,5.454,35.915,573.158,164.115,...,26.9,23.02,3.79,42.23,0.27,3.31,6.52,52.33,1.410691,4.596604
363,Bulgaria,2017,128228600000.0,7182430,18.82,-5.48,0.0,71.0,158.779,32.137,...,23.25,1.93,0.4,15.55,2.83,1.4,1.5,21.28,2.620283,2.962786
396,Croatia,2017,82748780000.0,4192468,3.04,6.95,0.0,4.563,75.561,28.997,...,4.67,3.09,0.21,0.0,5.31,0.08,1.2,6.59,0.72511,1.571866
454,Cyprus,2017,26706790000.0,1208527,3.22,0.0,0.0,0.035,31.31,0.0,...,4.57,0.0,4.57,0.0,0.0,0.17,0.21,0.38,2.664401,0.314432
512,Czechia,2017,317076100000.0,10531315,39.3,-13.04,0.0,182.183,381.227,83.748,...,47.81,3.7,2.67,28.34,1.87,2.2,0.59,33.0,3.731728,3.133512
635,Denmark,2017,262051100000.0,5737286,8.42,4.56,0.0,18.22,138.447,32.162,...,9.23,2.02,1.0,0.0,0.02,0.75,14.78,15.55,1.467593,2.710341
673,Estonia,2017,33800140000.0,1317550,8.16,-2.73,0.0,50.302,72.885,4.889,...,11.3,0.06,11.22,0.0,0.03,0.01,0.72,0.76,6.193313,0.576828
731,Finland,2017,211331200000.0,5508146,12.64,20.43,4.036,46.646,175.713,18.324,...,13.17,3.3,4.01,22.48,14.77,0.05,4.8,42.1,2.294783,7.643225
854,France,2017,2536203000000.0,64144092,47.12,-40.13,33.204,106.46,1437.183,447.657,...,65.09,40.5,11.78,398.36,50.0,9.59,24.61,482.56,0.734596,7.523062


In [65]:
df_energy_2017.sort_values(by=renewable_column, ascending=False)

source = ColumnDataSource(data={
    'country': df_energy_2017['country'].tolist(),
    'population': df_energy_2017['population'].tolist(),
    'co2': df_energy_2017['greenhouse_gas_emissions'].tolist(),
    renewable_column : df_energy_2017[renewable_column].tolist(),
    'co2_rate' : df_energy_2017['greenhouse_gas_emissions_rate'].tolist(),
    renewable_column + '_rate' : df_energy_2017[renewable_column + '_rate'].tolist()
})

TOOLTIPS_CO2 = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px;">
        <span style="font-size: 12px; color: steelblue;">Country:</span> @country<br>
        <span style="font-size: 12px; color: steelblue;">CO2 Total:</span> @co2{0.00} Megatonnes<br>
        <span style="font-size: 12px; color: steelblue;">CO2 per Capita:</span> @co2_rate{0.00} tonnes per capita<br>
    </div>
"""

TOOLTIPS_RENEW = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px;">
        <span style="font-size: 12px; color: steelblue;">Country:</span> @country<br>
        <span style="font-size: 12px; color: steelblue;">Renewable Production Total:</span> @renewable_electricity{0.00} TeraWatt/hour<br>
        <span style="font-size: 12px; color: steelblue;">Renewable per Capita:</span> @renewable_electricity{0.00} TeraWatt/hour per capita<br>
    </div>
"""

df_energy_2017.sort_values(by='greenhouse_gas_emissions_rate', ascending=True, inplace=True)

plot_co2 = figure(
    y_range=df_energy_2017.country.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS_CO2,
    title="Consumption per capita of CO2 in 2017",
    x_axis_label="CO2 per capita production [tonnes per capita]",
    y_axis_label="Country"
)

plot_co2.hbar(
    y='country',
    right='co2_rate',
    source=source,
    height=0.85,
)

plot_co2.toolbar.logo = None
plot_co2.toolbar_location = None
plot_co2.xgrid.grid_line_color = None
plot_co2.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_co2.yaxis.minor_tick_line_color = None
plot_co2.xaxis.minor_tick_line_color = None
plot_co2.x_range.start = 0

show(plot_co2)

df_energy_2017.sort_values(by=renewable_column + '_rate', ascending=True, inplace=True)

plot_renew = figure(
    y_range=df_energy_2017.country.tolist(),
    height=400,
    width=800,
    tooltips=TOOLTIPS_RENEW,
    title="Consumption per capita of Renewable Energy in 2017",
    x_axis_label="Renewable Energy per capita production [MW/hr per capita]",
    y_axis_label="Country"
)

plot_renew.hbar(
    y='country',
    right=renewable_column + '_rate',
    source=source,
    height=0.85,
)

plot_renew.toolbar.logo = None
plot_renew.toolbar_location = None
plot_renew.xgrid.grid_line_color = None
plot_renew.xaxis[0].formatter = NumeralTickFormatter(format="0,0")
plot_renew.yaxis.minor_tick_line_color = None
plot_renew.xaxis.minor_tick_line_color = None
plot_renew.x_range.start = 0

show(plot_renew)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_energy_2017.sort_values(by='greenhouse_gas_emissions_rate', ascending=True, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_energy_2017.sort_values(by=renewable_column + '_rate', ascending=True, inplace=True)


6. Please provide visualizations that show the evolution over the years (from 1990 to 2020) of:
    - Renewable energy production per capita for each country
    - Clean energy production per capita for each country
    - Net import per capita for each country

    Are there countries that behave differently from the others?

    *Please note that the goal of the visualization is not to compare all the countries with each other but to identify which ones present different trends compared to all the others.*

In [70]:
from bokeh.transform import dodge

# Combined tooltip
TOOLTIPS = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px; font-size: 12px;">
        <strong>Country:</strong> @country<br>
        <span style="color: indianred;">CO₂ per capita:</span> @co2_rate{0.00} tonnes<br>
        <span style="color: seagreen;">Renewable per capita:</span> @renewable_electricity_rate{0.00} MWh<br>
    </div>
"""

# Create figure
p = figure(
    y_range=source.data['country'],
    height=600,
    width=1000,
    tooltips=TOOLTIPS,
    title="2017 Energy Metrics Comparison",
    x_axis_label="Per Capita Values",
    y_axis_label="Country",
    toolbar_location=None
)

# Add bars with dodge offset
p.hbar(y=dodge('country', -0.15, range=p.y_range), 
       right='co2_rate', 
       height=0.3, 
       source=source,
       color='indianred', 
       legend_label="CO₂ Emissions")

p.hbar(y=dodge('country', 0.15, range=p.y_range), 
       right=renewable_column+'_rate', 
       height=0.3, 
       source=source,
       color='seagreen', 
       legend_label="Renewable Energy")

# Styling
p.x_range.start = 0
p.xaxis.formatter = NumeralTickFormatter(format="0.00")
p.ygrid.grid_line_color = None
p.legend.location = "top_right"
p.legend.orientation = "horizontal"
p.legend.label_text_font_size = '10pt'

show(p)



In [112]:
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure
import pandas as pd

def box_plot(df: pd.DataFrame, metrics: list, main_metric: str, main_value: int, 
             title: str, y_label: str) -> figure:
    # Filter data for the specified year
    df = df[df[main_metric] == main_value].copy()
    
    # Convert metrics to list if single string
    metrics = [metrics] if isinstance(metrics, str) else metrics
    
    # Create figure with metrics as x-axis categories
    p = figure(x_range=metrics, tools="", toolbar_location=None,
               title=title, background_fill_color="#eaefef",
               y_axis_label=y_label, x_axis_label="Metrics")

    # Define consistent width for boxes
    width = 0.4

    for idx, metric in enumerate(metrics):
        # Calculate statistics
        q1 = df[metric].quantile(0.25)
        q2 = df[metric].quantile(0.5)
        q3 = df[metric].quantile(0.75)
        iqr = q3 - q1
        upper = q3 + 1.5 * iqr
        lower = q1 - 1.5 * iqr

        # Create source for boxes and whiskers
        stats_source = ColumnDataSource(data={
            'metric': [metric],
            'q1': [q1],
            'q2': [q2],
            'q3': [q3],
            'upper': [upper],
            'lower': [lower]
        })

        # Create whiskers
        whisker = Whisker(base='metric', upper='upper', lower='lower', 
                         source=stats_source, level="annotation")
        whisker.upper_head.size = whisker.lower_head.size = 15
        p.add_layout(whisker)

        # Create box plot bars
        p.vbar(x='metric', width=width, bottom='q1', top='q2', 
              source=stats_source, fill_color="seagreen", line_color="black")
        p.vbar(x='metric', width=width, bottom='q2', top='q3', 
              source=stats_source, fill_color="seagreen", line_color="black")

        # Plot outliers
        outliers = df[~df[metric].between(lower, upper)][metric]
        if not outliers.empty:
            p.scatter(x=[metric]*len(outliers), y=outliers, 
                    size=6, color="black", alpha=0.3)

    # Visual styling
    p.xgrid.grid_line_color = None
    p.axis.major_label_text_font_size = "14px"
    p.axis.axis_label_text_font_size = "12px"
    
    return p

In [114]:
metrics_energy = ['greenhouse_gas_emissions_rate', renewable_column + '_rate']
metrics_energy_titles = {
    metrics_energy[0]: "Consumption per capita of CO2 production in 2017",
    metrics_energy[1]: "Production per capita of Renewable Energy in 2017"
}
metrics_energy_y = {
    metrics_energy[0]: "CO2 per capita production [tonnes per capita]",
    metrics_energy[1]: "Renewable Energy per capita production [MWh per capita]"
}

energy_bar_plot = box_plot(df_energy, metrics_energy, 'year', year, "Values per capita in 2017", None)

show(energy_bar_plot)

## Section 3 - Graph Redesign and Analysis (15 points) 📊

**Data Source:** `market_value_decline.csv`

The 2008 financial crisis had a significant impact on banks worldwide, leading to substantial losses in market value. The following graph compares the market value of major banks in 2007 (pre-crisis) and 2009 (post-crisis), using blue to represent their value before the meltdown and green to represent their value after.

The **primary** goal of this visualization is to highlight the extent of losses suffered by each bank, while also drawing attention to J.P. Morgan’s relatively minor decline compared to its peers. The **secondary** goal is to illustrate the overall ranking of banks by market value, showing their relative sizes before and after the crisis.

Does this visualization effectively convey both the absolute losses and the percentage changes in market value? Does it allow for an easy comparison of which banks retained the most value relative to their original size?

1. Evaluate the effectiveness of the graph in communicating the market value losses and the relative sizes of the banks. What improvements can be made?
2. Propose a visualization that better captures both the absolute and relative losses per bank. Should we emphasize the percentage decline more? Should we use a different chart type?
3. Implement your proposed visualization using the *market_value_decline* dataset.

**Exercise Submission Requirements:**
1. `Written analysis` of the original graph's shortcomings: Please examine the existing graph and identify any issues that hinder its ability to clearly convey the intended quantitative message.
2. `Justifications` for the proposed improvements: For each issue you identify, please discuss potential improvements or alternative visualization techniques that might resolve these issues.
3. `Redesigned graph` that better communicates the data. Be sure to explain how your redesign enhances data interpretation and achieves the intended objectives more effectively.

![exercise3.png](exercise3.png)




### 1 Written analysis

The objective of the chart is to show the difference between the market values before and after the crisis in 2008.

1. The main problem is that humans are not that good in comparing areas. This force to write the values of the circles decrasing the `data-ink ratio` value of the chart.

2. Another problem is that in the legend we can see that the comparison is done between a quarterly in 2007 and a month in 2009. This is a way to manipulate data to force them to show the crisis. This is reflected in the title where there is no evidence of the period of the year hiding better this difference in the periods. 

3. A problem could be the color choice for blue-blind people that see the chart monochromatic as you can see in the image below:

4. A not first impact visible problem is that the values are integer for the `2007` cap and floating point for `2009` giving no reason for the reader to compare the two values if not for the graph

5. Last, but not least, problem is the way in which banks names are represented on the chart: bank names are rotate by 45 degrees and the center of the name is aligned with the center of the chart itself.

![blueblind-exercise3.png](blueblind-exercise3.png)

### 2 Justications
An horizontal bar plot is a solution for both points `1.` and `5.` since a bar plot allows to compare in a better way the market values increasing the `data-ink ratio`.

To solve `2.` issue, I need the original dataset on which data are plotted containing only months or only quarterly for both years, which, unfortunately, with the given dataset is not possible.

To solve `4.` issue I can convert all the values to integer approximating the `2009` cap values.

For the last remeaning issue (`3.`) a proper set of color (like blue and red) could be choose to solve the problem.

### 3 Redesigned Graph

In [121]:
df_bank = pd.read_csv('datasets/market_value_decline.csv')
df_bank

Unnamed: 0,;market_value_2007;market_value_2009
0,Morgan Stanley;49;16.0
1,RBS;120;4.6
2,Deutsche Bank;76;10.3
3,Credit Agricole;67;17.0
4,Societe Generale;80;26.0
5,Barclays;91;7.4
6,BNP Paribas;108;32.5
7,Unicredit;93;26.0
8,UBS;100;35.0
9,Credit Suisse;100;27.0


First think since the `csv` is not well formatted I am going to reformat it.

In [129]:
with open('datasets/market_value_decline.csv', 'r') as file:
    lines = file.readlines()

lines[0] = "bank" + lines[0]

for i in range(len(lines)):
    lines[i] = lines[i].replace(';', ',')

with open('datasets/cleaned_market_data.csv', 'w') as file:
    file.writelines(lines)

In [131]:
df_bank = pd.read_csv('datasets/cleaned_market_data.csv')
df_bank.rename(columns=lambda x : x.rstrip())
df_bank

Unnamed: 0,bank,market_value_2007,market_value_2009
0,Morgan Stanley,49,16.0
1,RBS,120,4.6
2,Deutsche Bank,76,10.3
3,Credit Agricole,67,17.0
4,Societe Generale,80,26.0
5,Barclays,91,7.4
6,BNP Paribas,108,32.5
7,Unicredit,93,26.0
8,UBS,100,35.0
9,Credit Suisse,100,27.0


In [144]:
df_bank['market_value_2009'] = round(df_bank['market_value_2009'])
df_bank['absolute_increment'] = df_bank['market_value_2009'] - df_bank['market_value_2007']
df_bank['relative_increment'] = (df_bank['absolute_increment'] / df_bank['market_value_2007']) * 100
df_bank.sort_values(by=['bank'], ascending=False, inplace=True)

source = ColumnDataSource(data={
    "bank": df_bank.bank,
    "2007" : df_bank.market_value_2007,
    "2009": df_bank.market_value_2009,
    "absolute" : df_bank.absolute_increment,
    "relative" : df_bank.relative_increment
})

TOOLTIPS = """
    <div style="background-color: #f9f9f9; padding: 8px; border-radius: 8px; font-size: 12px;">
        <strong>Bank:</strong> @bank<br>
        <span style="color: lightcoral;"><strong>Market Cap 2007:</strong></span> @2007{0.00}<br>
        <span style="color: steelblue;"><strong>Market Cap 2009:</strong></span> @2009{0.00}<br>
        <span style="color: black;"><strong>Increment:</strong></span> @absolute{0} (@relative{0.00}%)<br>
    </div>
"""

# Create figure
p = figure(
    y_range=source.data['bank'],
    height=600,
    width=1000,
    tooltips=TOOLTIPS,
    title="Banks: Market Cap Comparison (2007 vs 2009)",
    x_axis_label=None,
    y_axis_label=None,
)

# Add bars with dodge offset
p.hbar(y=dodge('bank', 0.15, range=p.y_range), 
       right='2007', 
       height=0.3, 
       source=source,
       color='lightcoral', 
       legend_label="Market Cap 2007")

p.hbar(y=dodge('bank', -0.15, range=p.y_range), 
       right='2009', 
       height=0.3, 
       source=source,
       color='steelblue', 
       legend_label="Market Cap 2009")

# Styling
p.x_range.start = 0
p.xaxis.formatter = NumeralTickFormatter(format="0")
p.xaxis.minor_tick_line_color = None
p.ygrid.grid_line_color = None
p.legend.location = "bottom_right"
p.legend.orientation = "vertical"
p.legend.label_text_font_size = '10pt'

show(p)

## Section 4 - Geospatial Analysis (35 points) 🌍

**Data Source:** `airports.csv`, `countries.csv`, `routes.csv`, `europe.geojson`.

Please create an interactive map representation—focused on European countries—such that, when a country is selected, the map displays the flight balance (number of incoming flights - number of outgoing flights) between that country and all other European countries. The map should dynamically update based on the selected country, visually representing the extent to which each country is a net sender or receiver of flights.

**Hints**:
1. If `A` is a GeoDataFrame and `B` a DataFrame, the result of `A.merge(B,..)` is a GeoDataFrame, whereas the result of `B.merge(A,..)` is a DataFrame. The function `to_json()` on a DataFrame with a geometry column does **not** work.
2. When updating the map, to access the color mapper you can use the following method: `color_mapper = p.select_one(LinearColorMapper)`, where `p` is the figure.
3. You can discard Guernsey and Gibraltar that are not present in the geojson.

## Datasets Description

You can find the dataset in the `datasets` folder. The descriptions of the datasets are provided below.

### Used Cars

The content of the dataset is in German, but it should not impose critical issues in understanding the data. Each entry contains the following information.

| **Field**                    | **Description** |
|------------------------------|---------------|
| **dateCrawled**               | When this ad was first crawled, all field values are taken from this date. |
| **name**                      | The name of the car. |
| **seller**                    | Seller type (private or dealer). |
| **offerTypeprice**            | The price in euros for the car on the ad. |
| **abtest**                    | Type of test. |
| **vehicleType**               | Type of vehicle. |
| **yearOfRegistration**        | The year the car was first registered. |
| **gearboxpowerPS**            | Power of the car in PS (horsepower). |
| **modelkilometer**            | How many kilometers the car has driven. |
| **monthOfRegistration**       | The month the car was first registered. |
| **fuelType**                  | Vehicle fuel type. |
| **brand**                     | Vehicle brand. |
| **notRepairedDamage**         | If the car has any damage that has not been repaired yet. |
| **dateCreated**               | The date the ad was created on eBay. |
| **nrOfPictures**              | Number of pictures in the ad. |
| **postalCodelastSeenOnline**  | When the crawler last saw this ad online. |


### US Accidents

| **Field**              | **Description** |
|------------------------|---------------|
| **ID** | Unique identifier of the accident record. |
| **Severity** | Severity of the accident (1-4), where 1 indicates the least impact on traffic and 4 indicates significant impact. |
| **Start_Time** | Start time of the accident in local time zone. |
| **End_Time** | End time of the accident in local time zone (when the impact on traffic flow was dismissed). |
| **Start_Lat** | Latitude in GPS coordinate of the start point. |
| **Start_Lng** | Longitude in GPS coordinate of the start point. |
| **End_Lat** | Latitude in GPS coordinate of the end point. |
| **End_Lng** | Longitude in GPS coordinate of the end point. |
| **Distance(mi)** | Length of the road extent affected by the accident. |
| **Description** | Natural language description of the accident. |
| **Number** | Street number in address field. |
| **Street** | Street name in address field. |
| **Side** | Relative side of the street (Right/Left) in address field. |
| **City** | City in address field. |
| **County** | County in address field. |
| **State** | State in address field. |
| **Zipcode** | Zipcode in address field. |
| **Country** | Country in address field. |
| **Timezone** | Timezone based on the location of the accident (eastern, central, etc.). |
| **Airport_Code** | Closest airport-based weather station to the accident location. |
| **Weather_Timestamp** | Timestamp of weather observation record (in local time). |
| **Temperature(F)** | Temperature (in Fahrenheit). |
| **Wind_Chill(F)** | Wind chill (in Fahrenheit). |
| **Humidity(%)** | Humidity (in percentage). |
| **Pressure(in)** | Air pressure (in inches). |
| **Visibility(mi)** | Visibility (in miles). |
| **Wind_Direction** | Wind direction. |
| **Wind_Speed(mph)** | Wind speed (in miles per hour). |
| **Precipitation(in)** | Precipitation amount in inches, if any. |
| **Weather_Condition** | Weather condition (rain, snow, thunderstorm, fog, etc.). |
| **Amenity** | POI annotation indicating presence of an amenity nearby. |
| **Bump** | POI annotation indicating presence of a speed bump or hump nearby. |
| **Crossing** | POI annotation indicating presence of a crossing nearby. |
| **Give_Way** | POI annotation indicating presence of a give-way sign nearby. |
| **Junction** | POI annotation indicating presence of a junction nearby. |
| **No_Exit** | POI annotation indicating presence of a no-exit nearby. |
| **Railway** | POI annotation indicating presence of a railway nearby. |
| **Roundabout** | POI annotation indicating presence of a roundabout nearby. |
| **Station** | POI annotation indicating presence of a station nearby. |
| **Stop** | POI annotation indicating presence of a stop sign nearby. |
| **Traffic_Calming** | POI annotation indicating presence of traffic calming measures nearby. |
| **Traffic_Signal** | POI annotation indicating presence of a traffic signal nearby. |
| **Turning_Loop** | POI annotation indicating presence of a turning loop nearby. |
| **Sunrise_Sunset** | Period of day (day or night) based on sunrise/sunset. |
| **Civil_Twilight** | Period of day (day or night) based on civil twilight. |
| **Nautical_Twilight** | Period of day (day or night) based on nautical twilight. |
| **Astronomical_Twilight** | Period of day (day or night) based on astronomical twilight. |


### Energy Data

| **Field**                | **Description** |
|---------------------------|-----------------|
| **country**               | Geographic location. |
| **year**                  | Year of observation. |
| **gdp**                   | (Gross Domestic Product) This data is adjusted for inflation and differences in the cost of living between countries. |
| **population**            | Population by country, based on data and estimates from different sources. |
| **greenhouse_gas_emissions** | Emissions from electricity generation. Measured in megatonnes of CO₂ equivalents. |
| **net_elec_imports**      | Net electricity imports. Electricity imports minus exports, measured in TWh. |
| **biofuel_consumption**   | Primary energy consumption from biofuels. Measured in terawatt-hours. |
| **coal_consumption**      | Primary energy consumption from coal. Measured in terawatt-hours. |
| **fossil_fuel_consumption** | Primary energy consumption from fossil fuels. Measured in terawatt-hours. |
| **gas_consumption**       | Primary energy consumption from gas. Measured in terawatt-hours. |
| **oil_consumption**       | Primary energy consumption from oil. Measured in terawatt-hours. |
| **nuclear_consumption**   | Primary energy consumption from nuclear power. Measured in terawatt-hours, using the substitution method. |
| **hydro_consumption**     | Primary energy consumption from hydropower. Measured in terawatt-hours, using the substitution method. |
| **solar_consumption**     | Primary energy consumption from solar power. Measured in terawatt-hours, using the substitution method. |
| **wind_consumption**      | Primary energy consumption from wind power. Measured in terawatt-hours, using the substitution method. |
| **biofuel_electricity**   | Electricity generation from bioenergy. Measured in terawatt-hours. |
| **coal_electricity**      | Electricity generation from coal. Measured in terawatt-hours. |
| **fossil_electricity**    | Electricity generation from fossil fuels. Measured in terawatt-hours. |
| **gas_electricity**       | Electricity generation from gas. Measured in terawatt-hours. |
| **oil_electricity**       | Electricity generation from oil. Measured in terawatt-hours. |
| **nuclear_electricity**   | Electricity generation from nuclear. Measured in terawatt-hours. |
| **hydro_electricity**     | Electricity generation from hydropower. Measured in terawatt-hours. |
| **solar_electricity**     | Electricity generation from solar power. Measured in terawatt-hours. |
| **wind_electricity**      | Electricity generation from wind power. Measured in terawatt-hours. |



### Airports

As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe, as shown in the map above. Each entry contains the following information:

| **Field**                 | **Description** |
|---------------------------|---------------|
| **Airport ID** | Unique OpenFlights identifier for this airport. |
| **Name** | Name of the airport. May or may not contain the city name. |
| **City** | Main city served by the airport. May be spelled differently from the name. |
| **Country** | Country or territory where the airport is located. Can be cross-referenced with ISO 3166-1 codes. |
| **IATA** | 3-letter IATA code. Null if not assigned/unknown. |
| **ICAO** | 4-letter ICAO code. Null if not assigned/unknown. |
| **Latitude** | Decimal degrees, usually to six significant digits. Negative is South, positive is North. |
| **Longitude** | Decimal degrees, usually to six significant digits. Negative is West, positive is East. |
| **Altitude** | Altitude in feet. |
| **Timezone** | Hours offset from UTC. Fractional hours are expressed as decimals (e.g., India is 5.5). |
| **DST** | Daylight savings time classification: E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None), or U (Unknown). |
| **Tz database time zone** | Timezone in "tz" (Olson) format (e.g., "America/Los_Angeles"). |
| **Type** | Type of the airport. Value is "airport" for air terminals. |
| **Source** | Source of the data. "OurAirports" for data sourced from OurAirports. |


### Routes

As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67663 routes between 3321 airports on 548 airlines spanning the globe. \
Each entry contains the following information.

| **Field**                | **Description** |
|--------------------------|---------------|
| **Airline** | 2-letter (IATA) or 3-letter (ICAO) code of the airline. |
| **Airline ID** | Unique OpenFlights identifier for the airline. |
| **Source airport** | 3-letter (IATA) or 4-letter (ICAO) code of the source airport. |
| **Source airport ID** | Unique OpenFlights identifier for the source airport. |
| **Destination airport** | 3-letter (IATA) or 4-letter (ICAO) code of the destination airport. |
| **Destination airport ID** | Unique OpenFlights identifier for the destination airport. |
| **Codeshare** | "Y" if the flight is a codeshare (operated by another carrier), empty otherwise. |
| **Stops** | Number of stops on the flight ("0" for direct). |
| **Equipment** | 3-letter codes for plane type(s) generally used on this flight, separated by spaces. |


The data is UTF-8 encoded. The special value `\N` is used for "NULL" to indicate that no value is available, and is understood automatically by MySQL if imported


<aside>
💡 Notes:

- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately.
- Routes where one carrier operates both its own and codeshare flights are listed only once.
</aside>


### Countries

This dataset contains the information related to European countries. 