# Create Combined Dataframe

#### Advantages of Having a Combined Dataset for EDA

Having a combined dataset in the data cleaning phase, specifically in the task of creating a combined dataframe, offers several advantages for the Exploratory Data Analysis (EDA) process:

1. **Comprehensive Analysis:** Combining multiple datasets allows for a more comprehensive analysis of the variables and their relationships. It provides a holistic view of the data, enabling the identification of patterns and trends that may not be apparent when analyzing individual datasets separately. This comprehensive analysis can lead to more accurate and meaningful insights.

2. **Enhanced Data Integrity:** By merging multiple datasets into a combined dataframe, it becomes easier to address data integrity issues. Inconsistencies, missing values, or outliers present in separate datasets can be better identified and resolved when they are all integrated into a single dataset. This enhances the overall data quality and reliability, enabling more robust analysis during the EDA process.

3. **Efficient Data Exploration:** Having a combined dataset simplifies the exploration process. Instead of switching between different datasets, analysts can focus on a single consolidated dataset, reducing the need for repetitive data manipulation and merging operations. This efficiency allows for more time to be spent on actual analysis, hypothesis testing, and uncovering meaningful insights.

4. **Increased Statistical Power:** Combining datasets increases the statistical power of the analysis. With a larger sample size, there is a greater ability to detect significant relationships, correlations, or patterns within the data. This leads to more reliable and accurate conclusions and allows for more confident decision-making.

5. **Contextual Understanding:** A combined dataset enables the incorporation of relevant contextual information. For example, merging crime data with COVID-19 statistics in Italy would provide a broader context for understanding potential relationships between the two variables. This contextual understanding is crucial for generating insights that go beyond isolated data points and help explain the interdependencies between different factors.

In summary, having a combined dataset for EDA offers advantages such as comprehensive analysis, enhanced data integrity, efficient exploration, increased statistical power, and contextual understanding. It facilitates a more robust and meaningful analysis, enabling researchers to gain valuable insights into the impact of COVID-19 on crime in Italy.

## Regional Data

Import dependencies.

In [125]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import matplotlib.pyplot as plt
# import matplotlib.colors as colors
# import matplotlib.cm as cm
# from matplotlib.animation import FuncAnimation
import cartopy.crs as ccrs
import seaborn as sns
import plotly.express as px
# import plotly.subplots as sp
# import plotly.graph_objects as go
# import plotly.io as pio

# Set pandas options
pd.set_option('display.max_columns', None)

## Data Sets
For the purpose of this project, we will be looking at the following datasets:  
- Covid-19 data from [Dipartimento della Protezione Civile (DPC)](https://www.protezionecivile.gov.it/en/) - External Link
    - [dpc-covid19-ita-regioni-latest.json](../../data/Covid/dpc-covid19-ita-regioni-latest.json) - Internal Link
    - [dati-regioni](../../data/Covid/dati-regioni/') - Internal Link
- Crime data from [Istat](https://www.istat.it/en/) - External Link
    - [crime_type_by_year_cleaned.csv](../../data/crime_type_by_year_cleaned.csv) - Internal Link
- Unemployment data from [Istat](https://www.istat.it/en/) - External Link
    - [Unemployment_by_Region_clean.csv](../../data/Unemployment_by_Region_clean.csv) - Internal Link
- Population data from [Istat](https://www.istat.it/en/) - External Link
    - [Unemployment_by_Region_clean.csv](../../data/Unemployment_by_Region_clean.csv) - Internal Link

## Data Specifications
- Years: 2017 - 2021
- Regions: 
    1. Piemonte
    2. Valle d'Aosta/Vallée dAoste
    3. Lombardia
    4. Trentino-Alto Adige/Südtirol
    5. Veneto
    6. Friuli-Venezia Giulia
    7. Liguria
    8. Emilia-Romagna
    9. Toscana
    10. Umbria
    11. Marche
    12. Lazio
    13. Abruzzo
    14. Molise
    15. Campania
    16. Puglia
    17. Basilicata
    18. Calabria
    19. Sicilia
    20. Sardegna
- Column Names: All lowercase with underscores

## Covid-19 Data
Covid-19 data refers to information related to the spread, impact, and management of the novel coronavirus disease. It includes data on confirmed cases, testing, demographics, geographical distribution, time series trends, and public health measures. This data is crucial for monitoring the pandemic, informing decision-making, and evaluating interventions.  
  
For the purpose of this notebook, we will be looking at the total number of confirmed cases across the Regions of Italy.

## Covid-19 Data from Dipartimento della Protezione Civile (DPC)

Lets take a look at the **Regional** data.  
  
The data is in a JSON format, so we will need to convert it to a dataframe.  

In [126]:
# Load the json file
with open('../../data/Covid/dpc-covid19-ita-regioni-latest.json') as response:
    regions = json.load(response)

regions

[{'data': '2023-05-04T17:00:00',
  'stato': 'ITA',
  'codice_regione': 13,
  'denominazione_regione': 'Abruzzo',
  'lat': 42.35122196,
  'long': 13.39843823,
  'ricoverati_con_sintomi': 92,
  'terapia_intensiva': 4,
  'totale_ospedalizzati': 96,
  'isolamento_domiciliare': 3323,
  'totale_positivi': 3419,
  'variazione_totale_positivi': -47,
  'nuovi_positivi': 131,
  'dimessi_guariti': 650744,
  'deceduti': 3960,
  'casi_da_sospetto_diagnostico': None,
  'casi_da_screening': None,
  'totale_casi': 658123,
  'tamponi': 7483582,
  'casi_testati': 1372234,
  'note': "Il dato ''incremento casi confermati'' è composto da 131 , cioè 76 ''nuovi positivi'' e 55 ''reinfezioni''.",
  'ingressi_terapia_intensiva': 0,
  'note_test': None,
  'note_casi': None,
  'totale_positivi_test_molecolare': 250615,
  'totale_positivi_test_antigenico_rapido': 407508,
  'tamponi_test_molecolare': 2605598,
  'tamponi_test_antigenico_rapido': 4877984,
  'codice_nuts_1': 'ITF',
  'codice_nuts_2': 'ITF1'},
 {'data

Create a dataframe from the `json` file.

In [127]:
# Convert the json file to a dataframe
regions_df = pd.DataFrame(regions)

Lets take a look at the dataframe.

In [128]:
regions_df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,casi_da_sospetto_diagnostico,casi_da_screening,totale_casi,tamponi,casi_testati,note,ingressi_terapia_intensiva,note_test,note_casi,totale_positivi_test_molecolare,totale_positivi_test_antigenico_rapido,tamponi_test_molecolare,tamponi_test_antigenico_rapido,codice_nuts_1,codice_nuts_2
0,2023-05-04T17:00:00,ITA,13,Abruzzo,42.351222,13.398438,92,4,96,3323,3419,-47,131,650744,3960,,,658123,7483582,1372234,Il dato ''incremento casi confermati'' è compo...,0,,,250615,407508,2605598,4877984,ITF,ITF1
1,2023-05-04T17:00:00,ITA,17,Basilicata,40.639471,15.805148,26,2,28,8390,8418,1,11,191039,1027,,,200484,1341696,404577,Il dato relativo al numero dei “Casi in isolam...,0,,,71210,129274,702845,638851,ITF,ITF5
2,2023-05-04T17:00:00,ITA,18,Calabria,38.905976,16.594402,108,5,113,718,831,-106,89,632556,3412,,,636799,4337111,3371574,,1,,,202620,434179,1916246,2420865,ITF,ITF6
3,2023-05-04T17:00:00,ITA,15,Campania,40.839566,14.25085,174,8,182,19737,19919,0,347,2431492,11889,,,2463300,20827201,5408549,,0,,,955713,1507587,9602412,11224789,ITF,ITF3
4,2023-05-04T17:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,506,18,524,3181,3705,-57,305,2128147,19436,,,2151288,19545251,2982360,,3,,,1092316,1058972,10726333,8818918,ITH,ITH5


Translate the column names to English.

In [129]:
# Use a dictionary to rename the columns
regions_df = regions_df.rename(columns={
    'data': 'date', 'stato': 'state', 'codice_regione': 'reg_code', 'denominazione_regione': 'reg_name',
    'ricoverati_con_sintomi': 'symptons_hospitalised', 'terapia_intensiva': 'in_intensive_care',
    'totale_ospedalizzati': 'hospitalised', 'isolamento_domiciliare': 'home_isolation',
    'totale_positivi': 'positive', 'variazione_totale_positivi': 'variance',
    'nuovi_positivi': 'new_cases', 'dimessi_guariti': 'discharged', 'deceduti': 'deaths',
    'casi_da_sospetto_diagnostico': 'suspected', 'casi_da_screening': 'screened',
    'totale_casi': 'total_cases', 'tamponi': 'swabs', 'casi_testati': 'tested_cases', 'note': 'notes',
    'ingressi_terapia_intensiva': 'intensive_care_entrances', 'note_test': 'test_notes',
    'note_casi': 'cases_notes', 'totale_positivi_test_molecolare': 'molecular_positive',
    'totale_positivi_test_antigenico_rapido': 'antigen_positive', 'tamponi_test_molecolare': 'molecular_swabs',
    'tamponi_test_antigenico_rapido': 'antigen_swabs', 'codice_nuts_1': 'nuts_1_code',
    'codice_nuts_2': 'nuts_2_code', 'codice_nuts_3': 'nuts_3_code'
    })
regions_df.head()

Unnamed: 0,date,state,reg_code,reg_name,lat,long,symptons_hospitalised,in_intensive_care,hospitalised,home_isolation,positive,variance,new_cases,discharged,deaths,suspected,screened,total_cases,swabs,tested_cases,notes,intensive_care_entrances,test_notes,cases_notes,molecular_positive,antigen_positive,molecular_swabs,antigen_swabs,nuts_1_code,nuts_2_code
0,2023-05-04T17:00:00,ITA,13,Abruzzo,42.351222,13.398438,92,4,96,3323,3419,-47,131,650744,3960,,,658123,7483582,1372234,Il dato ''incremento casi confermati'' è compo...,0,,,250615,407508,2605598,4877984,ITF,ITF1
1,2023-05-04T17:00:00,ITA,17,Basilicata,40.639471,15.805148,26,2,28,8390,8418,1,11,191039,1027,,,200484,1341696,404577,Il dato relativo al numero dei “Casi in isolam...,0,,,71210,129274,702845,638851,ITF,ITF5
2,2023-05-04T17:00:00,ITA,18,Calabria,38.905976,16.594402,108,5,113,718,831,-106,89,632556,3412,,,636799,4337111,3371574,,1,,,202620,434179,1916246,2420865,ITF,ITF6
3,2023-05-04T17:00:00,ITA,15,Campania,40.839566,14.25085,174,8,182,19737,19919,0,347,2431492,11889,,,2463300,20827201,5408549,,0,,,955713,1507587,9602412,11224789,ITF,ITF3
4,2023-05-04T17:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,506,18,524,3181,3705,-57,305,2128147,19436,,,2151288,19545251,2982360,,3,,,1092316,1058972,10726333,8818918,ITH,ITH5


Create a new dataframe with only the column names required:  
  
- date
- state
- region_code
- lat
- lon
- total_cases
- nuts_1_code
- nuts_2_code

In [130]:
# Drop the columns that are not needed
regions_df2 = regions_df.drop(columns=[
    'symptons_hospitalised', 'in_intensive_care',
    'hospitalised', 'home_isolation', 'positive', 'variance',
    'new_cases', 'discharged', 'deaths', 'suspected', 'screened', 'swabs',
    'tested_cases', 'notes', 'intensive_care_entrances', 'test_notes',
    'cases_notes', 'molecular_positive', 'antigen_positive',
    'molecular_swabs', 'antigen_swabs'
    ])
regions_df2.head()

Unnamed: 0,date,state,reg_code,reg_name,lat,long,total_cases,nuts_1_code,nuts_2_code
0,2023-05-04T17:00:00,ITA,13,Abruzzo,42.351222,13.398438,658123,ITF,ITF1
1,2023-05-04T17:00:00,ITA,17,Basilicata,40.639471,15.805148,200484,ITF,ITF5
2,2023-05-04T17:00:00,ITA,18,Calabria,38.905976,16.594402,636799,ITF,ITF6
3,2023-05-04T17:00:00,ITA,15,Campania,40.839566,14.25085,2463300,ITF,ITF3
4,2023-05-04T17:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,2151288,ITH,ITH5


Lets explore the Regions data.  
  
Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [131]:
regions_df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         21 non-null     object 
 1   state        21 non-null     object 
 2   reg_code     21 non-null     int64  
 3   reg_name     21 non-null     object 
 4   lat          21 non-null     float64
 5   long         21 non-null     float64
 6   total_cases  21 non-null     int64  
 7   nuts_1_code  21 non-null     object 
 8   nuts_2_code  21 non-null     object 
dtypes: float64(2), int64(2), object(5)
memory usage: 1.6+ KB


Using `.describe()` will help us understand the values in the data.

In [132]:
regions_df2.describe()

Unnamed: 0,reg_code,lat,long,total_cases
count,21.0,21.0,21.0,21.0
mean,11.857143,43.046293,12.225955,1229010.0
std,6.42873,2.550241,2.724611,1091577.0
min,1.0,38.115697,7.320149,50802.0
25%,7.0,41.125596,11.121231,442756.0
50%,12.0,43.61676,12.388247,666306.0
75%,17.0,45.434905,13.768136,1825465.0
max,22.0,46.499335,16.867367,4154840.0


### Conclusion

The data included in the `json` files is the **sum** of Covid-19 cases across 2019 and 2020.  
  
For our analysis we need more granular data.

## Covid-19 Data from GitHub

### Aggregated Covid-19 Data from GitHub

The **csv** data is spread across separate files by month.  
  
We can use **os** to open, read, and collate the data into a single dataframe.

In [133]:
# specify the path where the csv files are located
path = '../../data/Covid/dati-regioni/'

In [134]:
# get a list of all the csv files in the folder
files = os.listdir(path)

In [135]:
# create an empty list to store the dataframes
dataframes = []

# loop through each csv file and append its contents to the list of dataframes
for file in files:
    if file.endswith('.csv'):
        filepath = os.path.join(path, file)
        df = pd.read_csv(filepath)
        dataframes.append(df)

# concatenate all the dataframes together
covid_df = pd.concat(dataframes, ignore_index=True)

Lets take a look at the dataframe.

In [136]:
covid_df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,casi_da_sospetto_diagnostico,casi_da_screening,totale_casi,tamponi,casi_testati,note,ingressi_terapia_intensiva,note_test,note_casi,totale_positivi_test_molecolare,totale_positivi_test_antigenico_rapido,tamponi_test_molecolare,tamponi_test_antigenico_rapido,codice_nuts_1,codice_nuts_2
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,0,0,0,0,0,0,0,0,,,0,5,,,,,,,,,,,
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,0,0,0,0,0,0,0,0,,,0,0,,,,,,,,,,,
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0,0,0,0,0,0,0,0,0,,,0,1,,,,,,,,,,,
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0,0,0,0,0,0,0,0,0,,,0,10,,,,,,,,,,,
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,10,2,12,6,18,0,18,0,0,,,18,148,,,,,,,,,,,


In [137]:
regions1 = pd.DataFrame(covid_df['denominazione_regione'].unique())
regions1.to_csv('../../data/regions1.csv')

Lets explore the data.  
  
First we can check the shape of the dataframe.

#### Shape

In [138]:
print(f'We have {covid_df.shape[0]} rows and {covid_df.shape[1]} columns')

We have 24486 rows and 30 columns


#### Columns
Lets have a look at the column names.

In [139]:
covid_cols = covid_df.columns.to_list()
cols_len = len(covid_cols)
print(f'The columns are:\n\n {covid_cols}\n\n There are {cols_len} columns')

The columns are:

 ['data', 'stato', 'codice_regione', 'denominazione_regione', 'lat', 'long', 'ricoverati_con_sintomi', 'terapia_intensiva', 'totale_ospedalizzati', 'isolamento_domiciliare', 'totale_positivi', 'variazione_totale_positivi', 'nuovi_positivi', 'dimessi_guariti', 'deceduti', 'casi_da_sospetto_diagnostico', 'casi_da_screening', 'totale_casi', 'tamponi', 'casi_testati', 'note', 'ingressi_terapia_intensiva', 'note_test', 'note_casi', 'totale_positivi_test_molecolare', 'totale_positivi_test_antigenico_rapido', 'tamponi_test_molecolare', 'tamponi_test_antigenico_rapido', 'codice_nuts_1', 'codice_nuts_2']

 There are 30 columns


#### Info()

Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [140]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24486 entries, 0 to 24485
Data columns (total 30 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   data                                    24486 non-null  object 
 1   stato                                   24486 non-null  object 
 2   codice_regione                          24486 non-null  int64  
 3   denominazione_regione                   24486 non-null  object 
 4   lat                                     24486 non-null  float64
 5   long                                    24486 non-null  float64
 6   ricoverati_con_sintomi                  24486 non-null  int64  
 7   terapia_intensiva                       24486 non-null  int64  
 8   totale_ospedalizzati                    24486 non-null  int64  
 9   isolamento_domiciliare                  24486 non-null  int64  
 10  totale_positivi                         24486 non-null  in

There are some `Null` values in the data, we will deal with the `Null` values as we go along.

#### Describe()

Using `describe()` will help us understand the numerical values in the data. 
  
Lets only look at the numerical data, excluding any categorical data as well as `dates`, `codes` and `geo` data.

In [141]:
covid_df[[
    'ricoverati_con_sintomi', 'terapia_intensiva',
    'totale_ospedalizzati', 'isolamento_domiciliare',
    'totale_positivi', 'variazione_totale_positivi',
    'nuovi_positivi', 'dimessi_guariti', 'deceduti',
    'casi_da_sospetto_diagnostico', 'casi_da_screening',
    'totale_casi', 'tamponi', 'casi_testati',
    'ingressi_terapia_intensiva', 'totale_positivi_test_molecolare',
    'totale_positivi_test_antigenico_rapido',
    'tamponi_test_molecolare', 'tamponi_test_antigenico_rapido'
    ]].describe()

Unnamed: 0,ricoverati_con_sintomi,terapia_intensiva,totale_ospedalizzati,isolamento_domiciliare,totale_positivi,variazione_totale_positivi,nuovi_positivi,dimessi_guariti,deceduti,casi_da_sospetto_diagnostico,casi_da_screening,totale_casi,tamponi,casi_testati,ingressi_terapia_intensiva,totale_positivi_test_molecolare,totale_positivi_test_antigenico_rapido,tamponi_test_molecolare,tamponi_test_antigenico_rapido
count,24486.0,24486.0,24486.0,24486.0,24486.0,24486.0,24486.0,24486.0,24486.0,3402.0,3402.0,24486.0,24486.0,23331.0,18543.0,17640.0,17640.0,17640.0,17640.0
mean,463.969983,44.780528,508.75051,21598.353345,22107.10361,5.118517,1041.701299,443171.5,5659.167198,16472.772193,6319.39565,470937.8,5876708.0,1797052.0,2.969638,328642.8,314716.8,3636032.0,4344032.0
std,871.306552,96.567591,962.934311,41110.278674,41513.129374,2053.651905,2390.960314,709480.5,7844.665291,32188.41327,15054.35195,731984.0,8522851.0,2145636.0,5.626426,348250.2,486836.9,3669912.0,5881339.0
min,0.0,0.0,0.0,0.0,0.0,-50797.0,-229.0,0.0,0.0,0.0,0.0,0.0,0.0,3482.0,-2.0,7382.0,0.0,66152.0,0.0
25%,50.0,3.0,55.0,1117.25,1228.25,-94.0,51.0,16563.0,690.0,2052.25,117.0,26769.75,465194.8,296255.0,0.0,70122.75,1341.75,876449.2,417239.8
50%,180.0,13.0,195.0,6248.5,6505.5,2.0,255.0,116202.0,2568.0,5290.5,1493.0,147673.0,2369642.0,871568.0,1.0,202170.0,72221.5,2195741.0,2100065.0
75%,498.0,41.0,540.0,22905.0,23701.25,116.0,998.75,477592.2,8307.75,19285.75,5469.0,511802.8,6905588.0,2585904.0,3.0,492500.5,388619.8,5062436.0,5293302.0
max,12077.0,1381.0,13328.0,574548.0,578257.0,47483.0,52693.0,4106068.0,45898.0,305002.0,113150.0,4154840.0,45586970.0,11152910.0,86.0,1539511.0,2615329.0,17097820.0,28489150.0


#### Column Unique Values
Lets create a function to check the unique values in each column.

In [142]:
def col_unique_count(data):
    """
    Function to print the unique values in each column.

    Args:
        data (DataFrame): DataFrame containing the data.
    """
    for col in data.columns:
        values_length = len(data[col].unique())  # Get the length of unique values
    
        # Print the column name and length
        print('Column Name:', col)
        print('Length of Unique Values:', values_length)
        
        print('-' * 30)  # Separator between columns

In [143]:
col_unique_count(covid_df)

Column Name: data
Length of Unique Values: 1166
------------------------------
Column Name: stato
Length of Unique Values: 1
------------------------------
Column Name: codice_regione
Length of Unique Values: 21
------------------------------
Column Name: denominazione_regione
Length of Unique Values: 21
------------------------------
Column Name: lat
Length of Unique Values: 21
------------------------------
Column Name: long
Length of Unique Values: 22
------------------------------
Column Name: ricoverati_con_sintomi
Length of Unique Values: 2672
------------------------------
Column Name: terapia_intensiva
Length of Unique Values: 598
------------------------------
Column Name: totale_ospedalizzati
Length of Unique Values: 2871
------------------------------
Column Name: isolamento_domiciliare
Length of Unique Values: 15288
------------------------------
Column Name: totale_positivi
Length of Unique Values: 15618
------------------------------
Column Name: variazione_totale_positiv

Lets create a function to take a look at the columns whose unique values are important to our analysis.

In [144]:
def col_unique_vals(data, col_names):
    """
    Function to print the unique values in each column of interest.

    Args:
        data (Dataframe): Dataframe containing the data.
        column_names (str): list of column names to check.
    """

    # Iterate through each column
    for col in data[col_names]:
        unique_values = data[col].unique()  # Get unique values
        
        # Print the column name, length, and unique values
        print('Column Name:', col)
        print('Unique Values:')
        
        for value in unique_values:
            print(value)
        
        print('-' * 30)  # Separator between columns

In [145]:
col_unique_vals(covid_df, ['denominazione_regione', 'codice_nuts_1', 'codice_nuts_2'])

Column Name: denominazione_regione
Unique Values:
Abruzzo
Basilicata
Calabria
Campania
Emilia-Romagna
Friuli Venezia Giulia
Lazio
Liguria
Lombardia
Marche
Molise
P.A. Bolzano
P.A. Trento
Piemonte
Puglia
Sardegna
Sicilia
Toscana
Umbria
Valle d'Aosta
Veneto
------------------------------
Column Name: codice_nuts_1
Unique Values:
nan
ITF
ITH
ITI
ITC
ITG
------------------------------
Column Name: codice_nuts_2
Unique Values:
nan
ITF1
ITF5
ITF6
ITF3
ITH5
ITH4
ITI4
ITC3
ITC4
ITI3
ITF2
ITH1
ITH2
ITC1
ITF4
ITG2
ITG1
ITI1
ITI2
ITC2
ITH3
------------------------------


#### Column Names

Lets explore the column names with a view to dropping any columns that are not required.

In [146]:
covid_cols

['data',
 'stato',
 'codice_regione',
 'denominazione_regione',
 'lat',
 'long',
 'ricoverati_con_sintomi',
 'terapia_intensiva',
 'totale_ospedalizzati',
 'isolamento_domiciliare',
 'totale_positivi',
 'variazione_totale_positivi',
 'nuovi_positivi',
 'dimessi_guariti',
 'deceduti',
 'casi_da_sospetto_diagnostico',
 'casi_da_screening',
 'totale_casi',
 'tamponi',
 'casi_testati',
 'note',
 'ingressi_terapia_intensiva',
 'note_test',
 'note_casi',
 'totale_positivi_test_molecolare',
 'totale_positivi_test_antigenico_rapido',
 'tamponi_test_molecolare',
 'tamponi_test_antigenico_rapido',
 'codice_nuts_1',
 'codice_nuts_2']

Lets remove the columns that are not required.

In [147]:
# Drop the columns that are not needed
covid_df.drop([
    'ricoverati_con_sintomi',
    'terapia_intensiva',
    'totale_ospedalizzati',
    'isolamento_domiciliare',
    'totale_positivi',
    'variazione_totale_positivi',
    'nuovi_positivi',
    'dimessi_guariti',
    'deceduti',
    'casi_da_sospetto_diagnostico',
    'casi_da_screening',
    'tamponi',
    'casi_testati',
    'note',
    'ingressi_terapia_intensiva',
    'note_test',
    'note_casi',
    'totale_positivi_test_molecolare',
    'totale_positivi_test_antigenico_rapido',
    'tamponi_test_molecolare',
    'tamponi_test_antigenico_rapido',
    'codice_nuts_1',
    'codice_nuts_2'], axis=1, inplace=True)
covid_df.head()

Unnamed: 0,data,stato,codice_regione,denominazione_regione,lat,long,totale_casi
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,18


We can now translate the column names to English.

In [148]:
# Use a dictionary to rename the columns
covid_df = covid_df.rename(columns={
    'data': 'date', 'stato': 'state', 'codice_regione': 'reg_code', 'denominazione_regione': 'reg_name',
    'long': 'lon', 'totale_casi': 'total_cases'
    })
covid_df.head()

Unnamed: 0,date,state,reg_code,reg_name,lat,lon,total_cases
0,2020-02-24T18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0
1,2020-02-24T18:00:00,ITA,17,Basilicata,40.639471,15.805148,0
2,2020-02-24T18:00:00,ITA,18,Calabria,38.905976,16.594402,0
3,2020-02-24T18:00:00,ITA,15,Campania,40.839566,14.25085,0
4,2020-02-24T18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,18


Convert `date` column to `Datetime` format.

In [149]:
covid_df['date'] = pd.to_datetime(covid_df['date'])

The data is on a daily basis. To make the data more manageable we can group the data by `year`, `month`, and `day_of_week`.

In [235]:
# Create a year and month column
covid_df['year'] = covid_df['date'].dt.year
covid_df['month'] = covid_df['date'].dt.month
covid_df['week'] = covid_df['date'].dt.isocalendar().week
covid_df['day_of_week'] = covid_df['date'].dt.dayofweek
covid_df

Unnamed: 0,date,state,reg_code,reg_name,lat,lon,total_cases,year,month,week,day_of_week
0,2020-02-24 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,2020,2,9,0
21,2020-02-25 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,2020,2,9,1
42,2020-02-26 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,0,2020,2,9,2
63,2020-02-27 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,1,2020,2,9,3
84,2020-02-28 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,1,2020,2,9,4
...,...,...,...,...,...,...,...,...,...,...,...
24401,2023-04-30 17:00:00,ITA,5,Veneto,45.434905,12.338452,2719946,2023,4,17,6
24422,2023-05-01 17:00:00,ITA,5,Veneto,45.434905,12.338452,2720023,2023,5,18,0
24443,2023-05-02 17:00:00,ITA,5,Veneto,45.434905,12.338452,2720126,2023,5,18,1
24464,2023-05-03 17:00:00,ITA,5,Veneto,45.434905,12.338452,2720780,2023,5,18,2


Lets check data types.

In [236]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24486 entries, 0 to 24485
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         24486 non-null  datetime64[ns]
 1   state        24486 non-null  object        
 2   reg_code     24486 non-null  int64         
 3   reg_name     24486 non-null  object        
 4   lat          24486 non-null  float64       
 5   lon          24486 non-null  float64       
 6   total_cases  24486 non-null  int64         
 7   year         24486 non-null  int32         
 8   month        24486 non-null  int32         
 9   week         24486 non-null  UInt32        
 10  day_of_week  24486 non-null  int32         
dtypes: UInt32(1), datetime64[ns](1), float64(2), int32(3), int64(2), object(2)
memory usage: 1.9+ MB


Lets check the total number of cases.  
  
We can do this by grouping the `date` and summing the `total_cases`.

In [237]:
end_date = covid_df['date'].iloc[-1].strftime('%d-%m-%Y')
covid_total = covid_df.loc[covid_df['date'] == covid_df['date'].iloc[-1], 'total_cases'].sum()
print(f'As of {end_date}, the total number of cases in Italy was {covid_total:,}.')

As of 04-05-2023, the total number of cases in Italy was 25,809,208.


##### Total Cases by Region
Lets plot the total number of cases by region based on a 'day_of_week' filter.

In [153]:
def plot_covid_regions(data, day_of_week=4):
    """
    Function to plot the number of cases by region for a given day of the week.

    Args:
        data (DataFrame): DataFrame containing the data.
        day_of_week (int, optional): Day of the week as an integer (Monday=1, Tuesday=2,
            Wednesday=3, Thursday=4, Friday=5, Saturday=6). Defaults to 4.
    """
    # Create a dictionary that maps numbers to day names
    day_mapping = {
        1: 'Monday', 2: 'Tuesday', 3: 'Wednesday',
        4: 'Thursday', 5: 'Friday', 6: 'Saturday'
    }

    # Filter the data based on the day of the week
    filtered_data = data[data['day_of_week'] == day_of_week]

    # Rename the columns for hover data
    filtered_data = filtered_data.rename(columns={'reg_name': 'Region', 'date': 'Date', 'total_cases': 'Total Cases'})

    # Plot the data using Plotly Express
    fig = px.line(filtered_data, x='Date', y='Total Cases', color='Region',
                hover_data={'Date': '|%B %d, %Y', 'Total Cases': ':,'})

    # Set plot layout
    fig.update_layout(
        title=f'Total Cases on each {day_mapping[day_of_week]} by Month',
        xaxis_title='Date',
        yaxis_title='Total Cases',
        width=1200,
        height=800,
        font=dict(
            family='Arial',
            size=18,
            color='Purple'
        )
    )

    # Show the plot
    fig.show()

In [238]:
plot_covid_regions(covid_df)

In [155]:
# Filter the data for Thursdays
covid_df_thur = covid_df[covid_df['day_of_week'] == 4]
covid_df_thur

Unnamed: 0,date,state,reg_code,reg_name,lat,lon,total_cases,year,month,week,day_of_week
84,2020-02-28 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,1,2020,2,9,4
85,2020-02-28 18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,2020,2,9,4
86,2020-02-28 18:00:00,ITA,18,Calabria,38.905976,16.594402,1,2020,2,9,4
87,2020-02-28 18:00:00,ITA,15,Campania,40.839566,14.250850,4,2020,2,9,4
88,2020-02-28 18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,145,2020,2,9,4
...,...,...,...,...,...,...,...,...,...,...,...
24355,2023-04-28 17:00:00,ITA,19,Sicilia,38.115697,13.362357,1824867,2023,4,17,4
24356,2023-04-28 17:00:00,ITA,9,Toscana,43.769231,11.255889,1600536,2023,4,17,4
24357,2023-04-28 17:00:00,ITA,10,Umbria,43.106758,12.388247,442349,2023,4,17,4
24358,2023-04-28 17:00:00,ITA,2,Valle d'Aosta,45.737503,7.320149,50778,2023,4,17,4


We need to create a time series set of Covid-19 data that will match our other datasets.

In [239]:
# Create a date range
date1 = datetime(2020,2,24)
date2 = datetime(2021,12,31)
# Filter the data based on the date range
covid_df_dates = covid_df_thur[(covid_df_thur['date'] >= date1) & (covid_df_thur['date'] <= date2)].reset_index(drop=True)

In [240]:
covid_df_dates

Unnamed: 0,date,state,reg_code,reg_name,lat,lon,total_cases,year,month,week,day_of_week
0,2020-02-28 18:00:00,ITA,13,Abruzzo,42.351222,13.398438,1,2020,2,9,4
1,2020-02-28 18:00:00,ITA,17,Basilicata,40.639471,15.805148,0,2020,2,9,4
2,2020-02-28 18:00:00,ITA,18,Calabria,38.905976,16.594402,1,2020,2,9,4
3,2020-02-28 18:00:00,ITA,15,Campania,40.839566,14.250850,4,2020,2,9,4
4,2020-02-28 18:00:00,ITA,8,Emilia-Romagna,44.494367,11.341721,145,2020,2,9,4
...,...,...,...,...,...,...,...,...,...,...,...
2011,2021-12-24 17:00:00,ITA,19,Sicilia,38.115697,13.362357,350236,2021,12,51,4
2012,2021-12-24 17:00:00,ITA,9,Toscana,43.769231,11.255889,327770,2021,12,51,4
2013,2021-12-24 17:00:00,ITA,10,Umbria,43.106758,12.388247,74632,2021,12,51,4
2014,2021-12-24 17:00:00,ITA,2,Valle d'Aosta,45.737503,7.320149,14725,2021,12,51,4


Lets plot our new dataframe.

In [158]:
plot_covid_regions(covid_df_dates)

Lets group the data by `region` and `reg_code` and return the `max` value of `total_cases` for **2020** and **2021**, in a new dataframe.

In [196]:
len(covid_df_dates[covid_df_dates['reg_name'] == 'Sicilia']), len(covid_df_dates[covid_df_dates['reg_name'] == 'Friuli Venezia Giulia'])

(96, 96)

In [197]:
len(covid_totals[covid_totals['reg_name'] == 'Sicilia']), len(covid_totals[covid_totals['reg_name'] == 'Friuli Venezia Giulia'])

(4, 2)

In [249]:
covid_df[covid_df['reg_name'] == 'Friuli Venezia Giulia'].to_csv('covid_Friuli_Venezia_Giuliaa.csv', index=False)

In [250]:
test = pd.read_csv('covid_sicilia.csv')
test2 = pd.read_csv('covid_Friuli_Venezia_Giuliaa.csv')

In [248]:
test[test['reg_name'] == 'Sicilia'].groupby(['reg_name', 'lat', 'lon', 'reg_code', 'year'])['total_cases'].max().reset_index()

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Sicilia,38.115697,13.362357,19,2020,93644
1,Sicilia,38.115697,13.362357,19,2021,216416
2,Sicilia,38.115697,13.362357,19,2020,3080
3,Sicilia,38.115697,13.362357,19,2021,372604
4,Sicilia,38.115697,13.362357,19,2022,1780388
5,Sicilia,38.115697,13.362357,19,2023,1825465


In [252]:
test2[test2['reg_name'] == 'Friuli Venezia Giulia'].groupby(['reg_name', 'lat', 'lon', 'reg_code', 'year'])['total_cases'].max().reset_index()

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Friuli Venezia Giulia,45.649435,13.768136,6,2020,50027
1,Friuli Venezia Giulia,45.649435,13.768136,6,2021,156092
2,Friuli Venezia Giulia,45.649435,13.768136,6,2022,566149
3,Friuli Venezia Giulia,45.649435,13.768136,6,2023,579908


In [210]:
covid_totals[covid_totals['reg_name'] == 'Sicilia']

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
32,Sicilia,38.115697,13.362357,19,2020,93644
33,Sicilia,38.115697,13.362357,19,2021,216416
34,Sicilia,38.115697,13.362357,19,2020,3080
35,Sicilia,38.115697,13.362357,19,2021,372604


In [199]:
sic = covid_df_dates[covid_df_dates['reg_name'] == 'Sicilia'].groupby(['reg_name', 'lat', 'lon', 'reg_code', 'year'])['total_cases'].max().reset_index()
sic

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Sicilia,38.115697,13.362357,19,2020,88597
1,Sicilia,38.115697,13.362357,19,2021,214482
2,Sicilia,38.115697,13.362357,19,2020,3076
3,Sicilia,38.115697,13.362357,19,2021,350236


In [209]:
covid_df[covid_df['reg_name'] == 'Sicilia'].groupby(['reg_name', 'lat', 'lon', 'reg_code', 'year'])['total_cases'].max()

reg_name  lat        lon        reg_code  year
Sicilia   38.115697  13.362357  19        2020      93644
                                          2021     216416
                                          2020       3080
                                          2021     372604
                                          2022    1780388
                                          2023    1825465
Name: total_cases, dtype: int64

In [206]:
# Create a dataframe with the total cases by region
covid_totals = covid_df[covid_df['year'].isin([2020, 2021])].groupby(['reg_name', 'lat', 'lon', 'reg_code', 'year'])['total_cases'].max().reset_index()
covid_totals

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Abruzzo,42.351222,13.398438,13,2020,35314
1,Abruzzo,42.351222,13.398438,13,2021,106573
2,Basilicata,40.639471,15.805148,17,2020,10826
3,Basilicata,40.639471,15.805148,17,2021,36295
4,Calabria,38.905976,16.594402,18,2020,23920
5,Calabria,38.905976,16.594402,18,2021,111746
6,Campania,40.839566,14.25085,15,2020,189673
7,Campania,40.839566,14.25085,15,2021,583262
8,Emilia-Romagna,44.494367,11.341721,8,2020,171512
9,Emilia-Romagna,44.494367,11.341721,8,2021,536922


Lets now add the extra rows for the missing years.  
  
We can assume the values of `total_cases` for 2017, 2018, and 2019 are `0`, as Covid-19 was not present in the world at that time.

In [160]:
covid_totals.head()

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Abruzzo,42.351222,13.398438,13,2020,34437
1,Abruzzo,42.351222,13.398438,13,2021,95670
2,Basilicata,40.639471,15.805148,17,2020,10447
3,Basilicata,40.639471,15.805148,17,2021,33882
4,Calabria,38.905976,16.594402,18,2020,22278


In [163]:
# Create a list of dictionaries to add the missing years
new_rows = []
for _, row in covid_totals.drop_duplicates('reg_name').iterrows():
    for year in [2017, 2018, 2019]:
        new_row = {'reg_name': row['reg_name'], 'lat': row['lat'], 'lon': row['lon'],'reg_code': row['reg_code'], 'year': year, 'total_cases': 0}
        new_rows.append(new_row)

# Create a dataframe from the list of dictionaries
new_rows_covid_totals = pd.DataFrame(new_rows)
# Append the new rows to the covid_totals dataframe
covid_totals = pd.concat([covid_totals, new_rows_covid_totals])
# Sort the dataframe by region name and year
covid_totals.sort_values(by=['reg_name', 'year'], inplace=True)
# Reset the index
covid_totals.reset_index(drop=True, inplace=True)
covid_totals

Unnamed: 0,reg_name,lat,lon,reg_code,year,total_cases
0,Abruzzo,42.351222,13.398438,13,2017,0
1,Abruzzo,42.351222,13.398438,13,2018,0
2,Abruzzo,42.351222,13.398438,13,2019,0
3,Abruzzo,42.351222,13.398438,13,2020,34437
4,Abruzzo,42.351222,13.398438,13,2021,95670
...,...,...,...,...,...,...
102,Veneto,45.434905,12.338452,5,2017,0
103,Veneto,45.434905,12.338452,5,2018,0
104,Veneto,45.434905,12.338452,5,2019,0
105,Veneto,45.434905,12.338452,5,2020,234792


## Crime Data

### Crime Data across Italy by Region

Crime data refers to information collected and recorded regarding criminal activities that occur within a specific jurisdiction or region.  
It includes various types of data related to criminal incidents, offenders, victims, and law enforcement activities.  
Crime data is essential for understanding patterns, trends, and the overall nature of criminal behavior.  
  
The data is in a **csv** file. Lets create a new **dataframe** from the **csv** file.

In [38]:
crime_url = '../../data/crime_type_by_year_cleaned.csv'

In [39]:
crime_df = pd.read_csv(crime_url, index_col=0)

In [40]:
crime_df.head()

Unnamed: 0,Territory_Code,Territory_Name,REATIPS_VICES,Crime_Type,Year,Number_of_Crime
0,IT,Italy,MASSMURD,mass murder,2017,17
1,IT,Italy,MASSMURD,mass murder,2018,20
2,IT,Italy,MASSMURD,mass murder,2019,14
3,IT,Italy,MASSMURD,mass murder,2020,29
4,IT,Italy,MASSMURD,mass murder,2021,29


#### Shape

In [41]:
print(f'We have {crime_df.shape[0]} rows and {crime_df.shape[1]} columns')

We have 36984 rows and 6 columns


#### Columns
Lets have a look at the column names.

In [42]:
crime_cols = crime_df.columns.to_list()
crime_cols_len = len(crime_cols)
print(f'The columns are:\n\n {crime_cols}\n\n There are {crime_cols_len} columns')

The columns are:

 ['Territory_Code', 'Territory_Name', 'REATIPS_VICES', 'Crime_Type', 'Year', 'Number_of_Crime']

 There are 6 columns


#### Info()

Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [43]:
crime_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36984 entries, 0 to 36983
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Territory_Code   36984 non-null  object
 1   Territory_Name   36984 non-null  object
 2   REATIPS_VICES    36984 non-null  object
 3   Crime_Type       36984 non-null  object
 4   Year             36984 non-null  int64 
 5   Number_of_Crime  36984 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 2.0+ MB


There are no `Null` values in the data.

#### Describe()

Using `describe()` will help us understand the numerical values in the data. 

In [44]:
crime_df.describe()

Unnamed: 0,Year,Number_of_Crime
count,36984.0,36984.0
mean,2018.992754,2756.114
std,1.416774,34206.23
min,2017.0,0.0
25%,2018.0,2.0
50%,2019.0,20.0
75%,2020.0,237.0
max,2021.0,2429795.0


#### Column Names

Lets explore the column names with a view to dropping any columns that are not required.

In [45]:
crime_cols

['Territory_Code',
 'Territory_Name',
 'REATIPS_VICES',
 'Crime_Type',
 'Year',
 'Number_of_Crime']

Lets remove the columns that are not required.

In [46]:
crime_df.drop(['REATIPS_VICES'], axis=1, inplace=True)
crime_df.head()

Unnamed: 0,Territory_Code,Territory_Name,Crime_Type,Year,Number_of_Crime
0,IT,Italy,mass murder,2017,17
1,IT,Italy,mass murder,2018,20
2,IT,Italy,mass murder,2019,14
3,IT,Italy,mass murder,2020,29
4,IT,Italy,mass murder,2021,29


Lets tidy up the column names.

In [47]:
crime_df.columns = crime_df.columns.str.strip(
    ).str.lower(
        ).str.replace(
            ' ', '_', regex=False)
crime_df.head()

Unnamed: 0,territory_code,territory_name,crime_type,year,number_of_crime
0,IT,Italy,mass murder,2017,17
1,IT,Italy,mass murder,2018,20
2,IT,Italy,mass murder,2019,14
3,IT,Italy,mass murder,2020,29
4,IT,Italy,mass murder,2021,29


#### Column Unique Values

In [48]:
col_unique_count(crime_df)

Column Name: territory_code
Length of Unique Values: 134
------------------------------
Column Name: territory_name
Length of Unique Values: 133
------------------------------
Column Name: crime_type
Length of Unique Values: 56
------------------------------
Column Name: year
Length of Unique Values: 5
------------------------------
Column Name: number_of_crime
Length of Unique Values: 4959
------------------------------


Lets take a look at the columns whose unique values are important to our analysis.

In [49]:
col_unique_vals(crime_df, ["territory_code", "territory_name", "crime_type", "year"])

Column Name: territory_code
Unique Values:
IT
ITC
ITC1
ITC11
ITC12
ITC13
ITC14
ITC15
ITC16
ITC17
ITC18
ITC2
ITC20
ITC3
ITC31
ITC32
ITC33
ITC34
ITC4
ITC41
ITC42
ITC43
ITC44
ITC45
ITC46
ITC47
ITC48
ITC49
ITC4A
ITC4B
ITD
ITDA
ITD1
ITD10
ITD2
ITD20
ITD3
ITD31
ITD32
ITD33
ITD34
ITD35
ITD36
ITD37
ITD4
ITD41
ITD42
ITD43
ITD44
ITD5
ITD51
ITD52
ITD53
ITD54
ITD55
ITD56
ITD57
ITD58
ITD59
ITE
ITE1
ITE11
ITE12
ITE13
ITE14
ITE15
ITE16
ITE17
ITE18
ITE19
ITE1A
ITE2
ITE21
ITE22
ITE3
ITE31
ITE32
ITE33
ITE34
ITE4
ITE41
ITE42
ITE43
ITE44
ITE45
ITF
ITF1
ITF11
ITF12
ITF13
ITF14
ITF2
ITF21
ITF22
ITF3
ITF31
ITF32
ITF33
ITF34
ITF35
ITF4
ITF41
ITF42
ITF43
ITF44
ITF45
ITF5
ITF51
ITF52
ITF6
ITF61
ITF62
ITF63
ITF64
ITF65
ITG
ITG1
ITG11
ITG12
ITG13
ITG14
ITG15
ITG16
ITG17
ITG18
ITG19
ITG2
ITG25
ITG26
ITG27
ITG28
IT108
IT109
IT110
------------------------------
Column Name: territory_name
Unique Values:
Italy
Nord-ovest
Piemonte
Torino
Vercelli
Biella
Verbano-Cusio-Ossola
Novara
Cuneo
Asti
Alessandria
Valle d'Aosta 

We can see we have Country, States, Regions, and Provinces in the data.  
  
We will need to remove all but Regions from the data.  
  
We know from the `territory_code` column that the Regions have 4 alphanumeric characters.  
We can use this pattern to filter the data.

In [50]:
# We will use regex to find the rows that match the pattern
crime_by_region = crime_df[crime_df['territory_code'].str.match(r'^\w{4}$')]
crime_by_region.head()

Unnamed: 0,territory_code,territory_name,crime_type,year,number_of_crime
552,ITC1,Piemonte,mass murder,2017,1
553,ITC1,Piemonte,mass murder,2018,0
554,ITC1,Piemonte,mass murder,2019,1
555,ITC1,Piemonte,mass murder,2020,0
556,ITC1,Piemonte,mass murder,2021,2


Lets check the results.

In [51]:
crime_terr_len = len(crime_by_region['territory_name'].unique())
crime_terr_reg = crime_by_region['territory_name'].unique()
print(f'There are now {crime_terr_len} regions in the dataset:\n\n {crime_terr_reg}')

There are now 22 regions in the dataset:

 ['Piemonte' "Valle d'Aosta / Vallée d'Aoste" 'Liguria' 'Lombardia'
 'Trentino Alto Adige / Südtirol' 'Provincia Autonoma Bolzano / Bozen'
 'Provincia Autonoma Trento' 'Veneto' 'Friuli-Venezia Giulia'
 'Emilia-Romagna' 'Toscana' 'Umbria' 'Marche' 'Lazio' 'Abruzzo' 'Molise'
 'Campania' 'Puglia' 'Basilicata' 'Calabria' 'Sicilia' 'Sardegna']


Lets filter the dataframe by `crime_type` totals.

In [52]:
crime_totals = crime_by_region[crime_by_region['crime_type'] == 'total'].copy()
crime_totals.sort_values(by=['territory_name', 'year'], inplace=True)
crime_totals.reset_index(drop=True, inplace=True)
crime_totals.head()

Unnamed: 0,territory_code,territory_name,crime_type,year,number_of_crime
0,ITF1,Abruzzo,total,2017,42847
1,ITF1,Abruzzo,total,2018,40038
2,ITF1,Abruzzo,total,2019,38381
3,ITF1,Abruzzo,total,2020,34250
4,ITF1,Abruzzo,total,2021,35324


Lets check data types.

In [53]:
crime_totals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   territory_code   110 non-null    object
 1   territory_name   110 non-null    object
 2   crime_type       110 non-null    object
 3   year             110 non-null    int64 
 4   number_of_crime  110 non-null    int64 
dtypes: int64(2), object(3)
memory usage: 4.4+ KB


Lets plot the data.

In [54]:
def plot_crime_regions(data):
    """
    Function to plot the number of cases by region for a given day of the week.

    Args:
        data (DataFrame): DataFrame containing the data.
        day_of_week (int, optional): Day of the week as an integer (Monday=1, Tuesday=2,
            Wednesday=3, Thursday=4, Friday=5, Saturday=6). Defaults to 4.
    """
    # Rename the columns for hover data
    data = data.rename(columns={'territory_name': 'Region', 'year': 'Date', 'number_of_crime': 'Total Crimes'})

    # Plot the data using Plotly Express
    fig = px.line(data, x='Date', y='Total Crimes', color='Region',
                hover_data={'Date': '|Dec %Y', 'Total Crimes': ':,'})

    # Set plot layout
    fig.update_layout(
        title=f'Total Crimes by Year by Region',
        xaxis_title='Year',
        yaxis_title='Total Crimes',
        width=1200,
        height=800,
        # Set the font family, size, and color
        font=dict(
            family='Arial',
            size=18,
            color='Dark Blue'
        ),
        # Set the x-axis to be a linear scale
        xaxis = dict(
        tickmode = 'linear',
        tick0 = 1, # Set the first value
        dtick = 1 # Set the difference between ticks
    )
    )

    # Show the plot
    fig.show()

In [55]:
plot_crime_regions(crime_totals)

Fix some of the place name values.

In [56]:
# crime_totals.loc[:, 'territory_name'] = crime_totals['territory_name'].str.replace(' / ', '/', regex=False)

In [57]:
# crime_totals[crime_totals['territory_name'] == "Valle d'Aosta/Vallée d'Aoste"]

In [58]:
# crime_totals[(crime_totals['crime_type'] == 'total') & (crime_totals['year'] == 2021)]

## Unemplyment Data

### Unemployment Data across Italy by Region

Unemployment data refers to information and statistics that capture the state of employment within a specific population or region.  
It provides insights into the number of individuals who are actively seeking employment but are currently without a job.  
Unemployment data is collected and analyzed to measure and understand the level of joblessness in an economy.
  
The data is in a **csv** file. Lets create a new **dataframe** from the **csv** file.

In [59]:
unemp_url = '../../data/Unemployment_by_Region_clean.csv'

In [60]:
# Read the data into a DataFrame
unemp_df = pd.read_csv(unemp_url)

In [61]:
unemp_df.head()

Unnamed: 0,Territory,Gender,Age Class,Duration of Unemployment,Quarter,Year,Unemployment Rate
0,Italy,males,15-64,total,1,2011,8.18978
1,Italy,males,15-64,total,2,2011,7.161944
2,Italy,males,15-64,total,4,2011,9.033599
3,Italy,males,15-64,total,3,2011,6.966123
4,Italy,males,15-64,total,2,2012,10.209171


#### Shape

In [62]:
print(f'We have {unemp_df.shape[0]} rows and {unemp_df.shape[1]} columns')

We have 7632 rows and 7 columns


#### Columns
Lets have a look at the column names.

In [63]:
unemp_cols = unemp_df.columns.to_list()
unemp_cols_len = len(unemp_cols)
print(f'The columns are:\n\n {unemp_cols}\n\n There are {unemp_cols_len} columns')

The columns are:

 ['Territory', 'Gender', 'Age Class', 'Duration of Unemployment', 'Quarter', 'Year', 'Unemployment Rate']

 There are 7 columns


#### Info()

Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [64]:
unemp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7632 entries, 0 to 7631
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Territory                 7632 non-null   object 
 1   Gender                    7632 non-null   object 
 2   Age Class                 7632 non-null   object 
 3   Duration of Unemployment  7632 non-null   object 
 4   Quarter                   7632 non-null   int64  
 5   Year                      7632 non-null   int64  
 6   Unemployment Rate         7632 non-null   float64
dtypes: float64(1), int64(2), object(4)
memory usage: 417.5+ KB


There are no `Null` values in the data.

#### Describe()

Using `describe()` will help us understand the numerical values in the data. 

In [65]:
unemp_df.describe()

Unnamed: 0,Quarter,Year,Unemployment Rate
count,7632.0,7632.0,7632.0
mean,2.5,2018.811321,8.933278
std,1.118107,2.848931,5.056835
min,1.0,2011.0,0.795578
25%,1.75,2018.0,5.373629
50%,2.5,2019.0,7.695213
75%,3.25,2021.0,11.159385
max,4.0,2022.0,28.83325


#### Column Names

Lets explore the column names with a view to dropping any columns that are not required.

In [66]:
unemp_cols

['Territory',
 'Gender',
 'Age Class',
 'Duration of Unemployment',
 'Quarter',
 'Year',
 'Unemployment Rate']

Lets remove the columns that are not required.

In [67]:
unemp_df.drop(['Duration of Unemployment'], axis=1, inplace=True)
unemp_df.head()

Unnamed: 0,Territory,Gender,Age Class,Quarter,Year,Unemployment Rate
0,Italy,males,15-64,1,2011,8.18978
1,Italy,males,15-64,2,2011,7.161944
2,Italy,males,15-64,4,2011,9.033599
3,Italy,males,15-64,3,2011,6.966123
4,Italy,males,15-64,2,2012,10.209171


Lets tidy up the column names.

In [68]:
unemp_df.columns = unemp_df.columns.str.strip(
    ).str.lower(
        ).str.replace(
            ' ', '_', regex=False)
unemp_df.head()

Unnamed: 0,territory,gender,age_class,quarter,year,unemployment_rate
0,Italy,males,15-64,1,2011,8.18978
1,Italy,males,15-64,2,2011,7.161944
2,Italy,males,15-64,4,2011,9.033599
3,Italy,males,15-64,3,2011,6.966123
4,Italy,males,15-64,2,2012,10.209171


#### Column Unique Values

In [69]:
col_unique_count(unemp_df)

Column Name: territory
Length of Unique Values: 28
------------------------------
Column Name: gender
Length of Unique Values: 3
------------------------------
Column Name: age_class
Length of Unique Values: 3
------------------------------
Column Name: quarter
Length of Unique Values: 4
------------------------------
Column Name: year
Length of Unique Values: 12
------------------------------
Column Name: unemployment_rate
Length of Unique Values: 7612
------------------------------


In [70]:
col_unique_vals(unemp_df, ['territory', 'gender', 'age_class', 'year'])

Column Name: territory
Unique Values:
Italy
Nord
Nord-ovest
Piemonte
Valle d'Aosta / Vallée d'Aoste
Liguria
Lombardia
Nord-est
Trentino Alto Adige / Südtirol
Provincia Autonoma Bolzano / Bozen
Provincia Autonoma Trento
Veneto
Friuli-Venezia Giulia
Emilia-Romagna
Centro (I)
Toscana
Sardegna
Marche
Umbria
Lazio
Basilicata
Calabria
Mezzogiorno
Molise
Abruzzo
Sicilia
Puglia
Campania
------------------------------
Column Name: gender
Unique Values:
males
females
total
------------------------------
Column Name: age_class
Unique Values:
15-64 
15-74 
20-64 
------------------------------
Column Name: year
Unique Values:
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
------------------------------


#### Filtering Data

We can see we have Country, States, and Regions in the data.  
  
We will need to remove all but Regions from the data.  
  
As we have only the `territory` column to filter on, we will need to manually filter the data.  
  
Lets create a list of the `territory` column values to remove.

In [71]:
unemp_vals_to_rem = ['Italy', 'Nord', 'Nord-ovest', 'Nord-est', 'Centro (I)', 'Mezzogiorno']

We can now remove the rows that contain the values in the `unemp_vals_to_rem`.

In [72]:
unemp_by_region = unemp_df[unemp_df['territory'].isin(unemp_vals_to_rem) == False].reset_index(drop=True)
unemp_by_region.head()

Unnamed: 0,territory,gender,age_class,quarter,year,unemployment_rate
0,Piemonte,males,15-64,2,2018,8.087172
1,Piemonte,males,15-64,1,2018,7.785743
2,Piemonte,males,15-64,4,2018,7.990208
3,Piemonte,males,15-64,3,2018,7.292484
4,Piemonte,males,15-64,1,2019,7.311297


##### Age Groups - `age_class`
We have three age groups, '15-64', '15-74', and '20-64'.  
  
We will only be looking at the '15-64' age group.  
##### Time Period - `year`
We twelve years in the data.  
  
We will only keep years **2017-2021**.
##### Time Period - `quarter`
We have four quarters in the data.  
  
We will only be looking at the 'Q4' quarter.  
##### Gender - `gender`
We have three gender values, Male, Female and Total.  
  
We will keep only the Total values.

In [73]:
# Modify the dataFrame to only include the values we want
years = [2017, 2018, 2019, 2020, 2021]
unemp_totals = unemp_by_region[
    (unemp_by_region['age_class'] == '15-64 ') &
    (unemp_by_region['quarter'] == 4) &
    (unemp_by_region['gender'] == 'total') &
    (unemp_by_region['year'].isin(years))
]
unemp_totals.reset_index(drop=True, inplace=True)
unemp_totals.head()

Unnamed: 0,territory,gender,age_class,quarter,year,unemployment_rate
0,Piemonte,total,15-64,4,2018,8.361784
1,Piemonte,total,15-64,4,2019,7.244067
2,Piemonte,total,15-64,4,2020,7.74387
3,Piemonte,total,15-64,4,2021,6.95121
4,Valle d'Aosta / Vallée d'Aoste,total,15-64,4,2018,8.491601


We can see that the `year` column now begins at `2018`.  
Lets take a look to see why.

In [74]:
unemp_df[(unemp_df['year'] == 2017) & (unemp_df['territory'] == 'Piemonte')]

Unnamed: 0,territory,gender,age_class,quarter,year,unemployment_rate


It appears there is no region data for `2017`.  
  
We will need to add the `2017` data using the `mean`.

In [75]:
# Compute the average unemployment rate of preceding years for each territory
avg_unemp_rate = unemp_totals[unemp_totals['year'] > 2017].groupby('territory')['unemployment_rate'].mean()
# Get unique territory, gender, age_class, and quarter combinations
unique_values = unemp_totals[['territory', 'gender', 'age_class', 'quarter']].drop_duplicates()
# Add new columns to unique_values dataframe for the new year
unique_values['year'] = 2017
# Map the average unemployment rate of preceding years to each territory
unique_values['unemployment_rate'] = unique_values['territory'].map(avg_unemp_rate)
# Append the new rows to the unemp_totals dataframe
unemp_totals = pd.concat([unemp_totals, unique_values], ignore_index=True)
# Sort the dataframe
unemp_totals.sort_values(by=['territory', 'year'], inplace=True)
# Reset the index
unemp_totals.reset_index(drop=True, inplace=True)


Lets check data types.

In [76]:
unemp_totals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   territory          110 non-null    object 
 1   gender             110 non-null    object 
 2   age_class          110 non-null    object 
 3   quarter            110 non-null    int64  
 4   year               110 non-null    int64  
 5   unemployment_rate  110 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 5.3+ KB


## Population Data

Population data refers to information and statistics that provide insights into the demographic composition and  
characteristics of a specific group of individuals within a defined geographic area.  
It encompasses various data points related to the size, distribution, structure, and dynamics of a population.  
  
For now we will use the collective population data by region.  
  
The data is in a **csv** file. Lets create a new **dataframe** from the **csv** file.

In [77]:
pop_url = '../../data/population_data_2019_23.csv'

In [78]:
# Read the data into a DataFrame
pop_df = pd.read_csv(pop_url)
pop_df

Unnamed: 0,territory,2017,2018,2019,2020,2021,2022,2023
0,Piemonte,4370348,4349911,4328565,4311217,4274945,4256350,4240736
1,Valle d'Aosta / Vallée d'Aoste,126677,126213,125653,125034,124089,123360,122955
2,Liguria,1551379,1541541,1532980,1524826,1518495,1509227,1502624
3,Lombardia,9970419,9986962,10010833,10027602,9981554,9943004,9950742
4,Trentino Alto Adige / Südtirol,1063734,1068738,1074034,1078069,1077078,1073574,1075317
5,Provincia Autonoma Bolzano / Bozen,523454,526772,530313,532644,534912,532616,533267
6,Provincia Autonoma Trento,540280,541966,543721,545425,542166,540958,542050
7,Veneto,4883373,4880936,4884590,4879133,4869830,4847745,4838253
8,Friuli-Venezia Giulia,1212809,1211155,1210414,1206216,1201510,1194647,1192191
9,Emilia-Romagna,4439768,4445920,4459453,4464119,4438937,4425366,4426929


#### Shape

In [79]:
print(f'We have {pop_df.shape[0]} rows and {pop_df.shape[1]} columns')

We have 22 rows and 8 columns


#### Columns
Lets have a look at the column names.

In [80]:
pop_cols = pop_df.columns.to_list()
pop_cols_len = len(pop_cols)
print(f'The columns are:\n\n {pop_cols}\n\n There are {pop_cols_len} columns')

The columns are:

 ['territory  ', '2017', '2018', '2019', '2020', '2021', '2022', '2023']

 There are 8 columns


#### Info()

Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [81]:
pop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   territory    22 non-null     object
 1   2017         22 non-null     object
 2   2018         22 non-null     object
 3   2019         22 non-null     object
 4   2020         22 non-null     object
 5   2021         22 non-null     object
 6   2022         22 non-null     object
 7   2023         22 non-null     object
dtypes: object(8)
memory usage: 1.5+ KB


There are no `Null` values in the data.

#### Describe()

Using `describe()` will help us understand the numerical values in the data. 

In [82]:
pop_df.describe()

Unnamed: 0,territory,2017,2018,2019,2020,2021,2022,2023
count,22,22,22,22,22,22,22,22
unique,22,22,22,22,22,22,22,22
top,Piemonte,4370348,4349911,4328565,4311217,4274945,4256350,4240736
freq,1,1,1,1,1,1,1,1


#### Column Names

Lets explore the column names with a view to dropping any columns that are not required.

In [83]:
pop_cols

['territory  ', '2017', '2018', '2019', '2020', '2021', '2022', '2023']

We have some white space in the column names. Lets remove the white space.

In [84]:
pop_df.columns = pop_df.columns.str.strip(
    ).str.lower(
        ).str.replace(
            ' ', '', regex=False)
pop_df.columns

Index(['territory', '2017', '2018', '2019', '2020', '2021', '2022', '2023'], dtype='object')

In [85]:
pop_df.drop(['2022', '2023'], axis=1, inplace=True)
pop_df.head()

Unnamed: 0,territory,2017,2018,2019,2020,2021
0,Piemonte,4370348,4349911,4328565,4311217,4274945
1,Valle d'Aosta / Vallée d'Aoste,126677,126213,125653,125034,124089
2,Liguria,1551379,1541541,1532980,1524826,1518495
3,Lombardia,9970419,9986962,10010833,10027602,9981554
4,Trentino Alto Adige / Südtirol,1063734,1068738,1074034,1078069,1077078


##### Pivot Dataframe

Lets pivot the dataframe to match our data so far.

In [86]:
# Pivot the DataFrame to create a 'year' column
pop_totals = pd.melt(pop_df, id_vars=['territory'], var_name='year', value_name='population')
pop_totals.sort_values(['territory', 'year']).reset_index(drop=True).head()

Unnamed: 0,territory,year,population
0,Abruzzo,2017,1313930
1,Abruzzo,2018,1306059
2,Abruzzo,2019,1300645
3,Abruzzo,2020,1293941
4,Abruzzo,2021,1281012


#### Column Unique Values

In [87]:
col_unique_count(pop_totals)

Column Name: territory
Length of Unique Values: 22
------------------------------
Column Name: year
Length of Unique Values: 5
------------------------------
Column Name: population
Length of Unique Values: 110
------------------------------


In [88]:
col_unique_vals(pop_totals, ['territory', 'year'])

Column Name: territory
Unique Values:
Piemonte  
Valle d'Aosta / Vallée d'Aoste  
Liguria  
Lombardia  
Trentino Alto Adige / Südtirol  
Provincia Autonoma Bolzano / Bozen  
Provincia Autonoma Trento  
Veneto  
Friuli-Venezia Giulia  
Emilia-Romagna  
Toscana  
Umbria  
Marche  
Lazio  
Abruzzo  
Molise  
Campania  
Puglia  
Basilicata  
Calabria  
Sicilia  
Sardegna  
------------------------------
Column Name: year
Unique Values:
2017
2018
2019
2020
2021
------------------------------


Lets check data types.

In [89]:
pop_totals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   territory   110 non-null    object
 1   year        110 non-null    object
 2   population  110 non-null    object
dtypes: object(3)
memory usage: 2.7+ KB


We need to convert the `year` and `population`, columns to `int` data type.

In [90]:
pop_totals['year'] = pop_totals['year'].astype(int)
pop_totals['population'] = pop_totals['population'].replace(",", "", regex=True).astype(int)
pop_totals.sort_values(by=['territory', 'year'], inplace=True)
pop_totals.info()

<class 'pandas.core.frame.DataFrame'>
Index: 110 entries, 14 to 95
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   territory   110 non-null    object
 1   year        110 non-null    int32 
 2   population  110 non-null    int32 
dtypes: int32(2), object(1)
memory usage: 2.6+ KB


## Geographical Data

### Geographical Data of Italian Regions

Geographical GeoJSON data is a format used for encoding various types of geographic data structures (areas as an example).  
It is a widely used standard for representing geographic features, such as points, lines, and polygons, along with their associated properties.  
GeoJSON data can represent a variety of geographical entities, including countries, states, cities, landmarks, and more.  
It stores geographic coordinates as well as additional attributes that provide information about the features.

The data is in a **geojson** file. Lets create a new **dataframe** from the **geojson** file.

In [91]:
reg_sim_url ='../../data/geo_data/regions_simplified.geojson'

In [92]:
with open(reg_sim_url) as f:
    geojson_reg = json.load(f)

In [93]:
# Convert the geojson data to a pandas dataframe using json_normalize
geo_df = pd.json_normalize(geojson_reg['features'])
geo_df.head()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.reg_name,properties.reg_istat_code_num,properties.reg_istat_code
0,Feature,Polygon,"[[[7.104329571840417, 45.46695875661557], [7.1...",Piemonte,1,1
1,Feature,Polygon,"[[[7.864048284549529, 45.91643936466997], [7.8...",Valle d'Aosta/Vallée d'Aoste,2,2
2,Feature,Polygon,"[[[8.714815174881677, 46.098042790817374], [8....",Lombardia,3,3
3,Feature,Polygon,"[[[10.840150465662777, 45.83275599772702], [10...",Trentino-Alto Adige/Südtirol,4,4
4,Feature,Polygon,"[[[10.840150465662777, 45.83275599772702], [10...",Veneto,5,5


#### Shape

In [94]:
print(f'We have {geo_df.shape[0]} rows and {geo_df.shape[1]} columns')

We have 20 rows and 6 columns


#### Columns
Lets have a look at the column names.

In [95]:
geo_cols = geo_df.columns.to_list()
geo_cols_len = len(geo_cols)
print(f'The columns are:\n\n {geo_cols}\n\n There are {geo_cols_len} columns')

The columns are:

 ['type', 'geometry.type', 'geometry.coordinates', 'properties.reg_name', 'properties.reg_istat_code_num', 'properties.reg_istat_code']

 There are 6 columns


#### Info()

Using `.info()` will help us identify the data types, size of the data, and any `Null` values.

In [96]:
geo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   type                           20 non-null     object
 1   geometry.type                  20 non-null     object
 2   geometry.coordinates           20 non-null     object
 3   properties.reg_name            20 non-null     object
 4   properties.reg_istat_code_num  20 non-null     int64 
 5   properties.reg_istat_code      20 non-null     object
dtypes: int64(1), object(5)
memory usage: 1.1+ KB


There are no `Null` values in the data.

#### Describe()

Using `describe()` will help us understand the numerical values in the data. 

In [97]:
geo_df.describe()

Unnamed: 0,properties.reg_istat_code_num
count,20.0
mean,10.5
std,5.91608
min,1.0
25%,5.75
50%,10.5
75%,15.25
max,20.0


#### Column Names

Lets explore the column names with a view to dropping any columns that are not required.

In [98]:
geo_cols

['type',
 'geometry.type',
 'geometry.coordinates',
 'properties.reg_name',
 'properties.reg_istat_code_num',
 'properties.reg_istat_code']

In [99]:
geo_df.drop(['type', 'geometry.type'], axis=1, inplace=True)
geo_df.head()

Unnamed: 0,geometry.coordinates,properties.reg_name,properties.reg_istat_code_num,properties.reg_istat_code
0,"[[[7.104329571840417, 45.46695875661557], [7.1...",Piemonte,1,1
1,"[[[7.864048284549529, 45.91643936466997], [7.8...",Valle d'Aosta/Vallée d'Aoste,2,2
2,"[[[8.714815174881677, 46.098042790817374], [8....",Lombardia,3,3
3,"[[[10.840150465662777, 45.83275599772702], [10...",Trentino-Alto Adige/Südtirol,4,4
4,"[[[10.840150465662777, 45.83275599772702], [10...",Veneto,5,5


##### Sort by properties.reg_name

In [100]:
geo_df.sort_values(by=['properties.reg_name'], inplace=True)

#### Column Unique Values

In [101]:
geo_df['properties.reg_name'].unique(), len(geo_df['properties.reg_name'].unique())

(array(['Abruzzo', 'Basilicata', 'Calabria', 'Campania', 'Emilia-Romagna',
        'Friuli-Venezia Giulia', 'Lazio', 'Liguria', 'Lombardia', 'Marche',
        'Molise', 'Piemonte', 'Puglia', 'Sardegna', 'Sicilia', 'Toscana',
        'Trentino-Alto Adige/Südtirol', 'Umbria',
        "Valle d'Aosta/Vallée d'Aoste", 'Veneto'], dtype=object),
 20)

Lets check data types.

In [102]:
geo_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20 entries, 12 to 4
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   geometry.coordinates           20 non-null     object
 1   properties.reg_name            20 non-null     object
 2   properties.reg_istat_code_num  20 non-null     int64 
 3   properties.reg_istat_code      20 non-null     object
dtypes: int64(1), object(3)
memory usage: 800.0+ bytes


## Data Summary: Current Snapshot

From the data to date we have some changes to make.  
- To help with the merge we will change the **column name** of **place names** to `reg_name` for all dataframes
- We need to check for leading and trailing whitespace in the dataframe values and remove them
- We need to check and sync the place names in the dataframes
- We need to create a new column in the `unemp_totals` dataframe for a count of unemployed people

Lets create a summary of the data we have so far.  
##### Column Names
First we will create a list of the dataframes we have.  
We can then iterate over them and return their column names.

In [103]:
print('-' * 30)
# Create a list of custom names and dataframes
data_frames = [('Covid 19', covid_totals), ('Crime', crime_totals), ('Unemployment', unemp_totals), ('Population', pop_totals), ('Geo', geo_df)]

# Iterate over the list of dataframes
for name, df in data_frames:
    # Print the column names
    print(name + " column names:\n", df.columns.to_list())
    print('-' * 30)

------------------------------
Covid 19 column names:
 ['reg_name', 'lat', 'lon', 'reg_code', 'year', 'total_cases']
------------------------------
Crime column names:
 ['territory_code', 'territory_name', 'crime_type', 'year', 'number_of_crime']
------------------------------
Unemployment column names:
 ['territory', 'gender', 'age_class', 'quarter', 'year', 'unemployment_rate']
------------------------------
Population column names:
 ['territory', 'year', 'population']
------------------------------
Geo column names:
 ['geometry.coordinates', 'properties.reg_name', 'properties.reg_istat_code_num', 'properties.reg_istat_code']
------------------------------


##### Change Column Names

In [104]:
# Rename the columns to match
crime_totals = crime_totals.copy()
crime_totals.rename(columns={'territory_name': 'reg_name'}, inplace=True)
unemp_totals = unemp_totals.copy()
unemp_totals.rename(columns={'territory': 'reg_name'}, inplace=True)
pop_totals = pop_totals.copy()
pop_totals.rename(columns={'territory': 'reg_name'}, inplace=True)
geo_df = geo_df.copy()
geo_df.rename(columns={'geometry.coordinates': 'geometry', 'properties.reg_name': 'reg_name'}, inplace=True)

Lets check the dataframes.

In [105]:
# Create a list of custom names and dataframes
data_frames = [('Covid 19', covid_totals), ('Crime', crime_totals), ('Unemployment', unemp_totals), ('Population', pop_totals), ('Geo', geo_df)]

# Iterate over the list of dataframes
for name, df in data_frames:
    # Print the column names
    print(name + " column names:\n", df.columns.to_list())
    # Print the shape of the dataframe
    print(f'We have {df.shape[0]} rows and {df.shape[1]} columns')
    # Print the first row of the dataframe
    print(f'Sample row:\n', df.head(1))
    print('-' * 30)

Covid 19 column names:
 ['reg_name', 'lat', 'lon', 'reg_code', 'year', 'total_cases']
We have 107 rows and 6 columns
Sample row:
   reg_name        lat        lon  reg_code  year  total_cases
0  Abruzzo  42.351222  13.398438        13  2017            0
------------------------------
Crime column names:
 ['territory_code', 'reg_name', 'crime_type', 'year', 'number_of_crime']
We have 110 rows and 5 columns
Sample row:
   territory_code reg_name crime_type  year  number_of_crime
0           ITF1  Abruzzo      total  2017            42847
------------------------------
Unemployment column names:
 ['reg_name', 'gender', 'age_class', 'quarter', 'year', 'unemployment_rate']
We have 110 rows and 6 columns
Sample row:
   reg_name gender age_class  quarter  year  unemployment_rate
0  Abruzzo  total    15-64         4  2017          10.309211
------------------------------
Population column names:
 ['reg_name', 'year', 'population']
We have 110 rows and 3 columns
Sample row:
      reg_name  year

##### Remove Leading and Trailing Whitespace

In [106]:

dfs = [covid_totals, crime_totals, unemp_totals, pop_totals]

for df in dfs:
    for col in df.columns:
        if df[col].dtype == 'object':  # If column is of object/string type
            df.loc[:, col] = df[col].str.strip()  # Strip leading and trailing spaces
            df.loc[:, col] = df[col].str.replace(' / ', '/')  # Replace ' / ' with '/'

##### Check Place Names

There are some abnormalities in the place names.  
- `'Provincia Autonoma Trento'` and `'Provincia Autonoma Bolzano / Bozen'` are a subset of `'Trentino Alto Adige / Südtirol'` region
- `'Trentino-Alto Adige/Südtirol'` has a hyphen in the `geo_df` dataframe 
- `'Friuli Venezia Giulia'` is missing a hyphen and `'Valle d'Aosta'` is incorrect, in the `covid_totals` dataframe
- The `covid_totals` dataframe has `'P.A. Bolzano'` and `'P.A. Trento'` instead of `'Provincia Autonoma Bolzano / Bozen'` and `'Provincia Autonoma Trento'`
- The `covid_totals` dataframe is missing `'Trentino Alto Adige / Südtirol'`
  
We need to make the following changes to the dataframes.
- Remove the hyphen from `'Trentino Alto Adige/Südtirol'` in the `geo_df` dataframe
- Add the missing hyphen to `'Friuli Venezia Giulia'` in the `covid_totals` dataframe
- Merge the `'P.A. Bolzano'` and `'P.A. Trento'` rows and rename to `'Trentino Alto Adige / Südtirol'` in the `covid_total`s dataframe
- Remove the `'Provincia Autonoma Trento'` and `'Provincia Autonoma Bolzano / Bozen'` rows from the `crime_totals`, `unemp_totals`, and `population_totals` dataframes

##### Correct the Place Names

In [107]:
geo_df.loc[geo_df['reg_name'] == 'Trentino-Alto Adige/Südtirol', 'reg_name'] = 'Trentino Alto Adige/Südtirol'
geo_df[geo_df['reg_name'] == 'Trentino Alto Adige/Südtirol']

Unnamed: 0,geometry,reg_name,properties.reg_istat_code_num,properties.reg_istat_code
3,"[[[10.840150465662777, 45.83275599772702], [10...",Trentino Alto Adige/Südtirol,4,4


In [108]:
# Add the missing hyphen to `'Friuli Venezia Giulia'` in the `covid_totals` dataframe
covid_totals.loc[covid_totals['reg_name'] == 'Friuli Venezia Giulia', 'reg_name'] = 'Friuli-Venezia Giulia'
# Correct the name of `'Valle d'Aosta'` in the `covid_totals` dataframe
covid_totals.loc[covid_totals['reg_name'] == "Valle d'Aosta", 'reg_name'] = "Valle d'Aosta/Vallée d'Aoste"

##### Merge and Rename

Lets do the merge and rename the values.

In [109]:
# Replace the 'reg_name' values
covid_totals['reg_name'] = covid_totals['reg_name'].replace(['P.A. Bolzano', 'P.A. Trento'], 'Trentino Alto Adige/Südtirol')

# Replace the 'reg_code' values
covid_totals.loc[covid_totals['reg_name'] == 'Trentino Alto Adige/Südtirol', 'reg_code'] = 'ITDA'

# Group by 'reg_name' and 'year', and sum 'total_cases'
covid_totals = covid_totals.groupby(['reg_name', 'lat', 'lon', 'year', 'reg_code'], as_index=False)['total_cases'].sum()

covid_totals[covid_totals['reg_name'] == 'Trentino Alto Adige/Südtirol']

Unnamed: 0,reg_name,lat,lon,year,reg_code,total_cases
82,Trentino Alto Adige/Südtirol,46.068935,11.121231,2017,ITDA,0
83,Trentino Alto Adige/Südtirol,46.068935,11.121231,2018,ITDA,0
84,Trentino Alto Adige/Südtirol,46.068935,11.121231,2019,ITDA,0
85,Trentino Alto Adige/Südtirol,46.068935,11.121231,2020,ITDA,20839
86,Trentino Alto Adige/Südtirol,46.068935,11.121231,2021,ITDA,57706
87,Trentino Alto Adige/Südtirol,46.499335,11.356624,2017,ITDA,0
88,Trentino Alto Adige/Südtirol,46.499335,11.356624,2018,ITDA,0
89,Trentino Alto Adige/Südtirol,46.499335,11.356624,2019,ITDA,0
90,Trentino Alto Adige/Südtirol,46.499335,11.356624,2020,ITDA,28722
91,Trentino Alto Adige/Südtirol,46.499335,11.356624,2021,ITDA,97864


##### Remove Rows

Lets check the values first.

In [110]:
# Define the list of region names of interest
regions_of_interest = ['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen', 'Trentino Alto Adige / Südtirol']

# Filter rows that have a 'reg_name' contained in regions_of_interest and group by 'reg_name' to get the sum of 'number_of_crime'
sub_cri = crime_totals[crime_totals['reg_name'].isin(regions_of_interest)].groupby('reg_name')['number_of_crime'].sum().reset_index()

sub_cri

Unnamed: 0,reg_name,number_of_crime
0,Provincia Autonoma Trento,70790


In [111]:
# Define the list of region names of interest
regions_of_interest = ['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen', 'Trentino Alto Adige / Südtirol']

# Filter rows that have a 'reg_name' contained in regions_of_interest and group by 'reg_name' to get the sum of 'number_of_crime'
sub_pop = pop_totals[pop_totals['reg_name'].isin(regions_of_interest)].groupby('reg_name')['population'].sum().reset_index()

sub_pop

Unnamed: 0,reg_name,population
0,Provincia Autonoma Trento,2713558


In [112]:
# Define the list of region names of interest
regions_of_interest = ['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen', 'Trentino Alto Adige / Südtirol']

# Filter rows that have a 'reg_name' contained in regions_of_interest and group by 'reg_name' to get the sum of 'number_of_crime'
sub_une = unemp_totals[unemp_totals['reg_name'].isin(regions_of_interest)].groupby('reg_name')['unemployment_rate'].sum().reset_index()

sub_une

Unnamed: 0,reg_name,unemployment_rate
0,Provincia Autonoma Trento,22.815205


Lets remove the unwanted rows.

In [113]:
# Remove the rows
rows_to_remove = ['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen']
crime_totals = crime_totals[~crime_totals['reg_name'].isin(rows_to_remove)].reset_index(drop=True)
unemp_totals = unemp_totals[~unemp_totals['reg_name'].isin(rows_to_remove)].reset_index(drop=True)
pop_totals = pop_totals[~pop_totals['reg_name'].isin(rows_to_remove)].reset_index(drop=True)

##### Quick explanation of the code
- **`pop_totals['reg_name'].isin(['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen'])`:**  
This line is checking if the `'reg_name'` in each row of `pop_totals` is either `'Provincia Autonoma Trento'` or `'Provincia Autonoma Bolzano / Bozen'`.  
It will return a Boolean Series (a series of True and False values), where True indicates that the `'reg_name'` is one of those two specified values.
- **`~`:**  
The `tilde` operator in front of the expression negates the Boolean Series. So, **True** becomes **False** and **False** becomes **True**.
- **`pop_totals[...]`:**  
This is standard DataFrame indexing. By putting a Boolean Series inside the square brackets, we're telling pandas to only keep the rows where the Series is **True**.
  
We can also use a list comprehension to achieve the same result:

In [114]:
# # Define the rows to remove
# rows_to_remove = ['Provincia Autonoma Trento', 'Provincia Autonoma Bolzano / Bozen']

# # Use list comprehension to filter and reindex all the DataFrames
# crime_totals, unemp_totals, pop_totals = [
#     df[~df['reg_name'].isin(rows_to_remove)].reset_index(drop=True)
#     for df in [crime_totals, unemp_totals, pop_totals]]

Now we can create a new column for the unemployed people count.  
  
We will do this in the `pop_totals` dataframe.

First we merge the `pop_totals` and `unemp_totals` dataframes on the `'reg_name'` and `'year'` columns.

In [115]:
# Merge the DataFrames based on the common column 'year' and 'reg_name'
pop_totals = unemp_totals.merge(pop_totals[['year', 'reg_name', 'population']], on=['year', 'reg_name'], how='left')
pop_totals

Unnamed: 0,reg_name,gender,age_class,quarter,year,unemployment_rate,population
0,Abruzzo,total,15-64,4,2017,10.309211,1313930
1,Abruzzo,total,15-64,4,2018,9.390300,1306059
2,Abruzzo,total,15-64,4,2019,13.247103,1300645
3,Abruzzo,total,15-64,4,2020,10.345357,1293941
4,Abruzzo,total,15-64,4,2021,8.254084,1281012
...,...,...,...,...,...,...,...
100,Veneto,total,15-64,4,2017,6.449217,4883373
101,Veneto,total,15-64,4,2018,7.578436,4880936
102,Veneto,total,15-64,4,2019,5.766357,4884590
103,Veneto,total,15-64,4,2020,7.068285,4879133


We can now calculate the count of unemployed people.

In [116]:
pop_totals['unemp_pop'] = (pop_totals['population'] * pop_totals['unemployment_rate'] / 100).astype(int)
pop_totals

Unnamed: 0,reg_name,gender,age_class,quarter,year,unemployment_rate,population,unemp_pop
0,Abruzzo,total,15-64,4,2017,10.309211,1313930,135455
1,Abruzzo,total,15-64,4,2018,9.390300,1306059,122642
2,Abruzzo,total,15-64,4,2019,13.247103,1300645,172297
3,Abruzzo,total,15-64,4,2020,10.345357,1293941,133862
4,Abruzzo,total,15-64,4,2021,8.254084,1281012,105735
...,...,...,...,...,...,...,...,...
100,Veneto,total,15-64,4,2017,6.449217,4883373,314939
101,Veneto,total,15-64,4,2018,7.578436,4880936,369898
102,Veneto,total,15-64,4,2019,5.766357,4884590,281662
103,Veneto,total,15-64,4,2020,7.068285,4879133,344871


In [164]:
# regions1 = pd.DataFrame(covid_totals['reg_name'].unique())
# regions1.to_csv('../../data/regions1.csv')
# regions2 = pd.DataFrame(crime_totals['reg_name'].unique())
# regions2.to_csv('../../data/regions2.csv')
# regions3 = pd.DataFrame(pop_totals['reg_name'].unique())
# regions3.to_csv('../../data/regions3.csv')
# regions4 = pd.DataFrame(unemp_totals['reg_name'].unique())
# regions4.to_csv('../../data/regions4.csv')
# regions5 = pd.DataFrame(geo['reg_name'].unique())
# regions5.to_csv('../../data/regions5.csv')

## Merge Dataframes

Now that we have all our dataframes in the correct format, we can merge them into a single dataframe.  
  
We will merge the dataframes on the `reg_name` column. 

In [118]:
dataframes = [('Covid 19', covid_totals), ('Crime', crime_totals), ('Unemployment', unemp_totals)]
for name, df in dataframes:
    print(name)
    print(f'We have {df.shape[0]} rows and {df.shape[1]} columns')
    print(f'Sample row:\n', df.head(1))
    print('-' * 30)

Covid 19
We have 107 rows and 6 columns
Sample row:
   reg_name        lat        lon  year reg_code  total_cases
0  Abruzzo  42.351222  13.398438  2017       13            0
------------------------------
Crime
We have 105 rows and 5 columns
Sample row:
   territory_code reg_name crime_type  year  number_of_crime
0           ITF1  Abruzzo      total  2017            42847
------------------------------
Unemployment
We have 105 rows and 6 columns
Sample row:
   reg_name gender age_class  quarter  year  unemployment_rate
0  Abruzzo  total     15-64        4  2017          10.309211
------------------------------


In [119]:
all_data_1 = covid_totals.merge(crime_totals, on=['year', 'reg_name'], how='left')
all_data_1.head()

Unnamed: 0,reg_name,lat,lon,year,reg_code,total_cases,territory_code,crime_type,number_of_crime
0,Abruzzo,42.351222,13.398438,2017,13,0,ITF1,total,42847
1,Abruzzo,42.351222,13.398438,2018,13,0,ITF1,total,40038
2,Abruzzo,42.351222,13.398438,2019,13,0,ITF1,total,38381
3,Abruzzo,42.351222,13.398438,2020,13,34437,ITF1,total,34250
4,Abruzzo,42.351222,13.398438,2021,13,95670,ITF1,total,35324


In [123]:
all_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   reg_name         107 non-null    object 
 1   lat              107 non-null    float64
 2   lon              107 non-null    float64
 3   year             107 non-null    int64  
 4   reg_code         107 non-null    object 
 5   total_cases      107 non-null    int64  
 6   territory_code   107 non-null    object 
 7   crime_type       107 non-null    object 
 8   number_of_crime  107 non-null    int64  
dtypes: float64(2), int64(3), object(4)
memory usage: 7.7+ KB


In [122]:
all_data_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 107 entries, 0 to 106
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   reg_name           107 non-null    object 
 1   lat                107 non-null    float64
 2   lon                107 non-null    float64
 3   year               107 non-null    int64  
 4   reg_code           107 non-null    object 
 5   total_cases        107 non-null    int64  
 6   territory_code     107 non-null    object 
 7   crime_type         107 non-null    object 
 8   number_of_crime    107 non-null    int64  
 9   gender             107 non-null    object 
 10  age_class          107 non-null    object 
 11  quarter            107 non-null    int64  
 12  unemployment_rate  107 non-null    float64
 13  population         107 non-null    int32  
 14  unemp_pop          107 non-null    int32  
dtypes: float64(3), int32(2), int64(4), object(6)
memory usage: 11.8+ KB


In [121]:
all_data_2 = all_data_1.merge(pop_totals, on=['year', 'reg_name'], how='left')
all_data_2.iloc[60 : 100]

Unnamed: 0,reg_name,lat,lon,year,reg_code,total_cases,territory_code,crime_type,number_of_crime,gender,age_class,quarter,unemployment_rate,population,unemp_pop
60,Puglia,41.125596,16.867367,2017,16,0,ITF4,total,146543,total,15-64,4,15.140318,4024067,609256
61,Puglia,41.125596,16.867367,2018,16,0,ITF4,total,143374,total,15-64,4,16.093097,4000966,643879
62,Puglia,41.125596,16.867367,2019,16,0,ITF4,total,134618,total,15-64,4,14.805607,3975528,588601
63,Puglia,41.125596,16.867367,2020,16,85674,ITF4,total,119851,total,15-64,4,15.647998,3953305,618613
64,Puglia,41.125596,16.867367,2021,16,290992,ITF4,total,125146,total,15-64,4,14.014571,3933777,551301
65,Sardegna,39.215312,9.110616,2017,20,0,ITG2,total,46371,total,15-64,4,15.32641,1636839,250868
66,Sardegna,39.215312,9.110616,2018,20,0,ITG2,total,44703,total,15-64,4,17.443104,1631040,284504
67,Sardegna,39.215312,9.110616,2019,20,0,ITG2,total,45032,total,15-64,4,15.807805,1622257,256443
68,Sardegna,39.215312,9.110616,2020,20,29876,ITG2,total,40258,total,15-64,4,15.952509,1611621,257093
69,Sardegna,39.215312,9.110616,2021,20,83886,ITG2,total,42919,total,15-64,4,12.102223,1590044,192430


In [246]:
# sort the data for plotly reading
date1 = datetime(2020,2,24)
date2 = datetime(2021,12,31)
covid_reg_dates = covid_reg_mon[(covid_reg_mon['date'] >= date1) & (covid_reg_mon['date'] <= date2)]

NameError: name 'covid_reg_mon' is not defined

In [None]:
fig = px.scatter_mapbox(
    covid_reg_dates, lat='lat', lon='lon', color='total_cases',
    hover_name='reg_name', hover_data=['month', 'year', 'total_cases'],
    zoom=5, height=850, width=1000,
    animation_frame='date', size='total_cases', size_max=100,
    animation_group='reg_name',
    labels={'total_cases': 'Covid-19 Cases',  'date': 'Covid-19 Cases by Week - 2020 to 2021 '},
)

# You can adjust the map view
fig.update_layout(
    mapbox_style='open-street-map',
    margin={'r':0, 't':35, 'l':15, 'b':10}, 
    mapbox=dict(pitch=60, bearing=30), # Cool map view, set if required
    title='Covid-19 Cases in Italy by Region',
)

# Set Map Centre
fig.update_mapboxes(
    center_lat=42.000,
    center_lon=12.706
)

fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 100  # Set the frame duration (adjust as needed)
fig.layout.updatemenus[0].buttons[0].args[1]['transition']['duration'] = 400 # Set the transition duration (adjust as needed)

fig.write_html('scatter_mapbox.html')

fig.show()

In [None]:
df_geo.head(1)

In [None]:
fig = px.choropleth_mapbox(df_geo_cases, geojson=geojson_reg, featureidkey='properties.reg_istat_code',
                        locations='properties.reg_istat_code', color='total_cases',
                        color_discrete_sequence=px.colors.qualitative.Dark24, center={'lon': 12.5, 'lat': 41.9},
                        mapbox_style='carto-positron', zoom=4.5, opacity=0.8)
fig.update_layout(margin={'r':0,'t':0,'l':0,'b':0})
fig.show()

In [None]:
# Create the choropleth map
fig = px.choropleth_mapbox(df_geo_cases, geojson=geojson_reg, featureidkey='properties.reg_istat_code',
                        locations='properties.reg_istat_code', color='total_cases',
                        color_discrete_sequence=px.colors.qualitative.Dark24, center={'lon': 12.5, 'lat': 41.9},
                        mapbox_style='open-street-map', zoom=4.5, opacity=0.8)

# Create the scatter map
scatter_trace = px.scatter_mapbox(covid_reg, lat='lat', lon='lon', 
                                  hover_name='reg_name', hover_data=['month', 'year', 'total_cases'],
                                  color_discrete_sequence=['red'], zoom=5, height=750, width=1000,
                                  animation_frame='date', size='size', size_max=100,
                                  animation_group='total_cases').data[0]

# Add scatter map trace to the choropleth map
fig.add_trace(scatter_trace)

# Adjust the map view
fig.update_layout(
    mapbox_style='open-street-map',
    margin={'r':0, 't':0, 'l':0, 'b':0}, 
    mapbox=dict(pitch=60, bearing=30)
    )

fig.update_traces(marker=dict(sizemin=10))  # Set the minimum marker size

fig.show()


In [None]:
# Modify the size column in the DataFrame
covid_reg['size'] = covid_reg['total_cases']

fig = px.scatter_mapbox(covid_reg, lat='lat', lon='lon', 
                        hover_name='reg_name', hover_data=['month', 'year', 'total_cases'],
                        color_discrete_sequence=['red'], zoom=5, height=750, width=1000,
                        animation_frame='date', size='size', size_max=100,
                        animation_group='total_cases')

# You can adjust the map view
fig.update_layout(
    mapbox_style='open-street-map',
    margin={'r':0, 't':0, 'l':0, 'b':0}, 
    mapbox=dict(pitch=60, bearing=30)
    )

fig.update_traces(marker=dict(sizemin=10))  # Set the minimum marker size

fig.update_geos(projection_type='equirectangular', visible=True, resolution=110)

# fig.layout.updatemenus[0].buttons[0].args[1]['frame']['duration'] = 100  # Set the frame duration (adjust as needed)

fig.write_html('scatter_mapbox.html')

fig.show()

print(fig.layout)
print(fig.data)
print(fig.frames)

In [None]:
start_size = 5  # Define the start size for the markers

# Modify the size column in the DataFrame
covid_reg['size'] = covid_reg['total_cases'] + start_size

fig = px.scatter_mapbox(covid_reg, lat="lat", lon="lon", 
                        hover_name="date", hover_data=["month", "year", "total_cases"],
                        color_discrete_sequence=["red"], zoom=5, height=750, width=700,
                        animation_frame="day_of_week", size="size", size_max=10,
                        animation_group="total_cases", range_color=[covid_reg['total_cases'].min(), covid_reg['total_cases'].max()])

fig.update_layout(mapbox_style="open-street-map")
# You can adjust the map view
fig.update_layout(margin={"r":0, "t":0, "l":0, "b":0}, 
                  mapbox=dict(pitch=60, bearing=30))

fig.update_traces(marker=dict(sizemin=1))  # Set the minimum marker size

fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100  # Set the frame duration (adjust as needed)

fig.show()


In [None]:
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import numpy as np
from matplotlib.animation import FuncAnimation

fig = plt.figure(figsize=(10, 6))
ax = plt.axes(projection=ccrs.PlateCarree())

# Set up map features
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(cfeature.BORDERS)
ax.add_feature(cfeature.STATES)
ax.add_feature(cfeature.LAND, color='lightgray')

# Retrieve latitude and longitude coordinates from the dataframe
latitudes = covid_reg['lat'].values
longitudes = covid_reg['lon'].values

# Calculate the range of latitudes and longitudes
min_lat, max_lat = latitudes.min(), latitudes.max()
min_lon, max_lon = longitudes.min(), longitudes.max()

# Create a grid for the map
x = np.linspace(min_lon, max_lon, 360)
y = np.linspace(min_lat, max_lat, 180)
X, Y = np.meshgrid(x, y)

def update_plot(index):
    ax.clear()

    # Get the unique combinations of reg_name and year
    unique_combinations = covid_reg[['reg_name', 'year']].drop_duplicates()

    # Filter data for the specific combination of reg_name and year
    reg_name = unique_combinations['reg_name'].iloc[index]
    year = unique_combinations['year'].iloc[index]
    filtered_df = covid_reg[(covid_reg['reg_name'] == reg_name) & (covid_reg['year'] == year)]

    # Calculate total cases sum
    total_cases_sum = filtered_df['total_cases'].sum()

    # Create a grid of total cases for each point on the map
    cases_grid = np.zeros_like(X)
    latitudes = filtered_df['lat'].values
    longitudes = filtered_df['lon'].values
    for lat, lon in zip(latitudes, longitudes):
        lat_index = int((lat - min_lat) / (max_lat - min_lat) * 179)
        lon_index = int((lon - min_lon) / (max_lon - min_lon) * 359)
        cases_grid[lat_index, lon_index] = total_cases_sum

    # Plotting the map with total cases data
    ax.imshow(cases_grid, extent=[min_lon, max_lon, min_lat, max_lat], origin='lower', cmap='Reds', vmin=0, vmax=cases_grid.max())

    # Setting plot title
    ax.set_title(f'Total Cases - Year {year}')

animation = FuncAnimation(fig, update_plot, frames=len(covid_reg['year'].unique()), interval=1000)

# Assign the animation to a variable
anim = animation

# Display the animation
plt.show()



In [None]:
plot_time_series(covid_reg)

In [None]:
def update_plot(frame):
    ax.clear()
    
    # Get the unique combinations of reg_name and year
    unique_combinations = covid_reg[['reg_name', 'year']].drop_duplicates()

    # Filter data for the specific combination of reg_name and year
    reg_name = unique_combinations['reg_name'].iloc[frame]
    year = unique_combinations['year'].iloc[frame]
    filtered_df = covid_reg[(covid_reg['reg_name'] == reg_name) & (covid_reg['year'] == year)]

    # Calculate total cases sum
    total_cases_sum = filtered_df['total_cases'].sum()

    # Plotting
    plt.bar(reg_name, total_cases_sum)
    plt.xlabel('Region')
    plt.ylabel('Total Cases')
    plt.title(f'Total Cases - Year {year}')
    plt.xticks(rotation=45)

fig, ax = plt.subplots()
animation = FuncAnimation(fig, update_plot, frames=len(covid_reg['year'].unique()), interval=1000)

plt.show()


In [None]:
covid_reg.describe()

In [None]:
covid_reg.groupby('reg_name', as_index=False).agg({'total_cases': 'max'})

In [None]:
# set style and figure size
sns.set(style='whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# plot bar chart and set title and axis labels
sns.barplot(data=covid_reg, x='reg_name', y='total_cases', ax=ax) # Change colors using palette - 'palette='crest''
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

plt.show()

In [None]:
# set style and figure size
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# plot bar chart and set title and axis labels
sns.barplot(data=covid_reg, x='reg_name', y='total_cases', ax=ax)
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

# add labels to each bar
y_ticks = np.arange(0, df['total_cases'].max()+1, 500000)
ax.set_yticks(y_ticks)
ax.set_yticklabels(['{:,.0f}'.format(y) for y in y_ticks])

# set color of each bar based on its height
max_cases = df['total_cases'].max()
start_color = '#9FC5E8' # light shade of blue
end_color = '#0B5394' # dark shade of blue
color_map = colors.LinearSegmentedColormap.from_list('custom', [start_color, end_color], N=max_cases)

# loop through each bar
for patch in ax.patches:
    value = patch.get_height()
    # set color based on the normalized height (between 0 and 1)
    color = color_map(value / max_cases)
    patch.set_facecolor(color)

# add color bar
cbar_ax = fig.add_axes([0.92, 0.2, 0.02, 0.6])
cbar = fig.colorbar(cm.ScalarMappable(norm=plt.Normalize(vmin=0, vmax=max_cases), cmap=color_map), cax=cbar_ax)
cbar.set_label('Covid Cases')

plt.show()

In [None]:
# Create a subplot with 1 row and 2 columns
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=("Covid 19 Cases by Region", "Region with Highest Cases"))

# Add the bar chart to the first subplot
fig.add_trace(go.Bar(x=df2['reg_name'], y=df2['total_cases'], name='Covid Cases'), row=1, col=1)
fig.update_xaxes(title_text='Italian Regions', tickangle=-45, row=1, col=1)
fig.update_yaxes(title_text='Covid Cases', tickformat=',.0f', row=1, col=1)

# Find the region with the highest total_cases
region_highest_cases = df2.loc[df2['total_cases'].idxmax(), 'reg_name']

# Add the choropleth map to the second subplot
fig.add_trace(px.choropleth_mapbox(df2, geojson=regions, locations='reg_name', color='total_cases',
                                    color_continuous_scale='Blues', mapbox_style='carto-positron',
                                    hover_name='reg_name', hover_data={'total_cases': ':,'},
                                    title=f'Region with Highest Cases: {region_highest_cases}').data[0], row=1, col=2)

fig.update_layout(height=600, showlegend=False)

fig.show()

In [None]:
# add data, set style, and figure size
fig = px.bar(df2, x='reg_name', y='total_cases', title='Covid 19 Cases by Region',
            # set labels names
            labels={'reg_name': 'Italian Regions', 'total_cases': 'Covid Cases'})
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    ) # set x-axis tick angle and y-axis tick format
fig.show()

In [None]:
regions_df.head(1)

In [None]:
df2.head(1)

In [None]:
df3 = df2.rename(columns={'reg_name': 'denominazione_regione'})
df3

In [None]:
# Find the region with the highest total_cases
region_highest_cases = df3.loc[df3['total_cases'].idxmax(), 'denominazione_regione']

# Create the bar chart
fig = px.bar(df3, x='denominazione_regione', y='total_cases', title='Covid 19 Cases by Region',
            labels={'denominazione_regione': 'Italian Regions', 'total_cases': 'Covid Cases'})
fig.update_layout(xaxis_tickangle=-45, yaxis_tickformat=',.0f')

# Create the choropleth map trace
map_trace = go.Choroplethmapbox(
    geojson=regions,  # Replace with the correct GeoJSON data for Italian regions
    locations=df3['denominazione_regione'],
    z=df3['total_cases'],
    colorscale='Blues',
    zmin=0,
    zmax=df3['total_cases'].max(),
    featureidkey='properties.denominazione_regione',  # Specify the property key in the GeoJSON data
    marker_opacity=0.7,
    hovertemplate='<b>%{location}</b><br>Total Cases: %{z:,.0f}',
    colorbar=dict(title='Covid Cases')
)

# Set the layout for the figure
layout = go.Layout(
    title=f'Region with Highest Cases: {region_highest_cases}',
    mapbox=dict(
        center=dict(lat=42.5, lon=12.5),
        zoom=4.5,
        style='carto-positron'
    ),
    height=500
)

# Create the figure and add the map trace
fig_map = go.Figure(data=map_trace, layout=layout)

# Display the figure
fig_map.show()

In [None]:
# add data, set style, and figure size
fig = px.bar(df2, x='reg_name', y='total_cases', title='Covid 19 Cases by Region',
            # Change colors using color_discrete_sequence
            color='total_cases', color_continuous_scale=['#9FC5E8', '#0b5394'],
            # set labels names
            labels={'reg_name': 'Italian Regions', 'total_cases': 'Covid Cases'},
            # Change hover text with new list
            hover_data={'total_cases': ':.2f'})

# set marker line width and color
fig.update_traces(marker=dict(line=dict(width=1, color='Gray')))
# set marker angle and opacity
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    )
fig.show()

In [None]:
sns.boxplot(data=df2, x='total_cases')
plt.title('Boxplot of Column Name')

#### Convert date column to `datetime` type

In [None]:
regions_df2['date'] = pd.to_datetime(regions_df2['date']).dt.date

Lets create two new columns, **month** and **year**.

In [None]:
regions_df2['year'] = pd.to_datetime(df['date']).dt.year
regions_df2['month'] = pd.to_datetime(df['date']).dt.month
regions_df2

In [None]:
# set style and figure size
sns.set(style='whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# plot bar chart and set title and axis labels
sns.barplot(data=regions_df2, x='reg_name', y='total_cases', ax=ax) # Change colors using palette - 'palette='crest''
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

plt.show()

In [None]:
# set style and figure size
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# plot bar chart and set title and axis labels
sns.barplot(data=regions_df2, x='reg_name', y='total_cases', ax=ax)
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

# add labels to each bar
y_ticks = np.arange(0, regions_df2['total_cases'].max()+1, 500000)
ax.set_yticks(y_ticks)
ax.set_yticklabels(['{:,.0f}'.format(y) for y in y_ticks])

# set color of each bar based on its height
max_cases = regions_df2['total_cases'].max()
start_color = '#9FC5E8' # light shade of blue
end_color = '#0B5394' # dark shade of blue
color_map = colors.LinearSegmentedColormap.from_list('custom', [start_color, end_color], N=max_cases)

# loop through each bar
for patch in ax.patches:
    value = patch.get_height()
    # set color based on the normalized height (between 0 and 1)
    color = color_map(value / max_cases)
    patch.set_facecolor(color)

# add color bar
cbar_ax = fig.add_axes([0.92, 0.2, 0.02, 0.6])
cbar = fig.colorbar(cm.ScalarMappable(norm=plt.Normalize(vmin=0, vmax=max_cases), cmap=color_map), cax=cbar_ax)
cbar.set_label('Covid Cases')

plt.show()

In [None]:
fig = px.bar(regions_df2, x='reg_name', y='total_cases', title='Covid 19 Cases by Region',
            labels={'region': 'Italian Regions', 'total_cases': 'Covid Cases'}) # set labels
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    yaxis_range=[0, regions_df2['total_cases'].max()+500000]
    ) # set x-axis tick angle and y-axis tick format
fig.show()

In [None]:
import plotly.express as px

fig = px.bar(regions_df2, x='reg_name', y='total_cases', title='Covid 19 Cases by Region',
            color='total_cases', color_continuous_scale=['#9FC5E8', '#0b5394'], labels={'region': 'Italian Regions', 'total_cases': 'Covid Cases'},
            hover_data={'total_cases': ':.2f'}) # set labels and hover data

fig.update_traces(marker=dict(line=dict(width=1, color='Gray'))) # set marker line width and color
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    yaxis_range=[0, regions_df2['total_cases'].max()+500000]
    ) # set x-axis tick angle and y-axis tick format
fig.show()

In [None]:
sns.boxplot(data=regions_df2, x='total_cases')
plt.title('Boxplot of Column Name')

Lets take a look at the **Province** data.

In [None]:
with open('../../data/Covid/dpc-covid19-ita-province-latest.json') as response:
    provinces = json.load(response)

provinces

Create a dataframe from the `json` file.

In [None]:
provs_df = pd.DataFrame(provinces)

In [None]:
provs_df.head()

Translate the column names to English.

In [None]:
provs_df = provs_df.rename(columns={
    'data': 'date', 'stato': 'state', 'codice_regione': 'reg_code', 'denominazione_regione': 'reg_name',
    'codice_provincia': 'prov_code', 'denominazione_provincia': 'prov_name', 'sigla_provincia': 'prov_abr',
    'totale_casi': 'total_cases', 'note': 'notes', 'codice_nuts_1': 'nuts_1_code',
    'codice_nuts_2': 'nuts_2_code', 'codice_nuts_3': 'nuts_3_code'
    })
provs_df.head()

Create a new dataframe with only the column names required:  
  
- date
- state
- region_code
- province_code
- province
- lat
- lon
- total_cases
- nuts_1_code
- nuts_2_code
- nuts_3_code

In [None]:
provs_df2 = provs_df.drop(columns=[
    'prov_abr', 'notes',
    ])
provs_df2.head()

In [None]:
provs_covid = provs_df2.drop_duplicates(subset=['prov_code','prov_name'])
provs_covid.to_csv('../../data/Covid/covid-provinces.csv', index=False)

In [None]:
provs_df2.info()

In [None]:
provs_df2.describe()

In [None]:
provs_df2['date'] = pd.to_datetime(provs_df2['date']).dt.date

In [None]:
# set style and figure size
sns.set(style='whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# select top 20 provinces by total cases and plot bar chart with updated color palette
top_provs = provs_df2.nlargest(20, 'total_cases')
sns.barplot(data=top_provs, x='prov_name', y='total_cases', ax=ax, palette='crest')

# set title and axis labels
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

plt.show()

In [None]:
# set style and figure size
sns.set_style('whitegrid')
fig, ax = plt.subplots(figsize=(10, 6))

# plot bar chart and set title and axis labels
sns.barplot(data=top_provs, x='prov_name', y='total_cases', ax=ax)
plt.title('Covid 19 Cases by Region', fontsize=16)
ax.set_xlabel('Italian Regions')
ax.set_ylabel('Covid Cases')
plt.xticks(rotation=45, ha='right') # ha = horizontal alignment

# add labels to each bar
y_ticks = np.arange(0, top_provs['total_cases'].max()+1, 500000)
ax.set_yticks(y_ticks)
ax.set_yticklabels(['{:,.0f}'.format(y) for y in y_ticks])

# set color of each bar based on its height
max_cases = top_provs['total_cases'].max()
start_color = '#9FC5E8' # light shade of blue
end_color = '#0B5394' # dark shade of blue
color_map = colors.LinearSegmentedColormap.from_list('custom', [start_color, end_color], N=max_cases)

# loop through each bar
for patch in ax.patches:
    value = patch.get_height()
    # set color based on the normalized height (between 0 and 1)
    color = color_map(value / max_cases)
    patch.set_facecolor(color)

# add color bar
cbar_ax = fig.add_axes([0.92, 0.2, 0.02, 0.6])
cbar = fig.colorbar(cm.ScalarMappable(norm=plt.Normalize(vmin=0, vmax=max_cases), cmap=color_map), cax=cbar_ax)
cbar.set_label('Covid Cases')

plt.show()

In [None]:
fig = px.bar(top_provs, x='prov_name', y='total_cases', title='Covid 19 Cases by Region',
            labels={'region': 'Italian Regions', 'total_cases': 'Covid Cases'}) # set labels
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    yaxis_range=[0, top_provs['total_cases'].max()+500000]
    ) # set x-axis tick angle and y-axis tick format
fig.show()

In [None]:
import plotly.express as px

fig = px.bar(top_provs, x='prov_name', y='total_cases', title='Covid 19 Cases by Region',
            color='total_cases', color_continuous_scale=['#9FC5E8', '#0b5394'], labels={'region': 'Italian Regions', 'total_cases': 'Covid Cases'},
            hover_data={'total_cases': ':.2f'}) # set labels and hover data

fig.update_traces(marker=dict(line=dict(width=1, color='Gray'))) # set marker line width and color
fig.update_layout(
    xaxis_tickangle=-45,
    yaxis_tickformat=',.0f',
    yaxis_range=[0, top_provs['total_cases'].max()+500000]
    ) # set x-axis tick angle and y-axis tick format
fig.show()

In [None]:
sns.boxplot(data=provs_df2, x='total_cases')
plt.title('Boxplot of Column Name')