<a href="https://colab.research.google.com/github/EphiWalker/italy_bes/blob/main/italy_bes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Comparing BES score of Tuscany with the rest of Italy

I love Florence, Italy and its surrounding region Tuscany. I've always wanted to go there to visit and perhaps live a year or two. But, I wonder, could there be other parts of Italy I should visit or live in rather than Florence?

##Italy's BES System
BES(Benessere Equo e Sostenibile) is a system Italy uses to measure the well-being and sustainability of its societies across genders, age groups and provinces by taking measurements of 100+ indicators grouped under 12 fundamental domains like Social Relationships or Work-Life Balance.  
  
The dataset I'll be using is compiled by Istituto Nazionale de Statistica (the Italian National Statistical Institute) and it can be found [here](https://www.istat.it/en/well-being-and-sustainability/the-measurement-of-well-being/indicators).

##Table of Contents
* Questions
* Data Cleaning & Manipulation
* Exploratory Data Analysis (EDA)
* Conclusions

##Questions I'd Like Answered
1. How safe is Tuscany compared to other provinces in Italy?
2. What are some of the best historically significant regions in Italy?
3. Is Tuscany environmentally a safe place to live?
4. What province is best suited to my field of work?
5. In case of emergencies, which parts of Italy have better healthcare systems?
6. Which part of Italy has people with the strongest reading habits? I'd love to know this because I believe well-read people have very interesting ideas to share and I'd love to befriend some well-read people in Italy. Also, it's nice to have a friend that shares my reading habits.
7. How socially active are the people of Tuscany compared to other parts of Italy? I'd love to live among a socially vibrant community.
8. How distributed is social trust in Italy's different regions?
9. How trustworthy are the Police in Tuscany compared to other parts of Italy?
10.  Based on my investigations, what are the top 3 regions I'd like to visit? What about the top 3 regions I'd like to live in for a few years?

##Data Cleaning & Manipulation
In this section, I'll go through multiple steps from preparing the enviornment to manipulating and cleaning the data for use.

###Preparing the Enviornment

I'll start by loading the data from my own github repo for datasets I use in my projects.

In [259]:
from pandas.core.tools.numeric import to_numeric
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

italy_bes = pd.read_excel('https://github.com/EphiWalker/datasets/raw/main/Indicators_region_gender.xlsx',
                          decimal = ',') # We specify decimal because Italy uses comma instead of periods for decimals

###Assessing Data

Next, I'll go on and sample the loaded datast at multiple locations.

In [260]:
italy_bes.head(10) #look at the first 10 cases/rows

Unnamed: 0,DOMINIO,CODICE,INDICATORE,SESSO,TERRITORIO,UNITA_MISURA,FONTE,2004,2005,2006,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,NOTA
0,Health,01SAL001,Life expectancy at birth,Males,Piemonte,Average number of years,Istat - Life tables of Italian population,77.6,77.9,78.2,...,79.9,80.6,80.4,80.5,80.7,79.1,80.2,80.3,,2022 data are provisional
1,Health,01SAL001,Life expectancy at birth,Males,Valle d'Aosta/Vallée d'Aoste,Average number of years,Istat - Life tables of Italian population,76.9,76.4,77.7,...,78.7,79.4,79.8,79.1,79.9,78.4,80.3,80.3,,2022 data are provisional
2,Health,01SAL001,Life expectancy at birth,Males,Liguria,Average number of years,Istat - Life tables of Italian population,78.0,77.9,78.5,...,79.9,80.6,80.5,80.4,80.8,79.3,80.6,80.4,,2022 data are provisional
3,Health,01SAL001,Life expectancy at birth,Males,Lombardia,Average number of years,Istat - Life tables of Italian population,77.9,78.1,78.4,...,80.6,81.1,81.2,81.3,81.5,79.0,80.9,81.1,,2022 data are provisional
4,Health,01SAL001,Life expectancy at birth,Males,Trentino-Alto Adige/Südtirol,Average number of years,Istat - Life tables of Italian population,78.0,78.5,78.8,...,81.0,81.2,81.5,82.0,81.9,80.5,81.4,81.5,,2022 data are provisional
5,Health,01SAL001,Life expectancy at birth,Males,Provincia Autonoma di Bolzano/Bozen,Average number of years,Istat - Life tables of Italian population,78.0,78.5,78.9,...,80.8,81.2,81.4,81.8,81.8,80.7,81.3,81.1,,2022 data are provisional
6,Health,01SAL001,Life expectancy at birth,Males,Provincia Autonoma di Trento,Average number of years,Istat - Life tables of Italian population,78.0,78.5,78.6,...,81.2,81.4,81.7,82.2,82.1,80.5,81.5,81.9,,2022 data are provisional
7,Health,01SAL001,Life expectancy at birth,Males,Veneto,Average number of years,Istat - Life tables of Italian population,78.2,78.4,78.8,...,80.6,81.0,81.2,81.4,81.7,80.7,81.1,81.2,,2022 data are provisional
8,Health,01SAL001,Life expectancy at birth,Males,Friuli-Venezia Giulia,Average number of years,Istat - Life tables of Italian population,77.8,77.8,78.1,...,79.9,80.4,80.6,80.8,81.3,80.3,79.9,80.3,,2022 data are provisional
9,Health,01SAL001,Life expectancy at birth,Males,Emilia-Romagna,Average number of years,Istat - Life tables of Italian population,78.5,78.8,79.1,...,80.9,81.2,81.2,81.5,81.7,80.3,80.9,81.2,,2022 data are provisional


In [261]:
italy_bes.sample(10, random_state = 888) #randomly sample 10 cases

Unnamed: 0,DOMINIO,CODICE,INDICATORE,SESSO,TERRITORIO,UNITA_MISURA,FONTE,2004,2005,2006,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,NOTA
1429,Education and training,02IST002-N22,People with at least upper secondary education...,Total,Puglia,Percentage values,Istat - Labour force survey,,,,...,,,,50.3,51.4,51.4,51.7,52.5,,
5075,Politics and Institutions,06POL005,Trust in police and fire brigade,Females,Centre,Mean score,Istat - Survey on Aspects of daily life,,,,...,7.0,7.3,7.4,7.3,7.5,7.5,7.5,7.5,,
4827,Politics and Institutions,06POL002,Trust in the parliament,Total,Puglia,Mean score,Istat - Survey on Aspects of daily life,,,,...,3.7,3.8,3.4,3.9,4.9,4.7,4.7,4.8,,
5731,Safety,07SIC021,Social decay (or incivilities),Females,Puglia,Percentage values,Istat - Survey on Aspects of daily life,,,,...,,,16.2,9.3,8.9,7.4,5.4,6.3,,Starting from 2020 for the whole time series t...
87,Health,01SAL001,Life expectancy at birth,Total,South,Average number of years,Istat - Life tables of Italian population,80.2,80.2,80.6,...,81.5,82.1,81.8,82.3,82.5,81.8,81.5,81.7,,2022 data are provisional
6626,Environment,10AMB020,Air quality – PM2.5,Total,Marche,Percentage values,Istat - Processing of data from Ispra,,,,...,76.9,61.5,83.3,83.3,76.5,66.7,53.3,,,
3097,Work and life balance,03LAV008,Share of employed persons not in regular occup...,Total,Calabria,Percentage values,Istat - National Accounts,19.5,20.5,21.3,...,23.1,22.2,21.6,22.0,21.5,20.9,,,,
6866,Environment,10AMB014,Protected natural areas,Total,Umbria,Percentage values,Istat - Processing of data from Ministry of th...,,,,...,,17.5,17.5,,,,17.5,,,
1518,Education and training,02IST003-N22,People having completed tertiary education (30...,Total,Campania,Percentage values,Istat - Labour force survey,,,,...,,,,20.4,21.1,21.2,21.2,23.4,,
8436,Quality of services,12SER027,General practitioners with a number of patient...,Total,Islands,Percentage values,Istat - Processing of data from Ministry of He...,12.8,12.7,12.8,...,16.2,16.5,16.2,17.9,18.7,20.1,,,,


In [262]:
italy_bes.tail(8) #sample the last 8 rows

Unnamed: 0,DOMINIO,CODICE,INDICATORE,SESSO,TERRITORIO,UNITA_MISURA,FONTE,2004,2005,2006,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,NOTA
8492,Quality of services,SDG-4,Nurses and midwives,Total,North,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,6.1,6.1,6.3,6.3,6.4,6.6,6.4,,,
8493,Quality of services,SDG-4,Nurses and midwives,Total,North-west,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.7,5.7,6.0,5.9,6.1,6.3,6.1,,,
8494,Quality of services,SDG-4,Nurses and midwives,Total,North-east,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,6.6,6.6,6.7,6.8,6.8,7.0,6.8,,,
8495,Quality of services,SDG-4,Nurses and midwives,Total,Centre,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.9,6.1,6.3,6.4,6.8,7.0,7.1,,,
8496,Quality of services,SDG-4,Nurses and midwives,Total,South and islands,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.2,5.5,5.8,5.6,6.2,6.3,6.3,,,
8497,Quality of services,SDG-4,Nurses and midwives,Total,South,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.2,5.5,5.9,5.6,6.3,6.3,6.5,,,
8498,Quality of services,SDG-4,Nurses and midwives,Total,Islands,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.3,5.5,5.7,5.7,6.1,6.2,6.0,,,
8499,Quality of services,SDG-4,Nurses and midwives,Total,Italy,"Per 1,000 inhabitants",Co.Ge.A.P.S. (Consorzio Gestione Anagrafica Pr...,,,,...,5.7,5.9,6.1,6.1,6.4,6.6,6.5,,,


In [263]:
italy_bes.shape #looking at the number of rows and columns

(8500, 28)

In [264]:
italy_bes.columns #the name of all 28 columns

Index([     'DOMINIO',       'CODICE',   'INDICATORE',        'SESSO',
         'TERRITORIO', 'UNITA_MISURA',        'FONTE',           2004,
                 2005,           2006,           2007,           2008,
                 2009,           2010,           2011,           2012,
                 2013,           2014,           2015,           2016,
                 2017,           2018,           2019,           2020,
                 2021,           2022,           2023,         'NOTA'],
      dtype='object')

###Renaming Columns
Though the data is in English, the column labels are Italian. For ease, I'll be translating those into English and use snake_case.

In [265]:
italy_bes = italy_bes.rename(columns={'DOMINIO':'domain', 'CODICE': 'code', 'INDICATORE': 'indicator',
                            'SESSO': 'sex', 'TERRITORIO' : 'territory', 'UNITA_MISURA' : 'unit',
                            'FONTE' : 'source', 'NOTA': 'note'})

In [266]:
italy_bes.columns

Index([   'domain',      'code', 'indicator',       'sex', 'territory',
            'unit',    'source',        2004,        2005,        2006,
              2007,        2008,        2009,        2010,        2011,
              2012,        2013,        2014,        2015,        2016,
              2017,        2018,        2019,        2020,        2021,
              2022,        2023,      'note'],
      dtype='object')

###Changing Data Types

In [267]:
italy_bes.dtypes

domain       object
code         object
indicator    object
sex          object
territory    object
unit         object
source       object
2004         object
2005         object
2006         object
2007         object
2008         object
2009         object
2010         object
2011         object
2012         object
2013         object
2014         object
2015         object
2016         object
2017         object
2018         object
2019         object
2020         object
2021         object
2022         object
2023         object
note         object
dtype: object

Pandas has interpreted all our columns as string data types. We'll need to fix this for the columns of years 2004-2023.

In [268]:
cols_to_convert = {2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023}
for col in cols_to_convert:
  italy_bes[col] = italy_bes[col].apply(pd.to_numeric, errors = 'coerce') #set errors as 'coerce' to convert empty values to nan

In [269]:
italy_bes.dtypes #check data type

domain        object
code          object
indicator     object
sex           object
territory     object
unit          object
source        object
2004         float64
2005         float64
2006         float64
2007         float64
2008         float64
2009         float64
2010         float64
2011         float64
2012         float64
2013         float64
2014         float64
2015         float64
2016         float64
2017         float64
2018         float64
2019         float64
2020         float64
2021         float64
2022         float64
2023         float64
note          object
dtype: object

###Drop Indicators
Our data doesn't seem quite right. I'd love the values under the 'indicator' column to be actualy columns instead of being listed as values. But there are 152 indicators total; that'd mean 152 columns!

In [270]:
italy_bes['indicator'].value_counts() #looking at how many unique indicators our data has

Life expectancy at birth                                                                                             90
Share of employed people aged 15-64 years working over 60 hours per week (including paid work and household work)    90
Trust in political parties                                                                                           90
Trust in judicial system                                                                                             90
Trust in the parliament                                                                                              90
                                                                                                                     ..
Per capita net wealth                                                                                                 6
Emissions of CO2 and other greenhouse gases                                                                           1
Women in decision-making bodies         

It's quite obvious that I'll need choose a few indicators out of the 152. And luckily, the indicators are divided into similar categories(domains).

In [271]:
italy_bes['domain'].value_counts()

Health                                 1292
Education and training                 1156
Work and life balance                  1148
Environment                             771
Social relationships                    750
Politics and Institutions               658
Quality of services                     574
Innovation, research and creativity     541
Safety                                  486
Landscape and cultural heritage         450
Subjective well-being                   360
Economic well-being                     314
Name: domain, dtype: int64

The dataframe has 12 domains. I can also see how many indicators a given domain contains.

In [272]:
italy_bes[italy_bes['domain'] == 'Health'].value_counts('indicator') #Look at the unique indicators in the Health domain

indicator
Adequate nutrition (standardised rates)                                                         90
Age-standardised cancer mortality rate (20-64 years old)                                        90
Age-standardised mortality rate for dementia and nervous system diseases (65 years and over)    90
Alcohol consumption (standardised rates)                                                        90
Avoidable mortality (age 0-74)                                                                  90
Healthy life expectancy at birth                                                                90
Infant mortality rate                                                                           90
Life expectancy at birth                                                                        90
Life expectancy without activity limitations at 65 years of age                                 90
Mental health index (SF36)                                                                      90


There are 15 indicators for the Health domain. From these, the only that best matches my research topic is 'Road accidents moratility rate (15-34 years old)' as the rest would be more of a concern for people who live there for the long term or their whole lives. I'll drop all rows with domain 'Health' that don't match my criteria.

In [273]:
condition = (italy_bes['domain']=='Health') & (italy_bes['indicator'] != 'Road accidents mortality rate (15-34 years old)') #setting the condition for the rows I want dropped

italy_bes = italy_bes.drop(italy_bes[condition].index) #drop rows based on a set condition

In [274]:
italy_bes[italy_bes['domain']=='Health']

Unnamed: 0,domain,code,indicator,sex,territory,unit,source,2004,2005,2006,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,note
450,Health,01SAL005,Road accidents mortality rate (15-34 years old),Males,Piemonte,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,3.1,2.8,2.9,...,1.3,1.1,1.1,1.0,1.2,0.8,0.7,,,In 2018 deaths include victims from the Bridge...
451,Health,01SAL005,Road accidents mortality rate (15-34 years old),Males,Valle d'Aosta/Vallée d'Aoste,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,2.3,3.4,3.5,...,1.7,0.8,3.9,0.8,,,,,,In 2018 deaths include victims from the Bridge...
452,Health,01SAL005,Road accidents mortality rate (15-34 years old),Males,Liguria,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,2.9,1.9,2.3,...,1.3,1.1,1.3,1.9,1.4,0.7,0.7,,,In 2018 deaths include victims from the Bridge...
453,Health,01SAL005,Road accidents mortality rate (15-34 years old),Males,Lombardia,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,2.7,2.8,2.6,...,1.1,0.9,0.7,0.8,0.9,0.6,0.8,,,In 2018 deaths include victims from the Bridge...
454,Health,01SAL005,Road accidents mortality rate (15-34 years old),Males,Trentino-Alto Adige/Südtirol,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,3.6,3.3,2.4,...,1.5,1.3,0.6,0.8,1.2,0.9,0.8,,,In 2018 deaths include victims from the Bridge...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
535,Health,01SAL005,Road accidents mortality rate (15-34 years old),Total,Centre,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,2.0,1.9,1.8,...,0.8,0.7,0.8,0.7,0.6,0.5,0.6,,,In 2018 deaths include victims from the Bridge...
536,Health,01SAL005,Road accidents mortality rate (15-34 years old),Total,South and islands,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,1.4,1.4,1.4,...,0.6,0.7,0.6,0.7,0.7,0.5,0.7,,,In 2018 deaths include victims from the Bridge...
537,Health,01SAL005,Road accidents mortality rate (15-34 years old),Total,South,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,1.4,1.4,1.3,...,0.6,0.7,0.6,0.7,0.7,0.5,0.7,,,In 2018 deaths include victims from the Bridge...
538,Health,01SAL005,Road accidents mortality rate (15-34 years old),Total,Islands,"Standardised rates per 10,000 residents",Istat - For deaths: Survey on road accidents r...,1.5,1.5,1.6,...,0.7,0.6,0.7,0.7,0.8,0.5,0.8,,,In 2018 deaths include victims from the Bridge...


Great! Now, I'm down 14 indicators. I'll do this for the other 11 domains, choosing a few indicators and getting rid of the rest.