
### DISCLAIMER: for the code to properly work, first "clear outpyts of all cells", then "run all"


# Source datasets

This Jupyter Notebook analyses the source datasets and the process of cleaning up for ["Blessed Be the Fruit"](https://github.com/OrsolaMBorrini/blessedfruit), an open data project regarding the analysis of factors that might influence pregnancy rates in young women in Italy.


## Selection phase
Overall, **7 source datasets** have been selected, all coming from [Istat](https://www.istat.it/) various databases and platforms such as [demo - demography in figures](https://demo.istat.it/?l=en), [IstatData](https://esploradati.istat.it/databrowser/#/) and [I.Stat](http://dati.istat.it/?lang=en).

When possible, the IstatData has been preferred to I.Stat, as all the content of the latter will be gradually migrated in the former.

Id | Dataset | Description (factor of interest) | Provenience | Link / Path
--- | --- | --- | --- | --- 
D1 | Population estimates 2002-2019 by age and sex at Jan 1st | POPULATION | demo | [Link](https://demo.istat.it/app/?i=RIC&l=en)
D2 | Resident population by age, sex and marital status on 1st January 2022 | POPULATION | demo | [Link](https://demo.istat.it/app/?i=POS&l=en)
D3 | Aspects of daily life: Religious observances - regions and type of municipality | RELIGION | I.Stat | Daily life and citizen opinions > Social Activities and religious observances > Religious observances - regions and type of municipality
D4 | Mother - Age and citizenship | PREGNANCY | IstatData | [Link](https://esploradati.istat.it/databrowser/#/en/dw/categories/IT1,POP,1.0/POP_BIRTHFERT/DCIS_NATI1/DCIS_NATI1_PARENT_CHARACT/IT1,25_74_DF_DCIS_NATI1_8,1.0)
D5 | Spontaneous abortions - resignation from the place of the event: Age of women - prov. | PREGNANCY | I.Stat | Health Statistics > Women Reproductive Health > Spontaneous abortions - resignation from the place of the event > Provincial data > Age of women - prov.
D6 | Induced abortions - Migration: Events by region of residence of the woman and region of intervention | PREGNANCY | I.Stat | Health Statistics > Women Reproductive Health > Voluntary interruptions of pregnancy - characteristics of the woman > Provincial data > Age - prov. of event
D7 | Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020) | EDUCATION | I.Stat | Education and Training > Population 15 years and over by highest level of education - previous regulation (until 2020) > Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020) > Data summary

## Analysis phase

### D1/2 - Population
After closing the 2010-2011 census, Istat decided to move onto a "permanent census", aiming at yearly detections (not decennial anymore).
Specifically, Istat started doing a "Censimento permanente della popolazione e delle abitazioni", regarding population and households, in [2018](https://www.istat.it/it/archivio/199573).

These ["permanent censuses"](https://www.istat.it/en/censuses) involve, time after time, just representative samples of the population, businesses, and institutions. However, the publication of the obtained data can be considered of "census" status, thus referrable to the entire observation field. This is possible thanks to the integration of administrative sources with sample detections, which allows for the exhaustivity, higher quality, and informational offer of the data.

All the demographic data, detailed to single municipalities, can be found both on I.stat and **demo**. As the latter is a more specific database (and contains even the 2017-2019 years), we have decided to use it instead of I.stat for this specific aspect of our research.

Since the census refers to the population "up to the 1st of January" of the selected year, when selecting the year variable in our dataset we did not choose directly 2017-2018-2019 (our time span of interest), but 2018-2019-2020.

For all the reasions mentioned above, we have used two different dataset for the population:
1. D1 - "Population estimates 2002-2019 by age and sex at Jan 1st" (for the year 2017)
2. D2 - "Resident population by age, sex and marital status on 1st January 2022" (for the years 2018 and 2019)

While for the mashup with the "PREGNANCY datasets" (D4, D5, D6) we will need only the population in our age range and gender of interest (15-25, F) for each region, the one with the "RELIGION dataset" (D3) will need the whole population for each region.
# QUA AGGIUNGI SPIEGAZIONE CHE QUINDI DOBBIAMO SDOPPIARE TUTTO 

#### D1 - "Population estimates 2002-2019 by age and sex at Jan 1st" (2017)
**Link** to the dataset on the demo platform: [here](https://demo.istat.it/app/?i=RIC&l=it)

##### D1.1 - Selected age range and gender (2017)
Select variables:
* Citizenship: All
* From age - to age: 15 - 25
* Year: 2018
* All regions

**Export in CSV** and get this dataframe:

In [29]:
import csv
import pandas as pd 

sel_2017 = pd.read_csv('../data/srcDS/Population/Selected/2017Selected.csv', keep_default_na=False)
sel_2017

Unnamed: 0,Region code,Region,Age,Sex,Population
0,1,Piemonte,15,Total,37597
1,1,Piemonte,15,Males,19171
2,1,Piemonte,15,Females,18426
3,1,Piemonte,16,Total,37662
4,1,Piemonte,16,Males,19387
...,...,...,...,...,...
655,20,Sardegna,24,Males,8217
656,20,Sardegna,24,Females,7625
657,20,Sardegna,25,Total,16453
658,20,Sardegna,25,Males,8504


As we are only interested in our selected gender ('Females'), we will get rid of the rows regarding male population (all the rows containing, under the column 'Sex', the value 'Males').

##### D1.2 - General population (2017)
Select variables:
* Citizenship: All
* From age - to age: 0 - 100 and over
* Year: 2018
* All regions

**Export in CSV** and get this dataframe (some rows are shown as example):

In [32]:
gen_2017 = pd.read_csv('../data/srcDS/Population/General/2017General.csv', keep_default_na=False)
gen_2017

Unnamed: 0,Region code,Region,Age,Sex,Population
0,1,Piemonte,0,Total,30680
1,1,Piemonte,0,Males,15597
2,1,Piemonte,0,Females,15083
3,1,Piemonte,1,Total,31895
4,1,Piemonte,1,Males,16306
...,...,...,...,...,...
6055,20,Sardegna,99,Males,43
6056,20,Sardegna,99,Females,151
6057,20,Sardegna,100 and over,Total,424
6058,20,Sardegna,100 and over,Males,93


#### D2 - "Resident population by age, sex and marital status on 1st January 2022" (2018, 2019)

Link to the dataset on the demo platform: [here](https://demo.istat.it/app/?i=POS&l=en).

##### D2.1 - Selected age and gender (2018, 2019)
Select variables:
* Year: 2019 (or 2020)
* From age - to age: 15 - 25 and over
* All regions (one dataset for each region)

**Export in CSV** and get this dataframe for each region (only one will be showcased here):

In [41]:
# Example for 2018, only one region
ex2018 = pd.read_csv('../data/srcDS/Population/Selected/D1-D2Population2018/2018Abruzzo.csv', keep_default_na=False)
ex2018 

Unnamed: 0,Age,Never married males,Married males,Divorced males,Widowed males,Same sex civil partner males,Divorced same-sex civil partner males,Widow of same-sex civil partner males,Total males,Never married females,Married females,Divorced females,Widowed females,Same sex civil partner females,Divorced same-sex civil partner females,Widow of same-sex civil partner females,Total females,Total
0,15,6035,0,0,0,0,0,0,6035,5559,0,0,0,0,0,0,5559,11594
1,16,5970,0,0,0,0,0,0,5970,5543,0,0,0,0,0,0,5543,11513
2,17,5930,0,0,0,0,0,0,5930,5455,0,1,0,0,0,0,5456,11386
3,18,6089,1,0,0,0,0,0,6090,5701,2,0,0,0,0,0,5703,11793
4,19,6409,3,0,0,0,0,0,6412,5615,32,0,0,0,0,0,5647,12059
5,20,6295,10,0,4,0,0,0,6309,5656,60,0,0,0,0,0,5716,12025
6,21,6451,19,0,5,1,0,0,6476,5822,119,1,0,1,0,0,5943,12419
7,22,6505,29,1,7,0,0,0,6542,5657,161,0,1,0,0,0,5819,12361
8,23,6422,56,0,2,1,0,0,6481,5888,227,3,0,0,1,0,6119,12600
9,24,6451,96,0,1,0,0,0,6548,5580,346,3,1,1,0,0,5931,12479


In [44]:
# Example for 2019, only one region
ex2019 = pd.read_csv('../data/srcDS/Population/Selected/D1-D2Population2019/2019Abruzzo.csv', keep_default_na=False)
ex2019

Unnamed: 0,Age,Never married males,Married males,Divorced males,Widowed males,Same sex civil partner males,Divorced same-sex civil partner males,Widow of same-sex civil partner males,Total males,Never married females,Married females,Divorced females,Widowed females,Same sex civil partner females,Divorced same-sex civil partner females,Widow of same-sex civil partner females,Total females,Total
0,15,6035,0,0,0,0,0,0,6035,5559,0,0,0,0,0,0,5559,11594
1,16,5970,0,0,0,0,0,0,5970,5543,0,0,0,0,0,0,5543,11513
2,17,5930,0,0,0,0,0,0,5930,5455,0,1,0,0,0,0,5456,11386
3,18,6089,1,0,0,0,0,0,6090,5701,2,0,0,0,0,0,5703,11793
4,19,6409,3,0,0,0,0,0,6412,5615,32,0,0,0,0,0,5647,12059
5,20,6295,10,0,4,0,0,0,6309,5656,60,0,0,0,0,0,5716,12025
6,21,6451,19,0,5,1,0,0,6476,5822,119,1,0,1,0,0,5943,12419
7,22,6505,29,1,7,0,0,0,6542,5657,161,0,1,0,0,0,5819,12361
8,23,6422,56,0,2,1,0,0,6481,5888,227,3,0,0,1,0,6119,12600
9,24,6451,96,0,1,0,0,0,6548,5580,346,3,1,1,0,0,5931,12479


##### D2.2 - General population (2018, 2019)

### D3 - "Aspects of daily life: Religious observance - regions and type of municipality"
**Path** to the dataset on the I.stat platform: `Daily life and citizen opinions > Social Activities and religious observances > Religious observances - regions and type of municipality`

Select variables:
* Measure: thousands value
* Territory: Piemonte, Valle d'Aosta / Vallée d'Aoste, Liguria, Lombardia, Trentino Alto Adige / Sudtirol, Veneto, Friuli-Venezia Giulia, Emilia-Romagna, Toscana, Umbria, Marche, Lazio, Abruzzo, Molise, Campania, Puglia, Basilicata, Calabria, Sicilia, Sardegna
* Select time: 2017, 2018, 2019
* Data type: "at least once a week", "never"


**Export in CSV** and get this dataframe (some rows are shown as example):

In [45]:
d3 = pd.read_csv('../data/srcDS/D3.csv', keep_default_na=False)
d3

Unnamed: 0,ITTER107,Territory,TIPO_DATO_AVQ,Data type,MISURA_AVQ,Measure,TIME,Select time,Value,Flag Codes,Flags
0,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2017,2017,1302,,
1,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2018,2018,1155,,
2,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2019,2019,1129,,
3,ITC4,Lombardia,6_WEEK_RELIG,at least once a week,THV,thousands value,2017,2017,2539,,
4,ITC4,Lombardia,6_WEEK_RELIG,at least once a week,THV,thousands value,2018,2018,2371,,
...,...,...,...,...,...,...,...,...,...,...,...
115,ITE1,Toscana,6_NEVER_RELIG,never,THV,thousands value,2018,2018,1308,,
116,ITE1,Toscana,6_NEVER_RELIG,never,THV,thousands value,2019,2019,1368,,
117,ITF2,Molise,6_NEVER_RELIG,never,THV,thousands value,2017,2017,51,,
118,ITF2,Molise,6_NEVER_RELIG,never,THV,thousands value,2018,2018,56,,



We can see that we have repeated data:
* `ITTER107` corresponds to `Territory`
* `TIPO_DATO_AVQ` corresponds to `Data type`
* `MISURA_AVQ` corresponds to `Measure`
* `TIME` corresponds to `Select time`

Therefore, we can get rid of these duplicated columns (and other unnecessary columns) in order to keep our dataset cleaner (<u>see</u>: section "Clean up phase > "D3 - Aspects of daily life: Religious observance - regions and type of municipality")

### D4 - "Mother - Age and citizenship"
### D5 - "Spontaneous abortions - resignation from the place of the event: Age of women - prov."
### D6 - "Induced abortions - Migration: Events by region of residence of the woman and region of intervention"
### D7 - "Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020)"

## Clean up phase

### D1 - "Population estimates 2002-2019 by age and sex at Jan 1st"
### D2 - "Resident population by age, sex and marital status on 1st January 2022"

### D3 - "Aspects of daily life: Religious observance - regions and type of municipality"

As D3 is a pretty simple dataset, all that was needed to clean it was just to **remove the duplicate and unnecessary columns** (namely "Territory", "Data type", "MISURA_AVQ", "Measure", "Select time", "Flag Codes", "Flags"), while keeping the corresponding coded ones ("ITTER107", "TIPO_DATO_AVQ", "TIME").
Specifically, the column "MISURA_AVQ" was also dropped because we only have "thousands values".


In [46]:
''' DOUBLE COLUMNS
    ITTER107 <-> Territory
        ITC1 <-> Piemonte
        ITC2 <-> Valle d'Aosta / Vallée d'Aoste
        ITC3 <-> Liguria
        ITC4 <-> Lombardia
        ITD3 <-> Veneto
        ITD4 <-> Friuli-Venezia Giulia
        ITD5 <-> Emilia-Romagna
        ITDA <-> Trentino Alto Adige / Sudtirol
        ITE1 <-> Toscana
        ITE2 <-> Umbria
        ITE3 <-> Marche
        ITE4 <-> Lazio
        ITF1 <-> Abruzzo
        ITF2 <-> Molise
        ITF3 <-> Campania
        ITF4 <-> Puglia
        ITF5 <-> Basilicata
        ITF6 <-> Calabria
        ITG1 <-> Sicilia
        ITG2 <-> Sardegna
    TIPO_DATO_AVQ <-> Data type
        6_NEVER_RELIG <-> never
        6_WEEK_RELIG <-> at least once a week
    MISURA_AVQ <-> Measure
        THV <-> thousands value
    TIME <-> Select time
        Nothing changes in the kind of data in the cells
'''
# Dropping the columns with data in NL for clarity (we can also drop the 'MISURA_AVQ' column, knowing that we are talking of thousands value)
d3.drop(["Territory","Data type","MISURA_AVQ","Measure","Select time","Flag Codes","Flags"], axis=1, inplace=True)
print(d3)

# Create new clean CSV for D3 (do not un-comment)
# d3.to_csv("../data/cleanDS/D3_clean.csv", index=False)

    ITTER107  TIPO_DATO_AVQ  TIME  Value
0       ITD3   6_WEEK_RELIG  2017   1302
1       ITD3   6_WEEK_RELIG  2018   1155
2       ITD3   6_WEEK_RELIG  2019   1129
3       ITC4   6_WEEK_RELIG  2017   2539
4       ITC4   6_WEEK_RELIG  2018   2371
..       ...            ...   ...    ...
115     ITE1  6_NEVER_RELIG  2018   1308
116     ITE1  6_NEVER_RELIG  2019   1368
117     ITF2  6_NEVER_RELIG  2017     51
118     ITF2  6_NEVER_RELIG  2018     56
119     ITF2  6_NEVER_RELIG  2019     60

[120 rows x 4 columns]



### D4 - "Mother - Age and citizenship"
### D5 - "Spontaneous abortions - resignation from the place of the event: Age of women - prov."
### D6 - "Induced abortions - Migration: Events by region of residence of the woman and region of intervention"
### D7 - "Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020)"