# Source datasets

This Jupyter Notebook analyses the source datasets and the process of cleaning up for ["Blessed Be the Fruit"](https://github.com/OrsolaMBorrini/blessedfruit), an open data project regarding the analysis of factors that might influence pregnancy rates in young women in Italy.

## Selection phase
Overall, **7 source datasets** have been selected, all coming from [Istat](https://www.istat.it/) various databases and platforms such as [demo - demography in figures](https://demo.istat.it/?l=en), [IstatData](https://esploradati.istat.it/databrowser/#/) and [I.Stat](http://dati.istat.it/?lang=en).

When possible, the IstatData has been preferred to I.Stat, as all the content of the latter will be gradually migrated in the former.

Id | Dataset | Description | Provenience | Link / Path
--- | --- | --- | --- | --- 
D1 | Population estimates 2002-2019 by age and sex at Jan 1st | POP | demo | [Link](https://demo.istat.it/app/?i=RIC&l=en)
D2 | Resident population by age, sex and marital status on 1st January 2022 | POP | demo | [Link](https://demo.istat.it/app/?i=POS&l=en)
D3 | Aspects of daily life: Religious observances - regions and type of municipality | REL | I.Stat | Daily life and citizen opinions > Social Activities and religious observances > Religious observances - regions and type of municipality
D4 | Mother - Age and citizenship | PREG | IstatData | [Link](https://esploradati.istat.it/databrowser/#/en/dw/categories/IT1,POP,1.0/POP_BIRTHFERT/DCIS_NATI1/DCIS_NATI1_PARENT_CHARACT/IT1,25_74_DF_DCIS_NATI1_8,1.0)
D5 | Spontaneous abortions - resignation from the place of the event: Age of women - prov. | PREG | I.Stat | Health Statistics > Women Reproductive Health > Spontaneous abortions - resignation from the place of the event > Provincial data > Age of women - prov.
D6 | Induced abortions - Migration: Events by region of residence of the woman and region of intervention | PREG | I.Stat | Health Statistics > Women Reproductive Health > Voluntary interruptions of pregnancy - characteristics of the woman > Provincial data > Age - prov. of event
D7 | Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020) | ED | I.Stat | Education and Training > Population 15 years and over by highest level of education - previous regulation (until 2020) > Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020) > Data summary

## Analysis phase

### D1 - "Population estimates 2002-2019 by age and sex at Jan 1st"
### D2 - "Resident population by age, sex and marital status on 1st January 2022"

### D3 - "Aspects of daily life: Religious observance - regions and type of municipality"
**Path** to the dataset on the I.stat platform: `Daily life and citizen opinions > Social Activities and religious observances > Religious observances - regions and type of municipality`

Select variables:
* Measure: thousands value
* Territory: Piemonte, Valle d'Aosta / Vall√©e d'Aoste, Liguria, Lombardia, Trentino Alto Adige / Sudtirol, Veneto, Friuli-Venezia Giulia, Emilia-Romagna, Toscana, Umbria, Marche, Lazio, Abruzzo, Molise, Campania, Puglia, Basilicata, Calabria, Sicilia, Sardegna
* Select time: 2017, 2018, 2019
* Data type: "at least once a week", "never"


**Export in CSV** and get this dataframe (some rows are shown as example):

In [6]:
import csv
import pandas as pd 

path = '../data/srcDS/D3.csv'
d3 = pd.read_csv(path, keep_default_na=False,nrows=5)
d3

Unnamed: 0,ITTER107,Territory,TIPO_DATO_AVQ,Data type,MISURA_AVQ,Measure,TIME,Select time,Value,Flag Codes,Flags
0,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2017,2017,1302.0,,
1,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2018,2018,1155.0,,
2,ITD3,Veneto,6_WEEK_RELIG,at least once a week,THV,thousands value,2019,2019,1129.0,,
3,ITC1,Piemonte,6_NEVER_RELIG,never,HSC,per 100 people with the same characteristics,2017,2017,26.6,,
4,ITC1,Piemonte,6_NEVER_RELIG,never,HSC,per 100 people with the same characteristics,2018,2018,28.5,,



We can see that we have repeated data:
* `ITTER107` corresponds to `Territory`
* `TIPO_DATO_AVQ` corresponds to `Data type`
* `MISURA_AVQ` corresponds to `Measure`
* `TIME` corresponds to `Select time`

Therefore, we can get rid of these duplicated columns (and other unnecessary columns) in order to keep our dataset cleaner (<u>see</u>: section "Clean up phase > "D3 - Aspects of daily life: Religious observance - regions and type of municipality")

### D4 - "Mother - Age and citizenship"
### D5 - "Spontaneous abortions - resignation from the place of the event: Age of women - prov."
### D6 - "Induced abortions - Migration: Events by region of residence of the woman and region of intervention"
### D7 - "Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020)"

## Clean up phase

### D1 - "Population estimates 2002-2019 by age and sex at Jan 1st"
### D2 - "Resident population by age, sex and marital status on 1st January 2022"

### D3 - "Aspects of daily life: Religious observance - regions and type of municipality"
1. Remove duplicate and unnecessary columns

### D4 - "Mother - Age and citizenship"
### D5 - "Spontaneous abortions - resignation from the place of the event: Age of women - prov."
### D6 - "Induced abortions - Migration: Events by region of residence of the woman and region of intervention"
### D7 - "Early leavers from education and training - aged 18 to 24 - previous regulation (until 2020)"