# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2
import plotly.graph_objects as go
import glob
# autoreload modules when code is run
%load_ext autoreload
%autoreload 2
# user written modules
import dataproject


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

Import your data, either through an API or manually, and load it. 

In [11]:
df_list = []
for year in range(2023, 2025,1):
    data = pd.read_csv(f'Popolazione_residente_{year}.csv')
    data['year'] = year
    df_list.append(data)

pop_italy = pd.concat(df_list, axis=0, ignore_index=True)


In [9]:
pop_italy.head()

Unnamed: 0,Età,Totale maschi,Totale femmine,Totale,year
0,0,203086,190834,393920,2023
1,1,207218,197034,404252,2023
2,2,211696,199551,411247,2023
3,3,219719,208732,428451,2023
4,4,230081,217293,447374,2023


In [4]:
#as we can see the column age is not in the right format. We need to change it to int but we have to deal with the '100 e oltre' and 'Total' observations 
pop_italy.info()

#we can drop the observation relative to the total of the population 
pop_italy = pop_italy[pop_italy['Età'] != 'Totale']

#Since 100 or more will be included in the same group of 100. we replace the '100 e oltre' with 100
pop_italy['Età'] = pop_italy['Età'].replace('100 e oltre', '100')

#Now we can change the column to int
pop_italy['Età'] = pop_italy['Età'].astype(int)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Età             204 non-null    object
 1   Totale maschi   204 non-null    int64 
 2   Totale femmine  204 non-null    int64 
 3   Totale          204 non-null    int64 
 4   year            204 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 8.1+ KB


In [5]:
# Create a new column with the age group of each observation
pop_italy['age_group'] = ''

for i in range(0, 101, 5):
    if i == 100:
        pop_italy.loc[pop_italy['Età'].between(i, i+5), 'age_group'] = '100-100+'
    else:
        pop_italy.loc[pop_italy['Età'].between(i, i+4), 'age_group'] = f'{i}-{i+4}'

# Summing up all the observations for each age group
pop_italy_agg = pop_italy.groupby(['age_group', 'year'])[['Totale maschi', 'Totale femmine', 'Totale']].sum()

# Resetting index
pop_italy_agg.reset_index(inplace=True)

# Splitting age_group into lower_bound and upper_bound in order to sort the age groups correctly
pop_italy_agg[['lower_bound', 'upper_bound']] = pop_italy_agg['age_group'].str.split('-', expand=True)

# Sorting the age groups
pop_italy_agg['lower_bound'] = pop_italy_agg['lower_bound'].astype(int)
pop_italy_agg.sort_values(['year','lower_bound'], inplace=True)
pop_italy_agg.reset_index(inplace=True)
pop_italy_agg.drop(['index', 'lower_bound', 'upper_bound'], axis=1, inplace=True)



In [6]:
fig = go.Figure()

# Adding male population data as positive values
fig.add_trace(go.Bar(
    y=pop_italy_agg['age_group'],
    x=pop_italy_agg['Totale maschi'],
    name='Male',
    orientation='h'
))

# Adding female population data as negative values to plot in opposite direction
fig.add_trace(go.Bar(
    y=pop_italy_agg['age_group'],
    x=pop_italy_agg['Totale femmine'] * -1,  # Multiplying by -1 to plot in opposite direction
    name='Female',
    orientation='h'
))

fig.update_layout(
    template='plotly_white', 
    title='Population in Italy',
    title_font_size=24,
    barmode='relative',
    bargap=0.1,  # Adjust as needed
    bargroupgap=0.2,  # Adjust as needed
    xaxis_title='Population',
    xaxis=dict(
        tickvals=[-2000000, -1000000, 0, 1000000, 2000000],
        ticktext=['2M', '1M', '0', '1M', '2M']
    ),
    width=800,  
    height=600  
)


In [73]:
def plot_func(df,year):

    I = df['year'] ==year

    fig = go.Figure()

# Adding male population data as positive values
    fig.add_trace(go.Bar(
        y=df.loc[I,'age_group'],
        x=df.loc[I,'Totale maschi'],
        name='Male',
        orientation='h'
))

# Adding female population data as negative values to plot in opposite direction
    fig.add_trace(go.Bar(
        y=df.loc[I,'age_group'],
        x=df.loc[I,'Totale femmine'] * -1,  # Multiplying by -1 to plot in opposite direction
        name='Female',
        orientation='h'
))

    fig.update_layout(
        template='plotly_white', 
        title=f'Population in Italy {year}',
        title_font_size=24,
        barmode='relative',
        bargap=0.1,  # Adjust as needed
        bargroupgap=0.2,  # Adjust as needed
        xaxis_title='Population',
        xaxis=dict(
            tickvals=[-2000000, -1000000, 0, 1000000, 2000000],
            ticktext=['2M', '1M', '0', '1M', '2M']
    ),
        width=800,  
        height=600  
)
    fig.show()

widgets.interact(plot_e, 
    df = widgets.fixed(empl_long),
    municipality = widgets.Dropdown(description='year', 
                                    options=empl_long.municipality.unique(), 
                                    value='Vejile')
); 


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

**Interactive plot** :

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.