## Assignment 5
The following assignment will focus on analysing the differences between the sexes by age in Ireland.
>
Author: Loic Soares Bagnoud


### Part 1
#### Preparing the Data
>
The first thing that need to do is to prepare the data for analysis.

In [3]:
# The first thing needed is to import the Data
import pandas as pd
import numpy as np

In [4]:
# Then we need the specific dataframe, checking the head to see if it's working fine.
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


With the data all set. We can start to clean this up and to get the columns that we need to work with.

In [5]:
# This gets us the dataframe header
headers = df.columns.tolist()
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [6]:
# We're going to drop the columns that we don't need by assigning them to a variable.
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','C02076V03371','C03789V04537','UNIT']

# And using the .drop command from Pandas
df.drop(columns=drop_col_list, inplace=True)

# We then use this to transform all ages into proper integers. This will make it easier down the line to work with.
df = df[df["Single Year of Age"] != "All ages"]
df['Single Year of Age'] = df['Single Year of Age'].str.replace('Under 1 year', '0')
df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)

df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')

# finally, we drop the both sexes value, since for this assignment, we're gonna be evaluating each sex individually.
# We follow the same logic as we did for the "ages" above.
df = df[df["Sex"] != "Both sexes"]
df.info()

print (df.head(3))


<class 'pandas.core.frame.DataFrame'>
Index: 6464 entries, 3296 to 9791
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Sex                      6464 non-null   object
 1   Single Year of Age       6464 non-null   int64 
 2   Administrative Counties  6464 non-null   object
 3   VALUE                    6464 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 252.5+ KB
       Sex  Single Year of Age Administrative Counties  VALUE
3296  Male                   0                 Ireland  29610
3297  Male                   0   Carlow County Council    346
3298  Male                   0     Dublin City Council   3188


In [7]:
# We go ahead and create a pivot table of our selected dataframe. In this case, we just want the following units
df_anal = pd.pivot_table(df, 'VALUE',["Sex",'Single Year of Age'],"Administrative Counties")
print (df_anal.head(3))

# And finally, we write out the entire file to the local machine in a CSV file that we can then look into.
df_anal.to_csv("population_for_analysis.csv")

Administrative Counties    Carlow County Council  Cavan County Council  \
Sex    Single Year of Age                                                
Female 0                                   353.0                 501.0   
       1                                   302.0                 477.0   
       2                                   334.0                 520.0   

Administrative Counties    Clare County Council  Cork City Council  \
Sex    Single Year of Age                                            
Female 0                                  691.0             1124.0   
       1                                  704.0             1136.0   
       2                                  744.0             1162.0   

Administrative Counties    Cork County Council  Donegal County Council  \
Sex    Single Year of Age                                                
Female 0                                2055.0                   881.0   
       1                                2045.0          

__Reference__: 
I had some issues here with an error I was getting. Luckily, ChatGPT helped me understand what the problem was:

 - _https://chatgpt.com/share/68fb8980-d6bc-800b-93f3-d0702e0e6ee1_

#### Weighted descriptive statistics

After data cleaning, we will apply weighted descriptive statistics to account for differences in population sizes across locations. This ensures that our analysis accurately reflects each location's contribution.

In [None]:
# the first part is to assign our newly created dataframe to a variable we can use.
df_sex_only = df[['Sex', 'Single Year of Age', 'VALUE']]
df_sex_only

Unnamed: 0,Sex,Single Year of Age,VALUE
3296,Male,0,29610
3297,Male,0,346
3298,Male,0,3188
3299,Male,0,1269
3300,Male,0,2059
...,...,...,...
9787,Female,100,7
9788,Female,100,9
9789,Female,100,12
9790,Female,100,31


There were two ways that I went about this:
>
1. Initially, I wasn't sure how I would calculate this and get the number correctly. My research led me to the group.by function which basically allows me to separate the current dataframe into 2 separate ones. One for men and one for women. Afterwards, we can then apply the .average function from Numpy to get the weighted mean. The issue I found with this version is that it's very complicated to then bring the two dataframes back together for further analysis. This ended up being extremelly cumbersome, so I abandoned this path. I left it below for insight on my thought process.
>
2. The second solution was a simpler one. I went ahead and used a basic for loop. First I created a dictionary to store the values in. Then, I looked for unique values in the Sex column (in this case, male and female) and created a smaller dataframe only contains rows where the Sex column matches the value sex. Afterwards, we apply the numpy function the same way to the age column and finally append those results to the dictionary. 
>
Finally, we convert that dictionary into a dataframe.
>
I left both solutions below:

In [None]:
# The first solution with the group.ny function
'''
sex_grouped = (df_sex_only.groupby("Sex"))

weighted_mean_result = sex_grouped.apply(lambda g: np.average(g['Single Year of Age'], weights=g['VALUE']))

weighted_mean_by_sex = weighted_mean_result.reset_index(name='Weighted Mean Age')
weighted_mean_by_sex
'''

# The Second solution with the for loop.
results = []

for sex in df_sex_only['Sex'].unique():
    subset = df_sex_only[df_sex_only['Sex'] == sex]
    weighted_mean = np.average(subset['Single Year of Age'], weights=subset['VALUE'])
    results.append({'Sex': sex, 'Weighted Mean Age': weighted_mean})

weighted_mean_by_sex_loop = pd.DataFrame(results)
weighted_mean_by_sex_loop

Unnamed: 0,Sex,Weighted Mean Age
0,Male,37.739448
1,Female,38.939796


__References__:

- For the first solution
>
_https://www.geeksforgeeks.org/pandas/python-pandas-dataframe-groupby/_

_https://realpython.com/pandas-reset-index/_

- For the second solution
>
_https://chatgpt.com/share/68fc9c34-e5c8-800b-90d2-2a16bfc33ee8_

#### Calculating the Difference between the sexes by age
>
Finally, to cap off Part 1, we're going to calculate the difference between the sexes.

In [9]:
# For this, we just create a pivot table off of the dataframe we have there. And then just subtract one from the other.
sex_difference = df_sex_only.pivot_table(index='Single Year of Age', columns='Sex', values='VALUE',aggfunc='sum')
sex_difference['Difference (Male - Female)'] = sex_difference['Male'] - sex_difference['Female']
sex_difference

Sex,Female,Male,Difference (Male - Female)
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,56372,59220,2848
1,55090,57750,2660
2,57948,60472,2524
3,58966,62002,3036
4,59638,63372,3734
...,...,...,...
96,1912,654,-1258
97,1464,434,-1030
98,984,260,-724
99,672,210,-462


__References__:

_https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html_

### Part 2
#### Creating the variable and age groups
>
For the second part of the exercise, we need to calculate the difference of a specific variable and age range. 

In [11]:
# We first choose our age. I went with the proposed 35.
chosen_age = 35

# Then, we create a boolean that goes through the Single Year of Age Column and marks the chosen age minus 5 and plus 5. This is our 10 year range.
age_group = df_sex_only[
    (df_sex_only['Single Year of Age'] >= chosen_age - 5) &
    (df_sex_only['Single Year of Age'] <= chosen_age + 5)
]

age_group

Unnamed: 0,Sex,Single Year of Age,VALUE
4256,Male,30,30858
4257,Male,30,367
4258,Male,30,6163
4259,Male,30,1511
4260,Male,30,1888
...,...,...,...
7867,Female,40,556
7868,Female,40,538
7869,Female,40,630
7870,Female,40,1293


I had some issues here with the "and" statement as it was giving me an error. Prompting me to find out that for Pandas, if I want and for two statements, 
I need to use "&"

__References:__

- _https://www.statology.org/and-operator-in-pandas/_

#### Sum up the results and calculate the difference

In [12]:
# We're going to follow the same logic as above as we did for the weighted mean. 

results_specific_age = []

# The one difference here is going to be the .sum command instead of the .average. Since we need to sum up everything, before we calculate the difference. 
for sex in age_group['Sex'].unique():
    subset = age_group[age_group['Sex'] == sex]
    total_pop = subset['VALUE'].sum()
    results_specific_age.append({'Sex': sex, 'Total Population': total_pop})

# We go ahead and store it in a dataframe.
sex_ages = pd.DataFrame(results_specific_age)

# Same logic but since this is a new dataframe, we have to create a pivot_table with the new values
sex_difference_age = sex_ages.pivot_table(columns='Sex', values='Total Population')

# We calculate difference. For this, we'll need the get() function. I was trying the same logic as the last time I calculated the difference,
# But it simply wasn't working. It kept giving me an error, so I went ahead and found the get() function. From my understanding, basically,
# when I summed everything up, I basically deleted the individual ages all together and summed everything up into one row. And I can't 
# use the minus operator anymore to sum from the same columns. That's where get() came in. 
sex_difference_age['Difference (Male - Female)'] = (
    sex_difference_age.get('Male') - sex_difference_age.get('Female')
)

# Just make sure to convert it to an integer for better clarity, since we were getting floats for some reason.
sex_difference_age = sex_difference_age.astype(int)

# Print it
sex_difference_age

Sex,Female,Male,Difference (Male - Female)
Total Population,829012,768060,-60952


__References__: 
>
- _https://www.w3schools.com/python/pandas/ref_df_get.asp_
- _https://how.dev/answers/what-is-the-pandas-dataframeget-method-in-python_