## Weighted stats
- This notebook is made in the Programming for Data Analytics lecture on weighted descriptive statistics.
- We are looking at population in each county of Ireland


In [85]:
import pandas as pd

In [86]:
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


I keep just the separated sexes

In [87]:
df = df[df["Sex"] != "Both sexes"]
df = df[df["Administrative Counties"] != "Ireland"]
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


Now I can do the code that is the prep for analysis.py  
I just noticed that there were more columns so I need to get their names

In [88]:
headers = df.columns.tolist()
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [89]:
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','C02076V03371','C03789V04537','UNIT']
df.drop(columns=drop_col_list, inplace=True)
df = df[df["Single Year of Age"] != "All ages"]
df['Single Year of Age'] = df['Single Year of Age'].str.replace('Under 1 year', '0')
df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)

df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6262 entries, 3297 to 9791
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Sex                      6262 non-null   object
 1   Single Year of Age       6262 non-null   int64 
 2   Administrative Counties  6262 non-null   object
 3   VALUE                    6262 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 244.6+ KB


  df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)


In [90]:
df_anal = pd.pivot_table(df, 'VALUE',"Single Year of Age","Sex")
print (df_anal.head(3))
# write out the entire file to local machine
df_anal.to_csv("population_for_analysis.csv")

Sex                     Female        Male
Single Year of Age                        
0                   909.225806  955.161290
1                   888.548387  931.451613
2                   934.645161  975.354839


## Now we can do weighted descriptive statistics

In [91]:
sexes = list(df_anal.columns)
sexes

['Female', 'Male']

Weighted mean is sum(age*population at age) / sum (populations at age)

In [92]:
number_people = df_anal[sexes].sum()
number_people_df = number_people.to_frame(name='Number of People')
number_people_df


Unnamed: 0_level_0,Number of People
Sex,Unnamed: 1_level_1
Female,84019.032258
Male,82082.225806


In [93]:
df_anal

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,909.225806,955.161290
1,888.548387,931.451613
2,934.645161,975.354839
3,951.064516,1000.032258
4,961.903226,1022.129032
...,...,...
96,30.838710,10.548387
97,23.612903,7.000000
98,15.870968,4.193548
99,10.838710,3.387097


In [94]:
cumages = df_anal[sexes].mul(df_anal.index, axis=0).sum()
cumages_df = cumages.to_frame(name='Cumulative Age')
cumages_df


Unnamed: 0_level_0,Cumulative Age
Sex,Unnamed: 1_level_1
Female,3271684.0
Male,3097738.0


In [95]:
weighted_mean = cumages/number_people
weighted_mean_df = weighted_mean.to_frame(name='Weighted Mean')
weighted_mean_df

Unnamed: 0_level_0,Weighted Mean
Sex,Unnamed: 1_level_1
Female,38.939796
Male,37.739448


#### Or you can use numpy

In [96]:
import numpy as np
weighted_means = {}
for sex in sexes:
    weighted_means[sex] = np.average(df_anal.index, weights=df_anal[sex])

print(f"The weighted means by sex are: {weighted_means}")

The weighted means by sex are: {'Female': 38.93979589877869, 'Male': 37.73944773710391}


### Weighted median
create a series of the cumulative sums and find the index of the middle value

In [97]:
cumsum = df_anal[sexes].cumsum()
cumsum

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,909.225806,955.161290
1,1797.774194,1886.612903
2,2732.419355,2861.967742
3,3683.483871,3862.000000
4,4645.387097,4884.129032
...,...,...
96,83949.870968,82062.677419
97,83973.483871,82069.677419
98,83989.354839,82073.870968
99,84000.193548,82077.258065


In [98]:
weighted_medians = {}
for sex in sexes:
    cumsum = df_anal[sex].cumsum()
    cutoff = df_anal[sex].sum() / 2
    median_age = cumsum.index[cumsum >= cutoff][0]
    weighted_medians[sex] = median_age

print(f"The weighted medians by sex are: {weighted_medians}")

The weighted medians by sex are: {'Female': 39, 'Male': 38}


## Weighted standard deviation
The weighted standard deviation uses the same formula as the regular standard deviation, but applies weights to the squared differences from the weighted mean.


In [99]:
w_variances = {}
for sex in sexes:
    mean = np.average(df_anal.index, weights=df_anal[sex])
    variance = np.average((df_anal.index - mean) ** 2, weights=df_anal[sex])
    w_variances[sex] = variance

print(f"The weighted variances for each sex are: {w_variances}")

The weighted variances for each sex are: {'Female': 528.953520736661, 'Male': 513.9835070876321}


In [100]:
w_stds = {sex: np.sqrt(var) for sex, var in w_variances.items()}
print(f"The weighted standard deviations for each sex are: {w_stds}")

The weighted standard deviations for each sex are: {'Female': 22.998989559036303, 'Male': 22.67120435900202}


Finding the mode for each sex
The mode is the age at which the population is the highest.
I can use the idxmax function to find the index of the highest value.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html

In [101]:
modes = {sex: df_anal[sex].idxmax() for sex in sexes}
print(f"The modes for each sex are: {modes}")

The modes for each sex are: {'Female': 41, 'Male': 42}


### Statistical Summary by Sex

- **Weighted Mean:** Females have an average age of **38.94**, while males average **37.74**, showing that females are slightly older on average.  
- **Weighted Median:** The median age is **39 for females** and **38 for males**, meaning half of the female population is older than half of the male population.   
- **Weighted Standard Deviation:** The spread of ages is **22.99 for females** and **22.67 for males**, with females showing a slightly wider age range overall.
- **Weighted Mode:** The most common ages are **41 for females** and **42 for males**, so both groups peak around the early forties.   


## Part 2: Grouping people withing 5-year age bands
Grouping people within 5 years of age 35 (i.e., ages 30 to 40 inclusive), filter the DataFrame for those ages, sum the populations by sex, and calculate the difference.

In [102]:
age_of_interest = 35
age_range = 5
lower_range = age_of_interest - age_range
upper_range = age_of_interest + age_range

# Select ages within the range using loc function
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
age_group = df_anal.loc[lower_range:upper_range]
pop_by_sex = age_group.sum()

# Calculate the population difference between sexes
# Applying the absolute value function to the difference to avoid negative numbers
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.abs.html
pop_diff = abs(pop_by_sex[sexes[0]] - pop_by_sex[sexes[1]])

print(f"Population by sex (ages {lower_range}-{upper_range}):\n{pop_by_sex}")
print(f"Population difference between sexes in this age group: {pop_diff}")

Population by sex (ages 30-40):
Sex
Female    13371.161290
Male      12388.064516
dtype: float64
Population difference between sexes in this age group: 983.0967741935456


## Part 3: Finding the region with the highest population
Finding which region in Ireland has the highest population difference between the sexes in that age group

In [107]:
# Filter for ages in the ranges defined before in Part 2
# Using the original DataFrame to group by the counties and sexes in the age range
age_group_df = df[(df['Single Year of Age'] >= lower_range) & (df['Single Year of Age'] <= upper_range)]

# Group by county and sex, sum population
# Using groupby to aggregate data and sum the population
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
region_sex_pop_grouped = age_group_df.groupby(['Administrative Counties', 'Sex'])['VALUE'].sum()

# Unstack to get sexes as columns
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html
region_sex_pop = region_sex_pop_grouped.unstack()

# Calculate the absolute difference between sexes and store in a new column 'Pop_Diff'
region_sex_pop['Pop_Diff'] = (region_sex_pop[sexes[0]] - region_sex_pop[sexes[1]]).abs()

# Find the region with the highest difference in Pop_Diff
max_diff_region = region_sex_pop['Pop_Diff'].idxmax()
# Find the value of the highest difference
max_diff_value = region_sex_pop['Pop_Diff'].max()

print(f"The region with the highest population difference is: {max_diff_region} ({max_diff_value})")

The region with the highest population difference is: Fingal County Council (2942)
