## Weighted stats
- This notebook is made in the Programming for Data Analytics lecture on weighted descriptive statistics.
- We are looking at population in each county of Ireland


In [40]:
import pandas as pd

In [41]:
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


I keep just the separated sexes

In [42]:
df = df[df["Sex"] != "Both sexes"]
df.tail(3)

Unnamed: 0,STATISTIC,Statistic Label,TLIST(A1),CensusYear,C02199V02655,Sex,C02076V03371,Single Year of Age,C03789V04537,Administrative Counties,UNIT,VALUE
9789,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-149d-13a3-e055-000000000001,Cavan County Council,Number,12
9790,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-14a4-13a3-e055-000000000001,Donegal County Council,Number,31
9791,FY006AC01,Population,2022,2022,2,Female,650,100 years and over,2ae19629-1495-13a3-e055-000000000001,Monaghan County Council,Number,7


Now I can do the code that is the prep for analysis.py  
I just noticed that there were more columns so I need to get their names

In [43]:
headers = df.columns.tolist()
headers

['STATISTIC',
 'Statistic Label',
 'TLIST(A1)',
 'CensusYear',
 'C02199V02655',
 'Sex',
 'C02076V03371',
 'Single Year of Age',
 'C03789V04537',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [44]:
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','Administrative Counties','C02076V03371','C03789V04537','UNIT']
df.drop(columns=drop_col_list, inplace=True)
df = df[df["Single Year of Age"] != "All ages"]
df['Single Year of Age'] = df['Single Year of Age'].str.replace('Under 1 year', '0')
df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)

df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')
df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 6464 entries, 3296 to 9791
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Sex                 6464 non-null   object
 1   Single Year of Age  6464 non-null   int64 
 2   VALUE               6464 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 202.0+ KB


  df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)


In [45]:
df_anal = pd.pivot_table(df, 'VALUE',"Single Year of Age","Sex")
print (df_anal.head(3))
# write out the entire file to local machine
df_anal.to_csv("population_for_analysis.csv")

Sex                    Female       Male
Single Year of Age                      
0                   1761.6250  1850.6250
1                   1721.5625  1804.6875
2                   1810.8750  1889.7500


## Now we can do weighted descriptive statistics

In [46]:
sexes = list(df_anal.columns)
sexes

['Female', 'Male']

Weighted mean is sum(age*population at age) / sum (populations at age)

In [47]:
number_people = df_anal[sexes].sum()
number_people_df = number_people.to_frame(name='Number of People')
number_people_df


Unnamed: 0_level_0,Number of People
Sex,Unnamed: 1_level_1
Female,162786.875
Male,159034.3125


In [48]:
df_anal

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1761.6250,1850.6250
1,1721.5625,1804.6875
2,1810.8750,1889.7500
3,1842.6875,1937.5625
4,1863.6875,1980.3750
...,...,...
96,59.7500,20.4375
97,45.7500,13.5625
98,30.7500,8.1250
99,21.0000,6.5625


In [49]:
cumages = df_anal[sexes].mul(df_anal.index, axis=0).sum()
cumages_df = cumages.to_frame(name='Cumulative Age')
cumages_df


Unnamed: 0_level_0,Cumulative Age
Sex,Unnamed: 1_level_1
Female,6338888.0
Male,6001867.0


In [50]:
weighted_mean = cumages/number_people
weighted_mean_df = weighted_mean.to_frame(name='Weighted Mean')
weighted_mean_df

Unnamed: 0_level_0,Weighted Mean
Sex,Unnamed: 1_level_1
Female,38.939796
Male,37.739448


#### Or you can use numpy

In [51]:
import numpy as np
weighted_means = {}
for sex in sexes:
    weighted_means[sex] = np.average(df_anal.index, weights=df_anal[sex])

print(f"The weighted means by sex are: {weighted_means}")

The weighted means by sex are: {'Female': 38.9397958987787, 'Male': 37.7394477371039}


### Weighted median
create a series of the cumulative sums and find the index of the middle value

In [52]:
cumsum = df_anal[sexes].cumsum()
cumsum

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1761.6250,1850.6250
1,3483.1875,3655.3125
2,5294.0625,5545.0625
3,7136.7500,7482.6250
4,9000.4375,9463.0000
...,...,...
96,162652.8750,158996.4375
97,162698.6250,159010.0000
98,162729.3750,159018.1250
99,162750.3750,159024.6875


In [53]:
weighted_medians = {}
for sex in sexes:
    cumsum = df_anal[sex].cumsum()
    cutoff = df_anal[sex].sum() / 2
    median_age = cumsum.index[cumsum >= cutoff][0]
    weighted_medians[sex] = median_age

print(f"The weighted medians by sex are: {weighted_medians}")

The weighted medians by sex are: {'Female': 39, 'Male': 38}


## Weighted standard deviation
The weighted standard deviation uses the same formula as the regular standard deviation, but applies weights to the squared differences from the weighted mean.


In [54]:
w_variances = {}
for sex in sexes:
    mean = np.average(df_anal.index, weights=df_anal[sex])
    variance = np.average((df_anal.index - mean) ** 2, weights=df_anal[sex])
    w_variances[sex] = variance

print(f"The weighted variances for each sex are: {w_variances}")

The weighted variances for each sex are: {'Female': 528.953520736661, 'Male': 513.9835070876321}


In [55]:
w_stds = {sex: np.sqrt(var) for sex, var in w_variances.items()}
print(f"The weighted standard deviations for each sex are: {w_stds}")

The weighted standard deviations for each sex are: {'Female': 22.998989559036303, 'Male': 22.67120435900202}


Finding the mode for each sex
The mode is the age at which the population is the highest.
I can use the idxmax function to find the index of the highest value.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html

In [56]:
modes = {sex: df_anal[sex].idxmax() for sex in sexes}
print(f"The modes for each sex are: {modes}")

The modes for each sex are: {'Female': 41, 'Male': 42}


### Statistical Summary by Sex

- **Weighted Mean:** Females have an average age of **38.94**, while males average **37.74**, showing that females are slightly older on average.  
- **Weighted Median:** The median age is **39 for females** and **38 for males**, meaning half of the female population is older than half of the male population.   
- **Weighted Standard Deviation:** The spread of ages is **22.99 for females** and **22.67 for males**, with females showing a slightly wider age range overall.
- **Weighted Mode:** The most common ages are **41 for females** and **42 for males**, so both groups peak around the early forties.   


## Part 2: Grouping people withing 5-year age bands
Grouping people within 5 years of age 35 (i.e., ages 30 to 40 inclusive), filter the DataFrame for those ages, sum the populations by sex, and calculate the difference.

In [59]:
age_of_interest = 35
age_range = 5
lower_range = age_of_interest - age_range
upper_range = age_of_interest + age_range

# Select ages within the range using loc function
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
age_group = df_anal.loc[lower_range:upper_range]
pop_by_sex = age_group.sum()

# Calculate the population difference between sexes
# Applying the absolute value function to the difference to avoid negative numbers
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.abs.html
pop_diff = abs(pop_by_sex[sexes[0]] - pop_by_sex[sexes[1]])

print(f"Population by sex (ages {lower_range}-{upper_range}):\n{pop_by_sex}")
print(f"Population difference between sexes in this age group: {pop_diff}")

Population by sex (ages 30-40):
Sex
Female    25906.625
Male      24001.875
dtype: float64
Population difference between sexes in this age group: 1904.75
