# Assignment 5 - Population

For the PFDA Module

Author: Kyra Menai Hamilton

# DELETE LATER
## ASSIGNMENT REQUIREMENTS

Part 1 70%
Write a jupyter notebook that analyses the differences between the sexes by age in Ireland.

Weighted mean age (by sex)
The difference between the sexes by age
This part does not need to look at the regions.

ie You can take the notebook I used in the lectures and substitute the sexes for the regions.

Part 2 20%
In the same notebook, make a variable that stores an age (say 35).

Write that code that would group the people within 5 years of that age together, into one age group 

Calculate the population difference between the sexes in that age group.

Part 3 10%
In the same notebook.

Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group

## Import Packages

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Getting the data

download as CSV file from CSO website - API Data Query > Format (CSV) > RESTful (URL)

In [None]:
url = "https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en"
df = pd.read_csv(url)
print(df.head(5))
print(df.tail(5))

## Cleaning the data

Following a data sanity check - Can see that the data imported has included the "Both Sexes" section - to sort this the data will be cleaned. Only want to focus on male vs female so need to remove the Boith seces and define the sexes

In [None]:
print(df['Sex'].unique())
print(df['Sex'].value_counts())

In [None]:
df = df[df['Sex'].str.strip().str.lower().isin(['male', 'female'])]

# Could have used df = df[df['Sex'] != 'Both sexes'] but I kept it more general to avoid any potential issues with spacing or casing.

There are several colums within the dataset. For this analysis we will be focussing on sexes, age, and region.
Check the columns present within the dataset.

In [None]:
headers = df.columns.tolist()
headers

In [None]:
drop_col_list = ['STATISTIC', 'Statistic Label','TLIST(A1)','CensusYear','C02199V02655','C02076V03371','C03789V04537','UNIT']
df.drop(columns=drop_col_list, inplace=True)
df = df[df["Single Year of Age"] != "All ages"]
df['Single Year of Age'] = df['Single Year of Age'].str.replace('Under 1 year', '0')
df['Single Year of Age'] = df['Single Year of Age'].str.replace('\D', '', regex=True)

df['Single Year of Age']=df['Single Year of Age'].astype('int64')
df['VALUE']=df['VALUE'].astype('int64')
print (df.head(5))
print (df.tail(5))
df.info()

want to keep Sex as an option in the analysis so need to make sure that it is contained within the analysis. - This is for part 3


In [None]:
df_analysis = pd.pivot_table(df, 'VALUE',"Single Year of Age","Administrative Counties",'Sex')
print (df_analysis.head(5))
# write out the entire file to local machine
df_analysis.to_csv("population_for_analysis.csv")


## Analysing the data

### Part 1 - Differenences between the sexes by age in 
- Weighted MEAN Age (by sex) - Regions not necessary 

In [None]:
# Part 1: Analyse differences between the sexes by age in Ireland
# Weighted mean age (by sex)
weighted_mean_age = df.groupby('Sex').apply(lambda x: np.average(x['Single Year of Age'], weights=x['VALUE']))
print('Weighted mean age by sex:')
print(weighted_mean_age)

# Difference between the sexes by age (total population by age and sex)
age_sex_diff = df.pivot_table(index='Single Year of Age', columns='Sex', values='VALUE', aggfunc='sum')
age_sex_diff['Difference'] = age_sex_diff['Male'] - age_sex_diff['Female']
print('Population difference (Male - Female) by age:')
print(age_sex_diff[['Difference']].head(10))  # Show first 10 ages as example

### Part 2 - Make a variable that stores an age
- Write the codee that would group the people within 5 years of that age together into one age group
- Calculate the population difference between the sexes in that age group

In [None]:
# Part 2: Group people within 5 years of a given age and calculate population difference
age_of_interest = 32
age_group = df[(df['Single Year of Age'] >= age_of_interest - 5) & (df['Single Year of Age'] <= age_of_interest + 5)]

# Calculate total population by sex in this age group
pop_by_sex = age_group.groupby('Sex')['VALUE'].sum()
pop_diff = pop_by_sex['Male'] - pop_by_sex['Female']
print(f"Population by sex for ages {age_of_interest-5} to {age_of_interest+5}:")
print(pop_by_sex)
print(f"Population difference (Male - Female) in this age group: {pop_diff}")

### Part 3 - Regions vs sexes
- Write the code that would work out which region in Ireland has the biggest population difference between the sexes in that age group.

In [None]:
# Part 3: Region with biggest population difference between sexes in the age group
# Reuse the age_group DataFrame from Part 2

age_group_counties = age_group[age_group['Administrative Counties'] != 'Ireland'] # Filter out 'Ireland' to keep only counties - should do this for the first and second parts too

region_diff = age_group_counties.pivot_table(index='Administrative Counties', columns='Sex', values='VALUE', aggfunc='sum')
region_diff['Difference'] = region_diff['Male'] - region_diff['Female']
max_diff_region = region_diff['Difference'].abs().idxmax()
max_diff_value = region_diff.loc[max_diff_region, 'Difference']
print(f"County with biggest population difference (by absolute value) between sexes in ages {age_of_interest-5} to {age_of_interest+5}:")
print(f"{max_diff_region}: {max_diff_value}")

## References

- Central Statistics Office (CSO) dataset: "Population by single year of age, administrative counties and sex" (FY006A). CSV download/API endpoint used:
  - https://ws.cso.ie/public/api.restful/PxStat.Data.Cube_API.ReadDataset/FY006A/CSV/1.0/en

- Methods and documentation:
  - Weighted mean / weighted average (used via numpy): https://numpy.org/doc/stable/reference/generated/numpy.average.html
  - pandas pivot_table and groupby (used for aggregations): https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html
  - Converting and cleaning age strings: pandas string methods docs: https://pandas.pydata.org/docs/reference/series.html#string-handling

- Notes:
  - The code uses `np.average(..., weights=...)` to compute the weighted mean age by sex, with population counts in the `VALUE` column used as weights.
  - The age-group logic in Part 2 groups people whose single year of age falls within Â±5 years of the chosen `age_of_interest`.



# END