assignment05-population.ipynb  
Author: Niamh Hogan

# Irish Census 2022 Population Analysis

This notebook uses population data from the 2022 Irish census to analyse age and sex differences. The database (FY006A) was downloaded via csv from: [Central Statistics Office](https://data.cso.ie/#)

Part 1:

- Calculate the weighted mean age by sex

- Analyse the population difference between the sexes by age

Part 2:

- Create an age variable (age 25)

- Group ages within ±5 years of that value

- Calculate the population difference between the sexes in that age group

Part 3:

- Determine which Irish region has the largest sex difference in that same age group

A short Markdown explanation is included before each code cell. The code follows PEP8 style and avoids over-commenting.

<b>Step 1: Data Import & Cleaning</b>

In [1]:
# imports

import pandas as pd 
import numpy as np

<b>Data Import</b>  

I imported the CSO population dataset using *pd.read_csv()* and loaded it into a pandas DataFrame ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)). Displaying a random sample of 10 rows allowed me to confirm that the file was read correctly and that the structure of the data was as expected.

In [2]:
df = pd.read_csv('../data/Irish_population_cso.csv')
df.sample(10)

Unnamed: 0,Statistic Label,CensusYear,Sex,Single Year of Age,Administrative Counties,UNIT,VALUE
9658,Population,2022,Female,96 years,Mayo County Council,Number,42
5240,Population,2022,Male,60 years,Galway County Council,Number,1161
8114,Population,2022,Female,48 years,Cork County Council,Number,2756
8756,Population,2022,Female,68 years,Limerick City & County Council,Number,931
5476,Population,2022,Male,68 years,Fingal County Council,Number,1077
8013,Population,2022,Female,45 years,Westmeath County Council,Number,676
8848,Population,2022,Female,71 years,Clare County Council,Number,595
129,Population,2022,Both sexes,3 years,Carlow County Council,Number,754
7312,Population,2022,Female,23 years,Clare County Council,Number,649
2519,Population,2022,Both sexes,77 years,Galway City Council,Number,425


<b>Dropping Unnecessary Columns</b>  

I created a list of unnecessary columns and used *df.drop()* to remove them from the DataFrame ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)). This step keeps only the variables needed for the sex-by-age analysis. I set *inplace=True* so the DataFrame was updated directly ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html)). I printed the first 3 rows to check that the columns were removed correctly.

In [3]:
drop_col_list = ["Statistic Label", "CensusYear", "Administrative Counties", "UNIT"]

df.drop(columns=drop_col_list, inplace=True)

print(df.head(3))

          Sex Single Year of Age    VALUE
0  Both sexes           All ages  5149139
1  Both sexes           All ages    61968
2  Both sexes           All ages   592713


<b>Dropping Unnecessary Values</b>  

I filtered the dataset to remove rows where the age category was listed as *“All ages”* since the analysis requires single-year age values.  
I also removed the rows where the sex category was *“Both sexes”* keeping only *“Male”* and *“Female”* for comparison. Boolean indexing was used to apply these filters ([Official Pandas Documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)).

In [4]:
# Drop all ages
df = df[df["Single Year of Age"] != "All ages"] 

# Drop both sexes
df = df[df["Sex"] != "Both sexes"]

<b>Replacing Value Name</b>  

I replaced the string value *“Under 1 year”* with *“0”* so that this age group can be treated as a numeric age in later calculations.  
I used the pandas string *.str.replace()* method to update the values in the Single Year of Age column ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)).

In [5]:
df["Single Year of Age"] = df["Single Year of Age"].str.replace("Under 1 year", "0")

<b>Removing Non-digit Characters</b>

I removed all non-digit characters from the *Single Year of Age* column using *str.replace()* ([Official Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)) with regular expression ([Official Python Documentation](https://docs.python.org/3/library/re.html)).  
This ensures that age values are numeric only, which is necessary for calculations.

In [6]:
df["Single Year of Age"] = df["Single Year of Age"].str.replace(r"\D", "", regex=True)

<b>Checking Data Types</b>  

I checked the the data types of all columns to confirm that the *Single Year of Age* column is still a string and other columns are in the expected format.  
This helps ensure that later conversions and calculations will work correctly.

In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6464 entries, 3296 to 9791
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Sex                 6464 non-null   object
 1   Single Year of Age  6464 non-null   object
 2   VALUE               6464 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 202.0+ KB
None


<b>Converting Age to Integer</b>  

I converted the *Single Year of Age* column to integer type using *astype("int64")* ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html)) to allow numeric calculations such as weighted mean and grouping by age.

In [8]:
# Convert single year of age to int
df["Single Year of Age"] = df["Single Year of Age"].astype("int64")

print (df.head(3))

       Sex  Single Year of Age  VALUE
3296  Male                   0  29610
3297  Male                   0    346
3298  Male                   0   3188


<b>Step 2: Converting to Pivot Table</b>  

I created a pivot table with Single Year of Age as the index and Sex as the columns, which organizes the population data for analysis ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)).  
The population values were rounded and converted to integers to simplify calculations ([nkmk](https://note.nkmk.me/en/python-numpy-round/)).  
Finally, I saved the cleaned pivot table to a CSV file for use in further analysis ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)).

In [9]:
df_anal = pd.pivot_table(
    df,
    values="VALUE",
    index="Single Year of Age",
    columns=["Sex"],
)

df_anal['Female'] = df_anal['Female'].round().astype(int)
df_anal['Male'] = df_anal['Male'].round().astype(int)

print(df_anal.head(10))

df_anal.to_csv("../data/population_for_analysis.csv")

Sex                 Female  Male
Single Year of Age              
0                     1762  1851
1                     1722  1805
2                     1811  1890
3                     1843  1938
4                     1864  1980
5                     1959  2043
6                     2039  2131
7                     2098  2214
8                     2152  2268
9                     2202  2311


# <b>Part 1: Descriptive Statistics</b>

## Task 1: weighted mean age by sex 

<b>Reading in Data</b>  

I loaded the cleaned pivot table CSV, with ages as the index and sexes as columns, into a pandas DataFrame ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)).

In [10]:
file_path = "../data/population_for_analysis.csv"
df_anal = pd.read_csv(file_path, index_col=0)
print(df_anal.head(3))

                    Female  Male
Single Year of Age              
0                     1762  1851
1                     1722  1805
2                     1811  1890


<b>Define Columns for Each Sex</b>  

I created a list of the DataFrame column names and assigned the first column to *sex1* (Female) and the second to *sex2* (Male) ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html)).

In [11]:
# Female 
headers = list(df_anal) 
sex1 = headers[0] # Female
sex2 = headers[1] # Male

sex1, sex2

('Female', 'Male')

<b>Calculating weighted mean age of Males & Females</b>  

I calculated the weighted mean age for each sex using *numpy.average()* ([Official NumPy Documentation](https://numpy.org/doc/stable/reference/generated/numpy.average.html)), with ages as the values and population counts as the weights. This gives the average age considering the number of people at each age.

In [12]:
w_fmean = np.average(df_anal.index, weights=df_anal[sex1])
print(f"Weighted mean age of female: {w_fmean}")

w_mmean = np.average(df_anal.index, weights=df_anal[sex2])
print(f"Weighted mean age of male: {w_mmean}")

Weighted mean age of female: 38.93960931261134
Weighted mean age of male: 37.74036581301912


## Task 2: The difference between the sexes by age


I calculated the differences between the sexes by age using a copy of the pivoted population DataFrame to ensure that the original data remained unchanged:

- Absolute difference (abs_diff): the number of males minus the number of females at each age ([Official Pandas documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#selection-by-label)).

- Relative difference (relative_diff): the percentage difference relative to the total population at each age, calculated as (Male - Female) / (Male + Female) * 100 ([Official Python documentation](https://pandas.pydata.org/docs/user_guide/basics.html#arithmetic-operations)), ([splashlearn](https://www.splashlearn.com/math-vocabulary/percent-difference)).

In [13]:
df_diff = df_anal.copy()

# Calculate absolute difference (Male - Female)
df_diff['abs_diff'] = df_diff[sex2] - df_diff[sex1]

# Calculate relative difference (%) = (Male - Female) / (Male + Female) * 100
df_diff['relative_diff'] = (
    (df_diff[sex1] - df_diff[sex2])
    / (df_diff[sex1] + df_diff[sex2])
    * 100
)

# Display the first 10 rows
print(df_diff.head(10))


                    Female  Male  abs_diff  relative_diff
Single Year of Age                                       
0                     1762  1851        89      -2.463327
1                     1722  1805        83      -2.353275
2                     1811  1890        79      -2.134558
3                     1843  1938        95      -2.512563
4                     1864  1980       116      -3.017690
5                     1959  2043        84      -2.098951
6                     2039  2131        92      -2.206235
7                     2098  2214       116      -2.690167
8                     2152  2268       116      -2.624434
9                     2202  2311       109      -2.415245


# Part 2: Grouping

In this section, I will create a variable that stores age 25.

I will then write the code that groups the people within 5 years of 25, into one age group.

Finally I will calculate the population difference between the sexes within ±5 years of age 25.

<b>Create Variable & Grouping</b>  

In this section, I created a variable *target_age* to specify the age of interest. I then:

- Added a new column *'AgeGroup'* that assigns all ages within ±5 years of the target age to the same group, and all other ages to *"Other"* ([Official Pandas Documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-with-series)), ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html)). I used a lambda function with a conditional expression to assign ages to an age group ([Official Python Documentation](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions)), ([Official Python Documentation](https://docs.python.org/3/reference/expressions.html#conditional-expressions)). 

- Used *groupby()* to sum the population for each sex within each age group ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)).

In [14]:
target_age = 25

df_group = df_anal.copy()

female_col = df_group.columns[0]
male_col = df_group.columns[1]

df_group['AgeGroup'] = df_group.index.to_series().apply(
    lambda x: f"{target_age-5}-{target_age+5}"
    if target_age-5 <= x <= target_age+5
    else "Other"
)

grouped = df_group.groupby('AgeGroup')[[female_col, male_col]].sum()

print(grouped)

          Female    Male
AgeGroup                
20-30      20810   20857
Other     141980  138186


<b>Calculating Population Difference</b>

I calculated the population difference between females and males within the ±5-year age group around the target age.  
I used *.loc[]* ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)) to select the row corresponding to the age group and the columns for each sex, then subtracted male population from female population.

In [15]:
pop_diff = grouped.loc[f"{target_age-5}-{target_age+5}", female_col] - \
           grouped.loc[f"{target_age-5}-{target_age+5}", male_col]

print(f"Population difference between the sexes within 5 years of 25: {pop_diff}")

Population difference between the sexes within 5 years of 25: -47


# Part 3: Regional Differences

For this section, I will investigate which region in Ireland has the biggest population difference between the sexes within ±5 years of age 25.

## <b>Section 1: Sorting data for analysis</b>

<b>Read in Dataset</b>

In [16]:
df2 = pd.read_csv("../data/Irish_pop_cso_2.csv")
print(df2.head(3))

  Statistic Label  CensusYear         Sex Single Year of Age  \
0      Population        2022  Both sexes           All ages   
1      Population        2022  Both sexes           All ages   
2      Population        2022  Both sexes           All ages   

  Administrative Counties    UNIT    VALUE  
0                 Ireland  Number  5149139  
1   Carlow County Council  Number    61968  
2     Dublin City Council  Number   592713  


<b>Dropping Unnecessary Columns</b>  

I removed columns that are not needed for regional analysis, keeping only the relevant columns for age, sex, and population values.

In [17]:
drop_col_list = ["Statistic Label", "CensusYear", "UNIT"]
df2.drop(columns=drop_col_list, inplace=True)
print(df2.head(3))

          Sex Single Year of Age Administrative Counties    VALUE
0  Both sexes           All ages                 Ireland  5149139
1  Both sexes           All ages   Carlow County Council    61968
2  Both sexes           All ages     Dublin City Council   592713


<b>Dropping Unnecassary Rows</b>  

I filtered the dataset using boolean indexing ([Official Pandas Documentation](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing)) to remove:

- Rows representing all ages

- Rows where sex is “Both sexes”

- Rows representing the whole of Ireland

In [18]:
df2 = df2[df2["Single Year of Age"] != "All ages"]
df2 = df2[df2["Sex"] != "Both sexes"]
df2 = df2[df2["Administrative Counties"] != "Ireland"]

<b>Cleaning Age Column</b>  

- Replaced “Under 1 year” with 0 so it can be treated numerically ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html)).

- Removed all non-digit characters from the age column using a regex ([Official Python Documentation](https://docs.python.org/3/library/re.html)). 

In [19]:
df2["Single Year of Age"] = df2["Single Year of Age"].str.replace("Under 1 year", "0")
df2["Single Year of Age"] = df2["Single Year of Age"].str.replace(r"\D", "", regex=True)
print(df2.head(3))

       Sex Single Year of Age                Administrative Counties  VALUE
3297  Male                  0                  Carlow County Council    346
3298  Male                  0                    Dublin City Council   3188
3299  Male                  0  Dún Laoghaire Rathdown County Council   1269


<b>Checked Data Types</b>

In [20]:
print(df2.info())

<class 'pandas.core.frame.DataFrame'>
Index: 6262 entries, 3297 to 9791
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Sex                      6262 non-null   object
 1   Single Year of Age       6262 non-null   object
 2   Administrative Counties  6262 non-null   object
 3   VALUE                    6262 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 244.6+ KB
None


<b>Converted Age to Integer</b>

In [21]:
df2["Single Year of Age"] = df2["Single Year of Age"].astype("int64")
print(df2.head(3))

       Sex  Single Year of Age                Administrative Counties  VALUE
3297  Male                   0                  Carlow County Council    346
3298  Male                   0                    Dublin City Council   3188
3299  Male                   0  Dún Laoghaire Rathdown County Council   1269


<b>Creating Pivot Table</b>  

I created a pivot table with *Single Year of Age* as the index and a multi-level column structure of *Region* and *Sex* ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)). The pivot table was saved to CSV for further analysis.

In [22]:
df2 = pd.pivot_table(
    df2,
    values="VALUE",
    index="Single Year of Age",
    columns=["Administrative Counties", "Sex"]
)

df2.to_csv("../data/population_for_analysis_2.csv")

## <b>Section 2: Analysing Data</b>

<b>Select ages 20-30</b>  

I selected rows where the index (age) is between 20 and 30 using *.loc[]*. This isolates the age group for which I want to analyse sex differences by county.

In [23]:
df_age_20_30 = df2.loc[20:30]

<b>Prepare Containers for Results</b>  

I created empty dictionaries ([GeeksforGeeks](https://www.geeksforgeeks.org/python/initialize-an-empty-dictionary-in-python/)) to store:

- The absolute male–female population difference for each county

- The total male and female population for each county

In [24]:
differences = {}
totals = {}

<b>Calculate Differences per County</b>  

I looped through each county in the multi-level column index:

- Summed the male and female populations for each county

- Calculated the absolute difference between males and females

- Stored total male and female populations in a dictionary  

I used the following references to create the code below: ([Official Pandas Documentation](https://pandas.pydata.org/docs/user_guide/advanced.html#multiindex)), ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sum.html)), ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.get_level_values.html)), ([Official Python Documentation](https://docs.python.org/3/reference/expressions.html#membership-testings)), ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html)).

In [25]:
for county in df2.columns.get_level_values(0).unique():
    if (county, 'Male') in df2.columns and (county, 'Female') in df2.columns:
        male_sum = df2[(county, 'Male')].sum()
        female_sum = df2[(county, 'Female')].sum()
        differences[county] = abs(male_sum - female_sum)
        totals[county] = {'Male': male_sum, 'Female': female_sum}

<b>Convert Differences to DataFrame</b>  

I converted the dictionary of differences into a pandas DataFrame for easier manipulation and analysis, with columns for County and Difference ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)).

In [26]:
diff_df = pd.DataFrame(list(differences.items()), columns=['County', 'Difference'])

<b>Identify County with Maximum Difference</b>  

I identified the county with the largest absolute difference between male and female populations using *idxmax()* ([Official Pandas Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html)) to find the row with the highest value.

In [27]:
max_row = diff_df.loc[diff_df['Difference'].idxmax()]
county_name = max_row['County']

<b>Determine Higher Sex in that County</b>  

I compared total male and female populations in the county with the largest difference to determine which sex is higher using conditional expressions ([Official Python Documentation](https://docs.python.org/3/reference/expressions.html#conditional-expressions)).

In [28]:
male_total = totals[county_name]['Male']
female_total = totals[county_name]['Female']
higher_sex = 'Male' if male_total > female_total else 'Female'

<b>Display Results</b>  

I printed the county name, the absolute difference, and which sex has the higher population to summarise the results clearly.  
The following references were used to write the code: ([Official Python Documentation](https://docs.python.org/3/reference/lexical_analysis.html#f-strings)), ([Official Python Documentation](https://docs.python.org/3/library/string.html#format-specification-mini-language)).

In [29]:
print(
    f"{county_name} is the county with the biggest male–female population difference "
    f"in the 20–30 age bracket."
)
print(f"Difference: {int(max_row['Difference']):,}")
print(f"{higher_sex} is the higher sex in this county.")

Dún Laoghaire Rathdown County Council is the county with the biggest male–female population difference in the 20–30 age bracket.
Difference: 9,796
Female is the higher sex in this county.


# End