# Assignment 5
Author: Anna Lozenko

In this Jupyter notebook, I will analyze Ireland's population data from a CSV file focusing on the differences between sexes by age groups.

The CSV file has been downloaded from the [Central Statistics Office Ireland](https://data.cso.ie/#) website, for census year 2022.

In [247]:
import pandas as pd

## Part 1: Data Cleaning

In [248]:
# import the population data from the CSV file as a Pandas DataFrame and display the first 5 rows
df = pd.read_csv("population.csv")
print(df.head(5))

  Statistic Label  CensusYear         Sex Single Year of Age  \
0      Population        2022  Both sexes           All ages   
1      Population        2022  Both sexes           All ages   
2      Population        2022  Both sexes           All ages   
3      Population        2022  Both sexes           All ages   
4      Population        2022  Both sexes           All ages   

                 Administrative Counties    UNIT    VALUE  
0                                Ireland  Number  5149139  
1                  Carlow County Council  Number    61968  
2                    Dublin City Council  Number   592713  
3  Dún Laoghaire Rathdown County Council  Number   233860  
4                  Fingal County Council  Number   330506  


Drop unnecessary columns.

In [249]:
# Get the list of column names
df.columns.values.tolist()

['Statistic Label',
 'CensusYear',
 'Sex',
 'Single Year of Age',
 'Administrative Counties',
 'UNIT',
 'VALUE']

In [250]:
df = df[df['Administrative Counties'] != 'Ireland']
df = df[df['Single Year of Age'] != 'All ages']
df = df[df['Sex'] != 'Both sexes']


df = df[['Sex', 'Single Year of Age', 'VALUE']]
df.head(5)

Unnamed: 0,Sex,Single Year of Age,VALUE
3297,Male,Under 1 year,346
3298,Male,Under 1 year,3188
3299,Male,Under 1 year,1269
3300,Male,Under 1 year,2059
3301,Male,Under 1 year,1855


Delete rows with "All ages" in the Single Year of Age column and "Both sexes" in the Sex column.

Convert the string values in the 'Single Year of Age' column to integers using regular expressions.

In [251]:
# replace "Under 1 year" with "0"
df["Single Year of Age"] = df["Single Year of Age"].str.replace("Under 1 year", "0", regex=True)
df.head()

Unnamed: 0,Sex,Single Year of Age,VALUE
3297,Male,0,346
3298,Male,0,3188
3299,Male,0,1269
3300,Male,0,2059
3301,Male,0,1855


In [252]:
# replace "100 years and over" with "100"
df["Single Year of Age"] = df["Single Year of Age"].str.replace("100 years and over", "100", regex=True)
df.tail()

Unnamed: 0,Sex,Single Year of Age,VALUE
9787,Female,100,7
9788,Female,100,9
9789,Female,100,12
9790,Female,100,31
9791,Female,100,7


In [253]:
# remove any remaining non-numeric characters (like "years") from the 'Single Year of Age' column
df["Single Year of Age"] = df["Single Year of Age"].str.replace('\\D', '', regex=True).astype(int)
df.dtypes

Sex                   object
Single Year of Age     int32
VALUE                  int64
dtype: object

Create a pivot table with clean data and save it to a new CSV file. The pivot table should have 'Single Year of Age' as the index, 'Sex' as the columns, and 'VALUE' as the values.

In [254]:
# create the pivot table and display the first 5 rows
ready_df = pd.pivot_table(df, "VALUE", "Single Year of Age", "Sex")
ready_df.head()

Sex,Female,Male
Single Year of Age,Unnamed: 1_level_1,Unnamed: 2_level_1
0,909.225806,955.16129
1,888.548387,931.451613
2,934.645161,975.354839
3,951.064516,1000.032258
4,961.903226,1022.129032


In [255]:
#save the pivot table to a new CSV file
ready_df.to_csv("cleaned_population.csv")

## Part 2: Data Analysis