# COMP40610 Information Visualisation Assignment Data Processing

## 1. Data Selection

本次作业我选择的数据集是从CSO中的High Value Dataset 中的Labour Market Section中选取。作为即将毕业的爱尔兰国际学生，抱持着对于爱尔兰的就业市场的兴趣，想要分析近年来，在不同因素影响下爱尔兰的就业市场形势，分析爱尔兰的劳动力市场变化情况

我从[CSO High Value Datasets](https://www.cso.ie/en/statistics/highvaluedatasetshvd)中的Labour Market and Earnings 板块中下载了相关的数据集，具体原始数据包括：

 - Annual employment rate
 - Annual percentage of part-time work
 - Annual unemployment rate
 - Annual long term unemployment rate
 - Annual percentage of potential additional labour force
 - Quarterly employment rate
 - Quarterly unemployment and long-term unemployment rate

## 2. Tasks and Question Identification

根据CSO提供的数据集，我想要探讨的问题（5个）包括：
1. 对于就业率和失业率在性别问题上的差异化问题，并分析在不同的区域、不同的教育程度或年龄上，是否会有不同的差异趋势
2. 由于女性在生育或家庭中的角色问题，可能一部分女性因承担家庭责任而选择放弃全职的工作，而转而进行兼职工作。因此我们同时想要分析，在兼职率中，不同性别的差异化
3. 在爱尔兰，不同的教育程度下的教育率，兼职率，失业率，长期失业率和潜在的劳动力分别为多少？
	- 增添性别分类
4. 随着时间的变化，爱尔兰的总体就业率和失业率变化趋势,
	- 可以根据不同季度变化体现
	- 或者按照不同年份的变化体现
	- 注意：可以展现covid-19对于就业率的影响
5. 不同年龄段之内的就业模式和差率
	- 青年失业率和爱尔兰总体失业率的对比
	- 加入时间的变化

## 3. Data Cleaning

I want to anaylyse these tasks/questions so I need to conbined all five datasets together.

In [12]:
#Import the required packages
#Import package pandas for data analysis
import pandas as pd

# Import package numpy for numeric computing
import numpy as np

import os



- **Import datasets**

In [29]:
files = {
    '../raw_datasets/ALF01_Annual Employment Rate.csv': 'Employment Rate',
    '../raw_datasets/ALF02_Annual Percentage of Part-time Work.csv': 'Part-time Work Rate',
    '../raw_datasets/ALF03_Annual Unemployment Rate.csv': 'Unemployment Rate',
    '../raw_datasets/ALF04_Annual Long Term Unemployment Rate.csv': 'Long-term Unemployment Rate',
    '../raw_datasets/ALF05_Annual Percentage of Potential Additional Labour Force.csv': 'Potential Labour Force'
}

dataframes = []

for filename, label in files.items():
    df = pd.read_csv(filename)
    df['Statistic Label'] = label
    dataframes.append(df)

- **Combination with all datasets**

In [30]:
combined_df = pd.concat(dataframes, ignore_index=True)

In [31]:
combined_df.dtypes

print(combined_df.shape)

(7704, 8)


- Delete unneseccery column

    We inspect raw dateset and found the 'Unit' column only has one unique value. It is '%' sign to represent the unit of value. 
    
    So I decided to drop this column

In [32]:
count_unique = combined_df.nunique()

print("Number of unique values in each column:\n", count_unique)

Number of unique values in each column:
 Statistic Label                 5
Year                            6
Age Group                      13
Sex                             3
Education Attainment Level      6
NUTS 2 Region                   4
UNIT                            1
VALUE                         876
dtype: int64


In [34]:
combined_df.drop(['UNIT'], axis = 1)

Unnamed: 0,Statistic Label,Year,Age Group,Sex,Education Attainment Level,NUTS 2 Region,VALUE
0,Employment Rate,2019,All ages,Both sexes,Levels of Education (Levels 0-8),Ireland,74.9
1,Employment Rate,2019,All ages,Both sexes,Levels of Education (Levels 0-8),Northern and Western,74.0
2,Employment Rate,2019,All ages,Both sexes,Levels of Education (Levels 0-8),Southern,72.8
3,Employment Rate,2019,All ages,Both sexes,Levels of Education (Levels 0-8),Eastern and Midland,76.6
4,Employment Rate,2019,All ages,Both sexes,Less than primary (Level 0),Ireland,
...,...,...,...,...,...,...,...
7699,Potential Labour Force,2024,25 - 54 years,Male,,,76.4
7700,Potential Labour Force,2024,25 - 54 years,Female,,,74.7
7701,Potential Labour Force,2024,55 - 74 years,Both sexes,,,92.0
7702,Potential Labour Force,2024,55 - 74 years,Male,,,93.0


- **Eliminate duplicated rows**

In [35]:
combined_df = combined_df.drop_duplicates()

- Deal with Missing values

In [37]:
print("Missing values:")
print(combined_df.isnull().sum())

Missing values:
Statistic Label                  0
Year                             0
Age Group                        0
Sex                              0
Education Attainment Level     144
NUTS 2 Region                 1656
UNIT                             0
VALUE                         3558
dtype: int64


I found the missing values for 'VALUE' column is too hgih and it would highly impact the information visualisation processing. So I decided to drop all those rows with missing value about 'VALUE'.

In [38]:
combined_df = combined_df.dropna(subset=['VALUE'])

In [39]:
print(f"删除后的行数: {len(combined_df)}")
print(f"剩余缺失值: {combined_df['VALUE'].isnull().sum()}")

删除后的行数: 4146
剩余缺失值: 0


## 4.  Save cleaning dataset

In [41]:
combined_df.to_csv('combined_clean_data.csv', index=False)