# COMP40610 Information Visualisation Assignment Data Processing

## 1. Data Selection

本次作业我选择的数据集是从CSO中的High Value Dataset 中的Labour Market Section中选取。作为即将毕业的爱尔兰国际学生，抱持着对于爱尔兰的就业市场的兴趣，想要分析近年来，在不同因素影响下爱尔兰的就业市场形势，分析爱尔兰的劳动力市场变化情况

我从[CSO High Value Datasets](https://www.cso.ie/en/statistics/highvaluedatasetshvd)中的Labour Market and Earnings 板块中下载了相关的数据集，具体原始数据包括：

 - Annual employment rate (ALF01)
 - Quarterly employment rate (QLF50)

## 2. Data Cleaning

In [None]:
# Import the required packages
import pandas as pd
import numpy as np
import os

### 2.1 Processing Annual Employment Rate Data

处理步骤:
- 读取 ALF01_Annual Employment Rate.csv
- Drop: UNIT, Statistic Label, Education Attainment Level
- 将 VALUE 重命名为 employment_rate

In [None]:
# Read annual employment rate data
annual_df = pd.read_csv('../raw_datasets/ALF01_Annual Employment Rate.csv')

print(f"Original shape: {annual_df.shape}")
print(f"\nOriginal columns: {list(annual_df.columns)}")
print(f"\nFirst few rows:")
print(annual_df.head())

In [None]:
# Drop unnecessary columns
columns_to_drop = ['UNIT', 'Statistic Label', 'Education Attainment Level']
annual_df = annual_df.drop(columns=columns_to_drop)

# Rename VALUE to employment_rate
annual_df = annual_df.rename(columns={'VALUE': 'employment_rate'})

# Convert employment_rate to numeric
annual_df['employment_rate'] = pd.to_numeric(annual_df['employment_rate'], errors='coerce')

print(f"Cleaned shape: {annual_df.shape}")
print(f"\nCleaned columns: {list(annual_df.columns)}")
print(f"\nFirst few rows:")
print(annual_df.head())
print(f"\nData types:")
print(annual_df.dtypes)
print(f"\nMissing values:")
print(annual_df.isnull().sum())

### 2.2 Processing Quarterly Employment Rate Data

处理步骤:
- 读取 QLF50-Quarterly Employment Rate.csv
- Drop: UNIT, Statistic Label, Education Attainment Level
- 将 VALUE 重命名为 employment_rate

In [None]:
# Read quarterly employment rate data
quarterly_df = pd.read_csv('../raw_datasets/QLF50-Quarterly Employment Rate.csv')

print(f"Original shape: {quarterly_df.shape}")
print(f"\nOriginal columns: {list(quarterly_df.columns)}")
print(f"\nFirst few rows:")
print(quarterly_df.head())

In [None]:
# Drop unnecessary columns
columns_to_drop = ['UNIT', 'Statistic Label', 'Education Attainment Level']
quarterly_df = quarterly_df.drop(columns=columns_to_drop)

# Rename VALUE to employment_rate
quarterly_df = quarterly_df.rename(columns={'VALUE': 'employment_rate'})

# Convert employment_rate to numeric
quarterly_df['employment_rate'] = pd.to_numeric(quarterly_df['employment_rate'], errors='coerce')

print(f"Cleaned shape: {quarterly_df.shape}")
print(f"\nCleaned columns: {list(quarterly_df.columns)}")
print(f"\nFirst few rows:")
print(quarterly_df.head())
print(f"\nData types:")
print(quarterly_df.dtypes)
print(f"\nMissing values:")
print(quarterly_df.isnull().sum())

## 3. Export Cleaned Datasets

In [None]:
# Export annual employment rate data
annual_output_path = 'annual_employment_rate_cleaned.csv'
annual_df.to_csv(annual_output_path, index=False)
print(f"Annual employment rate data exported to: {annual_output_path}")

In [None]:
# Export quarterly employment rate data
quarterly_output_path = 'quarterly_employment_rate_cleaned.csv'
quarterly_df.to_csv(quarterly_output_path, index=False)
print(f"Quarterly employment rate data exported to: {quarterly_output_path}")

## 4. Data Summary

### Annual Employment Rate Dataset
- **Remaining columns**: Year, Age Group, Sex, NUTS 2 Region, employment_rate
- **Dropped columns**: UNIT, Statistic Label, Education Attainment Level

### Quarterly Employment Rate Dataset
- **Remaining columns**: Quarter, Sex, Age Group, employment_rate
- **Dropped columns**: UNIT, Statistic Label, Education Attainment Level

In [None]:
# Final check - display unique values for key columns
print("=" * 50)
print("ANNUAL EMPLOYMENT RATE DATASET")
print("=" * 50)
print(f"\nUnique Years: {sorted(annual_df['Year'].unique())}")
print(f"\nUnique Age Groups: {annual_df['Age Group'].unique()}")
print(f"\nUnique Sex categories: {annual_df['Sex'].unique()}")
print(f"\nUnique Regions: {annual_df['NUTS 2 Region'].unique()}")
print(f"\nEmployment rate range: {annual_df['employment_rate'].min():.1f}% - {annual_df['employment_rate'].max():.1f}%")

print("\n" + "=" * 50)
print("QUARTERLY EMPLOYMENT RATE DATASET")
print("=" * 50)
print(f"\nUnique Quarters: {len(quarterly_df['Quarter'].unique())} quarters")
print(f"Quarter range: {quarterly_df['Quarter'].min()} to {quarterly_df['Quarter'].max()}")
print(f"\nUnique Age Groups: {quarterly_df['Age Group'].unique()}")
print(f"\nUnique Sex categories: {quarterly_df['Sex'].unique()}")
print(f"\nEmployment rate range: {quarterly_df['employment_rate'].min():.1f}% - {quarterly_df['employment_rate'].max():.1f}%")