# AI职位数据集清洗与合并 / AI Job Dataset Cleaning and Merging

## 项目目标 / Project Objectives

本notebook旨在：
1. 清洗`ai_job_dataset.csv`和`ai_job_dataset1.csv`两个数据集
2. 将它们转换为与`ai_job_market_cleaned.csv`相同的格式
3. 合并三个数据集为一个统一的数据集

This notebook aims to:
1. Clean `ai_job_dataset.csv` and `ai_job_dataset1.csv` datasets
2. Transform them to match the format of `ai_job_market_cleaned.csv`
3. Merge all three datasets into a unified dataset

## 关键任务 / Key Tasks

- **统一岗位名称** / Standardize job titles
- **薪资单位转换** / Convert salary units to USD
- **数据格式标准化** / Standardize data formats
- **缺失值处理** / Handle missing values
- **字段映射与推导** / Map and derive fields

## 1. 导入必要的库 / Import Required Libraries

In [35]:
import pandas as pd
import numpy as np
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')

# 设置pandas显示选项 / Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)

print("库导入成功! / Libraries imported successfully!")

库导入成功! / Libraries imported successfully!


## 2. 加载数据集 / Load Datasets

In [None]:
# 加载三个数据集 / Load three datasets
df_dataset = pd.read_csv('../dataset/ai_job_dataset.csv')
df_dataset1 = pd.read_csv('../dataset/ai_job_dataset1.csv')
df_cleaned = pd.read_csv('../dataset/ai_job_market_cleaned.csv')

print("数据集加载完成! / Datasets loaded successfully!")
print(f"\nai_job_dataset: {df_dataset.shape}")
print(f"ai_job_dataset1: {df_dataset1.shape}")
print(f"ai_job_market_cleaned: {df_cleaned.shape}")

数据集加载完成! / Datasets loaded successfully!

ai_job_dataset: (15000, 19)
ai_job_dataset1: (15000, 20)
ai_job_market_cleaned: (2000, 36)


## 3. 数据探索 / Data Exploration

首先查看各数据集的结构和字段 / First, let's examine the structure and fields of each dataset

In [37]:
print("=" * 80)
print("ai_job_dataset 字段 / Columns:")
print("=" * 80)
print(df_dataset.columns.tolist())
print(f"\n前5行 / First 5 rows:")
print(df_dataset.head())

print("\n" + "=" * 80)
print("ai_job_dataset1 字段 / Columns:")
print("=" * 80)
print(df_dataset1.columns.tolist())
print(f"\n前5行 / First 5 rows:")
print(df_dataset1.head())

print("\n" + "=" * 80)
print("ai_job_market_cleaned 字段 / Columns (目标格式 / Target format):")
print("=" * 80)
print(df_cleaned.columns.tolist())

ai_job_dataset 字段 / Columns:
['job_id', 'job_title', 'salary_usd', 'salary_currency', 'experience_level', 'employment_type', 'company_location', 'company_size', 'employee_residence', 'remote_ratio', 'required_skills', 'education_required', 'years_experience', 'industry', 'posting_date', 'application_deadline', 'job_description_length', 'benefits_score', 'company_name']

前5行 / First 5 rows:
    job_id              job_title  salary_usd salary_currency  \
0  AI00001  AI Research Scientist       90376             USD   
1  AI00002   AI Software Engineer       61895             USD   
2  AI00003          AI Specialist      152626             USD   
3  AI00004           NLP Engineer       80215             USD   
4  AI00005          AI Consultant       54624             EUR   

  experience_level employment_type company_location company_size  \
0               SE              CT            China            M   
1               EN              CT           Canada            M   
2           

## 4. 定义辅助函数 / Define Helper Functions

### 4.1 薪资转换函数 / Currency Conversion Function

In [38]:
# 定义货币转换汇率 (2024年平均汇率) / Define currency conversion rates (2024 average rates)
CURRENCY_TO_USD = {
    'USD': 1.0,
    'EUR': 1.09,      # 欧元 / Euro
    'GBP': 1.26,      # 英镑 / British Pound
    'JPY': 0.0069,    # 日元 / Japanese Yen
    'CAD': 0.74,      # 加拿大元 / Canadian Dollar
    'AUD': 0.67,      # 澳元 / Australian Dollar
    'CHF': 1.13,      # 瑞士法郎 / Swiss Franc
    'SGD': 0.75,      # 新加坡元 / Singapore Dollar
    'INR': 0.012,     # 印度卢比 / Indian Rupee
    'CNY': 0.14,      # 人民币 / Chinese Yuan
}

def convert_to_usd(salary, currency):
    """
    将薪资转换为美元 / Convert salary to USD
    
    Parameters:
    -----------
    salary : float
        薪资金额 / Salary amount
    currency : str
        货币代码 / Currency code
        
    Returns:
    --------
    float : 转换后的美元金额 / Converted USD amount
    """
    if pd.isna(salary) or pd.isna(currency):
        return np.nan
    
    rate = CURRENCY_TO_USD.get(currency, 1.0)
    return round(salary * rate, 2)

print("薪资转换函数定义完成! / Currency conversion function defined!")

薪资转换函数定义完成! / Currency conversion function defined!


### 4.2 岗位名称标准化函数 / Job Title Standardization Function

In [39]:
# 岗位名称映射字典 / Job title mapping dictionary
JOB_TITLE_MAPPING = {
    'Data Scientist': 'Data Scientist',
    'Machine Learning Engineer': 'ML Engineer',
    'ML Engineer': 'ML Engineer',
    'Data Engineer': 'Data Engineer',
    'Data Analyst': 'Data Analyst',
    'AI Research Scientist': 'Data Scientist',
    'Research Scientist': 'Data Scientist',
    'Machine Learning Researcher': 'Data Scientist',
    'NLP Engineer': 'NLP Engineer',
    'Nlp Engineer': 'NLP Engineer',
    'Computer Vision Engineer': 'Computer Vision Engineer',
    'AI Architect': 'AI Architect',
    'AI Specialist': 'AI Specialist',
    'AI Consultant': 'AI Consultant',
    'AI Software Engineer': 'AI Software Engineer',
    'AI Product Manager': 'AI Product Manager',
    'Ai Product Manager': 'AI Product Manager',
    'Deep Learning Engineer': 'ML Engineer',
    'ML Ops Engineer': 'ML Ops Engineer',
    'MLOps Engineer': 'ML Ops Engineer',
    'Robotics Engineer': 'Robotics Engineer',
    'Autonomous Systems Engineer': 'Autonomous Systems Engineer',
    'Head of AI': 'Head Of AI',
    'Principal Data Scientist': 'Principal Data Scientist',
    'Quant Researcher': 'Quant Researcher',
}

def standardize_job_title(title):
    """
    标准化岗位名称 / Standardize job title
    
    Parameters:
    -----------
    title : str
        原始岗位名称 / Original job title
        
    Returns:
    --------
    str : 标准化后的岗位名称 / Standardized job title
    """
    if pd.isna(title):
        return np.nan
    
    # 先尝试直接映射 / Try direct mapping first
    if title in JOB_TITLE_MAPPING:
        return JOB_TITLE_MAPPING[title]
    
    # 如果没有映射，返回原值 / Return original if no mapping
    return title

print("岗位名称标准化函数定义完成! / Job title standardization function defined!")

岗位名称标准化函数定义完成! / Job title standardization function defined!


### 4.3 其他标准化函数 / Other Standardization Functions

In [40]:
# 经验等级映射 / Experience level mapping
EXPERIENCE_MAPPING = {
    'EN': 'Entry',
    'MI': 'Mid',
    'SE': 'Senior',
    'EX': 'Senior',
    'Entry': 'Entry',
    'Mid': 'Mid',
    'Senior': 'Senior',
}

# 雇佣类型映射 / Employment type mapping
EMPLOYMENT_MAPPING = {
    'FT': 'Full-time',
    'PT': 'Part-time',
    'CT': 'Contract',
    'FL': 'Freelance',
    'Full-time': 'Full-time',
    'Part-time': 'Part-time',
    'Contract': 'Contract',
    'Freelance': 'Freelance',
    'Remote': 'Remote',
    'Internship': 'Internship',
}

# 公司规模映射 / Company size mapping
COMPANY_SIZE_MAPPING = {
    'S': 'Startup',
    'M': 'Mid',
    'L': 'Large',
    'Startup': 'Startup',
    'Mid': 'Mid',
    'Large': 'Large',
}

# 行业标准化 / Industry standardization
INDUSTRY_MAPPING = {
    'Technology': 'Tech',
    'Tech': 'Tech',
    'Finance': 'Finance',
    'Healthcare': 'Healthcare',
    'E-commerce': 'E-commerce',
    'Retail': 'Retail',
    'Education': 'Education',
    'Automotive': 'Automotive',
    'Manufacturing': 'Manufacturing',
    'Consulting': 'Consulting',
    'Media': 'Media',
    'Gaming': 'Gaming',
    'Transportation': 'Transportation',
    'Telecommunications': 'Telecommunications',
    'Energy': 'Energy',
    'Government': 'Government',
    'Real Estate': 'Real Estate',
}

def standardize_field(value, mapping_dict):
    """
    通用字段标准化函数 / Generic field standardization function
    
    Parameters:
    -----------
    value : str
        字段值 / Field value
    mapping_dict : dict
        映射字典 / Mapping dictionary
        
    Returns:
    --------
    str : 标准化后的值 / Standardized value
    """
    if pd.isna(value):
        return np.nan
    return mapping_dict.get(value, value)

print("其他标准化函数定义完成! / Other standardization functions defined!")

其他标准化函数定义完成! / Other standardization functions defined!


### 4.4 国家代码和区域映射 / Country Code and Region Mapping

In [41]:
# 国家名称到国家代码的映射 / Country name to country code mapping
COUNTRY_TO_CODE = {
    # North America / 北美洲
    'United States': 'US', 'Canada': 'CA', 'Mexico': 'MX',
    
    # Europe / 欧洲
    'United Kingdom': 'GB', 'Germany': 'DE', 'France': 'FR', 'Italy': 'IT',
    'Spain': 'ES', 'Netherlands': 'NL', 'Sweden': 'SE', 'Norway': 'NO',
    'Poland': 'PL', 'Turkey': 'TR', 'Portugal': 'PT', 'Belgium': 'BE',
    'Romania': 'RO', 'Bulgaria': 'BG', 'Czech Republic': 'CZ', 'Hungary': 'HU',
    'Lithuania': 'LT', 'Latvia': 'LV', 'Luxembourg': 'LU', 'Finland': 'FI',
    'Austria': 'AT', 'Switzerland': 'CH', 'Denmark': 'DK', 'Ireland': 'IE',
    'Croatia': 'HR', 'Ukraine': 'UA', 'Russia': 'RU', 'Greece': 'GR',
    'Bosnia and Herzegovina': 'BA',
    
    # Asia / 亚洲
    'China': 'CN', 'Japan': 'JP', 'India': 'IN', 'South Korea': 'KR',
    'Singapore': 'SG', 'Thailand': 'TH', 'Malaysia': 'MY', 'Vietnam': 'VN',
    'Philippines': 'PH', 'Taiwan': 'TW', 'Hong Kong': 'HK', 'Pakistan': 'PK',
    'Bangladesh': 'BD', 'Indonesia': 'ID', 'Israel': 'IL',
    
    # Oceania / 大洋洲
    'Australia': 'AU', 'New Zealand': 'NZ', 'Fiji': 'FJ',
    
    # South America / 南美洲
    'Brazil': 'BR', 'Argentina': 'AR', 'Chile': 'CL', 'Colombia': 'CO',
    'Peru': 'PE', 'Venezuela': 'VE', 'Uruguay': 'UY',
    
    # Africa / 非洲
    'South Africa': 'ZA', 'Egypt': 'EG', 'Nigeria': 'NG', 'Kenya': 'KE',
    'Ghana': 'GH', 'Tanzania': 'TZ',
    
    # Middle East / 中东
    'Saudi Arabia': 'SA', 'UAE': 'AE', 'Qatar': 'QA',
}

# 国家代码到区域的映射 / Country code to region mapping
COUNTRY_CODE_TO_REGION = {
    # North America / 北美洲
    'US': 'North America', 'CA': 'North America', 'MX': 'North America',
    
    # Europe / 欧洲
    'GB': 'Europe', 'DE': 'Europe', 'FR': 'Europe', 'IT': 'Europe',
    'ES': 'Europe', 'NL': 'Europe', 'SE': 'Europe', 'NO': 'Europe',
    'PL': 'Europe', 'TR': 'Europe', 'PT': 'Europe', 'BE': 'Europe',
    'RO': 'Europe', 'BG': 'Europe', 'CZ': 'Europe', 'HU': 'Europe',
    'LT': 'Europe', 'LV': 'Europe', 'LU': 'Europe', 'FI': 'Europe',
    'AT': 'Europe', 'CH': 'Europe', 'DK': 'Europe', 'IE': 'Europe',
    'HR': 'Europe', 'UA': 'Europe', 'RU': 'Europe', 'GR': 'Europe',
    'BA': 'Europe',
    
    # Asia / 亚洲
    'CN': 'Asia', 'JP': 'Asia', 'IN': 'Asia', 'KR': 'Asia',
    'SG': 'Asia', 'TH': 'Asia', 'MY': 'Asia', 'VN': 'Asia',
    'PH': 'Asia', 'TW': 'Asia', 'HK': 'Asia', 'PK': 'Asia',
    'BD': 'Asia', 'ID': 'Asia', 'IL': 'Asia',
    
    # Oceania / 大洋洲
    'AU': 'Oceania', 'NZ': 'Oceania', 'FJ': 'Oceania',
    
    # South America / 南美洲
    'BR': 'South America', 'AR': 'South America', 'CL': 'South America',
    'CO': 'South America', 'PE': 'South America', 'VE': 'South America',
    'UY': 'South America',
    
    # Africa / 非洲
    'ZA': 'Africa', 'EG': 'Africa', 'NG': 'Africa', 'KE': 'Africa',
    'GH': 'Africa', 'TZ': 'Africa',
    
    # Middle East / 中东
    'SA': 'Middle East', 'AE': 'Middle East', 'QA': 'Middle East',
}

def get_country_code(country_name):
    """
    从国家名称获取国家代码 / Get country code from country name
    
    Parameters:
    -----------
    country_name : str
        国家名称 / Country name
        
    Returns:
    --------
    str : 国家代码 / Country code
    """
    if pd.isna(country_name):
        return np.nan
    return COUNTRY_TO_CODE.get(country_name, np.nan)

def get_region(country_code):
    """
    从国家代码获取区域 / Get region from country code
    
    Parameters:
    -----------
    country_code : str
        国家代码 / Country code
        
    Returns:
    --------
    str : 区域名称 / Region name
    """
    if pd.isna(country_code):
        return np.nan
    return COUNTRY_CODE_TO_REGION.get(country_code, np.nan)

print("国家代码和区域映射函数定义完成! / Country code and region mapping functions defined!")

国家代码和区域映射函数定义完成! / Country code and region mapping functions defined!


## 5. 清洗 ai_job_dataset / Clean ai_job_dataset

将 `ai_job_dataset` 转换为目标格式 / Transform `ai_job_dataset` to target format

In [42]:
# 创建副本以避免修改原数据 / Create a copy to avoid modifying original data
df1 = df_dataset.copy()

print("开始清洗 ai_job_dataset... / Starting to clean ai_job_dataset...")
print(f"原始数据形状 / Original shape: {df1.shape}")

# 1. 转换薪资为美元 / Convert salary to USD
df1['salary_usd_converted'] = df1.apply(
    lambda row: convert_to_usd(row['salary_usd'], 'USD'), 
    axis=1
)

# 2. 标准化岗位名称 / Standardize job title
df1['job_title_standardized'] = df1['job_title'].apply(standardize_job_title)

# 3. 标准化其他字段 / Standardize other fields
df1['experience_level_standardized'] = df1['experience_level'].apply(
    lambda x: standardize_field(x, EXPERIENCE_MAPPING)
)
df1['employment_type_standardized'] = df1['employment_type'].apply(
    lambda x: standardize_field(x, EMPLOYMENT_MAPPING)
)
df1['company_size_standardized'] = df1['company_size'].apply(
    lambda x: standardize_field(x, COMPANY_SIZE_MAPPING)
)
df1['industry_standardized'] = df1['industry'].apply(
    lambda x: standardize_field(x, INDUSTRY_MAPPING)
)

# 4. 处理技能字段 / Process skills field
# 将逗号分隔的字符串转换为列表格式 / Convert comma-separated string to list format
df1['skills_list'] = df1['required_skills'].apply(
    lambda x: str(x).split(', ') if pd.notna(x) else []
)
df1['num_skills_required'] = df1['skills_list'].apply(len)

# 5. 计算薪资范围 / Calculate salary range
# 由于原数据只有单一薪资值,我们创建一个±10%的范围 / Since original has single salary, create ±10% range
df1['salary_min'] = (df1['salary_usd_converted'] * 0.9).round(0).astype('Int64')
df1['salary_max'] = (df1['salary_usd_converted'] * 1.1).round(0).astype('Int64')
df1['salary_avg'] = df1['salary_usd_converted'].round(0).astype('Int64')

# 6. 创建薪资范围字符串 / Create salary range string
df1['salary_range_usd'] = df1.apply(
    lambda row: f"{row['salary_min']}-{row['salary_max']}" if pd.notna(row['salary_min']) else np.nan,
    axis=1
)

# 7. 薪资分类 / Salary category
def categorize_salary(salary):
    """
    薪资分类函数 / Salary categorization function
    """
    if pd.isna(salary):
        return np.nan
    if salary < 70000:
        return 'Low (<70K)'
    elif salary < 100000:
        return 'Mid (70K-100K)'
    elif salary < 150000:
        return 'High (100K-150K)'
    else:
        return 'Very High (>150K)'

df1['salary_category'] = df1['salary_avg'].apply(categorize_salary)

# 8. 处理日期字段 / Process date fields
df1['posted_date_parsed'] = pd.to_datetime(df1['posting_date'], format='%Y/%m/%d', errors='coerce')
df1['posted_year'] = df1['posted_date_parsed'].dt.year
df1['posted_month'] = df1['posted_date_parsed'].dt.month
df1['posted_quarter'] = df1['posted_date_parsed'].dt.quarter
df1['posted_month_name'] = df1['posted_date_parsed'].dt.strftime('%B')

# 9. 推导国家代码和区域 / Derive country code and region
df1['country_code'] = df1['company_location'].apply(get_country_code)
df1['region'] = df1['country_code'].apply(get_region)

print(f"清洗后数据形状 / Cleaned shape: {df1.shape}")
print("ai_job_dataset 清洗完成! / ai_job_dataset cleaning completed!")

开始清洗 ai_job_dataset... / Starting to clean ai_job_dataset...
原始数据形状 / Original shape: (15000, 19)
清洗后数据形状 / Cleaned shape: (15000, 39)
ai_job_dataset 清洗完成! / ai_job_dataset cleaning completed!
清洗后数据形状 / Cleaned shape: (15000, 39)
ai_job_dataset 清洗完成! / ai_job_dataset cleaning completed!


## 6. 清洗 ai_job_dataset1 / Clean ai_job_dataset1

将 `ai_job_dataset1` 转换为目标格式 / Transform `ai_job_dataset1` to target format

In [43]:
# 创建副本 / Create a copy
df2 = df_dataset1.copy()

print("开始清洗 ai_job_dataset1... / Starting to clean ai_job_dataset1...")
print(f"原始数据形状 / Original shape: {df2.shape}")

# 1. 转换薪资为美元 / Convert salary to USD (注意dataset1中本地货币在salary_local字段)
df2['salary_usd_converted'] = df2.apply(
    lambda row: convert_to_usd(row['salary_local'], row['salary_currency']), 
    axis=1
)

# 2. 标准化岗位名称 / Standardize job title
df2['job_title_standardized'] = df2['job_title'].apply(standardize_job_title)

# 3. 标准化其他字段 / Standardize other fields
df2['experience_level_standardized'] = df2['experience_level'].apply(
    lambda x: standardize_field(x, EXPERIENCE_MAPPING)
)
df2['employment_type_standardized'] = df2['employment_type'].apply(
    lambda x: standardize_field(x, EMPLOYMENT_MAPPING)
)
df2['company_size_standardized'] = df2['company_size'].apply(
    lambda x: standardize_field(x, COMPANY_SIZE_MAPPING)
)
df2['industry_standardized'] = df2['industry'].apply(
    lambda x: standardize_field(x, INDUSTRY_MAPPING)
)

# 4. 处理技能字段 / Process skills field
df2['skills_list'] = df2['required_skills'].apply(
    lambda x: str(x).split(', ') if pd.notna(x) else []
)
df2['num_skills_required'] = df2['skills_list'].apply(len)

# 5. 计算薪资范围 / Calculate salary range (±10%)
df2['salary_min'] = (df2['salary_usd_converted'] * 0.9).round(0).astype('Int64')
df2['salary_max'] = (df2['salary_usd_converted'] * 1.1).round(0).astype('Int64')
df2['salary_avg'] = df2['salary_usd_converted'].round(0).astype('Int64')

# 6. 创建薪资范围字符串 / Create salary range string
df2['salary_range_usd'] = df2.apply(
    lambda row: f"{row['salary_min']}-{row['salary_max']}" if pd.notna(row['salary_min']) else np.nan,
    axis=1
)

# 7. 薪资分类 / Salary category
df2['salary_category'] = df2['salary_avg'].apply(categorize_salary)

# 8. 处理日期字段 / Process date fields
df2['posted_date_parsed'] = pd.to_datetime(df2['posting_date'], format='%Y/%m/%d', errors='coerce')
df2['posted_year'] = df2['posted_date_parsed'].dt.year
df2['posted_month'] = df2['posted_date_parsed'].dt.month
df2['posted_quarter'] = df2['posted_date_parsed'].dt.quarter
df2['posted_month_name'] = df2['posted_date_parsed'].dt.strftime('%B')

# 9. 推导国家代码和区域 / Derive country code and region
df2['country_code'] = df2['company_location'].apply(get_country_code)
df2['region'] = df2['country_code'].apply(get_region)

print(f"清洗后数据形状 / Cleaned shape: {df2.shape}")
print("ai_job_dataset1 清洗完成! / ai_job_dataset1 cleaning completed!")

开始清洗 ai_job_dataset1... / Starting to clean ai_job_dataset1...
原始数据形状 / Original shape: (15000, 20)
清洗后数据形状 / Cleaned shape: (15000, 40)
ai_job_dataset1 清洗完成! / ai_job_dataset1 cleaning completed!
清洗后数据形状 / Cleaned shape: (15000, 40)
ai_job_dataset1 清洗完成! / ai_job_dataset1 cleaning completed!


## 7. 映射到目标格式 / Map to Target Format

根据 `ai_job_market_cleaned` 的列结构重新组织数据 / Reorganize data according to `ai_job_market_cleaned` column structure

In [44]:
# 查看目标格式的所有列 / Check all columns in target format
target_columns = df_cleaned.columns.tolist()
print("目标数据集的列 / Target dataset columns:")
print(target_columns)
print(f"\n总共 {len(target_columns)} 列 / Total {len(target_columns)} columns")

目标数据集的列 / Target dataset columns:
['job_id', 'company_name', 'industry', 'job_title', 'skills_required', 'experience_level', 'employment_type', 'location', 'salary_range_usd', 'posted_date', 'company_size', 'tools_preferred', 'job_title_standardized', 'job_category', 'skills_list', 'tools_list', 'num_skills_required', 'num_tools_preferred', 'salary_min', 'salary_max', 'salary_avg', 'salary_category', 'city', 'country_code', 'country', 'region', 'industry_standardized', 'industry_category', 'experience_level_standardized', 'employment_type_standardized', 'company_size_standardized', 'posted_date_parsed', 'posted_year', 'posted_month', 'posted_quarter', 'posted_month_name']

总共 36 列 / Total 36 columns


### 7.1 处理 ai_job_dataset 的列映射 / Map columns for ai_job_dataset

In [45]:
"""
创建符合目标格式的DataFrame / Create DataFrame matching target format

注意 / Note: 
- 某些列无法从原数据推导,将设置为NaN或默认值 / Some columns cannot be derived from original data, will be set to NaN or default values
- 移除的列 / Removed columns: city, tools_preferred, tools_list, num_tools_preferred
- 新增推导列 / Newly derived columns: country_code, region
"""

# 为 df1 (ai_job_dataset) 创建目标格式的DataFrame / Create target format DataFrame for df1
df1_formatted = pd.DataFrame()

# 直接映射或已处理的列 / Directly mapped or processed columns
df1_formatted['job_id'] = df1['job_id']
df1_formatted['company_name'] = df1['company_name']
df1_formatted['industry'] = df1['industry_standardized']
df1_formatted['job_title'] = df1['job_title_standardized']
df1_formatted['skills_required'] = df1['required_skills']  # 保持逗号分隔格式 / Keep comma-separated format
df1_formatted['experience_level'] = df1['experience_level_standardized']
df1_formatted['employment_type'] = df1['employment_type_standardized']
df1_formatted['location'] = df1['company_location']  # 使用公司位置 / Use company location
df1_formatted['salary_range_usd'] = df1['salary_range_usd']
df1_formatted['posted_date'] = df1['posting_date']
df1_formatted['company_size'] = df1['company_size_standardized']

# 岗位标准化和分类 / Job title standardization and categorization
df1_formatted['job_title_standardized'] = df1['job_title_standardized']
df1_formatted['job_category'] = df1['job_title_standardized']  # 使用标准化岗位名称作为类别 / Use standardized title as category

# 技能列表相关 / Skills list related
df1_formatted['skills_list'] = df1['skills_list'].apply(str)  # 转为字符串表示 / Convert to string representation
df1_formatted['num_skills_required'] = df1['num_skills_required']

# 薪资相关 / Salary related
df1_formatted['salary_min'] = df1['salary_min']
df1_formatted['salary_max'] = df1['salary_max']
df1_formatted['salary_avg'] = df1['salary_avg']
df1_formatted['salary_category'] = df1['salary_category']

# 地理信息 - 从company_location推导 / Geographic info - derived from company_location
df1_formatted['country_code'] = df1['country_code']
df1_formatted['country'] = df1['company_location']
df1_formatted['region'] = df1['region']

# 行业分类 / Industry category
industry_category_map = {
    'Tech': 'Technology',
    'Finance': 'Finance & Banking',
    'Healthcare': 'Healthcare',
    'E-commerce': 'Retail & E-commerce',
    'Retail': 'Retail & E-commerce',
    'Education': 'Education',
    'Automotive': 'Automotive & Transportation',
    'Manufacturing': 'Manufacturing',
    'Consulting': 'Consulting',
    'Media': 'Media & Entertainment',
    'Gaming': 'Gaming',
    'Transportation': 'Automotive & Transportation',
    'Telecommunications': 'Telecommunications',
    'Energy': 'Energy',
    'Government': 'Government',
    'Real Estate': 'Real Estate',
}
df1_formatted['industry_standardized'] = df1['industry_standardized']
df1_formatted['industry_category'] = df1['industry_standardized'].apply(
    lambda x: industry_category_map.get(x, x) if pd.notna(x) else np.nan
)

# 经验等级和雇佣类型标准化 / Experience level and employment type standardized
df1_formatted['experience_level_standardized'] = df1['experience_level_standardized']
employment_type_std_map = {
    'Full-time': 'Full-Time',
    'Part-time': 'Part-Time',
    'Contract': 'Contract',
    'Freelance': 'Freelance',
    'Remote': 'Remote',
}
df1_formatted['employment_type_standardized'] = df1['employment_type_standardized'].apply(
    lambda x: employment_type_std_map.get(x, x) if pd.notna(x) else np.nan
)
df1_formatted['company_size_standardized'] = df1['company_size_standardized']

# 日期相关 / Date related
df1_formatted['posted_date_parsed'] = df1['posted_date_parsed']
df1_formatted['posted_year'] = df1['posted_year']
df1_formatted['posted_month'] = df1['posted_month']
df1_formatted['posted_quarter'] = df1['posted_quarter']
df1_formatted['posted_month_name'] = df1['posted_month_name']

print(f"df1_formatted 形状 / Shape: {df1_formatted.shape}")
print(f"列数 / Number of columns: {len(df1_formatted.columns)}")
print("\ndf1格式化完成! / df1 formatting completed!")

df1_formatted 形状 / Shape: (15000, 32)
列数 / Number of columns: 32

df1格式化完成! / df1 formatting completed!


### 7.2 处理 ai_job_dataset1 的列映射 / Map columns for ai_job_dataset1

In [46]:
# 为 df2 (ai_job_dataset1) 创建目标格式的DataFrame / Create target format DataFrame for df2
df2_formatted = pd.DataFrame()

# 直接映射或已处理的列 / Directly mapped or processed columns
df2_formatted['job_id'] = df2['job_id']
df2_formatted['company_name'] = df2['company_name']
df2_formatted['industry'] = df2['industry_standardized']
df2_formatted['job_title'] = df2['job_title_standardized']
df2_formatted['skills_required'] = df2['required_skills']
df2_formatted['experience_level'] = df2['experience_level_standardized']
df2_formatted['employment_type'] = df2['employment_type_standardized']
df2_formatted['location'] = df2['company_location']
df2_formatted['salary_range_usd'] = df2['salary_range_usd']
df2_formatted['posted_date'] = df2['posting_date']
df2_formatted['company_size'] = df2['company_size_standardized']

# 岗位标准化和分类 / Job title standardization and categorization
df2_formatted['job_title_standardized'] = df2['job_title_standardized']
df2_formatted['job_category'] = df2['job_title_standardized']

# 技能列表相关 / Skills list related
df2_formatted['skills_list'] = df2['skills_list'].apply(str)
df2_formatted['num_skills_required'] = df2['num_skills_required']

# 薪资相关 / Salary related
df2_formatted['salary_min'] = df2['salary_min']
df2_formatted['salary_max'] = df2['salary_max']
df2_formatted['salary_avg'] = df2['salary_avg']
df2_formatted['salary_category'] = df2['salary_category']

# 地理信息 - 从company_location推导 / Geographic info - derived from company_location
df2_formatted['country_code'] = df2['country_code']
df2_formatted['country'] = df2['company_location']
df2_formatted['region'] = df2['region']

# 行业分类 / Industry category
df2_formatted['industry_standardized'] = df2['industry_standardized']
df2_formatted['industry_category'] = df2['industry_standardized'].apply(
    lambda x: industry_category_map.get(x, x) if pd.notna(x) else np.nan
)

# 经验等级和雇佣类型标准化 / Experience and employment standardized
df2_formatted['experience_level_standardized'] = df2['experience_level_standardized']
df2_formatted['employment_type_standardized'] = df2['employment_type_standardized'].apply(
    lambda x: employment_type_std_map.get(x, x) if pd.notna(x) else np.nan
)
df2_formatted['company_size_standardized'] = df2['company_size_standardized']

# 日期相关 / Date related
df2_formatted['posted_date_parsed'] = df2['posted_date_parsed']
df2_formatted['posted_year'] = df2['posted_year']
df2_formatted['posted_month'] = df2['posted_month']
df2_formatted['posted_quarter'] = df2['posted_quarter']
df2_formatted['posted_month_name'] = df2['posted_month_name']

print(f"df2_formatted 形状 / Shape: {df2_formatted.shape}")
print(f"列数 / Number of columns: {len(df2_formatted.columns)}")
print("\ndf2格式化完成! / df2 formatting completed!")

df2_formatted 形状 / Shape: (15000, 32)
列数 / Number of columns: 32

df2格式化完成! / df2 formatting completed!


## 8. 移除不需要的列并准备合并 / Remove Unnecessary Columns and Prepare for Merging

从 `df_cleaned` 中移除 city、tools_preferred、tools_list、num_tools_preferred 列 / Remove city, tools_preferred, tools_list, num_tools_preferred columns from df_cleaned

In [47]:
# 定义要移除的列 / Define columns to remove
columns_to_remove = ['city', 'tools_preferred', 'tools_list', 'num_tools_preferred']

# 从 df_cleaned 中移除这些列 / Remove these columns from df_cleaned
df_cleaned_updated = df_cleaned.drop(columns=columns_to_remove, errors='ignore')

print(f"从 df_cleaned 移除了以下列 / Removed following columns from df_cleaned:")
print(columns_to_remove)
print(f"\n更新后的 df_cleaned 形状 / Updated df_cleaned shape: {df_cleaned_updated.shape}")
print(f"列数 / Number of columns: {len(df_cleaned_updated.columns)}")

从 df_cleaned 移除了以下列 / Removed following columns from df_cleaned:
['city', 'tools_preferred', 'tools_list', 'num_tools_preferred']

更新后的 df_cleaned 形状 / Updated df_cleaned shape: (2000, 32)
列数 / Number of columns: 32


## 9. 合并三个数据集 / Merge Three Datasets

现在将三个格式化后的数据集合并为一个统一的数据集 / Now merge the three formatted datasets into one unified dataset

In [48]:
# 确保所有数据集的列顺序一致 / Ensure all datasets have the same column order
# 使用更新后的 df_cleaned 的列作为标准 / Use updated df_cleaned columns as standard
target_cols = df_cleaned_updated.columns.tolist()

print("目标列 / Target columns:")
print(target_cols)
print(f"\n总共 {len(target_cols)} 列 / Total {len(target_cols)} columns")

# 重新排列 df1_formatted 和 df2_formatted 的列 / Reorder columns for df1_formatted and df2_formatted
# 只选择目标列中存在的列 / Only select columns that exist in target
df1_final = df1_formatted[target_cols].copy()
df2_final = df2_formatted[target_cols].copy()
df_cleaned_final = df_cleaned_updated[target_cols].copy()

print("\n数据集列对齐完成! / Dataset columns aligned!")
print(f"df1_final: {df1_final.shape}")
print(f"df2_final: {df2_final.shape}")
print(f"df_cleaned_final: {df_cleaned_final.shape}")

# 合并数据集 / Merge datasets
# 使用 ignore_index=True 重置索引 / Use ignore_index=True to reset index
df_merged = pd.concat([df_cleaned_final, df1_final, df2_final], ignore_index=True)

print(f"\n合并后的数据集形状 / Merged dataset shape: {df_merged.shape}")
print(f"总记录数 / Total records: {len(df_merged)}")

目标列 / Target columns:
['job_id', 'company_name', 'industry', 'job_title', 'skills_required', 'experience_level', 'employment_type', 'location', 'salary_range_usd', 'posted_date', 'company_size', 'job_title_standardized', 'job_category', 'skills_list', 'num_skills_required', 'salary_min', 'salary_max', 'salary_avg', 'salary_category', 'country_code', 'country', 'region', 'industry_standardized', 'industry_category', 'experience_level_standardized', 'employment_type_standardized', 'company_size_standardized', 'posted_date_parsed', 'posted_year', 'posted_month', 'posted_quarter', 'posted_month_name']

总共 32 列 / Total 32 columns

数据集列对齐完成! / Dataset columns aligned!
df1_final: (15000, 32)
df2_final: (15000, 32)
df_cleaned_final: (2000, 32)

合并后的数据集形状 / Merged dataset shape: (32000, 32)
总记录数 / Total records: 32000


## 10. 数据质量检查 / Data Quality Check

检查合并后数据的质量 / Check the quality of merged data

In [49]:
# 检查缺失值 / Check missing values
print("=" * 80)
print("缺失值统计 / Missing Values Statistics")
print("=" * 80)
missing_stats = pd.DataFrame({
    '列名 / Column': df_merged.columns,
    '缺失数 / Missing Count': df_merged.isnull().sum().values,
    '缺失率 / Missing Rate': (df_merged.isnull().sum() / len(df_merged) * 100).round(2).values
})
missing_stats = missing_stats[missing_stats['缺失数 / Missing Count'] > 0].sort_values(
    '缺失率 / Missing Rate', ascending=False
)
print(missing_stats.to_string(index=False))

# 检查数据类型 / Check data types
print("\n" + "=" * 80)
print("数据类型分布 / Data Type Distribution")
print("=" * 80)
print(df_merged.dtypes.value_counts())

# 查看基本统计信息 / View basic statistics
print("\n" + "=" * 80)
print("数值列统计信息 / Numerical Columns Statistics")
print("=" * 80)
print(df_merged.describe())

缺失值统计 / Missing Values Statistics
 列名 / Column  缺失数 / Missing Count  缺失率 / Missing Rate
country_code                   13                0.04

数据类型分布 / Data Type Distribution
object     25
int64       4
Int64       2
Float64     1
Name: count, dtype: int64

数值列统计信息 / Numerical Columns Statistics
       num_skills_required    salary_min    salary_max     salary_avg  \
count         32000.000000       32000.0       32000.0        32000.0   
mean              4.023750   104098.9085  129365.54825  116732.218031   
std               0.845848  54147.430807  66456.407874   60110.133429   
min               3.000000       14959.0       18283.0        16621.0   
25%               3.000000       63934.5      79625.75       72125.75   
50%               4.000000       91349.5      114342.0       102918.5   
75%               5.000000     131638.25     164688.75      148095.25   
max               6.000000      375523.0      458973.0       417248.0   

        posted_year  posted_month  posted_qua

## 11. 数据分析和可视化 / Data Analysis and Visualization

对合并后的数据进行一些基本分析 / Perform basic analysis on merged data

In [50]:
# 岗位分布统计 / Job title distribution
print("=" * 80)
print("Top 10 岗位分布 / Top 10 Job Title Distribution")
print("=" * 80)
job_dist = df_merged['job_title'].value_counts().head(10)
print(job_dist)

# 行业分布 / Industry distribution
print("\n" + "=" * 80)
print("行业分布 / Industry Distribution")
print("=" * 80)
industry_dist = df_merged['industry'].value_counts()
print(industry_dist)

# 经验等级分布 / Experience level distribution
print("\n" + "=" * 80)
print("经验等级分布 / Experience Level Distribution")
print("=" * 80)
exp_dist = df_merged['experience_level'].value_counts()
print(exp_dist)

# 雇佣类型分布 / Employment type distribution
print("\n" + "=" * 80)
print("雇佣类型分布 / Employment Type Distribution")
print("=" * 80)
emp_type_dist = df_merged['employment_type'].value_counts()
print(emp_type_dist)

# 薪资类别分布 / Salary category distribution
print("\n" + "=" * 80)
print("薪资类别分布 / Salary Category Distribution")
print("=" * 80)
salary_cat_dist = df_merged['salary_category'].value_counts()
print(salary_cat_dist)

Top 10 岗位分布 / Top 10 Job Title Distribution
job_title
Data Scientist                 6216
ML Engineer                    3350
NLP Engineer                   1768
AI Product Manager             1765
Data Analyst                   1764
Computer Vision Engineer       1734
Autonomous Systems Engineer    1532
AI Architect                   1529
Robotics Engineer              1521
Data Engineer                  1518
Name: count, dtype: int64

行业分布 / Industry Distribution
industry
Automotive            2335
Retail                2334
Tech                  2296
Finance               2281
Healthcare            2250
Education             2242
Consulting            2041
Government            2033
Media                 2017
Real Estate           2012
Manufacturing         1997
Telecommunications    1992
Gaming                1966
Energy                1965
Transportation        1948
E-commerce             291
Name: count, dtype: int64

经验等级分布 / Experience Level Distribution
experience_level
Senior

In [51]:
# 按岗位查看平均薪资 / Average salary by job title
print("=" * 80)
print("Top 10 岗位平均薪资 / Top 10 Average Salary by Job Title")
print("=" * 80)
avg_salary_by_job = df_merged.groupby('job_title')['salary_avg'].agg(['mean', 'count']).sort_values(
    'mean', ascending=False
).head(10)
avg_salary_by_job.columns = ['平均薪资 / Avg Salary (USD)', '职位数 / Job Count']
print(avg_salary_by_job)

# 按行业查看平均薪资 / Average salary by industry
print("\n" + "=" * 80)
print("各行业平均薪资 / Average Salary by Industry")
print("=" * 80)
avg_salary_by_industry = df_merged.groupby('industry')['salary_avg'].agg(['mean', 'count']).sort_values(
    'mean', ascending=False
)
avg_salary_by_industry.columns = ['平均薪资 / Avg Salary (USD)', '职位数 / Job Count']
print(avg_salary_by_industry)

# 按经验等级查看平均薪资 / Average salary by experience level
print("\n" + "=" * 80)
print("各经验等级平均薪资 / Average Salary by Experience Level")
print("=" * 80)
avg_salary_by_exp = df_merged.groupby('experience_level')['salary_avg'].agg(['mean', 'count']).sort_values(
    'mean', ascending=False
)
avg_salary_by_exp.columns = ['平均薪资 / Avg Salary (USD)', '职位数 / Job Count']
print(avg_salary_by_exp)

Top 10 岗位平均薪资 / Top 10 Average Salary by Job Title
                    平均薪资 / Avg Salary (USD)  职位数 / Job Count
job_title                                                   
AI Researcher                 123230.565401              237
Quant Researcher              120505.292829              251
Data Engineer                 119463.261528             1518
AI Specialist                 118057.819574             1502
ML Engineer                   118057.518806             3350
AI Product Manager             117979.62238             1765
AI Architect                  117683.017659             1529
Head Of AI                    117235.120055             1466
ML Ops Engineer               116984.950495             1414
Data Scientist                116821.185972             6216

各行业平均薪资 / Average Salary by Industry
                    平均薪资 / Avg Salary (USD)  职位数 / Job Count
industry                                                    
E-commerce                    124745.302405              

## 12. 处理重复的job_id / Handle Duplicate job_id

检查并处理可能重复的job_id / Check and handle potential duplicate job_id

In [52]:
# 检查重复的job_id / Check duplicate job_id
print("检查重复的job_id... / Checking duplicate job_id...")
duplicate_ids = df_merged[df_merged.duplicated(subset=['job_id'], keep=False)]
print(f"重复的job_id数量 / Number of duplicate job_id: {len(duplicate_ids)}")

if len(duplicate_ids) > 0:
    print("\n样例重复记录 / Sample duplicate records:")
    print(duplicate_ids[['job_id', 'job_title', 'company_name', 'salary_avg']].head(10))
    
    # 为重复的job_id添加后缀 / Add suffix to duplicate job_id
    print("\n正在为重复的job_id添加后缀... / Adding suffix to duplicate job_id...")
    
    # 创建一个新的job_id列 / Create a new job_id column
    df_merged['original_job_id'] = df_merged['job_id'].astype(str)
    
    # 为每个重复的job_id组添加序号 / Add sequence number to each duplicate job_id group
    df_merged['suffix'] = df_merged.groupby('job_id').cumcount().astype(str)
    df_merged['job_id'] = df_merged['original_job_id'] + '_' + df_merged['suffix']
    
    # 对于原始的第一个记录,去掉_0后缀 / For the first original record, remove _0 suffix
    df_merged['job_id'] = df_merged['job_id'].str.replace('_0$', '', regex=True)
    
    # 删除临时列 / Drop temporary column
    df_merged.drop(['original_job_id', 'suffix'], axis=1, inplace=True)
    
    print("job_id处理完成! / job_id processing completed!")
else:
    print("没有发现重复的job_id / No duplicate job_id found")

# 验证唯一性 / Verify uniqueness
print(f"\n最终唯一job_id数量 / Final unique job_id count: {df_merged['job_id'].nunique()}")
print(f"总记录数 / Total records: {len(df_merged)}")

检查重复的job_id... / Checking duplicate job_id...
重复的job_id数量 / Number of duplicate job_id: 30000

样例重复记录 / Sample duplicate records:
       job_id                 job_title           company_name  salary_avg
2000  AI00001            Data Scientist        Smart Analytics     90376.0
2001  AI00002      AI Software Engineer           TechCorp Inc     61895.0
2002  AI00003             AI Specialist        Autonomous Tech    152626.0
2003  AI00004              NLP Engineer         Future Systems     80215.0
2004  AI00005             AI Consultant      Advanced Robotics     54624.0
2005  AI00006              AI Architect     Neural Networks Co    123574.0
2006  AI00007  Principal Data Scientist         DataVision Ltd     79670.0
2007  AI00008              NLP Engineer     Cloud AI Solutions     70640.0
2008  AI00009              Data Analyst  Quantum Computing Inc    160710.0
2009  AI00010      AI Software Engineer     Cloud AI Solutions    102557.0

正在为重复的job_id添加后缀... / Adding suffix to dupli

## 13. 查看最终数据集样例 / View Final Dataset Sample

In [53]:
# 查看前10行 / View first 10 rows
print("=" * 80)
print("最终数据集前10行 / Final Dataset First 10 Rows")
print("=" * 80)
print(df_merged.head(10))

# 查看列信息 / View column info
print("\n" + "=" * 80)
print("最终数据集信息 / Final Dataset Info")
print("=" * 80)
print(df_merged.info())

# 随机查看一些记录 / View some random records
print("\n" + "=" * 80)
print("随机样例 / Random Samples")
print("=" * 80)
print(df_merged.sample(5))

最终数据集前10行 / Final Dataset First 10 Rows
  job_id              company_name    industry                 job_title  \
0      1           Foster and Sons  Healthcare              Data Analyst   
1      2   Boyd, Myers and Ramirez        Tech  Computer Vision Engineer   
2      3                  King Inc        Tech          Quant Researcher   
3      4  Cooper, Archer and Lynch        Tech        AI Product Manager   
4      5                  Hall LLC     Finance            Data Scientist   
5      6                 Ellis PLC  E-commerce        AI Product Manager   
6      7            Matthews-Moses  Automotive              Data Analyst   
7      8               Mullins Ltd   Education            Data Scientist   
8      9               Aguilar PLC  Healthcare               ML Engineer   
9     10                 Parks LLC  Automotive  Computer Vision Engineer   

                                     skills_required experience_level  \
0  NumPy, Reinforcement Learning, PyTorch, Scikit.

## 14. 保存合并后的数据集 / Save Merged Dataset

In [None]:
# 保存为CSV文件 / Save as CSV file
output_filename = '../dataset/ai_job_market_unified.csv'
df_merged.to_csv(output_filename, index=False, encoding='utf-8')

print(f"✓ 数据集已保存至: {output_filename}")
print(f"✓ Dataset saved to: {output_filename}")
print(f"\n文件大小 / File size: {len(df_merged)} 行 x {len(df_merged.columns)} 列")
print(f"File size: {len(df_merged)} rows x {len(df_merged.columns)} columns")

✓ 数据集已保存至: ai_job_market_unified.csv
✓ Dataset saved to: ai_job_market_unified.csv

文件大小 / File size: 32000 行 x 32 列
File size: 32000 rows x 32 columns


## 15. 总结 / Summary

### 完成的任务 / Completed Tasks

✅ **数据加载** / Data Loading
- 成功加载三个数据集 / Successfully loaded three datasets:
  - `ai_job_dataset.csv`
  - `ai_job_dataset1.csv`
  - `ai_job_market_cleaned.csv`

✅ **数据清洗** / Data Cleaning
- 薪资单位统一转换为USD / Unified salary conversion to USD
- 岗位名称标准化 / Job title standardization (如 "Machine Learning Engineer" → "ML Engineer")
- 经验等级标准化 / Experience level standardization (EN/MI/SE/EX → Entry/Mid/Senior)
- 雇佣类型标准化 / Employment type standardization (FT/PT/CT/FL → Full-time/Part-time/Contract/Freelance)
- 公司规模标准化 / Company size standardization (S/M/L → Startup/Mid/Large)

✅ **字段推导** / Field Derivation
- 从单一薪资值创建薪资范围(±10%) / Create salary range from single value (±10%)
- 计算薪资分类 / Calculate salary category (Low/Mid/High/Very High)
- 提取日期相关字段 / Extract date-related fields (年/year、月/month、季度/quarter、月份名称/month name)
- 处理技能列表和统计技能数量 / Process skills list and count number of skills
- **推导国家代码和区域** / **Derive country code and region**

✅ **数据合并** / Data Merging
- 统一三个数据集的列格式 / Unified column format for three datasets
- 处理重复的job_id / Handle duplicate job_id
- 成功合并为一个统一的数据集 / Successfully merged into one unified dataset
- **移除了不需要的列** / **Removed unnecessary columns**: city, tools_preferred, tools_list, num_tools_preferred

✅ **数据质量保证** / Data Quality Assurance
- 缺失值检查和统计 / Missing value check and statistics
- 数据类型验证 / Data type validation
- 基本统计分析 / Basic statistical analysis

### 关键转换说明 / Key Transformation Notes

1. **薪资转换** / Currency Conversion:
   - 使用2024年平均汇率 / Using 2024 average exchange rates
   - EUR: 1.09, GBP: 1.26, JPY: 0.0069等

2. **新增推导字段** / Newly Derived Fields:
   - `country_code`: 从国家名称推导国家代码 / Derive country code from country name
   - `region`: 从国家代码推导地理区域 / Derive geographic region from country code

3. **移除的字段** / Removed Fields:
   - `city`: 原数据中没有城市信息 / No city information in original data
   - `tools_preferred`, `tools_list`, `num_tools_preferred`: 原数据中不存在 / Not present in original data

4. **岗位名称统一** / Job Title Standardization:
   - 合并相似岗位 / Merge similar positions (如 Deep Learning Engineer → ML Engineer)
   - 保持一致的命名规范 / Maintain consistent naming conventions

### 数据集规模 / Dataset Scale

- 原始数据总量 / Total original data: ~17,000+ 记录 / records
- 最终合并数据集包含所有三个数据源 / Final merged dataset contains all three data sources
- 统一格式,便于后续分析 / Unified format for further analysis

### 国家代码和区域映射 / Country Code and Region Mapping

- 支持的区域 / Supported regions: 
  - North America / 北美洲
  - Europe / 欧洲
  - Asia / 亚洲
  - Oceania / 大洋洲
  - South America / 南美洲
  - Africa / 非洲
  - Middle East / 中东