# Part 1: Introduction to the Data Domain and the Data Exploration Report
## 1.1 数据集介绍
### 1.1.1 数据来源与背景

`Adult Income Dataset`是一个研究成年收入与各种因素关系的数据集，它可以从[Adult Income](https://www.kaggle.com/datasets/wenruliu/adult-income-dataset)下载到

正如介绍页所说的"An individual's annual income results from various factors. Intuitively, it is influenced by the individual's education level, age, gender, occupation, and etc.This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices."

作为机器学习领域的经典数据集之一，该最初来源于1994年美国人口普查数据。该数据集包含48,842个样本（含1个标题行），每个样本包含15个特征属性。

### 1.1.2 数据集详细信息
**数据集规模:**
- 总样本数：48,842条记录
- 特征数量：14个预测特征 + 1个目标变量
- 数据类型：包含数值型和分类型特征

连续型特征包括:
| 特征名称   | 英文名称           | 数据类型 | 描述     | 取值范围             |
| ------ | -------------- | ---- | ------ | ---------------- |
| 年龄     | age            | 数值型  | 个人年龄   | 17-90岁           |
| 教育年限   | education-num  | 数值型  | 受教育年数  | 1-16年            |
| 资本收益   | capital-gain   | 数值型  | 资本收益   | 0-99,999美元       |
| 资本损失   | capital-loss   | 数值型  | 资本损失   | 0-4,356美元        |
| 每周工作小时 | hours-per-week | 数值型  | 每周工作时间 | 1-99小时           |
| 最终权重   | fnlwgt         | 数值型  | 人口普查权重 | 12,285-1,484,705 |

离散型特征包括:
|特征名称|英文名称|类别数|主要取值|
|---|---|---|---|
|工作类型|workclass|9类|Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked|
|教育水平|education|16类|Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|婚姻状况|marital-status|7类|Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|职业|occupation|15类|Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces|
|家庭关系|relationship|6类|Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried|
|种族|race|5类|White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|性别|sex|2类|Female, Male|
|原籍国|native-country|42类|United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands|

## 1.2 探索性数据分析

In [None]:
# Import required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.model_selection as ms

# Set font for Chinese display
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

# Load CSV data
df = pd.read_csv('adult.csv')

# Basic dataset information
print("Dataset Basic Information:")
print(f"Dataset shape: {df.shape}")
print(f"Column names: {df.columns.tolist()}")
print("\nFirst 5 rows:")
print(df.head())

print("\nDataset info:")
print(df.info())

print("\nDataset statistics:")
print(df.describe())

print("\nMissing values:")
print(df.isnull().sum())

# Check for missing values marked with "?"
print("\nCheck for missing values marked with '?':")
for col in df.columns:
    if df[col].dtype == 'object':
        missing_count = (df[col] == '?').sum()
        if missing_count > 0:
            print(f"{col}: {missing_count} missing values")


### 1.2.1 单变量分析
在本部分, 我们将:
1. 绘制所有连续型变量的**箱线图**, 观察每个特征的分布
2. 绘制所有离散型变量的特征取值柱状统计图, 观察每个离散型特征各个取值的分布

观察图像可以发现:
1. 在连续型特征中: `age`和`educational-num`两个特征分布较为均匀, 出现了少量的离群值; `fnlwgt`(最终权重)分布范围很广，存在较多离群值; `capital-gain`和`capital-loss`两个特征高度右偏，大多数值为0，只有少数样本有资本收益或损失; `hours-per-week`主要集中在40小时左右，符合标准工作时间的分布特征。
2. 在离散型特征中: `workclass`以私营企业(Private)为主，约占75%; `education`中高中毕业(HS-grad)和大学教育(Some-college, Bachelors)较多; `marital-status`中已婚(Married-civ-spouse)占主导地位; `occupation`分布相对均匀，专业技术人员(Prof-specialty)和管理人员(Exec-managerial)较多; `relationship`中丈夫(Husband)和非家庭成员(Not-in-family)较多; `race`以白人(White)为主，约占85%; `gender`中男性(Male)约占67%; `native-country`以美国(United-States)为主，约占90%。
3. 目标变量`income`分布不均衡，收入≤50K的样本约占76%，收入>50K的样本约占24%。

In [None]:
# Define continuous and categorical variables
continuous_features = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']
categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country']

# 1. Plot boxplots for continuous variables
plt.figure(figsize=(20, 12))
for i, feature in enumerate(continuous_features):
    plt.subplot(2, 3, i+1)
    plt.boxplot(df[feature].dropna())
    plt.title(f'{feature} Boxplot')
    plt.ylabel('Value')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 2. Plot histograms for continuous variables
plt.figure(figsize=(20, 12))
for i, feature in enumerate(continuous_features):
    plt.subplot(2, 3, i+1)
    plt.hist(df[feature].dropna(), bins=50, alpha=0.7, edgecolor='black')
    plt.title(f'{feature} Distribution Histogram')
    plt.xlabel(feature)
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 3. Plot bar charts for categorical variables
# Split categorical variables into groups for better visualization
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle('Categorical Variables Distribution (Group 1)', fontsize=16)

for i, feature in enumerate(categorical_features[:4]):
    row = i // 2
    col = i % 2

    # Calculate frequency for each category
    value_counts = df[feature].value_counts()

    # Plot bar chart
    axes[row, col].bar(range(len(value_counts)), value_counts.values)
    axes[row, col].set_title(f'{feature} Distribution')
    axes[row, col].set_xlabel('Category')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].set_xticks(range(len(value_counts)))
    axes[row, col].set_xticklabels(value_counts.index, rotation=45, ha='right')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Group 2 categorical variables
fig, axes = plt.subplots(2, 2, figsize=(20, 12))
fig.suptitle('Categorical Variables Distribution (Group 2)', fontsize=16)

for i, feature in enumerate(categorical_features[4:]):
    row = i // 2
    col = i % 2

    # Calculate frequency for each category
    value_counts = df[feature].value_counts()

    # Plot bar chart
    axes[row, col].bar(range(len(value_counts)), value_counts.values)
    axes[row, col].set_title(f'{feature} Distribution')
    axes[row, col].set_xlabel('Category')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].set_xticks(range(len(value_counts)))
    axes[row, col].set_xticklabels(value_counts.index, rotation=45, ha='right')
    axes[row, col].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 4. Target variable distribution
plt.figure(figsize=(10, 6))
income_counts = df['income'].value_counts()
plt.bar(income_counts.index, income_counts.values, color=['skyblue', 'lightcoral'])
plt.title('Target Variable (income) Distribution')
plt.xlabel('Income Level')
plt.ylabel('Frequency')
plt.grid(True, alpha=0.3)

# Add value labels
for i, v in enumerate(income_counts.values):
    plt.text(i, v + 100, str(v), ha='center', va='bottom')

plt.show()

# 5. Statistical summary
print("\n=== Univariate Analysis Summary ===")
print("\nContinuous variables statistics:")
print(df[continuous_features].describe())

print("\nCategorical variables statistics:")
for feature in categorical_features:
    print(f"\n{feature}:")
    print(df[feature].value_counts())

print(f"\nTarget variable distribution:")
print(df['income'].value_counts())
print(f"Target variable proportion:")
print(df['income'].value_counts(normalize=True))

### 1.1.2 多变量分析
在本部分, 我们首先将通过画出**连续**变量之间的**相关性热力图**和**散点图**检查变量之间的相关关系。观察图像可知:特征之间的相关关系很弱。两两特征之间没有的相关系数没有超过0.3的, 这表明了变量之间的弱相关性

此外, 观察以`income`为分类绘制的散点图矩阵, 我们可以发现:
1. `age`和`education_num`之间, 收入高(`income>50k`)的群体普遍具有更高的教育年限和更大的年龄; 此外, 高收入群体`hours-per-week`更长，且分布范围更广
2. `capital-gain`是一个非常强的预测指标。几乎所有收入 >50K 的个体都集中在`capital-gain`大于零的区域，而收入 <=50K 的个体其资本收益几乎全部为零。这与我们在1.1.1节中看到的箱型图结果一致
3. `fnlwgt`与`income`几乎没有显著关联, 对`income`的预测作用有限, 后续可以考虑将其排除于模型之外

In [None]:
# Set visualization style
sns.set(style='whitegrid', palette='muted', font_scale=1.2)

# --- Correlation Heatmap ---
# Calculate the correlation matrix for continuous features
correlation_matrix = df[continuous_features].corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix of Continuous Features', fontsize=16)
plt.show()

# --- Pairplot ---
# Create a pairplot to visualize relationships between continuous variables,
# colored by the target variable 'income'.
# To avoid long computation time and cluttered plot, we use a sample of the data.
df_sample = df.sample(n=2000, random_state=42)

plt.figure(figsize=(12, 8))
sns.pairplot(df_sample, vars=continuous_features, hue='income', plot_kws={'alpha': 0.6}, diag_kind='kde')
plt.suptitle('Pairplot of Continuous Features by Income Level', y=1.02, fontsize=16)
plt.show()
