# Data cleaning

先通过stata生成变量amount，num_goods, price_goods,读入数据，检查基本信息。中国政府采购、公共采购主要分为货物、服务、工程三大类，分别有不同的公开招标金额标准。
根据主要标的名称和项目名称中词汇出现的词频，选出top 100关键词，对关键词进行手动分类，排除其中容易出现歧义的部分，比如油，可能是“燃油采购”（货物），也可能是“加油服务”（服务）。对包含关键词的采购项目进行分类，在通过机器学习进行分类。

In [79]:
import pandas as pd

pd.set_option("display.max_columns", None)   
pd.set_option("display.max_rows", 100)
pd.set_option("display.width", None)     

In [61]:

csv_file = "/Users/yxy/UChi/Summer2025/Procurement/dta/china_procurement_clean1.csv"

df = pd.read_csv(csv_file, low_memory=False)


## Generate training set
### top 100 items
给出现频率前100的标的物名称手动标注类别，使用了chatgpt+人工检查，对一部分“无”，“详情见合同”，标注了类别notsure

In [68]:
import pandas as pd

top100_items = df["主要标的名称"].value_counts().head(100).reset_index()
top100_items.columns = ["keyword", "count"]

top100_items["category"] = ""

top100_items.to_csv("/Users/yxy/UChi/Summer2025/Procurement/dta/keywords.csv", index=False, encoding="utf-8-sig")

print("Top 100 items exported to keywords.csv for manual categorization.")


Top 100 items exported to keywords.csv for manual categorization.


In [None]:

classified = pd.read_csv("/Users/yxy/UChi/Summer2025/Procurement/dta/keywords.csv")

df = df.merge(classified[["keyword", "category"]], 
              left_on="主要标的名称", 
              right_on="keyword", 
              how="left")

df.rename(columns={"category": "cat"}, inplace=True)

df.drop(columns=["keyword"], inplace=True)



In [70]:
df['cat'].value_counts(dropna=False)

cat
NaN        1643944
货物          869663
服务          465043
工程          161255
notsure      20021
Name: count, dtype: int64

In [81]:
df_uncat = df[df['cat'].isna() | (df['cat'] == 'notsure')]
print(df_uncat['主要标的名称'].value_counts().head(100))

主要标的名称
无                             6589
详见合同文本。                       3432
无无                            2938
1                             2753
见附件                           2308
详见合同附件                        2001
办公设备                          1921
车辆加油服务,采购数量1;                 1858
信息化工程监理服务                     1777
燃油费                           1776
打印设备                          1775
印刷品,采购数量1;                    1768
测试评估认证服务                      1759
商务车                           1746
桌前椅                           1725
汽油                            1714
A4黑白打印机                       1706
车辆维修,采购数量1.0000;              1698
其他不另分类的物品                     1698
车辆加油                          1690
木质架类                          1622
-                             1619
物业服务                          1610
其他信息技术服务                      1584
印刷                            1583
基础软件                          1546
不间断电源                         1490
车辆加油,采购数量1;                   1462
车辆维修保养,采购数量1;

# test code

In [None]:
df.head()

In [67]:
df["主要标的名称"].value_counts()

主要标的名称
复印纸                                   186539
印刷服务                                  110900
台式计算机                                  93411
空调机                                    90424
物业管理服务                                 76511
                                       ...  
肥胖症中西医保健宣传册,采购数量350;                       1
车辆保险（新购置）,采购数量1;                           1
社区共治共建项目                                   1
北京经济技术开发区综合执法局2023年市场、环境秩序巡查防控辅助服务         1
区县级咨询热线呼叫服务                                1
Name: count, Length: 952612, dtype: int64

In [None]:
df['项目名称'].unique()

In [None]:
df['项目名称'].nunique()

In [None]:
import re
from collections import Counter

all_words = []
for name in df["主要标的名称"].astype(str):
    words = re.split(r"[、，,（）() ]", name)
    all_words.extend(words)

# 统计前100个高频词
counter = Counter(all_words)
top100 = [w for w, _ in counter.most_common(100) if w.strip() != ""]

# 计算覆盖的行数
mask = df["主要标的名称"].astype(str).apply(
    lambda x: any(w in x for w in top100)
)
covered_rows = mask.sum()
total_rows = mask.shape[0]

print("前100高频词覆盖的行数:", covered_rows)
print("覆盖率: {:.2f}%".format(covered_rows / total_rows * 100))


In [None]:
import re
from collections import Counter
import pandas as pd

# 统计所有词
all_words = []
for name in df["主要标的名称"].dropna().astype(str):
    words = re.split(r"[、，,（）() ]", name)
    all_words.extend(words)

# 前100高频词
counter = Counter(all_words)
top100 = [(w, c) for w, c in counter.most_common(100) if w.strip() != ""]

# 转成 DataFrame
kw_df = pd.DataFrame(top100, columns=["keyword", "count"])

# 新建一列 category，空着等你手动填
kw_df["category"] = ""

# 导出 CSV
kw_df.to_csv("keywords.csv", index=False, encoding="utf-8-sig")

print("已导出 keywords.csv，可手动编辑 category 列")


In [None]:
import re
from collections import Counter
import pandas as pd
import os

def export_top_keywords(df, col_name, csv_path="/Users/yxy/UChi/Summer2025/Procurement/dta/keywords_classified.csv", topn=100):
    """
    从 df[col_name] 提取高频词，生成/追加到 csv 文件
    """
    # 分词
    all_words = []
    for name in df[col_name].dropna().astype(str):
        words = re.split(r"[、，,（）() ]", name)
        all_words.extend(words)

    # 高频词
    counter = Counter(all_words)
    topn_words = [(w, c) for w, c in counter.most_common(topn) if w.strip() != ""]
    new_df = pd.DataFrame(topn_words, columns=["keyword", "count"])
    new_df["category"] = ""

    # 如果文件已存在 → 读入旧文件并合并
    if os.path.exists(csv_path):
        old_df = pd.read_csv(csv_path)
        combined = pd.concat([old_df, new_df], ignore_index=True)
        # 去重（同一个 keyword 保留第一次出现）
        combined = combined.drop_duplicates(subset=["keyword"], keep="first")
    else:
        combined = new_df

    # 保存
    combined.to_csv(csv_path, index=False, encoding="utf-8-sig")
    print(f"已更新 {csv_path}, 当前总词数: {len(combined)}")




In [None]:
import pandas as pd
import re

# 读取关键词分类文件
kw_file = "/Users/yxy/UChi/Summer2025/Procurement/dta/keywords_classified.csv"
kw_df = pd.read_csv(kw_file)

# 构造关键词 -> 类别映射
mapping = dict(zip(kw_df["keyword"].astype(str), kw_df["category"].astype(str)))

def classify_name(name):
    if pd.isna(name) or not isinstance(name, str) or name.strip() == "":
        return pd.NA
    
    words = re.split(r"[、，,（）() ]", name)
    categories = set()

    for w in words:
        if w in mapping:
            categories.add(mapping[w])
    
    if len(categories) == 1:
        return categories.pop()
    elif len(categories) > 1:
        return "notsure"
    else:
        return pd.NA

# 应用分类
df["cat"] = df["主要标的名称"].apply(classify_name)

# 检查结果
print(df["cat"].value_counts(dropna=False))


In [None]:
df.loc[df['cat'] == 'notsure', 'cat'] = '服务'


In [None]:
df["cat"].value_counts(dropna=False)

In [None]:
df_miss1 = df[df["cat"].isna()]
df_miss1.shape

In [None]:
export_top_keywords(df_miss1, "主要标的名称", "/Users/yxy/UChi/Summer2025/Procurement/dta/keywords_classified.csv", topn=100)

In [None]:
mask = df["主要标的名称"].astype(str).str.contains("汽油", na=False)
result = df.loc[mask]

print("匹配行数:", len(result))
result.head()


In [None]:
import pandas as pd
import re

kw_file = "/Users/yxy/UChi/Summer2025/Procurement/dta/keywords_classified.csv"
kw_df = pd.read_csv(kw_file)

mapping = dict(zip(kw_df["keyword"].astype(str), kw_df["category"].astype(str)))

def classify_name(name, current_cat):
    if isinstance(current_cat, str) and current_cat.strip() != "":
        return current_cat
    
    if pd.isna(name) or not isinstance(name, str) or name.strip() == "":
        return pd.NA
    
    words = re.split(r"[、，,（）() ]", name)
    categories = set()

    for w in words:
        if w in mapping:
            categories.add(mapping[w])
    
    if len(categories) == 1:
        return categories.pop()
    elif len(categories) > 1:
        return "notsure"
    else:
        return pd.NA

# 应用分类（对已有 cat=NA 的进行补充）
df["cat"] = df.apply(lambda row: classify_name(row["主要标的名称"], row.get("cat", "")), axis=1)

# 检查结果
print(df["cat"].value_counts(dropna=False))


In [64]:
pip install jieba


Collecting jieba
  Downloading jieba-0.42.1.tar.gz (19.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: jieba
  Building wheel for jieba (pyproject.toml) ... [?25ldone
[?25h  Created wheel for jieba: filename=jieba-0.42.1-py3-none-any.whl size=19314509 sha256=2788c175e7cfe263a50db7b5404156e262af41e4c95a127027e96e8bc02c7f0b
  Stored in directory: /Users/yxy/Library/Caches/pip/wheels/08/a1/a3/5c8ac52cc2f5782ffffc34c95c57c8e5ecb3063dc69541ee7c
Successfully built jieba
Installing collected packages: jieba
Successfully installed jieba-0.42.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m


In [65]:
import jieba

text = "吉林省救援总队2023年全省抢险救灾装备项目（三）"
words = jieba.lcut(text)
print(words)


Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/4g/6_8lhyp147394q93651p5kyw0000gn/T/jieba.cache
Loading model cost 0.316 seconds.
Prefix dict has been built successfully.


['吉林省', '救援', '总队', '2023', '年', '全省', '抢险救灾', '装备', '项目', '（', '三', '）']
