## 4.1 词汇特征（全部文本）
### 4.1.1 基本特征

步骤2：对标注语料进行词频统计，获得每个特征词在各类别文本中出现的频数，过滤掉在所有类别中频数均小于5的特征词

In [1]:
import pandas as pd 

df = pd.read_excel('../data/最终版标注数据0110.xlsx', sheet_name='正反一起')

In [2]:
print(df.count())
df.head()

医生信息       74888
科室         74888
页面网址_编号    74888
文本编号       74888
身份         74888
标注         74888
对话内容       74888
dtype: int64


Unnamed: 0,医生信息,科室,页面网址_编号,文本编号,身份,标注,对话内容
0,肿瘤及防治科 医师 淄博市中心医院 陈映亮,肿瘤内科,https://www.chunyuyisheng.com/pc/qa/OOOLy5PcCo...,2,陈映亮医生,N,化疗/有没有/定期/复查
1,肿瘤及防治科 医师 淄博市中心医院 陈映亮,肿瘤内科,https://www.chunyuyisheng.com/pc/qa/OOOLy5PcCo...,3,陈映亮医生,N,检查/单子/发过来
2,肿瘤及防治科 医师 淄博市中心医院 陈映亮,肿瘤内科,https://www.chunyuyisheng.com/pc/qa/OOOLy5PcCo...,5,陈映亮医生,N,最近/一次/检查
3,肿瘤及防治科 医师 淄博市中心医院 陈映亮,肿瘤内科,https://www.chunyuyisheng.com/pc/qa/OOOLy5PcCo...,9,陈映亮医生,N,骨髓/穿刺/做/检查
4,肿瘤及防治科 医师 淄博市中心医院 陈映亮,肿瘤内科,https://www.chunyuyisheng.com/pc/qa/OOOLy5PcCo...,11,陈映亮医生,N,现在/需要/等待/病理


In [2]:
from collections import Counter

Ycounter = Counter()
Ncounter = Counter()

# 分类统计词频
for row in df.itertuples(index=False):
    label = getattr(row, '标注')
    seglst = str(getattr(row, '对话内容')).split('/')
    exec('{}counter += Counter(seglst)'.format(label))

In [4]:
df_Ytf = pd.DataFrame.from_dict(Ycounter, orient='index', columns=['TF'])
df_Ytf.to_excel('../data/Ytf.xlsx')
df_Ntf = pd.DataFrame.from_dict(Ncounter, orient='index', columns=['TF'])
df_Ntf.to_excel('../data/Ntf.xlsx')

In [5]:
sum_counter_all = Ycounter + Ncounter

# 过滤掉在所有类别中频数均小于5的特征词
for word in sum_counter_all:
    if Ycounter[word]<5 and Ncounter[word]<5:
        try:
            Ycounter.pop(word)
        except KeyError:
            pass
        try:
            Ncounter.pop(word)
        except KeyError:
            pass

步骤3：对每个特征词在各类中的频数进行归一化处理。

In [6]:
group = df.groupby('标注')
group.size()

标注
N    69920
Y     4968
dtype: int64

In [7]:
numCY = group.size().Y
numCN = group.size().N
sum_counter_filter = Ycounter + Ncounter

norm = dict()
for word in sum_counter_filter:
    meanY = Ycounter[word] / numCY
    meanN = Ncounter[word] / numCN
    sum_mean = meanY + meanN
    norm[word] = [meanY/sum_mean, meanN/sum_mean]

步骤4：对于每个类别文本，按照一定的阈值规则，根据归一化频数从高到低抽取与文本数目相对应的高频特征词

In [8]:
cnt_df = pd.DataFrame(norm, index=['Y', 'N']).T
cnt_df.head()

Unnamed: 0,Y,N
肿瘤,0.412084,0.587916
生存期,0.484718,0.515282
主要,0.498325,0.501675
治疗,0.584649,0.415351
效果,0.397391,0.602609


In [9]:
n = 330

# 分别获取正反类中排序前n的词
Ywords = cnt_df['Y'].sort_values(ascending=False).head(n).index
Nwords = cnt_df['N'].sort_values(ascending=False).head(n).index

步骤5：合并上一步抽取的所有类别文本的高频特征词，形成高频特征词集合wordset

In [10]:
wordset = set(Ywords) | set(Nwords)

with pd.ExcelWriter('../data/word_norm.xlsx') as writer:
    cnt_df.to_excel(writer, sheet_name='Sheet1')
    cnt_df.loc[wordset].to_excel(writer, sheet_name='基本特征词')

步骤7：计算TF-IDF

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 构建词汇索引表，key为词，value为索引
vocabulary = dict(zip(wordset, [i for i in range(len(wordset))]))

In [12]:
# 将对话中的'/'替换为' '
corpus = df['对话内容'].apply(lambda sent: str(sent).replace('/', ' '))
tfidfvector = TfidfVectorizer(vocabulary=vocabulary)
X = tfidfvector.fit_transform(corpus)

In [13]:
X.shape

(74888, 660)

In [14]:
# 构建列索引
columns = [x[0] for x in sorted(vocabulary.items(), key=lambda d:d[1])]
# 基本特征
df_base = pd.DataFrame(X.toarray(), columns=columns)

In [15]:
df_base.to_csv('../data/tfidf.csv')

### 4.1.2 扩展特征（归一化）
步骤1~5

In [20]:
import itertools

# 关联规则的集合，规则tuple为key，value是数量
rule_dict = Counter()

for sent in df['对话内容']:
    seg = set(str(sent).split('/'))
    exdset = seg - wordset
    baseset = seg - exdset
    rule_dict += Counter(itertools.product(baseset, exdset))

In [26]:
# 计算置信度保存到value中
for rule in rule_dict:
    # 规则tuple中的第一个词是base，第二个词是extend
    rule_dict[rule] /= sum_counter_filter[rule[0]]

df_rule = pd.DataFrame.from_dict(rule_dict, orient='index')
df_rule.to_excel('../data/base_rule.xlsx')

步骤6~7

In [15]:
from scipy import sparse

def calc_exd_V(data, base_matrix, base_word2id, rule_dict, conf):
    """
    data: series格式数据
    base_matrix: 基本特征矩阵
    base_word2id: 基本特征单词索引表
    rule_dict: 存储置信度的规则表
    conf: 置信度阈值，低于阈值的不予考虑
    
    return：
        exd_word2id: 扩展特征单词索引表
        exd_matrix: 扩展特征矩阵
    """
    
    # 根据置信度阈值筛选规则
    rule_dict = {k: v for k, v in rule_dict.items() if v >= conf}
    exd_word2id = dict(zip({k[1] for k in rule_dict}, [i for i in range(len(rule_dict))]))
    # 创建扩展特征稀疏矩阵
    exd_matrix = sparse.lil_matrix((len(data), len(exd_word2id)))
    
    for row_idx in range(len(data)):
        for word in set(str(data[row_idx]).split('/')):
            if word in base_word2id:
                for rule in rule_dict:
                    if word == rule[0]:
                        # 从tfidf矩阵获取权重
                        w = base_matrix[row_idx, base_word2id[word]]
                        # 扩展特征权重=基本特征权重*置信度
                        exd_matrix[row_idx, exd_word2id[rule[1]]] += w * rule_dict[rule]
    
    return exd_word2id, exd_matrix

### 4.1.3 融合基本特征和扩展特征的词汇特征

In [19]:
from sklearn import preprocessing

# 基本特征
df_base = pd.DataFrame(X.toarray())

for conf in [0.5, 0.6, 0.7, 0.8, 0.9]:
    exd_word2id, exd_matrix = calc_exd_V(df['对话内容'], X, vocabulary, rule_dict, conf)
    df_exd = pd.DataFrame(exd_matrix.toarray())

    # 拼接基本特征和扩展特征
    df_all = pd.concat([df_base, df_exd], axis=1)
    # 构建列索引
    columns = [x[0] for x in sorted(vocabulary.items(), key=lambda d:d[1])] + [x[0] for x in sorted(exd_word2id.items(), key=lambda d:d[1])]
    # 归一化
    scaler = preprocessing.MinMaxScaler()
    df_all = pd.DataFrame(scaler.fit_transform(df_all), columns=columns)
    
    df_all.to_csv('../data/词汇特征%s%%.csv' % (conf * 100))