
# 基于自然语言处理的搜索引擎公司用户画像——使用机器学习（朴素贝叶斯）方法对（年龄，搜索词列表进行建模）


# 数据集：
|数据文件|	备注|
|----  | ----  |
|train.csv|	带标注的训练集|
|test.csv|	测试集|

# 数据介绍
本数据来源于真实搜索引擎数据，ID经过加密，训练集中人口属性数据存在部分未知的情况。数据所有字段如下表所示：

|字段|	说明|
|----  | ----  |
|ID	加密后的ID|
|Age|	0：未知年龄; 1：0-18岁; 2：19-23岁; 3：24-30岁; 4：31-40岁; 5：41-50岁; 6： 51-999岁|
|Gender|	0：未知1：男性2：女性|
|Education|	0：未知学历; 1：博士; 2：硕士; 3：大学生; 4：高中; 5：初中; 6：小学|
|Query List|	搜索词列表|

## 数据一览

In [52]:
import warnings
warnings.filterwarnings("ignore")
import jieba    #分词包
import numpy    #numpy计算包
import codecs   #codecs提供的open方法来指定打开的文件的语言编码，它会在读取的时候自动转换为内部unicode 
import pandas as pd
train = pd.read_csv("./data/train.csv", sep="###__###",header = None,encoding='utf-8')
test = pd.read_csv("./data/test.csv", sep="###__###",header = None,encoding='utf-8')

In [53]:
train.columns = ['ID', 'Age', 'Gender', 'Education', 'Query_List']
train.head()

Unnamed: 0,ID,Age,Gender,Education,Query_List
0,22DD920316420BE2DF8D6EE651BA174B,1,1,4,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...
1,43CC3AF5A8D6430A3B572337A889AFE4,2,1,3,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只..."
2,E97654BFF5570E2CCD433EA6128EAC19,4,1,0,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...
3,6931EFC26D229CCFCEA125D3F3C21E57,4,2,3,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...
4,E780470C3BB0D340334BD08CDCC3C71A,2,2,4,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...


In [54]:
test.columns = ['ID', 'Query_List']
test.head()

Unnamed: 0,ID,Query_List
0,ED89D43B9F602F96D96C25255F7C228C,陈学冬将出的作品\t刘昊然与谭松韵\t211学校的分数线\t谁唱的味道好听\t吻戏是真吻还是...
1,83C3B7B4AAF8074655A8079F561A76D6,e的0.0052次方\tqq怎么快速提现\t绝色倾城飞烟\t马克思主义基本原理概论\t康世恩...
2,CA9F675A024FB2353849350A35CF8B0F,黑暗文\tlpl夏季赛\t大富豪电玩城\t英雄联盟之电竞称王\t手机怎么扫描手机上的二维码\...
3,DE45B5C4E57AAEBCF3FDFA2A774093BF,中秋水库钓鱼\t鱼竿\t用蚯蚓钓鱼怎样调漂\t传统钓\t3号鱼钩\t鲫鱼汤的做法大全\t鱼饵...
4,406A681FB3DF81EC0E561796AE50AE50,号码吉凶\t退休干部死后配偶\t郫县有哪些大学\t胜利油田属于中石化还是中石油\t苏珊米勒狮...


In [55]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   ID          100000 non-null  object
 1   Age         100000 non-null  int64 
 2   Gender      100000 non-null  int64 
 3   Education   100000 non-null  int64 
 4   Query_List  100000 non-null  object
dtypes: int64(3), object(2)
memory usage: 3.8+ MB


In [56]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   ID          100000 non-null  object
 1   Query_List  100000 non-null  object
dtypes: object(2)
memory usage: 1.5+ MB


## 数据清洗

In [57]:
train_age = train.loc[(train['Age']!=0),['Age','Query_List']]
train_age.head()

Unnamed: 0,Age,Query_List
0,1,柔和双沟\t女生\t中财网首页 财经\thttp://pan.baidu.com/s/1pl...
1,2,"广州厨宝烤箱\t世情薄,人情恶,雨送黄昏花易落,晓风干,泪痕\t厦门酒店用品批发市场\t我只..."
2,4,钻石之泪耳机\t盘锦到沈阳\t旅顺公交\t辽宁阜新车牌\tbaidu\tk715\tk716...
3,4,最受欢迎狗狗排行榜\t舶怎么读\t场景描 写范例\t三维绘图软件\t枣和酸奶能一起吃吗\t好...
4,2,干槽症能自愈吗\t太太万岁叶舒心去没去美国\t干槽症\t右眼皮下面一直跳是怎么回事\t麦当劳...


## jieba分词

In [58]:
# 将每个年龄段的数据转换成列表
train_age1 = train_age.loc[(train['Age']==1),['Age','Query_List']]
train_age1 = train_age1.Query_List.values.tolist()
train_age2 = train_age.loc[(train['Age']==2),['Age','Query_List']]
train_age2 = train_age2.Query_List.values.tolist()
train_age3 = train_age.loc[(train['Age']==3),['Age','Query_List']]
train_age3 = train_age3.Query_List.values.tolist()
train_age4 = train_age.loc[(train['Age']==4),['Age','Query_List']]
train_age4 = train_age4.Query_List.values.tolist()
train_age5 = train_age.loc[(train['Age']==5),['Age','Query_List']]
train_age5 = train_age5.Query_List.values.tolist()
train_age6 = train_age.loc[(train['Age']==6),['Age','Query_List']]
train_age6 = train_age6.Query_List.values.tolist()

In [59]:
# 停用词
stopwords=pd.read_csv("./stopwords-master/cn_stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

In [60]:
# 去停用词
def preprocess_text(content_lines, sentences, category, target_path):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = list(filter(lambda x:len(x)>1, segs)) #没有解析出来的过滤掉
            segs = list(filter(lambda x:x not in stopwords, segs)) #把停用词过滤掉
            sentences.append((" ".join(segs), category))
        except Exception as e:
            print(line)
            continue

#生成训练数据
sentences = []
preprocess_text(train_age1, sentences, 1, '../data/data')
preprocess_text(train_age2, sentences, 2, '../data/data')
preprocess_text(train_age3, sentences, 3, '../data/data')
preprocess_text(train_age4, sentences, 4, '../data/data')
preprocess_text(train_age5, sentences, 5, '../data/data')
preprocess_text(train_age6, sentences, 6, '../data/data')

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\15243\AppData\Local\Temp\jieba.cache
Loading model cost 0.952 seconds.
Prefix dict has been built successfully.


In [63]:
# 打乱顺序
import random
random.shuffle(sentences)

## 构建模型

In [65]:
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
print(len(y))

98334


In [None]:
# 抽取n-gram统计特征
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
    analyzer='word', # tokenise by character ngrams
    ngram_range=(1,4),  # use ngrams of size 1, 2, 3, 4
    max_features=20000,  # keep the most common 2000 ngrams
)
vec.fit(x_train)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score
import numpy as np

def stratifiedkfold_cv(x, y, clf_class, shuffle=True, n_folds=5, **kwargs):
    stratifiedk_fold = StratifiedKFold(n_splits=n_folds, shuffle=shuffle)
    y_pred = y[:]
    for x,train_inde,test_index in stratifiedk_fold.split(x, y):
        X_train, X_test = x[train_index], x[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred 

NB = MultinomialNB
print(precision_score(y, stratifiedkfold_cv(vec.transform(x),np.array(y),NB), average='macro'))