# 模型训练与保存

准备工作：导入训练模型和导出模型所需要的包

In [1]:
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import pickle

import pandas,numpy

## Ⅰ 连接数据库，读入数据。

说明：由于五类文本中【词】的数量最少，不足20000条，为保证输入类型的平衡性，因此读入阈值受【词】的数量限制。

In [2]:
import sqlite3
conn = sqlite3.connect("../data/database/texts.db")

print ('Opened database successfully')
c = conn.cursor()

ci_data=[]
cursor = c.execute("SELECT * from ci order by id asc limit 20000")
for row in cursor:
    ci_data.append(row)
print("**读取的词数据样例：")
print(ci_data[-2:])

poet_data = []
cursor=c.execute("SELECT * FROM poet order by id asc limit 20000")
for row in cursor:
    poet_data.append(row)
print("\n**读取的诗数据样例：")
print(poet_data[-2:])

classical_data = []
cursor=c.execute("SELECT * FROM classical order by id asc limit 20000")
for row in cursor:
    classical_data.append(row)
print("\n**读取的文言文数据样例：")
print(classical_data[-2:])

journal_data = []
cursor=c.execute("SELECT * FROM journal order by id asc limit 20000")
for row in cursor:
    journal_data.append(row)
print("\n**读取的期刊数据样例：")
print(journal_data[:2])
    
news_data = []
cursor=c.execute("SELECT * FROM news order by id asc limit 20000")
for row in cursor:
    news_data.append(row)
print("\n**读取的新闻数据样例：")
print(news_data[:2])

print("所有类型数据读取成功，各20000条。")

Opened database successfully
**读取的词数据样例：
[('50021047', '楼台里，春风淡荡。  ', 'ci', 21047), ('50021048', '乍卷珠帘新燕入。  ', 'ci', 21048)]

**读取的诗数据样例：
[(40020639, '白团扇子白纱衣，怕见萤光作火吹。 屈指西风明日是，会将束缊乞怜时。 ', 'poet', 'c69b266a-958f-4a6f-ba1a-ee239d76ecf1'), (40020640, '一叶梧桐一片云，秋风江上日纷纷。 鸦来鸦去吾何预，不向声尘著意闻。 ', 'poet', 'cd2f6d52-e6ce-4fd3-875a-981056a4d579')]

**读取的文言文数据样例：
[('10023324', '\u3000\u3000帝曰：願聞其異狀也。岐伯曰：陽者天氣也，主外；陰者地氣也，主內。故陽道實，陰道虛。故犯賊風虛邪者陽受之，食飲不節，起居不時者，陰受之。陽受之則入六腑，陰受之則入五臟。入六腑則身熱不時臥，上為喘呼；入五臟則瞋滿閉塞，下為飧泄，久為腸澼。故喉主天氣，咽主地氣。故陽受風氣，陰受濕氣。\u3000\u3000故陰氣從足上行至頭，而下行循臂至指端；陽氣從手上行至頭，而下行至足。故曰陽病者上行極而下，陰病者下行極而上。故傷於風者上先受之，傷於濕者，下先受之。\n', 'classical', 'c_list_111.txt'), ('10023325', '\u3000\u3000帝曰：脾病而四肢不用何也？岐伯曰：四肢皆稟氣於胃而不得至經，必因於脾乃得稟也。今脾病不能為胃行其津液，四肢不得稟水谷氣，氣日以衰，脈道不利，筋骨肌內，皆無氣以生，故不用焉。\n', 'classical', 'c_list_111.txt')]

**读取的期刊数据样例：
[(20000001, '作者程苏东, 北京大学中文系讲师 (北京100871) 。\n', 'modern', 'j_list_1.txt'), (20000002, '“失控的文本”这一概念是我在研究《汉书·五行志》体例问题时产生的想法。班固以刘向《洪范五行传论》为基础, 试图纂集董仲舒《春秋》灾异说、许商《五行传记》、刘歆《洪范五行传论》等不同系统的灾异学论著, 整合成具有集大成性质的西

受平台性能限制，将各类数据先做分割，之后对模型进行增量学习。

In [3]:
data = ci_data[:4000] + poet_data[:4000] + classical_data[:4000] + journal_data[:4000] + news_data[:4000]
data2 = ci_data[4000:8000] + poet_data[4000:8000] + classical_data[4000:8000] + journal_data[4000:8000] + news_data[4000:8000]
data3 = ci_data[8000:12000] + poet_data[8000:12000] + classical_data[8000:12000] + journal_data[8000:12000] + news_data[8000:12000]
data4 = ci_data[12000:16000] + poet_data[12000:16000] + classical_data[12000:16000] + journal_data[12000:16000] + news_data[12000:16000]
data5 = ci_data[16000:] + poet_data[16000:] + classical_data[16000:] + journal_data[16000:] + news_data[16000:]

## Ⅱ 数据预处理

### 1. 将语篇/句子拆分为单字

说明：一般而言，文本分类中提取文本特征会将文本化为词向量，因此分词是必要步骤。但是本例中，文本类型涵盖五种，其中三种为非现代文。使用目前所了解到的分词工具对于诗、词、文言文的分词效果不佳，而测试时输入的文本是未知类型的，不能够“对症下药”而只能“一视同仁”。因此本例将字作为粒度对于句子进行分割。

In [4]:
def split_spaces(data): ## 输入未处理的数据，第二列为文本，返回分割好第二列的列表。
    data_list = []
    for line in data:
        data_list.append(list(line))
        
    for item in data_list:
        item[1] = ' '.join(list(item[1]))
    return data_list

### 2. 去除符号

说明：降低句子中符号的干扰。

In [5]:
def desymbol(data):
    import re
    data_desymbol = []
    for line in data:
        data_desymbol.append(list(line))
    for row in data_desymbol:
        row[1] = ' '.join(list(re.sub("[\s+\.\!\/_,$%^*)(+\"\']+|[+——！，。？：、~@#￥%……&*（）“”]", "",row[1])))
    return data_desymbol

### 3. 句子头尾增加`<start>` `<end>` 标记

对于部分文本，其头尾的文字是其一大特征，加入标记可以增强这一特征。

In [6]:
def add_start_end(data):
    data_with_se = []
    for line in data:
        data_with_se.append(list(line))
    for row in data_with_se:
        row[1] = "<start> "+row[1]+" <end>"
    return data_with_se

## Ⅲ 特征工程

### 1. 计数向量作为特征

CountVectorizer是属于常见的特征数值计算类，是一个文本特征提取方法。对于每一个训练文本，它**只考虑**每种词汇在该训练文本中出现的**频率**。

CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语出现的次数。

这里，我们认为普通的`CountVectorizer`只是对字频进行统计，没有考虑字之间的关系，因此我们修改了参数ngram_range为(1,2)*（默认为1，1）*，这样就同时用到了unigram模型和bigram模型。

> I like eating apple.
>
> ngram_range 1,1 : (`I`) (`like`) (`eating`) (`apple`)
> 
> ngram_range 1,2 : (`I`) (`I`,`like`) (`like`) (`like`,`eating`) (`eating`) (`eating`, `apple`) (`apple`)

In [7]:
def countV(x_train,x_test):
    vectorizer = CountVectorizer(analyzer='char', max_features=5000,ngram_range=(1,2))
    vectorizer.fit(x_train)

    xtrain_count = vectorizer.transform(x_train)
    xtest_count = vectorizer.transform(x_test)
    return xtrain_count, xtest_count

### 2. TF-IDF作为特征

TF-IDF(term frequency-inverse document frequency)词频-逆向文件频率。该模型认为，字词的重要性与其在文本中出现的频率成正比(TF)，与其在语料库中出现的频率成反比(IDF)。

TF为某个词在文章中出现的总次数。为了消除不同文章大小之间的差异，便于不同文章之间的比较，我们在此标准化词频：$TF = \dfrac{某个词在文章中出现的总次数}{文章的总词数}。$

IDF为逆文档频率。逆文档频率 $IDF = log \dfrac{词料库的文档总数}{包含该词的文档数+1} $。

为了避免分母为0，所以在分母上加1。

$TF-IDF值 = TF * IDF。$

这里我们同时也加入了ngram参数，原理同上。

In [8]:
def tfIdfV(x_train,x_test):
    tfidf_vect_ngram = TfidfVectorizer(analyzer='char', ngram_range=(1,2), max_features=5000, lowercase = False)
    tfidf_vect_ngram.fit(x_train)

    x_train_tfidf = tfidf_vect_ngram.transform(x_train)
    x_test_tfidf = tfidf_vect_ngram.transform(x_test)
    
    return x_train_tfidf, x_test_tfidf

## Ⅳ 模型训练

### 1. 朴素贝叶斯模型

In [9]:
def train_NB(x_train,x_test,y_train,y_test): 
    model_NB = MultinomialNB()
    model_NB.fit(x_train, y_train)
    score = model_NB.score(x_test, y_test)
    return model_NB,score

def partial_train_NB(x_train,x_test,y_train,y_test,model_NB): # 增量训练，重复调用可用多批数据训练同一个模型
    model_NB.partial_fit(x_train,y_train,classes=['modern','ci','classical','poet'],sample_weight=None)
    score = score = model_NB.score(x_test, y_test)
    return model_NB, score

### 2. 线性回归模型

In [10]:
def train_LR(x_train,x_test,y_train,y_test):
    model_LR = LogisticRegression()
    model_LR.fit(x_train, y_train)
    score = model_LR.score(x_test, y_test)
    return model_LR, score
    
def partial_train_LR(x_train,x_test,y_train,y_test,model_LR):
    model_LR.fit(x_train, y_train)
    score = model_LR.score(x_test, y_test)
    return model_LR, score


## Ⅴ 数据实操与模型保存

### 建立拆分训练、测试集


In [11]:
def datasplit(data_list):
    train_data=[]
    train_target=[]
    for row in data_list:
        train_data.append(row[1])
        train_target.append(row[2])
    x_train, x_test, y_train, y_test = train_test_split(train_data, train_target,test_size=0.3,random_state=0) 
    return x_train, x_test, y_train, y_test

### 计数向量+朴素贝叶斯

In [12]:
x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data))))
x_train_cv, x_test_cv = countV(x_train,x_test)

model_NB = MultinomialNB()
model_NB,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB)
print("第1次NB-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data2))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_NB,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB)
print("第2次NB-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data3))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_NB,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB)
print("第3次NB-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data4))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_NB,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB)
print("第4次NB-CV综合训练得分："+str(score))

第1次NB-CV综合训练得分：0.9001666666666667
第2次NB-CV综合训练得分：0.9291666666666667
第3次NB-CV综合训练得分：0.9223333333333333
第4次NB-CV综合训练得分：0.8621666666666666


### TF-IDF + 朴素贝叶斯 

In [13]:
x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data))))
x_train_ti , x_test_ti = tfIdfV(x_train,x_test)

model_NB2 = MultinomialNB()
model_NB2,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB2)
print("第1次NB-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data2))))
x_train_ti , x_test_ti = tfIdfV(x_train,x_test)
model_NB2,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB2)
print("第2次NB-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data3))))
x_train_ti , x_test_ti = tfIdfV(x_train,x_test)
model_NB2,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB2)
print("第3次NB-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data4))))
x_train_ti , x_test_ti = tfIdfV(x_train,x_test)
model_NB2,score = partial_train_NB(x_train_cv, x_test_cv, y_train, y_test, model_NB2)
print("第4次NB-TFIDF综合训练得分："+str(score))

第1次NB-TFIDF综合训练得分：0.9246666666666666
第2次NB-TFIDF综合训练得分：0.9253333333333333
第3次NB-TFIDF综合训练得分：0.925
第4次NB-TFIDF综合训练得分：0.9251666666666667


### 计数向量 + 线性回归

In [14]:
x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data))))
x_train_cv, x_test_cv = countV(x_train,x_test)

model_LR = LogisticRegression()
model_LR,score = partial_train_LR(x_train_cv, x_test_cv, y_train, y_test, model_LR)
print("第1次LR-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data2))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_LR,score = partial_train_LR(x_train_cv, x_test_cv, y_train, y_test, model_LR)
print("第2次LR-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data3))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_LR,score = partial_train_LR(x_train_cv, x_test_cv, y_train, y_test, model_LR)
print("第3次LR-CV综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data4))))
x_train_cv, x_test_cv = countV(x_train,x_test)
model_LR,score = partial_train_LR(x_train_cv, x_test_cv, y_train, y_test, model_LR)
print("第4次LR-CV综合训练得分："+str(score))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第1次LR-CV综合训练得分：0.9223333333333333


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第2次LR-CV综合训练得分：0.9493333333333334


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第3次LR-CV综合训练得分：0.9578333333333333
第4次LR-CV综合训练得分：0.9426666666666667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [15]:
x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data))))
x_train_ti, x_test_ti = tfIdfV(x_train,x_test)

model_LR2 = LogisticRegression()
model_LR2,score = partial_train_LR(x_train_ti, x_test_ti, y_train, y_test, model_LR2)
print("第1次LR-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data2))))
x_train_ti, x_test_ti = tfIdfV(x_train,x_test)
model_LR2,score = partial_train_LR(x_train_ti, x_test_ti, y_train, y_test, model_LR2)
print("第2次LR-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data3))))
x_train_ti, x_test_ti = tfIdfV(x_train,x_test)
model_LR2,score = partial_train_LR(x_train_ti, x_test_ti, y_train, y_test, model_LR2)
print("第3次LR-TFIDF综合训练得分："+str(score))

x_train, x_test, y_train, y_test = datasplit(add_start_end(desymbol(split_spaces(data4))))
x_train_ti, x_test_ti = tfIdfV(x_train,x_test)
model_LR2,score = partial_train_LR(x_train_ti, x_test_ti, y_train, y_test, model_LR2)
print("第4次LR-TFIDF综合训练得分："+str(score))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第1次LR-TFIDF综合训练得分：0.9271666666666667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第2次LR-TFIDF综合训练得分：0.949


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


第3次LR-TFIDF综合训练得分：0.9515
第4次LR-TFIDF综合训练得分：0.9396666666666667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Ⅵ 保存模型文件

In [16]:
file = open("../model/NB-CV.pickle", "wb")
pickle.dump(model_NB, file)
file.close()

file = open("../model/NB-TFIDF.pickle", "wb")
pickle.dump(model_NB2, file)
file.close()

file = open("../model/LR-CV.pickle", "wb")
pickle.dump(model_LR, file)
file.close()

file = open("../model/LR-TFIDF.pickle", "wb")
pickle.dump(model_LR2, file)
file.close()

print("4 models saved succesfully. see folder /model")

4 models saved succesfully. see folder /model
