**Baseline   ----------    AUC  Score约为0.93+**

In [2]:
import sklearn
import numpy as np
import pandas as pd
import csv
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn.utils.extmath import density
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.model_selection import train_test_split

### 读取数据

简单对原始数据整理成了csv文件。

Kaggle竞赛的数据一般有train、test和sample_submission。

In [3]:
train = pd.read_csv('./data/Comments/train.csv')
test = pd.read_csv('./data/Comments/test.csv')
sample_submission = pd.read_csv('./data/Comments/sample_submission.csv')

labels = []

In [4]:
train.head()

Unnamed: 0,label,text
0,0,酸菜鱼不错
1,0,轻食素食都是友善的饮食方式
2,0,完爆中午吃的农家乐
3,1,烤鱼很入味
4,0,有种入口即化的感觉


In [5]:
test.head()

Unnamed: 0,text
0,理由很简单
1,蘸着花生酱吃非常美味
2,味道奶香味恰到好处
3,面包片烤的恰到好处
4,属于简单经济型


### 清洗数据

机器学习工作中广为流传的一句话：“**数据决定机器学习的上限，算法让我们不断逼近这个上限**”。

一个干净的数据集是我们在运用机器学习算法取得成功的关键，因此，对文本进行合适的处理是非常关键的一步。

主要工作：
- 分词
- 去除停用词

In [6]:
import jieba

stop_words = {}
with open('./data/Comments/stop_words.txt', 'r', encoding='utf-8') as f:
    lines = f.readlines()
    for line in lines:
        stop_words[line.strip()] = True

corpus_train = []
for one in train['text']:
    mid = []
    for ele in list(jieba.cut(one, cut_all=False, HMM=True)):
        if ele not in stop_words:
            mid.append(ele)
    corpus_train.append(' '.join(mid))
    
corpus_test = []
for one in test['text']:
    mid = []
    for ele in list(jieba.cut(one, cut_all=False, HMM=True)):
        if ele not in stop_words:
            mid.append(ele)
    corpus_test.append(' '.join(mid))

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/ml/gyc9l97n0cq3pfrr93x8xmy80000gn/T/jieba.cache
Loading model cost 0.817 seconds.
Prefix dict has been built successfully.


# TfidfTransformer


TfidfTransformer用于统计vectorizer中每个词语的TF-IDF值。

sklearn的计算过程有两点要注意：
- sklean计算对数log时，底数是e，不是10
- 参数smooth_idf默认值为True，若改为False，则计算方法略有不同，导致结果也有所差异。


实际计算会进行l2正则化，详细见文档[https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)

In [7]:
# 提取TF-IDF特征

text_vector = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode',token_pattern=r'\w{1,}',
                         max_features=5000, ngram_range=(1, 1), analyzer='word')
text_vector.fit(corpus_train + corpus_test)
train_vec = text_vector.transform(corpus_train)
test_vec = text_vector.transform(corpus_test)

In [8]:
x_train, x_valid, y_train, y_valid = train_test_split(train_vec, train['label'], test_size=0.1, random_state=2020)
x_test = test_vec

# K折交叉验证

`class sklearn.model_selection.KFold(n_splits=5, shuffle=False, random_state=None)`

- n_folds为分为多少个交叉验证集, 默认5折
- shuffle为是否随机
- random_state设置随机因子

In [11]:
# 5折交叉验证
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.svm import SVC

kf = KFold(n_splits=5)

accuracy = []
cnt = 0

clf = SVC(probability=True)
for train_index, test_index in kf.split(X=train_vec, y=train['label']):
    cnt += 1
    print('*'*10, 'k =', cnt)
    X_train, X_valid = train_vec[train_index], train_vec[test_index]
    y_train, y_valid = train['label'][train_index], train['label'][test_index]
    
    clf.fit(X_train, y_train)
    # y_pre = clf.predict_proba(x_valid)
    y_pre = clf.predict(X_valid)
    # print(y_pre[:10, 1])
    # print(y_p[:10])

    train_scores = clf.score(X_train, y_train)
    valid_scores = accuracy_score(y_pre, y_valid)
    print("train score is {}, valid score is {}".format(train_scores, valid_scores))
    accuracy.append(valid_scores)

print("Total cv accuracy is {}".format(np.mean(accuracy)))

********** k = 1
train score is 0.956015625, valid score is 0.8671875
********** k = 2
train score is 0.952890625, valid score is 0.8684375
********** k = 3
train score is 0.953984375, valid score is 0.8734375
********** k = 4
train score is 0.95421875, valid score is 0.8709375
********** k = 5
train score is 0.953359375, valid score is 0.8709375
Total cv accuracy is 0.8701874999999999


In [12]:
# test data
pred_proba = clf.predict_proba(x_test)[:, 1]
sample_submission['Prediction'] = pred_proba

In [13]:
res_file = './data/Comments/SVC_submission_kfold=5.csv'
sample_submission.to_csv(res_file, encoding='utf-8', index=False)

# 可以做文本分类的可选分类器

```python
results = []
for clf, name in (
        (RidgeClassifier(tol=1e-2, solver="lsqr"), "Ridge Classifier"),
        (Perceptron(n_iter=50), "Perceptron"),
        (PassiveAggressiveClassifier(n_iter=50), "Passive-Aggressive"),
        (KNeighborsClassifier(n_neighbors=10), "kNN"),
        (RandomForestClassifier(n_estimators=100), "Random forest")
):
    print('=' * 80)
    benchmark(clf)
```