# 对数几率回归

## 实验内容
1. 使用对数几率回归完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [13]:
import numpy as np

In [14]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [16]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练并预测

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

In [17]:
spam_model = LogisticRegression()
spam_prediction = cross_val_predict(spam_model, spamx, spamy, cv = 10)
print(spam_prediction)


dota2_model = LogisticRegression()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv = 10)
print(dota2_prediction)


[ 1.  1.  1. ...,  0.  0.  0.]
[ 1.  1.  1. ...,  1. -1.  1.]


# 5. 评价指标的计算

请你计算两个模型的四项指标

In [18]:
# YOUR CODE HERE

#spam
spam_accuracy_score = accuracy_score(spamy, spam_prediction)
spam_precision_score = precision_score(spamy, spam_prediction)
spam_recall_score = recall_score(spamy, spam_prediction)
spam_f1_score = f1_score(spamy, spam_prediction)

print(spam_accuracy_score, spam_precision_score, spam_recall_score, spam_f1_score)

#dota2
dota2_accuracy_score = accuracy_score(dota2y, dota2_prediction)
dota2_precision_score = precision_score(dota2y, dota2_prediction)
dota2_recall_score = recall_score(dota2y, dota2_prediction)
dota2_f1_score = f1_score(dota2y, dota2_prediction)

print(dota2_accuracy_score, dota2_precision_score, dota2_recall_score, dota2_f1_score)



0.917409258857 0.903207653348 0.88527302813 0.894150417827
0.598737182947 0.606657721082 0.676561026608 0.639705383534


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.917409258857 | 0.903207653348 | 0.88527302813 | 0.894150417827
dota2Results | 0.598737182947 | 0.606657721082 | 0.676561026608 | 0.639705383534

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [19]:
_spamx = spambase[:, :57]
_spamy = spambase[:, 57]

lr_1 = LogisticRegression()
prediction_1 = cross_val_predict(lr_1, _spamx, _spamy, cv = 10)

acc_1 = accuracy_score(_spamy, prediction_1)
precision_1 = precision_score(_spamy, prediction_1)
recall_1 = recall_score(_spamy, prediction_1)
f1_1 = f1_score(_spamy, prediction_1)

print(acc_1, precision_1, recall_1, f1_1)

0.918278635079 0.903878583474 0.886927744071 0.895322939866


In [20]:
_spamx = spambase[:, :47]
_spamy = spambase[:, 57]

lr_2 = LogisticRegression()
prediction_2 = cross_val_predict(lr_2, _spamx, _spamy, cv = 10)

acc_2 = accuracy_score(_spamy, prediction_2)
precision_2 = precision_score(_spamy, prediction_2)
recall_2 = recall_score(_spamy, prediction_2)
f1_2 = f1_score(_spamy, prediction_2)

print(acc_2, precision_2, recall_2, f1_2)

0.896109541404 0.871866295265 0.863210148924 0.867516629712


In [21]:
_spamx = spambase[:, 2:57]
_spamy = spambase[:, 57]

lr_3 = LogisticRegression()
prediction_3 = cross_val_predict(lr_3, _spamx, _spamy, cv = 10)

acc_3 = accuracy_score(_spamy, prediction_3)
precision_3 = precision_score(_spamy, prediction_3)
recall_3 = recall_score(_spamy, prediction_3)
f1_3 = f1_score(_spamy, prediction_3)

print(acc_3, precision_3, recall_3, f1_3)

0.916974570745 0.905382436261 0.881412024269 0.893236444941


In [22]:
_spamx = spambase[:, 2:47]
_spamy = spambase[:, 57]

lr_4 = LogisticRegression()
prediction_4 = cross_val_predict(lr_4, _spamx, _spamy, cv = 10)

acc_4 = accuracy_score(_spamy, prediction_4)
precision_4 = precision_score(_spamy, prediction_4)
recall_4 = recall_score(_spamy, prediction_4)
f1_4 = f1_score(_spamy, prediction_4)

print(acc_4, precision_4, recall_4, f1_4)

0.904151271463 0.889772727273 0.863761720905 0.876574307305


###### 双击此处填写
1. 模型1的处理流程：去掉最后十组数据
2. 模型2的处理流程：去掉第一项数据
3. 模型3的处理流程: 去掉第一项数据和后十项数据

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 | 0.896109541404 | 0.871866295265 | 0.863210148924 | 0.867516629712
模型2 | 数据集 | 0.915670506412 | 0.903225806452 | 0.880308880309 | 0.891620111732
模型3 | 数据集 | 0.904151271463 | 0.889772727273 | 0.863761720905 | 0.876574307305