# 对数几率回归

## 实验内容
1. 使用对数几率回归完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [2]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [4]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]
print(spamx)
print(spamy)

print()

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]
print(dota2x)
print(dota2y)

[[0.000e+00 6.400e-01 6.400e-01 ... 3.756e+00 6.100e+01 2.780e+02]
 [2.100e-01 2.800e-01 5.000e-01 ... 5.114e+00 1.010e+02 1.028e+03]
 [6.000e-02 0.000e+00 7.100e-01 ... 9.821e+00 4.850e+02 2.259e+03]
 ...
 [3.000e-01 0.000e+00 3.000e-01 ... 1.404e+00 6.000e+00 1.180e+02]
 [9.600e-01 0.000e+00 0.000e+00 ... 1.147e+00 5.000e+00 7.800e+01]
 [0.000e+00 0.000e+00 6.500e-01 ... 1.250e+00 5.000e+00 4.000e+01]]
[1. 1. 1. ... 0. 0. 0.]

[[223.   2.   2. ...   0.   0.   0.]
 [152.   2.   2. ...   0.   0.   0.]
 [131.   2.   2. ...   0.   0.   0.]
 ...
 [111.   2.   3. ...   0.   0.   0.]
 [185.   2.   2. ...   0.   0.   0.]
 [204.   2.   2. ...   0.   0.   0.]]
[-1.  1.  1. ...  1. -1. -1.]


# 4. 训练并预测

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

In [5]:
# YOUR CODE HERE
print("sqam:")
spam_model = LogisticRegression()
spam_prediction = cross_val_predict(spam_model, spamx, spamy, cv = 10)
print(spam_prediction)

sqam:
[1. 1. 1. ... 0. 0. 0.]


In [6]:
print("dota2:")
dota2_model = LogisticRegression()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[ 1.  1.  1. ...  1. -1.  1.]


# 5. 评价指标的计算

请你计算两个模型的四项指标

In [7]:
# YOUR CODE HERE
spam_accuracy = accuracy_score(spamy,spam_prediction)
spam_precision = precision_score(spamy,spam_prediction)
spam_recall = recall_score(spamy,spam_prediction)
spam_f1 = f1_score(spamy,spam_prediction)
print("spam：")
print(spam_accuracy)
print(spam_precision)
print(spam_recall)
print(spam_f1)

print()

dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)

spam：
0.9184959791349706
0.9043869516310461
0.8869277440706013
0.8955722639933167

dota2：
0.598758769562871
0.6066682595989487
0.6766225247017342
0.6397387318415723


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.9184959791349706 | 0.9043869516310461 | 0.8869277440706013 | 0.8955722639933167
dota2Results | 0.598758769562871 | 0.6066682595989487 | 0.6766225247017342 | 0.6397387318415723

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [8]:
# YOUR CODE HERE
temp = dota2x[:,3:]
new_dota2x = dota2x[:,0:3]
team1 = np.where(temp>0)[1]
team2 = np.where(temp<0)[1]

row = []
i = 0
k = 0
p1 = np.where(temp > 0)[1].reshape(92650,5)
p2 = np.where(temp < 0)[1].reshape(92650,5)
p1 = np.concatenate((p1,p2),axis=1)
new_dota2x = np.concatenate((new_dota2x,p1),axis=1)
print(new_dota2x)

[[223.   2.   2. ...  37.  73.  87.]
 [152.   2.   2. ...  34.  92.  97.]
 [131.   2.   2. ...  45.  71.  92.]
 ...
 [111.   2.   3. ...  61.  67.  73.]
 [185.   2.   2. ...  14.  31.  48.]
 [204.   2.   2. ...  31.  55.  78.]]


In [9]:
print("dota2:")
dota2_model = LogisticRegression()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[1. 1. 1. ... 1. 1. 1.]


In [10]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5292066918510524
0.5334881367805208
0.843036365872658
0.6534571657834733
92650
77087
15563


In [11]:
new_dota2x1 = new_dota2x[:,1:]

In [12]:
print("dota2:")
dota2_model = LogisticRegression()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x1, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[1. 1. 1. ... 1. 1. 1.]


In [13]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5289908256880734
0.5333497607221134
0.843036365872658
0.6533533509679161
92650
77107
15543


In [14]:
new_dota2x2 = dota2x[:,1:]

In [15]:
print("dota2:")
dota2_model = LogisticRegression()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x2, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[ 1.  1.  1. ...  1. -1.  1.]


In [16]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5987263896384242
0.606607379097457
0.6767660202533722
0.6397690056779645
92650
54424
38226


###### 双击此处填写
1. 模型1的处理流程：将每一局双方的英雄单独提取出id号并拼接在前3个属性之后使用原方法预测
2. 模型2的处理流程：在模型1的基础上舍弃第一个属性
3. 模型3的处理流程：在原模型的基础上舍弃第一个属性

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | Dota2结果预测问题 | 0.5292066918510524 | 0.5334881367805208 | 0.843036365872658 | 0.6534571657834733
模型2 | Dota2结果预测问题 | 0.5289908256880734 | 0.5333497607221134 | 0.843036365872658 | 0.6533533509679161
模型3 | Dota2结果预测问题 | 0.5987263896384242 | 0.606607379097457 | 0.6767660202533722 | 0.6397690056779645