# 线性判别分析

## 实验内容
1. 使用线性判别分析完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [2]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [3]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [4]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

**注意：dota2数据集上，线性判别分析模型在训练的过程中会有警告出现，不会影响程序运行**

In [5]:
# YOUR CODE HERE
print("sqam:")
spam_model = LinearDiscriminantAnalysis()
spam_prediction = cross_val_predict(spam_model, spamx, spamy, cv = 10)
print(spam_prediction)

print()

print("dota2:")
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv = 10)
print(dota2_prediction)

sqam:
[0. 1. 1. ... 0. 0. 0.]

dota2:




[ 1.  1.  1. ...  1. -1.  1.]




# 5. 评价指标的计算

请你计算两个模型的四项指标

In [6]:
# YOUR CODE HERE
# YOUR CODE HERE
spam_accuracy = accuracy_score(spamy,spam_prediction)
spam_precision = precision_score(spamy,spam_prediction)
spam_recall = recall_score(spamy,spam_prediction)
spam_f1 = f1_score(spamy,spam_prediction)
print("spam：")
print(spam_accuracy)
print(spam_precision)
print(spam_recall)
print(spam_f1)

print()

dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)

spam：
0.8830688980656379
0.90891597177678
0.7815774958632101
0.8404507710557533

dota2：
0.59876956287102
0.6067107924250782
0.6764585297855766
0.6396890629240491


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.8830688980656379 | 0.90891597177678 | 0.7815774958632101 | 0.8404507710557533
dota2Results | 0.59876956287102 | 0.6067107924250782 | 0.6764585297855766 | 0.6396890629240491

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [7]:
# YOUR CODE HERE
temp = dota2x[:,3:]
new_dota2x = dota2x[:,0:3]
team1 = np.where(temp>0)[1]
team2 = np.where(temp<0)[1]

row = []
i = 0
k = 0
p1 = np.where(temp > 0)[1].reshape(92650,5)
p2 = np.where(temp < 0)[1].reshape(92650,5)
p1 = np.concatenate((p1,p2),axis=1)
new_dota2x = np.concatenate((new_dota2x,p1),axis=1)
print(new_dota2x)



[[223.   2.   2. ...  37.  73.  87.]
 [152.   2.   2. ...  34.  92.  97.]
 [131.   2.   2. ...  45.  71.  92.]
 ...
 [111.   2.   3. ...  61.  67.  73.]
 [185.   2.   2. ...  14.  31.  48.]
 [204.   2.   2. ...  31.  55.  78.]]


In [8]:
print("dota2:")
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[1. 1. 1. ... 1. 1. 1.]


In [9]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5293470048569887
0.533565925656922
0.8433233569759337
0.653601728575514
92650
77102
15548


In [10]:
new_dota2x1 = new_dota2x[:,1:]

In [11]:
print("dota2:")
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x1, dota2y, cv = 10)
print(dota2_prediction)

dota2:
[1. 1. 1. ... 1. 1. 1.]


In [12]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5289152725310308
0.5333013460590813
0.843036365872658
0.6533170235750142
92650
77114
15536


In [13]:
new_dota2x2 = dota2x[:,1:]

In [14]:
print("dota2:")
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x2, dota2y, cv = 10)
print(dota2_prediction)

dota2:




[ 1.  1.  1. ...  1. -1.  1.]




In [15]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5988559093362116
0.6067882621166434
0.6765200278791358
0.6397596200445865
92650
54388
38262


In [16]:
new_dota2x2 = dota2x[:,1:]
from sklearn import preprocessing
new_dota2x2 = preprocessing.scale(new_dota2x2)

In [17]:
print("dota2:")
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, new_dota2x2, dota2y, cv = 10)
print(dota2_prediction)

dota2:




[ 1.  1.  1. ...  1. -1.  1.]




In [18]:
dota2_accuracy = accuracy_score(dota2y,dota2_prediction)
dota2_precision = precision_score(dota2y,dota2_prediction)
dota2_recall = recall_score(dota2y,dota2_prediction)
dota2_f1 = f1_score(dota2y,dota2_prediction)
print("dota2：")
print(dota2_accuracy)
print(dota2_precision)
print(dota2_recall)
print(dota2_f1)
print(dota2_prediction.size)
print(np.where(dota2_prediction > 0.5)[0].size)
print(np.where(dota2_prediction < 0.5)[0].size)

dota2：
0.5988559093362116
0.6067882621166434
0.6765200278791358
0.6397596200445865
92650
54388
38262


In [19]:
print(new_dota2x2)

[[-5.00396257e-01 -7.89982073e-01  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]
 [-5.00396257e-01 -7.89982073e-01  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]
 [-5.00396257e-01 -7.89982073e-01  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]
 ...
 [-5.00396257e-01  1.26412208e+00  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]
 [-5.00396257e-01 -7.89982073e-01  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]
 [-5.00396257e-01 -7.89982073e-01  4.05418341e-03 ...  1.34351311e-03
   2.27386176e-04 -6.44342241e-03]]


###### 双击此处填写
1. 模型1的处理流程：将每一局双方的英雄单独提取出id号并拼接在前3个属性之后使用原方法预测
2. 模型2的处理流程：在模型1的基础上舍弃第一个属性
3. 模型3的处理流程：在原模型的基础上舍弃第一个属性
4. 模型4的处理流程：在模型3的基础上使得原数据标准化

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | Dota2结果预测问题 | 0.5293470048569887 | 0.533565925656922 | 0.8433233569759337 | 0.653601728575514
模型2 | Dota2结果预测问题 | 0.5289152725310308 | 0.5333013460590813 | 0.843036365872658 | 0.6533170235750142
模型3 | Dota2结果预测问题 | 0.5988559093362116 | 0.6067882621166434 | 0.6765200278791358 | 0.6397596200445865
模型4 | Dota2结果预测问题 | 0.5988559093362116 | 0.6067882621166434 | 0.6765200278791358 | 0.6397596200445865