# 线性判别分析

## 实验内容
1. 使用线性判别分析完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [2]:
import numpy as np

In [3]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [4]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [5]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

**注意：dota2数据集上，线性判别分析模型在训练的过程中会有警告出现，不会影响程序运行**

In [6]:
LDA1 = LinearDiscriminantAnalysis()
LDA1.fit(spamx, spamy)
prediction1 = LDA1.predict(spamx)

LDA2 = LinearDiscriminantAnalysis()
LDA2.fit(dota2x, dota2y)
prediction2 = LDA2.predict(dota2x)








# 5. 评价指标的计算

请你计算两个模型的四项指标

In [7]:
# 测试精度
acc1 = accuracy_score(spamy, prediction1)
acc2 = accuracy_score(dota2y, prediction2)
# 测试查准率
precision1 = precision_score(spamy, prediction1)
precision2 = precision_score(dota2y, prediction2)
# 测试查全率
recall1 = recall_score(spamy, prediction1)
recall2 = recall_score(dota2y, prediction2)
# F1
f11 = f1_score(spamy, prediction1)
f12 = f1_score(dota2y, prediction2)
print(acc1)
print(acc2)
print(precision1)
print(precision2)
print(recall1)
print(recall2)
print(f11)
print(f12)






0.888719843512
0.600971397733
0.9194068343
0.608498364992
0.786541643685
0.679000450986
0.847800237812
0.641819097814


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.88871984351228 | 0.7865416436845009 | 0.9194068343004513 | 0.8478002378121284
dota2Results | 0.6009713977334052 | 0.6790004509860195 | 0.6084983649924679 | 0.6418190978142924

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [12]:
dota2x = dota2results[:, 1:]
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv=10)

print(accuracy_score(dota2y, dota2_prediction))
print(precision_score(dota2y, dota2_prediction))
print(recall_score(dota2y, dota2_prediction))
print(f1_score(dota2y, dota2_prediction))







0.598769562871
0.606710792425
0.676458529786
0.639689062924




In [13]:
dota2x = dota2results[:, 3:]
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv=10)

print(accuracy_score(dota2y, dota2_prediction))
print(precision_score(dota2y, dota2_prediction))
print(recall_score(dota2y, dota2_prediction))
print(f1_score(dota2y, dota2_prediction))



0.598823529412
0.606727322856
0.676663523431
0.639789896014




In [14]:
dota2x = dota2results[:, 2:]
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv=10)

print(accuracy_score(dota2y, dota2_prediction))
print(precision_score(dota2y, dota2_prediction))
print(recall_score(dota2y, dota2_prediction))
print(f1_score(dota2y, dota2_prediction))



0.598855909336
0.606788262117
0.676520027879
0.639759620045




In [15]:
dota2x = dota2results[:, 1:3]
dota2_model = LinearDiscriminantAnalysis()
dota2_prediction = cross_val_predict(dota2_model, dota2x, dota2y, cv=10)

print(accuracy_score(dota2y, dota2_prediction))
print(precision_score(dota2y, dota2_prediction))
print(recall_score(dota2y, dota2_prediction))
print(f1_score(dota2y, dota2_prediction))

0.526519158122
0.526519158122
1.0
0.689829741501


###### 双击此处填写
1. 模型1的处理流程：去掉Cluster ID、Game mode
2. 模型2的处理流程：去掉Cluster ID
3. 模型3的处理流程: 只保留Cluster ID、Game mode

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 | 0.5988235294117648 | 0.6067273228563551 | 0.6766635234307736 | 0.6397898960140328
模型2 | 数据集 | 0.5988559093362116 | 0.6067882621166434 | 0.6765200278791358 | 0.6397596200445865
模型3 | 数据集 | 0.5265191581219644 | 0.5265191581219644 | 1.0 | 0.6898297415012161