# 基于回归分析的大学综合得分预测
---


使用来自 Kaggle 的[数据](https://www.kaggle.com/mylesoneill/world-university-rankings?select=cwurData.csv)，构建「线性回归」模型，根据大学各项指标的排名预测综合得分。

**基本：**
* 按照 8:2 随机划分训练集测试集，用 RMSE 作为评价指标，得到测试集上线性回归模型的 RMSE 值；

* 基本输入特征有 8 个：`quality_of_education`, `alumni_employment`, `quality_of_faculty`, `publications`, `influence`, `citations`, `broad_impact`, `patents`；
* 预测目标为`score`；


## 三、数据概览

假设数据文件位于当前文件夹，使用 pandas 读入标准 csv 格式文件的函数`read_csv()`将数据转换为`DataFrame`的形式。观察前几条数据记录：

In [372]:
import numpy as np
import pandas as pd
import math
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm


csv_data = "./cwurData.csv"
raw_data = pd.read_csv(csv_data,sep=",")


去除其中包含 NaN 的数据，保留 2000 条有效记录。

In [373]:
raw_data = raw_data.dropna()
len(raw_data)

2000

取出对应自变量以及因变量的列，之后就可以基于此切分训练集和测试集，并进行模型构建与分析。
将数据转化为字典形式进行处理

In [374]:
coefficient = {}
coefficient['education'] = []; coefficient['employment'] = [];coefficient['faculty'] = [];coefficient['publications'] = []
coefficient['influence'] = [];coefficient['citations'] = [];coefficient['broad_impact'] = []; coefficient['patents'] = [];coefficient['school_name'] = []

## 四、模型构建

In [375]:
data = np.array(raw_data)
np.random.shuffle(data)
num = 0
print("自变量数目为：",len(raw_data.columns))

coefficient['education'] = np.array(data[:,4:5:1])

attitude_colums  = ['education','employment','faculty','publications','influence','citations','broad_impact','patents']

# for i in range(len(data)):
for j in range(5,13,1):
    coefficient["{}".format(attitude_colums[j-5])] = np.concatenate(np.array(data[:,j-1:j:1]))
score = np.concatenate(np.array(data[:,12:13]))


# 划分测试集与训练集
all_y = data[:,12]
all_x = data[:,4:12]

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(all_x,all_y,test_size=0.2,random_state=2021)


md = LinearRegression().fit(x_train,y_train)
y_predict = md.predict(x_test)
b0 = md.intercept_
b1_8 = md.coef_
R2 = md.score(x_test,y_test)


test_bias = y_predict-y_test
test_rmse = math.sqrt(sum([i**2 for i in test_bias])/len(test_bias))



自变量数目为： 14


输出相关系数$\beta$、拟合优度与RMSE 

In [376]:
print("相关系数beta ",b0,",".join(str(i) for i in b1_8))
print("拟合优度 = ",R2)
print("RMSE = ",test_rmse)

相关系数beta  66.72914930313155 -0.006361263592363662,-0.0071732440644781655,-0.06810079393504376,0.0002659589665198179,0.0007902853420936684,-0.00026562310607122503,-0.002263208674985792,-0.002517410471768085
拟合优度 =  0.37897527855240964
RMSE =  2.8259342982745057


## 岭回归求解

通过拟合优度的计算与RMSE的计算发现拟合效果一般，可知拟合效果较差与自变量间的共线性有关，于是对自变量进行正则化处理，进而使用岭回归进行求解

导入岭回归所需库，以及创建一些参数

In [377]:
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge,RidgeCV
from scipy.stats import zscore

b_ridge = []



调库对数据进行正则化处理得到标准化回归系数，而后求出岭回归的回归系数

In [378]:
kk = np.logspace(-4,10,100)
for k in kk:
    md_ridge = Ridge(alpha=k).fit(x_train,y_train)
    b_ridge.append(md_ridge.coef_)

md_ridge_cv = RidgeCV(alphas=np.logspace(-4,10,100)).fit(x_train,y_train)
print("最优alpha = ",md_ridge_cv.alpha_)

md_ridge_0 = Ridge(0.4).fit(x_test,y_test)
cs0 = md_ridge_0.coef_
print("标准化数据的回归系数为：",cs0)
mu=np.mean(data[:,4:13],axis=0); 
print(mu)
s=np.std(data[:,4:13],dtype=np.float64, axis=0,ddof=1) #计算所有指标的均值和标准差
params=[mu[-1]-s[-1]*sum(cs0*mu[:-1]/s[:-1]),s[-1]*cs0/s[:-1]]
print("原数据的回归系数为：",params)

最优alpha =  15922.82793341094
标准化数据的回归系数为： [-5.31102023e-03 -5.58457985e-03 -2.82183546e-02  7.85114160e-05
 -8.26608025e-04  2.55747936e-04 -2.52304652e-03 -1.55538986e-03]
[296.0015 385.2635 191.1275 500.415 500.219 449.3415 496.6995 470.321
 47.06762999999992]
原数据的回归系数为： [47.978253574862144, array([-3.27523757e-04, -2.14138187e-04, -3.54890850e-03,  1.79241957e-06,
       -1.88957068e-05,  6.73818041e-06, -5.79536218e-05, -3.94827912e-05])]


在测试集上评估岭回归训练数据

In [379]:
y_predict_ridge = md_ridge_0.predict(x_test)
test_bias_ridge = y_predict_ridge-y_test
test_rmse_ridge = math.sqrt(sum([i**2 for i in test_bias_ridge])/len(test_bias_ridge))




print("处理后的拟合优度：",md_ridge_0.score(x_test,y_test),">处理前拟合优度",R2)
print("相关系数beta: ",",".join(str(i) for i in params))
print("处理后RMSE = ",test_rmse_ridge,"<处理前RMSE = ",test_rmse)
#处理后RMSE =  2.1163977771613025 <处理前RMSE =  2.8259342982745057


处理后的拟合优度： 0.6516792384631696 >处理前拟合优度 0.37897527855240964
相关系数beta:  47.978253574862144,[-3.27523757e-04 -2.14138187e-04 -3.54890850e-03  1.79241957e-06
 -1.88957068e-05  6.73818041e-06 -5.79536218e-05 -3.94827912e-05]
处理后RMSE =  2.1163977771613025 <处理前RMSE =  2.8259342982745057


使用岭回归对数据做拟合后发现拟合效果优于直接对数据进行线性拟合