该数据集包含了美国病人的医疗费用，基于美国人口普查局（U.S.Census Bureau）的人口统计资料整理得出，包含1338个案例，即目前已经登记过的保险计划受益者以及表示病人特点和历年计划计入的总的医疗费用的特征。

|    特征        	|    说明                                                                                                                                         	|
|----------------	|-------------------------------------------------------------------------------------------------------------------------------------------------	|
|    age         	|    整数，表示主要受益者的年龄（不包括超过64岁的人，因为他们一般由政府支付）                                                                     	|
|    sex         	|    保单持有人的性别，要么是male，要么是female                                                                                                   	|
|    bmi         	|    身体质量指数（Body Mass Index, BMI），它提供了一个判断人的体重相对于身高是过重还是偏轻的方法，BMI指数等于体重（公斤）除以身高（米）的平方    	|
|    children    	|    整数，表示保险计划中所包括的孩子/受抚养者的数量                                                                                              	|
|    smoker      	|    被保险人是否吸烟，吸烟为yes，不吸烟为no                                                                                                      	|
|    region      	|    根据受益人在美国的居住地，分为4个地理区域：northeast、southeast、southwest和northwest                                                        	|
|    charges     	|    医疗费用                                                                                                                                     	|

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
insurance = pd.read_csv("./insurance.csv")
insurance.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


sex、smoker和region这三个变量均为类别变量，而非数值型。

使用OneHot编码来对这些变量进行编码，然后删去编码后的一个变量作为参照组。比如具有4个取值的特征region可以转换为4个变量：region_northwest、region_southeast、region_southwest、region_northeast。然后删去region_northeast变量作为参照组。

In [3]:
#用pandas的get_dummies方法进行OneHot编码
sex_onehot_df = pd.get_dummies(insurance.sex, prefix = "sex")
smoker_onehot_df = pd.get_dummies(insurance.smoker, prefix="smoker")
region_onehot_df = pd.get_dummies(insurance.region, prefix="region")

pd.set_option("display.max_columns",15)
insurance_merged = pd.concat([insurance, sex_onehot_df, smoker_onehot_df, region_onehot_df], axis = 1)
insurance_merged.head(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,female,27.9,0,yes,southwest,16884.924,1,0,0,1,0,0,0,1
1,18,male,33.77,1,no,southeast,1725.5523,0,1,1,0,0,0,1,0
2,28,male,33.0,3,no,southeast,4449.462,0,1,1,0,0,0,1,0
3,33,male,22.705,0,no,northwest,21984.47061,0,1,1,0,0,1,0,0
4,32,male,28.88,0,no,northwest,3866.8552,0,1,1,0,0,1,0,0


然后，对sex和smoker进行同样的处理，使用sex_female、smoker_no作为参照组。

In [4]:
Y = insurance_merged["charges"]
X = insurance_merged.drop(['charges', 'sex','sex_female', 'smoker','smoker_no',"region", 'region_northeast'], axis = 1)
X.head(5)

Unnamed: 0,age,bmi,children,sex_male,smoker_yes,region_northwest,region_southeast,region_southwest
0,19,27.9,0,0,1,0,0,1
1,18,33.77,1,1,0,0,1,0
2,28,33.0,3,1,0,0,1,0
3,33,22.705,0,1,0,1,0,0
4,32,28.88,0,1,0,1,0,0


In [5]:
# 将数据集划分为训练集和测试集两部分
from sklearn import model_selection
train_x, test_x, train_y, test_y  = model_selection.train_test_split(X, Y, test_size = 0.3, random_state = 14)
print('训练集的规模:',len(train_x), '测试集的规模:', len(test_x))

训练集的规模: 936 测试集的规模: 402


线性回归模型

In [6]:
#构建线性回归模型
from sklearn.linear_model import LinearRegression
from sklearn import metrics
lr = LinearRegression()
lr.fit(train_x, train_y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [7]:
#查看模型
pd.Series(data = lr.coef_, index = X.columns)  #回归系数

age                   258.792069
bmi                   370.965723
children              461.798357
sex_male              -82.561986
smoker_yes          23779.801649
region_northwest     -564.059256
region_southeast    -1243.795208
region_southwest    -1011.004159
dtype: float64

In [8]:
#回归效果评估
pred_y_test = lr.predict(test_x)
pred_y_train = lr.predict(train_x)

print("训练集的决定系数: ", round(metrics.r2_score(train_y, pred_y_train),3))
print("测试集的决定系数: ", round(metrics.r2_score(test_y, pred_y_test),3))

训练集的决定系数:  0.751
测试集的决定系数:  0.75
