# Multi-level regression using K means clustering

In [1]:
import random
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

Read in the dataset:

In [12]:
df = pd.read_csv('data-volunteer.csv')

The data has the following features, some of which you will use for clustering and some of which you will use for the regression model:

* `GEDIV` (geographical region of the US where the respondent lives, ordinal-encoded. You will use this for clustering only.)
* `GTMETSTA` (whether or not the respondent lives in a metropolitan area. You will use this for clustering only.)
* `GTCBSASZ` (size of the metro area where the respondent lives. You will use this for clustering only.)
* `PESEX` (sex of the respondent. You will use this for the regression only.)
* `PRTAGE` (age of the respondent, ordinal encoded. You will use this for the regression only.)
* `PEEDUCA` (education level of the respondent, ordinal encoded. You will use this for the regression only.)
* `PUWK` (whether the respondent worked in the last week (1), did not work in the last week (2), or is retired (3). You will use this for the regression only.)
* `PTS16E` (number of hours spent volunteering in the last 12 months. You will use this as the target variable for the regression.)

Split the data into training and test sets. Use 2,500 samples for the test set and the remaining samples for the training set. Use `random_state = 42`.

 * `ytr` and `yts` should each be a 1d `numpy` array with only the target variable.
 * `Xtr` and `Xts` should be `pandas` data frames with all of the remaining variables (excluding the target variable.)



In [13]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

# 目标变量
y = np.array(df['PTS16E'])

# 聚类特征
clustering_features = df[['GEDIV', 'GTMETSTA', 'GTCBSASZ']]

# 回归特征
regression_features = df[['PESEX', 'PRTAGE', 'PEEDUCA', 'PUWK']]

# 合并聚类和回归特征
X = pd.concat([clustering_features, regression_features], axis=1)

# 数据分割
Xtr, Xts, ytr, yts = train_test_split(X, y, test_size=2500, random_state=42)

In the next cells, you will use `sklearn` to perform K-means clustering using  `Xtr`. First, set `n_cluster` as specified on the question page.

In [14]:
n_cluster  = 4

Then, assign cluster labels to each data point, using only the geographical features that were specified as "You will use this for clustering only". 

(Use the specific random state shown below so that your clustering will match the auto-grader's.) Save the assigned class labels in `Xtr_cid` and `Xts_cid` for the training and test data, respectively.

In [15]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

kmeans = KMeans(n_cluster, random_state=42)

# 使用训练数据的地理特征进行聚类
kmeans.fit(Xtr[['GEDIV', 'GTMETSTA', 'GTCBSASZ']])

# 为训练数据和测试数据分配聚类标签
Xtr_cid = kmeans.predict(Xtr[['GEDIV', 'GTMETSTA', 'GTCBSASZ']])
Xts_cid = kmeans.predict(Xts[['GEDIV', 'GTMETSTA', 'GTCBSASZ']])


Finally, fit regression coefficients using the training data in each cluster, and then use the fitted regression models to create `yhat_ts`, the predicted values on the test set.

In [16]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

# this just generates an array that's the correct shape - yhat_ts shouldn't really be all zeros
yhat_ts = np.zeros(yts.shape)

# 对每个聚类训练一个线性回归模型，并进行预测
for i in range(n_cluster):
    # 选择属于当前聚类的训练数据
    Xtr_cluster = Xtr[Xtr_cid == i][['PESEX', 'PRTAGE', 'PEEDUCA', 'PUWK']]
    ytr_cluster = ytr[Xtr_cid == i]

    # 训练线性回归模型
    model = LinearRegression()
    model.fit(Xtr_cluster, ytr_cluster)

    # 选择属于当前聚类的测试数据
    Xts_cluster = Xts[Xts_cid == i][['PESEX', 'PRTAGE', 'PEEDUCA', 'PUWK']]

    # 对测试数据进行预测，并更新 yhat_ts
    yhat_ts[Xts_cid == i] = model.predict(Xts_cluster)


Then, compute the mean squared error of your model on the test data.

In [17]:
#grade (write your code in this cell and DO NOT DELETE THIS LINE)

mse_ts = mean_squared_error(yts, yhat_ts)

In [18]:
mse_ts

13895.812411048222