# 共享单车
`Dataset Characteristics  数据集特征:`
Multivariate  多元变量

`Subject Area  主题领域:`
Social Science  社会科学

`Associated Tasks  相关任务:`
Regression  回归

`Feature Type  特征类型:`
Integer, Real  整数，实数

`# Instances  样本数量:`
17389

`# Features  特征数量:`
13

## 变量表
| Variable Name | Role    | Type      | Description                                                                 | Units | Missing Values |
|---------------|---------|-----------|-----------------------------------------------------------------------------|-------|----------------|
| instant       | ID      | Integer   | record index                                                               |       | no             |
| dteday        | Feature | Date      | date                                                                       |       | no             |
| season        | Feature | Categorical | 1:winter, 2:spring, 3:summer, 4:fall                                    |       | no             |
| yr            | Feature | Categorical | year (0: 2011, 1: 2012)                                                   |       | no             |
| mnth          | Feature | Categorical | month (1 to 12)                                                           |       | no             |
| hr            | Feature | Categorical | hour (0 to 23)                                                            |       | no             |
| holiday       | Feature | Binary     | whether day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule) |       | no             |
| weekday       | Feature | Categorical | day of the week                                                           |       | no             |
| workingday    | Feature | Binary     | if day is neither weekend nor holiday is 1, otherwise is 0               |       | no             |
| weathersit    | Feature | Categorical | 1: Clear, Few clouds, Partly cloudy, Partly cloudy                        |       | no             |
| temp          | Feature | Continuous | Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale) | °C    | no             |
| atemp         | Feature | Continuous | Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale) | °C    | no             |
| hum           | Feature | Continuous | Normalized humidity. The values are divided to 100 (max)                  |       | no             |
| windspeed     | Feature | Continuous | Normalized wind speed. The values are divided to 67 (max)                 |       | no             |
| casual        | Other   | Integer    | count of casual users                                                     |       | no             |
| registered    | Other   | Integer    | count of registered users                                                 |       | no             |
| cnt           | Target  | Integer    | count of total rental bikes including both casual and registered          |       | no             |

### 1. 导入必要的库

首先，我们导入了用于数据处理、可视化和模型训练的常用库。这些库包括：
- `numpy` 和 `pandas` 用于数据处理。
- `matplotlib.pyplot` 用于数据可视化。
- `sklearn` 中的 `train_test_split` 用于数据集划分，`mean_squared_error` 和 `r2_score` 用于模型评估，`StandardScaler` 用于数据标准化。
- `torch` 用于构建和训练神经网络模型。
- `pygwalker` 用于数据探索和可视化。

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,r2_score
from sklearn.preprocessing import StandardScaler
import torch
from torch import nn
import pygwalker

device = torch.device('cuda' if torch.cuda.is_available() 
                      else 'mps' if torch.mps.is_available() 
                      else 'cpu')

print(torch.__version__)
print(device)



2.6.0
mps


### 2. 加载和查看数据

我们加载了自行车租赁数据集，并查看了前几行数据。数据集包含多个特征，如季节、年份、月份、天气情况等，以及目标变量 `cnt`（租赁数量）。我们分离了特征和目标变量，并查看了它们的前几行。

In [10]:
data = pd.read_csv('data/hour.csv')
X = data.drop(['instant','dteday','casual','registered','cnt'],axis=1)
y = data['cnt']
X.head(), y.head()

(   season  yr  mnth  hr  holiday  weekday  workingday  weathersit  temp  \
 0       1   0     1   0        0        6           0           1  0.24   
 1       1   0     1   1        0        6           0           1  0.22   
 2       1   0     1   2        0        6           0           1  0.22   
 3       1   0     1   3        0        6           0           1  0.24   
 4       1   0     1   4        0        6           0           1  0.24   
 
     atemp   hum  windspeed  
 0  0.2879  0.81        0.0  
 1  0.2727  0.80        0.0  
 2  0.2727  0.80        0.0  
 3  0.2879  0.75        0.0  
 4  0.2879  0.75        0.0  ,
 0    16
 1    40
 2    32
 3    13
 4     1
 Name: cnt, dtype: int64)

### 3. 数据探索和可视化

使用 `pygwalker` 进行数据探索和可视化。`pygwalker` 是一个交互式的数据探索工具，可以帮助我们更好地理解数据的分布和关系。

In [3]:
pygwalker.walk(data)

Box(children=(HTML(value='\n<div id="ifr-pyg-00062e6a73200f8784urg5GWkeAFJdQH" style="height: auto">\n    <hea…

<pygwalker.api.pygwalker.PygWalker at 0x1054b3190>

### 4. 数据标准化

为了确保模型训练的稳定性，我们对特征和目标变量进行了标准化处理。标准化将数据转换为均值为0，标准差为1的分布。

In [11]:
scaler = StandardScaler()
# 对数据进行标准化
X = scaler.fit_transform(X)
y = scaler.fit_transform(pd.DataFrame(y))
X[:3],y[:3]

(array([[-1.3566343 , -1.0051343 , -1.61043792, -1.67000398, -0.1721122 ,
          1.49389084, -1.46689994, -0.66519285, -1.33464759, -1.0932806 ,
          0.9473725 , -1.55388851],
        [-1.3566343 , -1.0051343 , -1.61043792, -1.52537422, -0.1721122 ,
          1.49389084, -1.46689994, -0.66519285, -1.4385164 , -1.18173227,
          0.89553869, -1.55388851],
        [-1.3566343 , -1.0051343 , -1.61043792, -1.38074446, -0.1721122 ,
          1.49389084, -1.46689994, -0.66519285, -1.4385164 , -1.18173227,
          0.89553869, -1.55388851]]),
 array([[-0.95633924],
        [-0.82402209],
        [-0.8681278 ]]))

### 5. 数据集划分

我们将数据集划分为训练集和测试集，测试集占15%。这样可以确保模型在未见过的数据上进行评估。

In [12]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.15)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((14772, 12), (2607, 12), (14772, 1), (2607, 1))

### 6. 定义神经网络模型

我们定义了一个简单的全连接神经网络模型 `BikeRent`，包含三个全连接层和 ReLU 激活函数。模型的结构如下：
- 输入层：11个特征
- 隐藏层1：20个神经元
- 隐藏层2：10个神经元
- 输出层：1个神经元（回归任务）

In [13]:
class BikeRent(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(12, 20)
        self.fc2 = nn.Linear(20, 10)
        self.fc3 = nn.Linear(10, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = BikeRent().to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(params=net.parameters(), lr=0.001)

### 7. 数据转换为张量

我们将 NumPy 数组转换为 PyTorch 张量，并将数据移动到指定的设备（如 GPU 或 CPU）上，以便进行模型训练。

In [14]:
X_train = torch.from_numpy(np.array(X_train).astype(np.float32)).to(device)
X_test = torch.from_numpy(np.array(X_test).astype(np.float32)).to(device)
y_train = torch.from_numpy(np.array(y_train).astype(np.float32)).to(device)
y_test = torch.from_numpy(np.array(y_test).astype(np.float32)).to(device)
X_train.shape

torch.Size([14772, 12])

### 8. 模型训练

我们进行了2000个epoch的训练。在每个epoch中，模型在训练集上进行前向传播、计算损失、反向传播和参数更新。然后，我们在测试集上评估模型的性能，并输出训练和测试的损失、MSE和R²分数。

In [15]:
epochs = 2000
for epoch in range(epochs):
    net.train()
    y_pred_train = net(X_train)
    loss = loss_fn(y_pred_train,y_train)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    net.eval()
    with torch.inference_mode():
        y_pred_test = net(X_test)
        loss_test = loss_fn(y_pred_test,y_test)
        mse_train = mean_squared_error(y_train.cpu().numpy(), y_pred_train.cpu().numpy())
        mse_test = mean_squared_error(y_test.cpu().numpy(), y_pred_test.cpu().numpy())
        r2_train = r2_score(y_train.cpu().numpy(), y_pred_train.cpu().numpy())
        r2_test = r2_score(y_test.cpu().numpy(), y_pred_test.cpu().numpy())
    if epoch%200==0:
        print(f"Epoch:{epoch}  Train-Loss:{loss}  Test-Loss:{loss_test}  Train-MSE:{mse_train}  Test-MSE:{mse_test}  Train-r^2:{r2_train}  Train-r^2:{r2_test}")

Epoch:0  Train-Loss:1.0085928440093994  Test-Loss:1.0832626819610596  Train-MSE:1.0085928440093994  Test-MSE:1.0832626819610596  Train-r^2:-0.02196979522705078  Train-r^2:-0.008888483047485352
Epoch:200  Train-Loss:0.5348677039146423  Test-Loss:0.5671183466911316  Train-MSE:0.5348677039146423  Test-MSE:0.5671182870864868  Train-r^2:0.4580383896827698  Train-r^2:0.47181862592697144
Epoch:400  Train-Loss:0.37587541341781616  Test-Loss:0.40790462493896484  Train-MSE:0.37587541341781616  Test-MSE:0.40790456533432007  Train-r^2:0.6191393733024597  Train-r^2:0.6201010942459106
Epoch:600  Train-Loss:0.3063720762729645  Test-Loss:0.3401021957397461  Train-MSE:0.3063720762729645  Test-MSE:0.3401021957397461  Train-r^2:0.6895645260810852  Train-r^2:0.6832484006881714
Epoch:800  Train-Loss:0.20762963593006134  Test-Loss:0.2372168004512787  Train-MSE:0.20762963593006134  Test-MSE:0.2372168004512787  Train-r^2:0.789616584777832  Train-r^2:0.7790699005126953
Epoch:1000  Train-Loss:0.1467042565345764