# Random Forest Regression

A [random forest](https://en.wikipedia.org/wiki/Random_forest) is a meta estimator that fits a number of classifying [decision trees](https://en.wikipedia.org/wiki/Decision_tree_learning) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement (can be changed by user).

Generally, Decision Tree and Random Forest models are used for classification task. However, the idea of Random Forest as a regularizing meta-estimator over single decision tree is best demonstrated by applying them to regresion problems. This way it can be shown that, **in the presence of random noise, single decision tree is prone to overfitting and learn spurious correlations while a properly constructed Random Forest model is more immune to such overfitting.**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder

### 创建数据集

In [2]:
df = pd.read_excel('/Users/jiangguilin/Desktop/VSCODE/Arrow_Experiment/experiment2_data.xlsx', sheet_name='Sheet1')

In [3]:
encoder = OneHotEncoder()
encoded_vars = encoder.fit_transform(df[['experimental_variable']]).toarray()
encoded_df = pd.DataFrame(encoded_vars, 
                         columns=encoder.get_feature_names_out(['experimental_variable']))

In [4]:
# 合并特征数据
X = pd.concat([
    encoded_df,
    df[['angle', 'velocity', 'mass']]
], axis=1)

# 目标变量
y = df[['point_x', 'point_y']]

# 检查异常数据（示例：velocity异常值）
print("异常值检查：")
print(df[df['velocity'] < 50]) 

异常值检查：
    times\inform experimental_variable     angle  velocity      mass  \
19            20               rao_min  4.427523  8.083535  22.05625   

     point_x   point_y  
19 -0.263209  1.187343  


In [5]:
# 假设确认是异常值，进行删除
df = df.drop(19).reset_index(drop=True)

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
X_train.head()

Unnamed: 0,experimental_variable_rao_average,experimental_variable_rao_max,experimental_variable_rao_min,angle,velocity,mass
9,0.0,1.0,0.0,4.330619,55.811199,22.05625
13,0.0,1.0,0.0,4.282615,56.098069,22.05625
1,1.0,0.0,0.0,4.237551,55.754166,22.05625
22,0.0,0.0,1.0,4.030105,56.190382,22.05625
5,0.0,1.0,0.0,4.340802,56.032121,22.05625


In [8]:
y_train.head()

Unnamed: 0,point_x,point_y
9,2.930167,0.0322
13,3.366472,0.476555
1,2.038572,3.913043
22,0.419328,-0.788659
5,0.328705,5.131817


### 创建模型

In [9]:
# 创建随机森林模型（启用多核并行）
model = RandomForestRegressor(
    n_estimators=5000, #随机森林中决策树的数量
    max_depth=10, #单棵决策树的最大深度（
    n_jobs=-1, #训练时使用的CPU核心数
    random_state=42 #随机数生成器的种子
)

# 训练模型
model.fit(X_train, y_train)

In [10]:
# 预测与评估
y_pred = model.predict(X_test)

print("\n模型评估：")
print(f"均方误差(MSE): {mean_squared_error(y_test, y_pred):.4f}")
print(f"决定系数(R²): {r2_score(y_test, y_pred):.4f}")

# 特征重要性分析
importance = pd.Series(model.feature_importances_, index=X.columns)
print("\n特征重要性排序：")
print(importance.sort_values(ascending=False))

# MSE是预测值和实际值的差距的平方的平均值，越接近0越好；R²在[0,1]，越接近1越好


模型评估：
均方误差(MSE): 8.1658
决定系数(R²): 0.3185

特征重要性排序：
velocity                             0.355312
angle                                0.327083
experimental_variable_rao_max        0.245996
experimental_variable_rao_average    0.036465
experimental_variable_rao_min        0.035145
mass                                 0.000000
dtype: float64


### 预测例子

In [11]:
y_pred_df = pd.DataFrame(y_pred, columns=['pred_point_x', 'pred_point_y'])

results = pd.concat([y_test.reset_index(drop=True), y_pred_df], axis=1)
print("\n测试集预测结果对比：")
print(results.round(3))


测试集预测结果对比：
   point_x  point_y  pred_point_x  pred_point_y
0    3.844    5.244         3.055         0.086
1   -0.770    2.372        -0.110         3.118
2    2.526    9.058         1.384         3.192
3    0.028   -1.746         0.158         2.265
4    3.872    1.508         2.816         0.792
