### 请建立一个或多个投资理论模型来研究投资与国内生产总值(GDP)的关系各个行业，并评估模型。

#### 导入相关模块

In [18]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler

#### 导入初始数据

In [28]:
df=pd.read_excel('../../data/近十年各行业生产总值数据-en.xlsx',sheet_name='Sheet1')
df2=pd.read_excel('../../data/相关系数.xlsx',sheet_name='Sheet1')

df


Unnamed: 0,Years,GDP,Farming_Forestry_Livestock_and_Fishing,Industry,Construction,Wholesale_and_Retail_Trade,Transportation_Warehousing_and_Postal_Services,Accommodation_and_Catering,Finance,Real_Estate,Others
0,2014,643563.1,57472.2,233197.4,45401.7,63170.4,28534.4,11228.7,46853.4,38086.4,119618.5
1,2015,688858.2,59852.6,234968.9,47761.3,67719.6,30519.5,12306.1,56299.8,42573.8,136856.5
2,2016,746395.1,62451.0,245406.4,51498.9,73724.5,33028.7,13607.8,59964.0,49969.4,156744.3
3,2017,832035.9,64660.0,275119.3,57905.6,81156.6,37121.9,15056.0,64844.3,57086.0,179086.3
4,2018,919281.1,67558.7,301089.3,65493.0,88903.7,40337.2,16520.6,70610.3,64623.0,204145.2
5,2019,986515.2,73576.9,311858.7,70648.1,95650.9,42466.3,17903.1,76250.6,70444.8,227715.8
6,2020,1013567.0,81396.5,312902.9,72444.7,96086.1,40582.9,15285.4,83617.7,73425.3,237825.3
7,2021,1149237.0,86994.8,374545.6,78741.2,110147.0,48423.9,18026.9,90308.7,77215.9,264833.0
8,2022,1204724.0,92576.8,395043.7,80766.0,116294.1,51076.9,17755.0,93285.3,73766.1,284160.2
9,2023,1260582.1,94462.6,399103.1,85691.1,123072.4,57819.8,21023.6,100676.6,73722.7,305010.2


#### 回归函数模型

In [26]:
def create_gdp_investment_model(data, industry_column, investment_column, correlation_matrix, correlation_threshold=0.4):
    """
    创建并拟合回归模型，分析某一行业的GDP与投资之间的关系，并选择相关性较高的行业作为控制变量。
    
    参数：
    - data: 包含GDP和投资数据的DataFrame
    - industry_column: 行业GDP列的列名
    - investment_column: 行业投资列的列名
    - correlation_matrix: 行业之间的相关系数矩阵
    - correlation_threshold: 控制变量选择的相关性阈值，默认0.4
    
    返回：
    - 回归结果的summary
    """
    # 选择与目标行业相关性大于correlation_threshold的其他行业作为控制变量
    industry_corr = correlation_matrix[industry_column]
    control_variables = industry_corr[industry_corr.abs() > correlation_threshold].index.tolist()
    
    # 从控制变量中去掉目标行业本身
    if industry_column in control_variables:
        control_variables.remove(industry_column)
    
    print("控制变量：", control_variables)
    
    # 选择用于回归的列
    X = data[[investment_column] + control_variables]  # 添加控制变量
    Y = data[industry_column]  # 行业的GDP列
    
    # 添加常数项（截距）
    X = sm.add_constant(X)
    
    # 拟合回归模型
    model = sm.OLS(Y, X).fit()
    
    # 返回回归结果摘要
    return model.summary()

# 加载数据
df = pd.read_excel('../../data/近十年各行业生产总值数据-en.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('../../data/相关系数.xlsx', sheet_name='Sheet1')

# 提取相关系数矩阵
correlation_matrix = df2.set_index('Unnamed: 0').dropna(axis=1)  # 去掉无用的索引列

# 示例：使用“工业”数据进行回归分析
result_industry = create_gdp_investment_model(
    df, 
    'Industry',  # 行业的GDP列
    'Industry',  # 行业投资列（假设投资数据列名是与GDP列相同）
    correlation_matrix,  # 相关系数矩阵
    correlation_threshold=0.4  # 设置阈值为0.4
)

# 输出回归分析结果
print(result_industry)

控制变量： ['Construction', 'Wholesale_and_Retail_Trade', 'Transportation_Warehousing_and_Postal_Services', 'Accommodation_and_Catering']
                            OLS Regression Results                            
Dep. Variable:               Industry   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.050e+26
Date:                Sat, 11 Jan 2025   Prob (F-statistic):           1.10e-53
Time:                        22:54:54   Log-Likelihood:                 184.29
No. Observations:                  10   AIC:                            -356.6
Df Residuals:                       4   BIC:                            -354.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                                     coef    std err         

  return hypotest_fun_in(*args, **kwds)
