## 欢迎进入 ModelWhale Notebook  

这里你可以编写代码，文档  

### 关于文件目录  


**project**：project 目录是本项目的工作空间，可以把将项目运行有关的所有文件放在这里，目录中文件的增、删、改操作都会被保留  


**input**：input 目录是数据集的挂载位置，所有挂载进项目的数据集都在这里，未挂载数据集时 input 目录被隐藏  


**temp**：temp 目录是临时磁盘空间，训练或分析过程中产生的不必要文件可以存放在这里，目录中的文件不会保存  


In [1]:
# 查看个人持久化工作区文件
!ls /home/mw/project/

house_price_predictions_2023200584.csv	Ridge_submission_2023200584
OLS_submission_2023200584


In [2]:
# 查看当前挂载的数据集目录
!ls /home/mw/input/

quant4533


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn 
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import scipy
from scipy import stats
from scipy.stats import pearsonr
from scipy.stats import zscore
from sklearn.linear_model import LinearRegression

In [4]:
plt.rcParams['figure.figsize']=[10,5]

# Get the Data

## Download the Data

In [5]:
df=pd.read_csv('/home/mw/input/quant4533/ruc_Class25Q1_train.csv')
df.head(1)

Unnamed: 0,城市,区域,板块,环线,小区名称,价格,房屋户型,所在楼层,建筑面积,套内面积,...,产权所属,抵押信息,房屋优势,核心卖点,户型介绍,周边配套,交通出行,lon,lat,年份
0,0,79.0,111.0,二至三环,人定湖西里,6564200,2室1厅1厨1卫,中楼层 (共5层),52.3㎡,,...,非共有,,装修、房本满五年,此房是南北通透小板楼，户型方正，格局合理,房子是南北通透户型方正采光好，前后没有遮挡视野好，通风效果好,医院、公园、超市，生活便利，火箭军医院、积水潭医院，双秀公园，人定湖公园，物美超市、世纪华联等。,,116.389326,39.963727,2018.0


## brief look at the data structure

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84133 entries, 0 to 84132
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   城市      84133 non-null  int64  
 1   区域      84133 non-null  float64
 2   板块      84133 non-null  float64
 3   环线      42726 non-null  object 
 4   小区名称    84133 non-null  object 
 5   价格      84133 non-null  int64  
 6   房屋户型    83528 non-null  object 
 7   所在楼层    84133 non-null  object 
 8   建筑面积    84133 non-null  object 
 9   套内面积    25146 non-null  object 
 10  房屋朝向    84133 non-null  object 
 11  建筑结构    83528 non-null  object 
 12  装修情况    83528 non-null  object 
 13  梯户比例    82438 non-null  object 
 14  配备电梯    75818 non-null  object 
 15  别墅类型    749 non-null    object 
 16  交易时间    84133 non-null  object 
 17  交易权属    84133 non-null  object 
 18  上次交易    55180 non-null  object 
 19  房屋用途    84131 non-null  object 
 20  房屋年限    54351 non-null  object 
 21  产权所属    84133 non-null  object 
 22

In [7]:
df.hist(bins=50)

array([[<AxesSubplot:title={'center':'城市'}>,
        <AxesSubplot:title={'center':'区域'}>,
        <AxesSubplot:title={'center':'板块'}>],
       [<AxesSubplot:title={'center':'价格'}>,
        <AxesSubplot:title={'center':'抵押信息'}>,
        <AxesSubplot:title={'center':'lon'}>],
       [<AxesSubplot:title={'center':'lat'}>,
        <AxesSubplot:title={'center':'年份'}>, <AxesSubplot:>]],
      dtype=object)

In [8]:
# 每次随机查看几行数据，逐渐熟悉数据内容
# 设置显示选项确保所有列和行完整显示
pd.set_option('display.max_columns', None)  # 显示所有列
pd.set_option('display.max_colwidth', None)  # 不截断列内容
pd.set_option('display.width',None)
pd.set_option('display.expand_frame_repr', False)  # 不换行显示

df.sample(5)[['建筑面积','装修情况','环线','房屋优势','核心卖点','户型介绍','交通出行']]

Unnamed: 0,建筑面积,装修情况,环线,房屋优势,核心卖点,户型介绍,交通出行
45720,171.98㎡,其他,,,,,
33896,59.33㎡,精装,,地铁、装修、房本满两年、,旭辉轻轨正两房，楼层好，采光好，户型方正，近轻轨，邻万达,,距离3号线学堂湾轻轨站大约600米；轻轨站下方有公交车198.898.171.306.822.393.311等多路公交车民生、工商、中国、农商、邮政、农业、等多家银行进驻龙洲湾商圈；龙洲公园、巴文化公园、滨江公园三大公园环抱；海洋公园、巴南图书馆、区政府.
75569,165.24㎡,其他,二至三环,、房本满两年,万国花园交通便利小区环境安静业主诚心出售房子户型,,210路，312路，5路，205路，905路，908路，716路至电子二路或博爱医院站下车
67426,39.24㎡,其他,,,公路大桥河曲街交通便利采光充足一室一厅,,"公交线路：12路,、14路,、61路,、65路、65路区间、66路、66路区间、77路、84路、116路、125路、201路、215路"
68548,273.97㎡,毛坯,,地铁、房本满五年、,本房户型很好，南北通透，小区中间位置，看房方便,,小区地处昆明路和团结南路交汇处向西昆明路2号！距离地铁三号线丈八北路站1.6公里。！距离公交站3511厂站仅有50米，包括有15路、201路、219路、219路区间、703路、800路出行方便。


## create a test set  
+ stratified sampling based on the city category  


In [9]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder

In [10]:
# 将城市特征转化为分层抽样可以接收的离散的数值标签
city_encoder = LabelEncoder()
city_labels = city_encoder.fit_transform(df['城市'])

In [11]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=111)

for train_index, test_index in split.split(df, city_labels):
    df_train = df.iloc[train_index]
    df_test = df.iloc[test_index]

# Discover and Visualize the Data to Find Correlations

In [12]:
demo=df_train.copy()

## numeric attributes  
+ 建筑面积：包含公摊面积   公摊比例高的房子（如高层电梯房）  
+ 套内面积：是购房者实际使用的空间，通常与房价呈强正相关  
建筑面积 = 套内面积+公摊面积  
+ 模型解释性：套内面积对房价的解释力更强，而建筑面积可能因包含噪音（公摊差异）导致解释性下降  
+ 共线性：两者高度相关，必要时降维或保留其一  


#### 套内面积or建筑面积： 哪个更适合成为预测特征

In [13]:
# 将带单位的套内面积和建筑面积转化为数值类型
demo['套内面积']=demo['套内面积'].str.replace('㎡',' ').astype(float)
demo['建筑面积']=demo['建筑面积'].str.replace('㎡',' ').astype(float)

In [14]:
demo['得房率']=demo['套内面积']/demo['建筑面积']
demo['得房率'].describe()
# 可知得房率的波动在7%以内，得房率较为稳定；数据中房屋套内面积与建筑面积的差别不大

count    20014.000000
mean         0.808447
std          0.070697
min          0.011690
25%          0.773981
50%          0.805422
75%          0.840610
max          1.000000
Name: 得房率, dtype: float64

In [15]:
# 抽取套内面积非空的行，查看套内面积和建筑面积哪个对房价的解释力更高
df_non_missing=demo.dropna(subset=['套内面积']).copy()
# 计算相关系数
corr_construction=df_non_missing['建筑面积'].corr(df_non_missing['价格'],method='spearman')
corr_interior=df_non_missing['套内面积'].corr(df_non_missing['价格'],method="spearman")
corr_construction, corr_interior


(0.5420481636803345, 0.524559682902003)

In [16]:
# 散点图可视化
sns.pairplot(df_non_missing,vars=['建筑面积','套内面积','价格'],diag_kind='kde')

<seaborn.axisgrid.PairGrid at 0x7f0fbade5190>

In [17]:
# 可以看出建筑面积与房价的相关系数与套内面积相差极小（甚至略强于套内面积；可能因为中国市场的计价习惯以及人们考虑公摊面积的价值贡献）
# 因此为避免共线性问题，删除“套内面积”只保留“建筑面积”作为预测特征
demo.drop(columns=['套内面积'],inplace=True)    # 表示直接修改原数据

#### 处理“建筑面积”特征

In [18]:
sns.scatterplot(x='建筑面积',y='价格',data=demo,alpha=0.6)

<AxesSubplot:xlabel='建筑面积', ylabel='价格'>

In [19]:
lower_thresh=demo['建筑面积'].quantile(0.0001)
upper_thresh=demo['建筑面积'].quantile(0.9999)
outlier_percentile=demo[(demo['建筑面积']<lower_thresh)|(demo['建筑面积']>upper_thresh)]
outlier_percentile
# 貌似存在错误数据： 面积过大 房价过低

Unnamed: 0,城市,区域,板块,环线,小区名称,价格,房屋户型,所在楼层,建筑面积,房屋朝向,建筑结构,装修情况,梯户比例,配备电梯,别墅类型,交易时间,交易权属,上次交易,房屋用途,房屋年限,产权所属,抵押信息,房屋优势,核心卖点,户型介绍,周边配套,交通出行,lon,lat,年份,得房率
77254,6,46.0,597.0,二至三环,银池道拉斯财富中心,13015900,1室0厅0厨0卫,低楼层 (共22层),1143.0,南,钢混结构,其他,一梯四户,,,2024-02-15,商品房,,写字楼,,非共有,,地铁,,,,,108.954349,34.340257,2021.0,
77265,6,93.0,247.0,二至三环,天地源悦熙广场,3580000,1室0厅0厨0卫,高楼层 (共29层),1200.0,东,钢混结构,其他,五梯六户,有,,2024-02-15,商品房,,写字楼,,非共有,,,,,,,108.913078,34.215412,2021.0,
78096,6,46.0,238.0,三环外,华远辰悦,108100,,地下室 (共0层),10.0,南 北,,,,,,2024-03-05,使用权,,车库,,非共有,,,,,,,109.005708,34.393416,2021.0,
82776,6,66.0,307.0,,沁水新城,1095000,2室2厅1厨1卫,高楼层 (共24层),10337.0,南,钢混结构,其他,两梯六户,有,,2024-06-08,商品房,,普通住宅,,非共有,,地铁,,,,,109.045126,34.275498,2022.0,
69628,6,66.0,596.0,,高科绿水东城二期,86800,,地下室 (共33层),10.0,南,,,,,,2022-09-11,商品房,,车库,,非共有,,,车位的位置很好，方便停车，业主诚心出售。,,公园：灞桥湿地公园交通：233/301/236/508/907，地铁一号线，纺织城站、医疗：唐都医院购物：华东购物广场，华阳城,门口有多路公交车，275，269，230，301，233，195，237，直达全城，离地铁站三站路，离高速路口1公里,109.070168,34.298546,2020.0,
78228,6,93.0,776.0,,陕商投资大厦,11390000,1室0厅0厨0卫,中楼层 (共30层),1250.0,南,钢混结构,其他,一梯一户,,,2024-03-09,商品房,,写字楼,,非共有,,,,,,,108.881256,34.231108,2021.0,
75899,6,66.0,596.0,,高科绿水东城二期,86800,,地下室 (共33层),10.0,东 南 西 北,,,,,,2023-12-25,商品房,,车库,,共有,,,业主诚心出售，随时可看，位置好,,3公里范围内有，唐都医院，华阳城商场，浐河湿地公园，滨河铁一中等,小区距离纺织城地铁一号线一公里，门前有多条公交线路，出行方便,109.070168,34.298546,2021.0,
57713,5,48.0,438.0,,香树湾,332400,1室1厅1厨1卫,高楼层 (共12层),5108.0,南,钢混结构,简装,一梯三户,有,,2020-07-12,商品房,2019-09-26,普通住宅,,共有,,,采光充足，交通便利，香树湾小区户型板采光好,,封闭式小区，环境优雅，市政府商圈，万达茂购物方便，隔道医大四院，门口公交站出行方便,交通方便，去江南群利哈西万达都有公交，距离万达茂3公里,126.521095,45.833217,2018.0,
76417,6,66.0,596.0,,高科绿水东城二期,86800,,地下室 (共33层),10.0,南,,,,,,2024-01-19,商品房,,车库,,非共有,,,业主诚心出售，随时可看，位置好,,3公里范围内有，唐都医院，华阳城商场，浐河湿地公园，滨河铁一中等,小区距离纺织城地铁一号线一公里，门前有多条公交线路，出行方便,109.070168,34.298546,2021.0,
82576,6,93.0,250.0,二至三环,华洲城天峰,166320,,地下室 (共0层),10.0,南,,,,,,2024-06-06,商品房,,车库,,非共有,,地铁,,,,,108.85482,34.250415,2022.0,


In [20]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin

In [21]:
def remove_outliers(X, y, threshold=27):    # 从threshold=3 开始，观察剔除的点是否合理
    model = LinearRegression().fit(X, y)
    residuals = y - model.predict(X)
    z_scores = np.abs(zscore(residuals))
    is_outlier = z_scores>threshold
    return (is_outlier)
is_outlier=remove_outliers(demo[['建筑面积']],demo['价格'])  
demo[is_outlier]
# 最终在threshold=27时，remove_outlier可以成功筛选出在散点图中观察到的两个明显离群的错误数据

Unnamed: 0,城市,区域,板块,环线,小区名称,价格,房屋户型,所在楼层,建筑面积,房屋朝向,建筑结构,装修情况,梯户比例,配备电梯,别墅类型,交易时间,交易权属,上次交易,房屋用途,房屋年限,产权所属,抵押信息,房屋优势,核心卖点,户型介绍,周边配套,交通出行,lon,lat,年份,得房率
82776,6,66.0,307.0,,沁水新城,1095000,2室2厅1厨1卫,高楼层 (共24层),10337.0,南,钢混结构,其他,两梯六户,有,,2024-06-08,商品房,,普通住宅,,非共有,,地铁,,,,,109.045126,34.275498,2022.0,
57713,5,48.0,438.0,,香树湾,332400,1室1厅1厨1卫,高楼层 (共12层),5108.0,南,钢混结构,简装,一梯三户,有,,2020-07-12,商品房,2019-09-26,普通住宅,,共有,,,采光充足，交通便利，香树湾小区户型板采光好,,封闭式小区，环境优雅，市政府商圈，万达茂购物方便，隔道医大四院，门口公交站出行方便,交通方便，去江南群利哈西万达都有公交，距离万达茂3公里,126.521095,45.833217,2018.0,


#### numeric pipeline

In [22]:
class OutlierRemover(BaseEstimator, TransformerMixin):
    """
    基于回归的异常值检测器
    功能：通过Z-score去除偏离回归线的值
    """
    def __init__(self, z_threshold=27.0):
        self.z_threshold = z_threshold
        self.inliers_ = None
    
    def fit(self, X, y):
        '''训练阶段：用价格和建筑面积的线性拟合线去除离群值'''
        X=np.array(X).reshape(-1,1)
        model = LinearRegression().fit(X, y)
        residuals = y - model.predict(X)
        z_scores = np.abs(zscore(residuals))
        self.inliers_ = z_scores < self.z_threshold
        return self
    
    def transform(self, X, y=None):
        '''预测阶段：不执行任何操作（没有价格参考）'''
        return X[self.inliers_]
    
    def fit_transform(self, X, y):
        '''训练阶段：返回清洗后的数据'''
        self.fit(X,y)
        return X[self.inliers_]

outlier_remover= OutlierRemover(z_threshold=27)

In [23]:
class Convert_m2(BaseEstimator, TransformerMixin):
    """
    功能：基于replace的建筑面积提取器
    """ 
    def __init__(self, to_replace=['㎡'], replace_with=''):
        """
        参数:
            to_replace: 需要删除的字符列表
            replace_with: 替换为什么字符（默认空字符串）
        """
        self.to_replace = to_replace
        self.replace_with = replace_with
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        if isinstance(X,pd.Series):
            x_clean=X.astype(str)
        else:
            x_clean = pd.Series(X).astype(str)
            
        for char in self.to_replace:
            x_clean = x_clean.str.replace(char, self.replace_with)
            
        return pd.to_numeric(x_clean).values.reshape(-1, 1)

convert_m2= Convert_m2(to_replace=['㎡'])

In [24]:
demo_cleaned=outlier_remover.fit_transform(demo['建筑面积'],demo['价格'])
demo['建筑面积']=convert_m2.transform(demo['建筑面积'])

#### 之后可能的尝试方向：1.预测每平米单价*建筑面积   or 2.建筑面积作为主要特征直接预测总价

In [25]:
# 房屋面积是否差异过大？
demo['建筑面积'].describe()

count    67306.000000
mean        96.660516
std         66.274249
min         10.000000
25%         66.000000
50%         88.860000
75%        115.700000
max      10337.000000
Name: 建筑面积, dtype: float64

In [26]:
# 计算每平米房价
demo['price/m2']=demo['价格']/demo['建筑面积']

##  categorial attributes  
**类型变量大致有六类**  
+ 是否类型：配备电梯；别墅类型；产权所属  
+ 种类少顺序特征：房屋年限；装修情况；环线  
+ 种类少名义特征：房屋用途；建筑结构   
+ 种类多名义特征：小区名称；房屋户型  
+ 种类所顺序特征：所在楼层  
+ 地理信息：城市；区域；板块; 小区名称  

**由于时间关系，只研究了2个种类少的顺序变量并总结了处理种类少的顺序变量的pipeline**

### 类别少有顺序

**房屋年限**  
+ “满五年”的房子比“满两年”的房子房价通常更高3%-5%，因买方节省的个税成本会部分转化为溢价  
+ 房价基数越高，税费差异对总价的影响更大 →应在相似价格水平的情况下再考虑房屋年限的影响？）  
+ missing: 29782   推断：未填的说明房屋年限方面没有优势  


In [27]:
demo['房屋年限'].value_counts(dropna=False)

NaN     23829
满五年     23812
满两年     12873
未满两年     6792
Name: 房屋年限, dtype: int64

In [28]:
demo['房屋年限']=demo['房屋年限'].fillna('空白')

In [29]:
median_price1=demo.groupby('房屋年限')['price/m2'].mean().reset_index()
median_price1

Unnamed: 0,房屋年限,price/m2
0,未满两年,20974.278316
1,满两年,19414.097768
2,满五年,30628.155975
3,空白,10274.207995


In [30]:
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=demo,
    x='房屋年限',
    y='price/m2',
    )
plt.xticks()

(array([0, 1, 2, 3]),
 [Text(0, 0, '空白'), Text(1, 0, '满五年'), Text(2, 0, '满两年'), Text(3, 0, '未满两年')])

In [31]:
# 可以看出‘满五年‘的房屋的均价明显高于其他类别，满两年和未满两年之间并无明显差别
#’空白‘房屋的均价似乎更低与满两年与未满两年，但差距并不明显，谨慎起见没有将’空白‘单独编码
order_map = {
    '满五年': 1,  
    '满两年': 0,
    '未满两年': 0,
    '空白': 0  
}

demo['holding_year_01'] = demo['房屋年限'].map(order_map)

**装修情况**  
+ 最终处理方案：1. 将特征缺失值填补为‘空白’  2.按照设定{‘精装’=1，‘简装’=1，‘毛坯’=0，‘其他’=0，‘空白’=0}对特征进行01编码

In [32]:
demo['装修情况'].value_counts()

精装    24502
其他    20940
简装    13798
毛坯     7568
Name: 装修情况, dtype: int64

In [33]:
demo['装修情况']=demo['装修情况'].fillna('空白')

In [34]:
median_price2=demo.groupby('装修情况')['price/m2'].mean().reset_index()
median_price2

Unnamed: 0,装修情况,price/m2
0,其他,15333.990778
1,毛坯,13054.614125
2,空白,7740.895437
3,简装,25129.4051
4,精装,24325.979397


In [35]:
plt.figure(figsize=(10, 6))
sns.boxplot(
    data=demo,
    x='装修情况',  
    y='price/m2',  
    order=['毛坯', '简装', '精装', '其他','空白']
)

plt.xticks()

(array([0, 1, 2, 3, 4]),
 [Text(0, 0, '毛坯'),
  Text(1, 0, '简装'),
  Text(2, 0, '精装'),
  Text(3, 0, '其他'),
  Text(4, 0, '空白')])

In [36]:
# 观察到’简装‘和’精装‘类别每平米房价普遍较高，而’毛坯‘’其他‘’空白‘类别每平米房价集中在低价，因此考虑将’装修情况‘进行01编码，精装简装设定为1，毛坯其他和空白设定为0来体现这种阈值效应
order_map = {'毛坯':0, '其他':0, '空白':0, '简装':1, '精装':1}
demo['decoration_01'] = demo['装修情况'].map(order_map)
demo['decoration_01'].value_counts()

1    38300
0    29006
Name: decoration_01, dtype: int64

#### 地理信息  
+ 城市  
+ 区域  
+ 板块  
+ lon  
+ lat  
+ 小区名称  


In [37]:
demo1=df_train.copy()

In [38]:
demo['城市']=demo['城市'].astype('category')
demo['区域']=demo['区域'].astype('category')
demo['板块']=demo['板块'].astype('category')

In [39]:
# 查看各个城市是否有相同的区域
# 获取每个城市的区域集合
城市区域 = demo.groupby('城市')['区域'].apply(set)

# 检查所有区域集合的交集
所有区域 = set().union(*城市区域.values)
区域计数器 = demo['区域'].value_counts()

重复区域 = 区域计数器[区域计数器 > 1].index.tolist()
重复区域

[93.0,
 65.0,
 45.0,
 59.0,
 53.0,
 46.0,
 67.0,
 13.0,
 7.0,
 62.0,
 30.0,
 12.0,
 5.0,
 98.0,
 82.0,
 81.0,
 22.0,
 87.0,
 43.0,
 66.0,
 11.0,
 48.0,
 71.0,
 20.0,
 80.0,
 74.0,
 17.0,
 40.0,
 97.0,
 64.0,
 68.0,
 60.0,
 79.0,
 61.0,
 99.0,
 41.0,
 55.0,
 52.0,
 2.0,
 4.0,
 88.0,
 34.0,
 25.0,
 50.0,
 3.0,
 90.0,
 54.0,
 44.0,
 102.0,
 19.0,
 76.0,
 38.0,
 70.0,
 24.0,
 94.0,
 100.0,
 56.0,
 92.0,
 89.0,
 49.0,
 18.0,
 27.0,
 47.0,
 69.0,
 58.0,
 6.0,
 86.0,
 9.0,
 21.0,
 77.0,
 23.0,
 95.0,
 96.0,
 39.0,
 42.0,
 84.0,
 29.0,
 51.0,
 57.0,
 35.0,
 101.0,
 37.0,
 15.0,
 78.0,
 32.0,
 91.0,
 33.0,
 83.0]

#### 城市  


In [40]:
city_data=demo.groupby('城市').agg({'lon':['min','max'],'lat':['min','max'],'价格':'mean'}).reset_index()
city_data.columns=['城市','min_lon','max_lon','min_lat','max_lat','平均房价']
city_data

Unnamed: 0,城市,min_lon,max_lon,min_lat,max_lat,平均房价
0,0,115.977233,117.250582,39.633938,40.490921,4507695.0
1,1,116.268541,117.038247,39.227426,40.002642,1099841.0
2,2,106.19742,107.243745,29.258156,30.024372,1116991.0
3,3,121.040343,121.922277,30.726271,31.64421,4618345.0
4,4,114.015751,114.838876,30.316828,30.887541,1249928.0
5,5,126.500799,127.169979,44.924544,45.920575,628423.6
6,6,108.597306,109.324056,34.113086,34.675726,1346583.0


In [41]:
city_data.plot(kind='bar',x='城市',y='平均房价')

<AxesSubplot:xlabel='城市'>

In [42]:
city_data.plot(kind='scatter',x='min_lon',y='min_lat',c='平均房价',s=100,cmap='jet',colorbar=True,legend=True,grid=True)

<AxesSubplot:xlabel='min_lon', ylabel='min_lat'>

In [43]:
demo_city=demo[['城市']]
demo_city.head(5)

Unnamed: 0,城市
46490,2
37957,2
74936,6
31653,2
62183,5


In [44]:
demo_city_1hot=OneHotEncoder().fit_transform(demo_city)
demo_city_1hot

<67306x7 sparse matrix of type '<class 'numpy.float64'>'
	with 67306 stored elements in Compressed Sparse Row format>

#### 区域   
+ 仅进行了初步的数据探索，未形成最终的预测特征  


In [45]:
fig,(ax1,ax2)=plt.subplots(2,1,figsize=(12,32))
pivot1=demo.pivot_table(values='价格',index='区域',columns='城市',aggfunc='mean')
sns.heatmap(pivot1,ax=ax1,annot=True,cmap='jet')
# pivot2=demo.pivot_table(values='价格',index='板块',columns='区域',aggfunc='mean')
# sns.heatmap(pivot2,ax=ax2,annot=True,cmap='jet')

<AxesSubplot:xlabel='城市', ylabel='区域'>

In [46]:
# 创建 2x3 的子图画布
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
cities = demo["城市"].unique()

for ax, city in zip(axes.flatten(), cities):
    # 筛选当前城市数据
    city_data = demo[demo["城市"] == city]
    # 生成区域-房价矩阵（热力图数据）
    pivot_table = city_data.pivot_table(index="区域", values="价格", aggfunc=np.mean)
    sns.heatmap(pivot_table, annot=True, fmt=".1f", cmap="jet", ax=ax)
    ax.set_title(f"{city}城市各区域房价（万元/㎡）")

plt.tight_layout()

##### 板块  
+ 仅进行了初步的数据探索 未形成最终的预测特征

In [47]:
city_median_price=demo1.groupby(['城市','区域'])['价格'].median().reset_index()
city_median_price['价格'].describe()

count    9.000000e+01
mean     1.752085e+06
std      1.514831e+06
min      2.949600e+05
25%      7.463500e+05
50%      1.121175e+06
75%      2.507188e+06
max      6.870000e+06
Name: 价格, dtype: float64

In [48]:
district_median_price=demo1.groupby(['城市'])['价格'].median().reset_index()
district_median_price['价格'].describe()

count    7.000000e+00
mean     1.591443e+06
std      1.213042e+06
min      4.980000e+05
25%      9.320000e+05
50%      1.038000e+06
75%      2.116050e+06
max      3.508000e+06
Name: 价格, dtype: float64

In [49]:
block_median_price=demo1.groupby(['城市','区域','板块'])['价格'].median().reset_index()
block_median_price['价格'].describe()

count    7.230000e+02
mean     2.998599e+06
std      3.098812e+06
min      2.244000e+05
25%      8.770000e+05
50%      2.020600e+06
75%      4.064450e+06
max      2.148630e+07
Name: 价格, dtype: float64

In [50]:
block_median_price.head()

Unnamed: 0,城市,区域,板块,价格
0,0,2.0,35.0,5979600.0
1,0,2.0,40.0,5580000.0
2,0,2.0,41.0,9938600.0
3,0,2.0,86.0,8540000.0
4,0,2.0,126.0,4840000.0


In [51]:
# 生成透视矩阵
pivot_data = block_median_price.pivot_table(
    index=["城市", "区域"],
    columns="板块",
    values="价格"
)

# 绘制热力图
plt.figure(figsize=(12, 6))
sns.heatmap(pivot_data, annot=False, fmt=".1f", cmap="jet")
plt.title("各城市-区域-板块房价热力图（万元/㎡）")

Text(0.5, 1.0, '各城市-区域-板块房价热力图（万元/㎡）')

In [52]:
# 将小区名称替换为该小区的平均房价
# 逻辑链条：如果小区名称对房价影响大→不同小区的均值差异会很大→生产的‘小区均价’与‘实际房价’高度相关

## text attributes  
+ 房屋优势：missing 16064  unique27 多变为统一，描述了小区的装修/房屋年限/地铁状况  
剩下的text attributes特征均为：缺失值较多，内容多变  
+ 核心买点：missing 16366  
+ 户型介绍 missing 63671  
+  周边配套 missing 34027  
+  交通出行 missing 32437  

最终处理方案： 由于01编码处理下text attribute 和房价的相关性并不明显，最终未将其作为预测变量

In [53]:
layout_intro=demo['户型介绍'].value_counts()
layout_intro.head(10)
# 可以看出“户型介绍”大部分是正向积极评价 并非客观中立评价

此房户型不错，其格局合理，宽敞的大明厅，日光充足。属于大众最喜欢的户型                                                                  52
本房属于商品房，楼层结构属于框架结构，房子户型方正，紧凑，格局布置的很好，没有浪费的空间面积。                                                      40
精装2室，宽敞客厅，户型方正，采光充足。卧室客厅朝南，大阳台                                                                       26
房屋户型没有任何改动，始终保持原始户型结构。板塔结合，抗震能力好                                                                     23
精装三室，宽敞客厅，户型方正，采光充足，卧室客厅朝南，大阳台                                                                       20
南北向，两室格局，没有变动，横向空间房主自己居住，精装修，独立厨房                                                                    19
大阳台，主客分离，私密性好！洋房低密社区，采光好，小区外围环境优美                                                                    19
本房属于商品房，楼层结构属于框架结构，房子户型方正，紧凑，格局布置的很好，没有浪费的空间面积                                                       18
此房户型不错，其格局合理，宽敞明厅，日光充足，属于大众最喜欢的户型。                                                                   18
1.此房设施完善，2.舒适，采光好，房子整天有光，位置好，包烧好。真正的好房子.3.屋内格局好，一眼即中！可以拎包即住！无需在装

#### 变量编码  
+ 可以看出所有自然语言描述都是对房屋的正向积极评价，且除"房屋优势"外，其余变量内容较为复杂多样，不容易提取关键词分类。因此考虑将各个变量缺失值设定为”0“，非缺失值设定为”1“  
**不足及改进方向**  
+ 忽略了文本内容差异—— 关键词标记编码  对关键短语（如”地铁“、”学区“）进行二进制标记；TF-IDF+将为（复杂文本）  
+ 无法量化强度——文本长度编码 描述越详细，可能反映信息越重要；文本情感强度编码  正向描述的强烈程度可能影响房价  
+ 难以捕捉交互效应：多个text特征同时存在可能对房价有叠加影响

In [54]:
# 0/1编码
text_columns=['房屋优势','核心卖点','户型介绍','周边配套','交通出行']
for col in text_columns:
    demo[f'{col}_01']=demo[col].notnull().astype(int)
print(demo[['price/m2']+[f'{col}_01' for col in text_columns]])
print(demo.isnull().sum())    # 确保所有文本列的缺失值已正确处理

           price/m2  房屋优势_01  核心卖点_01  户型介绍_01  周边配套_01  交通出行_01
46490   7593.750000        1        0        0        0        0
37957  15126.279863        1        1        1        0        1
74936  16047.837264        1        1        1        1        1
31653  16108.000000        1        0        0        0        0
62183   5629.707942        0        1        0        1        1
...             ...      ...      ...      ...      ...      ...
59232  10302.222222        1        1        0        0        1
16795  11934.212315        1        1        0        0        1
49274  42163.790108        1        1        0        1        1
22738   8395.324123        1        1        0        1        1
18048   9186.025891        1        1        0        1        1

[67306 rows x 6 columns]
城市                     0
区域                     0
板块                     0
环线                 33076
小区名称                   0
价格                     0
房屋户型                 498
所在楼层               

In [55]:
# 相关系数矩阵
corr_cols=['price/m2']+[f'{col}_01' for col in text_columns]
corr_matrix = demo[corr_cols].corr(method='spearman')
corr_matrix

Unnamed: 0,price/m2,房屋优势_01,核心卖点_01,户型介绍_01,周边配套_01,交通出行_01
price/m2,1.0,0.294093,0.009204,0.178985,-0.032874,-0.028101
房屋优势_01,0.294093,1.0,0.244798,0.101342,0.148947,0.155292
核心卖点_01,0.009204,0.244798,1.0,0.279219,0.596492,0.619626
户型介绍_01,0.178985,0.101342,0.279219,1.0,-0.039636,-0.033119
周边配套_01,-0.032874,0.148947,0.596492,-0.039636,1.0,0.452468
交通出行_01,-0.028101,0.155292,0.619626,-0.033119,0.452468,1.0


In [56]:
# 相关系数矩阵可视化
sns.heatmap(corr_matrix,annot=True,cmap='jet',vmin=-1,vmax=1)

<AxesSubplot:>

In [57]:
# 相关系数矩阵的结果与我的事先推断并不相符，周边配套和交通出行甚至和房价呈现微弱的负相关
# 通过房价分组检查高价房是否更可能拥有自然语言描述（我的推断）
demo['price_group']=pd.qcut(demo['price/m2'],q=3,labels=['low','median','high'])
print(demo.groupby('price_group')[['交通出行_01','周边配套_01','房屋优势_01','核心卖点_01','户型介绍_01']].mean())
# 最终结果显示高单价房更少填写交通出行以及周边配套，可能因为高价房核心的优势并不在于周边配套和交通出行

              交通出行_01   周边配套_01   房屋优势_01   核心卖点_01   户型介绍_01
price_group                                                  
low          0.618158  0.604787  0.674051  0.802193  0.185283
median       0.647069  0.633385  0.831424  0.815467  0.187475
high         0.576688  0.549187  0.922710  0.798975  0.359617



# Prepare the Data for Machine Learning

In [58]:
from sklearn.compose import ColumnTransformer

In [59]:
X_train=df_train.copy()

In [60]:
X_train['装修情况']=X_train['装修情况'].fillna('未填')
renovation_order_map = {'毛坯':0, '其他':0, '未填':0, '简装':1, '精装':1}
X_train['装修情况_01'] = X_train['装修情况'].map(renovation_order_map)
X_train['装修情况_01'].value_counts()

1    38300
0    29006
Name: 装修情况_01, dtype: int64

In [61]:
X_train['房屋年限']=X_train['房屋年限'].fillna('未填')
houseyear_order_map={'满五年':1,"未满两年":0, "满两年":0, "未填":0}
X_train['房屋年限_01'] = X_train['房屋年限'].map(houseyear_order_map)
X_train['房屋年限_01'].value_counts()

0    43494
1    23812
Name: 房屋年限_01, dtype: int64

In [62]:
X_train['建筑面积']=convert_m2.transform(X_train['建筑面积'])
X_train['建筑面积'].describe()

count    67306.000000
mean        96.660516
std         66.274249
min         10.000000
25%         66.000000
50%         88.860000
75%        115.700000
max      10337.000000
Name: 建筑面积, dtype: float64

In [63]:
encoder=OneHotEncoder(sparse=False)
X_city= encoder.fit_transform(X_train[["城市"]])
feature_names = encoder.get_feature_names_out(["城市"])
city_encoded=pd.DataFrame(X_city,columns=feature_names,index=X_train.index)

In [64]:
X_train=X_train[['装修情况_01','房屋年限_01','建筑面积']]
X_train_prepared=pd.concat([X_train,city_encoded],axis=1)
X_train_prepared.head(5)

Unnamed: 0,装修情况_01,房屋年限_01,建筑面积,城市_0,城市_1,城市_2,城市_3,城市_4,城市_5,城市_6
46490,1,0,80.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
37957,1,1,73.25,0.0,0.0,1.0,0.0,0.0,0.0,0.0
74936,1,0,89.47,0.0,0.0,0.0,0.0,0.0,0.0,1.0
31653,1,0,75.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
62183,1,0,75.67,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [65]:
X_test=df_test.copy()
X_test['装修情况']=X_test['装修情况'].fillna('未填')
X_test['装修情况_01'] = X_test['装修情况'].map(renovation_order_map)

X_test['房屋年限']=X_test['房屋年限'].fillna('未填')
X_test['房屋年限_01'] = X_test['房屋年限'].map(houseyear_order_map)

X_test['建筑面积']=convert_m2.transform(X_test['建筑面积'])

X_city2= encoder.fit_transform(X_test[["城市"]])
feature_names = encoder.get_feature_names_out(["城市"])
city_encoded2=pd.DataFrame(X_city2,columns=feature_names,index=X_test.index)

X_test=X_test[['装修情况_01','房屋年限_01','建筑面积']]
X_test_prepared=pd.concat([X_test,city_encoded2],axis=1)
X_test_prepared.head(5)

Unnamed: 0,装修情况_01,房屋年限_01,建筑面积,城市_0,城市_1,城市_2,城市_3,城市_4,城市_5,城市_6
1112,1,1,63.08,1.0,0.0,0.0,0.0,0.0,0.0,0.0
77012,0,0,54.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
70578,0,0,23.87,0.0,0.0,0.0,0.0,0.0,0.0,1.0
73698,0,0,195.74,0.0,0.0,0.0,0.0,0.0,0.0,1.0
22957,1,0,89.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [66]:
y_train =df_train['价格'].copy()
y_test=df_test['价格'].copy()

# Select and Train the model

## Training and Evaluating on the Training Set

In [67]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.compose import ColumnTransformer

In [68]:
from sklearn.model_selection import GridSearchCV

In [69]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

### OLS

In [70]:
# 训练集
OLS= LinearRegression()
OLS.fit(X_train_prepared,y_train)

# evaluate the error
X_train_pred_OLS=OLS.predict(X_train_prepared)
OLS_mae=mean_absolute_error(y_train,X_train_pred_OLS)
OLS_rmse=np.sqrt(mean_squared_error(y_train,X_train_pred_OLS))
print("OLS_MAE:",OLS_mae)
print("OLS_RMSE:",OLS_rmse)

OLS_MAE: 798633.2612461738
OLS_RMSE: 1919423.851782382


In [71]:
# out of sample测试集
OLS.fit(X_test_prepared,y_test)

# evaluate the error
X_test_pred_OLS=OLS.predict(X_test_prepared)
OLS_mae=mean_absolute_error(y_test,X_test_pred_OLS)
OLS_rmse=np.sqrt(mean_squared_error(y_test,X_test_pred_OLS))
print("OLS_MAE:",OLS_mae)
print("OLS_RMSE:",OLS_rmse)

OLS_MAE: 906378.9385810771
OLS_RMSE: 1644504.910388741


In [72]:
mae_scores = -cross_val_score(OLS, X_train_prepared, y_train, cv=6, scoring='neg_mean_absolute_error')
rmse_scores = -cross_val_score(OLS, X_train_prepared, y_train, cv=6, scoring='neg_root_mean_squared_error')

print("MAE scores:", np.mean(mae_scores))
print("RMSE scores:", np.mean(rmse_scores))

MAE scores: 809117.5197419837
RMSE scores: 1998603.6695784016


### Ridge  


In [73]:
ridge=Ridge()

# 超参数调优
params = {'alpha': [0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(ridge, params, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train_prepared, y_train)
print("最佳alpha:", grid.best_params_)

最佳alpha: {'alpha': 10}


In [74]:
best_ridge = grid.best_estimator_
y_pred = best_ridge.predict(X_train_prepared)

mae=mean_absolute_error(y_train,y_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print(f"训练集MAE: {mae:.2f}")
print(f"训练集RMSE: {rmse:.2f}")

训练集MAE: 798391.07
训练集RMSE: 1919425.22


In [75]:
y_test_pred = best_ridge.predict(X_test_prepared)

mae=mean_absolute_error(y_test,y_test_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"测试集MAE: {mae:.2f}")
print(f"测试集RMSE: {rmse:.2f}")

测试集MAE: 788886.15
测试集RMSE: 1765295.78


In [76]:
mae_scores = -cross_val_score(best_ridge, X_test_prepared, y_test, cv=6, scoring='neg_mean_absolute_error')
rmse_scores = -cross_val_score(best_ridge, X_test_prepared, y_test, cv=6, scoring='neg_root_mean_squared_error')

print("MAE scores:", np.mean(mae_scores))
print("RMSE scores:", np.mean(rmse_scores))

MAE scores: 907409.1340165349
RMSE scores: 1649860.8118270664


### LASSO

In [86]:
from sklearn.linear_model import LassoCV

In [87]:
lasso_cv = LassoCV(alphas=np.logspace(-4, 0, 100), cv=5)  # 设置lambda范围
lasso_cv.fit(X_train_prepared, y_train)
best_alpha = lasso_cv.alpha_

  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  po

In [88]:
mae=mean_absolute_error(y_train,y_pred)
rmse = np.sqrt(mean_squared_error(y_train, y_pred))
print(f"训练集MAE: {mae:.2f}")
print(f"训练集RMSE: {rmse:.2f}")

训练集MAE: 798391.07
训练集RMSE: 1919425.22


In [89]:
y_test_pred = lasso_cv.predict(X_test_prepared)

mae=mean_absolute_error(y_test,y_test_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
print(f"测试集MAE: {mae:.2f}")
print(f"测试集RMSE: {rmse:.2f}")

测试集MAE: 789128.71
测试集RMSE: 1765294.24


In [90]:
mae_scores = -cross_val_score(lasso_cv, X_test_prepared, y_test, cv=6, scoring='neg_mean_absolute_error')
rmse_scores = -cross_val_score(lasso_cv, X_test_prepared, y_test, cv=6, scoring='neg_root_mean_squared_error')


  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  positive,
  po

MAE scores: 908854.2287195707
RMSE scores: 1649849.2586776689


  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


In [91]:
print("MAE scores:", np.mean(mae_scores))
print("RMSE scores:", np.mean(rmse_scores))

MAE scores: 908854.2287195707
RMSE scores: 1649849.2586776689


## Final test

In [77]:
# 读入用于最终测试的csv
df_pred=pd.read_csv('/home/mw/input/quant4533/ruc_Class25Q1_test.csv')

In [78]:
X_pred=df_pred.copy()
X_pred['城市'].sample(5)

10301    4
10717    4
9349     3
8774     3
7842     3
Name: 城市, dtype: int64

In [79]:
X_pred['装修情况']=X_pred['装修情况'].fillna('未填')
X_pred['装修情况_01'] = X_pred['装修情况'].map(renovation_order_map)

X_pred['房屋年限']=X_pred['房屋年限'].fillna('未填')
X_pred['房屋年限_01'] = X_pred['房屋年限'].map(houseyear_order_map)

X_pred['建筑面积']=convert_m2.transform(X_pred['建筑面积'])

X_city3= encoder.fit_transform(X_pred[["城市"]])
feature_names = encoder.get_feature_names_out(["城市"])
city_encoded3=pd.DataFrame(X_city3,columns=feature_names,index=X_pred.index)

X_pred=X_pred[['装修情况_01','房屋年限_01','建筑面积']]
X_pred_prepared=pd.concat([X_pred,city_encoded3],axis=1)
X_pred_prepared

Unnamed: 0,装修情况_01,房屋年限_01,建筑面积,城市_0,城市_1,城市_2,城市_3,城市_4,城市_5,城市_6
0,1,0,209.20,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,1,163.69,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,1,102.92,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,1,109.66,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,1,57.20,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...
14781,0,0,65.78,0.0,0.0,0.0,0.0,0.0,0.0,1.0
14782,0,0,138.01,0.0,0.0,0.0,0.0,0.0,0.0,1.0
14783,0,0,110.00,0.0,0.0,0.0,0.0,0.0,0.0,1.0
14784,0,0,136.00,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [80]:
import os

In [81]:
# OLS
final_predictions=OLS.predict(X_pred_prepared)
final_predictions_df=pd.DataFrame({'ID':range(len(final_predictions)),
                                     'Price':final_predictions})
final_predictions_df.to_csv('OLS_submission_2023200584',index=False) 
final_predictions_df.head(5)                                  

Unnamed: 0,ID,Price
0,0,7909798.0
1,1,6583801.0
2,2,4859052.0
3,3,4799383.0
4,4,3561446.0


In [82]:
# Ridge
final_predictions2=best_ridge.predict(X_pred_prepared)
final_predictions2_df=pd.DataFrame({'ID':range(len(final_predictions2)),
                                     'Price':final_predictions2})
final_predictions2_df.to_csv('Ridge_submission_2023200584',index=False) 
final_predictions2_df.head(5) 

Unnamed: 0,ID,Price
0,0,6311511.0
1,1,5601714.0
2,2,4663115.0
3,3,4622722.0
4,4,3956966.0


In [92]:
# LASSO
final_predictions3=lasso_cv.predict(X_pred_prepared)
final_predictions3_df=pd.DataFrame({'ID':range(len(final_predictions3)),
                                     'Price':final_predictions3})
final_predictions3_df.to_csv('LASSO_submission_2023200584',index=False) 
final_predictions3_df.head(5) 

Unnamed: 0,ID,Price
0,0,6314612.0
1,1,5602824.0
2,2,4664206.0
3,3,4624439.0
4,4,3958041.0
