«Модель прогнозирования стоимости жилья для агентства недвижимости»

Дипломный проект

Data Science

Описание данных:

'status' — статус продажи; 'private pool' и 'PrivatePool' — наличие собственного бассейна; 'propertyType' — тип объекта недвижимости; 'street' — адрес объекта; 'baths' — количество ванных комнат; 'homeFacts' — сведения о строительстве объекта (содержит несколько типов сведений, влияющих на оценку объекта);

'fireplace' — наличие камина; 'city' — город; 'schools' — сведения о школах в районе; 'sqft' — площадь в футах; 'zipcode' — почтовый индекс; 'beds' — количество спален; 'state' — штат; 'stories' — количество этажей; 'mls-id' и 'MlsId' — идентификатор MLS (Multiple Listing Service, система мультилистинга);

target' — цена объекта недвижимости (целевой признак, который необходимо спрогнозировать).

In [2]:
import random
import numpy as np 
import pandas as pd 
import sys
import optuna

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import SGDRegressor
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import KFold
from sklearn import metrics
from tqdm.notebook import tqdm
from category_encoders import TargetEncoder, CatBoostEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures



import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = 10, 5
%config InlineBackend.figure_format = 'svg' 
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

RANDOM_SEED = 42
TEST_SIZE = 0.2



Создание модели

In [3]:
df = pd.read_csv('data/data2.csv')
display(df.head())
df.info()

Unnamed: 0,status,baths,city,sqft,zipcode,state,target,pool_encoded,Year built,Type,Cooling_encoded,Heating_encoded,fireplace_encoded
0,Active,4.0,Southern Pines,2900,28387,NC,418000,False,'2019',single_family_home,False,True,True
1,For Sale,3.0,Spokane Valley,1947,99216,WA,310000,False,'2019',single_family_home,False,False,False
2,For Sale,0.0,Palm Bay,0,32908,FL,5000,False,'',land,False,False,False
3,For Sale,0.0,Philadelphia,897,19145,PA,209000,False,'1920',townhouse,True,True,False
4,Active,0.0,Poinciana,1507,34759,FL,181500,False,'2006',other,True,True,False


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308147 entries, 0 to 308146
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   status             308147 non-null  object 
 1   baths              308147 non-null  float64
 2   city               308147 non-null  object 
 3   sqft               308147 non-null  int64  
 4   zipcode            308147 non-null  int64  
 5   state              308147 non-null  object 
 6   target             308147 non-null  int64  
 7   pool_encoded       308147 non-null  bool   
 8   Year built         308147 non-null  object 
 9   Type               308147 non-null  object 
 10  Cooling_encoded    308147 non-null  bool   
 11  Heating_encoded    308147 non-null  bool   
 12  fireplace_encoded  308147 non-null  bool   
dtypes: bool(4), float64(1), int64(3), object(5)
memory usage: 22.3+ MB


In [4]:
df['zipcode'] = df['zipcode'].astype(str)

In [6]:
# Список булевых признаков:
bin_features = ['pool_encoded','Heating_encoded','Cooling_encoded','fireplace_encoded']

# Список категориальных признаков:
cat_features = ['status','city','zipcode','state','Type','Year built']
 
# Список числовых признаков:
num_features = ['baths', 'sqft', 'target']

In [7]:
for col in cat_features:
   unique_values = df[col].nunique()
   print(f"Уникальные значения {col}: {unique_values}")

Уникальные значения status: 14
Уникальные значения city: 1763
Уникальные значения zipcode: 4139
Уникальные значения state: 37
Уникальные значения Type: 12
Уникальные значения Year built: 217


In [8]:
def preproc_data(df_input):
    '''includes several functions to pre-process the predictor data.'''
    
    df_output = df_input.copy()
    
    df_output['zipcode'] = df_output['zipcode'].astype(str)

    df_output['Year built'] = df_output['Year built'].astype(str)
    
    for column in ['baths', 'sqft', 'target']:
        
        df_output[column] = df_output[column].apply(lambda x: abs(x))
        constant = 1e-6
        df_output[column] = np.log(df_output[column] + constant)
        
    
 
    ohe_status = OneHotEncoder(sparse=False)
    ohe_state = OneHotEncoder(sparse=False)
    ohe_Type = OneHotEncoder(sparse=False)

    status_ohe = ohe_status.fit_transform(df_output['status'].values.reshape(-1,1))
    state_ohe = ohe_state.fit_transform(df_output['state'].values.reshape(-1,1))
    Type_ohe = ohe_Type.fit_transform(df_output['Type'].values.reshape(-1,1))

    le = LabelEncoder()
    state_label = le.fit_transform(df_output['state'])

    year_le = LabelEncoder()
    year_ord = year_le.fit_transform(df_output['Year built'])

    city_le = LabelEncoder()
    city_label = city_le.fit_transform(df_output['city'])

    zip_le = LabelEncoder()
    zip_label = zip_le.fit_transform(df_output['zipcode'])

    # Adding encoded categorical features to the output dataframe
    df_output = df_output.join(pd.DataFrame(status_ohe, columns=['status_' + str(cat) for cat in ohe_status.categories_[0]]))
    df_output = df_output.join(pd.DataFrame(state_ohe, columns=['state_' + str(cat) for cat in ohe_state.categories_[0]]))
    df_output = df_output.join(pd.DataFrame(Type_ohe, columns=['Type_' + str(cat) for cat in ohe_Type.categories_[0]]))
    df_output['state_label'] = state_label
    df_output['year_ord'] = year_ord
    df_output['city_label'] = city_label
    df_output['zip_label'] = zip_label

    # Dropping original categorical columns
    df_output.drop(['status', 'state', 'Type', 'city', 'zipcode','Year built'], axis=1, inplace=True)
    
    return df_output

In [9]:
df_encoded = preproc_data(df)
df_encoded.sample(10)

Unnamed: 0,baths,sqft,target,pool_encoded,Cooling_encoded,Heating_encoded,fireplace_encoded,status_Active,status_Auction,status_Back on Market,...,Type_modern,Type_multi_family_home,Type_other,Type_ranch,Type_single_family_home,Type_townhouse,state_label,year_ord,city_label,zip_label
294656,0.6931477,7.40001,11.967815,False,True,True,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,31,193,457,2965
156839,1.386295,7.640123,13.038764,False,True,True,True,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,8,204,1390,854
138076,1.386295,7.903966,12.611371,True,True,True,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,6,196,800,1375
142515,0.6931477,7.937375,12.498238,False,False,False,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,20,0,209,667
303423,0.6931477,7.189168,12.575064,True,False,False,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,20,0,1269,720
34134,0.6931477,7.311218,12.807653,False,True,True,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,6,148,970,1152
197886,9.999995e-07,6.831954,12.971424,False,False,True,False,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,3,137,388,3008
115283,-13.81551,-13.815511,11.695247,False,False,False,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,30,0,802,1500
190921,-13.81551,-13.815511,11.81303,False,False,True,False,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,31,0,705,2560
81975,1.098613,7.916807,12.765403,False,True,True,True,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,31,210,537,2508


In [10]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308147 entries, 0 to 308146
Data columns (total 74 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   baths                    308147 non-null  float64
 1   sqft                     308147 non-null  float64
 2   target                   308147 non-null  float64
 3   pool_encoded             308147 non-null  bool   
 4   Cooling_encoded          308147 non-null  bool   
 5   Heating_encoded          308147 non-null  bool   
 6   fireplace_encoded        308147 non-null  bool   
 7   status_Active            308147 non-null  float64
 8   status_Auction           308147 non-null  float64
 9   status_Back on Market    308147 non-null  float64
 10  status_Closed            308147 non-null  float64
 11  status_Coming Soon       308147 non-null  float64
 12  status_Contingent        308147 non-null  float64
 13  status_For Rent          308147 non-null  float64
 14  stat

In [11]:
y = df_encoded.target.values
X = df_encoded.drop(['target'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, shuffle=True, random_state=RANDOM_SEED)

MSE (Среднеквадратическая ошибка, Mean Squared Error, MSE) минимизирует сумму квадратов отклонений фактических значений от расчётных. Низкие значения MSE указывают на более точные предсказания модели.

MAE (Средняя абсолютная ошибка, Mean Absolute Error) - это мера ошибки, вычисленная как среднее значение абсолютных значений ошибок. Низкие значения MAE указывают на более точные предсказания модели.

R^2 (коэффициент детерминации) - статистическая мера, которая показывает, насколько хорошо вариации зависимой переменной объясняются моделью. Значения R^2 находятся в диапазоне от -∞ до 1. Чем ближе значение R^2 к 1, тем лучше модель объясняет зависимость между переменными.

'Наивная' модель

In [12]:
class NaiveModel:
    def __init__(self):
        self.means = None

    def fit(self, X, y):
        X_df = pd.DataFrame(X, columns=['city_label'])
        y_df = pd.DataFrame(y, columns=['target'])
        df = pd.concat([X_df, y_df], axis=1)
        self.means = df.groupby(['city_label'])['target'].mean().reset_index()

    def predict(self, X):
        X = pd.DataFrame(X, columns=['city_label']).copy()
        X['mean'] = np.nan
        for idx, row in self.means.iterrows():
            X.loc[(X['city_label'] == row['city_label']), 'mean'] = row['target']
        
        X['mean'].fillna(X['mean'].mean(), inplace=True)
        return X['mean'].to_numpy()

naive_model = NaiveModel()
naive_model.fit(X_train, y_train)
y_pred_train = naive_model.predict(X_train)
y_pred_test = naive_model.predict(X_test)

mse_train = mean_squared_error(y_train, y_pred_train)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_train = mean_absolute_error(y_train, y_pred_train)
mae_test = mean_absolute_error(y_test, y_pred_test)
r2_train = r2_score(y_train, y_pred_train)
r2_test = r2_score(y_test, y_pred_test)



In [13]:
#  метрики
print(f"Train MSE: {mse_train:.2f}")
print(f"Test MSE: {mse_test:.2f}")
print(f"Train MAE: {mae_train:.2f}")
print(f"Test MAE: {mae_test:.2f}")
print(f"Train R2: {r2_train:.2f}")
print(f"Test R2: {r2_test:.2f}")

Train MSE: 0.96
Test MSE: 0.98
Train MAE: 0.69
Test MAE: 0.69
Train R2: -0.01
Test R2: -0.00


Значения MSE и MAE для обоих почти одинаковы. Отсутствие явного переобучения или недообучения.
Значения R^2 очень низкие. Модель слабо объясняет зависимость между переменными. Обобщающая способность модели низкая. В среднем модель ошибается на 0.69 при предсказаниях. Модель не оптимальна.

Модель ElasticNetCV

In [14]:

model_el = ElasticNetCV(cv=5, random_state=RANDOM_SEED)
model_el.fit(X_train, y_train)

y_train_pred = model_el.predict(X_train)
y_test_pred = model_el.predict(X_test)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)

r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)


In [15]:
#  метрики
print(f"Train MSE: {mse_train:.2f}")
print(f"Test MSE: {mse_test:.2f}")
print(f"Train MAE: {mae_train:.2f}")
print(f"Test MAE: {mae_test:.2f}")
print(f"Train R2: {r2_train:.2f}")
print(f"Test R2: {r2_test:.2f}")

Train MSE: 0.72
Test MSE: 0.74
Train MAE: 0.60
Test MAE: 0.61
Train R2: 0.25
Test R2: 0.25


Значения MSE и MAE для обоих почти одинаковы. Отсутствие явного переобучения или недообучения.
Значения R^2 низкие. Модель слабо объясняет зависимость между переменными. Обобщающая способность модели низкая. В среднем модель ошибается на 0.60 при предсказаниях. Модель не оптимальна.

Модель LinearRegression

In [16]:

model = LinearRegression(fit_intercept=False)

model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

mse_train = metrics.mean_squared_error(y_train, y_train_pred)
mse_test = metrics.mean_squared_error(y_test, y_test_pred)
mae_train = metrics.mean_absolute_error(y_train, y_train_pred)
mae_test = metrics.mean_absolute_error(y_test, y_test_pred)
r2_train = metrics.r2_score(y_train, y_train_pred)
r2_test = metrics.r2_score(y_test, y_test_pred)


In [17]:
#  метрики
print(f"Train MSE: {mse_train:.2f}")
print(f"Test MSE: {mse_test:.2f}")
print(f"Train MAE: {mae_train:.2f}")
print(f"Test MAE: {mae_test:.2f}")
print(f"Train R2: {r2_train:.2f}")
print(f"Test R2: {r2_test:.2f}")

Train MSE: 0.53
Test MSE: 0.55
Train MAE: 0.53
Test MAE: 0.54
Train R2: 0.44
Test R2: 0.44


Значения MSE и MAE для обоих почти одинаковы. Отсутствие явного переобучения или недообучения.
Значения R^2 средние. Модель средне объсяняет зависимость между переменными. Обобщающая способность модели низкая. В среднем модель ошибается на 0.53 при предсказаниях. Модель не оптимальна.

Модель RandomForestRegressor

In [18]:

rf_regressor = RandomForestRegressor(random_state=RANDOM_SEED)

rf_regressor.fit(X_train, y_train)

y_train_pred = rf_regressor.predict(X_train)
y_test_pred = rf_regressor.predict(X_test)

mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)
mae_train = mean_absolute_error(y_train, y_train_pred)
mae_test = mean_absolute_error(y_test, y_test_pred)
r2_train = r2_score(y_train, y_train_pred)
r2_test = r2_score(y_test, y_test_pred)

In [19]:
# метрики
print(f"Train MSE: {mse_train:.2f}")
print(f"Test MSE: {mse_test:.2f}")
print(f"Train MAE: {mae_train:.2f}")
print(f"Test MAE: {mae_test:.2f}")
print(f"Train R2: {r2_train:.2f}")
print(f"Test R2: {r2_test:.2f}")

Train MSE: 0.08
Test MSE: 0.18
Train MAE: 0.13
Test MAE: 0.24
Train R2: 0.91
Test R2: 0.82


Значения Train MSE и Train MAE больше Test MSE и Test MAE.  Наличие переобучения.
Значения R^2 высокие. Модель хорошо объсяняет зависимость между переменными. Обобщающая способность модели высокая. В среднем модель ошибается на 0.13 при предсказаниях. Модель в целом оптимальна.

## Вывод

Из представленных моделей наилучшей является RandomForestRegressor