# Real Estate Rental Market in Berlin. p2. Analizing. 

I was inspired by original ideas and some useful approaches that were taken from [Dmitrii Eliuseev](https://towardsdatascience.com/housing-rental-market-in-germany-exploratory-data-analysis-with-python-3975428d07d2).

This notebook is an attempt to experiment with approaches that I found very useful and interesting, and they have their origins in the TDS article 'Housing Rental Market in Germany: Exploratory Data Analysis with Python'.  
The scope and processing are widen greatly in order to collect as much data as possible.

I will try to find some trends and insights from the data collected on https://www.immobilienscout24.de as one of the largest online residential rental aggregators in Germany.  

This is a second part of the data analysis.

The main stages of the forthcoming work are:  

* Analyze: analizing  building up a simple regression model for predicting the prices
* Share: and prepare some visualization

Loading the environment.  
You need to uncomment some lines of code if these libraries are not installed on your system. 

In [2]:
import pandas as pd
import numpy as np

import plotly.express as px

import json
import re #regular expression


import folium
from geopy.geocoders import Nominatim
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Defining some variables to configure the proccess.

In [1]:
path_to_csv = "/Users/velo1/SynologyDrive/GIT_syno/data/immobilienscout24.de/"

pd.set_option('display.max_colwidth', 100) # to display full text in columns
pd.set_option('display.max_columns', None) # display all columns

NameError: name 'pd' is not defined


|instance| used for storing:|
|:---|:---|
|base_url |https://www.immobilienscout24.de|
|||
|Berlin_housing_proccessed.csv|processed data|
|||
|df |cleaned data|
|temp |temporary dataframes|
| X | processed Train set|
|y (Series) | target labels|

## Ask

1. What is the most popular residential rental objects in Berlin?  
1. What are the main factors that define the rental price?  
1. Are there any trends and hidden patterns?
1. What are the main segments of that rental market?

### Loading proccessed data

In [168]:
df_r = pd.read_csv(path_to_csv + 'Berlin_housing_proccessed.csv', sep=';')

## Analyze

First, we'll explore feature by feature and
then answer the questions.

### What areas of Berlin are the most popular for rental housing?

In [169]:
region_top =df_r.groupby('region')[['address']].count().sort_values(by=['address'], ascending=False).head(12).to_dict()
region_minor = df_r.groupby('region')[['address']].count().sort_values(by=['address'], ascending=False).tail(12).to_dict()
temp = df_r[df_r.region.isin([*region_top['address'].keys()])] # df with listings within top-10 regions
fig = px.histogram(temp.region, title='Top Berlin districts by representation',  text_auto=True, height= 600)
fig.update_layout(xaxis_title="", yaxis_title="Number of listings")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_traces(showlegend=False)
fig.update_xaxes(tickangle=60)

Not easy to swallow this data for a foreigner)  
Let's visualize on the map top Berlin regions with highest representation.

In [170]:
def plotDot(row, color, from_df = True, radius=10, weight=10, this_map=map):
    if from_df:
        loc= geolocator.geocode([row.address, row.region, row.city, row.zip])
    else:
        loc= geolocator.geocode(row)
    if loc:
        folium.CircleMarker(
            location=[loc.latitude, loc.longitude],
            radius=radius,
            weight=weight,
            color=color,
            opacity=0.6,
            popup=('Agency:'+row.Publisher if from_df else 'Region:'+ row) 
        ).add_to(this_map)

In [171]:
geolocator = Nominatim(user_agent='geopy/2.2.0') 
my_map = folium.Map(prefer_canvas=True)
# folium.Marker([lat, lon], popup="Googleplex").add_to(this_map)
for k in region_10['address'].keys():
    plotDot(k + ', Berlin', color='#FF00AA', from_df=False, radius=20, weight=10, this_map=my_map)

for k in region_minor['address'].keys():
    plotDot(k + ', Berlin', color='#02bfe7', from_df=False, radius=20, weight=5, this_map=my_map)
# df.iloc[:3].apply(plotDot, color='#FF00AA', axis=1) # rgba(255, 0, 170, 0.4)

my_map.fit_bounds(my_map.get_bounds())

my_map

NameError: name 'region_10' is not defined

Looking at the map, you can get more information for comparison.  
It is clear that the majority of rental object are proposed for a rent in the central part of the city. 

### How tall are buildings in Berlin?

In [None]:
temp = df.floors_in_building.value_counts()   # get the number of floors in the building frequencies

fig = px.bar(temp, x=temp.index, y=temp.values,title='Number of floors in the building')
fig.update_layout(xaxis_title="", yaxis_title="Count of properties for rent")
fig.update_xaxes(type='category')
fig.update_xaxes(categoryorder='category ascending')
# fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.show()

Category (x axis) order is alphabetical.  And it is not actually intuitive here as the buildings height order is confusing.  
We can fix this by changing the index. 

Get 'floors_in_building' distribution values.

In [None]:
temp.sort_index(key=lambda x:('000'+x).str.replace('?','000', regex= True).str[-2:], ascending=True, inplace=True)
# we need to add 000 to the beginning of the string to make sure that the sorting is numerical
# we also need to replace ? with 000 to make put unspecified floors first

In [None]:
fig = px.bar(temp, y=temp.index, x=temp.values,title='Number of floors in the building', color = temp.values,
             height= 800, orientation='h', text_auto= True, color_continuous_scale= ['LightBlue','Blue','lightgrey']) 
fig.update_layout(xaxis_title="Count of properties for rent (log scale)", yaxis_title="Floors in the building")
fig.update_layout(xaxis_type = 'log') # log scale
fig.update_yaxes(type='category')     # sort the y axis by the number of floors
fig.update_coloraxes(showscale=False) # hide the color scale
fig.show()

This plot is much more easy to understand.  
Unspecified values we put at the bottom and y axis corresponds with the building height.  
Well done)

The tallest in the database is a 26-storey building.  

We have  8 offerы on the 18th and 13th floor.  

But the median among specified values is the 5-th floor.  

The tallest in the database is a 26-storey building.
We have one offering on the 16th floor. 
But the median value is ground floor. 
Partly this may be because most owners do not indicate the floor number.
There is nothing suspicious here.  
These data ranges are normal.

### What is the floor number distibution?

In [None]:
fig = px.histogram(df, x='floor', title='Floor', color_discrete_sequence=['#1f77b4'], opacity=.7, text_auto= True)
fig.update_layout(xaxis_title="", yaxis_title="Count of properties for rent")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Does the presence of garage increase the price?

#### Garage

In [None]:
fig = px.histogram(df_r[['garage']], x = df_r['garage'], title='Distribution of ads by garage availability', color= 'garage',  
                   text_auto=True, height= 600)
fig.update_layout(xaxis_title="", yaxis_title="Count")
fig.update_layout(xaxis={'categoryorder':'total descending'})
# fig.update_yaxes(type="log")

In [None]:
print(f'Only {df_r[df_r.garage != "No garage"].shape[0]/df_r.shape[0]:.2%} of the properties have a garage or a parking spot')

Only 9.26% of the properties have a garage or a parking spot


Most of the properties do not mention garage availability.

In [None]:
garage_bins = df_r.garage.apply(lambda x: 'Yes' if x != 'No garage' else 'No') # create a Serie with binary values
garage_bins.rename("garage_presence", inplace=True)                                # rename the column
garage_bins.value_counts()

No     3694
Yes     377
Name: garage_presence, dtype: int64

Does garage affect a visual representation on a scatter plot?

In [None]:
fig = px.scatter(pd.concat([df_r, garage_bins],axis=1), x="cold_price", y="property_area", 
                 color= 'garage_presence', height= 800, facet_col = 'garage_presence')   # ,  trendline="ols", trendline_options=dict(log_x=True)
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
# fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

Interesting results.  
Do you only notice "clustering" among "no garage" ads like me?  
We still have no idea what it means.

In [None]:
# area_bins = pd.qcut(df_r.property_area, 2)
garage_bins = df_r.garage.apply(lambda x: 'No' if x == 'No garage' else 'Yes')
garage_bins.rename("garage_presence", inplace=True)

df_r.pivot_table('cold_price', [garage_bins], aggfunc=['mean'])\
  .style.bar(align='mid', color='coral').format(precision=1, thousands=",")

Unnamed: 0_level_0,mean
Unnamed: 0_level_1,cold_price
garage_presence,Unnamed: 1_level_2
No,1466.9
Yes,2093.2


Ads mentioning the presence of a garage are listed higher by an average of 600 euros.  
However, if you choose to use the garage, you will be charged an additional cost.

#### energy_eff

In [None]:
fig = px.histogram(df_r[['energy_eff']].sort_values(by='energy_eff'), x = 'energy_eff', #color = 'energy_eff',
                   title='Distribution of offerings by energy efficiency class', text_auto=True)
fig.update_layout(xaxis_title="")
fig.update_layout(xaxis={'categoryorder':'total descending'})

In [None]:
print(f'Only {df_r[df_r.energy_eff != "Unknown"].shape[0]/df_r.shape[0]:.2%} of the properties have a specified energy efficiency class')

Only 10.49% of the properties have a specified energy efficiency class


Most of the properties do not have a designated energy efficiency rating.  

In [None]:
eff_piv = df_r.pivot_table('rel_heat_costs', ['energy_eff'], aggfunc=['mean','count'])\
                .sort_values(by=('mean', 'rel_heat_costs'), ascending= True)                
eff_piv.columns = ['Relative costs (EUR/m2), mean', 'Number of offerings']  # rename columns
eff_piv.reset_index(inplace=True) # reset index to deminish number of levels in the column names
eff_piv.style.bar(align='left', color='coral').format(precision=2, thousands=",") 

Unnamed: 0,energy_eff,"Relative costs (EUR/m2), mean",Number of offerings
0,Unknown,0.49,727
1,A+,0.71,13
2,H,0.79,2
3,A,0.82,28
4,B,0.95,107
5,C,1.17,127
6,E,1.17,31
7,D,1.27,65
8,F,1.28,19
9,G,1.91,3


In [None]:
fig = px.bar(eff_piv, x='energy_eff', y='Relative costs (EUR/m2), mean', 
             color='Relative costs (EUR/m2), mean', hover_data=['energy_eff'],
             color_continuous_scale=['Green','Blue','Red'], text_auto='.3',
             title='Relative costs (EUR/m2), mean', height= 600, opacity= .6)
fig.update_layout(xaxis_title="Energy efficiency class", yaxis_title="Relative costs (EUR/m2), mean")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()


Here we can notice that the proposed energy efficiency class correlates with relative costs.  
But the `H` class with it 2 listings confuse the picture a bit.

However, listings with en.eff.class specified are in the minority.  
Moreover, actual heat costs are lower among listings with en.eff. class 'Unknown'.  

Usually costs include the cost of heating and might be some other extra services, but  

As a tip from here: `Do not pay too much attention to the indicated energy efficiency class`.

In [None]:
df_r[df_r.energy_eff == 'H']

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,heat_costs_calc,add_costs_calc,rel_heat_costs,cold_price_rel,costs,deposit_calc,year_group,criteria_clean
536,141381114,Lots of space to live ..,01.05.2023,85.4,4.0,,,Balcony/ terrace balcony/ terrace fitted kitchen built-in kitchen guest toilet guest toilet,No garage,0.0,,1997.0,H,145,135,814.08,1094.08,3 mal Netto Miete (kalt),Ground floor apartment,Adler Group,Frau M. Giese,Berlin,"Salvador-Allende-Str. 76 M -,",The residential complex is located in the beautiful water -rich Köpenick directly on the Müggels...,Köpenick,12559,https://www.immobilienscout24.de/expose/141381114,135.0,145.0,1.581259,9.535344,279.99994,2442.240051,Late XX cent,balcony terrace fitted kitchen guest toilet
2452,136466424,Furnished 3 rooms apartment in Charlottenburg (Berlin),,78.0,3.0,2.0,2.0,Fitted kitchen fitted kitchen Guest toilet guest toilet,No garage,4.0,,,H,395,included in additional costs,2850.0,3245.0,1000 + Admin. Fee,Flat,Ukio Germany Gmbh,Frau Julia Morgan,Berlin,"Krumme Str. 54,",Spectacular apartment in Krumme Strasse with 2 bedrooms and 2 bathrooms. The kitchen is fully eq...,Charlottenburg,10627,https://www.immobilienscout24.de/expose/136466424,0.0,395.0,0.0,36.53846,395.0,1000.0,2000-2014,fitted kitchen guest toilet


In [None]:
temp = df_r[~df_r.warm_price.isna()]

s = temp.isna().sum()         # count missing values in each column where warm_price is missing
cols = s[ s == 0 ].index.to_list()      # list of columns with no missing values
[f'{i:>20}{s[i]:8}' for i in s.index if s[i] > 0] # list of columns with missing values

['        logging_date    2319',
 '        num_bedrooms    2539',
 '       num_bathrooms    2416',
 '            criteria       3',
 '               floor    1520',
 '  floors_in_building    2498',
 '         constr_year    2463',
 '             address     989',
 '     heat_costs_calc    2337',
 '      add_costs_calc    1545',
 '      rel_heat_costs    2337',
 '      criteria_clean    1492']

### Let's predict missing warm prices

In [None]:
ser = df['property_type'].value_counts()

In [None]:
check_na(temp)

Unnamed: 0,dtype,nans,nans%
logging_date,object,2319,67.1
criteria_clean,object,1492,43.1
address,object,989,28.6
criteria,object,3,0.1
title,object,0,0.0
garage,object,0,0.0
energy_eff,object,0,0.0
add_costs,object,0,0.0
heat_costs,object,0,0.0
deposit,object,0,0.0


In [None]:
temp[temp[cols_lr].isna().any(axis=1)]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,heat_costs_calc,add_costs_calc,rel_heat_costs,cold_price_rel,costs,deposit_calc,year_group,criteria_clean


 'garage', 'energy_eff',

In [None]:
# cols_lr = ['property_area', 'num_rooms', 'heat_costs_calc', 'cold_price', 'cold_price_rel','costs', 'deposit_calc']
cols_lr = ['property_area', 'cold_price']

In [None]:
cols_lr = ['property_area', 'cold_price',  'costs']
lr = LinearRegression()
lr.fit(temp[cols_lr], temp.warm_price)
print(f'Intercept: {lr.intercept_:.2f}')
print(f'Coefficients: {lr.coef_}')
lr.score(temp[cols_lr], temp.warm_price)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
warm_price_pred = lr.predict(temp[cols_lr])

Fit model on df where warm_price > 0  
predict warm prices

#### property_type

In [None]:
fig = px.histogram(df_r[['property_type']].sort_values(by='property_type'), title='property_type',  
                   text_auto=True, x = 'property_type', color = 'property_type', height= 600)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})

Among types that were designated Flats are the most common offering.

In [None]:
property_bins = df_r.property_type.apply(lambda x: x if x == 'Unknown' else 'specified') # create a Serie with binary values
property_bins.rename("property_bins", inplace=True)                                # rename the column
property_bins.value_counts()


In [None]:
fig = px.box(df_r, x = df_r['property_type'], y = df_r['cold_price'], height= 800,
             notched=True,  title='Prices for different property types', color='property_type')
# fig.update_yaxes(matches= None)
fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.update_layout(xaxis_title="", yaxis_title="Cold price (EUR)")
fig.show()

The most expensive are Penthouses and Maisonette (Small house).  And the cheapest - Basement. 

In [None]:
fig = px.scatter(pd.concat([df_r, property_bins],axis=1), x="cold_price", y="property_area", 
                 facet_col='property_bins', color= 'property_bins')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
# fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

As we've noticed earlier (garage section)  
listings with Unknown property type (actually nans) form a distribution with 2 clusters.

#### Bedrooms and bathrooms

In [None]:
fig = px.histogram(df_r[['num_bedrooms']].sort_values(by='num_bedrooms'), title='Number of bedrooms',  
                   text_auto=True, color_discrete_sequence=['green'], opacity= .6)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
fig = px.histogram(df_r[['num_bathrooms']].sort_values(by='num_bathrooms'), title='Number of bathrooms',  
                   text_auto=True, color_discrete_sequence=['blue'], opacity= .4)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
sp_rooms_bins = df_r.apply(lambda x: 'No' if (x['num_bathrooms'] == 0) and (x['num_bedrooms'] == 0) else 'specified', axis=1) # create a Serie with binary values
sp_rooms_bins.rename("sp_rooms_bins", inplace=True)                                # rename the column
sp_rooms_bins.value_counts()


In [None]:
fig = px.scatter(pd.concat([df_r, sp_rooms_bins],axis=1), x="cold_price", y="property_area", 
                 facet_col='sp_rooms_bins', color= 'sp_rooms_bins')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=4 , line=dict(width=2)) 
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

As with garage and property type we can notice a definite segmentation among listings without specific number of bedrooms and bathrooms.

And finally let's unite all features that lead to clusterization:

In [None]:
cluster_bin = df_r.apply(lambda x: 'clusterized' if (x['num_bathrooms'] == 0) and (x['num_bedrooms'] == 0) 
                         and (x['garage'] == 'No garage') and  (x['property_type'] == 'Unknown') 
                         and (x['energy_eff'] == 'Unknown') 
                         else 'normal', axis=1) # create a Serie with binary values
cluster_bin.rename("cluster_bin", inplace=True)                                # rename the column
cluster_bin.value_counts()

In [None]:
fig = px.scatter(pd.concat([df_r, cluster_bin],axis=1), x="cold_price", y="property_area", 
                 facet_col='cluster_bin', color= 'cluster_bin')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=3 , line=dict(width=2)) 
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

Listings without
* garage
* with no specification about property type, energy efficiency class, number of bedrooms and bathrooms  

forms 2 vivible clusters.

Later we'll try to use geo data to plot the data on map.

In [None]:
fig = px.scatter(temp, x="cold_price", y="property_area",
                 color="publisher", hover_name="publisher")
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=4 , line=dict(width=2))
fig.update_yaxes(range=[0, 120])
fig.update_xaxes(range=[0, 5000])
fig.show()

#### publisher

In [None]:
fig = px.histogram(df[['publisher']].sort_values(by='publisher'), title='publisher',  text_auto=True, height= 800)
fig.update_layout(xaxis_title="", yaxis_title="Count (log scale)")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_yaxes(type="log")
fig.update_xaxes(tickangle=60)

Let's print top-15 agencies (all private owners united in one group )

In [None]:
def custom_aggregation(data):
    '''
    Calculate the survival rate for each group
    '''
    d = {} # create an empty dictionary

    d['mean_sqm'] = data['property_area'].mean()           
    d['count'] = round(data['property_area'].count())
    d['mean_price'] =  data['cold_price'].mean()     
    d['volume']= d['count']*d['mean_sqm']
    d['share'] = d['volume'] /(df['property_area'].sum())*100
    return pd.Series(d)

grouped = df.groupby(['publisher'])[['property_area', 'cold_price']].apply(custom_aggregation)
grouped.sort_values(by='volume', ascending= False).head(15).\
    style.bar(align='mid', color='coral').format(precision=1, thousands=",")

In [None]:
# df.groupby(['publisher']).agg(mean_property_area=("property_area", 'mean'),
#                                    Count=('property_area','count'),
#                                    mean_price= ("cold_price",'mean'),
#                                    volume = ("cold_price",lambda x: x.sum())).sort_values(by='volume', ascending= False)\
#                                     .style.bar(align='mid', color='coral').format(precision=0, thousands=",")

### What is the most popular residential rental objects in Berlin? 

In [None]:
fig = px.scatter(df, x="cold_price", y="property_area", color= 'property_type',
                 height= 800,  trendline="ols", trendline_scope="overall")   # , trendline_options=dict(log_x=True)
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2)")
fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.show()

# results = px.get_trendline_results(fig)
# print(results)
# results.px_fit_results.iloc[0].summary()
# results.query("property_type == 'Flat' or property_type == 'Unknown'").px_fit_results.iloc[0].summary()

We observe here an interesting results.  
two big clusters are formed: 
* left upper with center 600 eur for 60 sqm
* right lower with center 1800 eur for 50 sqm.

Two segments

In [None]:
fig = px.scatter(df, x="property_area", y="costs", color= 'property_type',
                 height= 800,  trendline="ols", trendline_scope="overall" ) #, trendline_options=dict(log_x=True) )
fig.update_layout(xaxis_title="Property area (m2)", yaxis_title="Costs (EUR)")
fig.update_layout(xaxis_type = 'log')#, yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2)) 

In [None]:
# define a function to fill warm price on the basis of cold price and energy efficiency
# def fill_warm_price(xdf, cold_price, energy_eff, warm_price, property_type, property_area):

xdf = df_r.copy()            # make a copy of the dataframe
xdf['costs'] = xdf.warm_price - xdf.cold_price # calculate costs

In [None]:
xdf[xdf.costs < 50] # check if there are any negative values

In [None]:
px.histogram(xdf,  y='costs', color='property_type', title='Costs per sq.meter')

In [None]:

model = LinearRegression()  # define a linear regression model

X = xdf[xdf['warm_price'].notna()][['cold_price', 'property_area']] # select only rows with warm price not null
y = xdf[xdf['warm_price'].notna()]['warm_price']                     # select only rows with warm price not null

# X = pd.get_dummies(X, columns=[ 'energy_eff'], drop_first=True) # convert categorical columns to dummy variables

model.fit(X, y)
ind = X.index
# return X, _
# # xdf.loc[ind, warm_price] = xdf.loc[ind, cold_price] * (1 + xdf.loc[ind, energy_eff])
print(model.score(X,y), len(ind))
# return model.predict(X[[cold_price, energy_eff, property_type, property_area]])

#exclude columns

In [None]:
# temp_df = model.predict(pd.get_dummies(xdf[['cold_price', 'energy_eff',  'property_area']], columns=[ 'energy_eff'], drop_first=True))
temp_df

In [None]:
temp_df = model.predict(xdf[['cold_price', 'property_area']])
temp_df

In [None]:
# check_na(df)

In [None]:
temp_df = pd.DataFrame(temp_df, columns=['warm_price2'])
temp_df.head()

In [None]:
# temp['diff'] = (temp.warm_price - temp.cold_price) #/ df.property_area

In [None]:
temp_df.describe()

In [None]:
temp_df.shape, df.shape

In [None]:
t = pd.concat([df, temp_df], axis= 1, join='inner')

In [None]:
t['diff'] = (t.warm_price2 - t.cold_price) #/ df.property_area

In [None]:
pd.set_option('display.max_columns', None) # display all columns
t[(t['diff'] < 0) & (t.warm_price.isna())]

In [None]:
df[df.cold_price.notna() & df.warm_price.notna()]['energy_eff'].unique()

In [None]:
df['add_costs'] = df.warm_price - df.cold_price

In [None]:
check_na(df)