# Real Estate Rental Market in Berlin. p2. Analizing. 
![Pixabay](https://cdn.pixabay.com/photo/2019/10/25/18/20/berlin-4577624_1280.jpg)

I was inspired by original ideas and some useful approaches that were taken from [Dmitrii Eliuseev](https://towardsdatascience.com/housing-rental-market-in-germany-exploratory-data-analysis-with-python-3975428d07d2).

This notebook is an attempt to experiment with approaches that I found very useful and interesting, and they have their origins in the TDS article 'Housing Rental Market in Germany: Exploratory Data Analysis with Python'.  
The scope and processing are widen greatly in order to collect as much data as possible.  I also managed to answer the questions that were left without answers in the original article.

Let's try to find some trends and insights from the data collected on https://www.immobilienscout24.de as one of the largest online residential rental aggregators in Germany.  

This is a second part of the data analysis.  
Part 1 is about parsing, cleaning and processing data.

The main stages of the forthcoming work are:  

* Analyze: analizing  building up a simple regression model for predicting the prices
* Share:  prepare some visualization

Loading the environment.  
You need to uncomment some lines of code if these libraries are not installed on your system. 

In [1]:
import pandas as pd
import numpy as np

import plotly.express as px

import re

import folium
from geopy.geocoders import Nominatim
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import IsolationForest, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

Defining some variables to configure the proccess.

In [2]:
path_to_csv = "/Users/velo1/SynologyDrive/GIT_syno/data/immobilienscout24.de/"

pd.set_option("display.max_colwidth", 100)  # to display full text in columns
pd.set_option("display.max_columns", None)  # display all columns

colab = False



|instance| used for storing:|
|:---|:---|
|base_url |https://www.immobilienscout24.de|
|||
|Berlin_housing_proccessed.csv|processed data|
|||
|df, dff |cleaned and filtered data|
|temp |temporary dataframes|
| X |  Train set|
|y (Series) | target labels|

## Ask

1. What a typical property for rent in Berlin looks like?
1. What are the main segments of that rental market?
1. What is the most popular residential rental objects in Berlin?  
1. What are the main factors that define the rental price?  
1. Are there any trends and hidden patterns?


### Loading proccessed data

In [3]:
if colab: 
    file_link = 'https://drive.google.com/uc?id=1gEJyij2XSHTVMMuJC7o-20c61trYwYiz'
else:
    file_link = path_to_csv + "Berlin_housing_proccessed.csv"

df = pd.read_csv(file_link, sep=";")  

df.head(3)

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
0,141131393,Nassauische Straße!Bright 6-room alcohol apartment with balcony on the 1st floor,immediately or after agreement,220.5,7.0,3,?,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,No garage,1,5,1900.0,Unknown,800,included in additional costs,3500.0,4300.0,3 Nettokaltmieten,Flat,Kupsch Wohnimmobilien GmbH,Frau Sabine Woide Immobilien,Berlin,,"Berlin-Wilmersdorf Wohnquartier Güntzelkiez (Trautenaustraße, Hohenzollernplatz, Nassauische Str...",Wilmersdorf,10717,https://www.immobilienscout24.de/expose/141131393,15.873015,800.0,0.0,800.0,3.628118,Historic,10500.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
1,141131071,Exchange apartment: beautiful 2-room in Gräfekiez against 3-4 room (Kreuzb/Neuk),,60.0,2.0,?,?,Fitted kitchen fitted kitchen,No garage,3,?,,Unknown,170,not specified,410.0,580.0,0,Unknown,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,Quiet and beautiful apartment in the Gräfekiez.Ideal for couples because one of the two rooms is...,Kreuzberg,10967,https://www.immobilienscout24.de/expose/141131071,6.833334,170.0,,170.0,2.833333,2000-2014,0.0,fitted kitchen
2,141159056,"Exchange apartment: beautiful 2-room WHG in PB, 3-4 Zi apartment in PB, Wed, KR, FH wanted",,54.0,2.0,?,?,Basement basement,No garage,1,?,,Unknown,127,not specified,456.0,583.0,0,Flat,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,"Hello, our small family (2 adults and 1 child) lives in a beautiful 2-room apartment in the Bötz...",Prenzlauer Berg,10407,https://www.immobilienscout24.de/expose/141159056,8.444445,127.0,,127.0,2.351852,2000-2014,0.0,basement


In [4]:
def check_na(df, sort="dtype"):
    """
    Check for missing values in a dataframe
    df - dataframe
    sort - sort by column by dtype or by nans% (if `category` dtype is present)
    """
    sort = ["dtype", "nans%"] if sort else ["nans%"]
    dict_ = {}
    for col in df.columns:
        dict_[col] = {
            "dtype": df[col].dtype,
            "nans": df[col].isna().sum(),
            "nans%": df[col].isna().sum() / df.shape[0] * 100,
        }
    return (
        pd.DataFrame(dict_)
        .T.sort_values(by=sort, ascending=False)
        .style.bar(subset=["nans%"], color="#faebd7")
        .format(precision=1, thousands=",")
    )


## Analyze

First, we'll explore feature by feature and
then answer the questions.

We'll start with simple descriptive questions and answers.  
And the main question as always:

### What about prices?  

Well, it's time to answer the most interesting question.  
Initially, I put this question in the middle.  
But the answer to this question radically changes all subsequent research.  
So we need to answer it first.  

Let's dive into!

In [5]:
print(f"We have {df.shape[0]} rows and {df.shape[1]} columns in our proccessed dataset")


We have 4071 rows and 35 columns in our proccessed dataset


First, we use a standard pandas describe() method.

In [6]:
df[~df.warm_price.isna()][["warm_price", "cold_price"]].describe().T.style.format(
    precision=1, thousands=","
)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
warm_price,3458.0,1823.0,1284.9,250.0,1000.0,1600.0,2290.0,19850.0
cold_price,3458.0,1659.3,1219.8,180.0,827.2,1500.0,2100.0,17000.0


The median warm price have 100eur premium to cold price.

Let's take a closer look at relative prices distribution.

In [7]:
fig = px.box(
    df[["cold_price_rel"]],
    x="cold_price_rel",
    notched=True,
    title="Cold RELATIVE prices <br><sup>€ for sq.m per month</sup>",
    color_discrete_sequence=["#808080"],
)
fig.update_layout(xaxis_title="€ for sq.m per month", yaxis_title="Value range")


|Relative cold price|
|:---|
|The median is 22,15 eur for sq.m monthly.|

But the variance is high: from 4 up to 185 euros.


#### Let's explore the right tail of distribution.

In [8]:
print( f"We have {df[df.cold_price_rel > 100].shape[0]} listings with cold price > 100€/sq.m per month." )
df[df.cold_price_rel > 100].sort_values(by="cold_price", ascending=True).head(5)


We have 27 listings with cold price > 100€/sq.m per month.


Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
1746,139064872,"Wielandstrasse, Berlin",,23.0,1.0,?,?,Stepless access continuously access,No garage,?,?,,Unknown,not specified,not specified,2390.0,2390.0,1000,Unknown,HousingAnywhere B.V.,,Berlin,"Wielandstraße 0,",Alone or for two in Berlin?It doesn't matter at the apartment!In this space miracle there is not...,Charlottenburg,10707,https://www.immobilienscout24.de/expose/139064872,103.91304,0.0,,0.0,0.0,2000-2014,1000.0,stepless access
3094,138329459,"English street, Berlin",,24.0,1.0,?,?,Into,No garage,?,?,,Unknown,not specified,not specified,2460.0,2460.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Englische Straße 0,",The essential in perfection. Clear design and maximum hospitality: comfort and cosiness can be f...,Charlottenburg,10587,https://www.immobilienscout24.de/expose/138329459,102.5,0.0,,0.0,0.0,2000-2014,0.0,
411,140104529,"Heidestraße, Berlin",,18.0,1.0,?,?,Into,No garage,?,?,,Unknown,not specified,not specified,2500.0,2500.0,500,Unknown,HousingAnywhere B.V.,,Berlin,"Heidestraße 0,","More space for you - from 18m².Look Forward to Super Comfortable, Large Beds, USB Ports and a Sm...",Moabit,10557,https://www.immobilienscout24.de/expose/140104529,138.88889,0.0,,0.0,0.0,2000-2014,500.0,
2717,138335786,"Friedrichstrasse, Berlin",,20.0,1.0,?,?,Into,No garage,?,?,,Unknown,not specified,not specified,2500.0,2500.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Friedrichstraße 0,",Advantages: - Registration possible - no advance payment required - no deposit required - weekly...,Kreuzberg,10969,https://www.immobilienscout24.de/expose/138335786,125.0,0.0,,0.0,0.0,2000-2014,0.0,
2718,138335782,"Friedrichstrasse, Berlin",,18.0,1.0,?,?,Into,No garage,?,?,,Unknown,not specified,not specified,2500.0,2500.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Friedrichstraße 0,",Advantages: - Registration possible - no advance payment required - no deposit required - weekly...,Kreuzberg,10969,https://www.immobilienscout24.de/expose/138335782,138.88889,0.0,,0.0,0.0,2000-2014,0.0,


Here we see very niche offers.  
 
Some of them do not require a deposit and with weekly cleaning included in price.  

For example:
|`Our 19-23 sqm suites for stays over 28 nights are the ideal choice if you are looking for a suitable apartment for two and have therefore been furnished to our highest modern standards. The suites have a fully equipped kitchen, a comfortable box spring bed (1.60 m) with a modern smart TV and a private bathroom with a shower so you can feel at home. If there is dirty laundry, you have the opportunity to wash your clothes in the communal laundry room (opening hours: 6 a.m. to 10 p.m.). Your apartment offers everything you need for a longer stay with us in just one room.` |
|:---| 

Small but very comfortable rooms with a good furniture.  

|These offers  form a unique market segment and might be *an alternative for staying at a hotel.*| 
|:---|
|The most affordable offer starts from 2390 euros for 23 sq.m.  |
|A typical feature of offers in this segment is the **minimum or no deposit at all.**|
|But **relative prices** here are **over 100** euros for sq. meter per month!|

Relative cold prices in this segment are 6 times higher than with 300 sq.m and bigger listings.  
Very high.  

#### And what about low end listings?

In [9]:
print( f"We have {df[df.cold_price_rel < 15].shape[0]} listings with cold price < 15€/sq.m" )
df[df.cold_price_rel < 15].sort_values(by="cold_price_rel", ascending=True).head(3)


We have 1516 listings with cold price < 15€/sq.m


Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
1101,141136604,Exchange apartment: 2 room apartment in Schöneberg for exchange against 1-2 rooms,,49.0,2.0,?,?,Balcony/ terrace balcony/ terrace basement basement passenger elevator personal elevator,No garage,3,?,,Unknown,150,not specified,210.0,360.0,0,Flat,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,I am looking for an equivalent apartment of 45-60 sqm warm rent up to around € 550 in Berlin for...,Schöneberg,10783,https://www.immobilienscout24.de/expose/141136604,4.285714,150.0,,150.0,3.061224,2000-2014,0.0,balcony terrace basement passenger elevator personal
1161,141136243,Exchange apartment: bright 2.5 apartment for larger WHG,,65.0,3.0,?,?,Balcony/ terrace balcony/ terrace basement basement fitted kitchen fitted kitchen,No garage,1,?,,Unknown,260,not specified,280.0,540.0,0,Flat,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,Our bright and well -cut apartment is located in the quiet Tyrolean neighborhood in Pankow.It is...,Pankow (Ortsteil),13187,https://www.immobilienscout24.de/expose/141136243,4.307692,260.0,,260.0,4.0,2000-2014,0.0,balcony terrace basement fitted kitchen
2541,136653959,Apartment exchange: Schönleinstrasse 7,,105.0,2.0,?,?,Into,No garage,?,?,,Unknown,not specified,not specified,510.0,,0,Unknown,Wohnungsswap.de - Lägenhetsbyte Sverige AB -,Wohnungsswap .de,Berlin,"Schönleinstraße 7,","Cozy old apartment with an open kitchen (Berlin room), spacious hallway and 2 very large pretty ...",Kreuzberg,10000,https://www.immobilienscout24.de/expose/136653959,4.857143,,,,,2000-2014,0.0,


Well, we have a whole new segment of listings with cold price < 15€/sq.m.  
Let's check them out closer.

In [10]:
df[df.cold_price_rel < 15].groupby(by="publisher").agg(
    {
        "region": "count",
        "property_area": "mean",
        "cold_price": "mean",
        "cold_price_rel": "mean",
    }
).sort_values(by="region", ascending=False).head(7).style.format(precision=1).bar(
    subset=["cold_price_rel", "region"], color="#e1f3f8"
)


Unnamed: 0_level_0,region,property_area,cold_price,cold_price_rel
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tauschwohnung GmbH,661,66.9,641.7,9.7
Wohnungsswap.de - Lägenhetsbyte Sverige AB -,523,64.5,620.0,9.6
HOWOGE Wohnungsbaugesellschaft mbH,85,51.2,438.5,8.6
Private,64,90.9,1104.2,12.1
Immonexxt GmbH,33,56.7,666.2,11.9
degewo,18,73.6,635.5,8.3
ambelin GmbH,10,77.5,824.2,10.6


Very interesting!  
I would have expected that the cheapest listings are from private sellers but they are from very special publishers.  


`A great number of low end listings are actually Tauschwohnung and Wohnungsswap.de offers.`  
 
These listings are intended to be exchanged for other properties.  
This seriously changes our preliminary estimates about prices. 
We should identify such listings and **filter them out** as we cannot directly compare them with other segments.  
You need to have a property to leverage this offerings.

In [11]:
def filter_exchange(row):
    if ( "exchange" in row.title.lower() ):  
        return True
    else:
        return False


In [12]:
print( f"Filtered { df [ df.apply(filter_exchange, axis=1)].shape[0] } listings for exchange." )


Filtered 1325 listings for exchange.


Let's check this segment closer.

In [13]:
df[df.apply(filter_exchange, axis=1)].groupby(by="publisher").agg(
    {
        "region": "count",
        "property_area": "mean",
        "cold_price": "mean",
        "cold_price_rel": "mean",
    }
).sort_values(by="region", ascending=False).style.format(precision=1).bar(
    subset=["cold_price_rel"], color="#e1f3f8"
)


Unnamed: 0_level_0,region,property_area,cold_price,cold_price_rel
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Tauschwohnung GmbH,747,66.7,700.4,10.6
Wohnungsswap.de - Lägenhetsbyte Sverige AB -,578,64.2,663.8,10.4


We can notice two main players here.

Let's try to filter out all low end offers by price only.

In [14]:
df[df.cold_price_rel < 10].groupby(by="publisher").agg(
    {
        "region": "count",
        "property_area": "mean",
        "cold_price": "mean",
        "cold_price_rel": "mean",
    }
).sort_values(by="cold_price_rel", ascending=False).style.format(precision=1).bar(
    subset=["region", "cold_price_rel"], color="#e1f3f8"
)


Unnamed: 0_level_0,region,property_area,cold_price,cold_price_rel
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Westminster Unternehmensgruppe,1,55.2,552.0,10.0
HousingAnywhere B.V.,1,62.0,600.0,9.7
Inseriert auf ohne-makler.net,1,83.0,798.0,9.6
Deutsche Wohnen SE,2,60.1,574.7,9.6
ambelin GmbH,4,83.6,775.1,9.3
Adler Group,7,64.3,592.0,9.3
Vonovia SE,1,33.0,299.8,9.1
Private,9,96.2,828.6,8.8
Immonexxt GmbH,5,57.2,486.8,8.7
Dr. Lenhardt Roling GbR,1,62.0,530.0,8.5


We see offers that are below 10€/sq.m. and the most of them  are from that two publishers.  
But filtering out by low price also includes low end offerings that are not intended for exchange.  
These are really low priced offers.  
So filtering by price doesn't work well.

Eventually let's define a new dataframe with exchange offerings filtered out.  
All subsequent conclusions will be built on the basis of this datafarame unless otherwise specified.

In [15]:
dff = df[~df.apply(filter_exchange, axis=1)].copy()


We should make a new relative cold price estimations:

In [16]:
fig = px.box(
    dff[["cold_price_rel"]],
    x="cold_price_rel",
    notched=True,
    title="Cold RELATIVE prices <br><sup>€ for sq.m per month</sup>",
    color_discrete_sequence=["green"],
)
fig.update_layout(xaxis_title="€ for sq.m per month", yaxis_title="Value range")


As we filtered out exchange offerings the relative cold price increased from 22 to 30 eur/sq.m.month. (+ 37%).  
If we continued to include that irrelevant data, our statistics would be heavily skewed.

In [17]:
dff[
    ["property_area", "cold_price", "warm_price", "cold_price_rel"]
].describe().T.style.format(precision=1, thousands=",")


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
property_area,2746.0,67.5,43.2,11.8,40.0,58.0,80.0,706.0
cold_price,2746.0,1930.4,1252.6,195.0,1300.0,1734.0,2300.0,17000.0
warm_price,2711.0,2080.8,1325.4,300.0,1390.0,1800.0,2490.0,19850.0
cold_price_rel,2746.0,32.7,18.4,5.8,21.8,30.4,39.4,185.7


The plot and the table speak for themselves:


### A typical housing for rent in Berlin looks like that:
||property_area|cold price|warm price|Relative cold price|
|:---|:---|:---|:---|:---|
|median| 58 sq.m|1734|1800|30 eur for sq.m monthly|

3/4 of all listings have a cold price less than 2300 and warm price less than 2490 euros.  

But the variance is high: from 300 up to 19,850 euros (warm price).  


Before we go any further let's slow down a bit and answer some easy questions.

### How tall are buildings for rent in Berlin?

In [18]:
temp = (
    dff.floors_in_building.value_counts()
)  # get frequencies of the number of floors in the building 

fig = px.bar(
    temp, x=temp.index, y=temp.values, title="Number of floors in the building"
)
fig.update_layout(xaxis_title="", yaxis_title="Count of properties for rent")
fig.update_xaxes(type="category")
fig.update_xaxes(categoryorder="category ascending")
# fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.show()


Well..  Do you find like me this plot is boring.  
Category (x axis) order is alphabetical.  And it is not actually intuitive here as the buildings height order is confusing.  
We can fix this by changing the index. 

Get 'floors_in_building' distribution values:

In [19]:
temp = dff.floors_in_building.value_counts()

# we need to add 000 to the beginning of the string to make sure that the sorting is numerical
# we also need to replace ? with 000 to put unspecified floors first
temp.sort_index(
    key=lambda x: ("000" + x).str.replace("?", "000", regex=True).str[-2:],
    ascending=True,
    inplace=True,
)


fig = px.bar(
    temp,
    y=temp.index,
    x=temp.values,
    title="Height of buildings for rental housing in Berlin",
    color=temp.values,
    height=800,
    orientation="h",
    text_auto=True,
    color_continuous_scale=["LightBlue", "Blue", "lightgrey"],
)
fig.update_layout(
    xaxis_title="Count of properties for rent (log scale)",
    yaxis_title="Floors in the building (height)",
)
fig.update_layout(xaxis_type="log")  # log scale
fig.update_yaxes(type="category")  # sort the y axis by the number of floors
fig.update_coloraxes(showscale=False)  # hide the color scale
fig.show()


This plot is much more easy to understand.  
Unspecified values we put at the bottom and y axis corresponds with the building height.  
This is much better.  
I use x-axis log scale to fit a large data variance.

|The tallest in the database is a 26-storey building. |
|:---|
|We have 8 offers at 13 and 18 storey buildings |
|But the median  is the 5 storey flobuilding.|  

### What floors are most often offered for rent?

In [20]:
temp = dff.floor.value_counts()

temp.sort_index(
    key=lambda x: ("000" + x).str.replace("?", "000", regex=True).str[-2:],
    ascending=True,
    inplace=True,
)

fig = px.bar(
    temp,
    y=temp.index,
    x=temp.values,
    title="Property floor number offered for rent",
    color=temp.values,
    height=800,
    orientation="h",
    text_auto=True,
    color_continuous_scale=["LightGreen", "Green", "lightgrey"],
)
fig.update_layout(
    xaxis_title="Count of properties for rent (log scale)", yaxis_title="Floor number"
)
fig.update_layout(xaxis_type="log")  # log scale
fig.update_yaxes(type="category")  # sort the y axis by the number of floors
fig.update_coloraxes(showscale=False)  # hide the color scale
fig.show()


|We have offers up to 16th floor|
|:---|
|But most offers are from 0 (basement or ground floor) to the 4th floor|

### What districts of Berlin are the most popular for rental housing?

In [21]:
region_top = (
    dff.groupby("region")[["address"]]
    .agg({"address": "count"})
    .sort_values(by=["address"], ascending=False)
    .head(12)
    .to_dict()
)
region_minor = (
    dff.groupby("region")[["address"]]
    .count()
    .sort_values(by=["address"], ascending=False)
    .tail(12)
    .to_dict()
)
temp = dff[
    dff.region.isin([*region_top["address"].keys()])
]  # df with listings within top districts

fig = px.histogram(
    temp.region,
    title="Top Berlin districts by representation",
    text_auto=True,
    height=600,
)
fig.update_layout(xaxis_title="", yaxis_title="Number of listings")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.update_traces(showlegend=False)
fig.update_xaxes(tickangle=60)


Not easy to swallow this data for a foreigner)  
Let's visualize top Berlin districts with highest representation over the map.

In [22]:
def plotDot(row, color, from_df=True, radius=10, weight=10, this_map=map, cnt=None):
    if from_df:
        loc = geolocator.geocode([row.address, row.region, row.city, row.zip])
    else:
        loc = geolocator.geocode(row)
    if loc:
        folium.CircleMarker(
            location=[loc.latitude, loc.longitude],
            radius=radius,
            weight=weight,
            color=color,
            opacity=0.6,
            popup=(
                "Agency:" + row.Publisher
                if from_df
                else "Region:" + row + ".  " + cnt + " listings."
            ),
        ).add_to(this_map)


Berlin districts with the highest and lowest quantity of listings for rent.

In [23]:
# geolocator = Nominatim(user_agent='geopy/2.2.0')
# my_map = folium.Map(prefer_canvas=True)
# # folium.Marker([lat, lon], popup="Googleplex").add_to(this_map)
# for k, v in region_top['address'].items():
#     plotDot(k + ', Berlin', color='#FF00AA', from_df=False, radius=20, weight=10, this_map=my_map, cnt=str(v))

# for k,v in region_minor['address'].items():
#     plotDot(k + ', Berlin', color='#02bfe7', from_df=False, radius=20, weight=5, this_map=my_map, cnt=str(v))
# # df.iloc[:3].apply(plotDot, color='#FF00AA', axis=1) # rgba(255, 0, 170, 0.4)

# my_map.fit_bounds(my_map.get_bounds())

# my_map


Looking at the map, you can get more information for comparison.  
The map is interactive and suggests the area and the number of offers when you hover on.  

It is clear that most of the rental objects offered for rent are located **in the central part of the city** (pink circles).  

The far from the center the less are the destrict proposals for rent (light blue circles).

### What is the size of the premises offered for rent?

In [24]:
fig = px.box(
    dff[["property_area"]], x="property_area", notched=True, title="property_area"
)
fig.update_layout(xaxis_title="Property size (sq.meters)", yaxis_title="")


In [25]:
dff[["property_area"]].describe().T.style.format(precision=1, thousands=",")


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
property_area,2746.0,67.5,43.2,11.8,40.0,58.0,80.0,706.0


|The median property size is 58 sq.meters.|
|:---|
|75 % of offerings are below 80 sq.meters.|
|The half of the properties are between 40 and 80 sq.meters.|
|The minimum and maximum property size is 12 and 706 sq.meters respectively.|

Let's look at the right tail of distribution.

In [26]:
dff[dff.property_area > 300]


Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
305,140099183,Life in the Monbijou residence - stately penthouse on the World Heritage Site!,,375.0,8.0,4,4,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,1 parking space,4,?,1906.0,D,1.724,included in additional costs,9850.0,11574.0,"29.403,00 EUR",Other,Engel & Völkers Berlin Mitte GmbH,Engel & Völkers Berlin Mitte,Berlin,"Monbijoustraße 3/5,","This stately maisonett apartment has eight rooms, is located on the fourth floor and attic (gall...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/140099183,26.266666,1724.0,0.0,1724.0,4.597333,First half XX cent,29403.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
311,139142000,Berlin in view - unique townhouse in the heart of Berlin!,from immediately,456.0,4.0,3,3,Balcony/ Terrace Balcony/ Terrace Passenger Rifle People's Rifle Insperation Kitchen Interpretin...,1 garage,?,?,2012.0,Unknown,1.284,not included in additional costs,15000.0,16284.0,"45.000,00 EUR",Other,Engel & Völkers Berlin Mitte GmbH,Engel & Völkers Berlin Mitte,Berlin,"Oberwallstraße 13,","Characteristic of the town houses are the long, rather narrow parcels that challenge both origin...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/139142000,32.894737,1284.0,,1284.0,2.815789,modern,45000.0,balcony terrace passenger kitchen guest toilet
315,141306965,Exclusive townhouse in Mitte near the Gendarmenmarkt,immediately,456.0,4.0,3,3,Balcony/ terrace balcony/ terrace basement basement fitted kitchen built-in kitchen guest toilet...,1 Underground parking space,?,?,2012.0,Unknown,1.284,not specified,15000.0,16284.0,3 NKM,Small house,FAMOZA Immobilien,Frau Josipa Kovačević,Berlin,,An exclusive townhouse in the popular Mitte district is rented out.The four-storey Maisonette to...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/141306965,32.894737,1284.0,,1284.0,2.815789,modern,45000.0,balcony terrace basement fitted kitchen built-in guest toilet
1886,140875809,First cover: spectacular penthouse in city location,01.04.2023,321.0,5.0,3,2,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,2 Underground parking spaces,6,6,2022.0,Unknown,950,included in additional costs,11500.0,12450.0,0,Penthouse,Engel & Völkers Immobilien Deutschland GmbH,Engel & Völkers Immobilien Deutschland GmbH,Berlin,,"The penthouse offered here is located on the Kant garages, a historic architectural monument fro...",Charlottenburg,10625,https://www.immobilienscout24.de/expose/140875809,35.825546,950.0,0.0,950.0,2.959502,new,0.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
1908,138488871,Exceptional Living in Jägerstraße on Friedrichswerder - exclusive penthouse with 360 degrees Blick,Nach Absprache,706.0,7.0,4,3,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,Underground parking space,6,6,2007.0,C,2.850,included in additional costs,17000.0,19850.0,0,Penthouse,CITY-CONCEPT Gesellschaft für Immobilienmanagement mbH,Herr Stefan Schepers,Berlin,"Jägerstraße 34,",The Jägerstraße 34/35 residential and commercial building is in close proximity to the Federal F...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/138488871,24.07932,2850.0,0.0,2850.0,4.036827,2000-2014,0.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
4047,105850244,5-room apartment with a large terrace in the heart of Berlin,from Juli 2023,343.5,5.0,4,4,Balcony/ terrace balcony/ terrace Passenger elevator Personal compounding kitchen Garden/ use of...,1 Underground parking space,3,7,2015.0,Unknown,"1.545,35",included in additional costs,7081.87,8627.22,3 Nettokaltmieten,Terrace apartment,HGHI Immobilien Verwaltung GmbH,Frau Marie-Josephine Wahn,Berlin,"Leipziger Str. 12,","As a unique residential area over the roofs of the city, the Leipziger Platz Quartier sets new s...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/105850244,20.616798,1545.3496,0.0,1545.35,4.498836,modern,21245.610352,balcony terrace passenger elevator personal kitchen garden use guest toilet stepless access


These are very special offers with a crazy price of up to 20,000 euros for a 706 sq.m penthouse on Jägerstraße.  
All of them have a plenty of additional features including a garage.
You have to be very wealthy to afford that.  

How can one not remember the film Scent of a Woman with Al Pacino in the title role.  
Although it seems to me that the colonel would not allow himself such;)  

But in other way, the relative prices per sq.m. are decent (20-35 eur) and  are
a lot lower than  one-bedroom offerings we observed earlier.

### Warm prices

#### What are the costs of renting a property in Berlin?


In [27]:
print(f"{dff[dff.pr_diff < 5].pr_diff.count() / dff.shape[0]:.1%}")


53.6%


Many listings (54 %) have equal cold and warm prices, it means that heat costs are included in cold price.  
Therefore, when talking about heating costs, it is better to use the total maintenance costs (add_costs_ calc in our case).

In [28]:
dff[["heat_costs_calc", "add_costs_calc", "pr_diff"]].describe().T.style.format( "{:.0f}" )


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
heat_costs_calc,1122,48,88,0,0,0,78,1200
add_costs_calc,2737,138,217,0,0,0,231,3000
pr_diff,2711,157,234,0,0,0,277,3000


A minority of property owners (1122) have set a specific value for heating costs.  
We can figure out a real costs evaluating warm and cold prices.  


Main takeaway from that table: **the mean of costs is 157 euro.**


Let's look at costs distribution:

In [29]:
fig = px.box(
    dff,
    x=["add_costs_calc", "heat_costs_calc"],
    height=600,
    title="Costs per month",
    notched=True,
    color="variable",
)
fig.update_layout(
    xaxis_title="Costs per month (€)",
    yaxis_title="",
    legend_title="Costs",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)
fig.update_traces(boxpoints="all", jitter=0.7, pointpos=-1.8, marker_size=2)
fig.show()


Dots represents real observations.  
Many heat costs are actually zero (red vertical line on the plot).  
  
And we can clearly see outliers.

#### Let's look closer at offerings with extreme costs.

In [30]:
dff[dff.add_costs_calc > 900].sort_values(by="add_costs_calc", ascending=False).head(5)


Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
1883,133480215,Furnished luxury apartment in the heart of Grunewald,01.06.22,197.0,4.0,2,3,Balcony/ terrace balcony/ terrace basement basement Passenger elevator Personal kitchen built -i...,No garage,1,3,2009.0,Unknown,3.000,not specified,8500.0,11500.0,0,Flat,Engel & Völkers Immobilien Deutschland GmbH,Engel & Völkers Immobilien Deutschland GmbH,Berlin,,"The villa ""Grunewaldherz"" is a modern building that was stylized neoclassical.There is the smoot...",Schmargendorf,14195,https://www.immobilienscout24.de/expose/133480215,43.14721,3000.0,,3000.0,15.228426,2000-2014,0.0,balcony terrace basement passenger elevator personal kitchen built -in garden use guest toilet
1908,138488871,Exceptional Living in Jägerstraße on Friedrichswerder - exclusive penthouse with 360 degrees Blick,Nach Absprache,706.0,7.0,4,3,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,Underground parking space,6,6,2007.0,C,2.850,included in additional costs,17000.0,19850.0,0,Penthouse,CITY-CONCEPT Gesellschaft für Immobilienmanagement mbH,Herr Stefan Schepers,Berlin,"Jägerstraße 34,",The Jägerstraße 34/35 residential and commercial building is in close proximity to the Federal F...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/138488871,24.07932,2850.0,0.0,2850.0,4.036827,2000-2014,0.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
305,140099183,Life in the Monbijou residence - stately penthouse on the World Heritage Site!,,375.0,8.0,4,4,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal compound kitchen...,1 parking space,4,?,1906.0,D,1.724,included in additional costs,9850.0,11574.0,"29.403,00 EUR",Other,Engel & Völkers Berlin Mitte GmbH,Engel & Völkers Berlin Mitte,Berlin,"Monbijoustraße 3/5,","This stately maisonett apartment has eight rooms, is located on the fourth floor and attic (gall...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/140099183,26.266666,1724.0,0.0,1724.0,4.597333,First half XX cent,29403.0,balcony terrace basement passenger elevator personal kitchen fitted guest toilet
4047,105850244,5-room apartment with a large terrace in the heart of Berlin,from Juli 2023,343.5,5.0,4,4,Balcony/ terrace balcony/ terrace Passenger elevator Personal compounding kitchen Garden/ use of...,1 Underground parking space,3,7,2015.0,Unknown,"1.545,35",included in additional costs,7081.87,8627.22,3 Nettokaltmieten,Terrace apartment,HGHI Immobilien Verwaltung GmbH,Frau Marie-Josephine Wahn,Berlin,"Leipziger Str. 12,","As a unique residential area over the roofs of the city, the Leipziger Platz Quartier sets new s...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/105850244,20.616798,1545.3496,0.0,1545.35,4.498836,modern,21245.610352,balcony terrace passenger elevator personal kitchen garden use guest toilet stepless access
3846,141147504,Fantastic view of the zoo,Nach Absprache,152.5,3.0,2,2,Balcony/ terrace balcony/ terrace basement basement passenger elevator Personal kitchen built-in...,1 Underground parking space,5,9,2003.0,C,1.350,included in additional costs,2898.0,4248.0,8694,Flat,FRASSEK Private Real Estate GmbH,Herr Michael Frassek,Berlin,,As special as the location on Potsdamer Platz and at the Tiergarten is the feeling of living in ...,Tiergarten,10117,https://www.immobilienscout24.de/expose/141147504,19.003279,1350.0,0.0,1350.0,8.852459,2000-2014,8694.0,balcony terrace basement passenger elevator personal kitchen built-in guest toilet stepless access


Exclusive offerings with parking places and heat costs included to extra costs.  
Most of this list are modern buildings, but there is one penthouse on Monbijoustrasse built in 1906.

### How does energy efficiency class correlate with costs?

Let's start with distribution by EEC (energy efficiency class).

In [31]:
fig = px.histogram(
    df[["energy_eff"]].sort_values(by="energy_eff"),
    x="energy_eff",  # color = 'energy_eff',
    title="Distribution of offerings by energy efficiency class",
    text_auto=True,
)
fig.update_layout(xaxis_title="")
fig.update_layout(xaxis={"categoryorder": "total descending"})


In [32]:
print(
    f'Only {dff[dff.energy_eff != "Unknown"].shape[0]/dff.shape[0]:.2%} of the properties have a specified energy efficiency class.'
)


Only 15.55% of the properties have a specified energy efficiency class.


Most of the properties do not have a designated energy efficiency rating.  

In [33]:
eff_piv = dff.pivot_table(
    "add_costs_rel", ["energy_eff"], aggfunc=["mean", "count"]
).sort_values(by=("mean", "add_costs_rel"), ascending=True)

# rename columns
eff_piv.columns = [
    "Relative costs (EUR/m2), mean",
    "Number of offerings",
]  

# reset index to deminish number of levels in the column names
eff_piv.reset_index(
    inplace=True
)  

eff_piv.style.bar(align="left", color="coral").format(precision=2, thousands=",")


Unnamed: 0,energy_eff,"Relative costs (EUR/m2), mean",Number of offerings
0,Unknown,1.66,2310
1,A+,2.5,14
2,D,2.52,75
3,C,2.73,130
4,G,2.86,4
5,F,2.87,23
6,B,3.3,113
7,H,3.38,2
8,A,3.51,33
9,E,3.77,33


It seems like only A+ EEC really affects relative costs.  
Let's visualize this figures more artistically)

In [34]:
fig = px.bar(
    eff_piv,
    x="energy_eff",
    y="Relative costs (EUR/m2), mean",
    color="Relative costs (EUR/m2), mean",
    hover_data=["energy_eff"],
    color_continuous_scale=["Green", "Blue", "Red"],
    text_auto=".3",
    title="Relative costs (EUR/m2), mean",
    height=600,
    opacity=0.6,
)
fig.update_layout(
    xaxis_title="Energy efficiency class", yaxis_title="Relative costs (EUR/m2), mean"
)
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.show()


Here we can notice that the proposed energy efficiency class slightly correlates with relative costs.  
Classes are almost randomly mixed.

However, listings with en.eff.class specified are in the minority.  
Moreover, `actual heat costs are lower among listings with 'Unknown' en.eff. class.`  

Usually costs include the cost of heating and might be some other extra services, but  

as a tip from here: `Do not pay too much attention to the indicated energy efficiency class`.

### How many rooms are there in the properties?

In [35]:
cols = ["num_rooms"]
fig = px.box( dff[cols], x = cols, notched=True, title="Number of rooms", color="variable", width=700, height=300 )
fig.update_yaxes(matches=None)
fig.update_traces(showlegend=False)
fig.update_layout(xaxis_title="", yaxis_title="Value range")


In [36]:
dff[cols].describe().T.style.format(precision=1, thousands=",")

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_rooms,2746.0,1.9,1.1,1.0,1.0,2.0,2.0,11.0


The median is 2 rooms, but there are some outliers with 7 rooms and more.  


In [37]:
print(f'{(dff[cols][dff.num_rooms>=4].count()/dff.shape[0]).values[0]:.2%} of listings have 4 or more rooms for rent.')

8.19% of listings have 4 or more rooms for rent.


In [38]:
# temp = dff[~dff.warm_price.isna()]

# s = temp.isna().sum()  # count missing values in each column where warm_price is missing
# cols = s[s == 0].index.to_list()  # list of columns with no missing values
# [f"{i:>20}{s[i]:8}" for i in s.index if s[i] > 0]  # list of columns with missing values


### Does the presence of garage increase the price?

#### Garage

In [39]:
fig = px.histogram(
    dff[["garage"]],
    x=dff["garage"],
    title="Distribution of ads by garage availability",
    color="garage",
    text_auto=True,
    height=600,
)
fig.update_layout(xaxis_title="", yaxis_title="Count", showlegend=False)
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.update_yaxes(type="log")


In [40]:
print( f'Only {dff[dff.garage != "No garage"].shape[0]/dff.shape[0]:.2%} of the properties have a garage or a parking spot.')


Only 11.58% of the properties have a garage or a parking spot.


Most of the properties do not mention garage availability.

In [41]:
garage_bins = dff.garage.apply(
    lambda x: "No" if x == "No garage" else "Yes"
)  
garage_bins.rename("garage_presence", inplace=True)  # rename the column
garage_bins.value_counts()


No     2428
Yes     318
Name: garage_presence, dtype: int64

Does garage affect a visual representation on a scatter plot?

In [42]:
# fig = px.scatter(
#     pd.concat([dff, garage_bins], axis=1),
#     x="cold_price",
#     y="property_area",
#     color="garage_presence",
#     height=800,
#     facet_col="garage_presence",
# )  # ,  trendline="ols", trendline_options=dict(log_x=True)
# fig.update_layout(
#     xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)"
# )
# # fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
# fig.update_traces(
#     marker_size=4, line=dict(width=2)
# )  # change marker size and line width
# fig.update_yaxes(range=[0, 250])
# fig.update_xaxes(range=[0, 5000])
# fig.show()


Interesting results.  
Do you only notice "clustering" among "no garage" ads like me?  
We still have no idea what it means.

In [43]:
garage_bins = dff.garage.apply(lambda x: "No" if x == "No garage" else "Yes")
garage_bins.rename("garage_presence", inplace=True)

dff.pivot_table(["cold_price","property_area"], [garage_bins], aggfunc=["median"]).style.bar(
    align="mid", color="coral"
).format(precision=1, thousands=",")


Unnamed: 0_level_0,median,median
Unnamed: 0_level_1,cold_price,property_area
garage_presence,Unnamed: 1_level_2,Unnamed: 2_level_2
No,1720.0,55.0
Yes,1840.2,85.5


Typical (median) ads with a garage have more property area and therefore price as well.  
But the relative price is significantly lower.  
However, if you choose to use the garage, you will be charged an additional cost.  


### Let's predict missing warm prices

In [44]:
temp = dff[dff.warm_price.notna() & (dff.pr_diff > 10)]

In [45]:
px.histogram(temp.pr_diff)

In [46]:
temp[(temp.add_costs_calc <5)]


Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,add_costs,heat_costs,cold_price,warm_price,deposit,property_type,publisher,contact,city,address,description,region,zip,link,cold_price_rel,pr_diff,heat_costs_calc,add_costs_calc,add_costs_rel,year_group,deposit_calc,criteria_clean
571,124073549,Modern and bright 2-room apartment with a huge balcony,from immediately,55.0,2.0,1,1,Online tour possible online tour online tour of the providers enables visits via live video.So y...,No garage,3,4,2015.0,Unknown,0,included in additional costs,1850.0,2050.0,5550,Flat,Dominart Real Estate GmbH,Frau Marina Schäfer,Berlin,"Tieckstraße 22,","It is a beautiful apartment in a top location!The apartment is very bright, cozy, and has the 2 ...",Mitte (Ortsteil),10115,https://www.immobilienscout24.de/expose/124073549,33.636364,200.0,0.0,0.0,0.0,modern,5550.0,
916,138325340,Luxurious 3 room apartment in the new building,März 2023,105.0,3.0,3,2,Balcony/ terrace balcony/ terrace basement basement parent.,1 Outdoor parking space,1,?,2023.0,Unknown,150,1,2100.0,2362.5,6300,Unknown,Robeson SA,Herr Thomas Schneider,Berlin,"Assmannstrasse 50,","The new building project is located in Berlin-Treptow-Köpenick, Friedrichshagen district.The pop...",Friedrichshagen,12587,https://www.immobilienscout24.de/expose/138325340,20.0,262.5,1.0,1.5,0.014286,new,6300.0,balcony terrace basement


In [47]:
temp[temp[cols_lr].isna().any(axis=1)]


NameError: name 'cols_lr' is not defined

 'garage', 'energy_eff',

In [None]:
temp[cols_lr].describe()

In [None]:
cols_lr = ["cold_price", "property_area", "add_costs_calc"]
rfc = RandomForestRegressor()
rfc.fit(temp[cols_lr], temp.warm_price)
# print(f"Intercept: {rfc.intercept_:.2f}")
# print(f"Coefficients: {rfc.coef_}")
rfc.score(temp[cols_lr], temp.warm_price)


In [None]:
warm_price_pred = lr.predict(temp[cols_lr])


Fit model on df where warm_price > 0  
predict warm prices

#### property_type

In [None]:
fig = px.histogram(
    dff[["property_type"]].sort_values(by="property_type"),
    title="property_type",
    text_auto=True,
    x="property_type",
    color="property_type",
    height=600,
)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={"categoryorder": "total descending"})


Among types that were designated Flats are the most common offering.

In [None]:
property_bins = df.property_type.apply(
    lambda x: x if x == "Unknown" else "specified"
)  # create a Serie with binary values
property_bins.rename("property_bins", inplace=True)  # rename the column
property_bins.value_counts()


In [None]:
fig = px.box(
    df,
    x=dff["property_type"],
    y=dff["cold_price"],
    height=800,
    notched=True,
    title="Prices for different property types",
    color="property_type",
)
# fig.update_yaxes(matches= None)
fig.update_layout(xaxis={"categoryorder": "total ascending"})
fig.update_layout(xaxis_title="", yaxis_title="Cold price (EUR)")
fig.show()


The most expensive are Penthouses and Maisonette (Small house).  And the cheapest - Basement. 

In [None]:
fig = px.scatter(
    pd.concat([df, property_bins], axis=1),
    x="cold_price",
    y="property_area",
    facet_col="property_bins",
    color="property_bins",
)
fig.update_layout(
    xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)"
)
# fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(
    marker_size=4, line=dict(width=2)
)  # change marker size and line width
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()


As we've noticed earlier (garage section)  
listings with Unknown property type (actually nans) form a distribution with 2 clusters.

#### Bedrooms and bathrooms

In [None]:
fig = px.histogram(
    dff[["num_bedrooms"]].sort_values(by="num_bedrooms"),
    title="Number of bedrooms",
    text_auto=True,
    color_discrete_sequence=["green"],
    opacity=0.6,
)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.show()


In [None]:
fig = px.histogram(
    df[["num_bathrooms"]].sort_values(by="num_bathrooms"),
    title="Number of bathrooms",
    text_auto=True,
    color_discrete_sequence=["blue"],
    opacity=0.4,
)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.show()


In [None]:
sp_rooms_bins = dff.apply(
    lambda x: "No"
    if (x["num_bathrooms"] == 0) and (x["num_bedrooms"] == 0)
    else "specified",
    axis=1,
)  # create a Serie with binary values
sp_rooms_bins.rename("sp_rooms_bins", inplace=True)  # rename the column
sp_rooms_bins.value_counts()


In [None]:
fig = px.histogram(
    dff[["num_bedrooms", "num_bathrooms"]],
    title="Number of bedrooms and bathrooms",
    # color_discrete_sequence=['#1f77b4', '#ff7f0e'], labels={'value': 'Number of rooms', 'variable': 'Room type'},
    barmode="group",
    opacity=0.7,
    text_auto=True,
)
fig.update_layout(xaxis_title="", yaxis_title="Count of properties for rent")
fig.update_layout(xaxis={"categoryorder": "total descending"})


In [None]:
fig = px.scatter(
    pd.concat([df, sp_rooms_bins], axis=1),
    x="cold_price",
    y="property_area",
    facet_col="sp_rooms_bins",
    color="sp_rooms_bins",
)
fig.update_layout(
    xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)"
)
fig.update_traces(marker_size=4, line=dict(width=2))
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()


As with garage and property type we can notice a definite segmentation among listings without specific number of bedrooms and bathrooms.

And finally let's unite all features that lead to clusterization:

In [None]:
cluster_bin = df.apply(
    lambda x: "clusterized"
    if (x["num_bathrooms"] == 0)
    and (x["num_bedrooms"] == 0)
    and (x["garage"] == "No garage")
    and (x["property_type"] == "Unknown")
    and (x["energy_eff"] == "Unknown")
    else "normal",
    axis=1,
)  # create a Serie with binary values
cluster_bin.rename("cluster_bin", inplace=True)  # rename the column
cluster_bin.value_counts()


In [None]:
fig = px.scatter(
    pd.concat([df, cluster_bin], axis=1),
    x="cold_price",
    y="property_area",
    facet_col="cluster_bin",
    color="cluster_bin",
)
fig.update_layout(
    xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)"
)
fig.update_traces(marker_size=3, line=dict(width=2))
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()


Listings without
* garage
* with no specification about property type, energy efficiency class, number of bedrooms and bathrooms  

forms 2 vivible clusters.

Later we'll try to use geo data to plot the data on map.

In [None]:
fig = px.scatter(
    temp, x="cold_price", y="property_area", color="publisher", hover_name="publisher"
)
fig.update_layout(
    xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)"
)
fig.update_traces(marker_size=4, line=dict(width=2))
fig.update_yaxes(range=[0, 120])
fig.update_xaxes(range=[0, 5000])
fig.show()


#### publisher

In [None]:
fig = px.histogram(
    df[["publisher"]].sort_values(by="publisher"),
    title="publisher",
    text_auto=True,
    height=800,
)
fig.update_layout(xaxis_title="", yaxis_title="Count (log scale)")
fig.update_layout(xaxis={"categoryorder": "total descending"})
fig.update_yaxes(type="log")
fig.update_xaxes(tickangle=60)


Let's print top-15 agencies (all private owners united in one group )

In [None]:
def custom_aggregation(data):
    """
    Calculate the survival rate for each group
    """
    d = {}  # create an empty dictionary

    d["mean_sqm"] = data["property_area"].mean()
    d["count"] = round(data["property_area"].count())
    d["mean_price"] = data["cold_price"].mean()
    d["volume"] = d["count"] * d["mean_sqm"]
    d["share"] = d["volume"] / (dff["property_area"].sum()) * 100
    return pd.Series(d)


grouped = dff.groupby(["publisher"])[["property_area", "cold_price"]].apply(
    custom_aggregation
)
grouped.sort_values(by="volume", ascending=False).head(15).style.bar(
    align="mid", color="coral"
).format(precision=1, thousands=",")


In [None]:
# df.groupby(['publisher']).agg(mean_property_area=("property_area", 'mean'),
#                                    Count=('property_area','count'),
#                                    mean_price= ("cold_price",'mean'),
#                                    volume = ("cold_price",lambda x: x.sum())).sort_values(by='volume', ascending= False)\
#                                     .style.bar(align='mid', color='coral').format(precision=0, thousands=",")


### What is the most popular residential rental objects in Berlin? 

In [None]:
fig = px.scatter(
    df,
    x="cold_price",
    y="property_area",
    color="property_type",
    height=800,
    trendline="ols",
    trendline_scope="overall",
)  # , trendline_options=dict(log_x=True)
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2)")
fig.update_layout(xaxis_type="log", yaxis_type="log")
fig.update_traces(
    marker_size=4, line=dict(width=2)
)  # change marker size and line width
fig.show()

# results = px.get_trendline_results(fig)
# print(results)
# results.px_fit_results.iloc[0].summary()
# results.query("property_type == 'Flat' or property_type == 'Unknown'").px_fit_results.iloc[0].summary()


We observe here an interesting results.  
two big clusters are formed: 
* left upper with center 600 eur for 60 sqm
* right lower with center 1800 eur for 50 sqm.

Two segments

In [None]:
fig = px.scatter(
    df,
    x="property_area",
    y="costs",
    color="property_type",
    height=800,
    trendline="ols",
    trendline_scope="overall",
)  # , trendline_options=dict(log_x=True) )
fig.update_layout(xaxis_title="Property area (m2)", yaxis_title="Costs (EUR)")
fig.update_layout(xaxis_type="log")  # , yaxis_type = 'log')
fig.update_traces(marker_size=4, line=dict(width=2))


In [None]:
# define a function to fill warm price on the basis of cold price and energy efficiency
# def fill_warm_price(xdf, cold_price, energy_eff, warm_price, property_type, property_area):

xdf = dff.copy()  # make a copy of the dataframe
xdf["costs"] = xdf.warm_price - xdf.cold_price  # calculate costs


In [None]:
xdf[xdf.costs < 50]  # check if there are any negative values


In [None]:
px.histogram(xdf, y="costs", color="property_type", title="Costs per sq.meter")


In [None]:
model = LinearRegression()  # define a linear regression model

X = xdf[xdf["warm_price"].notna()][
    ["cold_price", "property_area"]
]  # select only rows with warm price not null
y = xdf[xdf["warm_price"].notna()][
    "warm_price"
]  # select only rows with warm price not null

# X = pd.get_dummies(X, columns=[ 'energy_eff'], drop_first=True) # convert categorical columns to dummy variables

model.fit(X, y)
ind = X.index
# return X, _
# # xdf.loc[ind, warm_price] = xdf.loc[ind, cold_price] * (1 + xdf.loc[ind, energy_eff])
print(model.score(X, y), len(ind))
# return model.predict(X[[cold_price, energy_eff, property_type, property_area]])


#exclude columns

In [None]:
# temp_df = model.predict(pd.get_dummies(xdf[['cold_price', 'energy_eff',  'property_area']], columns=[ 'energy_eff'], drop_first=True))
temp_df


In [None]:
temp_df = model.predict(xdf[["cold_price", "property_area"]])
temp_df


In [None]:
# check_na(df)


In [None]:
temp_df = pd.DataFrame(temp_df, columns=["warm_price2"])
temp_df.head()


In [None]:
# temp['diff'] = (temp.warm_price - temp.cold_price) #/ df.property_area


In [None]:
temp_df.describe()


In [None]:
temp_df.shape, df.shape


In [None]:
t = pd.concat([df, temp_df], axis=1, join="inner")


In [None]:
t["diff"] = t.warm_price2 - t.cold_price  # / df.property_area


In [None]:
pd.set_option("display.max_columns", None)  # display all columns
t[(t["diff"] < 0) & (t.warm_price.isna())]


In [None]:
dff[df.cold_price.notna() & df.warm_price.notna()]["energy_eff"].unique()


In [None]:
df["add_costs"] = df.warm_price - df.cold_price


In [None]:
check_na(df)
