# Real Estate Rental Market in Berlin. Parsing. Cleaning. Analizing. 

I was inspired by original ideas and some useful approaches that were taken from [Dmitrii Eliuseev](https://towardsdatascience.com/housing-rental-market-in-germany-exploratory-data-analysis-with-python-3975428d07d2).

This notebook is an attempt to experiment with approaches that I found very useful and interesting, and they have their origins in the TDS article 'Housing Rental Market in Germany: Exploratory Data Analysis with Python'. 

I will try to find some trends and insights from the data collected on https://www.immobilienscout24.de as one of the largest online residential rental aggregators in Germany.  

The main stages of the forthcoming work:  

* Ask: goals of the research
* Prepare: parcing the site, collecting data
* Proccess: cleaning and transforming data, conducting feature engineering
* Analyze: analizing  building up a simple regression model for predicting the prices
* Share: and prepare some visualization

Loading the environment.  
You need to uncomment some lines of code if these libraries are not installed on your system. 

In [52]:
import os
import pandas as pd
import numpy as np
#pip install selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import plotly.express as px

import json
import re #regular expression
# pip install googletrans==4.0.0-rc1
from googletrans import Translator

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


Defining some variables to configure the proccess.

In [53]:
to_parce, to_translate = False, False
base_url = "https://www.immobilienscout24.de"
path_to_csv = "/Users/velo1/SynologyDrive/GIT_syno/data/immobilienscout24.de/"
cols =  'property_id, title, logging_date, property_area, num_rooms, num_bedrooms, num_bathrooms, criteria, garage, floor, floors_in_building, constr_year, energy_eff, extra_costs, heat_costs, price_cold_eur, price_warm_eur, deposit_eur, property_type, publisher, contact, city, address, description, region, zip, link'.split(', ')

pd.set_option('display.max_colwidth', 100) # to display full text in columns
pd.set_option('display.max_columns', None) # display all columns

## Ask

What is the most popular residential rental objects in Berlin?  
What are the main factors that define the rental price?  
Are there any trends and hidden patterns?

## Prepare

|instance| used for storing:|
|:---|:---|
|base_url |https://www.immobilienscout24.de|
|to_parce|boolean flag to start parcing the site, translating some fields to English|
|||
|Berlin_housing.csv|raw data with basic proccessing|
|Berlin_housing_eng.csv|partitially proccessed and translated data |
|||
|df_raw |input data  with basic proccessing|
|df |cleaned data|
|df_r|data ready to run regression model|
| X | processed Train set|
|y (Series) | target labels|


### Data collecting. Parsing 

For this particular notebook I use [immobilienscout24.de](https://www.immobilienscout24.de) as one of the most popular site on local german market.  



#### Parcing with `Requests` library.

In [54]:
import requests

base_url = "https://www.immobilienscout24.de"
url_berlin = base_url + "/Suche/de/berlin/berlin/wohnung-mieten" 
print(requests.get(url_berlin))

<Response [403]>


The server returns <Response [403]>.  
It seems that the page rejects GET requests that do not identify a User-Agent.  
This approach doesn't work with this particular site but we've made a try.


Let's try a Selenium approach which takes under control a Chrome browser and emulate a real user browsing. 

#### Parcing with the `Selenium` python library 
allows using a real Chrome browser to retrieve the data and automate reading pages.  
Parcing this page was a real challenge for me.  
I blocked image loading, experimented with delay time and eventually have got a results.  
Here are some functions to control parcing proccess:

In [55]:
# contact = soup.find_all("dd", "is24qa-nebenkosten")
# if len(contact) > 0:
#     str_contact = contact[0].get_text().replace('+','').replace('€','').strip()

# str_contact

In [56]:
str_contact=""

In [57]:
def page_has_loaded(driver: webdriver.Chrome): 
    """ Check if the page is ready """
    page_state = driver.execute_script('return document.readyState;') 
    return page_state == 'complete'


def page_get(url: str, driver: webdriver.Chrome, delay_sec: int):
    """ Get the page content """
    driver.get(url)                     # load the page
    time.sleep(delay_sec)               # wait for the page to load
    while not page_has_loaded(driver):  # wait until the page is loaded (page_state == 'complete')
        time.sleep(0.1)
    return driver.page_source           # return the page content

def get_links(html: str, pp= 0):   
    ''' Retrive the links to the subpages from the main search pages results'''
    soup = BeautifulSoup(html, "lxml")          # parse the html using beautiful soup and store in variable `soup`
    li = soup.find(id="resultListItems")        # where the sublinks are stored
    links_all = []                              # list of links

    children = li.find_all("li", {"class": "result-list__listing"}) # this instance stores the links to the subpages
    for child in children:
        for link in child.find_all("a"):
            if 'data-go-to-expose-id' in link.attrs:                # check if the link has the required attributes
                links_all.append(base_url + link['href'])
                break

    links_all.append(base_url + link['href'])
    
    print(f'Got {len(links_all)} links, page:{pp} ')# print the number of links found and the page number
                                                    # in a case of an error, the page number can be used to restart the parcing loop
                                                    # from the last page that was successfully parsed
    os.system(f'say Got {len(links_all)} links, page:{pp} ')
    return links_all

def get_attributes(soup, link = None):  
    """ 
    Get the attributes of the property from the soup object
    """

    # initialize the empty variables
    str_property_id, str_logging_date, str_property_area, str_num_rooms, str_num_bedrooms, str_num_bathrooms, str_criteria, str_garage, str_floor, str_floors_in_building, str_year, str_energy_efficiency, str_extra_costs, str_energy_costs, str_price_cold_eur, str_price_warm_eur, str_deposit_eur, str_property_type, str_publisher, str_contact, str_city, str_title, str_address, str_desciption, str_region, str_zip = \
    ('',)*26

    # get the attributes from the soup object
    property_id = soup.find_all("div", "is24-scoutid__content") 
    if len(property_id) > 0:
        str_property_id = property_id[0].get_text().strip().split("Scout-ID: ")[1]

    logging_date = soup.find_all("dd", "is24qa-bezugsfrei-ab grid-item three-fifths")
    if len(logging_date) > 0:
        str_logging_date = logging_date[0].get_text().strip()
        
    property_area = soup.find_all("div", "is24qa-flaeche-main is24-value font-semibold")
    if len(property_area) > 0:
        str_property_area = property_area[0].get_text().strip()
        str_property_area = str_property_area.replace("m²", "").replace(".", "").strip()

    num_rooms = soup.find_all('dd', "is24qa-zimmer")
    if len(num_rooms) > 0:
        str_num_rooms = num_rooms[0].get_text().strip()
    
    num_bedrooms = soup.find_all("dd", "is24qa-schlafzimmer")
    if len(num_bedrooms) > 0:
        str_num_bedrooms = num_bedrooms[0].get_text().strip()
    
    num_bathrooms = soup.find_all("dd", "is24qa-badezimmer")
    if len(num_bathrooms) > 0:
        str_num_bathrooms = num_bathrooms[0].get_text().strip()

    criteria = soup.find_all("div", "criteriagroup boolean-listing padding-top-l")
    if len(criteria) > 0:
        str_criteria = criteria[0].get_text().replace('\n',' ').strip()
    
    garage = soup.find_all("dd", "is24qa-garage-stellplatz")
    if len(garage) > 0:
        str_garage = garage[0].get_text().strip()

    floor = soup.find_all("dd", "is24qa-etage")
    if len(floor) > 0:              # check if the floor is available
        temp_floor = floor[0].get_text().strip().split("von")
        str_floor = temp_floor[0].strip()
        if len(temp_floor) > 1:     # check if the number of floors is available
            str_floors_in_building = temp_floor[1].strip()

    year =soup.find_all("dd", "is24qa-baujahr")
    if len(year) > 0:
        str_year = year[0].get_text().strip()

    energy_efficiency = soup.find_all("dd", "is24qa-energieeffizienzklasse")
    if len(energy_efficiency) > 0:
        str_energy_efficiency = energy_efficiency[0].get_text().strip()

    energy_costs = soup.find_all("dd", "is24qa-heizkosten grid-item three-fifths")
    if len(energy_costs) > 0:
        str_energy_costs = energy_costs[0].get_text().replace('+','').replace('€','').strip()

    extra_costs = soup.find_all("dd", "is24qa-nebenkosten")
    if len(extra_costs) > 0:
        str_extra_costs = extra_costs[0].get_text().replace('+','').replace('€','').strip()        
    

    price_cold_eur = soup.find_all("div", "is24qa-kaltmiete-main")
    if len(price_cold_eur) > 0:
        str_price_cold_eur= price_cold_eur[0].get_text().strip()

        # Your locale maybe different from immobilienscout24.  
        # In this case your should make a little changes to regex patterns I used.  

        # Site locale    `2.000,00 €`,  my system locale  `2000.00`        
        str_price_cold_eur = re.search(r'(\d+[\.]?\d+[\,]?\d+)', str_price_cold_eur).group(0).replace(".", "").strip()

    price_warm_eur = soup.find_all("div", "is24qa-warmmiete-main")
    if len(price_warm_eur) > 0:
        str_price_warm_eur = price_warm_eur[0].get_text().strip()  
                                    # r'(\d+[\.|\,]?\d+[\,|\.]?\d+)'
        str_price_warm_eur = re.search(r'(\d+\.?\d*\,?\d+)', str_price_warm_eur).group(0).replace(".", "").strip()

    deposit_eur = soup.find_all("div", "is24qa-kaution-o-genossenschaftsanteile")
    if len(deposit_eur) > 0:
        str_deposit_eur = deposit_eur[0].get_text().strip()

    property_type = soup.find_all("dd", "is24qa-typ grid-item three-fifths")
    if len(property_type) > 0:
        str_property_type = property_type[0].get_text().strip()
    
    publisher = soup.find_all(attrs={"data-qa": "company-name"})  #  , "companyName"
    if len(publisher) > 0:
        str_publisher = publisher[0].get_text().strip()
    else:
        item = soup.find("div", {"class": "brandLogoPrivate_dnns4"})
        if item is not None:
            str_publisher = "Private"      

    contact = soup.find_all(attrs={"data-qa": "contactName"})
    if len(contact) > 0:
        str_contact = contact[0].get_text()


    title = soup.find_all("h1", id="expose-title")
    if len(title) > 0:
        str_title = title[0].get_text().strip()

    desciption = soup.find_all("pre", "is24qa-objektbeschreibung")
    if len(desciption) > 0:
        str_desciption = desciption[0].get_text().replace('\n', ' ').replace(';', ',').strip()

    region = soup.find_all("span", "zip-region-and-country")
    if len(region) > 0:
        str_region = region[0].get_text().strip().split(",")[0].strip()
        str_city = region[0].get_text().strip().split(",")[1].strip()
        str_city = str_city.split(" ")
        str_zip, str_city = str_city[0], str_city[1]

    address = soup.find_all("span", "block font-nowrap print-hide")
    if len(address) > 0:
        str_address = address[0].get_text().strip()

    return [str_property_id, str_title, str_logging_date, str_property_area, str_num_rooms, str_num_bedrooms, str_num_bathrooms, str_criteria, str_garage, str_floor, str_floors_in_building, str_year, str_energy_efficiency, str_extra_costs, str_energy_costs, str_price_cold_eur, str_price_warm_eur, str_deposit_eur, str_property_type, str_publisher, str_contact, str_city,  str_address, str_desciption, str_region, str_zip, link]


This chunk of code automatically was intended to prevent image loading and increases the performance of parcing.  
But the site has a sophisticated antirobot checkings that require images to pass a test, so this approach didn't work.  
We still need to manually turn off image loading.

In [58]:
# Block images via ChromeOptions object
# chrome_options = webdriver.ChromeOptions()
# prefs = {"profile.managed_default_content_settings.images": 2}
# chrome_options.add_experimental_option("prefs", prefs)

##### The Parsing.

In [59]:
if to_parce:
    
    # To continue scraping after an error, 
    # set the start_page to the page you want to start scraping
    start_page = int(input(f'What page in search pages do you want to start scraping?'))
    depth = int(input(f'How many pages do you want to scrape?'))

    chrome_options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(options= chrome_options)
    print("Driver is ready. \nYou have 60s to DISABLE images loading ...\nPrivacy and Security -> Site Settings -> Images -> Don't allow site to show images\n")

 
    cnt = 0
    for pp in range(start_page, start_page+ depth + 1):

        if pp == 1:

            # open the file in the write mode and write the header row  with the column names (OVERWRITE THE FILE)
            with open(path_to_csv + 'Berlin_housing_p2.csv', 'w') as f:  # write header row
                f.write("; ".join(cols)+'\n')

        if cnt == 0:        # first page
            delay_sec = 60  # wait 60 sec to have time to login , accept cookies and block images from loading
            cnt += 1

        else:
            delay_sec = np.random.random()*0.5 # wait random time to avoid bot detection

        if pp == 1: # first page
            url_page = base_url + "/Suche/de/berlin/berlin/wohnung-mieten"

        else:       # other pages
            url_page = base_url +  "/Suche/de/berlin/berlin/wohnung-mieten?pagenumber=" + str(pp)


        html = page_get(url_page, driver, delay_sec= delay_sec) # go to search page
        links_all = get_links(html, pp)                         # get links from search page

        for link in links_all:                                  # go to each link

            s_html = page_get(link, driver, delay_sec= np.random.random()*0.5)
            soup = BeautifulSoup(s_html, "lxml")
            row = get_attributes(soup, link)                          # get attributes from each link
            with open(path_to_csv + 'Berlin_housing_p2.csv', 'a') as f:  # write to csv file
                f.write(";".join(row)+'\n')

##### The results of parsing the site are stored in 'Berlin_housing.csv'

In [60]:
# df_raw.to_csv(path_to_csv + 'Berlin_housing.csv', sep=';', index=False)

##### Loading tha data we've already parced.  
This is useful if you've finished parcing and continue the next stages of research later.

In [61]:
df_raw = pd.read_csv(path_to_csv + 'Berlin_housing_p2.csv',  names= cols, header=0,  sep=';', on_bad_lines='skip') 
df_raw.head()

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link
0,141131393,Nassauische Straße! Helle 6-Zimmer-Altbau-Wohnung mit Balkon im 1. Obergeschoss,sofort bzw. nach Vereinbarung,2205,7,3.0,,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Personenaufzug Personenaufzug Einbauküche Einbau...,,1,5.0,1900.0,,800,in Nebenkosten enthalten,3500,4300,3 Nettokaltmieten,Etagenwohnung,Kupsch Wohnimmobilien GmbH,Frau Sabine Woide Immobilien,Berlin,,"Berlin- Wilmersdorf Wohnquartier Güntzelkiez (Trautenaustraße, Hohenzollernplatz, Nassauische St...",Wilmersdorf,10717,https://www.immobilienscout24.de/expose/141131393
1,141131071,Tauschwohnung: Schöne 2-Zi im Gräfekiez gegen 3-4 Zi. (kreuzb/neuk),,60,2,,,Einbauküche Einbauküche,,3,,,,170,keine Angabe,410,580,,,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,"Ruhige und schöne Wohnung im Gräfekiez. Ideal für Paare, weil eines der beiden Zimmer ein Durchg...",Kreuzberg,10967,https://www.immobilienscout24.de/expose/141131071
2,141159056,"Tauschwohnung: Schöne 2-Zi Whg in PB, 3-4 Zi-Whg in PB, MI, KR, FH gesucht",,54,2,,,Keller Keller,,1,,,,127,keine Angabe,456,583,,Etagenwohnung,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,"Hallo, unsere kleine Familie (2 Erwachsene und 1 Kind) wohnt in einer schönen 2-Zimmer-Wohnung i...",Prenzlauer Berg,10407,https://www.immobilienscout24.de/expose/141159056
3,141132344,Tauschwohnung: Gemütliche 2 Zimmer Wohnung im Samariterkiez,,60,2,,,Keller Keller,,3,,,,150,keine Angabe,500,650,,Etagenwohnung,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,"Hallo zusammen, mein Freund und ich sind auf der Suche nach einer etwas größeren Wohnung ebenfa...",Friedrichshain,10247,https://www.immobilienscout24.de/expose/141132344
4,140910270,Tauschwohnung: Suche 2-3 Zi. gegen 2-Zi.-Maisonette in Friedrichshain,,64,2,,,Keller Keller Einbauküche Einbauküche,,4,,,,180,keine Angabe,435,615,,Maisonette,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,Hallo! Wir sind Jonah und Benita. Wir haben Lust auf einen Tapetenwechsel und suchen deshalb: - ...,Friedrichshain,10247,https://www.immobilienscout24.de/expose/140910270


### Basic data cleaning  
#### Duplicates

In [62]:
df_raw.duplicated().sum()#.any()

294

In [63]:
df_raw.drop_duplicates(inplace=True)

#### Nan values
Some nans we drop right now,  
others might be droped or filled later taking a context into account.

In [64]:
def check_na(df):
  '''
  Check for missing values in a dataframe
  df - dataframe
  '''
  for col in df.columns:
    print(f'{col.ljust(20)} {df[col].isna().sum():<8}{df[col].isna().sum()/df.shape[0]:>6.2%}  {str(df[col].dtype).ljust(10)}')

In [65]:
check_na(df_raw)

property_id          57       1.32%  object    
title                57       1.32%  object    
logging_date         2986    69.09%  object    
property_area        81       1.87%  object    
num_rooms            75       1.74%  object    
num_bedrooms         3233    74.80%  object    
num_bathrooms        3099    71.70%  object    
criteria             1913    44.26%  object    
garage               3844    88.94%  object    
floor                2220    51.37%  object    
floors_in_building   3201    74.06%  object    
constr_year          3139    72.63%  object    
energy_eff           3773    87.30%  object    
extra_costs          153      3.54%  object    
heat_costs           153      3.54%  object    
price_cold_eur       246      5.69%  object    
price_warm_eur       859     19.88%  object    
deposit_eur          1688    39.06%  object    
property_type        2985    69.07%  object    
publisher            246      5.69%  object    
contact              246      5.69%  obj

Now let's drop rows without essentaial attributes such as  
`property_id`, `num_rooms` or `link`.   
The abscence of this information is the result of parcing errors.

In [66]:
ind = df_raw[df_raw['property_id'].isna() | df_raw['num_rooms'].isna() | df_raw.link.isna()].index
df_raw.drop(ind, inplace=True)

### Copying partially cleaned data to a new instance.
The most obvious preparations have been done.  
Now we are copying the data to a new instance for processing.

In [67]:
df = df_raw.copy()

## Process


### Let's translate some attributes to English.

In [68]:
# service code demonstrating how to translate german words to english
translator = Translator()
translator.translate("kurzfristig", dest='en', src='german').text
# df.title = df.title.replace({'Wohnungstausch':'Apartment'}, regex=True)

'short term'

#### garage

In [69]:
df.garage.unique()

array([nan, '1 Tiefgaragen-Stellplatz', 'Tiefgaragen-Stellplatz',
       '98 Tiefgaragen-Stellplätze', '1 Außenstellplatz',
       'Außenstellplatz', '1 Stellplatz', '1 Duplex-Stellplatz',
       'Parkhaus-Stellplatz', '3 Tiefgaragen-Stellplätze', '1 Garage',
       'Garage', '2 Tiefgaragen-Stellplätze', '2 Außenstellplätze',
       '2 Stellplätze', '1 Carport', '16 Tiefgaragen-Stellplätze',
       '4 Außenstellplätze', '2 Garagen'], dtype=object)

In [70]:
dict_ = {'Außenstellplatz':'Outdoor parking space', 'Tiefgaragen-Stellplatz':'Underground parking space',
'Tiefgaragen-Stellplätze':'Underground parking spaces', 'Tiefgarage':'Underground garage', 'Außenstellplätze':'Outdoor parking spaces',
'Garage':'garage', 'Stellplatz':'parking space','Parkhaus-Stellplatz':'Parking garage parking space',
'garagen':'garages', 'Parkhaus':'Parking garage','Stellplätze':'parking spaces', 'Garagen':'garages',
'Carport':'Carport', 'Duplex-Stellplatz':'Duplex parking space', 'Parkplatz':'Parking space'}

df.garage = df.garage.replace(dict_, regex=True)
df.garage.unique()

array([nan, '1 Underground parking space', 'Underground parking space',
       '98 Underground parking spaces', '1 Outdoor parking space',
       'Outdoor parking space', '1 parking space',
       '1 Duplex-parking space', 'Parking garage-parking space',
       '3 Underground parking spaces', '1 garage', 'garage',
       '2 Underground parking spaces', '2 Outdoor parking spaces',
       '2 parking spaces', '1 Carport', '16 Underground parking spaces',
       '4 Outdoor parking spaces', '2 garagen'], dtype=object)

#### property_type

In [71]:
df.property_type.unique()

array(['Etagenwohnung', nan, 'Maisonette', 'Erdgeschosswohnung',
       'Dachgeschoss', 'Penthouse', 'Souterrain', 'Terrassenwohnung',
       'Loft', 'Hochparterre', 'Sonstige'], dtype=object)

In [72]:
dict_ = {'Dachgeschoss':'Attic', 'Erdgeschosswohnung':'Ground floor apartment',
'Hochparterre':'High parterre', 'Etagenwohnung':'Flat', 'Souterrain':'Basement',
'Terrassenwohnung':'Terrace apartment', 'Sonstige':'Other', 'Maisonette':'Small house',}

df.property_type = df.property_type.replace(dict_, regex=True)
df.property_type.unique()

array(['Flat', nan, 'Small house', 'Ground floor apartment', 'Attic',
       'Penthouse', 'Basement', 'Terrace apartment', 'Loft',
       'High parterre', 'Other'], dtype=object)

#### logging_date

In [73]:
dict_ = {'nach Absprache':'according to the arrangement', 'sofort':'immediately','Sofort':'Immediately','verfügbar':'accessible',
'Mietbeginn':'Start of rental','Nach Vereinbarung':'By appointment','bzw.': 'or','nach':'after','Fertigstellung':'completion',
'bezugsfrei':'free of charge','Vereinbarung':'agreement','ab':'from','bis':'to','ab sofort':'immediately',
'voraussichtlich':'probably','Voraussichtlich':'Probably','Voraussichtlich':'Probably','Voraussichtlich':'Probably',
'Sommer':'summer','Winter':'winter','Frühjahr':'spring','Herbst':'autumn','Ende':'end','Anfang':'beginning',
'mitte': 'middle', 'Mitte':'Middle','kurzfristig':'short term','Kurzfristig':'Short term','Kurzfristig':'Short term',}
df.logging_date = df.logging_date.replace(dict_, regex=True)

#### const_year

In [74]:
df.constr_year = df.constr_year.replace({'unbekannt': '0'}, regex=True)
# df.constr_year = df.constr_year.replace({'0': 'unbekannt'})
# df.constr_year = df.constr_year.replace({'nan': np.nan})

In [75]:
df.constr_year.unique()

array(['1900', nan, '2019', '2022', '1983', '2023', '1980', '2021',
       '1910', '1930', '1896', '1977', '2012', '1023', '1992', '1890',
       '1984', '2016', '1936', '2014', '1997', '0', '1969', '1987',
       '2015', '2006', '1913', '1938', '1892', '2020', '1920', '1926',
       '1989', '2000', '2018', '1895', '1894', '1999', '2017', '1952',
       '1907', '1915', '2011', '1991', '1966', '1888', '2005', '1912',
       '1968', '1990', '1911', '1982', '1935', '1958', '1906', '2001',
       '1956', '1902', '2004', '1905', '1996', '2007', '2013', '1908',
       '2003', '1998', '1964', '1986', '1960', '1965', '1955', '1976',
       '1959', '1953', '1918', '1970', '1909', '1954', '1978', '1993',
       '1973', '1889', '1929', '1995', '1963', '1901', '1903', '1922',
       '2009', '1914', '1862', '2090', '1994', '1979', '1972', '1891',
       '1988', '1950', '1925', '1852', '1904', '1967', '1961', '1975',
       '1860', '1974', '1928', '1940', '1898', '1750'], dtype=object)

In [76]:
# df.constr_year = df.constr_year.astype(str)

In [77]:
# df.constr_year = df.constr_year.replace({r'\.0': ''}, regex=True)

#### title
Here we have a challange as there are no api keys for batch translating.  
The following proccess is executed row by row with online requests to Google.  
There were many timeouts and other issues so I divide the translation into chunks.  

In [78]:
def translate_col(df, columns, chunk_size=300, start_chunk_num=1):
  '''
  Translate column in dataframe
  df - dataframe
  columns - list of columns to translate
  chunk_size - number of rows to translate at once
  start_chunk_num - number of chunk to start from
  '''

  error_chunk = 0

  for ch in range(start_chunk_num, df.shape[0]//chunk_size + 2):
    print(f'Chunk {ch} of {df.shape[0]//chunk_size }')
    os.system(f'say Chunk {str(ch)} started.')

    ind1 = ch * chunk_size - chunk_size
    ind2 = ch * chunk_size if ch * chunk_size < df.shape[0] else df.shape[0]

    print(f'ind1 {ind1}, ind2 {ind2}', end=' ')

    for col in columns:
      try:
        df.loc[ind1:ind2, col] = df.loc[ind1:ind2, col].apply(lambda x: translator.translate(x, dest='en', src='auto').text)
      except:
        print(f'Error in column {col} at index {ind1} - {ind2}')
        os.system(f'say Error in column {col} at index {ind1} - {ind2}')
        error_chunk = ch 
        return error_chunk  # error
      
      time.sleep(1) 
      print(translator.translate("Everything's under control", dest='german', src='auto').text +':' , end=' '   )
      print(f'Column: {col} translated.', end=' ')
      os.system(f'say Column {col} translated')
      

    time.sleep(16) 
    print()
  return  0   # no error

`The next chunk of code maybe running for a long time.`  

Loops will repeat and repeat  until the successful executition without errors will be performed.  
You can skip this stage and load the intermediate results.

In [79]:
if to_translate:
  error_chunk = 1
  # this loop will continue until all chunks are translated 100% without errors
  while True:
    error_chunk = translate_col(df, ['title'], chunk_size=100, start_chunk_num= error_chunk)
    if error_chunk == 1:  # 0 - no error  (change to 0 if you want to run all chunks)
                          # or > 0 to limit the number of chunks)
      break

  os.system('say "Beer time"')

### Saving the intermediate results of translation from german.

In [80]:
# df.to_csv(path_to_csv + 'Berlin_housing_parteng.csv', sep=';', index=False)

### Loading intermediate results with some columns have already been translated.

In [81]:
df= pd.read_csv(path_to_csv + 'Berlin_housing_parteng.csv',   header=0,  sep=';')
df.head(2)

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link
0,141131393,Nassauische Straße! Helle 6-Zimmer-Altbau-Wohnung mit Balkon im 1. Obergeschoss,immediately or after agreement,2205,7,3.0,,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Personenaufzug Personenaufzug Einbauküche Einbau...,,1.0,5.0,1900.0,,800,in Nebenkosten enthalten,3500,4300,3 Nettokaltmieten,Flat,Kupsch Wohnimmobilien GmbH,Frau Sabine Woide Immobilien,Berlin,,"Berlin- Wilmersdorf Wohnquartier Güntzelkiez (Trautenaustraße, Hohenzollernplatz, Nassauische St...",Wilmersdorf,10717,https://www.immobilienscout24.de/expose/141131393
1,141131071,Tauschwohnung: Schöne 2-Zi im Gräfekiez gegen 3-4 Zi. (kreuzb/neuk),,60,2,,,Einbauküche Einbauküche,,3.0,,,,170,keine Angabe,410,580,,,Tauschwohnung GmbH,Tauschwohnung Wohnungstausch,Berlin,,"Ruhige und schöne Wohnung im Gräfekiez. Ideal für Paare, weil eines der beiden Zimmer ein Durchg...",Kreuzberg,10967,https://www.immobilienscout24.de/expose/141131071


### Nan values

At this stage we'll fill some features based on the context. 

#### `logging_date`  
is not a neccessary parameter.  
Some rows include availiable date to log, others additional notes. 
I replace nans here  with `""`.

In [82]:
df.logging_date.fillna("", inplace=True)

The same applies to the 
#### `garage`, `energy_eff` and others.

In [83]:
df.logging_date.fillna('Unknown', inplace=True)
df.garage.fillna("No garage", inplace=True)
df.energy_eff.fillna("Unknown", inplace=True)
df.floor.fillna(0, inplace=True)
df.floors_in_building.fillna(0, inplace=True)
df.property_type.fillna("Unknown", inplace=True)
df.publisher.fillna("Private", inplace=True)  # Private or Agency
df.address.fillna("Unknown", inplace=True)
df.region.fillna("Unknown", inplace=True)
df.zip.fillna("Unknown", inplace=True)
df.constr_year.fillna('0', inplace=True)
df.deposit_eur.fillna("0", inplace=True) # However, 3 months deposit is a standard in Germany, 0 means the owner does not require a deposit

### Data types  
Defining the proper data types.

In [84]:
check_na(df)

property_id          0        0.00%  int64     
title                0        0.00%  object    
logging_date         0        0.00%  object    
property_area        0        0.00%  object    
num_rooms            0        0.00%  object    
num_bedrooms         3140    77.09%  float64   
num_bathrooms        3016    74.05%  float64   
criteria             1846    45.32%  object    
garage               0        0.00%  object    
floor                0        0.00%  float64   
floors_in_building   0        0.00%  float64   
constr_year          0        0.00%  object    
energy_eff           0        0.00%  object    
extra_costs          0        0.00%  object    
heat_costs           0        0.00%  object    
price_cold_eur       0        0.00%  object    
price_warm_eur       613     15.05%  object    
deposit_eur          0        0.00%  object    
property_type        0        0.00%  object    
publisher            0        0.00%  object    
contact              0        0.00%  obj

#### num_rooms


In [85]:
df.num_rooms.unique()

array(['7', '2', '1', '4', '3', '1,5', '3,5', '5', '2,5', '8', '5,5', '6',
       '4,5', '7,5', '11'], dtype=object)

We have 1,5, 4,5 rooms, 5,5 rooms, 7,5 rooms, etc.  
It is not a mistake.  
These are the numbers indicated in real advertisements.

In [86]:
# replace comma with dot,
# convert to float type as some values are float (e.g. 4.5)
# if df.num_rooms.dtype == 'object':
df.num_rooms = df.num_rooms.str.replace(',', '.').astype('float16') 
df.num_rooms.unique()

array([ 7. ,  2. ,  1. ,  4. ,  3. ,  1.5,  3.5,  5. ,  2.5,  8. ,  5.5,
        6. ,  4.5,  7.5, 11. ], dtype=float16)

#### property_area

In [87]:
# if df.num_rooms.dtype == 'object':
df.property_area = df.property_area.str.replace('.', '',regex=True).str.replace(',', '.', regex=True).astype('float16')

#### price_cold_eur

Prices are in different locale.

In [88]:
#  replace dot with None, replace comma with dot, convert to float type
# if df.num_rooms.dtype == 'object':
df.price_cold_eur = df.price_cold_eur.str.replace('.', '',regex=True).str.replace(',', '.', regex=True).astype('float32')

#### price_warm_eur

In [89]:
# if df.num_rooms.dtype == 'object':
df.price_warm_eur = df.price_warm_eur.str.replace('.', '',regex=True).str.replace(',', '.', regex=True).astype('float32')

#### num_bedrooms

In [90]:
# if df.num_rooms.dtype == 'object':
df.num_bedrooms.fillna(0, inplace=True)
df.num_bedrooms = df.num_bedrooms.astype('int16')

#### num_bathrooms

In [91]:
df.num_bathrooms.unique()

array([nan,  1.,  2.,  3.,  4.,  0.])

In [92]:
df.num_bathrooms.fillna(0, inplace=True)
df.num_bathrooms = df.num_bathrooms.astype('int16')

#### floor, floors_in_building

In [93]:
df.floor.unique()

array([ 1.,  3.,  4.,  0.,  5.,  2.,  6.,  7., 11.,  8., 14.,  9., 16.,
       10., 13., 12.])

In [94]:
df.floor = df.floor.astype('int8')
df.floors_in_building = df.floors_in_building.astype('int16')

#### constr_year

In [95]:
df.constr_year = df.constr_year.astype('int16')

#### heat costs, extra costs

In [167]:
df.heat_costs = df.heat_costs.astype('float32')
df.extra_costs = df.extra_costs.astype('float32')

ValueError: could not convert string to float: 'in Nebenkosten enthalten'

### Numeric features Outliers  
Now we are ready to make some visualizations of data distribution


In [97]:
# select all categorical columns
cat_col = df.drop(['link'], axis=1).select_dtypes(include=['object']).columns
# select all numeric columns
num_col = df.drop('property_id', axis=1).select_dtypes(include=['number']).columns

cat_col, num_col

(Index(['title', 'logging_date', 'criteria', 'garage', 'energy_eff',
        'extra_costs', 'heat_costs', 'deposit_eur', 'property_type',
        'publisher', 'contact', 'city', 'address', 'description', 'region',
        'zip'],
       dtype='object'),
 Index(['property_area', 'num_rooms', 'num_bedrooms', 'num_bathrooms', 'floor',
        'floors_in_building', 'constr_year', 'price_cold_eur',
        'price_warm_eur'],
       dtype='object'))

In [102]:
fig = px.box(df[num_col], notched=True,  boxmode="overlay",
             title='Outliers', height=700, color='variable')
# fig.update_yaxes(matches=None)
fig.update_xaxes(tickangle=20)
fig.update_yaxes(type="log")
fig.update_layout(xaxis_title="", yaxis_title="Value range (log scale)")

Prices do definetely have outliers.  
Property area also needs to be checked. 

#### Cold price

In [103]:
px.box(df, x='price_cold_eur', height= 300)

We have an outlier in the dataset.  
Let's add a relative cold price column to explore the prices more intuitively.

In [104]:
df['cold_rel_price'] = df.price_cold_eur / df.property_area
df[df.cold_rel_price> 350]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price
2305,140100741,"Wilhelminenhofstraße, Berlin",,68.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,28000.0,28000.0,1000,Unknown,HousingAnywhere B.V.,,Berlin,"Wilhelminenhofstraße 0,","Dieses Apartment bietet die Privatsphäre einer eigenen Wohnung, aber den Service eines Hotels. E...",Oberschöneweide,12459,https://www.immobilienscout24.de/expose/140100741,411.764709
3575,114866641,"Stilvolle 1-Zimmer-Wohnung in Friedrichshain, Berlin",,1.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,in Nebenkosten enthalten,300000.0,300000.0,0,Attic,DevCom Deutschland,Herr Test-Vorname Test-Nachname,Berlin,Unknown,Zu der Wohnung zählt ein hübsches Zimmer.,Friedrichshain,10243,https://www.immobilienscout24.de/expose/114866641,300000.0


These ads are most likely a mistake as the prices are unreasonably high.  
300,000 for 1 sq.m and 28,000 for 68 sq.m  
Let's drop them.

In [105]:
df.drop(index = df[df.cold_rel_price> 350].index, inplace=True) 
fig = px.box(df[['cold_rel_price']], x='cold_rel_price', notched=True, title='Cold RELATIVE prices <br><sup>€ for sq.m per month (outliers removed)</sup>')
fig.update_layout(xaxis_title="€ for sq.m per month", yaxis_title="Value range")

These results seems to be more realistic with median 22,2 eur for sq.m monthly.  
But prices over 100 euros for square meter per month seems very high.  
Let's explore.

In [108]:
df[df.cold_rel_price > 100].sort_values(by='cold_rel_price', ascending=False).head(5)

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price
2994,138332662,"Luise-Henriette-Straße, Berlin",,19.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,3528.0,3528.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Luise-Henriette-Straße 0,",Our 19-23 sqm Suites for stays over 28 nights are the ideal choice if you are looking for a suit...,Tempelhof,12103,https://www.immobilienscout24.de/expose/138332662,185.684204
3288,138325908,"Englische Straße, Berlin",,30.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,5550.0,5550.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Englische Straße 0,",The essential in perfection. Clear design and maximum hospitality: comfort and cosiness can be f...,Charlottenburg,10587,https://www.immobilienscout24.de/expose/138325908,185.0
3521,138320888,"Winterfeldtstraße, Berlin",,30.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,5490.0,5490.0,1200,Unknown,HousingAnywhere B.V.,,Berlin,"Winterfeldtstraße 0,","Comfortable and elegantly furnished, our one bedroom apartments offer you all you need for your ...",Schöneberg,10781,https://www.immobilienscout24.de/expose/138320888,183.0
2717,138335791,"Konstanzer Straße, Berlin",,18.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,2700.0,2700.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Konstanzer Straße 0,",Vorteile: - Anmeldung möglich - Keine Vorauszahlung nötig - Keine Kaution benötigt - Wöchentlich...,Wilmersdorf,10707,https://www.immobilienscout24.de/expose/138335791,150.0
3065,138330006,"Müllerstraße, Berlin",,24.0,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,3528.0,3528.0,0,Unknown,HousingAnywhere B.V.,,Berlin,"Müllerstraße 0,",Our 24-27 sqm Suites M for stays over 28 nights have been furnished to our highest modern standa...,Wedding,13353,https://www.immobilienscout24.de/expose/138330006,147.0


Here we see very niche offers.
Small but very comfortable rooms with good furniture.  
For example:
`Our 19-23 sqm suites for stays over 28 nights are the ideal choice if you are looking for a suitable apartment for two and have therefore been furnished to our highest modern standards. The suites have a fully equipped kitchen, a comfortable box spring bed (1.60 m) with a modern smart TV and a private bathroom with a shower so you can feel at home. If there is dirty laundry, you have the opportunity to wash your clothes in the communal laundry room (opening hours: 6 a.m. to 10 p.m.). Your apartment offers everything you need for a longer stay with us in just one room.`  

It might be an alternative for staying at a hotel.  
But prices here are over 100 euros for sq. meter per month.  Very high.

In [112]:
fig = px.box(df[num_col].drop(['price_warm_eur','price_cold_eur','property_area', 'constr_year'], axis=1), 
             notched=True,  title='Outliers (continuation)',color='variable')
fig.update_yaxes(matches= None)
fig.update_layout(xaxis_title="", yaxis_title="Value range")

There is nothing suspicious here.
This data is normal.

Let's take a closer look at property sizes.

In [114]:
fig = px.box(df[['property_area']], x= 'property_area', notched=True, title='property_area')
fig.update_layout(xaxis_title="Property size (sq.meters)", yaxis_title="")

In [115]:
df[df.property_area>300]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price
305,140099183,Leben in der Residenz Monbijou - herrschaftliches Penthouse am Weltkulturerbe!,,375.0,8.0,4,4,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Personenaufzug Personenaufzug Einbauküche Einbau...,1 parking space,0,0,1906,D,1.724,in Nebenkosten enthalten,9850.0,11574.0,"29.403,00 EUR",Other,Engel & Völkers Berlin Mitte GmbH,Engel & Völkers Berlin Mitte,Berlin,"Monbijoustraße 3/5,","Dieses herrschaftliche Maisonett-Apartment verfügt über acht Zimmer, befindet sich im vierten Ob...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/140099183,26.266666
311,139142000,Berlin im Blick - einzigartiges Townhouse im Herzen Berlins!,from immediately,456.0,4.0,3,3,Balkon/ Terrasse Balkon/ Terrasse Personenaufzug Personenaufzug Einbauküche Einbauküche Gäste-WC...,1 garage,0,0,2012,Unknown,1.284,nicht in Nebenkosten enthalten,15000.0,16284.0,"45.000,00 EUR",Other,Engel & Völkers Berlin Mitte GmbH,Engel & Völkers Berlin Mitte,Berlin,"Oberwallstraße 13,","Charakteristisch für die Townhäuser sind die langen, eher schmalen Parzellen, die sowohl zu orig...",Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/139142000,32.894737
315,141306965,Exklusives Townhouse in Mitte nahe dem Gendarmenmarkt,immediately,456.0,4.0,3,3,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Einbauküche Einbauküche Gäste-WC Gäste-WC,1 Underground parking space,0,0,2012,Unknown,1.284,keine Angabe,15000.0,16284.0,3 NKM,Small house,FAMOZA Immobilien,Frau Josipa Kovačević,Berlin,Unknown,Vermietet wird ein exklusives Townhouse im beliebten Stadtteil Mitte. Das viergeschossige Maison...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/141306965,32.894737
1886,140875809,Erstbezug: Spektakuläres Penthouse in City-Lage,01.04.2023,321.0,5.0,3,2,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Personenaufzug Personenaufzug Einbauküche Einbau...,2 Underground parking spaces,6,6,2022,Unknown,950,in Nebenkosten enthalten,11500.0,12450.0,0,Penthouse,Engel & Völkers Immobilien Deutschland GmbH,Engel & Völkers Immobilien Deutschland GmbH,Berlin,Unknown,"Das hier angebotene Penthouse befindet sich auf den Kant-Garagen, einem historischen Architektur...",Charlottenburg,10625,https://www.immobilienscout24.de/expose/140875809,35.825546
1908,138488871,Exceptional Living in der Jägerstraße am Friedrichswerder - Exklusives Penthouse mit 360 Grad Blick,Nach Absprache,706.0,7.0,4,3,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Personenaufzug Personenaufzug Einbauküche Einbau...,Underground parking space,6,6,2007,C,2.850,in Nebenkosten enthalten,17000.0,19850.0,0,Penthouse,CITY-CONCEPT Gesellschaft für Immobilienmanagement mbH,Herr Stefan Schepers,Berlin,"Jägerstraße 34,",Das Wohn- und Geschäftshaus Jägerstraße 34/35 liegt in direkter Nähe zum Auswärtigem Amt. Im 5. ...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/138488871,24.079321
4049,105850244,5-Zimmerwohnung mit großer Terrasse im Herzen von Berlin,from Juli 2023,343.5,5.0,4,4,Balkon/ Terrasse Balkon/ Terrasse Personenaufzug Personenaufzug Einbauküche Einbauküche Garten/ ...,1 Underground parking space,3,7,2015,Unknown,"1.545,35",in Nebenkosten enthalten,7081.870117,8627.219727,3 Nettokaltmieten,Terrace apartment,HGHI Immobilien Verwaltung GmbH,Frau Marie-Josephine Wahn,Berlin,"Leipziger Str. 12,",Das Leipziger Platz Quartier setzt als einzigartiges Wohnquartier über den Dächern der Stadt neu...,Mitte (Ortsteil),10117,https://www.immobilienscout24.de/expose/105850244,20.616798


These are very special offers with a crazy price of up to 20,000 euros for a 706 sq.m penthouse.  

But the relative price per sq.m. is decent (appr. 25 eur) and a lot cheaper than we've reviewed previously for some of the one-bedroom offerings.

#### Construction year

In [116]:
fig = px.histogram(df.sort_values(["constr_year"]), x='constr_year',  title='Construction year') #color='property_type', 
fig.update_layout(xaxis_type = 'category')
# fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.update_yaxes(type="log")
fig.update_layout(xaxis_title="", yaxis_title="Count (log scale)")
fig.show()

There are too many typos in constr_year and   
most listings do not designate the year of construction at all.

I plan to divide data into 3-4 categories like '<1950','1951-1990', '1991-2010', '>2010'

In [117]:
# remove typos
df.loc[df[(df.constr_year > 2023) ].index,'constr_year'] = 1990

### Category features

In [118]:
cat_col

Index(['title', 'logging_date', 'criteria', 'garage', 'energy_eff',
       'extra_costs', 'heat_costs', 'deposit_eur', 'property_type',
       'publisher', 'contact', 'city', 'address', 'description', 'region',
       'zip'],
      dtype='object')

#### region

In [119]:
fig = px.histogram(df[['region']], title='region',  text_auto=True)
fig.update_layout(xaxis_title="", yaxis_title="Count (log scale)")
fig.update_layout(xaxis_type = 'category')
fig.update_layout(xaxis={'categoryorder':'total ascending'})
fig.update_yaxes(type="log")
fig.update_xaxes(tickangle=60)

In [120]:
df[df.floors_in_building.isna()  & (df.property_type == "None") ]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price


In [121]:
ind = df[df.floor.isna()  & (df.property_type == "Ground floor apartment") ].index
df.loc[ind, 'floor'] = 0

#### Deposit_eur

`deposit_eur` incude information about deposit.  
Some ads include specific  some, others - the number of monthly paid cold price.   
Let's clean this data:
1. retrive only digits
2. if value < 13 (common practice 3) then multiply by cold price

In [122]:
def clean_deposit(row):
  '''
  Clean price column
  '''
  # if deposit is given in words, convert it to n- months amount of price_cold_eur
  if 'drei' in str.lower(row['deposit_eur']):
    return 3 * row['price_cold_eur']
  elif 'zwei' in str.lower(row['deposit_eur']):
    return 2 * row['price_cold_eur']

  res = re.search(r'(\d+\.?\d*\,?\d+)', row['deposit_eur']) # matching object
  if  res is None:
    return 0
    # return 3 * row['price_cold_eur'] # 3 months deposit is a standard in Germany
  else:
    res = float(res.group(0).replace(".", "").replace(",", ".")) # extract group 0, replace dots and commas with dots, convert to float

  if res == 0:
    return 0
  elif res < 13: 
    return res * row['price_cold_eur'] # if deposit is given in months, convert it to EUR
  else:
    return res

In [123]:
# print rows to check if attribute contains 'drei'
ind = df[df.deposit_eur.apply(lambda x: 'drei' in str.lower(x))].index
df.loc[ind,].head(3)

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price
458,140224410,Neubau im Erstbezug - Musterwohnung,01.05.2023,111.9375,5.0,0,0,Balkon/ Terrasse Balkon/ Terrasse Personenaufzug Personenaufzug Gäste-WC Gäste-WC,No garage,1,5,2023,B,22946,25408,1287.199951,1770.73999,drei Nettokaltmieten,Unknown,degewo,degewo Köpenicker Wohnungsgesellschaft mbH,Berlin,"Igelsteig 7B,",Alle Wohnungen in diesem Neubau sind mit einer Fußbodenheizung ausgestattet und per Aufzug errei...,Köpenick,12557,https://www.immobilienscout24.de/expose/140224410,11.499273
518,141383819,* Erstbezug im Neubau nahe der Müggelspree*,from immediately,107.75,5.0,0,0,Personenaufzug Personenaufzug Gäste-WC Gäste-WC,No garage,1,3,2023,A,21987,28669,1023.909973,1530.469971,drei Nettokaltmieten,Unknown,degewo,degewo AG,Berlin,"Fürstenwalder Allee 324,",Im Neubauprojekt im Stadtteil Hessenwinkel entstehen insgesamt 386 Wohnungen in 34 Gebäuden für ...,Rahnsdorf,12589,https://www.immobilienscout24.de/expose/141383819,9.502645
519,141383671,* Erstbezug im Neubau nahe der Müggelspree*,from immediately,85.8125,3.0,0,0,Personenaufzug Personenaufzug,No garage,3,3,2023,A,17501,22820,986.590027,1389.800049,drei Nettokaltmieten,Unknown,degewo,degewo AG,Berlin,"Fürstenwalder Allee 326,",Im Neubauprojekt im Stadtteil Hessenwinkel entstehen insgesamt 386 Wohnungen in 34 Gebäuden für ...,Rahnsdorf,12589,https://www.immobilienscout24.de/expose/141383671,11.497044


In [124]:
df['calc_deposit_eur'] = df.apply(lambda row: clean_deposit(row), axis=1)
df.loc[ind,].head(3)

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price,calc_deposit_eur
458,140224410,Neubau im Erstbezug - Musterwohnung,01.05.2023,111.9375,5.0,0,0,Balkon/ Terrasse Balkon/ Terrasse Personenaufzug Personenaufzug Gäste-WC Gäste-WC,No garage,1,5,2023,B,22946,25408,1287.199951,1770.73999,drei Nettokaltmieten,Unknown,degewo,degewo Köpenicker Wohnungsgesellschaft mbH,Berlin,"Igelsteig 7B,",Alle Wohnungen in diesem Neubau sind mit einer Fußbodenheizung ausgestattet und per Aufzug errei...,Köpenick,12557,https://www.immobilienscout24.de/expose/140224410,11.499273,3861.599854
518,141383819,* Erstbezug im Neubau nahe der Müggelspree*,from immediately,107.75,5.0,0,0,Personenaufzug Personenaufzug Gäste-WC Gäste-WC,No garage,1,3,2023,A,21987,28669,1023.909973,1530.469971,drei Nettokaltmieten,Unknown,degewo,degewo AG,Berlin,"Fürstenwalder Allee 324,",Im Neubauprojekt im Stadtteil Hessenwinkel entstehen insgesamt 386 Wohnungen in 34 Gebäuden für ...,Rahnsdorf,12589,https://www.immobilienscout24.de/expose/141383819,9.502645,3071.729919
519,141383671,* Erstbezug im Neubau nahe der Müggelspree*,from immediately,85.8125,3.0,0,0,Personenaufzug Personenaufzug,No garage,3,3,2023,A,17501,22820,986.590027,1389.800049,drei Nettokaltmieten,Unknown,degewo,degewo AG,Berlin,"Fürstenwalder Allee 326,",Im Neubauprojekt im Stadtteil Hessenwinkel entstehen insgesamt 386 Wohnungen in 34 Gebäuden für ...,Rahnsdorf,12589,https://www.immobilienscout24.de/expose/141383671,11.497044,2959.770081


Finally we've managed to calculate `calc_deposit_eur`.

#### price_warm_eur

Let's add a new feature named 'costs' as a substraction of warm and cold prices.

In [125]:
df['costs'] = df.price_warm_eur - df.price_cold_eur
ind = df[df.costs < 0].index
df.loc[ind]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price,calc_deposit_eur,costs
484,132716077,Furnished 2 rooms apartment in Mitte (Berlin),,65.0,2.0,1,1,Personenaufzug Personenaufzug Einbauküche Einbauküche,No garage,5,6,2020,Unknown,355,in Nebenkosten enthalten,3091.0,2736.0,1000 + Admin. Fee,Flat,Ukio Germany Gmbh,Frau Julia Morgan,Berlin,"Am Köllnischen Park 17,","Where Berlin conception meets Saharan design, Djenné captures the magic and magnificence of its ...",Mitte (Ortsteil),10179,https://www.immobilienscout24.de/expose/132716077,47.553844,1000.0,-355.0
2690,138189345,Co-Living - THE HOUSE OF CO - Erstbezug Apartment,immediately,27.0,1.0,1,1,Keller Keller Personenaufzug Personenaufzug Einbauküche Einbauküche WG-geeignet WG-geeignet,1 Underground parking space,1,5,2019,Unknown,100,in Nebenkosten enthalten,1149.0,1049.0,2 Kaltmieten,Flat,FU.Life Service GmbH,Booking House of Co,Berlin,"Heidestraße 20,",Wir verbinden das Beste aus zwei Wohnkonzepten auf höchstem Niveau. Alle wollen heute „Co“! „...,Moabit,10557,https://www.immobilienscout24.de/expose/138189345,42.555557,0.0,-100.0


This a mismatch in two these listings.
Let's fix it.

In [126]:
df.loc[ind, 'price_warm_eur'], df.loc[ind, 'price_cold_eur'] = df.loc[ind, 'price_cold_eur'], df.loc[ind, 'price_warm_eur']
df.loc[ind, 'costs'] = df.loc[ind, 'price_warm_eur'] - df.loc[ind, 'price_cold_eur']
df.loc[ind]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price,calc_deposit_eur,costs
484,132716077,Furnished 2 rooms apartment in Mitte (Berlin),,65.0,2.0,1,1,Personenaufzug Personenaufzug Einbauküche Einbauküche,No garage,5,6,2020,Unknown,355,in Nebenkosten enthalten,2736.0,3091.0,1000 + Admin. Fee,Flat,Ukio Germany Gmbh,Frau Julia Morgan,Berlin,"Am Köllnischen Park 17,","Where Berlin conception meets Saharan design, Djenné captures the magic and magnificence of its ...",Mitte (Ortsteil),10179,https://www.immobilienscout24.de/expose/132716077,47.553844,1000.0,355.0
2690,138189345,Co-Living - THE HOUSE OF CO - Erstbezug Apartment,immediately,27.0,1.0,1,1,Keller Keller Personenaufzug Personenaufzug Einbauküche Einbauküche WG-geeignet WG-geeignet,1 Underground parking space,1,5,2019,Unknown,100,in Nebenkosten enthalten,1049.0,1149.0,2 Kaltmieten,Flat,FU.Life Service GmbH,Booking House of Co,Berlin,"Heidestraße 20,",Wir verbinden das Beste aus zwei Wohnkonzepten auf höchstem Niveau. Alle wollen heute „Co“! „...,Moabit,10557,https://www.immobilienscout24.de/expose/138189345,42.555557,0.0,100.0


In [128]:
df.publisher.nunique()

270

In [129]:
check_na(df)

property_id          0        0.00%  int64     
title                0        0.00%  object    
logging_date         0        0.00%  object    
property_area        0        0.00%  float16   
num_rooms            0        0.00%  float16   
num_bedrooms         0        0.00%  int16     
num_bathrooms        0        0.00%  int16     
criteria             1844    45.30%  object    
garage               0        0.00%  object    
floor                0        0.00%  int8      
floors_in_building   0        0.00%  int16     
constr_year          0        0.00%  int16     
energy_eff           0        0.00%  object    
extra_costs          0        0.00%  object    
heat_costs           0        0.00%  object    
price_cold_eur       0        0.00%  float32   
price_warm_eur       613     15.06%  float32   
deposit_eur          0        0.00%  object    
property_type        0        0.00%  object    
publisher            0        0.00%  object    
contact              0        0.00%  obj

And finally we have a clean data set except `price_warm_eur` and associated with it `costs`.

### Save data ready for analysis

In [130]:
# df.to_csv(path_to_csv + 'Berlin_housing_cleaned2.csv', sep=';', index=False)

### Loading cleaned data

In [132]:
df_r = pd.read_csv(path_to_csv + 'Berlin_housing_cleaned2.csv', sep=';')

## Analyze

First, we'll explore feature by feature and
then answer the questions.

Does the presence of garage increase the price?

#### Garage

In [133]:
fig = px.histogram(df_r[['garage']], x = df_r['garage'], title='Garage', color= 'garage',  
                   text_auto=True, height= 600)
fig.update_layout(xaxis_title="", yaxis_title="Count")
fig.update_layout(xaxis={'categoryorder':'total descending'})
# fig.update_yaxes(type="log")

In [165]:
print(f'Only {df_r[df_r.garage != "No garage"].shape[0]/df_r.shape[0]:.2%} of the properties have a garage')

Only 9.26% of the properties have a garage


Most of the properties do not have a garage

In [135]:
garage_bins = df_r.garage.apply(lambda x: 'Yes' if x != 'No garage' else 'No') # create a Serie with binary values
garage_bins.rename("garage_bins", inplace=True)                                # rename the column
garage_bins.value_counts()

No     3694
Yes     377
Name: garage_bins, dtype: int64

In [136]:
fig = px.scatter(pd.concat([df_r, garage_bins],axis=1), x="price_cold_eur", y="property_area", 
                 color= 'garage_bins', height= 800, facet_col = 'garage_bins')   # ,  trendline="ols", trendline_options=dict(log_x=True)
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
# fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

Interesting results.  
Do you only notice "clustering" among "no garage" ads like me?

In [137]:
area_bins = pd.qcut(df_r.property_area, 2)
garage_bins = df_r.garage.apply(lambda x: 'No' if x == 'No garage' else 'Yes')
garage_bins.rename("garage_bins", inplace=True)

df_r.pivot_table('price_cold_eur', [garage_bins], aggfunc=['mean'])\
  .style.bar(align='mid', color='coral').format(precision=1, thousands=",")

Unnamed: 0_level_0,mean
Unnamed: 0_level_1,price_cold_eur
garage_bins,Unnamed: 1_level_2
No,1466.9
Yes,2093.2


The presence of a garage increases the price of a rent by 600 EUR on average

#### energy_eff

Let's add new column to explore the energy effiency more intuitevely.

In [166]:
df_r['rel_heat_costs'] = df_r.heat_costs / df_r.property_area # relative costs  (EUR/m2)

TypeError: unsupported operand type(s) for /: 'str' and 'float'

In [139]:
eff_piv = df_r.pivot_table('rel_costs', ['energy_eff'], aggfunc=['mean','count'])\
                .sort_values(by=('mean', 'rel_costs'), ascending= True)                
eff_piv.columns = ['Relative costs (EUR/m2), mean', 'Number of offerings']  # rename columns
eff_piv.reset_index(inplace=True) # reset index to deminish number of levels in the column names
eff_piv.style.bar(align='left', color='coral').format(precision=2, thousands=",") 

Unnamed: 0,energy_eff,"Relative costs (EUR/m2), mean",Number of offerings
0,Unknown,2.05,3036
1,A+,3.39,13
2,D,3.63,75
3,F,3.93,23
4,C,3.93,129
5,H,4.17,2
6,B,4.23,113
7,G,4.29,4
8,A,4.32,31
9,E,5.05,32


In [140]:
fig = px.bar(eff_piv, x='energy_eff', y='Relative costs (EUR/m2), mean', 
             color='Relative costs (EUR/m2), mean', hover_data=['energy_eff'],
             color_continuous_scale=['Green','Blue','Red'], text_auto='.3',
             title='Relative costs (EUR/m2), mean', height= 600, opacity= .6)
fig.update_layout(xaxis_title="Energy efficiency class", yaxis_title="Relative costs (EUR/m2), mean")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()


Here we can notice that the proposed energy efficiency class does not actually correspond to the relative costs.  
Lowest cost are among the listings without any energy efficiency notices.  
Usually costs include the cost of heating and might be some other extra services, but  

As a tip from here: `Do not pay too much attention to the indicated energy efficiency class`.

In [141]:
# df_r[df_r.energy_eff == 'H']

In [142]:
fig = px.histogram(df_r[['energy_eff']].sort_values(by='energy_eff'), title='Distribution of offerings by energy efficiency class', text_auto=True)
fig.update_layout(xaxis_title="")
fig.update_layout(xaxis={'categoryorder':'total descending'})

Most of the properties do not have a designated energy efficiency rating.  
The rating is based on the energy consumption of the building.  
The higher the rating, the lower the energy consumption but  
`the smallest real relative costs are among listings with Unknown category.` 

#### property_type

In [143]:
fig = px.histogram(df_r[['property_type']].sort_values(by='property_type'), title='property_type',  text_auto=True)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})

Among types that were designated  
Flats are the most common offering.

In [144]:
property_bins = df_r.property_type.apply(lambda x: x if x == 'Unknown' else 'specified') # create a Serie with binary values
property_bins.rename("property_bins", inplace=True)                                # rename the column
property_bins.value_counts()


Unknown      2738
specified    1333
Name: property_bins, dtype: int64

In [145]:
fig = px.scatter(pd.concat([df_r, property_bins],axis=1), x="price_cold_eur", y="property_area", 
                 facet_col='property_bins', color= 'property_bins')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
# fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

As we've noticed earlier (garage section)  
listings with Unknown property type form a distribution with 2 clusters.

#### Bedrooms and bathrooms

In [146]:
fig = px.histogram(df_r[['num_bedrooms']].sort_values(by='num_bedrooms'), title='Number of bedrooms',  
                   text_auto=True, color_discrete_sequence=['green'], opacity= .6)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [147]:
fig = px.histogram(df_r[['num_bathrooms']].sort_values(by='num_bathrooms'), title='Number of bathrooms',  
                   text_auto=True, color_discrete_sequence=['blue'], opacity= .4)
fig.update_layout(xaxis_title="", yaxis_title="Count")
# fig.update_yaxes(type="log")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [148]:
sp_rooms_bins = df_r.apply(lambda x: 'No' if (x['num_bathrooms'] == 0) and (x['num_bedrooms'] == 0) else 'specified', axis=1) # create a Serie with binary values
sp_rooms_bins.rename("sp_rooms_bins", inplace=True)                                # rename the column
sp_rooms_bins.value_counts()


No           3008
specified    1063
Name: sp_rooms_bins, dtype: int64

In [149]:
fig = px.scatter(pd.concat([df_r, sp_rooms_bins],axis=1), x="price_cold_eur", y="property_area", 
                 facet_col='sp_rooms_bins', color= 'sp_rooms_bins')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=4 , line=dict(width=2)) 
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

As with garage and property type we can notice a definite segmentation among listings without specific number of bedrooms and bathrooms.

And finally let's unite all features that lead to clusterization:

In [150]:
cluster_bin = df_r.apply(lambda x: 'clusterized' if (x['num_bathrooms'] == 0) and (x['num_bedrooms'] == 0) 
                         and (x['garage'] == 'No garage') and  (x['property_type'] == 'Unknown') 
                         and (x['energy_eff'] == 'Unknown') 
                         else 'normal', axis=1) # create a Serie with binary values
cluster_bin.rename("cluster_bin", inplace=True)                                # rename the column
cluster_bin.value_counts()

clusterized    2368
normal         1703
Name: cluster_bin, dtype: int64

In [151]:
fig = px.scatter(pd.concat([df_r, cluster_bin],axis=1), x="price_cold_eur", y="property_area", 
                 facet_col='cluster_bin', color= 'cluster_bin')
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=3 , line=dict(width=2)) 
fig.update_yaxes(range=[0, 250])
fig.update_xaxes(range=[0, 5000])
fig.show()

Listings without
* garage
* with no specification about property type, energy efficiency class, number of bedrooms and bathrooms  

forms 2 vivible clusters.

Later we'll try to use geo data to plot the data on map.

In [154]:
fig = px.scatter(temp, x="price_cold_eur", y="property_area",
                 color="publisher", hover_name="publisher")
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2) (log scale)")
fig.update_traces(marker_size=4 , line=dict(width=2))
fig.update_yaxes(range=[0, 120])
fig.update_xaxes(range=[0, 5000])
fig.show()

#### publisher

In [171]:
fig = px.histogram(df[['publisher']].sort_values(by='publisher'), title='publisher',  text_auto=True, height= 800)
fig.update_layout(xaxis_title="", yaxis_title="Count (log scale)")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_yaxes(type="log")
fig.update_xaxes(tickangle=60)

Let's print top-15 agencies (all private owners united in one group )

In [173]:
def custom_aggregation(data):
    '''
    Calculate the survival rate for each group
    '''
    d = {} # create an empty dictionary

    d['mean_sqm'] = data['property_area'].mean()           
    d['count'] = round(data['property_area'].count())
    d['mean_price'] =  data['price_cold_eur'].mean()     
    d['volume']= d['count']*d['mean_sqm']
    d['share'] = d['volume'] /(df['property_area'].sum())*100
    return pd.Series(d)

grouped = df.groupby(['publisher'])[['property_area', 'price_cold_eur']].apply(custom_aggregation)
grouped.sort_values(by='volume', ascending= False).head(15).\
    style.bar(align='mid', color='coral').format(precision=1, thousands=",")


invalid value encountered in double_scalars


invalid value encountered in double_scalars



Unnamed: 0_level_0,mean_sqm,count,mean_price,volume,share
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HousingAnywhere B.V.,inf,1445.0,2106.3,inf,
Tauschwohnung GmbH,66.6,751.0,700.9,50035.4,0.0
Wohnungsswap.de - Lägenhetsbyte Sverige AB -,64.2,579.0,663.8,37200.8,0.0
Private,78.5,266.0,1575.4,20881.0,0.0
Blueground Germany GmbH,70.4,101.0,2163.2,7107.9,0.0
HOWOGE Wohnungsbaugesellschaft mbH,51.2,85.0,438.5,4350.9,0.0
Ukio Germany Gmbh,71.1,61.0,2454.5,4338.6,0.0
Engel & Völkers Immobilien Deutschland GmbH,152.1,24.0,3861.2,3651.0,0.0
Engel & Völkers Berlin Mitte GmbH,164.2,20.0,4693.5,3285.0,0.0
Immonexxt GmbH,56.2,35.0,673.0,1966.6,0.0


In [157]:
# df.groupby(['publisher']).agg(mean_property_area=("property_area", 'mean'),
#                                    Count=('property_area','count'),
#                                    mean_price= ("price_cold_eur",'mean'),
#                                    volume = ("price_cold_eur",lambda x: x.sum())).sort_values(by='volume', ascending= False)\
#                                     .style.bar(align='mid', color='coral').format(precision=0, thousands=",")

### What is the most popular residential rental objects in Berlin? 

In [175]:
fig = px.scatter(df, x="price_cold_eur", y="property_area", color= 'property_type',
                 height= 800,  trendline="ols", trendline_scope="overall")   # , trendline_options=dict(log_x=True)
fig.update_layout(xaxis_title="Price (EUR)", yaxis_title="Property area (m2)")
fig.update_layout(xaxis_type = 'log', yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2))   # change marker size and line width
fig.show()

# results = px.get_trendline_results(fig)
# print(results)
# results.px_fit_results.iloc[0].summary()
# results.query("property_type == 'Flat' or property_type == 'Unknown'").px_fit_results.iloc[0].summary()


overflow encountered in square



We observe here an interesting results.  
two big clusters are formed: 
* left upper with center 600 eur for 60 sqm
* right lower with center 1800 eur for 50 sqm.

Two segments

In [176]:
fig = px.scatter(df, x="property_area", y="costs", color= 'property_type',
                 height= 800,  trendline="ols", trendline_scope="overall" ) #, trendline_options=dict(log_x=True) )
fig.update_layout(xaxis_title="Property area (m2)", yaxis_title="Costs (EUR)")
fig.update_layout(xaxis_type = 'log')#, yaxis_type = 'log')
fig.update_traces(marker_size=4 , line=dict(width=2)) 

In [177]:
# define a function to fill warm price on the basis of cold price and energy efficiency
# def fill_price_warm_eur(xdf, price_cold_eur, energy_eff, price_warm_eur, property_type, property_area):

xdf = df_r.copy()            # make a copy of the dataframe
xdf['costs'] = xdf.price_warm_eur - xdf.price_cold_eur # calculate costs

In [178]:
xdf[xdf.costs < 50] # check if there are any negative values

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,criteria,garage,floor,floors_in_building,constr_year,energy_eff,extra_costs,heat_costs,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,contact,city,address,description,region,zip,link,cold_rel_price,calc_deposit_eur,costs,rel_costs
16,118279473,"*Teilmöblierte* 1,5-Zimmer-Wohnung direkt am Maybachufer",14.04.2023,51.88,1.5,1,1,Online-Besichtigung möglich Online-Besichtigung Online-Besichtigung D...,Underground parking space,1,5,1980,Unknown,keine Angabe,in Nebenkosten enthalten,1650.0,1650.0,3300,Flat,REK Berlin Home Service GmbH,WOONWOON Booking,Berlin,"Maybachufer 42,",Es ist eine Teilmöblierung vorhanden. There is partial furniture available.,Neukölln (Ortsteil),12047,https://www.immobilienscout24.de/expose/118279473,31.807228,3300.0,0.0,0.0
67,140798159,"***Helle, möblierte Wohnung in Kudamm Nähe***",01.04.2023,65.00,2.0,1,1,Balkon/ Terrasse Balkon/ Terrasse Keller Keller Einbauküche Einbauküche,1 Underground parking space,3,5,1984,Unknown,keine Angabe,in Nebenkosten enthalten,1950.0,1950.0,5850,Unknown,DKW Management GmbH,Frau Chantal Rütz,Berlin,"Bornstedter Straße 2,",Das im Jahr 1984 errichtete Mehrfamilienhaus verfügt über elf Wohneinheiten mit sehr attraktiven...,Halensee,10711,https://www.immobilienscout24.de/expose/140798159,30.000000,5850.0,0.0,0.0
79,141111430,2bedroom + 2bathrooms + huge living room in a prime location,01.04.23,85.00,3.0,2,2,Personenaufzug Personenaufzug Einbauküche Einbauküche,No garage,3,5,0,Unknown,keine Angabe,keine Angabe,2300.0,2300.0,4600,Flat,LRC Group GmbH,Frau Eni,Berlin,"Eislebener Strasse 1,",*** Luxury & Prime Location *** 2 bedroom apartment + 2 bathrooms + huge living room at the K...,Charlottenburg,10789,https://www.immobilienscout24.de/expose/141111430,27.058823,4600.0,0.0,0.0
80,140101292,"Brunnenstraße, Berlin",,41.00,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,1865.0,1865.0,1865,Unknown,HousingAnywhere B.V.,,Berlin,"Brunnenstraße 0,",Scroll down for the english version 😊 _______________________________________________________ D...,Gesundbrunnen,13355,https://www.immobilienscout24.de/expose/140101292,45.487804,1865.0,0.0,0.0
81,140103947,"Geusenstraße, Berlin",,80.00,1.0,0,0,,No garage,0,0,0,Unknown,keine Angabe,keine Angabe,1850.0,1850.0,2500,Unknown,HousingAnywhere B.V.,,Berlin,"Geusenstraße 0,",Furnished 2 rooms apartment all bills included Bright & Specious with a balcony in Kaskelkiez! T...,Rummelsburg,10317,https://www.immobilienscout24.de/expose/140103947,23.125000,2500.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3909,140832151,"Stilvolle, gepflegte 2-Zimmer-EG-Wohnung mit Einbauküche in Wilmersdorf, Berlin",7.4.2023,60.00,2.0,1,1,Keller Keller Einbauküche Einbauküche Garten/ -mitbenutzung Garten/ -mitbenutzung,No garage,0,4,0,Unknown,keine Angabe,in Nebenkosten enthalten,1800.0,1800.0,4000,High parterre,Private,Herr Jay,Berlin,"Zahringerstr 28,","Bei dieser ansprechenden Immobilie handelt es sich um eine gepflegte EG-Wohnung, die durch eine ...",Wilmersdorf,10707,https://www.immobilienscout24.de/expose/140832151,30.000000,4000.0,0.0,0.0
3940,140627844,Furnished apartments/Rooms for limited period ( 3-12 Month) in a cool Villa next to Kudamm,Immediately,20.00,1.0,1,0,WG-geeignet WG-geeignet,No garage,1,2,0,Unknown,keine Angabe,keine Angabe,800.0,800.0,1200,Flat,Kigel Investment,Herr Amit Samuel,Berlin,"Münsterschestr. 11,",!!! Anmeldung möglich!!!! Registration possible!!! Great hostel near Kurfürstendamm are offerin...,Wilmersdorf,10709,https://www.immobilienscout24.de/expose/140627844,40.000000,1200.0,0.0,0.0
3947,140469247,Vollmöbelierte helle 67 m2 + Balkon (nur für Frauen) WG geeignet,01.04.2023,67.00,2.5,0,1,Balkon/ Terrasse Balkon/ Terrasse Einbauküche Einbauküche,No garage,4,4,0,Unknown,keine Angabe,nicht in Nebenkosten enthalten,1600.0,1600.0,0,Flat,Private,,Berlin,Unknown,"1600 € - 67 m² - 2.5 Zi. Hallo, Wir vermieten unsere vollmöbelierte Wohnung. Es hat 2,5 Zimme...",Steglitz,12167,https://www.immobilienscout24.de/expose/140469247,23.880596,0.0,0.0,0.0
3948,140459421,"1 room furnished apartment in Mitte for rent - 1-6 Months, now available",Immediately,25.00,1.0,0,0,Personenaufzug Personenaufzug Einbauküche Einbauküche Stufenloser Zugang Stufenloser Zugang,No garage,0,0,2000,Unknown,keine Angabe,in Nebenkosten enthalten,975.0,975.0,1500,Unknown,Kigel Investment,Herr Ron Sameach,Berlin,"Invalidenstr. 100,",Beautiful 1-room apartment in an excellent location in the center of Berlin for rent. Wunderschö...,Mitte (Ortsteil),10115,https://www.immobilienscout24.de/expose/140459421,39.000000,1500.0,0.0,0.0


In [179]:
px.histogram(xdf,  y='costs', color='property_type', title='Costs per sq.meter')

In [180]:

model = LinearRegression()  # define a linear regression model

X = xdf[xdf['price_warm_eur'].notna()][['price_cold_eur', 'property_area']] # select only rows with warm price not null
y = xdf[xdf['price_warm_eur'].notna()]['price_warm_eur']                     # select only rows with warm price not null

# X = pd.get_dummies(X, columns=[ 'energy_eff'], drop_first=True) # convert categorical columns to dummy variables

model.fit(X, y)
ind = X.index
# return X, _
# # xdf.loc[ind, price_warm_eur] = xdf.loc[ind, price_cold_eur] * (1 + xdf.loc[ind, energy_eff])
print(model.score(X,y), len(ind))
# return model.predict(X[[price_cold_eur, energy_eff, property_type, property_area]])

0.9832529295194079 3458


#exclude columns

In [181]:
# temp_df = model.predict(pd.get_dummies(xdf[['price_cold_eur', 'energy_eff',  'property_area']], columns=[ 'energy_eff'], drop_first=True))
temp_df

NameError: name 'temp_df' is not defined

In [None]:
temp_df = model.predict(xdf[['price_cold_eur', 'property_area']])
temp_df

array([3935.31412239, 4370.96318457, 5195.325166  , ..., 1020.80754978,
        550.22381886, 1712.41099877])

In [None]:
# check_na(df)

In [None]:
temp_df = pd.DataFrame(temp_df, columns=['price_warm_eur2'])
temp_df.head()

Unnamed: 0,price_warm_eur2
0,3935.314122
1,4370.963185
2,5195.325166
3,594.980753
4,570.748106


In [None]:
# temp['diff'] = (temp.price_warm_eur - temp.price_cold_eur) #/ df.property_area

In [None]:
temp_df.describe()

Unnamed: 0,price_warm_eur2
count,4154.0
mean,1705.120945
std,1240.959197
min,239.030263
25%,868.491232
50%,1487.818441
75%,2148.163782
max,19087.292423


In [None]:
temp_df.shape, df.shape

((4154, 1), (4154, 25))

In [None]:
t = pd.concat([df, temp_df], axis= 1, join='inner')

In [None]:
t['diff'] = (t.price_warm_eur2 - t.price_cold_eur) #/ df.property_area

In [None]:
pd.set_option('display.max_columns', None) # display all columns
t[(t['diff'] < 0) & (t.price_warm_eur.isna())]

Unnamed: 0,property_id,title,logging_date,property_area,num_rooms,num_bedrooms,num_bathrooms,garage,floor,floors_in_building,constr_year,energy_eff,price_cold_eur,price_warm_eur,deposit_eur,property_type,publisher,city,address,region,zip,link,cold_rel_price,calc_deposit_eur,add_costs,price_warm_eur2,diff
1083,140093393,Apartment: Kanzowstraße 6,Unknown,82.0,3.0,0,0,No garage,0,0,0,Unknown,1080.0,,0,Unknown,© Copyright 2023,Berlin,"Kanzowstraße 6,",Prenzlauer Berg,10000,https://www.immobilienscout24.de/expose/140093393,13.170732,0.0,,522.111406,-557.888594
1085,140093345,Apartment: Seestraße 106,Unknown,117.0,3.0,0,0,No garage,0,0,0,Unknown,1143.0,,0,Unknown,© Copyright 2023,Berlin,"Seestraße 106,",Wedding,13000,https://www.immobilienscout24.de/expose/140093345,9.769231,0.0,,687.260593,-455.739407
1088,140093307,Apartment: Kollwitzstraße 93,Unknown,52.0,2.0,0,0,No garage,0,0,0,Unknown,720.0,,0,Unknown,© Copyright 2023,Berlin,"Kollwitzstraße 93,",Prenzlauer Berg,10000,https://www.immobilienscout24.de/expose/140093307,13.846154,0.0,,666.626150,-53.373850
1090,140093295,Apartment: Böttgerstraße 18,Unknown,103.0,3.0,0,0,No garage,0,0,0,Unknown,1000.0,,0,Unknown,© Copyright 2023,Berlin,"Böttgerstraße 18,",Gesundbrunnen,13000,https://www.immobilienscout24.de/expose/140093295,9.708738,0.0,,944.787379,-55.212621
1092,140093270,Apartment: Frankfurter Allee 80,Unknown,92.0,3.0,0,0,No garage,0,0,0,Unknown,700.0,,0,Unknown,© Copyright 2023,Berlin,"Frankfurter Allee 80,",Friedrichshain,10000,https://www.immobilienscout24.de/expose/140093270,7.608696,0.0,,552.659323,-147.340677
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3821,141296345,Design Apartment in Berlin Mitte/ fully furnished apartment,1.10.2023,130.4,4.0,0,0,No garage,0,0,0,Unknown,4900.0,,0,Unknown,4.700,Berlin,Unknown,Mitte (Ortsteil),10435,https://www.immobilienscout24.de/expose/141296345,37.576687,0.0,,1379.119135,-3520.880865
3822,141296240,Design Apartment in Berlin Mitte,01.12.2023,250.0,6.0,3,3,1 parking space,1,5,2018,Unknown,9000.0,,0,Small house,4.700,Berlin,Unknown,Mitte (Ortsteil),10435,https://www.immobilienscout24.de/expose/141296240,36.000000,0.0,,2663.198942,-6336.801058
3835,141219164,Immediate reference - apartment to fall in love with Pankow Floraviertel - sunny.Quiet.Furnished...,01.04.2023,73.0,2.0,1,1,No garage,3,5,2014,Unknown,1612.0,,"4.836,00",Unknown,3.500,Berlin,"In den Floragärten 00,",Pankow (Ortsteil),13187,https://www.immobilienscout24.de/expose/141219164,22.082192,4836.0,,1356.595091,-255.404909
3889,141125007,"Bright, modern apartment, partially furnished, open living space",01.05.2023,100.0,3.0,0,0,No garage,0,0,0,Unknown,2000.0,,0,Unknown,3.500,Berlin,Unknown,Kaulsdorf,12621,https://www.immobilienscout24.de/expose/141125007,20.000000,0.0,,1836.658224,-163.341776


In [None]:
df[df.price_cold_eur.notna() & df.price_warm_eur.notna()]['energy_eff'].unique()

array(['Unknown', 'B', 'E', 'A+', 'A', 'D', 'C', 'F', 'G', 'H'],
      dtype=object)

In [None]:
df['add_costs'] = df.price_warm_eur - df.price_cold_eur

In [182]:
check_na(df)

property_id          0        0.00%  int64     
title                0        0.00%  object    
logging_date         0        0.00%  object    
property_area        0        0.00%  float16   
num_rooms            0        0.00%  float16   
num_bedrooms         0        0.00%  int16     
num_bathrooms        0        0.00%  int16     
criteria             1844    45.30%  object    
garage               0        0.00%  object    
floor                0        0.00%  int8      
floors_in_building   0        0.00%  int16     
constr_year          0        0.00%  int16     
energy_eff           0        0.00%  object    
extra_costs          0        0.00%  object    
heat_costs           0        0.00%  object    
price_cold_eur       0        0.00%  float32   
price_warm_eur       613     15.06%  float32   
deposit_eur          0        0.00%  object    
property_type        0        0.00%  object    
publisher            0        0.00%  object    
contact              0        0.00%  obj