In [1]:
import pandas as pd
import numpy as np
import re
import json
from itertools import permutations
from itertools import combinations

pd.set_option('display.max_columns', None, 'display.max_rows', 100)

df = pd.read_csv('../data/immoscout_cleaned_lat_lon_fixed_v9.csv', low_memory=False)

col_names = df.columns.array
col_names[0:2] = ['Index1', 'Index2']
df.columns = col_names

numeric_columns = ['Living space', 'Plot area', 'Floor space', 'Floor', 'detail_responsive#surface_living', 'detail_responsive#floor', 'Wohnfläche', 'Stockwerk', 'Nutzfläche', 'Grundstücksfläche', 'detail_responsive#surface_property', 'detail_responsive#surface_usable', 'Surface habitable', 'Surface du terrain', 'Surface utile', 'Étage', 'Superficie abitabile', 'Piano', 'Superficie del terreno', 'Superficie utile', 'Floor_merged', 'Living_space_merged', 'Floor_space_merged', 'Plot_area_merged', 'price_cleaned', 'price', 'Space extracted']
df_numeric = df[numeric_columns]

df_numeric.shape

(13378, 27)

In this part of the data analysis we focus primarily on the variables containing information about 'Living Space', 'Plot Area', 'Floor Space' and 'Floor'. For each one of them we find at least 6 columns containing relevant data (columns with corresponding names in German, French, Italian, English, one '..._merged' and one 'detail_responsive#...').  

For the living space there's also another column called 'Space extracted'.  

How the '_merged' column has been merged is unclear, therefore we cannot assume that this column contains all the data.  

## Living Space

To investigate, how we can get the full set of data for living space, let's first create a subset containing only the columns with data on the living space.

In [2]:
living_space = ['Living space', 'Wohnfläche', 'Surface habitable', 'Superficie abitabile', 'detail_responsive#surface_living', 'Living_space_merged', 'Space extracted']
df_living_space = df_numeric.loc[:, living_space]

The cumulative sum of the value counts per column can be checked against the 'merged'-count to see if any combination of value counts add up to the same number.

In [3]:
df_living_space[living_space[0:5]].count().cumsum() == df_living_space[living_space[5]].count()

Living space                        False
Wohnfläche                          False
Surface habitable                   False
Superficie abitabile                 True
detail_responsive#surface_living    False
dtype: bool

Here we see, that the cumulative sum of the values in columns 'Living space', 'Wohnfläche', 'Surface habitable', 'Superficie abitabile' adds up to the same amount as the 'Living_space_merged' value count.  

Let's see if the data contained in 'Living_space_merged' is actually the same data contained in columns 'Living space', 'Wohnfläche', 'Surface habitable' and 'Superficie abitabile'.  

To do so we first combine the respective columns into one so we can compare the columns.

In [4]:
df_living_space['living_space'] = df_living_space[living_space[0]].fillna('') + \
  (df_living_space[living_space[1]]).fillna('') + \
  (df_living_space[living_space[2]]).fillna('') + \
  (df_living_space[living_space[3]]).fillna('')

(df_living_space.loc[:, 'living_space'] == df_living_space.loc[:, living_space[5]].fillna('')).sum() == df_numeric.shape[0]

True

Counting the `True` values of the comparison of the two columns yields the same count as the column count of the dataset meaning that all rows of the two columns are identical.

Therefore we can confirm, that the '_merged'-column contains all information from the columns 'Living space', 'Wohnfläche', 'Surface habitable' and 'Superficie abitabile'.  

There are still two remaining columns with data about the living space: 'detail_responsive#surface_living' and 'Space extracted'.  

Let's check if there's any new data in 'detail_responsive#surface_living':

In [5]:
df_living_space['living_space'] = df_living_space['living_space'].replace('', np.NaN)

df_living_space[df_living_space['living_space'].isna() & df_living_space["detail_responsive#surface_living"].notna()]

Unnamed: 0,Living space,Wohnfläche,Surface habitable,Superficie abitabile,detail_responsive#surface_living,Living_space_merged,Space extracted,living_space
2,,,,,93 m²,,93.0,
39,,,,,97 m²,,97.0,
44,,,,,216 m²,,216.0,
170,,,,,70 m²,,70.0,
178,,,,,127 m²,,127.0,
...,...,...,...,...,...,...,...,...
13351,,,,,93 m²,,93.0,
13356,,,,,157 m²,,157.0,
13357,,,,,121 m²,,121.0,
13359,,,,,83 m²,,83.0,


Interesting. This output shows that not only 'detail_responsive#surface_living' but also 'Space extracted' contains more information on living space. In a first step we can merge the 'detail_responsive#surface_living' into 'living_space' and compare the counts.

In [6]:
df_living_space['living_space'] = df_living_space[living_space[5]].fillna('') + \
  (df_living_space[living_space[4]]).fillna('')

df_living_space['living_space'] = df_living_space['living_space'].replace('', np.nan)

df_living_space['living_space'].count(), df_living_space['Space extracted'].count()

(12304, 12308)

With all columns containing direct information about the living space except 'Space extracted' we get 12304 rows of data. 'Space extracted' has 12308, 4 more rows.

In [7]:
df_living_space.loc[df_numeric['Space extracted'].notna() & df_living_space['living_space'].isna(), :] 

Unnamed: 0,Living space,Wohnfläche,Surface habitable,Superficie abitabile,detail_responsive#surface_living,Living_space_merged,Space extracted,living_space
786,,,,,,,200.0,
3380,,,,,,,210.0,
3696,,,,,,,228.0,
6506,,,,,,,200.0,


Does the column 'Space extracted' contain the same information as the merged column plus 4 more rows? To check that, the data has to be parsed the same way. 

In [8]:
df_living_space['living_space'] = df_living_space.living_space.str.extract('(\d+)').astype(float)

df_living_space['living_space'].dtype == df_living_space['Space extracted'].dtype

True

Now with the same datatype, the data can be merged and compared once more.

In [9]:
df_living_space.loc[df_living_space['Space extracted'].notna() & df_living_space['living_space'].isna(), 'living_space'] = df_living_space.loc[df_living_space['Space extracted'].notna() & df_living_space['living_space'].isna(), 'Space extracted']
(df_living_space['Space extracted'].fillna(0).astype('int') == df_living_space['living_space']).count() == df_living_space.shape[0]

True

And with this we can see that the column 'Space extracted' contained the most complete data of the investigated columns for living space. 4 more than the '..._merged' and 'detail_responsive#...' combined. 

In [10]:
df_numeric = df_numeric.drop(living_space, axis=1) 
df_numeric['living_space'] = df_living_space['Space extracted'] 
df_numeric.shape

(13378, 21)

Now the next question is, where did those 4 more rows come from?  
In the dataset we have many redundant columns but we discovered a pattern for the above described columns. They are all extracted from different forms (mobile, desktop, different languages) of the website immoscout24.ch and the column names 'description', 'detailed_description', 'table', 'details', 'details_structured' suggest that they may contain raw data. 

# Description

To get an idea of the information contained in the 'description' column, let's look at the `value_counts()`

In [11]:
df['description'].value_counts()

4.5 rooms, 153 m²«Duplex dans les combles avec 2 terrasses !»CHF 686,700.—Favourite                                                                               4
3.5 rooms, 98 m²«Belle promotion Minergie de 22 appartements au calme ! Du 2.5 pces au 4.5 pces !»CHF 495,000.—Favourite                                          4
5.5 rooms, 153 m²«####Les Vergers d Ollon#### à Ollon VD Magnifique Villa Mitoyenne avec un grand jardin d environ 1 000 m2 à vendre»CHF 1,260,000.—Favourite     4
5.5 rooms, 170 m²«NOUVELLE PROMOTION»CHF 1,795,000.—Favourite                                                                                                     4
2.5 rooms, 82 m²«Quartier Saint-Michel Appartement 2.5 pces au 3e»CHF 492,000.—Favourite                                                                          3
                                                                                                                                                                 ..
3.5 rooms, 101 m

Two things immediately stand out:
- The same description (and maybe more features) have been recorded multiple times for some observations
- There seems to be a distinctive pattern in the data contained in the 'description'-column.  

Since we cannot inspect every column manually, we've built a regex matching group to check, if the structure of the data is consistent in all observations.

In [12]:
df['description'].count()

13378

In [13]:
description_pattern = '\d+\.?\d? rooms, \d+ m²«.+»CHF [\d,]+\.'
df['description'].str.contains(description_pattern).sum()

11201

The column 'description' contains information for every observation in the dataset. 11201 of which follow the defined structure. What about the rest?

In [14]:
is_structured = df['description'].str.contains(description_pattern)
not_structured = is_structured[is_structured == False]
not_structured.count()

2177

In [15]:
df.iloc[not_structured.index]['description'].head(10)

7     4.5 rooms«Preishit! Grossräumige Wohnung mitte...
15    258 m²«Mehrgenerationenhaus mit grossem Garten...
21    4.5 rooms, 236 m²«Terrassenhaus mit malerische...
22    167 m²«EFH 6.5 davon 1 Zi-Studio (Büro / Praxi...
33    258 m²«Mehrgenerationenhaus mit grossem Garten...
35    4.5 rooms«Preishit! Grossräumige Wohnung mitte...
41    4.5 rooms, 150 m²«####Two in One#### mit Einli...
44    7.5 rooms, 216 m²«Verzauberndes Generationenha...
45    5 rooms, 104 m²«#### Top Einfamilienhaus einge...
52    5.5 rooms, 160 m²«FAMILIENGLÜCK MIT VIEL PLATZ...
Name: description, dtype: object

So not every row of 'description' contains complete information about the rooms, living space and price. Let's check how it compares to the information contained in other columns in the dataset. 

## Rooms

In [16]:
df['description'].str.contains('\d+\.?\d? rooms, ').sum()
#TODO

11998

11998 rows of rooms-data can be extracted from 'description'

## Living Space

In [17]:
df['description'].str.contains('\d+ m²«').sum()

12304

Interesting, here we see the same count as before, when we merged all columns except 'Space extracted'. Is it actually the same information?

In [18]:
df_living_space['extracted_ls'] = df['description'].str.extract('(\d+) m²«')
(df_living_space['extracted_ls'].astype(float) == df_living_space['living_space']).sum()

12304

In [19]:
df_living_space.loc[df_numeric['living_space'].notna() & df_living_space['extracted_ls'].isna(), :] 

Unnamed: 0,Living space,Wohnfläche,Surface habitable,Superficie abitabile,detail_responsive#surface_living,Living_space_merged,Space extracted,living_space,extracted_ls
786,,,,,,,200.0,200.0,
3380,,,,,,,210.0,210.0,
3696,,,,,,,228.0,228.0,
6506,,,,,,,200.0,200.0,


Indeed! with this we have proven, that the column 'description' does not contain more information about the living space. But the question, where those 4 rows come from, remains unanswered for now.  

## Price

In [20]:
df['description'].str.contains('CHF [\d.,]+').sum()

12359

We can extract 12359 rows of data containing information about the price. We'll keep that in mind and move forward.  

# Detailed Description

In [21]:
df['detailed_description'].value_counts()

Extract from the debt collection registerIn a few days by e-mail and by post at your home. Per invoice, for CHF 29.–Order the extract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

In this column we find, as the name suggests, the detailed description of the posting. This data does not follow any specific pattern and does not contain any reliable information on the features we are investigating in this notebook. Therefore we discard it.  
With nlp techniques applied it may become useful for fine tuning predictions though. 

# Table

In [22]:
df['table'].value_counts()

b <article class=####Box-cYFBPY hKrxoH####><h2 class=####Box-cYFBPY gZLPvm####>Main information</h2><table class=####DataTable__StyledTable-sc-1o2xig5-1 jbXaEC####><tbody><tr><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__Cell-sc-1o2xig5-4 edrNfG dGBatU####>Municipality</td><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__CellValue-sc-1o2xig5-3 edrNfG rJZBK####>Le Mouret</td></tr><tr><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__Cell-sc-1o2xig5-4 edrNfG dGBatU####>Availability</td><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__CellValue-sc-1o2xig5-3 edrNfG rJZBK####>On request</td></tr></tbody></table><hr class=####Divider-iprSaI bBhTLQ####/></article>                                                                                                                                                                                                                                                                                                 

In this column, we see information about the Municipality, Living space, Plot area, Availability, Floor and so on. Maybe this is where our information originates from?

## Living Space

In [23]:
df['table'].str.contains('Living space').sum()

11634

Countwise we're well below the data we've gathered so far for the living space. 

In [24]:
df_living_space['table_ls'] = df['table'].str.extract('Living space.+####>(\d+) m').astype(float)
df_living_space.loc[df_living_space['living_space'].isna() & df_living_space['table_ls'].notna(), 'living_space'].count()

0

And this confirms, that we cannot extract any more information on the living space from the 'table' column. We'll now extract relevant variables for further investigation.

## Municipality

In [44]:
df['table'].str.contains('Municipality').sum() == df['Municipality'].count()

True

In [41]:
df['table_municipality'] = df['table'].str.extract('Municipality<\/td><td .+?####.+?####>(.+?)<')
df.loc[df['table_municipality'].fillna('') != df['Municipality'].fillna(''), ['table_municipality', 'Municipality']]

Unnamed: 0,table_municipality,Municipality
4,K&#252;ttigen,Küttigen
43,K&#252;ttigen,Küttigen
47,K&#252;ttigen,Küttigen
51,K&#252;ttigen,Küttigen
55,K&#252;ttigen,Küttigen
...,...,...
13291,F&#228;llanden,Fällanden
13293,"D&#252;bendorf, Kreis 7 (Zurich)","Dübendorf, Kreis 7 (Zurich)"
13295,F&#228;llanden,Fällanden
13297,"Z&#252;rich, Kreis 6 (Zurich)","Zürich, Kreis 6 (Zurich)"


## Plot Area

In [43]:
df['table'].str.contains('Plot area').sum() == df['Plot area'].count()

True

# Details

In [471]:
df['details'].value_counts()

4.5 rooms, 120 m²,     135
4.5 rooms, 110 m²,     132
4.5 rooms,             120
3.5 rooms, 100 m²,     111
4.5 rooms, 100 m²,     102
                      ... 
121 m²,                  1
6 rooms, 132 m²,         1
5.5 rooms, 340 m²,       1
393 m²,                  1
7.5 rooms, 385 m²,       1
Name: details, Length: 2741, dtype: int64

It looks like we've got another column with information about the rooms and living space (or some other space). Let's extract and compare it.  

## Living Space

In [472]:
df_living_space['details_ls'] = df['details'].str.extract(', (\d+) m').astype(float)
df_living_space['details_ls'].count()

12431

This count is greater than the number before!

In [473]:
df_living_space[df_living_space['details_ls'].notna() & df_living_space['living_space'].isna()]

Unnamed: 0,Living space,Wohnfläche,Surface habitable,Superficie abitabile,detail_responsive#surface_living,Living_space_merged,Space extracted,living_space,extracted_ls,table_ls,details_ls
53,,,,,,,,,,,95.0
88,,,,,,,,,,,284.0
141,,,,,,,,,,,114.0
182,,,,,,,,,,,259.0
186,,,,,,,,,,,259.0
...,...,...,...,...,...,...,...,...,...,...,...
13204,,,,,,,,,,,803.0
13207,,,,,,,,,,,1436.0
13278,,,,,,,,,,,345.0
13337,,,,,,,,,,,544.0


Okay, 378 rows with information, where 'living_space' is NA. But is this data even about living space?

In [474]:
(df_living_space['details_ls'].fillna(0) == df_living_space['living_space'].fillna(0)).sum()

12744

In [475]:
not_equal = df_living_space['details_ls'].fillna(0) != df_living_space['living_space'].fillna(0)
df_living_space.loc[not_equal[not_equal == True].index, ['details_ls', 'living_space']]

Unnamed: 0,details_ls,living_space
15,,258.0
22,,167.0
33,,258.0
53,95.0,
57,,167.0
...,...,...
13204,803.0,
13207,1436.0,
13278,345.0,
13337,544.0,


In [476]:
df_living_space.loc[df_living_space['details_ls'].notna() & df_living_space['living_space'].isna(), 'living_space'] = df_living_space.loc[df_living_space['details_ls'].notna() & df_living_space['living_space'].isna(), 'details_ls']
df_living_space['living_space'].count()

12686

We could extract 378 more rows for the living space from 'details'.  

## Rooms

In [477]:
df['details'].str.contains('\d+\.?\d? rooms, ').sum()
#TODO

12799

# Details Structured

In [478]:
df['details_structured'].value_counts()

{'Municipality': 'Biberstein', 'Living space': '100 m²', 'Floor': '4. floor', 'Availability': 'On request', 'location': '5023 Biberstein, AG', 'description': '3.5 rooms, 100 m²«Luxuriöse Attika-Wohnung mit herrlicher Aussicht»CHF 1,150,000.—Favourite', 'detailed_description': 'DescriptionLuxuriöse Attika-Wohnung direkt an der Aare und angrenzend an die Landwirtschaftszone, mit unverbaubarer Weitsicht, grosszügiger Garage und Option auf ein zusätzliches Zimmer.Einzigartige Lage, top Aussicht und hochwertige Innenausstattung? Das alles bietet diese charmante Eigentumswohnung auf 100m2 im steuergünstigen Biberstein. Stadtnah gelegen und mit direktem Naturzugang sorgt sie für ein rundum angenehmes Wohngefühl.In der ganzen Wohnung sind hochwertige Materialien mit einem südländischen Touch verbaut. Der Boden ist mit einem Jurastein und die beiden Zimmer mit Holz versehen (mit Bodenheizung).In die Wohnung gelangt man über einen separaten Eingang, ein halbes Stockwerk vom gewachsenen Boden erh

Municipality, Living space, Plot area, Availability, location, description, detailed_description, url, table, Floor space, Floor

In [45]:
df['details_structured'].str.contains("'Municipality'").sum() == df['Municipality'].count()

True

In [47]:
df['details_structured'].str.contains("'Living space'").sum() == df['Living space'].count()

True

In [482]:
df['details_structured'].str.contains("'Plot area'").sum()

4696

In [488]:
df['details_structured'].str.contains("'Floor space'").sum()

2780

In [483]:
df['details_structured'].str.contains("'Availability'").sum()

12663

In [484]:
df['details_structured'].str.contains("'location'").sum()

13378

In [485]:
df['details_structured'].str.contains("'description'").sum()

13378

In [486]:
df['details_structured'].str.contains("'detailed_description'").sum()

13378

In [487]:
df['details_structured'].str.contains("'url'").sum()

13378

In [489]:
df['details_structured'].str.contains("'table'").sum()

13378

In [490]:
df['details_structured'].str.contains("'Floor'").sum()

5315

## Plot Area

In [187]:
plot_area = ['Plot area', 'Grundstücksfläche', 'Surface du terrain', 'Superficie del terreno', 'detail_responsive#surface_property', 'Plot_area_merged']
df_plot_area = df_numeric[plot_area]

df_plot_area[plot_area[0:5]].count().cumsum() == df_plot_area[plot_area[5]].count()

Plot area                             False
Grundstücksfläche                     False
Surface du terrain                    False
Superficie del terreno                 True
detail_responsive#surface_property    False
dtype: bool

In [188]:
df_plot_area['plot_area'] = df_plot_area[plot_area[0]].fillna('') + \
  (df_plot_area[plot_area[1]]).fillna('') + \
  (df_plot_area[plot_area[2]]).fillna('') + \
  (df_plot_area[plot_area[3]]).fillna('')

(df_plot_area['plot_area'] == df_plot_area[plot_area[5]].fillna('')).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_plot_area['plot_area'] = df_plot_area[plot_area[0]].fillna('') + \


13378

In [189]:
df_plot_area['plot_area'] = df_plot_area[plot_area[5]].fillna('') + \
  (df_plot_area[plot_area[4]]).fillna('')

df_plot_area['plot_area'] = df_plot_area['plot_area'].replace('', np.nan)

df_plot_area['plot_area'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_plot_area['plot_area'] = df_plot_area[plot_area[5]].fillna('') + \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_plot_area['plot_area'] = df_plot_area['plot_area'].replace('', np.nan)


4953

In [190]:
df_numeric = df_numeric.drop(plot_area, axis=1) # drops 6 columns -> 13 remain
df_numeric['plot_area'] = df_plot_area['plot_area'] # adds 1 column -> 14
df_numeric.shape

(13378, 14)

In [191]:
df_numeric['plot_area'] = df_numeric.plot_area.str.extract('(\d+)').fillna(0).astype(int)

## Floor Space

In [192]:
floor_space = ['Floor space', 'Nutzfläche', 'Surface utile', 'Superficie utile', 'detail_responsive#surface_usable', 'Floor_space_merged']
df_floor_space = df_numeric[floor_space]

df_floor_space[floor_space[0:5]].count().cumsum() == df_floor_space[floor_space[5]].count()

Floor space                         False
Nutzfläche                          False
Surface utile                       False
Superficie utile                     True
detail_responsive#surface_usable    False
dtype: bool

In [193]:
df_floor_space['floor_space'] = df_floor_space[floor_space[0]].fillna('') + \
  (df_floor_space[floor_space[1]]).fillna('') + \
  (df_floor_space[floor_space[2]]).fillna('') + \
  (df_floor_space[floor_space[3]]).fillna('')

(df_floor_space['floor_space'] == df_floor_space[floor_space[5]].fillna('')).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor_space['floor_space'] = df_floor_space[floor_space[0]].fillna('') + \


13378

In [194]:
df_floor_space['floor_space'] = df_floor_space[floor_space[0]].fillna('') + \
  (df_floor_space[floor_space[1]]).fillna('') + \
  (df_floor_space[floor_space[2]]).fillna('') + \
  (df_floor_space[floor_space[3]]).fillna('')

df_floor_space[df_floor_space['floor_space'] != '']['floor_space'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor_space['floor_space'] = df_floor_space[floor_space[0]].fillna('') + \


2842

In [195]:
df_floor_space['floor_space'] = df_floor_space[floor_space[5]].fillna('') + \
  (df_floor_space[floor_space[4]]).fillna('')

df_floor_space['floor_space'] = df_floor_space['floor_space'].replace('', np.nan)

df_floor_space['floor_space'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor_space['floor_space'] = df_floor_space[floor_space[5]].fillna('') + \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor_space['floor_space'] = df_floor_space['floor_space'].replace('', np.nan)


2953

In [196]:
df_numeric = df_numeric.drop(floor_space, axis=1) # drops 6 columns -> 8 remain
df_numeric['floor_space'] = df_floor_space['floor_space'] # adds 1 column -> 9
df_numeric.shape

(13378, 9)

In [197]:
df_numeric['floor_space'] = df_numeric.floor_space.str.extract('(\d+)').fillna(0).astype(int)

## Floor

In [198]:
floor = ['Floor', 'Stockwerk', 'Étage', 'Piano', 'detail_responsive#floor', 'Floor_merged']
df_floor = df_numeric[floor]

df_floor[floor[0:5]].count().cumsum() == df_floor[floor[5]].count()

Floor                      False
Stockwerk                  False
Étage                      False
Piano                       True
detail_responsive#floor    False
dtype: bool

In [199]:
df_floor['floor'] = df_floor[floor[0]].fillna('') + \
  (df_floor[floor[1]]).fillna('') + \
  (df_floor[floor[2]]).fillna('') + \
  (df_floor[floor[3]]).fillna('')

(df_floor['floor'] == df_floor[floor[5]].fillna('')).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor['floor'] = df_floor[floor[0]].fillna('') + \


13378

In [200]:
df_floor['floor'] = df_floor[floor[0]].fillna('') + \
  (df_floor[floor[1]]).fillna('') + \
  (df_floor[floor[2]]).fillna('') + \
  (df_floor[floor[3]]).fillna('')

df_floor[df_floor['floor'] != '']['floor'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor['floor'] = df_floor[floor[0]].fillna('') + \


5414

In [201]:
df_floor['floor'] = df_floor[floor[5]].fillna('') + \
  (df_floor[floor[4]]).fillna('')

df_floor['floor'] = df_floor['floor'].replace('', np.nan)

df_floor['floor'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor['floor'] = df_floor[floor[5]].fillna('') + \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_floor['floor'] = df_floor['floor'].replace('', np.nan)


5620

In [202]:
df_numeric = df_numeric.drop(floor, axis=1) # drops 6 columns -> 3 remain
df_numeric['floor'] = df_floor['floor'] # adds 1 column -> 4
df_numeric.shape

(13378, 4)

In [203]:
df_numeric

Unnamed: 0,living_space,plot_area,floor_space,floor
0,100,0,0,4. floor
1,156,222,242,
2,93,0,0,2. floor
3,154,370,257,
4,142,0,0,Ground floor
...,...,...,...,...
13373,70,0,0,
13374,0,284,0,
13375,150,160,0,
13376,145,853,140,


# Cleaning and Parsing
## Floor

In [204]:
df_numeric['floor'].unique()

array(['4. floor', nan, '2. floor', 'Ground floor', '3. floor',
       '6. floor', '1. floor', '5. floor', '14. floor', '20. floor',
       '8. floor', '2. Basement', '7. floor', '15. floor', '10. floor',
       '11. floor', '4. Basement', '100. floor', '12. floor',
       '1. Basement', '21. floor', '9. floor', '3. Basement',
       '999. floor', '23. floor'], dtype=object)

In [205]:
def parse_floor(x):
  if x != x:
    return np.nan
  elif x == 'Ground floor':
    return 0
  elif re.search('\. floor', x):
    return re.search('\d+', x).group()
  elif re.search('Basement', x):
    return '-' + re.search('\d+', x).group()


df_numeric['floor'] = df_numeric['floor'].apply(parse_floor)
df_numeric['floor'].unique()

array(['4', nan, '2', 0, '3', '6', '1', '5', '14', '20', '8', '-2', '7',
       '15', '10', '11', '-4', '100', '12', '-1', '21', '9', '-3', '999',
       '23'], dtype=object)

In [206]:
df_numeric['floor'] = df_numeric['floor'].astype(float)
df_numeric['floor'].unique()

array([  4.,  nan,   2.,   0.,   3.,   6.,   1.,   5.,  14.,  20.,   8.,
        -2.,   7.,  15.,  10.,  11.,  -4., 100.,  12.,  -1.,  21.,   9.,
        -3., 999.,  23.])

In [207]:
df_numeric.std()

living_space    123.475275
plot_area       845.261158
floor_space     153.571314
floor            29.809057
dtype: float64

# Availability

In [208]:
availability = ['Availability', 'Verfügbarkeit', 'Disponibilité', 'Disponibilità', 'detail_responsive#available_from',  'Availability_merged']
df_availability = df[availability]

df_availability.iloc[:, 0:5].count().cumsum() == df_availability.iloc[:, 5].count()

Availability                        False
Verfügbarkeit                       False
Disponibilité                       False
Disponibilità                        True
detail_responsive#available_from    False
dtype: bool

In [209]:
df_availability['availability'] = df_availability[availability[0]].fillna('') + \
  (df_availability[availability[1]]).fillna('') + \
  (df_availability[availability[2]]).fillna('') + \
  (df_availability[availability[3]]).fillna('')

(df_availability['availability'] == df_availability[availability[5]].fillna('')).sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_availability['availability'] = df_availability[availability[0]].fillna('') + \


13378

In [210]:
df_availability['availability'] = df_availability[availability[5]].fillna('') + \
  (df_availability[availability[4]]).fillna('')

df_availability['availability'] = df_availability['availability'].replace('', np.nan)

df_availability['availability'].count()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_availability['availability'] = df_availability[availability[5]].fillna('') + \
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_availability['availability'] = df_availability['availability'].replace('', np.nan)


13378

In [211]:
df_availability['availability'].unique()

array(['On request', 'Immediately', '30.12.2022', '01.12.2022',
       '01.04.2023', '01.08.2023', '01.10.2022', '01.11.2022',
       '01.09.2023', '01.07.2023', '07.07.2023', '22.10.2022',
       '01.02.2023', '01.06.2023', '31.10.2023', '01.01.2023',
       '01.05.2023', '01.12.2023', '01.10.2023', '30.11.2022',
       '31.12.2023', '20.03.2023', '19.01.2024', '01.03.2023',
       '01.05.2024', '15.08.2023', '31.12.2022', '31.03.2023',
       '30.06.2024', '01.02.2024', '31.07.2023', '02.01.2023',
       '15.10.2022', '11.11.2022', '30.11.2023', '01.04.2024',
       '01.12.2024', '30.09.2022', '01.04.2025', '01.10.2024',
       '01.07.2024', '01.11.2024', '15.12.2022', '01.06.2024',
       '01.01.2024', '01.11.2023', '25.01.2024', '24.06.2023',
       '26.10.2022', '28.02.2023', '15.09.2022', '30.09.2023',
       '30.01.2024', '03.04.2023', '15.02.2024', '01.04.2030',
       '30.04.2023', '05.09.2022', '03.10.2022', '31.05.2024',
       '31.05.2023', '31.03.2024', '30.12.2023', '16.1

In [212]:
df_numeric['availability'] = df_availability['availability']

# Gross return

In [213]:
df['Gross return'].unique()

array([nan, '0.00 %', '4.5 %'], dtype=object)

In [214]:
df['Gross return'].count()

6

This column does not contain a whole lot of information, therefore we will not consider it for our analysis. 

# Rooms

In [215]:
df['rooms'].value_counts()

5.0    10937
6.0      344
7.0      309
4.0      250
8.0      239
3.0      200
2.0      152
9.0      140
0.0      138
1.0       90
Name: rooms, dtype: int64

In [216]:
def parse_rooms(x):
  pattern = '(\d+\.\d) rooms'
  match = re.search(pattern, x)
  if match is not None:
    result = match.group(1)
  else:
    result = np.NaN
  return result

df_numeric['rooms'] = df['details_structured'].apply(parse_rooms).astype('float')

df_numeric['rooms'].value_counts()

4.5     3481
3.5     2318
5.5     2188
6.5      918
2.5      850
7.5      362
8.5      195
1.5       86
9.5       64
10.5      42
12.5      23
11.5      21
13.5      12
14.5      11
15.5       3
17.5       3
16.5       2
21.5       1
Name: rooms, dtype: int64

# Price

In [217]:
df_numeric['price'] = df['price_cleaned']

In [218]:
df_numeric

Unnamed: 0,living_space,plot_area,floor_space,floor,availability,rooms,price
0,100,0,0,4.0,On request,3.5,1150000.0
1,156,222,242,,On request,4.5,1420000.0
2,93,0,0,2.0,Immediately,2.5,720000.0
3,154,370,257,,On request,4.5,1430000.0
4,142,0,0,0.0,On request,4.5,995000.0
...,...,...,...,...,...,...,...
13373,70,0,0,,On request,2.5,1101000.0
13374,0,284,0,,On request,,1750000.0
13375,150,160,0,,On request,6.5,1415000.0
13376,145,853,140,,Immediately,,1465000.0


# Function

In [268]:
def clean_integers(df):
  '''Cleans the following columns in a dataframe and removes unnecessary columns:
  - Living Space
  - Plot Area
  - Floor Space
  - Floor
  - Rooms
  - Price
  - Availability

  Args:
      df (pandas DataFrame): The .csv as dataframe
  '''
  def parse_rooms(x):
    pattern = '(\d+\.\d) rooms'
    match = re.search(pattern, x)
    if match is not None:
      result = match.group(1)
    else:
      result = np.NaN
    return result

  def parse_floor(x):
    if x != x:
      return np.nan
    elif x == 'Ground floor':
      return 0
    elif re.search('\. floor', x):
      return re.search('\d+', x).group()
    elif re.search('Basement', x):
      return '-' + re.search('\d+', x).group()

  col_names = df.columns.array
  col_names[0:2] = ['Index1', 'Index2']
  df.columns = col_names  

  columns_to_drop = ['Index1', 'Index2', 'Living space', 'Plot area', 'Floor space', 'Floor', 'detail_responsive#surface_living', 'detail_responsive#floor', 'Wohnfläche', 'Stockwerk', 'Nutzfläche', 'Grundstücksfläche', 'detail_responsive#surface_property', 'detail_responsive#surface_usable', 'Surface habitable', 'Surface du terrain', 'Surface utile', 'Étage', 'Superficie abitabile', 'Piano', 'Superficie del terreno', 'Superficie utile', 'Floor_merged', 'Living_space_merged', 'Floor_space_merged', 'Plot_area_merged', 'Space extracted', 'Gross return', 'price_cleaned', 'Availability', 'Verfügbarkeit', 'Disponibilité', 'Disponibilità', 'detail_responsive#available_from',  'Availability_merged']
  
  # Merge columns
  df['living_space'] = df['Space extracted']
  df['plot_area'] = df['Plot_area_merged'].fillna('') + \
    df['detail_responsive#surface_property'].fillna('')
  df['floor_space'] = df['Floor_space_merged'].fillna('') + \
    df['detail_responsive#surface_usable'].fillna('')
  df['floor'] = df['Floor_merged'].fillna('') + \
    df['detail_responsive#floor'].fillna('')
  df['availability'] = df['Availability_merged'].fillna('') + \
    df['detail_responsive#available_from'].fillna('')
  df['price'] = df['price_cleaned']

  # Parsing
  df['plot_area'] = df['plot_area'].replace('', np.nan).str.extract('(\d+)').astype(float)
  df['floor_space'] = df['floor_space'].replace('', np.nan).str.extract('(\d+)').astype(float)
  df['floor'] = df['floor'].replace('', np.nan).apply(parse_floor).astype(float)
  df['availability'] = df['availability'].replace('', np.nan)
  df['rooms'] = df['details_structured'].apply(parse_rooms).astype(float)

  df.drop(columns_to_drop, axis=1, inplace=True)

  return df  

In [270]:
df_cleaned = clean_integers(pd.read_csv('../data/immoscout_cleaned_lat_lon_fixed_v9.csv', low_memory=False))
df_cleaned.head()

Unnamed: 0,Municipality,location,description,detailed_description,url,table,detail_responsive#municipality,Gemeinde,Commune,Comune,Municipality_merged,location_parsed,title,details,address,price,link,details_structured,lat,lon,index,ForestDensityL,ForestDensityM,ForestDensityS,Latitude,Locality,Longitude,NoisePollutionRailwayL,NoisePollutionRailwayM,NoisePollutionRailwayS,NoisePollutionRoadL,NoisePollutionRoadM,NoisePollutionRoadS,PopulationDensityL,PopulationDensityM,PopulationDensityS,RiversAndLakesL,RiversAndLakesM,RiversAndLakesS,WorkplaceDensityL,WorkplaceDensityM,WorkplaceDensityS,Zip,distanceToTrainStation,gde_area_agriculture_percentage,gde_area_forest_percentage,gde_area_nonproductive_percentage,gde_area_settlement_percentage,gde_average_house_hold,gde_empty_apartments,gde_foreigners_percentage,gde_new_homes_per_1000,gde_politics_bdp,gde_politics_cvp,gde_politics_evp,gde_politics_fdp,gde_politics_glp,gde_politics_gps,gde_politics_pda,gde_politics_rights,gde_politics_sp,gde_politics_svp,gde_pop_per_km2,gde_population,gde_private_apartments,gde_social_help_quota,gde_tax,gde_workers_sector1,gde_workers_sector2,gde_workers_sector3,gde_workers_total,type,rooms,living_space,plot_area,floor_space,floor,availability
0,Biberstein,"5023 Biberstein, AG","3.5 rooms, 100 m²«Luxuriöse Attika-Wohnung mit...",DescriptionLuxuriöse Attika-Wohnung direkt an ...,https://www.immoscout24.ch//en/d/penthouse-buy...,b <article class=####Box-cYFBPY hKrxoH####><h2...,,,,,Biberstein,Strasse: plz:5023 Stadt: Biberstein Kanton: AG,Luxuriöse Attika-Wohnung mit herrlicher Aussicht,"3.5 rooms, 100 m²,","5023 Biberstein, AG",1150000.0,/en/d/penthouse-buy-biberstein/7255200,"{'Municipality': 'Biberstein', 'Living space':...",47.4171,8.0856,16620,0.511176,0.286451,0.090908,47.415927,Biberstein,8.08584,0.0,0.0,0.0,0.058298,0.067048,0.10385,0.092914,0.20953,0.366674,0.08217,0.001811,0.011871,0.030169,0.05212,0.098951,5023,3.038467,30.676329,51.449275,4.589372,13.285024,2.23,1.994681,9.255663,4.739336,5.873715,4.579662,3.359031,18.35536,6.057269,7.066814,,0.220264,20.392805,30.809471,376.829268,1545.0,686.0,2.234259,5.89,14.0,9.0,308.0,331.0,penthouse,3.5,100.0,,,4.0,On request
1,Biberstein,"Buhldenstrasse 8d5023 Biberstein, AG","4.5 rooms, 156 m²«Stilvolle Liegenschaft - ruh...",DescriptionStilvolle Liegenschaft an ruhiger L...,https://www.immoscout24.ch//en/d/terrace-house...,b <article class=####Box-cYFBPY hKrxoH####><h2...,,,,,Biberstein,Strasse:Buhldenstrasse 8d plz:5023 Stadt: Bib...,"Stilvolle Liegenschaft - ruhige Lage, unverbau...","4.5 rooms, 156 m²,","Buhldenstrasse 8d, 5023 Biberstein, AG",1420000.0,/en/d/terrace-house-buy-biberstein/7266694,"{'Municipality': 'Biberstein', 'Living space':...",47.4195,8.0827,16620,0.511176,0.286451,0.090908,47.415927,Biberstein,8.08584,0.0,0.0,0.0,0.058298,0.067048,0.10385,0.092914,0.20953,0.366674,0.08217,0.001811,0.011871,0.030169,0.05212,0.098951,5023,3.038467,30.676329,51.449275,4.589372,13.285024,2.23,1.994681,9.255663,4.739336,5.873715,4.579662,3.359031,18.35536,6.057269,7.066814,,0.220264,20.392805,30.809471,376.829268,1545.0,686.0,2.234259,5.89,14.0,9.0,308.0,331.0,terrace-house,4.5,156.0,222.0,242.0,,On request
2,,"5022 Rombach, AG","2.5 rooms, 93 m²«Moderne, lichtdurchflutete At...","detail_responsive#description_title2,5 Zimmerw...",https://www.immoscout24.ch//en/d/penthouse-buy...,b <article class=####Box-cYFBPY hKrxoH####><h2...,Küttigen,,,,,Strasse: plz:5022 Stadt: Rombach Kanton: AG,"Moderne, lichtdurchflutete Attikawohnung mit E...","2.5 rooms, 93 m²,","5022 Rombach, AG",720000.0,/en/d/penthouse-buy-rombach/7261389,"{'detail_responsive#municipality': 'Küttigen',...",47.4033,8.033,17812,0.163362,0.095877,0.001911,47.397416,Aarau,8.04315,0.0,0.0,0.0,0.334957,0.381257,0.297575,0.325887,0.393783,0.635194,0.154274,0.188229,0.0,0.172646,0.16385,0.16583,5000,0.909587,11.35442,32.197891,7.137064,49.310624,2.01,2.023799,21.358623,3.814582,3.633134,5.324421,3.782202,18.089552,7.899807,8.851305,,0.735032,26.515854,22.66229,1704.700162,21036.0,10149.0,3.54901,6.05,37.0,3092.0,30364.0,33493.0,penthouse,2.5,93.0,,,2.0,Immediately
3,Biberstein,"Buhaldenstrasse 8A5023 Biberstein, AG","4.5 rooms, 154 m²«AgentSelly - Luxuriöses Eckh...",DescriptionDieses äusserst grosszügige Minergi...,https://www.immoscout24.ch//en/d/detached-hous...,b <article class=####Box-cYFBPY hKrxoH####><h2...,,,,,Biberstein,Strasse:Buhaldenstrasse 8A plz:5023 Stadt: Bi...,AgentSelly - Luxuriöses Eckhaus an toller Süd-...,"4.5 rooms, 154 m²,","Buhaldenstrasse 8A, 5023 Biberstein, AG",1430000.0,/en/d/detached-house-buy-biberstein/7047212,"{'Municipality': 'Biberstein', 'Living space':...",47.415643,8.085423,16620,0.511176,0.286451,0.090908,47.415927,Biberstein,8.08584,0.0,0.0,0.0,0.058298,0.067048,0.10385,0.092914,0.20953,0.366674,0.08217,0.001811,0.011871,0.030169,0.05212,0.098951,5023,3.038467,30.676329,51.449275,4.589372,13.285024,2.23,1.994681,9.255663,4.739336,5.873715,4.579662,3.359031,18.35536,6.057269,7.066814,,0.220264,20.392805,30.809471,376.829268,1545.0,686.0,2.234259,5.89,14.0,9.0,308.0,331.0,detached-house,4.5,154.0,370.0,257.0,,On request
4,Küttigen,"5022 Rombach, AG","4.5 rooms, 142 m²«MIT GARTENSITZPLATZ UND VIEL...",DescriptionAus ehemals zwei Wohnungen wurde ei...,https://www.immoscout24.ch//en/d/flat-buy-romb...,b <article class=####Box-cYFBPY hKrxoH####><h2...,,,,,Küttigen,Strasse: plz:5022 Stadt: Rombach Kanton: AG,MIT GARTENSITZPLATZ UND VIELEN EXTRAS,"4.5 rooms, 142 m²,","5022 Rombach, AG",995000.0,/en/d/flat-buy-rombach/7293107,"{'Municipality': 'Küttigen', 'Living space': '...",47.403824,8.048288,12716,0.333865,0.279276,0.145835,47.40487,Rombach,8.052781,0.0,0.0,0.0,0.133498,0.132933,0.235917,0.190986,0.136984,0.204549,0.109586,0.141473,0.091805,0.04695,0.038008,0.055509,5022,1.460245,33.13709,49.705635,1.17746,15.979815,2.28,0.691563,15.90199,1.160862,5.21774,5.728026,5.006679,19.158429,6.502805,7.477959,,0.892332,20.459524,27.590168,511.008403,6081.0,2638.0,1.708126,6.3,65.0,349.0,941.0,1355.0,flat,4.5,142.0,,,0.0,On request
