# Purpose
This notebooks explores what information can be extracted from the following columns:
- `description`
- `detailed_description`
- `table`
- `details_structured` and   
- `details`

# Summary
| Column        | Contains information on |
| ------------- | ----------------------- |
| `description` | `Rooms`, `Living Space` and `Price` |
| `detailed_description` | No way to extract data in a structured way |
| `table`  | `Availability`, `Municipality`, `Floor`, `Floor Space`, `Gross Return`, `Plot Area` and `Living Space` |
| `details_structured` |`Availability`, `Municipality`, `Floor`, `Floor Space`, `Gross Return`, `Plot Area`, `Living Space`, `Price`, `Rooms` |
| `details` | `Living Space` and `Rooms` |

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re


In [2]:
# Set display options
pd.set_option(
    "display.max_columns", None, "display.max_rows", 100, "display.max_colwidth", None
)


In [3]:
# Read Data
df = pd.read_csv(
    "https://raw.githubusercontent.com/Immobilienrechner-Challenge/data/main/immo_data_202208.csv",
    low_memory=False,
)


# Description
Let's look at the `value_counts()` to get an idea about the information contained in the `description` column 

In [4]:
df["description"].value_counts()


4.5 rooms, 153 m²«Duplex dans les combles avec 2 terrasses !»CHF 686,700.—Favourite                                                                               4
3.5 rooms, 98 m²«Belle promotion Minergie de 22 appartements au calme ! Du 2.5 pces au 4.5 pces !»CHF 495,000.—Favourite                                          4
5.5 rooms, 153 m²«####Les Vergers d Ollon#### à Ollon VD Magnifique Villa Mitoyenne avec un grand jardin d environ 1 000 m2 à vendre»CHF 1,260,000.—Favourite     4
5.5 rooms, 170 m²«NOUVELLE PROMOTION»CHF 1,795,000.—Favourite                                                                                                     4
2.5 rooms, 82 m²«Quartier Saint-Michel Appartement 2.5 pces au 3e»CHF 492,000.—Favourite                                                                          3
                                                                                                                                                                 ..
3.5 rooms, 101 m

Two things immediately stand out:
- The same description has been recorded multiple times for different objects.
- There seems to be a distinctive pattern in the data contained in the `description`-column.  

Since we cannot inspect every column manually we built a regex matching group to check if the structure of the data is consistent in all observations.

In [5]:
df["description"].count()


13378

In [6]:
description_pattern = "\d+\.?\d? *rooms,? *\d+ *m² *«.+» *CHF *[\d,]+\."
df["description"].str.contains(description_pattern).sum()


11201

The column 'description' contains information for every observation in the dataset. 11201 of which follow the defined structure. What about the rest?

In [7]:
is_structured = df["description"].str.contains(description_pattern)
not_structured = is_structured[is_structured == False]
not_structured.count()


2177

In [8]:
df.iloc[not_structured.index]["description"].head(10)


7                                                       4.5 rooms«Preishit! Grossräumige Wohnung mitten in Aarau»CHF 590,000.—Favourite
15                        258 m²«Mehrgenerationenhaus mit grossem Garten oder Anlageimmobilie? Sie entscheiden»CHF 1,580,000.—Favourite
21                                                    4.5 rooms, 236 m²«Terrassenhaus mit malerischer Weitsicht»Price on requestFavorit
22    167 m²«EFH 6.5 davon 1 Zi-Studio (Büro / Praxis), Garten Terrasse mit Aussicht, Nähe Schule, Einkauf, ÖV»CHF 1,095,000.—Favourite
33                        258 m²«Mehrgenerationenhaus mit grossem Garten oder Anlageimmobilie? Sie entscheiden»CHF 1,580,000.—Favourite
35                                                      4.5 rooms«Preishit! Grossräumige Wohnung mitten in Aarau»CHF 590,000.—Favourite
41                                                  4.5 rooms, 150 m²«####Two in One#### mit Einliegerwohnung»Price on requestFavourite
44       7.5 rooms, 216 m²«Verzauberndes Generat

So not every row of `description` contains complete information about the rooms, living space and price.

## Living Space

In [9]:
df["description"].str.contains("\d *m²", flags=re.IGNORECASE).sum()


12308

## Price

In [10]:
df["description"].str.contains("CHF", flags=re.IGNORECASE).sum()


12363

## Rooms

In [11]:
df["description"].str.contains("\d *rooms", flags=re.IGNORECASE).sum()


12799

# Detailed Description

In [12]:
df["detailed_description"].value_counts()


Extract from the debt collection registerIn a few days by e-mail and by post at your home. Per invoice, for CHF 29.–Order the extract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

This column contains, as the name suggests, the detailed description of the posting. This data does not follow any specific pattern and is therefore no reliable source for information on the features we are investigating in this notebook. Therefore we discard it.  
With natural language processing techniques this column could become useful for fine tuning predictions however. 

# Table

In [13]:
df["table"].value_counts()


b <article class=####Box-cYFBPY hKrxoH####><h2 class=####Box-cYFBPY gZLPvm####>Main information</h2><table class=####DataTable__StyledTable-sc-1o2xig5-1 jbXaEC####><tbody><tr><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__Cell-sc-1o2xig5-4 edrNfG dGBatU####>Municipality</td><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__CellValue-sc-1o2xig5-3 edrNfG rJZBK####>Le Mouret</td></tr><tr><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__Cell-sc-1o2xig5-4 edrNfG dGBatU####>Availability</td><td class=####DataTable__SimpleCell-sc-1o2xig5-2 DataTable__CellValue-sc-1o2xig5-3 edrNfG rJZBK####>On request</td></tr></tbody></table><hr class=####Divider-iprSaI bBhTLQ####/></article>                                                                                                                                                                                                                                                                                                 

This output suggests that there's information in the column `table` on the Municipality, Plot Area, Availability, Floor, Floor Space and Living Space.   
Let's inspect how many observations we can find in `table` per feature.

## Availability

In [14]:
df["table"].str.contains("availability", flags=re.IGNORECASE).sum()


12663

## Address
### Municipality

In [15]:
df["table"].str.contains("Municipality", flags=re.IGNORECASE).sum()


12897

### Zip Code

In [16]:
df["table"].str.contains("Zip", flags=re.IGNORECASE).sum()


0

### Canton

In [17]:
df["table"].str.contains("Canton", flags=re.IGNORECASE).sum()


23

### Street

In [18]:
df["table"].str.contains("Street", flags=re.IGNORECASE).sum()


0

In [19]:
df["table"].str.contains("location", flags=re.IGNORECASE).sum()


0

In [20]:
df["table"].str.contains("address", flags=re.IGNORECASE).sum()


0

## Floor

In [21]:
df["table"].str.contains("floor", flags=re.IGNORECASE).sum()


7173

## Floor Space

In [22]:
df["table"].str.contains("floor space", flags=re.IGNORECASE).sum()


2780

## Gross Return

In [23]:
df["table"].str.contains("gross return", flags=re.IGNORECASE).sum()


6

## Plot Area

In [24]:
df["table"].str.contains("Plot area", flags=re.IGNORECASE).sum()


4696

## Living Space

In [25]:
df["table"].str.contains("living space", flags=re.IGNORECASE).sum()


11634

## Price

In [26]:
df["table"].str.contains("price", flags=re.IGNORECASE).sum()


0

## Rooms

In [27]:
df["table"].str.contains("rooms", flags=re.IGNORECASE).sum()


0

# Details Structured

In [28]:
df["details_structured"].value_counts()


{'Municipality': 'Biberstein', 'Living space': '100 m²', 'Floor': '4. floor', 'Availability': 'On request', 'location': '5023 Biberstein, AG', 'description': '3.5 rooms, 100 m²«Luxuriöse Attika-Wohnung mit herrlicher Aussicht»CHF 1,150,000.—Favourite', 'detailed_description': 'DescriptionLuxuriöse Attika-Wohnung direkt an der Aare und angrenzend an die Landwirtschaftszone, mit unverbaubarer Weitsicht, grosszügiger Garage und Option auf ein zusätzliches Zimmer.Einzigartige Lage, top Aussicht und hochwertige Innenausstattung? Das alles bietet diese charmante Eigentumswohnung auf 100m2 im steuergünstigen Biberstein. Stadtnah gelegen und mit direktem Naturzugang sorgt sie für ein rundum angenehmes Wohngefühl.In der ganzen Wohnung sind hochwertige Materialien mit einem südländischen Touch verbaut. Der Boden ist mit einem Jurastein und die beiden Zimmer mit Holz versehen (mit Bodenheizung).In die Wohnung gelangt man über einen separaten Eingang, ein halbes Stockwerk vom gewachsenen Boden erh

From this structure it's apparent that there's information contained on 
- Municipality
- Living Space
- Plot Area
- Availability
- Floor Space
- Location
- Description
- Detailed Description
- URL
- Table 
- Floor

Let's see how much information we can extract from it.

## Availability

In [29]:
df["details_structured"].str.contains("Availability", flags=re.IGNORECASE).sum()


12664

## Address
### Municipality

In [30]:
df["details_structured"].str.contains("Municipality", flags=re.IGNORECASE).sum()


12897

### Location

In [31]:
df["details_structured"].str.contains("location", flags=re.IGNORECASE).sum()


13378

## Floor

In [32]:
df["details_structured"].str.contains("Floor", flags=re.IGNORECASE).sum()


7380

## Floor Space

In [33]:
df["details_structured"].str.contains("Floor space", flags=re.IGNORECASE).sum()


2784

## Gross Return

In [34]:
df["details_structured"].str.contains("Gross Return", flags=re.IGNORECASE).sum()


6

## Plot Area

In [35]:
df["details_structured"].str.contains("Plot area", flags=re.IGNORECASE).sum()


4696

## Living Space

In [36]:
df["details_structured"].str.contains("Living space", flags=re.IGNORECASE).sum()


11647

## Price

In [37]:
df["details_structured"].str.contains("Price", flags=re.IGNORECASE).sum()


1120

## Rooms

In [38]:
df["details_structured"].str.contains("rooms", flags=re.IGNORECASE).sum()


12816

## Description

In [39]:
df["details_structured"].str.contains("description", flags=re.IGNORECASE).sum()


13378

## Detailed Description

In [40]:
df["details_structured"].str.contains("detailed_description", flags=re.IGNORECASE).sum()


13378

## URL

In [41]:
df["details_structured"].str.contains("url", flags=re.IGNORECASE).sum()


13378

## Table

In [42]:
df["details_structured"].str.contains("table", flags=re.IGNORECASE).sum()


13378

# Details

In [43]:
df["details"].value_counts()


4.5 rooms, 120 m²,     135
4.5 rooms, 110 m²,     132
4.5 rooms,             120
3.5 rooms, 100 m²,     111
4.5 rooms, 100 m²,     102
                      ... 
121 m²,                  1
6 rooms, 132 m²,         1
5.5 rooms, 340 m²,       1
393 m²,                  1
7.5 rooms, 385 m²,       1
Name: details, Length: 2741, dtype: int64

It looks like we've got another column with information about the rooms and living space (or some other space). Let's extract and count it.  

## Rooms

In [44]:
df["details"].str.contains("rooms", flags=re.IGNORECASE).sum()


12799

## Living Space

In [45]:
df["details"].str.contains("m²").sum()


12777