# Purpose
This notebooks explores how much more information in comparison to the first version of the dataset can be extracted from the following columns:
- `description` (+8450)  
- `description_detailed` (new)
- `detailed_description` (+9126)
- `table` (=)
- `details_structured` (=) and   
- `details` (=)

# Summary
New data has been added to columns `description` and `detailed_description`. The column `description_detailed` is completely unseen before and the rest of the columns contain still the same information. It is therefore not feasable to extract common data from any of these columns. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import sweetviz as sv
import re


In [2]:
# Set display options
pd.set_option("display.max_columns", None, "display.max_rows", 100)


In [3]:
# Read Data
df = pd.read_parquet(
    "https://github.com/Immobilienrechner-Challenge/data/blob/main/immo_data_202208_v2.parquet?raw=true"
)


# Description
The following statements help to get an idea of the data contained in the `description` column.

In [4]:
df["description"].count()


21805

In [5]:
df["description"].head()


0    3.5 rooms, 100 m²«Luxuriöse Attika-Wohnung mit...
1    4.5 rooms, 156 m²«Stilvolle Liegenschaft - ruh...
2    2.5 rooms, 93 m²«Moderne, lichtdurchflutete At...
3    4.5 rooms, 154 m²«AgentSelly - Luxuriöses Eckh...
4    4.5 rooms, 142 m²«MIT GARTENSITZPLATZ UND VIEL...
Name: description, dtype: object

In [6]:
df["description"].tail()


22476    None
22477    None
22478    None
22479    None
22480    None
Name: description, dtype: object

In [7]:
df["description"].value_counts()


CHANTIER OUVERT!FRAIS DE NOTAIRE ET DROITS DE MUTATION RÉDUITS !Idéalement située à quelques minutes de la gare CFF de Payerne et de toutes ses commodités, découvrez notre nouveau projet immobilier "Le Saule". Ce projet se compose de 18 appartements du 2,5 au 4,5 pièces aux surfaces habitables variant de 55 à 110 m2. Les appartements situés au rez disposent de belles terrasses et certains de jardins privatifs alors que les appartements aux étages bénéficient d'un balcon aux surfaces variables entre 11 et 16 m2.Le bâtiment est réparti sur 4 niveaux: sous-sol, rez, 1er étage et 2ème étage. Au sous-sol, vous trouverez les caves, le local vélo-poussettes ainsi que le garage souterrain qui compte 14 places de parc. Les ascenseurs desservent tous les appartements.                                                                                                                                                                                                                                        

This output looks like unstructured data which makes it hard to reliably extract data for known features.

# Description Detailed

In [8]:
df["description_detailed"].count()


9126

The count suggests that the column `description_detailed` is only filled for rows that are from homegate.ch

In [9]:
# Select rows where description_detailed is not null and display unique values of provider
df[df["description_detailed"].notna()]["provider"].unique()


array(['homegate.ch'], dtype=object)

And this proves it. Let's see what data is stored inside this column.

In [10]:
df["description_detailed"].value_counts()


Description\n"LE SAULE - 18 logements MINERGIE DU 2,5 au 4,5 pièces à vendre"\nCHANTIER OUVERT!\n\nFRAIS DE NOTAIRE ET DROITS DE MUTATION RÉDUITS !\n\nIdéalement située à quelques minutes de la gare CFF de Payerne et de toutes ses commodités, découvrez notre nouveau projet immobilier "Le Saule".\n\nCe projet se compose de 18 appartements du 2,5 au 4,5 pièces aux surfaces habitables variant de 55 à 110 m2.\n\nLes appartements situés au rez disposent de belles terrasses et certains de jardins privatifs alors que les appartements aux étages bénéficient d'un balcon aux surfaces variables entre 11 et 16 m2.\n\nLe bâtiment est réparti sur 4 niveaux: sous-sol, rez, 1er étage et 2ème étage.\n\nAu sous-sol, vous trouverez les caves, le local vélo-poussettes ainsi que le garage souterrain qui compte 14 places de parc.\n\nLes ascenseurs desservent tous les appartements.                                                                                                                                 

As with `detailed_description` from the first dataset, `description_detailed` contains unstructured data which makes it hard to extract data from reliably. 

# Detailed Description

In [11]:
df["detailed_description"].count()


22481

In [12]:
df["detailed_description"].value_counts()


Extract from the debt collection registerIn a few days by e-mail and by post at your home. Per invoice, for CHF 29.–Order the extract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

Like before this column is not suitable for extracting data from with RegEx.

# Table

In [13]:
df["table"].count()


13355

In [14]:
df[df["provider"] == "homegate.ch"]["table"].value_counts()


Series([], Name: table, dtype: int64)

No new data has been added to this column so the findings from [this](../../v1/exports/2-daw_raw.html) notebook still hold true.

# Details Structured

In [15]:
df["details_structured"].count()


13355

In [16]:
df[df["provider"] == "homegate.ch"]["details_structured"].value_counts()


Series([], Name: details_structured, dtype: int64)

The same goes for `details_structured`.

# Details

In [17]:
df["details"].count()


13179

In [18]:
df[df["provider"] == "homegate.ch"]["details"].value_counts()


Series([], Name: details, dtype: int64)

And for `details`.