# Yakeey Data

This file is for inspecting and exploring data extracted from Yakeey.

In [1]:
import pandas as pd

In [2]:
yakeey_df = pd.read_json("../data/raw/yakeey/yakeey_2024-11-14.json")

In [3]:
yakeey_df.head()

Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipments
0,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 750 000 DH,Hopitaux,Casablanca,Appartement à vendre de 126 m² dont 101 m² hab...,CA001095,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Balcon, Place de parking en sous-s..."
1,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 500 000 DH,Lekrimat,Casablanca,Appartement à vendre de 128 m²,CI067647,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Balcon, Box titré, Cuisine équipée]"
2,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 800 000 DH,Hopitaux,Casablanca,Appartement à vendre de 133 m²,CI068492,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Balcon, Cuisine équipée]"
3,https://yakeey.com/fr-ma/acheter-appartement-s...,Appartement,750 000 DH,Hay safae,Salé,Appartement à vendre de 94 m²,SI067739,"{'Nb. de chambres': '1', 'Nb. de salles de bai...",[Cuisine équipée]
4,https://yakeey.com/fr-ma/acheter-appartement-s...,Appartement,690 000 DH,Hay chmaou,Salé,Appartement à vendre de 82 m²,SI067053,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Terrasse, Place de parking en sous-sol]"


In [4]:
yakeey_df.columns

Index(['url', 'type', 'price', 'neighborhood', 'city', 'title', 'reference',
       'attributes', 'equipments'],
      dtype='object')

In [5]:
yakeey_df["type"].value_counts()

type
Appartement    175
Studio           6
Duplex           6
Triplex          3
Name: count, dtype: int64

All of these values of `type` refer to appartments, since these are the main types of properties we are working with. We might make a distinction between:
- "Studio": being a single room appartment
- "Appartment": being a regular appartment with multiple rooms
- "Duplex/Triplex": being a two or more story appartment

In [6]:
yakeey_df["price"].head()

0    1 750 000 DH
1    1 500 000 DH
2    1 800 000 DH
3      750 000 DH
4      690 000 DH
Name: price, dtype: object

In [7]:
prices = yakeey_df["price"].str.replace("DH", "").str.replace(" ", "").astype(int)

In [8]:
prices.describe().astype(int)

count         190
mean      1805528
std       1224030
min        500000
25%       1054750
50%       1478000
75%       2100000
max      10000000
Name: price, dtype: int64

As mentioned in the scraping module, the quality of the data from Yakeey is very good, due to the fact that every single announcement is manually checked by real estate consultants. This means that we wouldn't need to make any filtering for potential outliers. Even further, we can rely on this data to determine rules for controling quality from other sources.

In [9]:
yakeey_df["city"].value_counts()

city
Casablanca     145
Salé             7
Témara           7
Tanger           7
Dar bouazza      6
Rabat            4
Mohammédia       4
Marrakech        3
Bouznika         3
Sidi rahal       3
Kénitra          1
Name: count, dtype: int64

`city` contains 2 missing values, it may be useful to fill them with the most common value since they are very rare and the mode is very frequent.
We also need to compare the city names for consistency.

In [10]:
yakeey_df["neighborhood"].value_counts()

neighborhood
Hopitaux              11
Derb omar             11
Maarif                 8
Val fleury             8
Maarif extension       7
                      ..
Ain borja              1
Racine                 1
Bettana                1
Quartier allaymoun     1
Benmsick               1
Name: count, Length: 80, dtype: int64

`title` and `reference` are unique to the platform, `reference` to determine the uniqueness of the announcement for the delta load, we don't need to do anything with them.

In [12]:
yakeey_df["attributes"].head()

0    {'Nb. de chambres': '2', 'Nb. de salles de bai...
1    {'Nb. de chambres': '3', 'Nb. de salles de bai...
2    {'Nb. de chambres': '3', 'Nb. de salles de bai...
3    {'Nb. de chambres': '1', 'Nb. de salles de bai...
4    {'Nb. de chambres': '2', 'Nb. de salles de bai...
Name: attributes, dtype: object

In [14]:
attributes_keys = set()
for attribute in yakeey_df["attributes"]:
    if attribute is not None:
        attributes_keys.update(attribute.keys())

for attribute_key in attributes_keys:
    print(attribute_key)  # noqa: T201
    yakeey_df[attribute_key] = yakeey_df["attributes"].apply(
        lambda x, key=attribute_key: x.get(key, None)
    )

yakeey_df.head()

Nb. de chambres
Nb. de salles de bains
Places de parking en sous-sol
Vue
Nb. de façades
Frais de syndic (DH/an)
Places de parking extérieur
Nb. de salles d'eau
Orientation
Résidence fermée
Surface solarium
Surface totale
Surface box titré
Surface habitable
Surface terrasse
Surface balcon
Étage du bien
Nb. d'étages dans l'immeuble


Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipments,Nb. de chambres,...,Orientation,Résidence fermée,Surface solarium,Surface totale,Surface box titré,Surface habitable,Surface terrasse,Surface balcon,Étage du bien,Nb. d'étages dans l'immeuble
0,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 750 000 DH,Hopitaux,Casablanca,Appartement à vendre de 126 m² dont 101 m² hab...,CA001095,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Balcon, Place de parking en sous-s...",2,...,"Ouest, Sud-Ouest",,,126 m²,,101 m²,,8 m²,5,5
1,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 500 000 DH,Lekrimat,Casablanca,Appartement à vendre de 128 m²,CI067647,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Balcon, Box titré, Cuisine équipée]",3,...,,,,128 m²,,128 m²,,,4,4
2,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,1 800 000 DH,Hopitaux,Casablanca,Appartement à vendre de 133 m²,CI068492,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Balcon, Cuisine équipée]",3,...,,,,133 m²,,133 m²,,,4,6
3,https://yakeey.com/fr-ma/acheter-appartement-s...,Appartement,750 000 DH,Hay safae,Salé,Appartement à vendre de 94 m²,SI067739,"{'Nb. de chambres': '1', 'Nb. de salles de bai...",[Cuisine équipée],1,...,,,,94 m²,,94 m²,,,1,1
4,https://yakeey.com/fr-ma/acheter-appartement-s...,Appartement,690 000 DH,Hay chmaou,Salé,Appartement à vendre de 82 m²,SI067053,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Terrasse, Place de parking en sous-sol]",2,...,,,,82 m²,,82 m²,,,Rez-de-chaussée,1


We see already that many of these attributes don't exist in the other sources, for simplicity we will keep the following:
- `Nb. de salles de bains`
- `Surface habitable`
- `Étage du bien`
- `Nb. de chambres`
- `Nb. de salles d'eau`
- `Surface totale`

In [15]:
yakeey_df["Nb. de salles d'eau"].value_counts()

Nb. de salles d'eau
1    48
2    10
3     3
Name: count, dtype: int64

In [16]:
yakeey_df["Nb. de salles de bains"].value_counts()

Nb. de salles de bains
2    104
1     64
3     18
4      2
6      1
Name: count, dtype: int64

For simplicity, we can further only keep `Nb. de salles d'eau`, which corresponds to the number of bathrooms.

In [17]:
living_areas = (yakeey_df.loc[yakeey_df["Surface habitable"].notnull(), "Surface habitable"]  # noqa: E501
                .str.replace(" m²", "").astype(int))
living_areas.describe().astype(int)

count    190
mean     124
std       76
min       43
25%       87
50%      109
75%      138
max      705
Name: Surface habitable, dtype: int64

Be careful with the missing values.

In [18]:
yakeey_df["Étage du bien"].value_counts()

Étage du bien
1                  36
3                  34
4                  32
2                  28
5                  26
Rez-de-chaussée    19
7                   6
6                   4
Rez-de-jardin       2
9                   1
8                   1
12                  1
Name: count, dtype: int64

In [19]:
yakeey_df["Nb. de chambres"].value_counts()

Nb. de chambres
3    88
2    67
4    21
1    11
6     1
7     1
5     1
Name: count, dtype: int64

In [22]:
equipments = set()
for equipment in yakeey_df["equipments"]:
    if equipment is not None:
        equipments.update(equipment)

equipment_list = sorted(equipments)
equipment_list

['Accès aux personnes à mobilité réduite',
 'Agent de sécurité',
 'Aire de jeux pour enfants',
 'Ascenseur',
 'Balcon',
 'Box titré',
 'Buanderie',
 'Chambre de service',
 'Chauffage centralisé',
 'Chauffage électrique',
 'Chauffe-eau à gaz',
 'Chauffe-eau électrique',
 'Cheminée',
 'Climatisation centralisée',
 'Climatisation split',
 'Concierge',
 'Cuisine américaine',
 'Cuisine équipée',
 'Digicode',
 'Espaces verts',
 'Interphone',
 'Jardin privatif',
 'Meublé',
 'Piscine commune',
 'Piscine privative',
 'Place de parking en extérieur',
 'Place de parking en sous-sol',
 'Résidence fermée',
 'Salle de fitness',
 'Solarium',
 'Terrasse']

## Conclusion

To summarize, we saw that for this specific platform, we don't need to focus on the quality of the data, since it is already very good and we shouldn't be expecting outliers, we only have to be careful with how we handle the missing values. We were also able to identify some issues with the scraping script.