# Yakeey Data

This file is for inspecting and exploring data extracted from Yakeey.

In [2]:
import pandas as pd

In [3]:
yakeey_df = pd.read_json("../data/yakeey/2024-10-12_yakeey.json")

In [4]:
yakeey_df.head()

Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipements
0,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,2 600 000 DH,Hopitaux,Casablanca,Appartement à vendre de 171 m² dont 159 m² hab...,CA001579,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Interphone, Résidence fermée, Espa..."
1,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,3 900 000 DH,Maarif,Casablanca,Appartement à vendre de 230 m² dont 220 m² hab...,CA000311,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Interphone, Résidence fermée, Pisc..."
2,https://yakeey.com/fr-ma/acheter-appartement-m...,Appartement,650 000 DH,Hay el bahja,Marrakech,Appartement à vendre de 74 m²,MI063170,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Agent de sécurité, Climatisation c..."
3,https://yakeey.com/fr-ma/acheter-appartement-k...,Appartement,720 000 DH,La ville haute,Kénitra,Appartement à vendre de 77 m²,KI062829,"{'Nb. de chambres': '1', 'Nb. de salles de bai...","[Ascenseur, Balcon, Terrasse, Place de parking..."
4,https://yakeey.com/fr-ma/acheter-appartement-m...,Appartement,650 000 DH,Jardin de la koutoubia,Marrakech,Appartement meublé à vendre de 74 m²,MA063043,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Résidence fermée, Concierge, Clima..."


In [5]:
yakeey_df.columns

Index(['url', 'type', 'price', 'neighborhood', 'city', 'title', 'reference',
       'attributes', 'equipements'],
      dtype='object')

In [6]:
yakeey_df["type"].value_counts()

type
Appartement    177
Duplex           6
Studio           6
Triplex          3
Name: count, dtype: int64

All of these values of `type` refer to appartments, since these are the main types of properties we are working with. We might make a distinction between:
- "Studio": being a single room appartment
- "Appartment": being a regular appartment with multiple rooms
- "Duplex/Triplex": being a two or more story appartment

In [9]:
yakeey_df["price"].head()

0    2 600 000 DH
1    3 900 000 DH
2      650 000 DH
3      720 000 DH
4      650 000 DH
Name: price, dtype: object

In [10]:
prices = yakeey_df["price"].str.replace("DH", "").str.replace(" ", "").astype(int)

In [12]:
prices.describe().astype(int)

count         192
mean      1796000
std       1239900
min        500000
25%       1000000
50%       1478000
75%       2112500
max      10000000
Name: price, dtype: int64

As mentioned in the scraping module, the quality of the data from Yakeey is very good, due to the fact that every single announcement is manually checked by real estate consultants. This means that we wouldn't need to make any filtering for potential outliers. Even further, we can rely on this data to determine rules for controling quality from other sources.

In [21]:
yakeey_df["city"].value_counts()

city
Casablanca     146
Tanger           9
Témara           5
Dar bouazza      5
Kénitra          4
Marrakech        4
Salé             4
Sidi rahal       3
Bouznika         3
Rabat            3
Mohammédia       2
                 2
Harhoura         1
El jadida        1
Name: count, dtype: int64

`city` contains 2 missing values, it may be useful to fill them with the most common value since they are very rare and the mode is very frequent.
We also need to compare the city names for consistency.

In [24]:
yakeey_df["neighborhood"].value_counts()

neighborhood
Derb omar           12
Hopitaux            10
Val fleury           9
Maarif extension     8
Maarif               8
                    ..
Annasr               1
Koudiat laabid       1
Hay moumen           1
Lalla chafia         1
Hay charaf           1
Name: count, Length: 83, dtype: int64

In [25]:
yakeey_df[yakeey_df["neighborhood"] == ""]

Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipements
88,https://yakeey.com/fr-ma/acheter-appartement-I...,Appartement,725 000 DH,,,Appartement à vendre de 64 m²,II063966,{},"[Prix de vente, 725 000 DH, Frais et charges ..."
133,https://yakeey.com/fr-ma/acheter-appartement-I...,Appartement,2 010 000 DH,,,Appartement à vendre de 144 m²,II063583,{},[]


Here again, we need to think of a strategy for the missing values, and make the comparion between neighborhoods from different sources.

`title` and `reference` are unique to the platform, `reference` to determine the uniqueness of the announcement for the delta load, we don't need to do anything with them.

In [26]:
yakeey_df["attributes"].head()

0    {'Nb. de chambres': '3', 'Nb. de salles de bai...
1    {'Nb. de chambres': '3', 'Nb. de salles de bai...
2    {'Nb. de chambres': '2', 'Nb. de salles de bai...
3    {'Nb. de chambres': '1', 'Nb. de salles de bai...
4    {'Nb. de chambres': '2', 'Nb. de salles de bai...
Name: attributes, dtype: object

In [28]:
attributes_keys = set()
for attribute in yakeey_df["attributes"]:
    if attribute is not None:
        attributes_keys.update(attribute.keys())

for attribute_key in attributes_keys:
    yakeey_df[attribute_key] = yakeey_df["attributes"].apply(
        lambda x, key=attribute_key: x.get(key, None)
    )

yakeey_df.head()

Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipements,Places de parking extérieur,...,Orientation,Surface balcon,Surface box titré,Étage du bien,Nb. de chambres,Nb. de salles d'eau,Nb. de façades,Vue,Nb. d'étages dans l'immeuble,Surface totale
0,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,2 600 000 DH,Hopitaux,Casablanca,Appartement à vendre de 171 m² dont 159 m² hab...,CA001579,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Interphone, Résidence fermée, Espa...",,...,,,,3,3,1.0,,,7,171 m²
1,https://yakeey.com/fr-ma/acheter-appartement-c...,Appartement,3 900 000 DH,Maarif,Casablanca,Appartement à vendre de 230 m² dont 220 m² hab...,CA000311,"{'Nb. de chambres': '3', 'Nb. de salles de bai...","[Ascenseur, Interphone, Résidence fermée, Pisc...",,...,Sud-Ouest,,,7,3,2.0,2.0,Vue dégagée,8,230 m²
2,https://yakeey.com/fr-ma/acheter-appartement-m...,Appartement,650 000 DH,Hay el bahja,Marrakech,Appartement à vendre de 74 m²,MI063170,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Agent de sécurité, Climatisation c...",,...,,,,1,2,,,,3,74 m²
3,https://yakeey.com/fr-ma/acheter-appartement-k...,Appartement,720 000 DH,La ville haute,Kénitra,Appartement à vendre de 77 m²,KI062829,"{'Nb. de chambres': '1', 'Nb. de salles de bai...","[Ascenseur, Balcon, Terrasse, Place de parking...",,...,,,,8,1,,2.0,Vue dégagée,8,77 m²
4,https://yakeey.com/fr-ma/acheter-appartement-m...,Appartement,650 000 DH,Jardin de la koutoubia,Marrakech,Appartement meublé à vendre de 74 m²,MA063043,"{'Nb. de chambres': '2', 'Nb. de salles de bai...","[Ascenseur, Résidence fermée, Concierge, Clima...",,...,Nord,,,3,2,,1.0,,5,74 m²


In [30]:
yakeey_df.columns[-len(attributes_keys):]

Index(['Places de parking extérieur', 'Surface solarium',
       'Nb. de salles de bains', 'Surface habitable',
       'Places de parking en sous-sol', 'Résidence fermée', 'Surface terrasse',
       'Frais de syndic (DH/an)', 'Orientation', 'Surface balcon',
       'Surface box titré', 'Étage du bien', 'Nb. de chambres',
       'Nb. de salles d'eau', 'Nb. de façades', 'Vue',
       'Nb. d'étages dans l'immeuble', 'Surface totale'],
      dtype='object')

We see already that many of these attributes don't exist in the other sources, for simplicity we will keep the following:
- `Nb. de salles de bains`
- `Surface habitable`
- `Étage du bien`
- `Nb. de chambres`
- `Nb. de salles d'eau`
- `Surface totale`

In [31]:
yakeey_df["Nb. de salles d'eau"].value_counts()

Nb. de salles d'eau
1    49
2    10
3     3
Name: count, dtype: int64

In [32]:
yakeey_df["Nb. de salles de bains"].value_counts()

Nb. de salles de bains
2    106
1     63
3     17
4      2
6      1
Name: count, dtype: int64

For simplicity, we can further only keep `Nb. de salles d'eau`, which corresponds to the number of bathrooms.

In [62]:
living_areas = (yakeey_df.loc[yakeey_df["Surface habitable"].notnull(), "Surface habitable"]
                .str.replace(" m²", "").astype(int))
living_areas.describe().astype(int)

count    190
mean     122
std       72
min       43
25%       84
50%      108
75%      138
max      705
Name: Surface habitable, dtype: int64

Be careful with the missing values.

In [63]:
yakeey_df["Étage du bien"].value_counts()

Étage du bien
1                  40
3                  34
2                  32
4                  25
5                  25
Rez-de-chaussée    17
7                   6
6                   5
8                   4
9                   1
12                  1
Name: count, dtype: int64

In [65]:
yakeey_df["Nb. de chambres"].value_counts()

Nb. de chambres
3    89
2    68
4    18
1    12
5     1
6     1
7     1
Name: count, dtype: int64

In [71]:
yakeey_df[yakeey_df["Surface totale"].isnull()]

Unnamed: 0,url,type,price,neighborhood,city,title,reference,attributes,equipements,Places de parking extérieur,...,Orientation,Surface balcon,Surface box titré,Étage du bien,Nb. de chambres,Nb. de salles d'eau,Nb. de façades,Vue,Nb. d'étages dans l'immeuble,Surface totale
88,https://yakeey.com/fr-ma/acheter-appartement-I...,Appartement,725 000 DH,,,Appartement à vendre de 64 m²,II063966,{},"[Prix de vente, 725 000 DH, Frais et charges ...",,...,,,,,,,,,,
133,https://yakeey.com/fr-ma/acheter-appartement-I...,Appartement,2 010 000 DH,,,Appartement à vendre de 144 m²,II063583,{},[],,...,,,,,,,,,,


Strangely enough, the rows with the missing values for the city, neighborhood and areas are the same. After checking in the website, it seems that we are dealing with a new type of announcements, and we need to fix the scraping script to handle them.

In [72]:
equipments = set()
for equipment in yakeey_df["equipements"]:
    if equipment is not None:
        equipments.update(equipment)

equipment_list = sorted(equipments)
equipment_list

['1 400  DH',
 '1 450  DH',
 '1 500  DH',
 '1 950 000  DH',
 '10 500  DH',
 '10 875  DH',
 '12 600  DH',
 '13 050  DH',
 '132 350  DH',
 '19 500  DH',
 '2 117 450  DH',
 '200  DH',
 '28 000  DH',
 '29 000  DH',
 '29 250  DH',
 '3 900  DH',
 '35 100  DH',
 '48 600  DH',
 '50 275  DH',
 '7 000  DH',
 '7 250  DH',
 '700 000  DH',
 '725 000  DH',
 '761 200  DH',
 '78 000  DH',
 '788 325  DH',
 'Accès aux personnes à mobilité réduite',
 'Agent de sécurité',
 'Aire de jeux pour enfants',
 'Ascenseur',
 'Balcon',
 'Box titré',
 'Buanderie',
 'Certificat de propriété et droits fixes',
 'Chambre de service',
 'Chauffage centralisé',
 'Chauffage électrique',
 'Chauffe-eau à gaz',
 'Chauffe-eau électrique',
 'Cheminée',
 'Climatisation centralisée',
 'Climatisation split',
 'Concierge',
 'Cuisine américaine',
 'Cuisine équipée',
 'Digicode',
 'Droits de la Conservation Foncière',
 'Droits d’enregistrement',
 'Espaces verts',
 'Frais de dossiers divers',
 "Frais de service à la charge de l'acheteu

We notice some values that are invalid, such as some price values, suggesting that an error was made during the scraping process.

## Conclusion

To summarize, we saw that for this specific platform, we don't need to focus on the quality of the data, since it is already very good and we shouldn't be expecting outliers, we only have to be careful with how we handle the missing values. We were also able to identify some issues with the scraping script.