# Avito Data

This file is for inspecting and exploring data extracted from Avito.

In [1]:
import pandas as pd

In [22]:
avito_df = pd.read_json("../data/avito/2024-10-12_avito.json")

In [13]:
avito_df.head()

Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,equipements,date_time,year,month
0,https://www.avito.ma/fr/hay_chrifa/appartement...,1,1,72 m²,61292-Vente Appt à Casablanca Lekrimat de 72 m²,830 000 DH,Casablanca,il y a 8 minutes,Yakeey,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Chauffage, Cuisine équipée...",2024-10-12 19:23:00,2024,10
1,https://www.avito.ma/fr/hay_mohammadi/appartem...,2,1,,apparemment neuf à vendre 63m²,Prix non spécifié,Agadir,il y a 36 minutes,Hemza IMMO,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Balcon, Climatisation, Cuisine équipée]",2024-10-12 18:55:00,2024,10
2,https://www.avito.ma/fr/sidi_bernoussi/apparte...,2,1,62 m²,شقة جميلة جدا بالطابق التاني اقامة مدينتي,340 000 DH,Casablanca,il y a 33 minutes,cimmo.ma,"{'Type': 'Appartements, à vendre', 'Secteur': ...",[],2024-10-12 18:58:00,2024,10
3,https://www.avito.ma/fr/ville_nouvelle/apparte...,3,2,128 m²,Appartement duplex 128 m² à Meknès,Prix non spécifié,Meknès,il y a 34 minutes,Alaoui immobilier,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Concierge, Cuisine équipée...",2024-10-12 18:57:00,2024,10
4,https://www.avito.ma/fr/ville_verte/appartemen...,3,2,114 m²,Appartement haut standing à vendre à Bouskoura,1 480 000 DH,Bouskoura,il y a 33 minutes,ABDES,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Concierge, Cuisine équipée...",2024-10-12 18:58:00,2024,10


In [23]:
avito_df.columns

Index(['url', 'n_bedrooms', 'n_bathrooms', 'total_area', 'title', 'price',
       'city', 'time', 'user', 'attributes', 'equipements', 'date_time',
       'year', 'month'],
      dtype='object')

In [21]:
avito_df.dtypes

url                    object
n_bedrooms              int64
n_bathrooms            object
total_area             object
title                  object
price                  object
city                   object
time                   object
user                   object
attributes             object
equipements            object
date_time      datetime64[ns]
year                    int64
month                   int64
dtype: object

In [8]:
avito_df["n_bedrooms"].value_counts()

n_bedrooms
2     112
3      72
1      19
4       7
5       1
10      1
6       1
Name: count, dtype: int64

The `n_bedrooms` column is an integer, we might need to control the input to prevent outliers from occuring.
We can think of a reasonable range for the number of bedrooms in a house, and remove any rows that fall outside of that range.

In [24]:
avito_df["n_bathrooms"].value_counts()

n_bathrooms
2     105
1      86
3      17
0       2
7+      1
4       1
Name: count, dtype: int64

`n_bathrooms` is an optional column and can be null, we also notice a value of 7+, and the rest being integers.

In [26]:
avito_df["total_area"].head()

0     72 m²
1      None
2     62 m²
3    128 m²
4    114 m²
Name: total_area, dtype: object

`total_area` is optional as well, it could be either null or might contain the m² symbol alone. Let's remove the m² sign to inspect the numbers.

In [118]:
total_areas = (avito_df[(avito_df["total_area"].notnull()
                         & (avito_df["total_area"]!="m²"))]["total_area"])
total_areas = total_areas.str[:-3].astype(int)

In [84]:
total_areas.describe()

count    176.000000
mean     108.670455
std       72.240348
min       32.000000
25%       70.000000
50%       96.500000
75%      127.250000
max      800.000000
Name: total_area, dtype: float64

We might think of a reasonable range to remove outliers from the `total_area` column as well.

In [72]:
avito_df["price"].value_counts()

price
Prix non spécifié    25
1 500 000 DH          7
1 550 000 DH          5
800 000 DH            5
1 100 000 DH          5
                     ..
1 030 000 DH          1
2 070 000 DH          1
845 500 DH            1
1 012 000 DH          1
660 000 DH            1
Name: count, Length: 134, dtype: int64

`price` is optional but it can't be null, it contains the value "Prix non spécifié" instead. We can reformat the value to look at the numbers.

In [86]:
prices = avito_df[avito_df["price"]!="Prix non spécifié"]["price"]
prices = prices.str.replace("DH", "")
prices = prices.str.replace("\u202f", "")
prices = prices.astype(int)

In [95]:
prices.describe().astype(int)

count         188
mean      1403872
std       2326640
min          6000
25%        613750
50%        931500
75%       1605000
max      29000000
Name: price, dtype: int64

The max value of 29.000.000 DH is very high for an appartement. We might consider removing outliers from the `price` column as well using a reasonable range.

In [96]:
avito_df["city"].value_counts()

city
Casablanca     74
Marrakech      25
Tanger         15
Rabat          12
Temara         11
Kénitra        11
Mohammedia      7
Fès             7
Salé            6
Agadir          6
Bouznika        5
Meknès          4
Saidia          4
Bouskoura       3
Tétouan         3
El Jadida       3
Asilah          2
Martil          2
Cabo Negro      2
Laâyoune        1
Taroudant       1
Zenata          1
Béni Mellal     1
Berrechid       1
Had Soualem     1
Oujda           1
Azemmour        1
Sidi Rahal      1
Deroua          1
Dar Bouazza     1
Name: count, dtype: int64

We will need to compare these values with the same colmun from the other datasets to make sure they are consistent.

In [99]:
avito_df["title"].value_counts()

title
Appartement à vendre 68 m² à Salé                     2
Appartement a Vendre à Temara                         2
apparemment neuf à vendre 63m²                        1
شقة جميلة جدا بالطابق التاني اقامة مدينتي             1
Appartement duplex 128 m² à Meknès                    1
                                                     ..
Appartement à vendre 85 m² à Tétouan                  1
Appartement à Marina Blanca                           1
Appartement à vendre 70 m² à florida                  1
26598-Vente Appt à Casablanca Bourgogne (Anfa) de     1
CMN-AS-1083 - Appartement à vendre à Roches Noires    1
Name: count, Length: 211, dtype: int64

In [100]:
avito_df[avito_df["title"]=="Appartement à vendre 68 m² à Salé"]

Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,equipements,date_time,year,month,total_area_cleaned,price_cleaned
7,https://www.avito.ma/fr/tabriquet/appartements...,2,1,90 m²,Appartement à vendre 68 m² à Salé,520 000 DH,Salé,il y a 31 minutes,mohamed,"{'Type': 'Appartements, à vendre', 'Secteur': ...",[],2024-10-12 19:00:00,2024,10,90,520000
120,https://www.avito.ma/fr/cherkaoui___marzouka/a...,2,1,68 m²,Appartement à vendre 68 m² à Salé,639 200 DH,Salé,il y a 2 heures,Les Oliviers,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Climatisation, Concierge, Cuisine équipée, Pa...",2024-10-12 17:31:00,2024,10,68,639200


We can't even refer to the `title` in order to determine duplicates. In this case, it wouldn't be interesting deduplicate the rows.

In [101]:
avito_df["user"].value_counts()

user
Yakeey                       24
AGENZ.MA                     13
promozin                      9
Benmoussa immobilier          6
Avito Immo Neuf               5
                             ..
OKTASI IMMOBOLIER             1
Hamza Alaoui                  1
SAKANI immobilier Kenitra     1
Si Hmed                       1
El Amrani                     1
Name: count, Length: 136, dtype: int64

Analyzing the `user` to know who are the most active, but this would result in an analysis of the specific platform, which is out of the scope of this project, we would rather drop this column.

The `time` column has been extracted in order to help us determine the time of each announcement, which is represented by the `date_time`, `year` and `month` columns.
Time is important and is a known factor for price changes in real estate, we would drop the `time` column since it serves no other purpose than helping us extract the actual date.

In [102]:
avito_df["attributes"].head()

0    {'Type': 'Appartements, à vendre', 'Secteur': ...
1    {'Type': 'Appartements, à vendre', 'Secteur': ...
2    {'Type': 'Appartements, à vendre', 'Secteur': ...
3    {'Type': 'Appartements, à vendre', 'Secteur': ...
4    {'Type': 'Appartements, à vendre', 'Secteur': ...
Name: attributes, dtype: object

`attributes` is an object column that contains a dictionary of attributes, we can extract the keys and values to inspect the data. This column can't contain null values since there are required attributes for each announcement.

In [110]:
attributes_keys = set()
for attribute in avito_df["attributes"]:
    if attribute is not None:
        attributes_keys.update(attribute.keys())

for attribute_key in attributes_keys:
    avito_df[attribute_key] = avito_df["attributes"].apply(
        lambda x, key=attribute_key: x.get(key, None)
    )

avito_df.head()

Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,...,month,total_area_cleaned,price_cleaned,Type,Secteur,Âge du bien,Frais de syndic / mois,Étage,Surface habitable,Salons
0,https://www.avito.ma/fr/hay_chrifa/appartement...,1,1,72 m²,61292-Vente Appt à Casablanca Lekrimat de 72 m²,830 000 DH,Casablanca,il y a 8 minutes,Yakeey,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,10,72,830000,"Appartements, à vendre",Hay Chrifa,11-20 ans,2400.0,1,72,
1,https://www.avito.ma/fr/hay_mohammadi/appartem...,2,1,,apparemment neuf à vendre 63m²,Prix non spécifié,Agadir,il y a 36 minutes,Hemza IMMO,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,10,0,0,"Appartements, à vendre",Hay Mohammadi,Neuf,,2,63,1.0
2,https://www.avito.ma/fr/sidi_bernoussi/apparte...,2,1,62 m²,شقة جميلة جدا بالطابق التاني اقامة مدينتي,340 000 DH,Casablanca,il y a 33 minutes,cimmo.ma,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,10,62,340000,"Appartements, à vendre",Sidi Bernoussi,,,2,62,1.0
3,https://www.avito.ma/fr/ville_nouvelle/apparte...,3,2,128 m²,Appartement duplex 128 m² à Meknès,Prix non spécifié,Meknès,il y a 34 minutes,Alaoui immobilier,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,10,128,0,"Appartements, à vendre",Ville Nouvelle,Neuf,,3,128,1.0
4,https://www.avito.ma/fr/ville_verte/appartemen...,3,2,114 m²,Appartement haut standing à vendre à Bouskoura,1 480 000 DH,Bouskoura,il y a 33 minutes,ABDES,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,10,114,1480000,"Appartements, à vendre",Ville verte,Neuf,200.0,2,114,1.0


In [112]:
avito_df["Type"].value_counts()
# 1 value, indicating that the type of the property is an apartment

Type
Appartements, à vendre    213
Name: count, dtype: int64

In [114]:
avito_df["Secteur"].value_counts()
# we need to compare neighborhoods with the ones in the other datasets

Secteur
Toute la ville    22
Autre secteur     16
Bourgogne         15
Guéliz            14
Maarif             8
                  ..
Bachkou            1
Malabata           1
Narjis             1
Oulfa              1
Roches Noires      1
Name: count, Length: 95, dtype: int64

In [115]:
avito_df["Âge du bien"].value_counts()

Âge du bien
Neuf         77
6-10 ans     28
11-20 ans    27
21+ ans      20
1-5 ans      17
Name: count, dtype: int64

In [117]:
(avito_df["Frais de syndic / mois"].isnull().sum(),
avito_df["Frais de syndic / mois"].value_counts())
# column mostly contains null values

(np.int64(128),
 Frais de syndic / mois
 100     14
 300     14
 200     11
 150      8
 250      6
 50       4
 2400     2
 4200     2
 3600     2
 500      2
 3000     2
 1200     1
 5000     1
 7600     1
 5500     1
 4800     1
 700      1
 1        1
 6100     1
 120      1
 80       1
 130      1
 60       1
 550      1
 850      1
 49       1
 400      1
 30       1
 180      1
 Name: count, dtype: int64)

In [120]:
avito_df["Étage"].value_counts() # to be compared with the other datasets

Étage
1                  55
2                  49
3                  38
4                  32
Rez de chaussée    16
5                  11
7+                  6
6                   3
8                   3
Name: count, dtype: int64

In [133]:
(int(avito_df["Surface habitable"].isna().sum()),
 avito_df["Surface habitable"].value_counts())
# be careful with the outliers

(2,
 Surface habitable
 60     7
 110    6
 120    5
 140    5
 68     5
       ..
 57     1
 109    1
 166    1
 89     1
 141    1
 Name: count, Length: 102, dtype: int64)

In [134]:
avito_df["Salons"].value_counts()

Salons
1     130
2      44
7+      1
3       1
Name: count, dtype: int64

In [135]:
avito_df["equipements"].head()

0    [Ascenseur, Balcon, Chauffage, Cuisine équipée...
1             [Balcon, Climatisation, Cuisine équipée]
2                                                   []
3    [Ascenseur, Balcon, Concierge, Cuisine équipée...
4    [Ascenseur, Balcon, Concierge, Cuisine équipée...
Name: equipements, dtype: object

For the `equipments` column, we can one hot encode the value of each equipment item. But in this step, we would only be interested to know what are the possible values, in order to know what to expect from the other datasets.

In [157]:
equipments = set()
for equipment in avito_df["equipements"]:
    if equipment is not None:
        equipments.update(equipment)

equipment_list = sorted(equipments)
equipment_list

['Afficher plus de détails',
 'Ascenseur',
 'Balcon',
 'Chauffage',
 'Climatisation',
 'Concierge',
 'Cuisine équipée',
 'Duplex',
 'Meublé',
 'Parking',
 'Sécurité',
 'Terrasse']

"Afficher plus de détails" is an invalid equipment item, this is probably an edge case that was not caught during data extraction.

## Conclusion

Here are the main takeaways from this overview:
- Clean columns for type encoding: `total_area`, `living_area`, `price`, etc.
- Filter columns for outliers: `n_bedrooms`, `total_area`, `price`, etc.
- Compare columns holding the same information from the other datasets, to make sure they are consistent.
- Drop columns that are not useful for the analysis: `user`, `time`, `Frais syndicat / mois`, etc.