# Avito Data

This file is for inspecting and exploring data extracted from Avito.

In [5]:
import pandas as pd

In [6]:
avito_df = pd.read_json("../data/raw/avito/avito_2024-11-14.json")

In [7]:
avito_df.head()

Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,equipments,date_time,year,month
0,https://www.avito.ma/fr/haut_founty/appartemen...,2,2.0,,Appartement à vendre 81 m² quartier Haut Founty,Prix non spécifié,Agadir,il y a 7 minutes,AM IMMOBILIER AGADIR,"{'Type': 'Appartements, à vendre', 'Secteur': ...",[],2024-11-14 15:53:00,2024,11
1,https://www.avito.ma/fr/ahlane/appartements/Ap...,3,2.0,101 m²,Appartement à vendre 101 m² à Tanger,135 000 DH,Tanger,il y a 19 minutes,Moha,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Climatisation, Cuisine équ...",2024-11-14 15:41:00,2024,11
2,https://www.avito.ma/fr/autre_secteur/appartem...,2,2.0,90 m²,Bel appartement,540 000 DH,Fès,il y a 19 minutes,Med,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Chauffage, Climatisation, ...",2024-11-14 15:41:00,2024,11
3,https://www.avito.ma/fr/riad_toulal/appartemen...,2,2.0,58 m²,Appartement à vendre 58 m² à Meknès,290 000 DH,Meknès,il y a 21 minutes,ORBIS Promotion,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Chauffage, Climatisation, ...",2024-11-14 15:39:00,2024,11
4,https://www.avito.ma/fr/al_boustane/appartemen...,2,1.0,100 m²,Appartement Rdc dans R plus 2,520 000 DH,El Jadida,il y a 17 minutes,Tic House,"{'Type': 'Appartements, à vendre', 'Secteur': ...",[Cuisine équipée],2024-11-14 15:43:00,2024,11


In [8]:
avito_df.columns

Index(['url', 'n_bedrooms', 'n_bathrooms', 'total_area', 'title', 'price',
       'city', 'time', 'user', 'attributes', 'equipments', 'date_time', 'year',
       'month'],
      dtype='object')

In [9]:
avito_df.dtypes

url                    object
n_bedrooms              int64
n_bathrooms           float64
total_area             object
title                  object
price                  object
city                   object
time                   object
user                   object
attributes             object
equipments             object
date_time      datetime64[ns]
year                    int64
month                   int64
dtype: object

In [37]:
avito_df["n_bedrooms"].value_counts()

n_bedrooms
2    164
3    113
1     30
4     10
Name: count, dtype: int64

The `n_bedrooms` column is an integer, we might need to control the input to prevent outliers from occuring.
We can think of a reasonable range for the number of bedrooms in a house, and remove any rows that fall outside of that range.

In [11]:
avito_df["n_bathrooms"].value_counts()

n_bathrooms
2.0    160
1.0    129
3.0     19
0.0      2
4.0      1
5.0      1
Name: count, dtype: int64

`n_bathrooms` is an optional column and can be null, we also notice a value of 7+, and the rest being integers.

In [12]:
avito_df["total_area"].head()

0      None
1    101 m²
2     90 m²
3     58 m²
4    100 m²
Name: total_area, dtype: object

`total_area` is optional as well, it could be either null or might contain the m² symbol alone. Let's remove the m² sign to inspect the numbers.

In [13]:
total_areas = (avito_df[(avito_df["total_area"].notnull()
                         & (avito_df["total_area"]!="m²"))]["total_area"])
total_areas = total_areas.str[:-3].astype(int)

In [14]:
total_areas.describe()

count    246.000000
mean      99.841463
std       54.509419
min        8.000000
25%       69.250000
50%       87.000000
75%      115.750000
max      586.000000
Name: total_area, dtype: float64

We might think of a reasonable range to remove outliers from the `total_area` column as well.

In [15]:
avito_df["price"].value_counts()

price
Prix non spécifié    37
1 100 000 DH          8
650 000 DH            6
550 000 DH            6
700 000 DH            5
                     ..
1 130 000 DH          1
3 450 000 DH          1
4 200 000 DH          1
5 000 DH              1
2 140 000 DH          1
Name: count, Length: 151, dtype: int64

`price` is optional but it can't be null, it contains the value "Prix non spécifié" instead. We can reformat the value to look at the numbers.

In [16]:
prices = avito_df[avito_df["price"]!="Prix non spécifié"]["price"]
prices = prices.str.replace("DH", "")
prices = prices.str.replace("\u202f", "")
prices = prices.astype(int)

In [17]:
prices.describe().astype(int)

count        280
mean     1176787
std       849372
min         5000
25%       627500
50%       933500
75%      1450000
max      4750000
Name: price, dtype: int64

The max value of 29.000.000 DH is very high for an appartement. We might consider removing outliers from the `price` column as well using a reasonable range.

In [18]:
avito_df["city"].value_counts()

city
Casablanca       104
Marrakech         48
Kénitra           22
Rabat             16
Agadir            14
Tanger            12
Martil            11
Mohammedia         9
Temara             9
Salé               7
Meknès             7
Mehdia             5
El Jadida          5
Bouskoura          5
Tétouan            4
Urgent             4
Bouznika           4
Dar Bouazza        4
Fès                3
Saidia             3
Berrechid          3
Deroua             2
Mediouna           2
Béni Mellal        2
Sidi Rahal         2
Tamaris            1
Had Soualem        1
Drargua            1
Ifrane             1
Tiznit             1
Fnideq             1
Asilah             1
الدار البيضاء      1
Nouaceur           1
Cabo Negro         1
Name: count, dtype: int64

We will need to compare these values with the same colmun from the other datasets to make sure they are consistent.

In [19]:
avito_df["title"].value_counts()

title
appartement à vendre                                 4
Appartement à vendre 76 m² à Casablanca              3
شقق للبيع في مدينة برشيد                             2
Appartement à vendre 72 m² à Casablanca              2
appartement                                          2
                                                    ..
Appartement à vendre 90 m² à El Menzeh               1
Appartement à vendre 82 m² à Casablanca              1
Appartement à vendre 92 m² à Kénitra                 1
Très bel apt 3chambres Résidence Sécurisée Busway    1
Appartement en Vente à Bouznika                      1
Name: count, Length: 306, dtype: int64

In [22]:
avito_df[avito_df["title"]=="Appartement à vendre 72 m² à Casablanca"]

Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,equipments,date_time,year,month
182,https://www.avito.ma/fr/aïn_borja/appartements...,2,2.0,81 m²,Appartement à vendre 72 m² à Casablanca,750 000 DH,Casablanca,il y a 3 heures,Benmoussa immobilier,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Concierge, Parking, Sécurité, Terr...",2024-11-14 13:02:00,2024,11
300,https://www.avito.ma/fr/almaz/appartements/App...,2,2.0,74 m²,Appartement à vendre 72 m² à Casablanca,1 050 000 DH,Casablanca,il y a 13 minutes,Jawad Machkour,"{'Type': 'Appartements, à vendre', 'Secteur': ...","[Ascenseur, Balcon, Chauffage, Climatisation, ...",2024-11-14 15:50:00,2024,11


We can't even refer to the `title` in order to determine duplicates. In this case, it wouldn't be interesting deduplicate the rows.

In [23]:
avito_df["user"].value_counts()

user
AGENZ.MA                  14
kenitra immo              11
ALPHA TRANSACTION          9
hmoz martil                7
Lhaj                       7
                          ..
Ibtissam ta                1
MMP IMMOBILIER             1
Groupe Ikamati             1
North Morocco IMMO         1
IMMO GLOBAL BUSINESS 1     1
Name: count, Length: 180, dtype: int64

Analyzing the `user` to know who are the most active, but this would result in an analysis of the specific platform, which is out of the scope of this project, we would rather drop this column.

The `time` column has been extracted in order to help us determine the time of each announcement, which is represented by the `date_time`, `year` and `month` columns.
Time is important and is a known factor for price changes in real estate, we would drop the `time` column since it serves no other purpose than helping us extract the actual date.

In [24]:
avito_df["attributes"].head()

0    {'Type': 'Appartements, à vendre', 'Secteur': ...
1    {'Type': 'Appartements, à vendre', 'Secteur': ...
2    {'Type': 'Appartements, à vendre', 'Secteur': ...
3    {'Type': 'Appartements, à vendre', 'Secteur': ...
4    {'Type': 'Appartements, à vendre', 'Secteur': ...
Name: attributes, dtype: object

`attributes` is an object column that contains a dictionary of attributes, we can extract the keys and values to inspect the data. This column can't contain null values since there are required attributes for each announcement.

In [25]:
attributes_keys = set()
for attribute in avito_df["attributes"]:
    if attribute is not None:
        attributes_keys.update(attribute.keys())

print(attributes_keys)  # noqa: T201
for attribute_key in attributes_keys:
    avito_df[attribute_key] = avito_df["attributes"].apply(
        lambda x, key=attribute_key: x.get(key, None)
    )

avito_df.head()

{'Secteur', 'Surface habitable', 'Type', 'Âge du bien', 'Frais de syndic / mois', 'Étage', 'Salons'}


Unnamed: 0,url,n_bedrooms,n_bathrooms,total_area,title,price,city,time,user,attributes,...,date_time,year,month,Secteur,Surface habitable,Type,Âge du bien,Frais de syndic / mois,Étage,Salons
0,https://www.avito.ma/fr/haut_founty/appartemen...,2,2.0,,Appartement à vendre 81 m² quartier Haut Founty,Prix non spécifié,Agadir,il y a 7 minutes,AM IMMOBILIER AGADIR,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,2024-11-14 15:53:00,2024,11,Haut-Founty,81,"Appartements, à vendre",Neuf,,6,1
1,https://www.avito.ma/fr/ahlane/appartements/Ap...,3,2.0,101 m²,Appartement à vendre 101 m² à Tanger,135 000 DH,Tanger,il y a 19 minutes,Moha,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,2024-11-14 15:41:00,2024,11,Ahlane,101,"Appartements, à vendre",Neuf,,1,1
2,https://www.avito.ma/fr/autre_secteur/appartem...,2,2.0,90 m²,Bel appartement,540 000 DH,Fès,il y a 19 minutes,Med,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,2024-11-14 15:41:00,2024,11,Autre secteur,80,"Appartements, à vendre",1-5 ans,50.0,1,1
3,https://www.avito.ma/fr/riad_toulal/appartemen...,2,2.0,58 m²,Appartement à vendre 58 m² à Meknès,290 000 DH,Meknès,il y a 21 minutes,ORBIS Promotion,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,2024-11-14 15:39:00,2024,11,Riad Toulal,58,"Appartements, à vendre",Neuf,,2,1
4,https://www.avito.ma/fr/al_boustane/appartemen...,2,1.0,100 m²,Appartement Rdc dans R plus 2,520 000 DH,El Jadida,il y a 17 minutes,Tic House,"{'Type': 'Appartements, à vendre', 'Secteur': ...",...,2024-11-14 15:43:00,2024,11,Al Boustane,78,"Appartements, à vendre",Neuf,,Rez de chaussée,1


In [26]:
avito_df["Type"].value_counts()
# 1 value, indicating that the type of the property is an apartment

Type
Appartements, à vendre    317
Name: count, dtype: int64

In [27]:
avito_df["Secteur"].value_counts()
# we need to compare neighborhoods with the ones in the other datasets

Secteur
Toute la ville    45
Autre secteur     30
Guéliz            19
Sidi Maarouf       8
Ain Sebaa          7
                  ..
Aviation           1
Zemmouri           1
Sidi Moumen        1
Partie Est         1
Moujahidine        1
Name: count, Length: 112, dtype: int64

In [28]:
avito_df["Âge du bien"].value_counts()

Âge du bien
Neuf         132
11-20 ans     37
1-5 ans       35
6-10 ans      29
21+ ans        7
Name: count, dtype: int64

In [29]:
(avito_df["Frais de syndic / mois"].isnull().sum(),
avito_df["Frais de syndic / mois"].value_counts())
# column mostly contains null values

(np.int64(227),
 Frais de syndic / mois
 200       17
 100       14
 150       10
 50        10
 300        9
 1          6
 500        5
 250        5
 400        5
 320        1
 900        1
 800000     1
 449        1
 47         1
 20         1
 80         1
 226        1
 130        1
 Name: count, dtype: int64)

In [30]:
avito_df["Étage"].value_counts() # to be compared with the other datasets

Étage
1                  103
2                   67
3                   57
4                   30
Rez de chaussée     26
5                   22
6                    4
7+                   4
8                    2
Name: count, dtype: int64

In [31]:
(int(avito_df["Surface habitable"].isna().sum()),
 avito_df["Surface habitable"].value_counts())
# be careful with the outliers

(5,
 Surface habitable
 85     10
 80      9
 75      9
 72      8
 70      7
        ..
 141     1
 96      1
 182     1
 127     1
 37      1
 Name: count, Length: 118, dtype: int64)

In [32]:
avito_df["Salons"].value_counts()

Salons
1    227
2     54
3      7
0      4
4      3
Name: count, dtype: int64

In [34]:
avito_df["equipments"].head()

0                                                   []
1    [Ascenseur, Balcon, Climatisation, Cuisine équ...
2    [Ascenseur, Balcon, Chauffage, Climatisation, ...
3    [Ascenseur, Balcon, Chauffage, Climatisation, ...
4                                    [Cuisine équipée]
Name: equipments, dtype: object

For the `equipments` column, we can one hot encode the value of each equipment item. But in this step, we would only be interested to know what are the possible values, in order to know what to expect from the other datasets.

In [36]:
equipments = set()
for equipment in avito_df["equipments"]:
    if equipment is not None:
        equipments.update(equipment)

equipment_list = sorted(equipments)
equipment_list

['Ascenseur',
 'Balcon',
 'Chauffage',
 'Climatisation',
 'Concierge',
 'Cuisine équipée',
 'Duplex',
 'Meublé',
 'Parking',
 'Sécurité',
 'Terrasse']

## Conclusion

Here are the main takeaways from this overview:
- Clean columns for type encoding: `total_area`, `living_area`, `price`, etc.
- Filter columns for outliers: `n_bedrooms`, `total_area`, `price`, etc.
- Compare columns holding the same information from the other datasets, to make sure they are consistent.
- Drop columns that are not useful for the analysis: `user`, `time`, `Frais syndicat / mois`, etc.