# Data Exploration

In [1]:
import numpy as np
import pandas as pd

## Setup

For the scope of this tutorial we are going to use AirBnb Scraped data for the city of Paris. The data is freely available at **Inside AirBnb**: http://insideairbnb.com/get-the-data.html.

A description of all variables in all datasets is avaliable [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896).

We are going to use 2 datasets:

- listing dataset: contains listing-level information
- pricing dataset: contains pricing data, over time

## Importing Data

Pandas has a variety of function to import data

- `pd.read_csv()`
- `pd.read_html()`
- `pd.read_parquet()`

Importatly for our purpose, `pd.read_csv()` can directly import data from the web.

The first dataset that we are going to import is the dataset of Airbnb listings in Bordeaux. It contains listing-level information.

In [2]:
url_listings = "https://data.insideairbnb.com/france/nouvelle-aquitaine/bordeaux/2025-06-15/visualisations/listings.csv"
df_listings = pd.read_csv(url_listings)

The second dataset that we are going to use is the dataset of calendar prices. This time the dataset is compressed but we can use the `compression` option to import it directly.

In [3]:
url_prices = "https://data.insideairbnb.com/france/nouvelle-aquitaine/bordeaux/2025-06-15/data/calendar.csv.gz"
df_prices = pd.read_csv(url_prices, compression="gzip")

## Inspecting Data

> Methods
>
>- `info()`
>- `head()`
>- `describe()`

The first way yo have a quick look at the data is the `info()` method. If called with the option `verbose=False`, it gives a quick overview of the dimensions of the data.

In [4]:
df_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12383 entries, 0 to 12382
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              12383 non-null  int64  
 1   name                            12383 non-null  object 
 2   host_id                         12383 non-null  int64  
 3   host_name                       12383 non-null  object 
 4   neighbourhood_group             12383 non-null  object 
 5   neighbourhood                   12383 non-null  object 
 6   latitude                        12383 non-null  float64
 7   longitude                       12383 non-null  float64
 8   room_type                       12383 non-null  object 
 9   price                           8345 non-null   float64
 10  minimum_nights                  12383 non-null  int64  
 11  number_of_reviews               12383 non-null  int64  
 12  last_review                     

If we want to know how the data looks like, we can use the `head()` method. It prints the first 5 lines of the data by default.

In [5]:
df_listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,222887,"Spectacular view, full air-con, elevator",1156398,Suzanne,Bordeaux,Bordeaux Sud,44.836102,-0.566395,Entire home/apt,266.0,3,127,2025-06-22,0.77,4,248,31,3306300031048
1,247452,"Cosy apartment ,barbecue, pool",959918,Krista,Saint-Mdard-en-Jalles,Saint-Mdard-en-Jalles,44.8589,-0.72735,Entire home/apt,115.0,3,79,2025-04-03,0.49,1,147,9,
2,317273,"Luxury, spacious, patio, near public gardens",1156398,Suzanne,Bordeaux,Chartrons - Grand Parc - Jardin Public,44.847801,-0.581046,Entire home/apt,203.0,3,80,2025-04-29,0.6,4,273,20,33063001366CB
3,317658,"Key to Bordeaux · fairytale view, 2 bd + elevator",1156398,Suzanne,Bordeaux,Centre ville (Bordeaux),44.838799,-0.56887,Entire home/apt,222.0,3,161,2025-04-28,1.0,4,269,17,33063001225CF
4,333031,STUDIO BORDEAUX TRIANGLE D OR ***** Climatisé,1697156,Antony,Bordeaux,Centre ville (Bordeaux),44.84256,-0.57794,Entire home/apt,103.0,1,556,2025-06-09,3.45,2,353,53,3306300055979


We can print a description of the data using `describe()`. If we have many variables, it's best to print it transposed using the `.T` attribute.

In [6]:
df_listings.describe().T[:5]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,12383.0,5.339815e+17,5.518793e+17,222887.0,27744260.0,5.833414e+17,1.076652e+18,1.44285e+18
host_id,12383.0,160050800.0,178555100.0,30374.0,29376860.0,76366010.0,232556800.0,700293300.0
latitude,12383.0,44.84028,0.03101698,44.75053,44.8242,44.83893,44.85541,45.01855
longitude,12383.0,-0.588186,0.04622049,-0.831114,-0.6041521,-0.5775376,-0.56498,-0.46428
price,8345.0,125.3258,366.0109,12.0,52.0,79.0,130.0,10000.0


You can select which variables to display using the `include` option. `include='all'` includes also categorical variables.

In [7]:
df_listings.describe(include='all').T[:5]

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,12383.0,,,,5.339815005965522e+17,5.518793274702556e+17,222887.0,27744263.5,5.833414180058369e+17,1.0766524088336792e+18,1.4428498584012472e+18
name,12383.0,11759.0,Appartement,19.0,,,,,,,
host_id,12383.0,,,,160050802.918598,178555149.27013,30374.0,29376862.5,76366007.0,232556765.5,700293334.0
host_name,12383.0,2674.0,Guillaume,156.0,,,,,,,
neighbourhood_group,12383.0,28.0,Bordeaux,6817.0,,,,,,,


We can get the list of columns using the `.columns` attribute.

In [8]:
df_listings.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'number_of_reviews_ltm', 'license'],
      dtype='object')

We can get the index using the `.index` attribute,

In [9]:
df_listings.index

RangeIndex(start=0, stop=12383, step=1)

## Data Selection

We can access single columns as if the DataFrame was a dictionary.

In [10]:
df_listings['price']

0        266.0
1        115.0
2        203.0
3        222.0
4        103.0
         ...  
12378     65.0
12379    117.0
12380     63.0
12381     35.0
12382    147.0
Name: price, Length: 12383, dtype: float64

We can select rows and columns by index, using the `.iloc` attribute.

In [11]:
df_listings.iloc[:7, 5:9]

Unnamed: 0,neighbourhood,latitude,longitude,room_type
0,Bordeaux Sud,44.836102,-0.566395,Entire home/apt
1,Saint-Mdard-en-Jalles,44.8589,-0.72735,Entire home/apt
2,Chartrons - Grand Parc - Jardin Public,44.847801,-0.581046,Entire home/apt
3,Centre ville (Bordeaux),44.838799,-0.56887,Entire home/apt
4,Centre ville (Bordeaux),44.84256,-0.57794,Entire home/apt
5,Bgles,44.81149,-0.55825,Entire home/apt
6,Chartrons - Grand Parc - Jardin Public,44.85447,-0.582,Private room


If we want to condition only on rows or columns, we have use `:` for the unrestricted dimesion, otherwise we get an error.

In [12]:
df_listings.iloc[:, 5:9].head()

Unnamed: 0,neighbourhood,latitude,longitude,room_type
0,Bordeaux Sud,44.836102,-0.566395,Entire home/apt
1,Saint-Mdard-en-Jalles,44.8589,-0.72735,Entire home/apt
2,Chartrons - Grand Parc - Jardin Public,44.847801,-0.581046,Entire home/apt
3,Centre ville (Bordeaux),44.838799,-0.56887,Entire home/apt
4,Centre ville (Bordeaux),44.84256,-0.57794,Entire home/apt


Instead, the `.loc` attribute allows us to use row and column names.

In [13]:
df_listings.loc[:, ['neighbourhood', 'latitude', 'longitude']].head()

Unnamed: 0,neighbourhood,latitude,longitude
0,Bordeaux Sud,44.836102,-0.566395
1,Saint-Mdard-en-Jalles,44.8589,-0.72735
2,Chartrons - Grand Parc - Jardin Public,44.847801,-0.581046
3,Centre ville (Bordeaux),44.838799,-0.56887
4,Centre ville (Bordeaux),44.84256,-0.57794


We can also select ranges.

In [14]:
df_listings.loc[:, 'neighbourhood':'room_type'].head()

Unnamed: 0,neighbourhood,latitude,longitude,room_type
0,Bordeaux Sud,44.836102,-0.566395,Entire home/apt
1,Saint-Mdard-en-Jalles,44.8589,-0.72735,Entire home/apt
2,Chartrons - Grand Parc - Jardin Public,44.847801,-0.581046,Entire home/apt
3,Centre ville (Bordeaux),44.838799,-0.56887,Entire home/apt
4,Centre ville (Bordeaux),44.84256,-0.57794,Entire home/apt


There is an easy way to **select numerical columns**, the `.select_dtypes()` function.

In [15]:
df_listings.select_dtypes(include=['number']).head()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm
0,222887,1156398,44.836102,-0.566395,266.0,3,127,0.77,4,248,31
1,247452,959918,44.8589,-0.72735,115.0,3,79,0.49,1,147,9
2,317273,1156398,44.847801,-0.581046,203.0,3,80,0.6,4,273,20
3,317658,1156398,44.838799,-0.56887,222.0,3,161,1.0,4,269,17
4,333031,1697156,44.84256,-0.57794,103.0,1,556,3.45,2,353,53


Other types include

- `object` for strings
- `bool` for booleans
- `int` for integers
- `float` for floats (numbers that are not integers)

We can also use logical operators to selet rows.

In [16]:
df_listings.loc[df_listings['number_of_reviews']>500, :].head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
4,333031,STUDIO BORDEAUX TRIANGLE D OR ***** Climatisé,1697156,Antony,Bordeaux,Centre ville (Bordeaux),44.84256,-0.57794,Entire home/apt,103.0,1,556,2025-06-09,3.45,2,353,53,3306300055979
9,482102,Beautiful city house full comfort,2387430,Frederic,Le Bouscat,Le Bouscat,44.8542,-0.59581,Entire home/apt,224.0,2,503,2025-06-19,3.19,1,307,34,
53,1117474,INCROYABLE T3 EN HYPERCENTRE,70229865,Thomas,Bordeaux,Centre ville (Bordeaux),44.83871,-0.56846,Entire home/apt,130.0,3,516,2025-06-11,3.5,22,190,41,330630009852A
54,1136724,"Spacious house, 4 guests with Garden & Garage",3774373,Stéphane & Virginie,Bordeaux,Saint Augustin - Tauzin - Alphonse Dupeux,44.82893,-0.58775,Entire home/apt,,2,581,2025-02-05,3.97,2,0,15,33063003349C9
57,1172803,PATIO STUDIO CENTRE,1029070,Louis,Bordeaux,Bordeaux Sud,44.83472,-0.57507,Entire home/apt,59.0,4,711,2025-06-13,4.86,5,104,25,330630002860D


We can use logical operations as well. But remember to use paranthesis.

**Note**: the `and` and `or` expressions do not work in this setting. We have to use `&` and `|` instead.

In [17]:
df_listings.loc[(df_listings['number_of_reviews']>300) &
                (df_listings['reviews_per_month']>7), 
                :].head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
394,6424164,Belle et grande chambre avec salle de bain privée,33338004,Thierry,Merignac,Arlac,44.82891,-0.63392,Private room,41.0,1,926,2025-06-12,7.54,2,208,92,
726,9591925,Chambre privée chez l'habitant face à la gare,49645813,Julien,Bordeaux,Bordeaux Sud,44.82543,-0.55688,Private room,47.0,1,1068,2025-06-25,9.16,4,362,93,330630000616c
804,10890090,Chambre privée chez l'habitant gare saintjean...,49645813,Julien,Bordeaux,Bordeaux Sud,44.82748,-0.5568,Private room,50.0,1,862,2025-06-25,7.54,4,363,69,330630000616c
1104,12926326,Charmante Maison à 5mn de Bordeaux,70937383,Alexandre,Floirac,Floirac,44.83349,-0.52772,Entire home/apt,46.0,1,893,2025-06-19,8.09,4,269,84,
1116,12956742,Bright and independent room without vis-a-vis.,62154643,Claude,Le Bouscat,Le Bouscat,44.86797,-0.58236,Private room,35.0,1,829,2025-06-22,7.55,2,344,80,


For a single column (i.e. a Series), we can get the unique values using the `unique()` function.

In [18]:
df_listings['neighbourhood'].unique()

array(['Bordeaux Sud', 'Saint-Mdard-en-Jalles',
       'Chartrons - Grand Parc - Jardin Public',
       'Centre ville (Bordeaux)', 'Bgles', 'Talence', 'Le Bouscat',
       'Saint Augustin - Tauzin - Alphonse Dupeux', 'Caudran',
       'Nansouty - Saint Gens', 'Magonty', 'Palmer-Gravires-Cavailles',
       'Carbon-Blanc', 'La Glacire', 'France Alouette', 'Gradignan',
       'Le Bourg', 'Eysines', 'Bouliac', 'Lormont', 'Saint-Aubin-de-Mdoc',
       'Floirac', 'La Bastide', 'Verthamon', 'Chiquet-Fontaudin',
       'Bordeaux Maritime', 'Bruges', "Villenave-d'Ornon", 'Les Eyquems',
       'Bourran', 'Parempuyre', 'Gambetta-Mairie-Lissandre', 'Capeyron',
       'Arlac', 'Chemin Long', 'Ambars-et-Lagrave', 'Blanquefort',
       'Bassens', 'Le Haillan', 'Le Monteil', 'Saige', 'Beaudsert',
       'Sardine', 'Le Taillan-Mdoc', 'Centre ville (Merignac)',
       'Martignas-sur-Jalle', 'Cap de Bos', 'Casino', 'Beutre',
       '3M-Bourgailh', 'Plaisance-Loret-Maregue', 'Nos',
       'Artigues-Prs-Bo

For multiple columns, we can use the `drop_duplicates` function.

In [19]:
df_listings[['neighbourhood', 'room_type']].drop_duplicates()

Unnamed: 0,neighbourhood,room_type
0,Bordeaux Sud,Entire home/apt
1,Saint-Mdard-en-Jalles,Entire home/apt
2,Chartrons - Grand Parc - Jardin Public,Entire home/apt
3,Centre ville (Bordeaux),Entire home/apt
5,Bgles,Entire home/apt
...,...,...
11042,Le Bourg,Shared room
11207,Caudran,Shared room
11209,Bourran,Shared room
11275,Bgles,Shared room


## Aggregation and Pivot Tables

We can compute statistics by group using `.groupby()`.

In [None]:
df_listings.groupby('neighbourhood')[['price', 'reviews_per_month']].mean()

Unnamed: 0_level_0,price,reviews_per_month
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
3M-Bourgailh,43.527769,1.271122
Ambars-et-Lagrave,122.678057,1.465372
Ambs,68.858551,0.851952
Arago-La Chataigneraie,63.807201,1.287130
Arlac,67.683729,1.403024
...,...,...
Sardine,68.684359,1.749963
Talence,74.721414,1.579673
Toctoucau,152.552261,1.832516
Verthamon,70.956368,1.879508


If you want to perform more than one function, maybe on different columns, you can use `.aggregate()` which can be shortened to `.agg()`. It takes as argument a dictionary with variables as keys and lists of functions as values.

In [26]:
df_listings.groupby('neighbourhood').agg({"price": ["mean"],
                                          "minimum_nights": ["min", np.max]}).reset_index()

Unnamed: 0_level_0,neighbourhood,price,minimum_nights,minimum_nights
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,amax
0,3M-Bourgailh,68.000000,1,7
1,Ambars-et-Lagrave,124.272727,1,12
2,Ambs,59.500000,1,3
3,Arago-La Chataigneraie,74.581818,1,7
4,Arlac,93.851351,1,200
...,...,...,...,...
57,Sardine,79.727273,1,40
58,Talence,84.757437,1,364
59,Toctoucau,187.230769,1,14
60,Verthamon,88.384615,1,90


The problem with this syntax is that it generates a hierarchical structure for variable names, which might not be so easy to work with. In the example above, to access the mean price, you have to use `df.price["min"]`.

To perform variable naming and aggregation and the same time, you can ise the following syntax: `agg(output_var = ("input_var", function))`.

In [None]:
df_listings.groupby('neighbourhood').agg(mean_reviews=("price", "mean"),
                                         min_price=("price", "min"),
                                         max_price=("price", np.max)).reset_index()

Unnamed: 0,neighbourhood,mean_reviews,min_price,max_price
0,3M-Bourgailh,0.850000,22.0,150.0
1,Ambars-et-Lagrave,1.097692,29.0,622.0
2,Ambs,0.727273,23.0,290.0
3,Arago-La Chataigneraie,1.077302,26.0,342.0
4,Arlac,1.080115,20.0,394.0
...,...,...,...,...
57,Sardine,1.305000,28.0,300.0
58,Talence,1.208582,16.0,760.0
59,Toctoucau,1.184000,30.0,636.0
60,Verthamon,1.456786,30.0,280.0


We can make pivot tables with the `.pivot_table()` function. It takes the folling arguments:

- `index`: rows
- `columns`: columns
- `values`: values
- `aggfunc`: aggregation function

In [27]:
df_listings.pivot_table(index='neighbourhood', columns='room_type', values='price', aggfunc='mean')

room_type,Entire home/apt,Hotel room,Private room,Shared room
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3M-Bourgailh,96.800000,,39.200000,
Ambars-et-Lagrave,148.761905,,45.153846,
Ambs,113.250000,,38.000000,
Arago-La Chataigneraie,88.268293,,34.500000,
Arlac,105.475410,,39.307692,
...,...,...,...,...
Sardine,100.785714,,42.875000,
Talence,98.724551,130.0,38.578431,
Toctoucau,219.400000,,80.000000,
Verthamon,105.473684,,42.000000,
