EDA Analysis Airbnb Lisbon

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Summary: Key Questions for EDA

	•	What does the dataset look like (columns, types, missing)?
	•	What’s the price distribution?
	•	Which neighbourhoods are most popular?
	•	How does price vary by neighbourhood?
	•	What are the room types and how do they differ?
	•	Are there any outliers?
	•	How does availability vary?
	•	Are reviews linked to price or popularity?
	•	Any patterns in minimum nights or extra fees?

In [7]:
import os
print(os.getcwd())

/Users/jamshedmaqsudov/Airbnb_Lisbon_project_2025


EDA Finding	Real-World Insight
Some neighbourhoods are much more expensive	These may be luxury zones or tourist hubs
Entire homes cost 3x more than private rooms	Clear pricing tiers based on privacy
Outliers in price	Possible data errors or luxury listings
Listings with high reviews but low price	Possibly undervalued or high competition

In [13]:
listings = pd.read_csv("/Users/jamshedmaqsudov/Airbnb_Lisbon_project_2025/listings.csv")

### Key Columns (clarified)

- **neighbourhood_group**: Broad area or region of the city (sometimes missing or unused).
- **room_type**: The kind of space offered — e.g., "Entire home/apt", "Private room", or "Shared room".
- **reviews_per_month**: Average number of reviews per month. Often missing (NaN) if there are no reviews.
- **calculated_host_listings_count**: Total number of listings a host has — helps detect professional hosts.
- **availability_365**: Number of days in a year the listing is marked as available.
- **number_of_reviews_ltm**: Total reviews received in the last 12 months ("ltm" = last twelve months).

In [15]:
listings.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6499,Belém 1 Bedroom Historical Apartment,14455,Bruno,Lisboa,Belm,38.6975,-9.19768,Entire home/apt,73.0,3,82,2025-03-01,0.64,1,284,18,
1,25659,Heart of Alfama - Le cœur d'Alfama (3 people),107347,Ellie,Lisboa,Santa Maria Maior,38.71241,-9.12706,Entire home/apt,106.0,2,215,2024-11-11,1.62,1,298,13,56539/AL.
2,29396,Alfama Hill - Boutique apartment,126415,Mónica,Lisboa,Santa Maria Maior,38.71156,-9.12987,Entire home/apt,75.0,3,416,2025-02-27,2.67,1,206,30,28737/AL
3,29720,TheHOUSE - Your luxury home,128075,Francisco,Lisboa,Estrela,38.71108,-9.15979,Entire home/apt,1065.0,2,142,2025-02-02,0.82,1,247,22,55695/AL
4,29915,Modern and Spacious Apartment in Lisboa,128890,Sara,Lisboa,Avenidas Novas,38.74571,-9.15264,Entire home/apt,95.0,6,61,2024-05-20,0.34,1,169,1,85851/AL.


Recommended Order (Mini Workflow):

	1.	📊 Understand the structure
	•	df.info() → See data types & missing values
	•	df.describe() → See numeric summary
	•	df.isna().sum() → Count missing values
	•	df.nunique() → Check unique values per column
	2.	🧹 Clean your data
	•	Handle missing values (NaN)
	•	Convert date columns if needed (pd.to_datetime())
	•	Remove or filter outliers (like listings with price = 0 or over 10,000)
	•	Drop unnecessary columns (e.g., license if it’s all null)
	3.	📈 Then start visualizing
	•	Now your plots will make sense and not be skewed by messy or invalid data.

In [24]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24264 entries, 0 to 24263
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              24264 non-null  int64  
 1   name                            24264 non-null  object 
 2   host_id                         24264 non-null  int64  
 3   host_name                       24248 non-null  object 
 4   neighbourhood_group             24264 non-null  object 
 5   neighbourhood                   24264 non-null  object 
 6   latitude                        24264 non-null  float64
 7   longitude                       24264 non-null  float64
 8   room_type                       24264 non-null  object 
 9   price                           21079 non-null  float64
 10  minimum_nights                  24264 non-null  int64  
 11  number_of_reviews               24264 non-null  int64  
 12  last_review                     

In [30]:
listings['price'].describe()

count    21079.000000
mean       156.624982
std        576.065191
min          8.000000
25%         60.000000
50%         87.000000
75%        132.000000
max      20000.000000
Name: price, dtype: float64