# Exploratory Data Analysis (EDA)




## Objective

* The objective of this exploratory data analysis is to understand patterns, trends, and differences in Airbnb listings across Berlin and Bangkok.

* This includes analyzing neighbourhood distribution, property and room types, pricing behavior, and review-related metrics to support meaningful dashboard insights.


## Dataset Used
- Berlin (cleaned)
- Bangkok (cleaned)



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

BASE_PATH = "/content/drive/MyDrive/AlmaBetter/Module_4/data"

# Load cleaned datasets
berlin_listings = pd.read_csv(f"{BASE_PATH}/berlin/listings_clean.csv")
berlin_reviews = pd.read_csv(f"{BASE_PATH}/berlin/reviews_clean.csv")
berlin_neighbourhoods = pd.read_csv(f"{BASE_PATH}/berlin/neighbourhoods_clean.csv")

bangkok_listings = pd.read_csv(f"{BASE_PATH}/bangkok/listings_clean.csv")
bangkok_reviews = pd.read_csv(f"{BASE_PATH}/bangkok/reviews_clean.csv")
bangkok_neighbourhoods = pd.read_csv(f"{BASE_PATH}/bangkok/neighbourhoods_clean.csv")


## Overview Metrics



In [4]:
# Basic overview metrics

overview = {
    "Berlin": {
        "total_listings": berlin_listings.shape[0],
        "unique_neighbourhoods": berlin_listings["neighbourhood"].nunique(),
        "unique_hosts": berlin_listings["host_id"].nunique()
    },
    "Bangkok": {
        "total_listings": bangkok_listings.shape[0],
        "unique_neighbourhoods": bangkok_listings["neighbourhood"].nunique(),
        "unique_hosts": bangkok_listings["host_id"].nunique()
    }
}

overview


{'Berlin': {'total_listings': 9264,
  'unique_neighbourhoods': 138,
  'unique_hosts': 5255},
 'Bangkok': {'total_listings': 23273,
  'unique_neighbourhoods': 50,
  'unique_hosts': 6669}}

## Neighbourhood Analysis



In [5]:
# Neighbourhood-wise listing counts (top 10 only)

berlin_neighbourhood_counts = (
    berlin_listings["neighbourhood"]
    .value_counts()
    .head(10)
)

bangkok_neighbourhood_counts = (
    bangkok_listings["neighbourhood"]
    .value_counts()
    .head(10)
)

berlin_neighbourhood_counts, bangkok_neighbourhood_counts


(neighbourhood
 alexanderplatz              725
 frankfurter allee sÃ¼d fk    477
 tempelhofer vorstadt        420
 brunnenstr. sÃ¼d             402
 prenzlauer berg sÃ¼dwest     290
 sÃ¼dliche luisenstadt        242
 schÃ¶neberg-nord             212
 prenzlauer berg sÃ¼d         197
 prenzlauer berg nordwest    193
 karl-marx-allee-sÃ¼d         183
 Name: count, dtype: int64,
 neighbourhood
 vadhana         3709
 khlong toei     3119
 huai khwang     3033
 ratchathewi     1243
 sathon          1092
 phra khanong    1001
 phra nakhon      924
 bang rak         756
 suanluang        728
 chatu chak       681
 Name: count, dtype: int64)

## Property and Room Type Analysis



In [19]:
# room type distribution in terms of counts and percentage
def room_type_distribution(df, col="room_type"):
    counts = df[col].value_counts()
    percentages = round((counts / counts.sum()) * 100, 2)
    return counts, percentages


city_datasets = {
    "Berlin": berlin_listings,
    "Bangkok": bangkok_listings
}

room_type_results = {}

for city, df in city_datasets.items():
    counts, pct = room_type_distribution(df)

    room_type_results[city] = {
        "counts": counts,
        "percentage": pct
    }

    print(f"\n{city} Room Type Distribution:")
    print(counts)
    print(pct)





Berlin Room Type Distribution:
room_type
entire home/apt    6806
private room       2260
hotel room          101
shared room          97
Name: count, dtype: int64
room_type
entire home/apt    73.47
private room       24.40
hotel room          1.09
shared room         1.05
Name: count, dtype: float64

Bangkok Room Type Distribution:
room_type
entire home/apt    16498
private room        6263
shared room          314
hotel room           198
Name: count, dtype: int64
room_type
entire home/apt    70.89
private room       26.91
shared room         1.35
hotel room          0.85
Name: count, dtype: float64


## Pricing Analysis



In [30]:
#  caluclate pricing statistics
def pricing_stats(df, col="price"):
  stats= df[col].describe()
  return stats
pricing_stats_results={}
for city, df in city_datasets.items():
  stats = pricing_stats(df)
  pricing_stats_results[city] = stats
  print(f"\n{city} Pricing Statistics:")
  print(stats)



Berlin Pricing Statistics:
count     9264.000000
mean       201.240393
std       1656.989769
min          5.000000
25%         70.000000
50%        104.000000
75%        160.000000
max      50000.000000
Name: price, dtype: float64

Bangkok Pricing Statistics:
count      23273.000000
mean        2528.749151
std        16473.896035
min            4.000000
25%          923.000000
50%         1379.000000
75%         2207.000000
max      1000000.000000
Name: price, dtype: float64


## Reviews and Availability Analysis



In [32]:
# Reviews and availibility of listings
def review_availability_stats(df):
  review_stats = df["number_of_reviews"].describe()
  availability_stats = df["availability_365"].describe()
  return review_stats, availability_stats

review_availability_stats_results = {}
for city, df in city_datasets.items():
  review_stats, availability_stats = review_availability_stats(df)
  review_availability_stats_results[city] = {
      "review_stats": review_stats,
      "availability_stats": availability_stats
  }
  print(f"\n{city} Review and Availability Statistics:")
  print("Review Statistics:")
  print(review_stats)
  print("\nAvailability Statistics:")
  print(availability_stats)


Berlin Review and Availability Statistics:
Review Statistics:
count    9264.000000
mean       60.029901
std       118.268617
min         0.000000
25%         1.000000
50%        14.000000
75%        66.000000
max      2895.000000
Name: number_of_reviews, dtype: float64

Availability Statistics:
count    9264.000000
mean      216.195488
std       120.585205
min         0.000000
25%       104.000000
50%       246.000000
75%       329.000000
max       365.000000
Name: availability_365, dtype: float64

Bangkok Review and Availability Statistics:
Review Statistics:
count    23273.000000
mean        23.164611
std         59.582666
min          0.000000
25%          0.000000
50%          4.000000
75%         21.000000
max       2926.000000
Name: number_of_reviews, dtype: float64

Availability Statistics:
count    23273.000000
mean       278.135952
std         98.949843
min          0.000000
25%        226.000000
50%        318.000000
75%        364.000000
max        365.000000
Name: availabi

## Key Observations

### ðŸ”¹ Market Size and Host Activity

* Bangkok has more than double the number of listings (23,273) compared to Berlin (9,264), but only slightly more hosts (6,669 vs 5,255).

* This indicates that hosts in Bangkok tend to manage more properties per host, suggesting a more commercialized hosting market compared to Berlin.

### ðŸ”¹ Neighbourhood Concentration of Listings

* In Berlin, listings are spread across many neighbourhoods (138), but top areas like Alexanderplatz, Frankfurter Allee SÃ¼d, and Tempelhofer Vorstadt have the highest concentration, indicating tourism and transit-connected zones attract more listings.

* In Bangkok, listings are highly concentrated in fewer neighbourhoods (50), with Vadhana, Khlong Toei, and Huai Khwang alone accounting for a large share of total listings.

* This shows that Bangkokâ€™s Airbnb market is more geographically concentrated, while Berlin is more evenly distributed across districts.

### ðŸ”¹ Room Type Distribution

* In both cities, Entire home/apartment dominates the market:

Berlin: 73.5% entire homes

Bangkok: 70.9% entire homes

* Private rooms form about one-quarter of listings in both cities.

* Shared rooms and hotel rooms together make up less than 3% in both markets.

* This indicates that Airbnb in both cities is primarily used for full-property short-term rentals rather than shared accommodation.

### ðŸ”¹ Pricing Patterns and Outliers

* Median prices are much lower than mean prices in both cities:

  Berlin: median â‰ˆ 104, mean â‰ˆ 201

  Bangkok: median â‰ˆ 1,379, mean â‰ˆ 2,529

* Extremely high maximum prices (Berlin: 50,000, Bangkok: 1,000,000) and very large standard deviations show the presence of strong price outliers, likely luxury or incorrectly priced listings.

* This suggests that average price alone is misleading, and median price is a more reliable indicator of typical listing cost.

### ðŸ”¹ Review Activity (Demand Indicator)

* Berlin listings have significantly higher engagement:

  Median reviews: 14 (Berlin) vs 4 (Bangkok)

  Mean reviews: 60 (Berlin) vs 23 (Bangkok)

* In Bangkok, 25% of listings have zero reviews, indicating many new, inactive, or low-demand listings.

* In both cities, a small number of listings receive extremely high reviews (over 2,900), showing that demand is highly concentrated in a few popular properties.

### ðŸ”¹ Availability and Market Saturation

* Average availability is high in both cities, especially in Bangkok:

  Berlin: 216 days available per year

  Bangkok: 278 days available per year

* Median availability in Bangkok is 318 days, meaning most listings are available most of the year and are not frequently booked.

* This indicates strong oversupply, especially in Bangkok, where demand is spread across very few high-performing listings.

### ðŸ”¹ Combined Demand vs Supply Insight

- Berlin shows better balance between demand and supply, with higher review counts and slightly lower availability.

- Bangkok shows higher supply with lower demand, reflected by low reviews and very high availability.

- This suggests that in Bangkok, only specific neighbourhoods and property types attract consistent bookings, while many listings struggle to receive guests.