# Exploratory Data Analysis (EDA)




## Objective

* The objective of this exploratory data analysis is to understand patterns, trends, and differences in Airbnb listings across Berlin and Bangkok.

* This includes analyzing neighbourhood distribution, property and room types, pricing behavior, and review-related metrics to support meaningful dashboard insights.


## Dataset Used
- Berlin (cleaned)
- Bangkok (cleaned)



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd

BASE_PATH = "/content/drive/MyDrive/AlmaBetter/Module_4/data"

# Load cleaned datasets
berlin_listings = pd.read_csv(f"{BASE_PATH}/berlin/listings_clean.csv")
berlin_reviews = pd.read_csv(f"{BASE_PATH}/berlin/reviews_clean.csv")
berlin_neighbourhoods = pd.read_csv(f"{BASE_PATH}/berlin/neighbourhoods_clean.csv")

bangkok_listings = pd.read_csv(f"{BASE_PATH}/bangkok/listings_clean.csv")
bangkok_reviews = pd.read_csv(f"{BASE_PATH}/bangkok/reviews_clean.csv")
bangkok_neighbourhoods = pd.read_csv(f"{BASE_PATH}/bangkok/neighbourhoods_clean.csv")


## Overview Metrics



In [4]:
# Basic overview metrics

overview = {
    "Berlin": {
        "total_listings": berlin_listings.shape[0],
        "unique_neighbourhoods": berlin_listings["neighbourhood"].nunique(),
        "unique_hosts": berlin_listings["host_id"].nunique()
    },
    "Bangkok": {
        "total_listings": bangkok_listings.shape[0],
        "unique_neighbourhoods": bangkok_listings["neighbourhood"].nunique(),
        "unique_hosts": bangkok_listings["host_id"].nunique()
    }
}

overview


{'Berlin': {'total_listings': 9264,
  'unique_neighbourhoods': 138,
  'unique_hosts': 5255},
 'Bangkok': {'total_listings': 23273,
  'unique_neighbourhoods': 50,
  'unique_hosts': 6669}}

## Neighbourhood Analysis



In [5]:
# Neighbourhood-wise listing counts (top 10 only)

berlin_neighbourhood_counts = (
    berlin_listings["neighbourhood"]
    .value_counts()
    .head(10)
)

bangkok_neighbourhood_counts = (
    bangkok_listings["neighbourhood"]
    .value_counts()
    .head(10)
)

berlin_neighbourhood_counts, bangkok_neighbourhood_counts


(neighbourhood
 alexanderplatz              725
 frankfurter allee süd fk    477
 tempelhofer vorstadt        420
 brunnenstr. süd             402
 prenzlauer berg südwest     290
 südliche luisenstadt        242
 schöneberg-nord             212
 prenzlauer berg süd         197
 prenzlauer berg nordwest    193
 karl-marx-allee-süd         183
 Name: count, dtype: int64,
 neighbourhood
 vadhana         3709
 khlong toei     3119
 huai khwang     3033
 ratchathewi     1243
 sathon          1092
 phra khanong    1001
 phra nakhon      924
 bang rak         756
 suanluang        728
 chatu chak       681
 Name: count, dtype: int64)

## Property and Room Type Analysis



In [19]:
# room type distribution in terms of counts and percentage
def room_type_distribution(df, col="room_type"):
    counts = df[col].value_counts()
    percentages = round((counts / counts.sum()) * 100, 2)
    return counts, percentages


city_datasets = {
    "Berlin": berlin_listings,
    "Bangkok": bangkok_listings
}

room_type_results = {}

for city, df in city_datasets.items():
    counts, pct = room_type_distribution(df)

    room_type_results[city] = {
        "counts": counts,
        "percentage": pct
    }

    print(f"\n{city} Room Type Distribution:")
    print(counts)
    print(pct)





Berlin Room Type Distribution:
room_type
entire home/apt    6806
private room       2260
hotel room          101
shared room          97
Name: count, dtype: int64
room_type
entire home/apt    73.47
private room       24.40
hotel room          1.09
shared room         1.05
Name: count, dtype: float64

Bangkok Room Type Distribution:
room_type
entire home/apt    16498
private room        6263
shared room          314
hotel room           198
Name: count, dtype: int64
room_type
entire home/apt    70.89
private room       26.91
shared room         1.35
hotel room          0.85
Name: count, dtype: float64


## Pricing Analysis



In [30]:
#  caluclate pricing statistics
def pricing_stats(df, col="price"):
  stats= df[col].describe()
  return stats
pricing_stats_results={}
for city, df in city_datasets.items():
  stats = pricing_stats(df)
  pricing_stats_results[city] = stats
  print(f"\n{city} Pricing Statistics:")
  print(stats)



Berlin Pricing Statistics:
count     9264.000000
mean       201.240393
std       1656.989769
min          5.000000
25%         70.000000
50%        104.000000
75%        160.000000
max      50000.000000
Name: price, dtype: float64

Bangkok Pricing Statistics:
count      23273.000000
mean        2528.749151
std        16473.896035
min            4.000000
25%          923.000000
50%         1379.000000
75%         2207.000000
max      1000000.000000
Name: price, dtype: float64


## Reviews and Availability Analysis



## Key Observations