# Data Cleaning




## Objective

The objective of this stage is to improve data quality and consistency across the Airbnb datasets for Berlin and Bangkok. This includes identifying and addressing missing values, duplicates, formatting inconsistencies, and invalid records to ensure the data is reliable and suitable for exploratory analysis and dashboard development.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive



## Data Quality Issues Identified

* Missing values are present in both numerical fields (such as price and reviews per month) and categorical fields (such as host name and license). These can affect statistical calculations and category-based analysis, so each affected column must be handled carefully during data cleaning.

* Price-related fields require validation to ensure values are numeric and within realistic ranges. Extremely high, zero, or incorrectly formatted prices can distort average price calculations and lead to misleading comparisons between cities.

* There may be duplicate listings based on listing identifiers, which can inflate property counts and review statistics. Verifying and removing duplicates is necessary to ensure each property is represented only once.

* Text-based fields like neighbourhood and room type may contain inconsistent formatting (such as extra spaces or case differences), which can split the same category into multiple groups and affect grouping and visualization accuracy.

* Review-related fields contain missing or sparse values, often indicating new or inactive listings rather than data errors. These should be interpreted carefully to avoid removing meaningful information about market activity.


In [4]:
import pandas as pd
BASE_PATH = "/content/drive/MyDrive/AlmaBetter/Module_4/data"

berlin_listings = pd.read_csv(f"{BASE_PATH}/berlin/listings.csv")
berlin_reviews = pd.read_csv(f"{BASE_PATH}/berlin/reviews.csv")
berlin_neighbourhoods = pd.read_csv(f"{BASE_PATH}/berlin/neighbourhoods.csv")

bangkok_listings = pd.read_csv(f"{BASE_PATH}/bangkok/listings.csv")
bangkok_reviews = pd.read_csv(f"{BASE_PATH}/bangkok/reviews.csv")
bangkok_neighbourhoods = pd.read_csv(f"{BASE_PATH}/bangkok/neighbourhoods.csv")

# Create working copies to preserve raw data
berlin_listings_clean = berlin_listings.copy()
berlin_reviews_clean = berlin_reviews.copy()
berlin_neighbourhoods_clean = berlin_neighbourhoods.copy()

bangkok_listings_clean = bangkok_listings.copy()
bangkok_reviews_clean = bangkok_reviews.copy()
bangkok_neighbourhoods_clean = bangkok_neighbourhoods.copy()


## Cleaning Strategy

* Apply the same cleaning methods to both Berlin and Bangkok datasets to ensure fair and reliable comparison.

* Keep the original raw data unchanged and perform all cleaning on working copies to preserve data integrity.

* Treat missing values based on their business importance and impact on analysis rather than applying a single rule to all fields.

* Identify and remove duplicate listings using unique listing identifiers to avoid inflated counts and biased statistics.

* Standardize categorical text fields by correcting case and spacing to ensure consistent grouping and accurate visualizations.

* Validate numerical columns to confirm correct formats and realistic value ranges before performing statistical analysis.


## Handling Missing Values

* Missing values will be handled based on the importance of each column for analysis and visualization.

* Critical fields such as price, room type, and neighbourhood will be reviewed carefully, as they directly impact insights and dashboards.
For less critical fields (for example, review-related or optional host details), missing values may be retained or handled selectively.

* No missing values are modified at this stage; decisions will be applied in subsequent cleaning steps.



In [5]:
# Drop records where price is missing, as price is critical for analysis
berlin_listings_clean = berlin_listings_clean.dropna(subset=["price"])
bangkok_listings_clean = bangkok_listings_clean.dropna(subset=["price"])


## Handling Duplicates

Duplicate records will be checked using unique listing identifiers.
If duplicate entries are found, only one valid record will be retained to avoid double counting in analysis.
The focus will be on ensuring that each listing is represented only once in the cleaned dataset.
No duplicate removal is performed at this stage; this section only defines the approach.



In [6]:
# Remove duplicate listings based on listing id
berlin_listings_clean = berlin_listings_clean.drop_duplicates(subset=["id"])
bangkok_listings_clean = bangkok_listings_clean.drop_duplicates(subset=["id"])



## Standardizing Formats

* Text-based fields such as neighbourhood names and room types will be standardized to ensure consistency across records.

* This includes fixing letter casing, trimming extra spaces, and resolving minor naming variations.

* Price and date fields will be reviewed to ensure they follow a consistent format suitable for analysis and visualization.

* At this stage, this section only documents the standardization approach; no transformations are applied yet.




## Output Datasets

* The output of the data cleaning process will be cleaned and standardized datasets for both Berlin and Bangkok.

* These datasets will be used as inputs for exploratory data analysis and dashboard creation.

* Raw datasets will remain unchanged to preserve original data integrity.

* The cleaned datasets will be generated and validated in the execution phase of data cleaning.
