# Pyclense Demo: Cleaning the Beach Dataset

This notebook demonstrates how to use the `pyclense` library to build a sequential data cleaning pipeline for the `dataset.csv` file.

In [1]:
import pandas as pd
import numpy as np
import os
import sys

# Add project root to path to import pyclense
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from pyclense.base import BaseCleaner
from pyclense.standardizer import FormatStandardizer
from pyclense.missing import MissingDataCleaner
from demo_cleaner import DemoCleaner

## 2. Loading the Messy Dataset

Let's load the `dataset.csv` file from the `data` folder. It contains several data quality issues that we will address with our pipeline.

In [2]:
# Load the dataset from the data folder
data_path = os.path.join('..', 'data', 'dataset.csv')
df_messy = pd.read_csv(data_path)

print('--- Original Messy Data ---')
print(f"Loaded {len(df_messy)} rows from {data_path}")
df_messy.head()

--- Original Messy Data ---
Loaded 150 rows from ..\data\dataset.csv


Unnamed: 0,Country,Beach / Location Name,Accommodation Name,Rating (Out of 5),Fee (USD/Night),Review Date,Written Review
0,Turks and Caicos,Grace Bay Beach,Amanyara,4.9,"$1,800",20-Nov-25,Impossibly white sand and water so blue it loo...
1,Portugal,Praia da Falésia (Algarve),"Pine Cliffs Hotel, a Luxury Collection",4.7,$410,10/15/2025,The dramatic golden cliffs meet the azure Atla...
2,Mexico,Tulum Beach,Be Tulum Hotel,4.5,$650,1-Dec-25,"Chic, bohemian luxury right on the white sands..."
3,Philippines,El Nido (Palawan),Amanpulo,4.8,"$2,000",11/12/2025,
4,South Africa,Boulders Beach (Cape Town),Tintswalo at Boulders Beach,4.8,$580,5-Oct-25,Sharing the sand with African Penguins makes t...


## 3. Building and Running the Pipeline

We will now apply our cleaners sequentially. Each cleaner takes a DataFrame and returns a cleaned one, making the process clear and easy to follow.

In [3]:
print('--- Starting Cleaning Process ---')

# Start with a copy of the messy data
df_clean = df_messy.copy()

# 1. Use the custom DemoCleaner to perform initial cleaning on the dataset.
# This handles column names, fees, duplicates, and fills missing reviews.
df_clean = DemoCleaner(df_clean).clean()

# 2. Use the generic FormatStandardizer to fix date formats and clean text fields.
# The 'written_review' column contains emojis and special characters to be removed.
df_clean = FormatStandardizer(df_clean, date_cols=['review_date'], text_cols=['written_review', 'beach_location_name']).clean()

# 3. Use the MissingDataCleaner to drop rows where critical data is still missing.
# We'll drop rows that don't have an accommodation name.
df_clean = MissingDataCleaner(df_clean, subset=['accommodation_name']).clean()

print('\n--- Final Cleaned Data ---')
if df_clean is not None:
	df_clean.head()
else:
	print("No data to display: cleaning pipeline returned None.")

--- Starting Cleaning Process ---

--- Running DemoCleaner ---
Cleaned column names: ['country', 'beach_location_name', 'accommodation_name', 'rating_out_of_5_', 'fee_usd_night_', 'review_date', 'written_review']
Removed 11 duplicate rows.
Filled missing 'written_review' values.
--- DemoCleaner Finished ---
[Standardizer] Processed 1 date column(s) and relevant text columns.
[MissingDataCleaner] Dropped 0 rows with missing values in ['accommodation_name'].

--- Final Cleaned Data ---


## 4. Saving the Result

Finally, we can save the cleaned data using the helper method from `BaseCleaner`. We can instantiate any concrete cleaner with the final DataFrame to access the method.

In [4]:
# Use a concrete cleaner class for saving
final_cleaner = DemoCleaner(df_clean)
final_cleaner.save_data('../data/cleaned_data.csv')

Saved to ../data/cleaned_data.csv
