## üßπ Cleaning the Airbnb Dataset  
This module standardises the dataset and prepares it for analysis.  
The goal is to produce a clean, reliable DataFrame ready for deeper visualisation and modelling.

### What this script handles
- Removes duplicate rows.  
- Normalises column names (`lowercase_snake_case`).  
- Cleans and standardises the `price` column (strip symbols, convert to numeric, fill missing, remove invalid values).  
- Fixes and normalises key categorical fields such as `room_type` and `neighbourhood_group`, including correcting common misspellings.  
- Drops rows missing essential fields required for analysis.  
- Adds an `is_expensive` flag (top 5% of prices) to support later insights.  
- Generates **three quick diagnostic charts**:
  - Room type distribution  
  - Price distribution  
  - Neighbourhood group distribution  

### Why this matters
Data cleaning isn‚Äôt just about fixing errors ‚Äî it ensures that your visualisations and findings reflect the real structure of the dataset rather than noise or inconsistent formatting.

### How to use
Import the function and run it on your loaded DataFrame:

```python
from cleaning_data import clean_data, plot_basic_charts

clean_df = clean_data(df)
plot_basic_charts(clean_df)


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import os

In [2]:
# Path to the CSV (relative is safe for GitHub)
CSV_PATH = os.path.join("data", "Airbnb_Open_Data.csv")

In [3]:
def load_data(path):
    """Grab the CSV and get it into a DataFrame. Quick sanity check included."""
    df = pd.read_csv(path)
    print(f"Loaded: {path} | Rows: {len(df)}, Columns: {len(df.columns)}")
    return df

In [4]:
# ============================
# 2Ô∏è‚É£ Clean Data (Improved)
# ============================

def clean_data(df):
    """Drop duplicates, fix numeric and categorical columns, handle missing data cleanly."""

    # 1Ô∏è‚É£ Drop duplicates
    df = df.drop_duplicates()

    # 2Ô∏è‚É£ Standardise column names
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

    # 3Ô∏è‚É£ Clean numeric columns
    if 'price' in df.columns:
        # Remove currency symbols/commas and convert to numeric
        df['price'] = df['price'].astype(str).str.replace('[¬£$,]', '', regex=True)
        df['price'] = pd.to_numeric(df['price'], errors='coerce')

        # Fill missing prices with median
        median_price = df['price'].median()
        df['price'].fillna(median_price, inplace=True)

        # Optional: remove zero or negative prices
        df = df[df['price'] > 0]

    # 4Ô∏è‚É£ Clean categorical columns
    for col in ['room_type', 'neighbourhood_group', 'name']:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.lower()
            df[col].replace(
                {
                    'nan': 'missing',
                    '': 'missing',
                    'manhatan': 'manhattan',
                    'brookln': 'brooklyn'
                },
                inplace=True
            )
            df[col].fillna('missing', inplace=True)

    # 5Ô∏è‚É£ Drop rows missing critical info for visualisation
    df = df[df['price'].notna() & df['neighbourhood_group'].notna()]

    # 6Ô∏è‚É£ Optional flags for portfolio analysis
    df['is_expensive'] = df['price'] > df['price'].quantile(0.95)

    print("Cleaning done. Data is now consistent and ready for analysis.")
    print(f"Rows remaining: {len(df)} | Columns: {len(df.columns)}")
    return df


In [5]:
def plot_basic_charts(df):
    """
    Produce three quick exploratory charts:
    1. Room type count
    2. Price distribution
    3. Neighbourhood group count
    """

    plt.figure(figsize=(12, 4))

    # Chart 1: Room type counts
    plt.subplot(1, 3, 1)
    df['room_type'].value_counts().plot(
        kind='bar',
        title='Room Type Count'
    )

    # Chart 2: Price distribution
    plt.subplot(1, 3, 2)
    df['price'].plot(
        kind='hist',
        bins=20,
        title='Room Price Distribution'
    )

    # Chart 3: Neighbourhood counts
    plt.subplot(1, 3, 3)
    df['neighbourhood_group'].value_counts().plot(
        kind='bar',
        title='Neighbourhood Count'
    )

    plt.tight_layout()
    plt.show()

In [6]:
if __name__ == "__main__":
    # 1Ô∏è‚É£ Load it
    df = load_data(CSV_PATH)

    # 2Ô∏è‚É£ Clean it
    clean_df = clean_data(df)

    # 3Ô∏è‚É£ Make some quick charts
    plot_basic_charts(clean_df)


FileNotFoundError: [Errno 2] No such file or directory: 'data\\Airbnb_Open_Data.csv'