### Step 1: Setup and Data Loading
In this section, we imported the necessary libraries (Pandas, NLTK) and downloaded the required NLTK resources (like stopwords and wordnet).

Then, we loaded the raw Airbnb data for **San Diego** and **San Francisco**. We combined these two datasets into a single dataframe to process them together.

In [None]:
# ==========================================
# 1. IMPORT LIBRARIES & SETUP
# ==========================================
import pandas as pd
import numpy as np
import nltk
import os

# Configure pandas to show full text content
pd.set_option('display.max_colwidth', None)

# ==========================================
# 2. DOWNLOAD NLTK RESOURCES
# ==========================================
print("Downloading NLTK resources...")
try:
    nltk.download('punkt')       # Sentence/word splitter
    nltk.download('stopwords')   # Noise words (the, is, at)
    nltk.download('wordnet')     # Dictionary for lemmatization
    nltk.download('omw-1.4')     # Open Multilingual Wordnet
    print("NLTK resources downloaded successfully.")
except Exception as e:
    print(f"Error downloading NLTK: {e}")

# ==========================================
# 3. LOAD AND COMBINE RAW DATA
# ==========================================
# Define the paths for your two raw data files.
# PLEASE UPDATE THESE FILENAMES to match your actual files in data/raw/
base_path = "../../data/raw/"
FILE_1_NAME = "san diego.csv"      # Ornegin: listings_sd.csv
FILE_2_NAME = "san francisco.csv"   # Ornegin: listings_sf.csv

path1 = os.path.join(base_path, FILE_1_NAME)
path2 = os.path.join(base_path, FILE_2_NAME)

def load_and_combine_data(p1, p2):
    # Check if files exist
    if not os.path.exists(p1):
        print(f"File not found: {p1}")
        return None
    if not os.path.exists(p2):
        print(f"File not found: {p2}")
        return None

    print(f"Loading file 1: {p1}")
    df1 = pd.read_csv(p1)
    
    print(f"Loading file 2: {p2}")
    df2 = pd.read_csv(p2)
    
    # Combine (Concatenate) the dataframes vertically
    combined_df = pd.concat([df1, df2], ignore_index=True)
    
    print(f"Data combined successfully.")
    print(f"File 1 shape: {df1.shape}")
    print(f"File 2 shape: {df2.shape}")
    print(f"Total shape:  {combined_df.shape}")
    
    return combined_df

# Execute loading
df = load_and_combine_data(path1, path2)

# Check text columns
if df is not None:
    text_cols = ['name', 'description', 'neighborhood_overview', 'host_about']
    existing_cols = [c for c in text_cols if c in df.columns]
    print(f"Text columns found: {existing_cols}")
    display(df[existing_cols].head(3))

## Sanity Check: Duplicate IDs

In this step, we check whether there are **duplicate `id` values** in the dataset.

Since the data was merged from multiple files, duplicate IDs may cause problems in later steps such as merging or modeling.

- If **no duplicate IDs** are found, the data integrity is safe.
- If **duplicate IDs** exist, they are removed to avoid conflicts.

This check helps ensure that each listing in the dataset has a **unique identifier**.


In [None]:
# ==========================================
# CHECK: DUPLICATE IDS
# ==========================================
# Since we merged data from different files, we check for ID collisions.
duplicate_count = df['id'].duplicated().sum()

print(f"Checking for duplicate IDs...")
if duplicate_count == 0:
    print(f"Test Passed: No duplicate IDs found. Data integrity is safe.")
else:
    print(f"WARNING: Found {duplicate_count} duplicate IDs.")
    # If duplicates exist, we drop them to prevent merge issues later
    df = df.drop_duplicates(subset=['id'])
    print(f"Duplicates dropped. New shape: {df.shape}")

### Step 2: Text Preprocessing (T1.8)
In this step, we clean the text data to make it ready for analysis. 
We define a function `preprocess_text` that performs the following operations:
1.  **Lowercasing**: Converts all text to small letters.
2.  **Noise Removal**: Removes HTML tags (like `<br>`), URLs, and special characters.
3.  **Tokenization**: Splits sentences into individual words.
4.  **Stopword Removal**: Removes common words (like 'and', 'the') that do not carry specific meaning.
5.  **Lemmatization**: Converts words to their root form (e.g., 'running' -> 'run') using NLTK.

In [None]:
# ==========================================
# 4. T1.7: HANDLING MISSING TEXT VALUES
# ==========================================

# List of text columns to process
text_features = ['name', 'description', 'neighborhood_overview', 'host_about']

# Check for missing values before fixing
print("Missing values BEFORE fixing:")
print(df[text_features].isnull().sum())

# Fill NaN (empty) values with an empty string ""
# We do not use "No Description" to avoid adding artificial words to the model.
for col in text_features:
    df[col] = df[col].fillna("").astype(str)

print("-" * 30)

# Check for missing values after fixing (Should be 0)
print("Missing values AFTER fixing:")
print(df[text_features].isnull().sum())

# Verify that they are strings
print("\nSample check:")
display(df[text_features].head(3))

### Step 2: Text Preprocessing (T1.8)
In this step, we built a cleaning pipeline to prepare the text for analysis.
We defined a function `preprocess_text` that performs the following operations:

1.  **Lowercasing**: Converted all letters to lowercase to ensure consistency (e.g., "Home" becomes "home").
2.  **Noise Removal**: Removed HTML tags (like `<br>`), URLs, and special characters using Regex.
3.  **Tokenization**: Split the text into individual words using the NLTK library.
4.  **Stopword Removal**: Removed common words (like 'the', 'is', 'and') that add no specific meaning.
5.  **Lemmatization**: Converted words to their root forms (e.g., "running" -> "run") to group similar concepts.

Finally, we applied this function to all text columns (`description`, `host_about`, etc.) and saved the clean versions with a `_clean` suffix.

In [None]:
# ==========================================
# 5. T1.8: TEXT PREPROCESSING PIPELINE (UPDATED)
# ==========================================
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# --- PRE-CHECK: Download necessary NLTK resources ---
# We ensure 'punkt_tab' is available for the new NLTK version
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    print("Downloading missing resource: punkt_tab...")
    nltk.download('punkt_tab')

# Initialize Lemmatizer and Stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    """
    Cleans and processes text data:
    1. Lowercase
    2. Remove HTML tags and URLs
    3. Remove non-alphabetic characters
    4. Tokenize (Split into words)
    5. Remove stopwords
    6. Lemmatize (Convert to root form)
    """
    if not isinstance(text, str):
        return ""
    
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove HTML tags (e.g., <br />) using regex
    text = re.sub(r'<.*?>', ' ', text)
    
    # 3. Remove URLs (http://...)
    text = re.sub(r'http\S+', ' ', text)
    
    # 4. Remove special characters (keep only letters and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # 5. Tokenization (Split into words)
    # Uses the updated punkt_tab logic automatically
    tokens = word_tokenize(text)
    
    # 6. Remove Stopwords & Lemmatization
    # We keep words that are NOT in stop_words, and find their root (lemma)
    clean_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    
    # Join tokens back into a single string
    return " ".join(clean_tokens)

# Apply the function to our text columns
# We create NEW columns with '_clean' suffix to compare results later
print("Starting text preprocessing... This might take a minute.")

text_features = ['name', 'description', 'neighborhood_overview', 'host_about']

for col in text_features:
    new_col_name = col + "_clean"
    print(f"Processing column: {col} -> {new_col_name}")
    df[new_col_name] = df[col].apply(preprocess_text)

print("Preprocessing complete!")

# Compare Original vs Cleaned version
display(df[['description', 'description_clean']].head(3))

### Step 3: Saving the Optimized NLP Dataset
In this final step, we save the processed text data.

To maintain a modular project structure and avoid conflicts with other team members' work (who might be cleaning columns like 'price' or 'room_type'), we do **not** save the entire dataset.

Instead, we export a specialized **NLP-only dataset** containing:
1.  **id**: Essential for merging this data back with the main dataset later.
2.  **Original Text Columns**: Required for extracting structural features (e.g., text length, capital letters ratio).
3.  **Cleaned Text Columns**: Required for content-based tasks like Sentiment Analysis.

In [None]:
# ==========================================
# 6. SAVE CHECKPOINT (NLP PREPROCESSING COMPLETE)
# ==========================================
import os

# Create 'data/processed' folder if it doesn't exist
output_folder = "../../data/processed"
os.makedirs(output_folder, exist_ok=True)

# Define the output path
output_path = os.path.join(output_folder, "listings_text_cleaned.csv")

# --- OPTIMIZATION: SAVE ONLY RELEVANT COLUMNS ---
# We keep 'id' to merge with the main dataset later.
# We keep original text columns (for length/structure features).
# We keep cleaned text columns (for sentiment/content analysis).

relevant_columns = ['id'] + text_features + [col + "_clean" for col in text_features]

# Create a smaller dataframe with only NLP data
df_nlp = df[relevant_columns].copy()

# Save only the NLP-related data
df_nlp.to_csv(output_path, index=False)

print(f"NLP Preprocessing Pipeline (T1.7 & T1.8) completed successfully!")
print(f"Optimized NLP dataset saved to: {output_path}")
print(f"Original shape: {df.shape} -> Optimized shape: {df_nlp.shape}")
print(f"Columns saved: {df_nlp.columns.tolist()}")