# NLP Feature Engineering
**Tasks:** T1.9 (Sentiment Analysis) & T1.10 (Text Features)
**Input:** `data/processed/listings_text_cleaned.csv` (Output of mfa_T1.7_T1.8_nlp_pipeline.ipynb)

### Plan
In this notebook, we extract numerical features from the text data. We use a **Hybrid Approach**:

1.  **Setup:** Load libraries and the prepared text dataset.
2.  **T1.9 Sentiment Analysis (VADER):**
    * We will calculate sentiment scores (Positive, Negative, Neutral) using the **NLTK VADER** tool.
    * **Strategy:** We use the **Original (Raw) Text** (e.g., `description`) because VADER uses capitalization ("GREAT"), punctuation ("!!!"), and emojis to measure emotion intensity.
3.  **T1.10 Text Feature Extraction:**
    * We will calculate structural features to understand the listing quality:
        * **Word Count:** How long is the description?
        * **Capital Letter Ratio:** Is the host "shouting" in the title?
4.  **Save:** Export the new features for the final merge.

### Step 1: Setup and Data Loading
In this section, we import the necessary libraries and download the **VADER lexicon**, which is a dictionary specifically designed for sentiment analysis in social media and marketing contexts.

We also load the processed dataset from data/processed/listings_text_cleaned.csv. We explicitly fill missing values with empty strings to ensure the VADER analyzer does not fail when encountering `NaN` values.

In [None]:
# ==========================================
# 1. SETUP & LOAD DATA
# ==========================================
import pandas as pd
import numpy as np
import nltk
import os

# Download VADER lexicon (Sentiment Dictionary)
print("Downloading VADER resources...")
try:
    nltk.download('vader_lexicon')
    print("VADER lexicon downloaded successfully.")
except Exception as e:
    print(f"Error downloading VADER: {e}")

# Load the processed text data from Block A
input_path = "../../data/processed/listings_text_cleaned.csv"

if os.path.exists(input_path):
    df = pd.read_csv(input_path)
    
    # CRITICAL: Fill NaN values that might have reappeared during CSV reload
    # We fill them with empty strings to avoid errors in VADER or Length checks
    text_cols = [col for col in df.columns if 'id' not in col]
    df[text_cols] = df[text_cols].fillna("")
    
    print(f"Data Loaded Successfully.")
    print(f"Shape: {df.shape}")
    
    # Check for missing values to ensure safety
    missing_count = df.isnull().sum().sum()
    print(f"Total missing values after fix: {missing_count}")
    
    display(df.head(3))

else:
    print(f"File not found: {input_path}")
    print("Please check if data/processed/listings_text_cleaned.csv completed successfully.")

### Step 2: T1.9 - Sentiment Analysis (VADER)
In this step, we calculate the sentiment scores using **NLTK VADER**.

**Strategy:**
We apply VADER to the **Original Text Columns** (`description` and `host_about`) to capture the emotion intensity provided by capitalization and punctuation.

**Scope & Exclusions:**
* **Included:** `description`, `host_about` (Rich content).
* **Excluded `name`:** Too short for reliable sentiment analysis. We will use it for structural features in T1.10 instead.
* **Excluded `neighborhood_overview`:** Contains ~40% missing data. Filling these with neutral scores would bias the model significantly.

**Output:**
We calculate the **Compound Score**, which summarizes the sentiment into a single number between -1 (Negative) and +1 (Positive).

In [None]:
# ==========================================
# 2. T1.9: SENTIMENT ANALYSIS (VADER)
# ==========================================
from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize the VADER analyzer
sia = SentimentIntensityAnalyzer()

def get_sentiment_score(text):
    """
    Calculates the compound sentiment score for a given text.
    Returns a float between -1.0 (Negative) and 1.0 (Positive).
    """
    # Safety check for non-string values
    if not isinstance(text, str) or not text.strip():
        return 0.0
    
    # Get sentiment scores
    scores = sia.polarity_scores(text)
    
    # Return only the 'compound' score
    return scores['compound']

print("Starting Sentiment Analysis using VADER...")

# 1. Analyze 'description' (Original text)
print("Processing: description -> description_sentiment")
df['description_sentiment'] = df['description'].apply(get_sentiment_score)

# 2. Analyze 'host_about' (Original text)
print("Processing: host_about -> host_about_sentiment")
df['host_about_sentiment'] = df['host_about'].apply(get_sentiment_score)

print("Sentiment Analysis complete.")

# Display results: Show text with its score
cols_to_check = ['description', 'description_sentiment']
display(df[cols_to_check].head(5))

### Step 3: T1.10 - Text Feature Extraction
In this step, we extract structural features from the text. These features help the model understand the "quality" and "style" of the listing.

**Features Created:**
1.  **`name_length`**: The number of characters in the listing title. (Short titles might be less informative).
2.  **`name_upper_ratio`**: The percentage of uppercase letters in the title.
    * *Why?* Helps detect "clickbait" or aggressive marketing (e.g., "AMAZING VIEW!!!").
3.  **`desc_length`**: Total character count of the original description.
4.  **`desc_word_count`**: The count of meaningful words in the *cleaned* description.
    * *Why?* Longer descriptions usually correlate with professional hosts.

In [None]:
# ==========================================
# 3. T1.10: TEXT FEATURE EXTRACTION
# ==========================================

def calculate_upper_ratio(text):
    """Calculates the ratio of uppercase letters to total length."""
    if not isinstance(text, str) or len(text) == 0:
        return 0.0
    upper_count = sum(1 for char in text if char.isupper())
    return upper_count / len(text)

print("Starting Feature Extraction...")

# 1. Name Features (Title Analysis)
print("Processing: name -> name_length & name_upper_ratio")
df['name_length'] = df['name'].apply(lambda x: len(str(x)))
df['name_upper_ratio'] = df['name'].apply(calculate_upper_ratio)

# 2. Description Features (Structure Analysis)
print("Processing: description -> desc_length")
df['desc_length'] = df['description'].apply(lambda x: len(str(x)))

print("Processing: description_clean -> desc_word_count")
# We use the CLEANED version to count actual words (ignoring html tags/stopwords)
df['desc_word_count'] = df['description_clean'].apply(lambda x: len(str(x).split()))

print("Feature Extraction complete.")

# Display the new structural features
new_features = ['name', 'name_length', 'name_upper_ratio', 'desc_length', 'desc_word_count']
display(df[new_features].head(5))

### Step 4: Saving the Final NLP Feature Set
We save the final dataset containing the original IDs and the new NLP-derived features.

**Filename:** `listings_nlp_features.csv`
**Why this name?** To distinguish our work (Text/Sentiment features) from other team members who might be generating physical features (price, room count, etc.).

In [None]:
# ==========================================
# 4. SAVE FINAL FEATURE SET
# ==========================================
output_folder = "../../data/processed"

output_path = os.path.join(output_folder, "listings_nlp_features.csv")

# We select ONLY the numerical features we created + the ID to merge later.
# We DROP the text columns now, as the model only needs numbers.
final_columns = [
    'id', 
    'description_sentiment', 'host_about_sentiment',  # T1.9 Features
    'name_length', 'name_upper_ratio',                # T1.10 Features
    'desc_length', 'desc_word_count'                  # T1.10 Features
]

df_final = df[final_columns].copy()

df_final.to_csv(output_path, index=False)

print(f"NLP Feature Engineering Completed Successfully!")
print(f"NLP Features saved to: {output_path}")
print(f"Final Shape: {df_final.shape}")
print(f"Columns: {df_final.columns.tolist()}")