# NLP Feature Engineering (Part 2) & Merge
**Tasks:** T1.11 (TF-IDF Vectorization) & T1.12 (NLP Tables Merge)
**Inputs:** <br>1. `data/processed/listings_text_cleaned.csv` (Text Data from mfa_T1.7_T1.8_nlp_pipeline)
<br>2. `data/processed/listings_nlp_features.csv` (Sentiment/Structure Data from mfa_T1.9_T1.10_sentiment_features)

### Plan
This notebook completes the NLP pipeline by generating keyword features and creating the final master dataset for NLP part.

1.  **Setup & Load:** Load the cleaned text data and the previously generated NLP features.
2.  **Sanity Check:** Verify that row counts match across datasets to ensure data integrity.
3.  **T1.11 TF-IDF Transformation:**
    * Convert `description_clean` into numerical vectors using TF-IDF.
    * Limit to **Top 100 keywords** to focus on the most important terms (e.g., "luxury", "beach", "downtown").
4.  **T1.12 Feature Integration (Merge):**
    * Merge the new TF-IDF features with the Sentiment & Structural features from `data/processed/listings_nlp_features.csv` using `id`.
5.  **Final Save:** Export the complete NLP dataset (`nlp_master_features.csv`).

### Step 1: Setup and Data Loading
In this section, we prepare the environment and load the necessary datasets.

**Inputs Loaded:**
1.  **`listings_text_cleaned.csv` (from mfa_T1.7_T1.8_nlp_pipeline.ipynb):** Contains the cleaned text (`description_clean`) which is the input for the TF-IDF model.
2.  **`listings_nlp_features.csv` (from mfa_T1.9_T1.10_sentiment_features.ipynb):** Contains the previously generated Sentiment and Structural features.

**Critical Checks:**
* **NaN Handling:** We explicitly fill missing values in the text column to prevent the TF-IDF vectorizer from crashing.
* **Row Consistency:** We perform a sanity check to ensure both datasets have the exact same number of rows (`id` count) before proceeding to analysis.

In [None]:
# ==========================================
# 1. SETUP & LOAD DATA
# ==========================================
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer

# Define file paths
text_data_path = "../../data/processed/listings_text_cleaned.csv"
features_data_path = "../../data/processed/listings_nlp_features.csv"

# Check if input files exist
if os.path.exists(text_data_path) and os.path.exists(features_data_path):
    print("Input files found.")
    
    # 1. Load Text Data
    df_text = pd.read_csv(text_data_path)
    # Handle missing values immediately to prevent errors in TF-IDF
    df_text['description_clean'] = df_text['description_clean'].fillna("")
    print(f"Text Data Loaded. Shape: {df_text.shape}")
    
    # 2. Load NLP Features Data
    df_features = pd.read_csv(features_data_path)
    print(f"Sentiment & Structural Data Loaded. Shape: {df_features.shape}")
    
    # 3. Sanity Check: Row Count Verification
    if df_text.shape[0] == df_features.shape[0]:
        print("Row counts match. Ready for T1.11 and T1.12.")
        
        # Display samples to confirm correct loading
        print("Sample Text Data:")
        display(df_text[['id', 'description_clean']].head(3))
        print("Sample Feature Data:")
        display(df_features.head(3))
    else:
        print("WARNING: Row count mismatch between Text Data and Feature Data.")
        print(f"Text Rows: {df_text.shape[0]}")
        print(f"Feature Rows: {df_features.shape[0]}")
        
else:
    print("Missing input files. Please ensure fa_T1.7_T1.8_nlp_pipeline.ipynb and mfa_T1.9_T1.10_sentiment_features.ipynb are completed successfully.")

### Step 2: T1.11 - TF-IDF Vectorization & T1.12 - Final Merge
In this section, we perform the final feature extraction and dataset consolidation.

**1. T1.11 TF-IDF (Term Frequency-Inverse Document Frequency):**
* We convert the `description_clean` text into numerical vectors.
* **Settings:** We limit the vocabulary to the **Top 100** most important words to keep the dataset lightweight and avoid overfitting.
* **Output:** Columns like `tfidf_beach`, `tfidf_luxury`, etc.

**2. T1.12 Merging:**
* We combine the **Sentiment/Structure Features** (loaded from `mfa_T1.9_T1.10_sentiment_features`) with the new **TF-IDF Features**.
* **Join Key:** We merge strictly on `id` to ensure data integrity.

**3. Final Save:**
* The consolidated dataset is saved as `nlp_master_features.csv`. This file contains all NLP insights and is ready for the project-wide merge.

In [None]:
# ==========================================
# 2. T1.11: TF-IDF VECTORIZATION
# ==========================================
# We limit to top 100 features to keep the dataset manageable.
print("Starting TF-IDF Transformation...")

# Initialize TF-IDF Vectorizer
tfidf = TfidfVectorizer(
    max_features=100,       # Keep only top 100 important words
    stop_words='english',   # Remove common English words
    dtype=np.float32        # Use less memory
)

# Fit and transform the cleaned descriptions
# Note: df_text was loaded in the previous cell
tfidf_matrix = tfidf.fit_transform(df_text['description_clean'])

# Convert to DataFrame
feature_names = tfidf.get_feature_names_out()
tfidf_cols = [f"tfidf_{word}" for word in feature_names]

df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_cols)

# Add ID back to TF-IDF dataframe for merging
df_tfidf['id'] = df_text['id']

print(f"TF-IDF Complete. Created {len(tfidf_cols)} features.")
print(f"Top 10 features example: {tfidf_cols[:10]}")

# ==========================================
# 3. T1.12: MERGE ALL NLP FEATURES
# ==========================================
print("\nStarting Merge Process...")

# Merge TF-IDF features with Sentiment/Structure features (Block B)
# We use 'id' as the key.
# df_features was loaded in the previous cell
df_master = pd.merge(df_features, df_tfidf, on='id', how='inner')

print("Merge Complete.")
print(f"Master Dataset Shape: {df_master.shape}")

# ==========================================
# 4. SAVE MASTER NLP DATASET
# ==========================================
output_folder = "../../data/processed"
output_path = os.path.join(output_folder, "nlp_master_features.csv")

df_master.to_csv(output_path, index=False)

print(f"\nSUCCESS: Pipeline Finished.")
print(f"Master NLP Dataset saved to: {output_path}")
print(f"Columns (First 10): {df_master.columns.tolist()[:10]}")