# Milestone 3: Advanced NLP & Feature Engineering

**Objective**
In this notebook, we will use advanced Deep Learning techniques to understand the listing descriptions and compare them with our earlier work.
1. **BERT Analysis:** Use pre-trained transformers to capture the context of sentences.
2. **Word Embeddings:** Train a custom Word2Vec model to learn Airbnb-specific vocabulary.
3. **Comparative Analysis:** Perform a comprehensive comparison between **BERT**, **Word2Vec**, and our **Baseline** (statistics from Milestone 1) to evaluate their performance relative to each other.

**Final Goal: Meta-Feature Generation (Stacking)**
We will not just choose one model. We will combine their intelligence.
* We will calculate a **"Confidence Score"** (from -1 to +1) for each model.
* We will save these scores into a new file (`nlp_scores.csv`).
* This file will be the input for our final Hybrid Deep Learning Model in the next stage.

## Step 1: Setup and Data Loading

**What will we do?**
In this step, we will prepare our environment and load the data.

1.  **Import Libraries:** We will import Pandas, NumPy, and the Transformers library for BERT.
2.  **Load Text Data (`listings_text_cleaned.csv`):**
    * We created this file in milestone 1.
    * It contains **two versions** of the text for different future tasks:
        * **Raw Text:** The original text with punctuation. We need this for **BERT** to understand the context.
        * **Clean Text:** The processed text without stopwords. We used this for TF-IDF before.
3.  **Load Baseline Features (`nlp_master_features.csv`):**
    * This file contains our old VADER sentiment scores.
    * We will keep these scores to compare them with our new Deep Learning model later.

In [None]:
# 1. Import Libraries
import pandas as pd
import numpy as np
import os
import warnings

# We ignore warnings to keep the output clean
warnings.filterwarnings('ignore')

# Try to import Transformers (for BERT)
try:
    from transformers import DistilBertTokenizer, TFDistilBertModel
    print("Transformers library is ready.")
except ImportError:
    print("Transformers library is not found. Please install it.")

# 2. Define File Paths
# We assume the data is in the processed folder
DATA_DIR = "../../data/processed/"
TEXT_FILE = os.path.join(DATA_DIR, "listings_text_cleaned.csv")
FEATURES_FILE = os.path.join(DATA_DIR, "nlp_master_features.csv")

# 3. Load Data
if os.path.exists(TEXT_FILE) and os.path.exists(FEATURES_FILE):
    # Load text data (contains 'description' and 'description_clean')
    df_text = pd.read_csv(TEXT_FILE)
    
    # Load old features (contains VADER scores)
    df_features = pd.read_csv(FEATURES_FILE)
    
    print("Data loaded successfully.")
    print(f"Text Data Shape: {df_text.shape}")
    print(f"Features Data Shape: {df_features.shape}")
else:
    print("Error: Files not found. Please check your paths.")

# 4. Prepare Baseline Features
# Strategy: We keep VADER scores and Counts. We DROP old TF-IDF columns.
# We will use these to compare with our new Deep Learning model later.
keep_columns = [
    'id', 
    'description_sentiment', 
    'host_about_sentiment', 
    'desc_word_count', 
    'desc_length', 
    'name_length'
]

# Create the baseline dataframe
if 'df_features' in locals():
    df_baseline = df_features[keep_columns].copy()
    print("\nBaseline Features Selected (VADER + Structure):")
    print(df_baseline.head(3))

### Interpretation of Step 1

We successfully loaded the data and created three dataframes:

1.  **df_text**:
    * **Source:** `listings_text_cleaned.csv`
    * **Content:** Contains the raw `description` text.
    * **Why?** We will feed this raw text into the BERT model to understand the context.<br><br>

2.  **df_features**:
    * **Source:** `nlp_master_features.csv`
    * **Content:** Contains all 107 NLP features from Milestone 1 (including VADER scores and TF-IDF).
    * **Why?** We loaded this to extract the specific columns we need.<br><br>

3.  **df_baseline**:
    * **Source:** Selected columns from `df_features`.
    * **Content:** Contains only `id`, VADER sentiment scores, and word counts.
    * **Why?** These are our "Baseline" features. Later, we will compare these old scores with the new BERT scores to see if Deep Learning is better.<br><br>

## Step 2: Initialize BERT Tokenizer and Model

**What will we do?**
Computers cannot read words. They only understand numbers.
1.  **Tokenizer:** We will load a tool that converts our text into numbers (tokens).
2.  **Model:** We will download the **DistilBERT** model.
    * **Why DistilBERT?** It is a smaller, faster, and lighter version of BERT. It gives 95% of the performance but runs 60% faster.
    * **Pre-trained:** The model already "knows" English because it was trained on Wikipedia and Books.

In [None]:
# 1. Import PyTorch and Transformers
import torch
from transformers import DistilBertTokenizer, DistilBertModel

# Check device (Use GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Initialize Tokenizer
# We use 'distilbert-base-uncased' (Standard English model)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# 3. Initialize Model (PyTorch Version)
# We switched to PyTorch (DistilBertModel) to avoid Keras errors.
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Move the model to the active device (CPU or GPU)
model.to(device)

print("DistilBERT Model (PyTorch) loaded successfully.")

### Interpretation of Step 2

We successfully loaded the model.

**Model Status:** The DistilBERT model is ready to process our text.

## Step 3: Generating BERT Embeddings

**What will we do?**
We will now convert the listing descriptions into numbers (vectors).

**How will we do it?**
1.  **Batch Processing:** We cannot process all 20,000 listings at once. It would crash the computer's memory (RAM).
2.  **Loop:** We will take small groups (e.g., 32 listings at a time).
3.  **Tokenize & Encode:**
    * First, we convert words to tokens.
    * Then, we feed them into the DistilBERT model.
    * The model gives us a **vector of size 768** for each listing. This vector represents the "meaning" of the description.

**Note:** This process might take some time (10-20 minutes on CPU).

In [None]:
# 1. Prepare Data
# We fill empty descriptions with " " to avoid errors.
descriptions = df_text['description'].fillna("").tolist()

# 2. Parameters
BATCH_SIZE = 32 # We process 32 listings at a time
embeddings_list = []

print(f"Starting BERT embedding generation for {len(descriptions)} listings...")

# 3. Loop through data in batches
# range(start, stop, step)
for i in range(0, len(descriptions), BATCH_SIZE):
    # Select the batch
    batch_texts = descriptions[i : i + BATCH_SIZE]
    
    # Tokenize
    # padding=True: pad to the longest sentence in the batch
    # truncation=True: cut texts longer than 128 tokens (saves memory)
    inputs = tokenizer(batch_texts, padding=True, truncation=True, 
                      max_length=128, return_tensors="pt")
    
    # Move inputs to the device (CPU or GPU)
    inputs = {key: val.to(device) for key, val in inputs.items()}
    
    # Generate Embeddings
    with torch.no_grad(): # We do not need gradients for inference (saves RAM)
        outputs = model(**inputs)
    
    # Extract the [CLS] token (The vector representing the whole sentence)
    # It is the first token (index 0) of the last hidden state
    batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    
    # Add to our list
    embeddings_list.extend(batch_embeddings)
    
    # Print progress every 100 batches (approx. every 3200 listings)
    if (i // BATCH_SIZE) % 100 == 0:
        print(f"Processed {i} / {len(descriptions)} listings...")

print("Embedding generation complete!")

# 4. Create DataFrame
# Convert the list of arrays into a Pandas DataFrame
df_bert = pd.DataFrame(embeddings_list)

# Rename columns to 'bert_0', 'bert_1', ... 'bert_767'
df_bert.columns = [f'bert_{i}' for i in range(df_bert.shape[1])]

# Add the ID column for merging later
df_bert['id'] = df_text['id'].values

print(f"BERT DataFrame Shape: {df_bert.shape}")
print(df_bert.head(3))

### Interpretation of Step 3 (BERT Embeddings)

We have successfully converted 20,942 listing descriptions into high-dimensional vectors.

**Understanding the Output (`df_bert`):**
* **Rows (20,942):** Each row represents one Airbnb listing.
* **Columns (769):**
    * **`id`:** The key to match these numbers back to the original house.
    * **`bert_0` ... `bert_767`:** These **768 numbers** are the "Deep Learning features."
    * Unlike VADER (which gave us just 1 score: Positive/Negative), BERT gives us **768 dimensions** of meaning (e.g., one number might represent "luxury," another "location," another "coziness").

**Next Step:**
Now that we have these valuable numbers, we must **save** them immediately so we don't have to wait a long time again. Then, we will merge them with our VADER scores to prepare for the comparison.

## Step 4: Save and Merge Data

**What will we do?**
1.  **Save to CSV:** We will save the new BERT features to a file (`bert_embeddings.csv`).
    * **Why?** Generating these numbers took a long time. We save them immediately to avoid doing it again if the computer crashes.
2.  **Merge:** We will combine the **BERT features** (768 columns) with our **Baseline Features** (VADER scores + Word Counts).
    * **Goal:** Create a single "Dataset" that allows us to compare the old method vs. the new method side-by-side.

In [None]:
# 1. Define Output Paths
BERT_FILE = os.path.join(DATA_DIR, "bert_embeddings.csv")
FINAL_TASK_FILE = os.path.join(DATA_DIR, "bert_prepared.csv")

# 2. Save BERT Embeddings (Checkpoint)
# We save this immediately so we don't lose the calculated data.
if 'df_bert' in locals():
    df_bert.to_csv(BERT_FILE, index=False)
    print(f"Checkpoint saved: {BERT_FILE}")
else:
    print("Warning: df_bert not found. Make sure Step 3 ran successfully.")

# 3. Merge with Baseline Features
# We combine:
# - df_baseline: ID + VADER Scores + Word Counts (Old features)
# - df_bert: ID + 768 BERT Vectors (New Deep Learning features)
if 'df_baseline' in locals() and 'df_bert' in locals():
    # Merge on 'id'
    df_task3_1 = pd.merge(df_baseline, df_bert, on='id', how='inner')
    
    # 4. Save Final Dataset
    df_task3_1.to_csv(FINAL_TASK_FILE, index=False)
    
    print("\nMerge Complete!")
    print(f"Final Dataset Shape: {df_task3_1.shape}")
    print(f"Saved to: {FINAL_TASK_FILE}")
    
    # Display first few rows to confirm
    print(df_task3_1.head(3))
else:
    print("Error: Could not merge. Check if df_baseline and df_bert exist.")

## Step 5: Word Embeddings (Word2Vec)

**What will we do?**
We will train a **Word2Vec** model specifically on our Airbnb descriptions.
* **BERT vs Word2Vec:**
    * **BERT** is pre-trained on Wikipedia. It knows general English perfectly.
    * **Word2Vec** will be trained *only* on our data. It will learn the specific jargon of Airbnb (e.g., that "Ocean" and "Beach" are mathematically very close).

**Methodology:**
1.  **Input:** We will use the **Cleaned Text** (`description_clean`) this time.
    * *Why?* Word2Vec doesn't need punctuation or stopwords. It just needs the core words.
2.  **Training:** We will create a model that represents every word as a vector.
3.  **Averaging:** Since a house has many words, we will take the **average** of all word vectors to get a single "House Vector".

In [None]:
# 1. Import Library
try:
    from gensim.models import Word2Vec
    print("Gensim library is ready.")
except ImportError:
    print("Gensim not found. Please install it using: !pip install gensim")

# 2. Prepare Data
# Word2Vec expects a list of words, not a full sentence string.
# We use 'description_clean' because stopwords/punctuation are already removed.
# We convert "sunny room wifi" -> ['sunny', 'room', 'wifi']
sentences = df_text['description_clean'].fillna("").apply(lambda x: str(x).split()).tolist()

print(f"Training Word2Vec model on {len(sentences)} listings...")

# 3. Train Word2Vec Model
# vector_size=100: Each word/house will be represented by 100 numbers.
# window=5: The model looks at 5 words before and after the target word.
# min_count=5: We ignore rare words (words that appear less than 5 times).
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)

print("Word2Vec Model trained successfully.")

# 4. Generate Document Vectors (Averaging)
# Since a house description has many words, we take the AVERAGE of all word vectors
# to get a single vector representing the house.

def get_mean_vector(word_list):
    # Filter words that exist in our trained model
    valid_words = [word for word in word_list if word in w2v_model.wv]
    
    if len(valid_words) > 0:
        # Calculate mean (average)
        return np.mean(w2v_model.wv[valid_words], axis=0)
    else:
        # If no valid words, return a list of zeros
        return np.zeros(100)

# Apply the function to all listings
w2v_vectors = [get_mean_vector(doc) for doc in sentences]

# 5. Create DataFrame
df_w2v = pd.DataFrame(w2v_vectors)

# Rename columns to w2v_0, w2v_1, ... w2v_99
df_w2v.columns = [f'w2v_{i}' for i in range(df_w2v.shape[1])]

# Add ID for merging
df_w2v['id'] = df_text['id'].values

print(f"Word2Vec DataFrame Shape: {df_w2v.shape}")
print(df_w2v.head(3))

### Interpretation of Step 5 (Word2Vec)

We successfully trained a custom Word2Vec model on our data.
* **Warning Note:** The `Exception ignored` message in the output is a harmless warning related to the library's internal threads. It did not stop the process.
* **Result:** We now have a dataframe (`df_w2v`) with **100 new columns** (`w2v_0` to `w2v_99`)(and +1 id for represent).
* **Meaning:** Each row represents the "average meaning" of a house description, distilled into 100 numbers.

## Step 6: Save Word2Vec Data

**What will we do?**
We will save these new features to a CSV file (`word2vec_embeddings.csv`).

**Why?**
1.  **Safety:** Just like BERT, we want to save our work so we don't have to retrain the model later.
2.  **Usage:** Later, we will use this file to train a Neural Network.

In [None]:
# 1. Define Output Path
W2V_FILE = os.path.join(DATA_DIR, "word2vec_embeddings.csv")

# 2. Save Word2Vec Features
if 'df_w2v' in locals():
    df_w2v.to_csv(W2V_FILE, index=False)
    
    print("Save Complete!")
    print(f"File saved to: {W2V_FILE}")
else:
    print("Error: df_w2v not found. Please check Step 5.")

## Step 7: Prepare Target Variable

**What will we do?**
Before training the model, we need to prepare the "Answer Key" (Target Variable).

1.  **Load Data:** We will load the file `listings_cleaned_with_target.csv` which contains our calculated value categories.
2.  **Filter and Encode:**
    * We will select only the `id` and `value_category` columns.
    * We will convert the text categories into numbers (Label Encoding) so the Neural Network can understand them:
        * `Poor_Value` -> **0**
        * `Fair_Value` -> **1**
        * `Excellent_Value` -> **2**
3.  **Save:** We will save this ready-to-use table (with columns: `id`, `value_category`, `target_encoded`) as `target_labels.csv`. This ensures we don't have to repeat this step later.

In [None]:
# 1. Define File Paths
TARGET_SOURCE_FILE = os.path.join(DATA_DIR, "listings_cleaned_with_target.csv")
TARGET_OUTPUT_FILE = os.path.join(DATA_DIR, "target_labels.csv")

# 2. Load Source Data
if os.path.exists(TARGET_SOURCE_FILE):
    df_target_source = pd.read_csv(TARGET_SOURCE_FILE)
    
    # 3. Filter and Encode
    # We only need ID and the Category
    df_labels = df_target_source[['id', 'value_category']].copy()
    
    # Define the mapping (Encoding)
    # 0: Poor, 1: Fair, 2: Excellent
    label_mapping = {
        'Poor_Value': 0,
        'Fair_Value': 1,
        'Excellent_Value': 2
    }
    
    # Apply mapping
    df_labels['target_encoded'] = df_labels['value_category'].map(label_mapping)
    
    # Check for unmapped values (NaN)
    if df_labels['target_encoded'].isnull().sum() > 0:
        print("Warning: Some categories were not found in the map and are set to NaN.")
        # Drop NaNs if any (to be safe)
        df_labels = df_labels.dropna(subset=['target_encoded'])
        
    # Convert to integer
    df_labels['target_encoded'] = df_labels['target_encoded'].astype(int)

    # 4. Save to CSV
    df_labels.to_csv(TARGET_OUTPUT_FILE, index=False)
    
    print("Target Labels Processed and Saved!")
    print(f"File saved to: {TARGET_OUTPUT_FILE}")
    print(f"Shape: {df_labels.shape}")
    print(df_labels.head(5))

else:
    print(f"Error: Source file not found at {TARGET_SOURCE_FILE}")

## Step 8: Comparative Analysis (Cross-Validation)

**What will we do?**
We will conduct a rigorous experiment to decide which NLP technique is the best for predicting value.
Instead of a single test split, we will use **5-Fold Cross-Validation**.
* This means we will train and test the models 5 times on different parts of the data and average the results.

**The Three Contenders (Experiments):**
1.  **Experiment A (BERT):**
    * **Input:** 768 BERT embeddings (`bert_embeddings.csv`).
    * **Hypothesis:** The most complex model, should understand context best.
2.  **Experiment B (Word2Vec):**
    * **Input:** 100 Word2Vec vectors (`word2vec_embeddings.csv`).
    * **Hypothesis:** Faster and lighter, but might miss complex sentence structures.
3.  **Experiment C (Baseline):**
    * **Input:** Old VADER scores + TF-IDF (`nlp_master_features.csv`).
    * **Hypothesis:** Our reference point. Can Deep Learning beat these simple statistics?

**Method:**
We will use a standard **Neural Network (MLP Classifier)** for all three experiments to keep the comparison fair.

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.preprocessing import StandardScaler

# 1. Setup Configurations
# We will use a standard Neural Network structure for all experiments to be fair.
# Hidden Layers: (64, 32) -> A standard architecture for this data size.
clf = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)

# Cross Validation Setup: 5 splits, shuffled
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# File Paths (Assuming all files are in DATA_DIR)
target_path = os.path.join(DATA_DIR, "target_labels.csv")

# Define Experiments
experiments = {
    "Baseline (TF-IDF + VADER)": "nlp_master_features.csv",
    "Word2Vec (100 Dim)": "word2vec_embeddings.csv",
    "BERT (768 Dim)": "bert_embeddings.csv"
}

# 2. Load Target Data
if os.path.exists(target_path):
    df_target = pd.read_csv(target_path)
    print(f"Targets loaded. Shape: {df_target.shape}")
else:
    raise FileNotFoundError("Target labels file not found!")

results = []

print("\nStarting 5-Fold Cross-Validation Analysis...\n" + "-"*50)

# 3. Execution Loop
for exp_name, file_name in experiments.items():
    file_path = os.path.join(DATA_DIR, file_name)
    
    if os.path.exists(file_path):
        print(f"Running Experiment: {exp_name}...")
        
        # Load Features
        df_features = pd.read_csv(file_path)
        
        # Merge with Targets
        # Use 'inner' to ensure we only have rows that exist in both files
        df_merged = pd.merge(df_target, df_features, on='id', how='inner')
        
        # Define X (Features) and y (Target)
        # Drop ID and target columns from X
        X = df_merged.drop(columns=['id', 'value_category', 'target_encoded'])
        # If 'bert' in name, drop non-numeric columns if any exist (safety check)
        X = X.select_dtypes(include=[np.number])
        
        y = df_merged['target_encoded']
        
        # Scale Data
        # Neural Networks perform much better when data is scaled (0-1 range approx)
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Run Cross-Validation
        # scoring: 'accuracy' and 'f1_weighted' (good for imbalanced classes)
        cv_results = cross_validate(clf, X_scaled, y, cv=cv, scoring=['accuracy', 'f1_weighted'])
        
        # Store Results
        avg_acc = cv_results['test_accuracy'].mean()
        avg_f1 = cv_results['test_f1_weighted'].mean()
        
        print(f"   -> Accuracy: {avg_acc:.4f}")
        print(f"   -> F1 Score: {avg_f1:.4f}\n")
        
        results.append({
            "Experiment": exp_name,
            "Accuracy": avg_acc,
            "F1 Score": avg_f1
        })
        
    else:
        print(f"Skipping {exp_name}: File {file_name} not found.")

# 4. Final Comparison Table
print("-" * 50)
print("FINAL RESULTS SUMMARY")
df_results = pd.DataFrame(results).sort_values(by="Accuracy", ascending=False)
print(df_results)

### Analysis of Step 8 Results

We tested three models. Here is a fair comparison based on Accuracy and Computational Cost (Workload).

**1. The Difference Between Accuracy and F1 Score**

* **Accuracy (The Trap):** Accuracy can sometimes trick us.
    * *Imagine this:* There is a class with 100 students. 95 are healthy, and 5 are sick.
    * If the model says **"Everyone is healthy"**, the Accuracy is **95%**. It looks amazing on paper.
    * *But:* It completely missed the 5 sick students. This is a useless model.
* **F1 Score (The Balance Check):**
    * F1 Score protects us from this trick. It checks if the model found the small groups (the 5 sick students) too.
    * **Our Result (~0.52):** This is an **average score**. It means our models are learning, but they are not perfect yet.

**2. Comparative Table: Performance vs. Workload**

We scored the "Workload" (CPU/RAM usage) on a scale of 1 to 5 (1=Very Light, 5=Very Heavy).

| Model | Accuracy | F1 Score | Workload (1-5) | Comment |
| :--- | :--- | :--- | :--- | :--- |
| **Baseline** (TF-IDF) | 49.68% | 0.49 | **1 / 5** | Instant results. The starting point. |
| **Word2Vec** | 51.70% | 0.51 | **2 / 5** | **+2% better** than Baseline with low cost. Efficient. |
| **BERT** | 52.44% | 0.52 | **5 / 5** | **+0.7% better** than Word2Vec but requires **max power**. |

**3. Interpretation of Trade-offs**

* **Word2Vec vs. Baseline:** By slightly increasing the workload (1 → 2), we gain a solid **2% increase** in accuracy. This is a "profitable" trade.
* **BERT vs. Word2Vec:** To gain an extra **0.7% accuracy**, we must increase the workload significantly (2 → 5).

**Current Status:**
We have successfully extracted features using all three methods. We will keep these results in mind as we move forward to the next stages of our project.

## Step 9: Generating Continuous NLP Scores (Meta-Features)

**The Goal:**
We want to convert our complex NLP models into a simple, powerful "Score" that represents the value of a house based *only* on its text.

**The Strategy: From Classes to Continuum**
Instead of forcing the model to make a hard choice ("Is it Fair or Excellent?"), we will ask for its **confidence**.
* A rigid model says: "This is Fair."
* Our approach asks: "How close is it to Excellent? How close is it to Poor?"

**The Methodology:**
1.  **Input:** We will take the three models we tested:
    * **Baseline** (TF-IDF)
    * **Word2Vec** (Embeddings)
    * **BERT** (Deep Learning)<br><br>
2.  **Probability Calculation:** For every house, the models will calculate the probability of each category (Poor, Fair, Excellent).<br><br>
3.  **The Magic Formula:** We will convert these probabilities into a single number between **-1 and +1**.
    * $$Score = (1 \times P_{Excellent}) + (-1 \times P_{Poor}) + (0 \times P_{Fair})$$
    * **Result +1.0:** Perfectly Excellent.
    * **Result -1.0:** Perfectly Poor.
    * **Result 0.0:** Perfectly Fair.
    * **Result +0.4:** Fair, but leaning towards Excellent.<br><br>
4.  **Cross-Validation:** We will generate these scores using "Cross-Validation" to ensure the model doesn't cheat by memorizing the answers.

**Outcome:**
We will save a new file `nlp_scores.csv` containing just 4 columns: `id`, `baseline_score`, `w2v_score`, `bert_score`. This file will be the "Gold Standard" input for our final project phase.

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler

# 1. Setup Configurations
# We will use the same Neural Network structure to be consistent.
clf = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=500, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# File Paths
target_path = os.path.join(DATA_DIR, "target_labels.csv")
output_path = os.path.join(DATA_DIR, "nlp_scores.csv")

# Input Files (The 3 experiments)
experiments = {
    "baseline": "nlp_master_features.csv",
    "w2v": "word2vec_embeddings.csv",
    "bert": "bert_embeddings.csv"
}

# 2. Load Target Data
if os.path.exists(target_path):
    df_target = pd.read_csv(target_path)
    print(f"Targets loaded. Shape: {df_target.shape}")
else:
    raise FileNotFoundError("target_labels.csv not found! Please check Step 7.")

# Initialize final dataframe with IDs
# We will add columns to this dataframe
df_scores = df_target[['id']].copy()

print("\nStarting Meta-Feature Generation (Stacking)...")
print("-" * 50)

# 3. Execution Loop
for name, file_name in experiments.items():
    file_path = os.path.join(DATA_DIR, file_name)
    
    if os.path.exists(file_path):
        print(f"Processing Model: {name}...")
        
        # Load Features
        df_features = pd.read_csv(file_path)
        
        # Merge with Targets (Inner Join)
        df_merged = pd.merge(df_target, df_features, on='id', how='inner')
        
        # Prepare X (Features) and y (Target)
        X = df_merged.drop(columns=['id', 'value_category', 'target_encoded'])
        X = X.select_dtypes(include=[np.number]) # Use only numbers
        y = df_merged['target_encoded']
        
        # Scale Data (Important for Neural Networks)
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        # Generate Probabilities using Cross-Validation
        # This ensures we get predictions for EVERY house as if it was in the test set.
        # method='predict_proba' returns 3 columns: [Prob_Poor, Prob_Fair, Prob_Excellent]
        print(f"   -> Calculating probabilities...")
        probs = cross_val_predict(clf, X_scaled, y, cv=cv, method='predict_proba')
        
        # Apply the Magic Formula: Score = P(Excellent) - P(Poor)
        # Column 0 = Poor, Column 1 = Fair, Column 2 = Excellent
        # Fair (0) is ignored in the calculation as we discussed.
        continuous_scores = probs[:, 2] - probs[:, 0]
        
        # Create a temporary dataframe to merge back safely
        temp_df = pd.DataFrame({
            'id': df_merged['id'],
            f'{name}_score': continuous_scores
        })
        
        # Merge into our main scorecard
        df_scores = pd.merge(df_scores, temp_df, on='id', how='left')
        
        print(f"   -> {name}_score generated. (Mean: {continuous_scores.mean():.3f})\n")
        
    else:
        print(f"Warning: File {file_name} not found. Skipping.")

# 4. Save Final Scores
# Drop any rows that might have missing values (safety check)
df_scores.dropna(inplace=True)

df_scores.to_csv(output_path, index=False)

print("-" * 50)
print("Meta-Feature Generation Complete!")
print(f"Saved to: {output_path}")
print(f"Final Shape: {df_scores.shape}")
print(df_scores.head())

## Conclusion of NLP Feature Engineering

**What did we do in Step 9?**
We converted our three complex models (Baseline, Word2Vec, BERT) into simple "Score" values.
Instead of dealing with raw text or huge vectors, we calculated a specific mathematical score for each house.
* **-1.0:** Represents "Poor Value" confidence.
* **+1.0:** Represents "Excellent Value" confidence.

**Final Output:**
We saved these scores to `nlp_scores.csv`.
* This file contains the distilled knowledge of all our NLP experiments.
* It is clean, simple, and ready to be used as a high-quality input for any future analysis.

**Achievement:**
We successfully transformed unstructured text descriptions into structured, powerful numerical features. The NLP feature engineering part of the project is now complete.

## Step 10: Final Data Integration (Word2Vec Selection)

**Objective:**
We are now ready to create the final dataset for our Deep Learning Model.
Based on our Cost-Performance analysis, we have decided to use the **Word2Vec Score**.
* It offers the best balance: High Accuracy (~51.7%) with Low Computational Cost.

**What will we do?**
1.  **Load Data:**
    * Load the processed numerical data (`numeric_final_data.csv` from `data/finalized`).
    * Load our NLP scores (`nlp_scores.csv`).<br><br>
2.  **Merge and Select:**
    * Merge the two datasets using `id`.
    * Select only the **`w2v_score`** (ignoring Baseline and BERT).<br><br>
3.  **Cleanup:**
    * Remove the `id` column (it is no longer needed for training).<br><br>
4.  **Save:**
    * Save the final, ready-to-train file as `final_data_with_nlp_score.csv` in the `data/finalized` folder.

In [None]:
import pandas as pd
import os

# 1. Define File Paths (Corrected)
# NLP scores are in 'processed', Numeric data is in 'finalized'
NLP_SCORES_PATH = "../../data/processed/nlp_scores.csv"
NUMERIC_DATA_PATH = "../../data/finalized/numeric_final_data.csv"
OUTPUT_PATH = "../../data/finalized/final_data_with_nlp_score.csv"

# 2. Check and Load Data
if os.path.exists(NUMERIC_DATA_PATH) and os.path.exists(NLP_SCORES_PATH):
    print("Loading datasets...")
    df_numeric = pd.read_csv(NUMERIC_DATA_PATH)
    df_nlp = pd.read_csv(NLP_SCORES_PATH)
    
    print(f"Numeric Data Shape: {df_numeric.shape}")
    print(f"NLP Scores Shape: {df_nlp.shape}")

    # 3. Merge Data (Numeric + Word2Vec Score)
    # We explicitly select only 'id' and 'w2v_score' from the NLP dataframe
    print("Merging Word2Vec scores...")
    df_final = pd.merge(df_numeric, df_nlp[['id', 'w2v_score']], on='id', how='inner')
    
    # 4. Reorder Columns
    # We want 'w2v_score' to be the first feature (since we will drop ID)
    cols = list(df_final.columns)
    
    # Move 'w2v_score' to the front (index 0)
    if 'w2v_score' in cols:
        cols.remove('w2v_score')
        # Insert at position 1 (assuming ID is at 0) to verify, then drop ID.
        # Or simpler: just put it at 0 after dropping ID. 
        # Let's insert it at index 1 first to be safe with the 'id' logic.
        cols.insert(1, 'w2v_score')
        df_final = df_final[cols]

    # 5. Drop 'id'
    df_final.drop(columns=['id'], inplace=True)
    
    # 6. Save Final File
    df_final.to_csv(OUTPUT_PATH, index=False)
    
    print("-" * 50)
    print("Integration Complete!")
    print(f"Selected NLP Feature: Word2Vec Score")
    print(f"Dropped Column: id")
    print(f"Final Dataset Saved to: {OUTPUT_PATH}")
    print(f"Final Shape: {df_final.shape}")
    print(df_final.head())

else:
    print("Error: Input files not found.")
    print(f"Looking for: {NUMERIC_DATA_PATH}")
    print(f"Looking for: {NLP_SCORES_PATH}")