# Milestone 3: Advanced NLP and Deep Learning Models

**Objective**
In this notebook, we will improve our text analysis using Deep Learning.
1. **BERT Analysis:** Use pre-trained transformers to understand listing descriptions.
2. **Word Embeddings:** Create Word2Vec vectors to capture semantic meaning.
3. **Deep Learning Model:** Train a Neural Network to predict values.

**Final Goal: Integration Strategy (The "Meta-Feature")**
After building these complex models, we will not just stop at comparisons.
We will **condense** all our Deep Learning insights into **1 or 2 simple numeric features** (like a "BERT_Score").
We will save these new features so they can be easily used by other models (like XGBoost or Random Forest) to improve their performance.

## Step 1: Setup and Data Loading

**What will we do?**
In this step, we will prepare our environment and load the data.

1.  **Import Libraries:** We will import Pandas, NumPy, and the Transformers library for BERT.
2.  **Load Text Data (`listings_text_cleaned.csv`):**
    * We created this file in milestone 1.
    * It contains **two versions** of the text for different future tasks:
        * **Raw Text:** The original text with punctuation. We need this for **BERT** to understand the context.
        * **Clean Text:** The processed text without stopwords. We used this for TF-IDF before.
3.  **Load Baseline Features (`nlp_master_features.csv`):**
    * This file contains our old VADER sentiment scores.
    * We will keep these scores to compare them with our new Deep Learning model later.

In [None]:
# 1. Import Libraries
import pandas as pd
import numpy as np
import os
import warnings

# We ignore warnings to keep the output clean
warnings.filterwarnings('ignore')

# Try to import Transformers (for BERT)
try:
    from transformers import DistilBertTokenizer, TFDistilBertModel
    print("Transformers library is ready.")
except ImportError:
    print("Transformers library is not found. Please install it.")

# 2. Define File Paths
# We assume the data is in the processed folder
DATA_DIR = "../../data/processed/"
TEXT_FILE = os.path.join(DATA_DIR, "listings_text_cleaned.csv")
FEATURES_FILE = os.path.join(DATA_DIR, "nlp_master_features.csv")

# 3. Load Data
if os.path.exists(TEXT_FILE) and os.path.exists(FEATURES_FILE):
    # Load text data (contains 'description' and 'description_clean')
    df_text = pd.read_csv(TEXT_FILE)
    
    # Load old features (contains VADER scores)
    df_features = pd.read_csv(FEATURES_FILE)
    
    print("Data loaded successfully.")
    print(f"Text Data Shape: {df_text.shape}")
    print(f"Features Data Shape: {df_features.shape}")
else:
    print("Error: Files not found. Please check your paths.")

# 4. Prepare Baseline Features
# Strategy: We keep VADER scores and Counts. We DROP old TF-IDF columns.
# We will use these to compare with our new Deep Learning model later.
keep_columns = [
    'id', 
    'description_sentiment', 
    'host_about_sentiment', 
    'desc_word_count', 
    'desc_length', 
    'name_length'
]

# Create the baseline dataframe
if 'df_features' in locals():
    df_baseline = df_features[keep_columns].copy()
    print("\nBaseline Features Selected (VADER + Structure):")
    print(df_baseline.head(3))

### Interpretation of Step 1

We successfully loaded the data and created three dataframes:

1.  **df_text**:
    * **Source:** `listings_text_cleaned.csv`
    * **Content:** Contains the raw `description` text.
    * **Why?** We will feed this raw text into the BERT model to understand the context.<br><br>

2.  **df_features**:
    * **Source:** `nlp_master_features.csv`
    * **Content:** Contains all 107 NLP features from Milestone 1 (including VADER scores and TF-IDF).
    * **Why?** We loaded this to extract the specific columns we need.<br><br>

3.  **df_baseline**:
    * **Source:** Selected columns from `df_features`.
    * **Content:** Contains only `id`, VADER sentiment scores, and word counts.
    * **Why?** These are our "Baseline" features. Later, we will compare these old scores with the new BERT scores to see if Deep Learning is better.<br><br>

## Step 2: Initialize BERT Tokenizer and Model

**What will we do?**
Computers cannot read words. They only understand numbers.
1.  **Tokenizer:** We will load a tool that converts our text into numbers (tokens).
2.  **Model:** We will download the **DistilBERT** model.
    * **Why DistilBERT?** It is a smaller, faster, and lighter version of BERT. It gives 95% of the performance but runs 60% faster.
    * **Pre-trained:** The model already "knows" English because it was trained on Wikipedia and Books.

In [None]:
# 1. Import PyTorch and Transformers
import torch
from transformers import DistilBertTokenizer, DistilBertModel

# Check device (Use GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Initialize Tokenizer
# We use 'distilbert-base-uncased' (Standard English model)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# 3. Initialize Model (PyTorch Version)
# We switched to PyTorch (DistilBertModel) to avoid Keras errors.
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Move the model to the active device (CPU or GPU)
model.to(device)

print("DistilBERT Model (PyTorch) loaded successfully.")

### Interpretation of Step 2

We successfully loaded the model.

**Model Status:** The DistilBERT model is ready to process our text.

## Step 3: Generating BERT Embeddings

**What will we do?**
We will now convert the listing descriptions into numbers (vectors).

**How will we do it?**
1.  **Batch Processing:** We cannot process all 20,000 listings at once. It would crash the computer's memory (RAM).
2.  **Loop:** We will take small groups (e.g., 32 listings at a time).
3.  **Tokenize & Encode:**
    * First, we convert words to tokens.
    * Then, we feed them into the DistilBERT model.
    * The model gives us a **vector of size 768** for each listing. This vector represents the "meaning" of the description.

**Note:** This process might take some time (10-20 minutes on CPU).

In [None]:
# 1. Prepare Data
# We fill empty descriptions with " " to avoid errors.
descriptions = df_text['description'].fillna("").tolist()

# 2. Parameters
BATCH_SIZE = 32 # We process 32 listings at a time
embeddings_list = []

print(f"Starting BERT embedding generation for {len(descriptions)} listings...")

# 3. Loop through data in batches
# range(start, stop, step)
for i in range(0, len(descriptions), BATCH_SIZE):
    # Select the batch
    batch_texts = descriptions[i : i + BATCH_SIZE]
    
    # Tokenize
    # padding=True: pad to the longest sentence in the batch
    # truncation=True: cut texts longer than 128 tokens (saves memory)
    inputs = tokenizer(batch_texts, padding=True, truncation=True, 
                      max_length=128, return_tensors="pt")
    
    # Move inputs to the device (CPU or GPU)
    inputs = {key: val.to(device) for key, val in inputs.items()}
    
    # Generate Embeddings
    with torch.no_grad(): # We do not need gradients for inference (saves RAM)
        outputs = model(**inputs)
    
    # Extract the [CLS] token (The vector representing the whole sentence)
    # It is the first token (index 0) of the last hidden state
    batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    
    # Add to our list
    embeddings_list.extend(batch_embeddings)
    
    # Print progress every 100 batches (approx. every 3200 listings)
    if (i // BATCH_SIZE) % 100 == 0:
        print(f"Processed {i} / {len(descriptions)} listings...")

print("Embedding generation complete!")

# 4. Create DataFrame
# Convert the list of arrays into a Pandas DataFrame
df_bert = pd.DataFrame(embeddings_list)

# Rename columns to 'bert_0', 'bert_1', ... 'bert_767'
df_bert.columns = [f'bert_{i}' for i in range(df_bert.shape[1])]

# Add the ID column for merging later
df_bert['id'] = df_text['id'].values

print(f"BERT DataFrame Shape: {df_bert.shape}")
print(df_bert.head(3))

### Interpretation of Step 3 (BERT Embeddings)

We have successfully converted 20,942 listing descriptions into high-dimensional vectors.

**Understanding the Output (`df_bert`):**
* **Rows (20,942):** Each row represents one Airbnb listing.
* **Columns (769):**
    * **`id`:** The key to match these numbers back to the original house.
    * **`bert_0` ... `bert_767`:** These **768 numbers** are the "Deep Learning features."
    * Unlike VADER (which gave us just 1 score: Positive/Negative), BERT gives us **768 dimensions** of meaning (e.g., one number might represent "luxury," another "location," another "coziness").

**Next Step:**
Now that we have these valuable numbers, we must **save** them immediately so we don't have to wait a long time again. Then, we will merge them with our VADER scores to prepare for the comparison.

## Step 4: Save and Merge Data

**What will we do?**
1.  **Save to CSV:** We will save the new BERT features to a file (`bert_embeddings.csv`).
    * **Why?** Generating these numbers took a long time. We save them immediately to avoid doing it again if the computer crashes.
2.  **Merge:** We will combine the **BERT features** (768 columns) with our **Baseline Features** (VADER scores + Word Counts).
    * **Goal:** Create a single "Dataset" that allows us to compare the old method vs. the new method side-by-side.

In [None]:
# 1. Define Output Paths
BERT_FILE = os.path.join(DATA_DIR, "bert_embeddings.csv")
FINAL_TASK_FILE = os.path.join(DATA_DIR, "bert_prepared.csv")

# 2. Save BERT Embeddings (Checkpoint)
# We save this immediately so we don't lose the calculated data.
if 'df_bert' in locals():
    df_bert.to_csv(BERT_FILE, index=False)
    print(f"Checkpoint saved: {BERT_FILE}")
else:
    print("Warning: df_bert not found. Make sure Step 3 ran successfully.")

# 3. Merge with Baseline Features
# We combine:
# - df_baseline: ID + VADER Scores + Word Counts (Old features)
# - df_bert: ID + 768 BERT Vectors (New Deep Learning features)
if 'df_baseline' in locals() and 'df_bert' in locals():
    # Merge on 'id'
    df_task3_1 = pd.merge(df_baseline, df_bert, on='id', how='inner')
    
    # 4. Save Final Dataset
    df_task3_1.to_csv(FINAL_TASK_FILE, index=False)
    
    print("\nMerge Complete!")
    print(f"Final Dataset Shape: {df_task3_1.shape}")
    print(f"Saved to: {FINAL_TASK_FILE}")
    
    # Display first few rows to confirm
    print(df_task3_1.head(3))
else:
    print("Error: Could not merge. Check if df_baseline and df_bert exist.")

## Step 5: Word Embeddings (Word2Vec)

**What will we do?**
We will train a **Word2Vec** model specifically on our Airbnb descriptions.
* **BERT vs Word2Vec:**
    * **BERT** is pre-trained on Wikipedia. It knows general English perfectly.
    * **Word2Vec** will be trained *only* on our data. It will learn the specific jargon of Airbnb (e.g., that "Ocean" and "Beach" are mathematically very close).

**Methodology:**
1.  **Input:** We will use the **Cleaned Text** (`description_clean`) this time.
    * *Why?* Word2Vec doesn't need punctuation or stopwords. It just needs the core words.
2.  **Training:** We will create a model that represents every word as a vector.
3.  **Averaging:** Since a house has many words, we will take the **average** of all word vectors to get a single "House Vector".

In [None]:
# 1. Import Library
try:
    from gensim.models import Word2Vec
    print("Gensim library is ready.")
except ImportError:
    print("Gensim not found. Please install it using: !pip install gensim")

# 2. Prepare Data
# Word2Vec expects a list of words, not a full sentence string.
# We use 'description_clean' because stopwords/punctuation are already removed.
# We convert "sunny room wifi" -> ['sunny', 'room', 'wifi']
sentences = df_text['description_clean'].fillna("").apply(lambda x: str(x).split()).tolist()

print(f"Training Word2Vec model on {len(sentences)} listings...")

# 3. Train Word2Vec Model
# vector_size=100: Each word/house will be represented by 100 numbers.
# window=5: The model looks at 5 words before and after the target word.
# min_count=5: We ignore rare words (words that appear less than 5 times).
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)

print("Word2Vec Model trained successfully.")

# 4. Generate Document Vectors (Averaging)
# Since a house description has many words, we take the AVERAGE of all word vectors
# to get a single vector representing the house.

def get_mean_vector(word_list):
    # Filter words that exist in our trained model
    valid_words = [word for word in word_list if word in w2v_model.wv]
    
    if len(valid_words) > 0:
        # Calculate mean (average)
        return np.mean(w2v_model.wv[valid_words], axis=0)
    else:
        # If no valid words, return a list of zeros
        return np.zeros(100)

# Apply the function to all listings
w2v_vectors = [get_mean_vector(doc) for doc in sentences]

# 5. Create DataFrame
df_w2v = pd.DataFrame(w2v_vectors)

# Rename columns to w2v_0, w2v_1, ... w2v_99
df_w2v.columns = [f'w2v_{i}' for i in range(df_w2v.shape[1])]

# Add ID for merging
df_w2v['id'] = df_text['id'].values

print(f"Word2Vec DataFrame Shape: {df_w2v.shape}")
print(df_w2v.head(3))

### Interpretation of Step 5 (Word2Vec)

We successfully trained a custom Word2Vec model on our data.
* **Warning Note:** The `Exception ignored` message in the output is a harmless warning related to the library's internal threads. It did not stop the process.
* **Result:** We now have a dataframe (`df_w2v`) with **100 new columns** (`w2v_0` to `w2v_99`)(and +1 id for represent).
* **Meaning:** Each row represents the "average meaning" of a house description, distilled into 100 numbers.

## Step 6: Save Word2Vec Data

**What will we do?**
We will save these new features to a CSV file (`word2vec_embeddings.csv`).

**Why?**
1.  **Safety:** Just like BERT, we want to save our work so we don't have to retrain the model later.
2.  **Usage:** Later, we will use this file to train a Neural Network.

In [None]:
# 1. Define Output Path
W2V_FILE = os.path.join(DATA_DIR, "word2vec_embeddings.csv")

# 2. Save Word2Vec Features
if 'df_w2v' in locals():
    df_w2v.to_csv(W2V_FILE, index=False)
    
    print("Save Complete!")
    print(f"File saved to: {W2V_FILE}")
else:
    print("Error: df_w2v not found. Please check Step 5.")