# Milestone 3: Advanced NLP and Deep Learning Models

**Objective**
In this notebook, we will improve our text analysis using Deep Learning.
1. **BERT Analysis:** Use pre-trained transformers to understand listing descriptions.
2. **Word Embeddings:** Create Word2Vec vectors to capture semantic meaning.
3. **Deep Learning Model:** Train a Neural Network to predict values.

**Final Goal: Integration Strategy (The "Meta-Feature")**
After building these complex models, we will not just stop at comparisons.
We will **condense** all our Deep Learning insights into **1 or 2 simple numeric features** (like a "BERT_Score").
We will save these new features so they can be easily used by other models (like XGBoost or Random Forest) to improve their performance.

## Step 1: Setup and Data Loading

**What will we do?**
In this step, we will prepare our environment and load the data.

1.  **Import Libraries:** We will import Pandas, NumPy, and the Transformers library for BERT.
2.  **Load Text Data (`listings_text_cleaned.csv`):**
    * We created this file in milestone 1.
    * It contains **two versions** of the text for different future tasks:
        * **Raw Text:** The original text with punctuation. We need this for **BERT** to understand the context.
        * **Clean Text:** The processed text without stopwords. We used this for TF-IDF before.
3.  **Load Baseline Features (`nlp_master_features.csv`):**
    * This file contains our old VADER sentiment scores.
    * We will keep these scores to compare them with our new Deep Learning model later.

In [None]:
# 1. Import Libraries
import pandas as pd
import numpy as np
import os
import warnings

# We ignore warnings to keep the output clean
warnings.filterwarnings('ignore')

# Try to import Transformers (for BERT)
try:
    from transformers import DistilBertTokenizer, TFDistilBertModel
    print("Transformers library is ready.")
except ImportError:
    print("Transformers library is not found. Please install it.")

# 2. Define File Paths
# We assume the data is in the processed folder
DATA_DIR = "../../data/processed/"
TEXT_FILE = os.path.join(DATA_DIR, "listings_text_cleaned.csv")
FEATURES_FILE = os.path.join(DATA_DIR, "nlp_master_features.csv")

# 3. Load Data
if os.path.exists(TEXT_FILE) and os.path.exists(FEATURES_FILE):
    # Load text data (contains 'description' and 'description_clean')
    df_text = pd.read_csv(TEXT_FILE)
    
    # Load old features (contains VADER scores)
    df_features = pd.read_csv(FEATURES_FILE)
    
    print("Data loaded successfully.")
    print(f"Text Data Shape: {df_text.shape}")
    print(f"Features Data Shape: {df_features.shape}")
else:
    print("Error: Files not found. Please check your paths.")

# 4. Prepare Baseline Features
# Strategy: We keep VADER scores and Counts. We DROP old TF-IDF columns.
# We will use these to compare with our new Deep Learning model later.
keep_columns = [
    'id', 
    'description_sentiment', 
    'host_about_sentiment', 
    'desc_word_count', 
    'desc_length', 
    'name_length'
]

# Create the baseline dataframe
if 'df_features' in locals():
    df_baseline = df_features[keep_columns].copy()
    print("\nBaseline Features Selected (VADER + Structure):")
    print(df_baseline.head(3))

### Interpretation of Step 1

We successfully loaded the data and created three dataframes:

1.  **df_text**:
    * **Source:** `listings_text_cleaned.csv`
    * **Content:** Contains the raw `description` text.
    * **Why?** We will feed this raw text into the BERT model to understand the context.<br><br>

2.  **df_features**:
    * **Source:** `nlp_master_features.csv`
    * **Content:** Contains all 107 NLP features from Milestone 1 (including VADER scores and TF-IDF).
    * **Why?** We loaded this to extract the specific columns we need.<br><br>

3.  **df_baseline**:
    * **Source:** Selected columns from `df_features`.
    * **Content:** Contains only `id`, VADER sentiment scores, and word counts.
    * **Why?** These are our "Baseline" features. Later, we will compare these old scores with the new BERT scores to see if Deep Learning is better.<br><br>

## Step 2: Initialize BERT Tokenizer and Model

**What will we do?**
Computers cannot read words. They only understand numbers.
1.  **Tokenizer:** We will load a tool that converts our text into numbers (tokens).
2.  **Model:** We will download the **DistilBERT** model.
    * **Why DistilBERT?** It is a smaller, faster, and lighter version of BERT. It gives 95% of the performance but runs 60% faster.
    * **Pre-trained:** The model already "knows" English because it was trained on Wikipedia and Books.

In [None]:
# 1. Initialize Tokenizer
# We use 'distilbert-base-uncased'.
# 'Uncased' means it treats "Hello" and "hello" as the same word.
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# 2. Initialize Model
# This loads the pre-trained neural network weights.
bert_model = TFDistilBertModel.from_pretrained('distilbert-base-uncased')

print("DistilBERT Tokenizer and Model loaded successfully.")