# NLP Quality Assurance & Data Dictionary
**Task:** T1.Extra (Final Sanity Check)
**Input:** `data/processed/nlp_master_features.csv`

### 1. The Journey of Features (Data Dictionary)
Here is the breakdown of every column created during the NLP pipeline (Week 1):

| Feature Name | Source Block | Method / Logic | Why did we create this? (Business Value) |
| :--- | :--- | :--- | :--- |
| **id** | Original Data | Unique Identifier | To merge these features back with the main dataset. |
| **description_sentiment** | mfa_T1.9_T1.10_sentiment_features.ipynb | NLTK VADER | Captures the emotional tone of the listing. High score = Welcoming/Positive. |
| **host_about_sentiment** | mfa_T1.9_T1.10_sentiment_features.ipynb | NLTK VADER | Captures the "friendliness" of the host description. |
| **name_length** | mfa_T1.9_T1.10_sentiment_features.ipynb | `len()` count | Short titles might be uninformative; very long ones might be messy. |
| **name_upper_ratio** | mfa_T1.9_T1.10_sentiment_features.ipynb | `uppercase / total` | Detects "shouting" or clickbait titles (e.g., "AMAZING VIEW!!!"). |
| **desc_length** | mfa_T1.9_T1.10_sentiment_features.ipynb | `len()` count | Indicates how much effort the host put into the description. |
| **desc_word_count** | mfa_T1.9_T1.10_sentiment_features.ipynb | Cleaned text word count | Measures the actual information content (ignoring HTML/formatting). |
| **tfidf_[word]** (100 cols) | mfa_T1.11_T1.12_nlp_tfidf_NlpTablesMerge.ipynb | TF-IDF Vectorizer | Represents the importance of specific keywords (e.g., 'beach', 'wifi') in the text. |

### 2. Quality Assurance Goals
We will perform a programmatic "Sanity Check" to ensure:
1.  **Completeness:** No missing values (NaNs).
2.  **Ranges:**
    * Sentiments must be between **-1.0 and +1.0**.
    * Ratios (Upper case) must be between **0.0 and 1.0**.
    * Lengths/Counts must be **>= 0**.
3.  **Integrity:** Row count matches the original dataset (20,942).

In [None]:
# ==========================================
# QA TEST SUITE for NLP Features
# ==========================================
import pandas as pd
import numpy as np

# Load the Master NLP Dataset
file_path = "../../data/processed/nlp_master_features.csv"
print(f"Loading data from: {file_path}")
df = pd.read_csv(file_path)

print(f"Shape: {df.shape}")
print("-" * 30)

# Initialize Error Counter
errors = 0

# --- TEST 1: NULL VALUE CHECK ---
total_nans = df.isnull().sum().sum()
if total_nans == 0:
    print("[PASS] TEST 1 (Missing Values): 0 NaNs found")
else:
    print(f"[FAIL] TEST 1 (Missing Values): {total_nans} NaNs found!")
    display(df.columns[df.isnull().any()])
    errors += 1

# --- TEST 2: SENTIMENT RANGE CHECK (-1 to 1) ---
sent_cols = ['description_sentiment', 'host_about_sentiment']
# Check if any value is < -1 OR > 1
out_of_range = df[ (df[sent_cols] < -1) | (df[sent_cols] > 1) ].count().sum()

if out_of_range == 0:
    print(f"[PASS] TEST 2 (Sentiment Ranges): All values between -1 and 1")
else:
    print(f"[FAIL] TEST 2 (Sentiment Ranges): {out_of_range} values out of bounds!")
    # Show culprits
    print(df[sent_cols].describe())
    errors += 1

# --- TEST 3: RATIO CHECK (0 to 1) ---
# name_upper_ratio must be a percentage
ratio_cols = ['name_upper_ratio']
invalid_ratios = df[ (df[ratio_cols] < 0) | (df[ratio_cols] > 1) ].count().sum()

if invalid_ratios == 0:
    print(f"[PASS] TEST 3 (Ratio Ranges): All values between 0 and 1")
else:
    print(f"[FAIL] TEST 3 (Ratio Ranges): {invalid_ratios} values out of bounds!")
    errors += 1

# --- TEST 4: NON-NEGATIVE COUNTS ---
count_cols = ['name_length', 'desc_length', 'desc_word_count']
negative_counts = df[ (df[count_cols] < 0) ].count().sum()

if negative_counts == 0:
    print(f"[PASS] TEST 4 (Physical Counts): All values >= 0")
else:
    print(f"[FAIL] TEST 4 (Physical Counts): {negative_counts} negative values found!")
    errors += 1

# --- TEST 5: TF-IDF INTEGRITY ---
# TF-IDF values cannot be negative
tfidf_cols = [c for c in df.columns if 'tfidf_' in c]
if len(tfidf_cols) == 100:
    print(f"[PASS] TEST 5.1 (Column Count): Found exactly 100 TF-IDF features")
else:
    print(f"[FAIL] TEST 5.1 (Column Count): Found {len(tfidf_cols)} features, expected 100")
    errors += 1

# Check for negative TF-IDF values
neg_tfidf = (df[tfidf_cols] < 0).sum().sum()
if neg_tfidf == 0:
    print(f"[PASS] TEST 5.2 (TF-IDF Values): No negative values")
else:
    print(f"[FAIL] TEST 5.2 (TF-IDF Values): {neg_tfidf} negative values found!")
    errors += 1

print("-" * 30)
if errors == 0:
    print("FINAL VERDICT: DATASET IS CLEAN AND READY!")
    display(df.describe().round(4)) # Show summary stats as final proof
else:
    print(f"FINAL VERDICT: FOUND {errors} ISSUES. DO NOT PROCEED.")