# 02 – NLP Enrichment (Sentiment & Complaint Flag)

This notebook implements the **NLP stage** of the Proactive Device Quality Insight Pipeline.

We start from the SQL-cleaned table `reviews_clean` (created in `00_sql_ingestion_and_cleaning.ipynb`)
and add NLP-derived features:

- Text sentiment score using **TextBlob**.
- Keyword-based complaint detection (device-quality issue phrases).
- A combined `complaint_flag` field that uses:
  - Star rating
  - Sentiment polarity
  - Complaint keywords

We then write an enriched table `reviews_enriched` back into the same SQLite database and
optionally export it to CSV for downstream aggregation and anomaly detection.


### 1. Imports & Paths

In [1]:
import pandas as pd
import sqlite3
from pathlib import Path

# Path to the SQLite database (must match previous notebook)
db_path = Path("/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/device_quality.db")
db_path.exists()


True

### 2. Load Cleaned Reviews from SQL

We connect to the existing SQLite database and load the `reviews_clean` table
created in the **SQL ingestion and cleaning** step.

This table should already have:
- `review_id`
- `device_brand`
- `device_model`
- `rating`
- `review_text`
- `review_date`


In [2]:
# Connect to the existing SQLite database
conn = sqlite3.connect(db_path)

# Load the cleaned reviews table into pandas
df_clean = pd.read_sql("SELECT * FROM reviews_clean;", conn)

df_clean.head()


Unnamed: 0,review_id,device_brand,device_model,rating,review_text,review_date
0,1,Realme,Realme 12 Pro,2,Not worth the money spent. Wouldn’t recommend.,2023-11-06
1,2,Realme,Realme 12 Pro,4,Absolutely love this phone! The camera is next...,2023-03-30
2,3,Google,Pixel 6,4,Loving the clean UI and fast updates. Loving i...,2022-12-07
3,4,Xiaomi,Redmi Note 13,3,Build quality feels solid and durable. No regr...,2025-03-11
4,5,Motorola,Edge 50,3,Not bad for daily use but could be optimized. ...,2023-09-29


In [3]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   review_id     50000 non-null  int64 
 1   device_brand  50000 non-null  object
 2   device_model  50000 non-null  object
 3   rating        50000 non-null  int64 
 4   review_text   50000 non-null  object
 5   review_date   50000 non-null  object
dtypes: int64(2), object(4)
memory usage: 2.3+ MB


### 3. NLP Sentiment Scoring with TextBlob

We will use **TextBlob** to compute a simple sentiment polarity score
for each review:

- `sentiment_score` ∈ [-1.0, 1.0]
  - Negative values → negative sentiment
  - Positive values → positive sentiment

If you get an `ImportError` for TextBlob, run this in a terminal or a notebook cell:

```bash
pip install textblob
python -m textblob.download_corpora


In [5]:
!pip install textblob

Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Collecting nltk>=3.9 (from textblob)
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m624.3/624.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hDownloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m44.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk, textblob
  Attempting uninstall: nltk
    Found existing installation: nltk 3.8.1
    Uninstalling nltk-3.8.1:
      Successfully uninstalled nltk-3.8.1
Successfully installed nltk-3.9.2 textblob-0.19.0


In [7]:
import textblob
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/amritaneogi/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [9]:
# Import TextBlob & Define Helper

from textblob import TextBlob

def get_sentiment(text: str) -> float:
    """
    Returns sentiment polarity score in [-1.0, 1.0] using TextBlob.
    """
    if text is None:
        return 0.0
    text = str(text)
    if not text.strip():
        return 0.0
    return TextBlob(text).sentiment.polarity

### 4. Compute Sentiment Score

In [10]:
%%time
df_clean["sentiment_score"] = df_clean["review_text"].apply(get_sentiment)

df_clean[["device_brand", "device_model", "rating", "sentiment_score", "review_text"]].head()


CPU times: user 3.35 s, sys: 28.6 ms, total: 3.38 s
Wall time: 3.38 s


Unnamed: 0,device_brand,device_model,rating,sentiment_score,review_text
0,Realme,Realme 12 Pro,2,-0.125,Not worth the money spent. Wouldn’t recommend.
1,Realme,Realme 12 Pro,4,0.333333,Absolutely love this phone! The camera is next...
2,Google,Pixel 6,4,0.378333,Loving the clean UI and fast updates. Loving i...
3,Xiaomi,Redmi Note 13,3,0.025,Build quality feels solid and durable. No regr...
4,Motorola,Edge 50,3,0.05,Not bad for daily use but could be optimized. ...


In [11]:
# inspect basic sentiment distribution
df_clean["sentiment_score"].describe()

count    50000.000000
mean         0.184499
std          0.331104
min         -0.975000
25%          0.087500
50%          0.254167
75%          0.385417
max          0.750000
Name: sentiment_score, dtype: float64

### 5. Keyword-Based Complaint Indicator

Next, we define a set of common device-quality complaint keywords
and phrases, such as:

- "crash", "bug", "freeze", "overheat"
- "battery drain", "doesn't work", "stopped working", etc.

We then create a boolean flag `has_complaint_kw` indicating whether
the review text contains any of these patterns.


In [12]:
complaint_keywords = [
    "crash", "crashes", "crashing",
    "bug", "bugs", "glitch", "glitches",
    "lag", "laggy", "slow", "freezes", "freezing", "freeze",
    "overheat", "overheats", "overheating", "heats up",
    "battery drain", "battery dies", "poor battery", "bad battery",
    "random restart", "restarts", "reboot", "rebooting",
    "screen flicker", "screen issue", "display issue",
    "no signal", "network issue", "wifi issue", "wifi problem",
    "doesn't work", "doesnt work", "not working", "stopped working",
    "faulty", "defective", "problem with", "issue with"
]

def text_has_complaint_keywords(text: str) -> bool:
    if text is None:
        return False
    t = str(text).lower()
    return any(kw in t for kw in complaint_keywords)

df_clean["has_complaint_kw"] = df_clean["review_text"].apply(text_has_complaint_keywords)

df_clean[["rating", "sentiment_score", "has_complaint_kw", "review_text"]].head(10)


Unnamed: 0,rating,sentiment_score,has_complaint_kw,review_text
0,2,-0.125,False,Not worth the money spent. Wouldn’t recommend.
1,4,0.333333,False,Absolutely love this phone! The camera is next...
2,4,0.378333,False,Loving the clean UI and fast updates. Loving i...
3,3,0.025,False,Build quality feels solid and durable. No regr...
4,3,0.05,False,Not bad for daily use but could be optimized. ...
5,5,0.094444,False,Battery easily lasts a day with heavy use. No ...
6,3,0.385417,False,Loving the clean UI and fast updates. Absolute...
7,2,0.0,False,"Phone hangs often, regret buying it. Wouldn’t ..."
8,4,0.239583,False,Battery easily lasts a day with heavy use. Lov...
9,3,0.225,False,Smooth performance even after months of use. N...


### 6. Combined Complaint Flag

We now define a single `complaint_flag` that integrates:

- **Star rating**: low ratings (e.g., ≤ 2 stars) are highly likely to be complaints.
- **Sentiment score**: `sentiment_score < -0.2` suggests negative language.
- **Keywords**: `has_complaint_kw == True` indicates explicit device issues.

A review is labeled as a complaint if **any** of these conditions hold.
This combines **NLP + business rules**, similar to how many real systems work.


In [13]:
# Ensure rating is numeric
df_clean["rating"] = pd.to_numeric(df_clean["rating"], errors="coerce")

df_clean["complaint_flag"] = (
    (df_clean["rating"] <= 2) |
    (df_clean["sentiment_score"] < -0.2) |
    (df_clean["has_complaint_kw"])
)

print("Total rows:", len(df_clean))
print("Complaint rows:", df_clean["complaint_flag"].sum())
print("Complaint rate: {:.1%}".format(df_clean["complaint_flag"].mean()))


Total rows: 50000
Complaint rows: 16989
Complaint rate: 34.0%


In [14]:
print("=== Sample complaints ===")
display(
    df_clean[df_clean["complaint_flag"]]
    .sample(5, random_state=42)[["rating", "sentiment_score", "has_complaint_kw", "review_text"]]
)

print("=== Sample non-complaints ===")
display(
    df_clean[~df_clean["complaint_flag"]]
    .sample(5, random_state=43)[["rating", "sentiment_score", "has_complaint_kw", "review_text"]]
)


=== Sample complaints ===


Unnamed: 0,rating,sentiment_score,has_complaint_kw,review_text
26615,1,-0.496667,True,Charging is very slow compared to other brands...
13890,1,-0.8375,False,Speaker quality is bad and muffled. Very disap...
43250,2,-0.2575,True,Charging is very slow compared to other brands...
24984,1,-0.8375,False,Speaker quality is bad and muffled. Very disap...
9008,2,0.142308,False,Sound quality is okay but not very loud. Avera...


=== Sample non-complaints ===


Unnamed: 0,rating,sentiment_score,has_complaint_kw,review_text
10197,5,0.4375,False,Design feels premium and stylish. Absolutely w...
4168,4,0.6,False,Fast charging is a lifesaver. Best purchase of...
37838,4,0.291667,False,Worth every penny. Highly recommended! Absolut...
27648,3,0.116667,False,"Design is okay, a bit bulky though. Average ex..."
1242,4,0.239583,False,Battery easily lasts a day with heavy use. Lov...


### 7. Persist `reviews_enriched` to SQLite

We now write the NLP-enriched dataframe back into the same SQLite database
as a new table: `reviews_enriched`.

This table will be the **single source of truth** for
downstream steps:

- Weekly aggregations
- Anomaly detection
- Tableau dashboard


In [15]:
# Write enriched reviews back to SQLite
df_clean.to_sql("reviews_enriched", conn, if_exists="replace", index=False)

# Confirm row count
enriched_count = conn.execute("SELECT COUNT(*) FROM reviews_enriched;").fetchone()[0]
enriched_count


50000

### 8. Export Enriched Reviews to CSV

For convenience, we can also export `reviews_enriched` to a CSV file so that
other notebooks (e.g., weekly aggregation, anomaly detection) can read from it
without needing to connect to SQLite.

This step is optional but often useful for prototyping.


In [16]:
enriched_csv_path = Path("/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/mobile_reviews_enriched_nlp_sql.csv")

df_clean.to_csv(enriched_csv_path, index=False)
enriched_csv_path


PosixPath('/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/mobile_reviews_enriched_nlp_sql.csv')

### 9. Close Connection and Summary

We close the SQLite connection. The database now contains:

- `reviews_raw` — raw ingestion from CSV.
- `reviews_clean` — SQL-cleaned and standardized reviews.
- `reviews_enriched` — NLP-enhanced reviews with:
  - `sentiment_score`
  - `has_complaint_kw`
  - `complaint_flag`

This completes the **NLP stage** of the SQL + Python pipeline and fully supports the
résumé statement:

> "Built a Python and SQL pipeline with NLP and anomaly detection to flag emerging issues
>  by device model and OS version."


In [17]:
conn.close()