# 03 – Data Loading & Basic Cleaning

This notebook loads the raw mobile reviews dataset, performs initial profiling and light cleaning, and saves a clean CSV that we will later use for:
- NLP-based complaint flagging
- Weekly aggregations
- Anomaly detection and Tableau dashboards


### 1. Load the CSV and inspect structure

Now add the following code cell below that markdown and run it.

#### 1.1 Imports and file path

In [2]:
import pandas as pd
from pathlib import Path

# Set path to your CSV (adjust if needed)
data_path = Path("/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/Mobile Reviews Sentiment.csv")

# If you want to be extra safe, check that the file exists
data_path, data_path.exists()


(PosixPath('/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/Mobile Reviews Sentiment.csv'),
 True)

#### 1.2 Read the CSV

In [3]:
# Read the raw CSV
df_raw = pd.read_csv(data_path)

# Show basic info
df_raw.shape


(50000, 25)

In [6]:
df_raw.head()

Unnamed: 0,review_id,customer_name,age,brand,model,price_usd,price_local,currency,exchange_rate_to_usd,rating,...,verified_purchase,battery_life_rating,camera_rating,performance_rating,design_rating,display_rating,review_length,word_count,helpful_votes,source
0,1,Aryan Maharaj,45,Realme,Realme 12 Pro,337.31,₹27996.73,INR,83.0,2,...,True,1,1,3,2,1,46,7,1,Amazon
1,2,Davi Miguel Sousa,18,Realme,Realme 12 Pro,307.78,R$1754.35,BRL,5.7,4,...,True,3,2,4,3,2,74,12,5,Flipkart
2,3,Pahal Balay,27,Google,Pixel 6,864.53,₹71755.99,INR,83.0,4,...,True,3,5,3,2,4,55,11,8,AliExpress
3,4,David Guzman,19,Xiaomi,Redmi Note 13,660.94,د.إ2425.65,AED,3.67,3,...,False,1,3,2,1,2,66,11,3,Amazon
4,5,Yago Leão,38,Motorola,Edge 50,792.13,R$4515.14,BRL,5.7,3,...,True,3,3,2,2,1,73,12,0,BestBuy


In [7]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   review_id             50000 non-null  int64  
 1   customer_name         50000 non-null  object 
 2   age                   50000 non-null  int64  
 3   brand                 50000 non-null  object 
 4   model                 50000 non-null  object 
 5   price_usd             50000 non-null  float64
 6   price_local           50000 non-null  object 
 7   currency              50000 non-null  object 
 8   exchange_rate_to_usd  50000 non-null  float64
 9   rating                50000 non-null  int64  
 10  review_text           50000 non-null  object 
 11  sentiment             50000 non-null  object 
 12  country               50000 non-null  object 
 13  language              50000 non-null  object 
 14  review_date           50000 non-null  object 
 15  verified_purchase  

In [8]:
df_raw.describe(include="all").T.head(20)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
review_id,50000.0,,,,25000.5,14433.901067,1.0,12500.75,25000.5,37500.25,50000.0
customer_name,50000.0,45911.0,Michael Smith,16.0,,,,,,,
age,50000.0,,,,30.07522,8.931307,18.0,23.0,29.0,36.0,65.0
brand,50000.0,7.0,Xiaomi,7241.0,,,,,,,
model,50000.0,22.0,Realme Narzo 70,3597.0,,,,,,,
price_usd,50000.0,,,,689.693713,310.307331,180.02,450.7925,637.04,900.975,1499.89
price_local,50000.0,48418.0,R$2272.3,4.0,,,,,,,
currency,50000.0,8.0,USD,6435.0,,,,,,,
exchange_rate_to_usd,50000.0,,,,12.057946,26.553332,0.78,1.0,1.53,5.7,83.0
rating,50000.0,,,,3.12312,1.248612,1.0,2.0,3.0,4.0,5.0


### 2. Standardize Key Columns & Basic Cleaning

We rename key columns, parse dates, drop empty texts, and create a compact analysis-ready dataset.


In [9]:
# Copy the raw dataframe so we don’t modify df_raw directly
df = df_raw.copy()

# Rename columns we care about
rename_map = {
    "brand": "device_brand",
    "model": "device_model",
    "review_text": "review_text",
    "rating": "rating",
    "review_date": "review_date"
}

df = df.rename(columns=rename_map)

# Keep only the core columns we will use in signal detection
cols_to_keep = ["review_date", "device_brand", "device_model", "rating", "review_text"]
df = df[cols_to_keep].copy()

df.head()


Unnamed: 0,review_date,device_brand,device_model,rating,review_text
0,2023-11-06,Realme,Realme 12 Pro,2,Not worth the money spent. Wouldn’t recommend.
1,2023-03-30,Realme,Realme 12 Pro,4,Absolutely love this phone! The camera is next...
2,2022-12-07,Google,Pixel 6,4,Loving the clean UI and fast updates. Loving i...
3,2025-03-11,Xiaomi,Redmi Note 13,3,Build quality feels solid and durable. No regr...
4,2023-09-29,Motorola,Edge 50,3,Not bad for daily use but could be optimized. ...


In [11]:
# Parse review_date into datetime
df["review_date"] = pd.to_datetime(df["review_date"], errors="coerce")

# Drop rows with missing date or missing/blank text
df = df.dropna(subset=["review_date", "review_text"])
df = df[df["review_text"].str.strip() != ""]

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   review_date   50000 non-null  datetime64[ns]
 1   device_brand  50000 non-null  object        
 2   device_model  50000 non-null  object        
 3   rating        50000 non-null  int64         
 4   review_text   50000 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 1.9+ MB


### 3. Complaint Detection (Rule-Based)

Here we create a simple `is_negative` / `is_complaint` flag using:
- Star rating (<= 2 stars)
- Presence of common issue keywords in the review text

This gives us a first-pass "quality complaint" indicator for downstream aggregation and anomaly detection.


#### 3.1 Define keywords + helper function

In [12]:
import numpy as np

# Common device-quality complaint keywords/phrases
complaint_keywords = [
    "crash", "crashes", "crashing",
    "bug", "bugs", "glitch", "glitches",
    "lag", "laggy", "slow", "freezes", "freezing", "freeze",
    "overheat", "overheats", "overheating", "heats up",
    "battery drain", "battery dies", "poor battery", "bad battery",
    "random restart", "restarts", "reboot", "rebooting",
    "screen flicker", "screen issue", "display issue",
    "no signal", "network issue", "wifi issue", "wifi problem",
    "doesn't work", "doesnt work", "not working", "stopped working",
    "faulty", "defective", "problem with", "issue with"
]

def text_has_complaint_keywords(text: str) -> bool:
    """
    Returns True if the review text contains any of the complaint keywords.
    """
    if pd.isna(text):
        return False
    t = str(text).lower()
    return any(kw in t for kw in complaint_keywords)

# Apply keyword flag
df["has_complaint_kw"] = df["review_text"].apply(text_has_complaint_keywords)

df[["review_text", "has_complaint_kw"]].head(5)


Unnamed: 0,review_text,has_complaint_kw
0,Not worth the money spent. Wouldn’t recommend.,False
1,Absolutely love this phone! The camera is next...,False
2,Loving the clean UI and fast updates. Loving i...,False
3,Build quality feels solid and durable. No regr...,False
4,Not bad for daily use but could be optimized. ...,False


#### 3.2 Combine rating + keywords into `is_complaint`

In [13]:
# Make sure rating is numeric
df["rating"] = pd.to_numeric(df["rating"], errors="coerce")

# Define complaint: low rating OR keyword hit
df["is_complaint"] = (
    (df["rating"] <= 2) | (df["has_complaint_kw"])
)

# Quick sanity checks
print("Total rows:", len(df))
print("Complaint rows:", df["is_complaint"].sum())
print("Complaint rate: {:.1%}".format(df["is_complaint"].mean()))

df[["rating", "has_complaint_kw", "is_complaint"]].head(10)


Total rows: 50000
Complaint rows: 16581
Complaint rate: 33.2%


Unnamed: 0,rating,has_complaint_kw,is_complaint
0,2,False,True
1,4,False,False
2,4,False,False
3,3,False,False
4,3,False,False
5,5,False,False
6,3,False,False
7,2,False,True
8,4,False,False
9,3,False,False


#### 3.3 Peek at a few complaint vs non-complaint texts

In [14]:
print("=== Sample complaint reviews ===")
display(df[df["is_complaint"]].sample(5, random_state=42)[
    ["rating", "device_brand", "device_model", "review_text"]
])

print("=== Sample non-complaint reviews ===")
display(df[~df["is_complaint"]].sample(5, random_state=43)[
    ["rating", "device_brand", "device_model", "review_text"]
])


=== Sample complaint reviews ===


Unnamed: 0,rating,device_brand,device_model,review_text
40074,2,Apple,iPhone 15 Pro,Not bad for daily use but could be optimized. ...
45746,2,Apple,iPhone 14,"Overall decent, but expected a bit more for th..."
48840,2,Xiaomi,Redmi Note 13,"Design is okay, a bit bulky though. Average ex..."
9584,2,Xiaomi,Mi 13 Pro,Overheats quickly while gaming. Returning this...
47432,2,OnePlus,OnePlus 12,Sound quality is okay but not very loud. Fine ...


=== Sample non-complaint reviews ===


Unnamed: 0,rating,device_brand,device_model,review_text
18958,3,Xiaomi,Poco X6,Battery easily lasts a day with heavy use. No ...
31058,5,Apple,iPhone 14,Build quality feels solid and durable. Absolut...
25153,3,Realme,Realme Narzo 70,Loving the clean UI and fast updates. Loving i...
11460,3,Xiaomi,Redmi Note 13,Fast charging is a lifesaver. No regrets buyin...
41106,5,OnePlus,OnePlus Nord 3,Design feels premium and stylish. Best purchas...


#### 3.4 Save this as our “clean + flagged” dataset

In [15]:
from pathlib import Path

clean_path = Path("/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/mobile_reviews_clean_flagged.csv")
df.to_csv(clean_path, index=False)

clean_path


PosixPath('/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/mobile_reviews_clean_flagged.csv')

### 4. Weekly Aggregations for Quality Signal Detection

We aggregate reviews by week and by (device_brand, device_model) to build
a time-series dataset that captures complaint volume and complaint rate.
This will be used for anomaly detection and Tableau dashboards.


#### 4.1 Extract “week start” date

In [16]:
# Create a "week_start" column (Monday of that week)
df["week_start"] = df["review_date"] - pd.to_timedelta(df["review_date"].dt.weekday, unit="D")

df[["review_date", "week_start"]].head()


Unnamed: 0,review_date,week_start
0,2023-11-06,2023-11-06
1,2023-03-30,2023-03-27
2,2022-12-07,2022-12-05
3,2025-03-11,2025-03-10
4,2023-09-29,2023-09-25


#### 4.2 Weekly aggregation

In [17]:
weekly = (
    df.groupby(["device_brand", "device_model", "week_start"])
      .agg(
          total_reviews = ("review_text", "count"),
          total_complaints = ("is_complaint", "sum")
      )
      .reset_index()
)

# Complaint rate
weekly["complaint_rate"] = weekly["total_complaints"] / weekly["total_reviews"]

weekly.head(10)


Unnamed: 0,device_brand,device_model,week_start,total_reviews,total_complaints,complaint_rate
0,Apple,iPhone 13,2022-10-17,1,0,0.0
1,Apple,iPhone 13,2022-10-24,17,2,0.117647
2,Apple,iPhone 13,2022-10-31,11,3,0.272727
3,Apple,iPhone 13,2022-11-07,16,5,0.3125
4,Apple,iPhone 13,2022-11-14,9,5,0.555556
5,Apple,iPhone 13,2022-11-21,7,3,0.428571
6,Apple,iPhone 13,2022-11-28,12,3,0.25
7,Apple,iPhone 13,2022-12-05,7,2,0.285714
8,Apple,iPhone 13,2022-12-12,12,8,0.666667
9,Apple,iPhone 13,2022-12-19,13,2,0.153846


#### 4.3 Sanity check: how many weekly time-series do we have?

In [18]:
weekly.groupby(["device_brand", "device_model"]).size().sort_values(ascending=False).head(10)

device_brand  device_model   
Apple         iPhone 13          158
              iPhone 14          158
Xiaomi        Poco X6            158
              Mi 13 Pro          158
Samsung       Galaxy Z Flip      158
              Galaxy S24         158
              Galaxy A55         158
Realme        Realme Narzo 70    158
              Realme 12 Pro      158
OnePlus       OnePlus Nord 3     158
dtype: int64

#### 4.4 Optional: filter out models that have too few reviews

In [19]:
weekly = weekly[weekly["total_reviews"] >= 5]
weekly.shape


(3436, 6)

#### 4.5 Rolling metrics (for anomaly detection later)

This prepares the data for the next step.

In [20]:
weekly = weekly.sort_values(["device_brand", "device_model", "week_start"])

# Rolling 6-week window (changeable)
weekly["roll_mean"] = (
    weekly.groupby(["device_brand", "device_model"])["complaint_rate"]
          .transform(lambda x: x.rolling(window=6, min_periods=3).mean())
)

weekly["roll_std"] = (
    weekly.groupby(["device_brand", "device_model"])["complaint_rate"]
          .transform(lambda x: x.rolling(window=6, min_periods=3).std())
)

weekly.head(10)


Unnamed: 0,device_brand,device_model,week_start,total_reviews,total_complaints,complaint_rate,roll_mean,roll_std
1,Apple,iPhone 13,2022-10-24,17,2,0.117647,,
2,Apple,iPhone 13,2022-10-31,11,3,0.272727,,
3,Apple,iPhone 13,2022-11-07,16,5,0.3125,0.234291,0.102956
4,Apple,iPhone 13,2022-11-14,9,5,0.555556,0.314607,0.181299
5,Apple,iPhone 13,2022-11-21,7,3,0.428571,0.3374,0.165074
6,Apple,iPhone 13,2022-11-28,12,3,0.25,0.322834,0.151897
7,Apple,iPhone 13,2022-12-05,7,2,0.285714,0.350845,0.118264
8,Apple,iPhone 13,2022-12-12,12,8,0.666667,0.416501,0.165957
9,Apple,iPhone 13,2022-12-19,13,2,0.153846,0.390059,0.195798
10,Apple,iPhone 13,2022-12-26,11,6,0.545455,0.388376,0.194127


#### 4.6 Save the weekly dataset

In [21]:
weekly_path = Path("/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/weekly_complaint_timeseries.csv")
weekly.to_csv(weekly_path, index=False)

weekly_path


PosixPath('/Volumes/Personal Drive/GitHub/Proactive-Device-Quality-Signal-Detection/Dataset/weekly_complaint_timeseries.csv')