# Customer Dissatisfaction Analysis

This notebook analyzes customer dissatisfaction using the prepared
master fact table.

Business Questions:
1) Why do customers give low ratings?
2) What factors most strongly drive dissatisfaction?

Data Source:
- `fact_orders_master.csv`

Step 1: Load Master Table

In [53]:
import pandas as pd
import numpy as np

fact_orders_master = pd.read_csv("../data/processed/fact_orders_master.csv")
fact_orders_master.shape

(110197, 22)

## Step 2: Evaluate Low Rating Frequency

Assess how many orders received low ratings (1 or 2)
to understand the overall scale of customer dissatisfaction.

In [54]:
fact_orders_master["is_low_rating"].value_counts(dropna=False)


is_low_rating
False    94053
True     16144
Name: count, dtype: int64

In [55]:
(
    fact_orders_master["is_low_rating"]
    .value_counts(normalize=True)
    * 100
)

is_low_rating
False    85.349873
True     14.650127
Name: proportion, dtype: float64

## Insight: Scale of Customer Dissatisfaction

Approximately **14.6%** of delivered orders received low ratings (1 or 2).

This indicates a non-trivial level of customer dissatisfaction.
Rather than being random noise, this suggests systematic issues
that warrant deeper investigation, such as delivery delays or
category-specific fulfillment problems.

## Step 3: Impact of Delivery Delay on Low Ratings
Convert the results into percentages for easier interpretation.

In [56]:
(
    fact_orders_master
    .groupby("is_delayed")["is_low_rating"]
    .mean()
    * 100
)

is_delayed
False    11.318152
True     53.454212
Name: is_low_rating, dtype: float64

## Step 4: Rating Distribution for Delayed Orders
Examine how review scores are distributed
for orders that were delivered late.

In [57]:
(
    fact_orders_master[fact_orders_master["is_delayed"] == True]
    ["review_score"]
    .value_counts(normalize=True)
    * 100
)

review_score
1.0    46.942129
5.0    22.173964
4.0    12.043667
3.0    11.104590
2.0     7.735650
Name: proportion, dtype: float64

## Insight: Delivery Delay as a Primary Driver of Customer Dissatisfaction

Delivery delay shows a strong and direct relationship with customer dissatisfaction.

- Only **11.3%** of on-time orders received low ratings (1 or 2).
- In contrast, **53.4%** of delayed orders received low ratings.

This means delayed orders are **nearly five times more likely**
to result in customer dissatisfaction.

Further analysis of delayed orders reveals that:
- **~47%** received a 1-star rating.
- **~55%** received a low rating (1 or 2).

These findings indicate that delivery delays are not a minor inconvenience,
but a critical failure point that significantly damages the customer experience.

## Step 5: Review Text Analysis for Low Ratings

Analyze written review comments for low-rated orders (1–2 stars)
to identify common dissatisfaction themes beyond delivery delays.

In [58]:
import re
from collections import Counter

reviews_full = pd.read_csv("../data/raw/olist_order_reviews_dataset.csv")

low_rating_reviews = reviews_full[
    (reviews_full["review_score"].isin([1, 2])) &
    (reviews_full["review_comment_message"].notna())
]

low_rating_reviews[["review_score", "review_comment_message"]].head()

Unnamed: 0,review_score,review_comment_message
16,2,"GOSTARIA DE SABER O QUE HOUVE, SEMPRE RECEBI E..."
19,1,Péssimo
29,1,Não gostei ! Comprei gato por lebre
32,1,Sempre compro pela Internet e a entrega ocorre...
39,1,Nada de chegar o meu pedido.


## Step 5a: Normalize Review Text

Clean and normalize review comments to allow
basic text pattern extraction.

In [59]:
low_rating_reviews["clean_text"] = (
    low_rating_reviews["review_comment_message"]
    .str.lower()
    .str.replace(r"[^a-z\s]", "", regex=True)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  low_rating_reviews["clean_text"] = (


## Step 5b: Most Common Complaint Terms

Identify the most frequent words appearing in low-rating reviews.

In [60]:
all_words = " ".join(low_rating_reviews["clean_text"]).split()
common_words = Counter(all_words).most_common(25)

common_words

[('no', 9197),
 ('o', 8650),
 ('e', 6431),
 ('produto', 6128),
 ('a', 5112),
 ('de', 4971),
 ('que', 3967),
 ('recebi', 3256),
 ('do', 2493),
 ('um', 2371),
 ('com', 2286),
 ('foi', 1959),
 ('comprei', 1863),
 ('para', 1638),
 ('veio', 1535),
 ('uma', 1532),
 ('da', 1523),
 ('entrega', 1393),
 ('ainda', 1385),
 ('na', 1332),
 ('entregue', 1316),
 ('meu', 1315),
 ('em', 1282),
 ('at', 1186),
 ('chegou', 1167)]

## Step 5c: Remove Generic Stop Words

Remove very common words to focus on meaningful complaint terms.

In [61]:
stop_words = {
    "the", "and", "to", "of", "a", "i", "it", "is", "for", "in",
    "this", "that", "was", "with", "on", "my", "not", "very"
}

filtered_words = [
    word for word in all_words
    if word not in stop_words and len(word) > 3
]

Counter(filtered_words).most_common(20)

[('produto', 6128),
 ('recebi', 3256),
 ('comprei', 1863),
 ('para', 1638),
 ('veio', 1535),
 ('entrega', 1393),
 ('ainda', 1385),
 ('entregue', 1316),
 ('chegou', 1167),
 ('estou', 931),
 ('compra', 896),
 ('prazo', 896),
 ('muito', 864),
 ('mais', 790),
 ('pedido', 779),
 ('loja', 764),
 ('agora', 684),
 ('como', 672),
 ('minha', 612),
 ('apenas', 538)]

## Step 5d: Remove Portuguese Stop Words

Remove common Portuguese stop words to focus on
meaningful complaint terms.

In [62]:
portuguese_stopwords = {
    "a", "o", "e", "de", "do", "da", "que", "com", "um", "uma",
    "para", "em", "na", "no", "meu", "minha", "foi", "at", "ainda"
}

filtered_words_pt = [
    word for word in filtered_words
    if word not in portuguese_stopwords
]

Counter(filtered_words_pt).most_common(20)

[('produto', 6128),
 ('recebi', 3256),
 ('comprei', 1863),
 ('veio', 1535),
 ('entrega', 1393),
 ('entregue', 1316),
 ('chegou', 1167),
 ('estou', 931),
 ('compra', 896),
 ('prazo', 896),
 ('muito', 864),
 ('mais', 790),
 ('pedido', 779),
 ('loja', 764),
 ('agora', 684),
 ('como', 672),
 ('apenas', 538),
 ('nada', 508),
 ('produtos', 508),
 ('pois', 504)]

## Insight: Root Causes of Customer Dissatisfaction from Review Text

Analysis of low-rating review text reveals clear and recurring complaint themes.

Customers most frequently mention:
- Delivery issues, including delays and missed delivery deadlines (`entrega`, `prazo`).
- Product-related problems, where the received item did not meet expectations
  or differed from what was purchased (`produto`, `recebi`, `veio`).
- Fulfillment failures, such as orders that were incomplete or not delivered at all (`nada`, `pedido`).

These findings confirm that customer dissatisfaction is driven by a combination
of logistics performance issues and product fulfillment quality,
not by random or isolated incidents.

## Step 6: Complaint Themes by Delivery Status

Compare review text for delayed vs on-time low-rated orders
to understand whether complaints differ by delivery performance.

In [63]:
low_rating_reviews = low_rating_reviews.merge(
    fact_orders_master[["order_id", "is_delayed"]],
    on="order_id",
    how="left"
)

delayed_text = " ".join(
    low_rating_reviews[low_rating_reviews["is_delayed"] == True]["clean_text"]
).split()

ontime_text = " ".join(
    low_rating_reviews[low_rating_reviews["is_delayed"] == False]["clean_text"]
).split()

Counter(delayed_text).most_common(15), Counter(ontime_text).most_common(15)

([('no', 3136),
  ('o', 2577),
  ('produto', 1832),
  ('e', 1781),
  ('de', 1306),
  ('a', 1295),
  ('recebi', 1158),
  ('que', 984),
  ('ainda', 766),
  ('entrega', 715),
  ('at', 628),
  ('do', 627),
  ('foi', 555),
  ('meu', 526),
  ('com', 523)],
 [('o', 6451),
  ('no', 6315),
  ('e', 6002),
  ('de', 4588),
  ('a', 4587),
  ('produto', 4146),
  ('que', 3418),
  ('recebi', 3112),
  ('um', 2578),
  ('comprei', 2387),
  ('do', 2125),
  ('com', 2065),
  ('veio', 1962),
  ('foi', 1787),
  ('uma', 1628)])

## Step 7: Save Analysis Results (Customer Dissatisfaction)

Persist key derived outputs for reuse in dashboards and reports.

In [64]:
import os

os.makedirs("../data/processed/analysis_outputs", exist_ok=True)

# Delay vs low rating
delay_vs_rating = (
    fact_orders_master
    .groupby("is_delayed")["is_low_rating"]
    .mean()
    .reset_index()
)

delay_vs_rating["low_rating_percentage"] = delay_vs_rating["is_low_rating"] * 100

delay_vs_rating.to_csv(
    "../data/processed/analysis_outputs/delay_vs_low_rating.csv",
    index=False
)

# Rating distribution for delayed orders
delayed_rating_distribution = (
    fact_orders_master[fact_orders_master["is_delayed"] == True]
    ["review_score"]
    .value_counts(normalize=True)
    .reset_index()
)

delayed_rating_distribution.columns = ["review_score", "proportion"]
delayed_rating_distribution["percentage"] = delayed_rating_distribution["proportion"] * 100

delayed_rating_distribution.to_csv(
    "../data/processed/analysis_outputs/delayed_orders_rating_distribution.csv",
    index=False
)