
## Reviews Mart – Implementation

Based on insights derived from the EDA, the following transformations and design decisions were implemented in the **Reviews Mart**:

* Accounted for the fact that the **nominal grain of the reviews table is at the `review_id` level**, with duplicate entries present due to multiple reviews per product or order.

* Designed aggregations to handle **multiple reviews over time**, ensuring review signals are not double-counted when joined with order- or customer-level datasets.

* Derived **review-based behavioral features** (such as review frequency and sentiment-related proxies) to capture customer experience.



In [None]:
import pandas as pd
reviews=pd.read_csv("../Source Data/olist_order_reviews_dataset.csv")

In [4]:
reviews.head()

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
0,7bc2406110b926393aa56f80a40eba40,73fc7af87114b39712e6da79b0a377eb,4,,,2018-01-18 00:00:00,2018-01-18 21:46:59
1,80e641a11e56f04c1ad469d5645fdfde,a548910a1c6147796b98fdf73dbeba33,5,,,2018-03-10 00:00:00,2018-03-11 03:05:13
2,228ce5500dc1d8e020d8d1322874b6f0,f9e4b658b201a9f2ecdecbb34bed034b,5,,,2018-02-17 00:00:00,2018-02-18 14:36:24
3,e64fb393e7b32834bb789ff8bb30750e,658677c97b385a9be170737859d3511b,5,,Recebi bem antes do prazo estipulado.,2017-04-21 00:00:00,2017-04-21 22:02:06
4,f7c4243c7fe1938f181bec41a392bdeb,8e6bfb81e283fa7e4f11123a3fb894f1,5,,Parabéns lojas lannister adorei comprar pela I...,2018-03-01 00:00:00,2018-03-02 10:26:53


In [6]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99224 entries, 0 to 99223
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   review_id                99224 non-null  object
 1   order_id                 99224 non-null  object
 2   review_score             99224 non-null  int64 
 3   review_comment_title     99224 non-null  object
 4   review_comment_message   99224 non-null  object
 5   review_creation_date     99224 non-null  object
 6   review_answer_timestamp  99224 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.3+ MB



### Review-Level Aggregation to Order-Level

* Aggregated review data to the **order_id grain** to ensure a single record per order in the master table.
* Computed **review volume metrics** (review count) to capture customer engagement. For an orderId what are the total number of review counts 
* Derived **review quality signals** using average, minimum, and maximum review scores. For an orderId what are the min,max, avg review features.
* Extracted **text-based indicators** such as average review length and presence of review text to proxy review richness.
* Ensured all review-related features are **join-safe at the order level** for downstream modeling.


In [14]:
order_review_summary = (
    reviews.assign(
        review_length=reviews["review_comment_message"].str.len(),
        has_review_text=reviews["review_comment_message"].notna().astype(int)
    )
    .groupby("order_id", as_index=False)
    .agg(
        review_count=("review_id", "count"),
        avg_review_score=("review_score", "mean"),
        avg_review_length=("review_length", "mean"),
        has_review_text=("has_review_text", "max"),
        min_review_score=("review_score", "min"),
        max_review_score=("review_score", "max")
    )
)


In [16]:
order_review_summary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98673 entries, 0 to 98672
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   order_id           98673 non-null  object 
 1   review_count       98673 non-null  int64  
 2   avg_review_score   98673 non-null  float64
 3   avg_review_length  40836 non-null  float64
 4   has_review_text    98673 non-null  int64  
 5   min_review_score   98673 non-null  int64  
 6   max_review_score   98673 non-null  int64  
dtypes: float64(2), int64(4), object(1)
memory usage: 5.3+ MB


## VALIDATION ##

In [22]:
order_review_summary["order_id"].duplicated().sum()

0

In [23]:
order_review_summary.to_csv("../Processed Data/prd_reviews.csv", index=False)