# Olist Customer Segmentation

## **Introduction**

This project aims to build a robust customer segmentation by analyzing purchasing behaviors within the public Olist e-commerce dataset, available on Kaggle: https://www.kaggle.com/code/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs/input. The primary goal is to move beyond a one-size-fits-all approach and understand the distinct patterns of consumption among different user groups.

By leveraging unsupervised machine learning, specifically the K-Means clustering algorithm, this analysis identifies and characterizes different customer profiles. The final output is a set of well-defined customer personas that can be used to inform and direct targeted marketing campaigns, ultimately leading to more effective customer engagement and business growth.

## **Business Problem**

As a prominent Brazilian e-commerce marketplace, Olist operates within a complex ecosystem, connecting thousands of different sellers to millions of customers. In a vast and diverse e-commerce marketplace like Olist, a "one-size-fits-all" strategy for marketing, sales, and customer service is inefficient and costly. Without a deep understanding of the customer base, the company faces significant challenges:

* **Ineffective Marketing Spend**: Generic campaigns often fail to resonate with the specific needs and motivations of different customers, leading to a low return on investment (ROI).

* **Lost Sales Opportunitie**s: Sales teams cannot tailor their approach to high-value or high-potential customers, missing chances for upselling and cross-selling.

* **Customer Churn**: A lack of personalized experience can lead to low customer loyalty, with users failing to become repeat buyers.

To overcome these obstacles and drive sustainable growth, Olist needs to answer critical strategic questions about its customers. This project directly addresses this need by using data-driven segmentation to answer the following:

* **Marketing**: Which customer groups should we target with specific promotions?

* **Sales**: How can we optimize sales strategies for different customer profiles?

* **Product**: What types of products appeal to which customers?

* **Customer Service**: Which customers require more attention or support?

* **Profitability**: Who are our most profitable customers?

  

## **EDA Summary (Exploratory Data Analysis)**
The exploratory analysis revealed several key characteristics of the Olist dataset, which directly informed the feature engineering and modeling strategy.

**Key Findings:**

**Data Distribution and Preprocessing:**

* Monetary values, such as price and payment value, were heavily skewed with significant outliers. A logarithmic transformation was identified as a necessary step to normalize their distribution for modeling.

* The original 73 product categories were consolidated into 13 macro-categories based on their similarities, simplifying the analysis and making patterns more interpretable.

**Geographical Concentration:**

* A significant portion of revenue and order volume is concentrated in Brazil's Southeast region.

* Interestingly, states with a lower order volume often exhibit a higher average payment value, which is also correlated with higher freight costs.

**Customer Satisfaction and Behavior:**

* While over 70% of reviews are positive (4 or 5 stars), there is a notable polarization. The volume of very negative reviews (1 star) is greater than neutral and bad reviews (2-3 stars) combined, suggesting that customers either have a good experience or a very poor one.

* Crucially, the highest-grossing and most frequently ordered product category is also the one with the worst average review score, pointing to a major potential problem area in customer experience.

**Payment Patterns:**

* Credit cards are the dominant payment method, offering the widest range of transaction values.

* Most purchases are paid in a single installment, but when installments are used, it is common to have more than two. Higher purchase values are strongly correlated with a higher number of installments.

**Key Correlations:**

* A strong positive correlation between price and payment_value confirms the data's integrity.

* A moderate positive correlation between price and freight_value suggests that more expensive items tend to cost more to ship.

* The strongest negative correlation was between payment_installments and payment_type, confirming that installments are an exclusive feature of credit card payments.

  

## **Cluster Profiles & Results**

The K-Means algorithm, applied to features representing customers' dominant product category preference and total payment value, successfully identified 23 distinct customer personas.
The analysis revealed that customer behavior on Olist is primarily segmented by product category affinity, and secondarily by spending level within that category (e.g., Low, Mid, and Premium Budget). This provides a powerful, two-tiered framework for understanding the customer base.

**High-Level Insights Across All Segments:**

Before detailing individual personas, some patterns emerged across almost all clusters:

**Geographic Dominance**: The vast majority of customers across all segments reside in the Southeast region of Brazil (typically 65-75% of any given cluster).

**Payment Preference**: Credit Card is the overwhelmingly preferred payment method.

**The Loyalty Challenge**: A significant portion of the customer base, especially in low-budget segments, are "once in a lifetime" customers, highlighting a major opportunity for improving customer retention.

## **Representative Customer Personas:**
Below are a few examples of the 23 personas identified, showcasing the diversity of customer profiles discovered.

**Persona 1: The Occasional Shopper (Health & Beauty Low Budget)**

This group represents a common type of Olist customer: the single-purchase, low-spend user.

* Segment Size: 8% of total customers.

* Spending: Low average ticket of R$ 77.

* Behavior: They are "once in a lifetime" customers with virtually no loyalty. They use a single payment method (usually credit card) in few installments.

* Satisfaction: Generally positive, with a good average review score of 4.2.

* Business Implication: This is a high-volume acquisition group. The challenge is to convert them into repeat buyers.

**Persona 2: The Loyal High-Spender (Home & Construction Premium)**

This persona represents the ideal, high-value customer that is crucial for profitability.

* Segment Size: 2% of total customers.

* Spending: High average ticket of R$ 486.

* Behavior: They are loyal, repeat customers who use more installments (avg. 4.7) and sometimes multiple payment methods, indicating a deeper integration with the platform.

* Satisfaction: Their review score is slightly lower (3.8) than the low-budget segment, suggesting their expectations might be higher or more complex.

* Business Implication: This is a core group to nurture and retain. Understanding their needs is key to protecting a major revenue source.

**Persona 3: The At-Risk High-Value Customer (Furniture & Decoration Premium)**

This is perhaps the most critical persona from a strategic standpoint, as it represents high value paired with high risk.

* Segment Size: 2% of total customers.

* Spending: High average ticket of R$ 379.

* Behavior: They are loyal customers and use many installments (avg. 4.8), indicating significant purchases.

* Satisfaction: Their experience is subpar, with a mediocre average review score of 3.5.

* Business Implication: This is an alarm bell. These high-spending, loyal customers are not satisfied. Immediate action is needed to investigate the cause (e.g., product quality, delivery issues in this category) and prevent churn.
