# Olist Customer Segmentation

## **Introduction**

This project aims to build a robust customer segmentation by analyzing purchasing behaviors within the public Olist e-commerce dataset, available on Kaggle: https://www.kaggle.com/code/anshumoudgil/olist-ecommerce-analytics-quasi-poisson-poly-regs/input. The primary goal is to move beyond a one-size-fits-all approach and understand the distinct patterns of consumption among different user groups.

By leveraging unsupervised machine learning, specifically the K-Means clustering algorithm, this analysis identifies and characterizes different customer profiles. The final output is a set of well-defined customer personas that can be used to inform and direct targeted marketing campaigns, ultimately leading to more effective customer engagement and business growth.

## **Business Problem**

As a prominent Brazilian e-commerce marketplace, Olist operates within a complex ecosystem, connecting thousands of different sellers to millions of customers. In a vast and diverse e-commerce marketplace like Olist, a "one-size-fits-all" strategy for marketing, sales, and customer service is inefficient and costly. Without a deep understanding of the customer base, the company faces significant challenges:

* **Ineffective Marketing Spend**: Generic campaigns often fail to resonate with the specific needs and motivations of different customers, leading to a low return on investment (ROI).

* **Lost Sales Opportunitie**s: Sales teams cannot tailor their approach to high-value or high-potential customers, missing chances for upselling and cross-selling.

* **Customer Churn**: A lack of personalized experience can lead to low customer loyalty, with users failing to become repeat buyers.

To overcome these obstacles and drive sustainable growth, Olist needs to answer critical strategic questions about its customers. This project directly addresses this need by using data-driven segmentation to answer the following:

* **Marketing**: Which customer groups should we target with specific promotions?

* **Sales**: How can we optimize sales strategies for different customer profiles?

* **Product**: What types of products appeal to which customers?

* **Customer Service**: Which customers require more attention or support?

* **Profitability**: Who are our most profitable customers?

  

## **EDA Summary (Exploratory Data Analysis)**
The exploratory analysis revealed several key characteristics of the Olist dataset, which directly informed the feature engineering and modeling strategy.

**Key Findings:**

**Data Distribution and Preprocessing:**

* Monetary values, such as price and payment value, were heavily skewed with significant outliers. A logarithmic transformation was identified as a necessary step to normalize their distribution for modeling.

* The original 73 product categories were consolidated into 13 macro-categories based on their similarities, simplifying the analysis and making patterns more interpretable.

**Geographical Concentration:**

* A significant portion of revenue and order volume is concentrated in Brazil's Southeast region.

* Interestingly, states with a lower order volume often exhibit a higher average payment value, which is also correlated with higher freight costs.

**Customer Satisfaction and Behavior:**

* While over 70% of reviews are positive (4 or 5 stars), there is a notable polarization. The volume of very negative reviews (1 star) is greater than neutral and bad reviews (2-3 stars) combined, suggesting that customers either have a good experience or a very poor one.

* Crucially, the highest-grossing and most frequently ordered product category is also the one with the worst average review score, pointing to a major potential problem area in customer experience.

**Payment Patterns:**

* Credit cards are the dominant payment method, offering the widest range of transaction values.

* Most purchases are paid in a single installment, but when installments are used, it is common to have more than two. Higher purchase values are strongly correlated with a higher number of installments.

**Key Correlations:**

* A strong positive correlation between price and payment_value confirms the data's integrity.

* A moderate positive correlation between price and freight_value suggests that more expensive items tend to cost more to ship.

* The strongest negative correlation was between payment_installments and payment_type, confirming that installments are an exclusive feature of credit card payments.
