### Report: Clustering for Customer Segmentation

#### 1. Main Objective of the Analysis
The main objective of this analysis is to perform customer segmentation using clustering techniques. The focus will be on identifying distinct customer groups based on purchasing behavior and other attributes. This analysis will benefit the business by providing insights into customer profiles, enabling targeted marketing strategies, improving customer service, and increasing customer retention. 
#### 2. Brief Description of the Dataset
I employed a UK-based online retail dataset obtained from the UCI machine learning repository [https://archive.ics.uci.edu/dataset/352/online+retail](https://archive.ics.uci.edu/dataset/352/online+retail). The retail dataset consists of 541,909 customer records and eight features:
- InvoiceNo : a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: a 5-digit integral number uniquely assigned to each distinct product.
- Description: product name
- Quantity: the quantities of each product (item) per transaction.
- InvoiceDate: the day and time when each transaction was generated.
- UnitPrice: product price per unit.
- CustomerID: a 5-digit integral number uniquely assigned to each customer.
- Country: 	the name of the country where each customer resides.

<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>InvoiceNo</th>
      <th>StockCode</th>
      <th>Description</th>
      <th>Quantity</th>
      <th>InvoiceDate</th>
      <th>UnitPrice</th>
      <th>CustomerID</th>
      <th>Country</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>536365</td>
      <td>85123A</td>
      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>
      <td>6</td>
      <td>12/1/10 8:26</td>
      <td>2.55</td>
      <td>17850.0</td>
      <td>United Kingdom</td>
    </tr>
    <tr>
      <th>1</th>
      <td>536365</td>
      <td>71053</td>
      <td>WHITE METAL LANTERN</td>
      <td>6</td>
      <td>12/1/10 8:26</td>
      <td>3.39</td>
      <td>17850.0</td>
      <td>United Kingdom</td>
    </tr>
    <tr>
      <th>2</th>
      <td>536365</td>
      <td>84406B</td>
      <td>CREAM CUPID HEARTS COAT HANGER</td>
      <td>8</td>
      <td>12/1/10 8:26</td>
      <td>2.75</td>
      <td>17850.0</td>
      <td>United Kingdom</td>
    </tr>
    <tr>
      <th>3</th>
      <td>536365</td>
      <td>84029G</td>
      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>
      <td>6</td>
      <td>12/1/10 8:26</td>
      <td>3.39</td>
      <td>17850.0</td>
      <td>United Kingdom</td>
    </tr>
    <tr>
      <th>4</th>
      <td>536365</td>
      <td>84029E</td>
      <td>RED WOOLLY HOTTIE WHITE HEART.</td>
      <td>6</td>
      <td>12/1/10 8:26</td>
      <td>3.39</td>
      <td>17850.0</td>
      <td>United Kingdom</td>
    </tr>
  </tbody>
</table>
</div>


#### 3. Data Exploration and Cleaning
**Data Pre-Processing:**
- During the pre-processing stage, the null values were identified, specifically in the "Description" and "CustomerID" fields. By successfully remvoing null values for "CustomerID" entries.
- Removed 8905 incorrect values from the column "quantity" in the dataset.
- Also removed duplicate records by removing rows with identical values in all columns. 


#### 4. Training Variations of Unsupervised Models
**Model 1: K-Means Clustering**
- K-Means Clustering was chosen for this project due to its simplicity, efficiency, and effectiveness in partitioning large datasets into distinct customer segments. As a widely used clustering algorithm, K-Means excels in scenarios where the goal is to categorize data into a predefined number of clusters, making it ideal for identifying distinct groups of customers based on purchasing behavior. Its iterative refinement process ensures convergence to optimal centroids, allowing for clear and interpretable segmentation. Additionally, K-Means is computationally efficient, making it suitable for handling large datasets like "Online Retail.csv," providing valuable insights for targeted marketing strategies and business decision-making.



![image.png](attachment:image.png)

* This approach is predicated on the assumption that k=3 is a local optimmum, while k=5 should be selected as the number of clusters. This method
is deemed to be superior as it renders the determination of the optimal number of clusters more critical and transparent. However, it should be noted
that this calculation is computationally intensive, as the coefficient must be
computed for each case

#### Visualize the Clusters
The clusters were visualized to understand customer segments based on their purchasing behavior.

![image.png](attachment:image.png)

**Model 2: Hierarchical Clustering**
- Used agglomerative clustering with different linkage criteria (ward, complete, average).
- Determined optimal number of clusters by analyzing the dendrogram.

**Model 3: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**
- Experimented with different values for epsilon (ε) and minimum samples.
- Evaluated cluster formation and noise points.

#### 5. Recommended Model
Based on the evaluation metrics and visual inspection of the clustering results, I recommend using **K-Means Clustering** with k=5. This model provides well-defined clusters with clear customer segments that align with business goals. The silhouette score and inertia indicated good cluster cohesion and separation.

#### 6. Key Findings and Insights
- Identified five distinct customer segments based on spending behavior and income levels.
- High-income, high-spending customers can be targeted with premium offers.
- Moderate-income, high-spending customers are ideal for loyalty programs.
- Low-income, low-spending customers may require cost-effective promotions to increase engagement.

#### 7. Suggestions for Next Steps
- Consider incorporating additional features such as purchase frequency and product preferences to enhance the model.
- Explore dimensionality reduction techniques like PCA to simplify the feature space and improve clustering.
- Revisit the model periodically with updated data to ensure the segments remain relevant and accurate.