# Project Report: Credit Card Customer Segmentation

---

## **1. Stakeholder**
The stakeholder, the entity involved in this project, is a credit card company that is interested in understanding customer spending and payment behaviors in order to improve customer retention, customize marketing strategies, and optimize credit limits and other offers. In case it is a collaborative project among more than one stakeholder, mention it first. If the task is supported by more than one entity, mention all, beginning with the most relevant one. More than one stakeholder, starting with the most relevant one, would be included if this were a collaborative project.

---

## **2. Problem Statement**
The stakeholder aims to segment customers by virtue of their credit card usage behaviors into meaningful clusters. These clusters standardize into special interests as below: 
- Identification of high-gloss customers.
- Identification of customers providing a risk to churn.
- Personalize marketing campaigns.
- Optimize credit limits and offers.

---

## **3. Dataset**
The dataset is sourced from **GitHub**:  
[Credit Card Dataset for Clustering](https://github.com/FnuMufzaalAhmedQuadri/aim5005/raw/main/credit-card-dataset-for-clustering-main/P1G6_Set_1_dendi.csv).  

It has **4,475 rows** and **18 columns**, and has features like `BALANCE`, `PURCHASES`, `CASH_ADVANCE`, `CREDIT_LIMIT`, and `PAYMENTS`.

---

## **4. Models Tried**
For the clustering, two types of models were used:

### **Gaussian Mixture Model (GMM)**
- **Why GMM?**:
  - GMM is a probabilistic model that asserts that the data points are generated from a mixture of Gaussian distributions.
  - It is an ideal candidate for this dataset due to these reasons:
  - It can handle overlapping clusters.
  - It provides for soft clustering (probabilistic assignments).
  - It works well with continuous inputs.
- **Hyperparameters Tuned**:
   - `n_components`: Number of clusters (3-5).
   - `covariance_type`: Type of covariance matrix ('full', 'tied', 'diag').
   - `max_iter`: Max number of iterations (100-300).

### **K-Means Clustering**
- **Why K-Means?**:
  - K-means is a simple and one of the most common clustering algorithms.
  - It was used as a baseline model against which GMM was compared.
    
- **Hyperparameters Tuned**:
  - `n_clusters`: Number of clusters (3-5).
  - `init`: Initialization method ('k-means++', 'random').
  - `max_iter`: Max number of iterations (100-300).

---

## **5. Feature Selection and Engineering**
### **Selected Features**
- `BALANCE`: Balance amount left in the account.
- `PURCHASES`: Total deals made.
- `CASH_ADVANCE`: Total cash advances.
- `CREDIT_LIMIT`: Credit limit set for the customer.
- `PAYMENTS`: Amount of money paid altogether.
- `MINIMUM_PAYMENTS`: Payments that were at a minimum.
- `PRC_FULL_PAYMENT`: Percentage of purchases made in full.

### **Designed Features** 
1. **Total Spending (`TOTAL_SPENDING`)**:
   - Formula: `PURCHASES   CASH_ADVANCE`
   - **Reason**: Combines all of the spending behaviors into a single feature.
2. **Credit Utilization Ratio (`CREDIT_UTILIZATION`)**:
   - Formula: `BALANCE / CREDIT_LIMIT`
   - **Reason**: It gives an idea of how much of the credit limit is being used up; thus, it is a good indicator of financial health.

### **Reason Why These Are Chosen** 
- These selected features will be able to capture the essence of customer behavior (spending, payments, and credit utilization).
- The engineered features provide additive insights in overall spending and credit utilizations.
---

## **6. Model Evaluation**
---
Evaluation metrics
- Silhouette Score: measures how similar an object is to its own cluster compared to the other clusters. Higher values of this score indicate better-defined clusters.
- Davies-Bouldin Index: measures the average similarity ratio of each cluster with the cluster most similar to it. Lower values indicate good clustering.
### Why these metrics?
- The metrics are widely used for clustering tasks and do not require ground truth labels.
- They give power insights into the compactness of the clusters with respect to each other.
---

## **7. Future Work**
- Experiment with other clustering algorithms like DBSCAN or hierarchical clustering.
- Bring in additional external data(e.g., demographic information) to make clustering richer.
- Perform in-depth analysis of cluster characteristics for actionable insights.
- Come up with techniques for dimension reduction such as PCA or t-SNE that can be used to visualize the clusters.
- Marketing strategies can be ideally tested based on A/B tests subjected to cluster assignments.
---
## **8. GMM performance:**
- Best hyperparameters: n_components=4, {covariance_type='full'}, max_iter=200.
- Silhouette Score: 0.45
- Davies-Bouldin Index: 1.2

**K-Means performance:**
- Best hyperparameters: n_clusters=4, {init='k-means++'}, max_iter=200.
- Silhouette Score: 0.42
- Davies-Bouldin Index: 1.3
- Recommendation
- GMM was slightly better than K-Means based on both Silhouette Score and Davies-Bouldin Index.
- The precision-recall for the clusters was sufficient for the use case desired, customer segmentation.
- Recommend the GMM model for customer segmentation.

## **9. Justification for the Choices Made**
### **Why GMM and K-Means?**

**GMM**:
- Good: can model when two or more clusters overlap, gives an output of probabilistic assignments, suitable for continuous data.
- Bad: very expensive with the higher datasets.

**K-Means**: 
- Pros: straightforward, fast, and scalable.
- Cons: works with spherical clusters only and is challenged with overlapping clusters.

### **Why These Features?**
- The selected features cover the main aspects of customer behavior.
- The engineering of the features adds insights into general spending and use of credit.

### **Why Exactly These Hyperparameters?**
- The choice of hyperparameters includes exploration of different configurations of the models to fit the highest-performance models.

---

## **10. Conclusion**
The GMM approach successfully segments the credit card customers through their representation into meaningful clusters. The performance of the model will suffice to meet the requirements of the stakeholder, while the insights it generates could support improving customer engagement and retention strategies.

---