<div style="position: relative; text-align: center; padding: 0; font-family: 'Segoe UI', sans-serif; color: white;">

  <!-- تصویر بزرگ به عنوان عنوان -->
  <img src="https://i.postimg.cc/YqdHv8mv/0-Jv-IYDHNvri-ERR5-m.png" 
       alt="Title Image" 
       style="display: block; margin: 0 auto; border-radius: 0; box-shadow: 0 0 3px #666;">
  
  <!-- اسم کوچیک سمت چپ بالا -->
  <span style="
  color: #B8860B;
  font-family: 'Times New Roman', Times, serif;
  font-weight: bold;
  padding-bottom: 0;
">
</div>

# 💡 Data Mining Final Project: Customer Behavioral Segmentation and Profiling

## 📚 Introduction

This project involves the analysis of a dataset related to a charity group. The dataset contains general information about the members and a list of their donations over a specific period, stored in two separate files named BenefactorsData.csv and TransactionalData.csv.

- **BenefactorsData.csv:** This file includes the membership ID, gender, State, date of birth, and how members became acquainted with the charity group.  
- **TransactionalData.csv:** This file includes the unique transaction code along with the date and amount of the donation, and the type of donation made by each group member.

---

## 🎯 Project Aim

The charity group is interested in using data science tools to design marketing strategies aimed at determining the appropriate target group and profiling the behavioral patterns of its members. The goal is to design and implement advertising campaigns. The expected outcomes of this project are:

- **🔸 Segmentation of Members:** Based on donation history and behavioral indicators, members can be divided into several manageable groups. This segmentation will reveal the current behavioral patterns of the members.

- **🔸 Target Market Identification:** By understanding the behavioral patterns, the charity group can select suitable behavioral patterns for campaign implementation.

- **🔸 Profiling Members:** Exploring the relationship between initial member information (such as gender, age, etc.) and identified behavioral patterns. This profiling will provide insights into key characteristics of potential benefactors, helping the charity group in targeting and engaging with them effectively.

---


In [None]:
# 1-Import Libraies
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
#!pip install ydata-profiling
from ydata_profiling import ProfileReport
!pip install jdatetime
import jdatetime
import datetime
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer

# Modelling
from sklearn.cluster import KMeans

# Validation
from sklearn.metrics import silhouette_score, silhouette_samples

#pip install yellowbrick
from yellowbrick.cluster import KElbowVisualizer

# Ignore Warnings
import warnings

---
## 📥 Transactional Data Processing

## 🔬 Exploratory Data Analysis (EDA)

- Perform EDA to check data quality.

In [None]:
Transactional = pd.read_csv("/kaggle/input/transactional/TransactionalData.csv")
Transactional = Transactional.drop('Unnamed: 0', axis=1)

Transactional.head()

In [None]:
print("Shape:", Transactional.shape)

In [None]:
print(Transactional.info())

In [None]:
profile_Transactional = ProfileReport(Transactional, title="Transactional data EDA", type_schema = {"Ed": "categorical", "Default": "categorical"})
profile_Transactional
#profile_Transactional.to_file("transactional_dataset_profile_report.html")

---
## ✂️ Filter Transactions

- Select transactions with PaymentAmount greater than 1000.
-  Select transactions with 'Membership Fee' in SupportType. Explain why this selection is important?

  **“This type of support is important because membership fees usually provide a more stable source of income, and supporters who use this method are more likely to continue their contributions.”**

In [None]:
Transactional['SupportType'].unique()

In [None]:
Filtered_Transactional = Transactional[(Transactional['PaymentAmount']>1000) &
                                       (Transactional['SupportType'] == "Membership Fee")]

Aggregated_Transactional = Filtered_Transactional.groupby(['UserID','PaymentDate'])['PaymentAmount'].sum().reset_index()

Aggregated_Transactional.head()

---
### 🧮 Aggregate Transactional Data

- First Stage: Aggregate the data by UserID and PaymentDate.  
- Second Stage: Aggregate the results of the first stage by UserID to construct the R, F, and M fields:  
  - R (Recency): The number of days since the last donation.  
  - F (Frequency): The number of donations.  
  - M (Monetary): The total amount donated.  
  - D (Duration): The number of days between the first and last donation.

In [None]:
Aggregated_Transactional['PaymentDate'] = pd.to_datetime(Aggregated_Transactional['PaymentDate'])
last_date = Aggregated_Transactional['PaymentDate'].max()

RFMD = Aggregated_Transactional.groupby('UserID').agg(
    R=('PaymentDate', lambda x: (last_date - x.max()).days),
    F=('PaymentDate','nunique'),
    M=('PaymentAmount','sum'),
    D=('PaymentDate', lambda x: (x.max() - x.min()).days)
).reset_index()

RFMD.head()

In [None]:
profile_RFMD = ProfileReport(RFMD, title="RFMD data EDA", type_schema = {"Ed": "categorical", "Default": "categorical"})
profile_RFMD

#profile_RFMD.to_file("RFMD_dataset_profile_report.html")

---
### 📈 EDA on Aggregated Data

- Perform EDA to check the quality indexes in the aggregated data and explore the distributions of R, F, M, and D.

### 🎛️ Categorize and Score R, F, M, and D

- R (Recency):  
  - 0 <= R < 60  
  - 60 <= R < 180  
  - 180 <= R < 365  
  - 365 <= R < 545  
  - R >= 545  

- F (Frequency):  
  - 1 <= F < 2  
  - 2 <= F < 5  
  - 5 <= F < 10  
  - 10 <= F < 20  
  - F >= 20  

- M (Monetary):  
  - 0 <= M < 500,000  
  - 500,000 <= M < 1,200,000  
  - 1,200,000 <= M < 2,500,000  
  - 2,500,000 <= M < 10,000,000  
  - M >= 10,000,000  

- D (Duration):  
  - 0 <= D < 1  
  - 1 <= D < 180  
  - 180 <= D < 365  
  - 365 <= D < 545  
  - D >= 545  


In [None]:
bins_dict = {
    'R': [0, 60, 180, 365, 545, float('inf')],
    'F': [1, 2, 5, 10, 20, float('inf')],
    'M': [0, 500000, 1200000, 2500000, 10000000, float('inf')],
    'D': [0, 1, 180, 365, 545, float('inf')]
}

for col in ['R', 'F', 'M', 'D']:
    if col == 'R':
        RFMD[f'{col}_Score'] = pd.cut(RFMD[col], bins=bins_dict[col], labels=[5, 4, 3, 2, 1], right=False)
    else:
        RFMD[f'{col}_Score'] = pd.cut(RFMD[col], bins=bins_dict[col], labels=[1, 2, 3, 4, 5], right=False)

RFMD.drop(columns=['R', 'F', 'M', 'D'], inplace=True)

RFMD.head()

In [None]:
def group_stats(col):
    counts = RFMD[col].value_counts().sort_index()
    percentages = RFMD[col].value_counts(normalize=True).sort_index() * 100
    return pd.DataFrame({'Count': counts, 'Percentage': percentages})

print("Recency Groups:\n", group_stats('R_Score'))
print("Frequency Groups:\n", group_stats('F_Score'))
print("Monetary Groups:\n", group_stats('M_Score'))
print("Duration Groups:\n", group_stats('D_Score'))


---

## 🤖 Data Modeling

### 🗂️ Clustering Model

- To identify customer behavioral patterns  
- Perform k-means clustering model with 2 to 6 clusters on the R, F, M, and D fields.  
- Evaluate the fitted clustering models using KElbowVisualizer.  
- Explore the clusters and describe customer behavioral patterns.  
- Select the best model based on cluster descriptions and silhouette score.  
- Choose a pattern as a target group and construct a binary target field for each customer based on that pattern.

In [None]:
X = RFMD[['R_Score','F_Score','M_Score', 'D_Score']]

cluster_range = range(2, 7)  

# List to store WCSS & silhouette values
wss = []
silhouette_scores = []

# Calculate WCSS & silhouette for each cluster number
for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, init='k-means++', n_init='auto', max_iter=100,
                    tol=0.0001, random_state=880, algorithm='lloyd')
    kmeans.fit(X)
    wss.append(kmeans.inertia_)
    labels = kmeans.predict(X)
    silhouette_avg = silhouette_score(X, labels)
    silhouette_scores.append(silhouette_avg)

In [None]:
cluster_metrics = pd.DataFrame({
    'Number of Clusters': list(cluster_range),
    'WCSS': wss,
    'Silhouette Score': silhouette_scores
})

cluster_metrics

In [None]:
plt.style.use('seaborn-v0_8-whitegrid')

k_opt = 5  

fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(cluster_range, [w + 10 for w in wss], 
        marker='o', markersize=12, linewidth=4, color='lightgray', alpha=0.6, zorder=1)

ax.plot(cluster_range, wss, marker='o', markersize=10, linewidth=2.5, color='#0052cc', zorder=2, label='WSS')

ax.scatter(k_opt, wss[cluster_range.index(k_opt)], 
           color='crimson', s=180, edgecolor='black', zorder=3, label='Optimal k')

ax.set_title('Elbow Method for Optimal Number of Clusters', fontsize=18, fontweight='bold', pad=15)
ax.set_xlabel('Number of Clusters', fontsize=15, labelpad=10)
ax.set_ylabel('Within-Cluster Sum of Squares (WSS)', fontsize=15, labelpad=10)

for i, txt in enumerate(wss):
    ax.annotate(f"{txt:.0f}", (cluster_range[i], wss[i]), 
                textcoords="offset points", xytext=(0, 10), ha='center', fontsize=10, color='dimgray')

ax.set_xticks(cluster_range)
ax.tick_params(axis='both', which='major', labelsize=12)
ax.grid(True, linestyle='--', alpha=0.6)
ax.legend(fontsize=12, frameon=True, shadow=True)

plt.tight_layout()
plt.show()

In [None]:
plt.style.use('seaborn-v0_8-whitegrid')


k_opt = 5

fig, ax = plt.subplots(figsize=(10, 6))


ax.plot(cluster_range, silhouette_scores, marker='o', markersize=10, linewidth=2.5, 
        color='#0052cc', zorder=2, label='Silhouette Score')

ax.scatter(k_opt, silhouette_scores[cluster_range.index(k_opt)],
           color='crimson', s=180, edgecolor='black', zorder=3, label='Optimal k')

ax.set_title('Silhouette Score vs Number of Clusters', fontsize=18, fontweight='bold', pad=15)
ax.set_xlabel('Number of Clusters', fontsize=15, labelpad=10)
ax.set_ylabel('Silhouette Score', fontsize=15, labelpad=10)

for i, txt in enumerate(silhouette_scores):
    ax.annotate(f"{txt:.2f}", (cluster_range[i], silhouette_scores[i]),
                textcoords="offset points", xytext=(0, 10), ha='center',
                fontsize=10, color='dimgray')

ax.set_xticks(cluster_range)
ax.tick_params(axis='both', which='major', labelsize=12)
ax.grid(True, linestyle='--', alpha=0.6)
ax.legend(fontsize=12, frameon=True, shadow=True)

plt.tight_layout()
plt.show()

In [None]:
pip install yellowbrick

In [None]:
plt.style.use('seaborn-v0_8-whitegrid')

# Create the KElbowVisualizer with the model and the range of clusters to test
visualizer = KElbowVisualizer(kmeans, metric='silhouette', k=(2, 7))

# Fit the visualizer and show the plot
visualizer.fit(X)
visualizer.show()

In [None]:
# Initialize the KMeans model
km_model = KMeans(n_clusters=5, init='k-means++', n_init='auto', max_iter=300,
                  tol=0.0001, random_state=880, algorithm='lloyd')


# Fit the model
km_model.fit(X)

# Predict the cluster labels
labels = km_model.predict(X)

# Compute the silhouette score
silhouette_avg = silhouette_score(X, labels)
silhouette = silhouette_samples(X, labels)

print(f'Silhouette Score: {silhouette_avg}')

X_kmeans = X.copy()
X_kmeans['cluster'] = labels
X_kmeans['silhouette'] = silhouette
X_kmeans.index = RFMD['UserID']

X_kmeans.head()

In [None]:
wss = km_model.inertia_
print(f"WSS: {wss:.2f}")

print("-"*80)

centers = km_model.cluster_centers_
print("Cluster Centers:\n", centers)

In [None]:
# Suppress warnings
warnings.filterwarnings("ignore")

# Define the score column names

X_kmeans['cluster'] = X_kmeans['cluster'] + 1
clusters = sorted(X_kmeans['cluster'].unique())

# Ensure the score columns are of float type
scores = ['R_Score', 'F_Score', 'M_Score', 'D_Score']
X_kmeans[scores] = X_kmeans[scores].astype(float)

In [None]:
# Loop through each cluster
for cluster in clusters:
    # Create subplots: one for each score
    fig, axes = plt.subplots(1, len(scores), figsize=(20, 4))
    
    for i, score in enumerate(scores):
        ax = axes[i]

        # Create DataFrame for all data and for the specific cluster
        all_data = pd.DataFrame({score: X_kmeans[score], 'Type': 'All Data'})
        data_cluster = pd.DataFrame({
            score: X_kmeans.loc[X_kmeans['cluster'] == cluster, score],
            'Type': f'Cluster {cluster}'
        })
        # Combine both DataFrames for plotting
        data_combined = pd.concat([all_data, data_cluster])

        # Add half-unit jitter to the score for better visualization
        data_combined[score+'_float'] = data_combined[score] + np.random.uniform(0, 0.5, len(data_combined))

        # Define bins with 0.5 spacing
        bins = np.arange(data_combined[score+'_float'].min(),
                         data_combined[score+'_float'].max() + 0.5,
                         0.5)

        # Plot histogram with hue for cluster vs all data
        sns.histplot(
            x=score+'_float',
            hue='Type',
            data=data_combined,
            ax=ax,
            palette=['pink', 'black'],
            bins=bins,
            stat='count',
            alpha=0.7
        )

        # Set title and axis labels
        ax.set_title(f'{score} Distribution\n(Cluster {cluster})')
        ax.set_xlabel(score)
        ax.set_ylabel('Count')

    # Adjust layout and display the plot
    plt.tight_layout()
    plt.show()

In [None]:
# Set up the matplotlib figure
fig, axes = plt.subplots(1, len(scores), figsize=(20, 4))

# Define the colors
pink_color = 'pink'  
dark_pink_color = 'brown'  

# Loop over each score and create a horizontal box plot
for i, score in enumerate(scores):
    ax = axes[i]    
    
    # Create a DataFrame for all data
    df_all_data = X_kmeans[[score]].copy()
    df_all_data['cluster'] = 'All Data'

    # Combine the all data DataFrame with the original DataFrame
    combined_df = pd.concat([X_kmeans[['cluster', score]], df_all_data], ignore_index=True)

    # Sort the clusters including 'All Data'
    combined_df['cluster'] = pd.Categorical(combined_df['cluster'], categories= clusters + ['All Data'], ordered=True)

    # Create a horizontal box plot
    sns.boxplot(y='cluster', x=score, data=combined_df, palette= [pink_color] * len(clusters) + [dark_pink_color], ax=ax, showfliers=False)

    # Set the title and labels
    ax.set_title(f'{score} Distribution')
    ax.set_xlabel(score)
    ax.set_ylabel('Cluster')

plt.tight_layout()
plt.show()

In [None]:
radar_data = pd.DataFrame({
    'cluster': [1, 2, 3, 4, 5],
    'R_Score': [4, 3.5, 1.5, 5, 3.5],      
    'F_Score': [3, 2.5, 1.5, 4.5, 1.5],      
    'M_Score': [1.5, 3.5, 3.5, 3.5, 1.5],      
    'D_Score': [3.5, 4, 1.5, 5, 1.5]       
})

cluster_profiles = radar_data.groupby('cluster').mean()

labels = cluster_profiles.columns
num_vars = len(labels)

angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]

fig, ax = plt.subplots(figsize=(7, 7), subplot_kw=dict(polar=True))

for cluster in cluster_profiles.index:
    values = cluster_profiles.loc[cluster].tolist()
    values += values[:1]  # بستن حلقه
    ax.plot(angles, values, linewidth=2, label=f'Cluster {cluster}')
    ax.fill(angles, values, alpha=0.25)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels)
ax.set_yticks([1, 2, 3, 4, 5])
ax.set_yticklabels(['1', '2', '3', '4', '5'])
plt.title("RFMD Cluster Profiles", size=15, y=1.1)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.show()

In [None]:
cluster_summary = X_kmeans.groupby('cluster').mean().reset_index()
cluster_summary['cluster'] = cluster_summary['cluster']

# Calculate the number of records and percentage for each cluster
cluster_counts = X_kmeans['cluster'].value_counts().reset_index()
cluster_counts.columns = ['cluster', 'count']
cluster_counts['percentage'] = (cluster_counts['count'] / len(X_kmeans)) * 100

# Calculate the overall mean for each score
overall_mean = X_kmeans[['R_Score', 'F_Score', 'M_Score', 'D_Score', 'silhouette']].mean().to_frame().T
overall_mean['cluster'] = 'All Data'
overall_mean['count'] = len(X_kmeans)
overall_mean['percentage'] = 100.0

# Rename columns to indicate they are mean values
cluster_summary.columns = ['cluster', 'mean_Recency_Score', 'mean_Frequency_Score', 'mean_Monetary_Score','mean_Duration_Score', 'mean_Silhouette']

# Merge cluster_summary with cluster_counts
cluster_summary = pd.merge(cluster_summary, cluster_counts, on='cluster')

# Append the overall mean to the cluster summary
overall_mean.columns = ['mean_Recency_Score', 'mean_Frequency_Score', 'mean_Monetary_Score', 'mean_Duration_Score', 'mean_Silhouette', 'cluster', 'count', 'percentage']
cluster_summary = pd.concat([cluster_summary, overall_mean], ignore_index=True)

cluster_name = {'1': 'Steady Supporters',
                '2': 'Engaging Supporters',
                '3': 'churn Supporters',
                '4': 'VIP Supporters',
                '5': 'Fresh Supporters'}

cluster_summary['cluster'] = cluster_summary['cluster'].astype(str).replace(cluster_name)

cluster_summary


### Based on our analysis of the optimal clusters, we identified five distinct types of donor behavior:

 🔄 **Steady Supporters:** These donors exhibit high Recency, Frequency, and Duration but low Monetary value, indicating strong and consistent engagement with the charity, though at a relatively low financial volume.

💬 **Engaging Supporters:** They display above-average values across all metrics, reflecting positive and steady interaction with the organization over time.

⏳ **Churn Supporters:** These donors have low values across all metrics, suggesting that they are transient and engage with the charity only occasionally.

🏆 **Top Loyal (VIP) Supporters:** Exhibiting high values in all metrics, these individuals represent our most committed and loyal donors.

🌱 **Fresh Supporters:** With low Frequency, Monetary, and Duration scores but high Recency, these donors are newly active and have recently begun engaging with the charity.


In [None]:
X_kmeans['VIP Supporter'] = (X_kmeans['cluster'] == 4).astype(int)
RFMD_Clustring = X_kmeans.drop(columns=['cluster','silhouette'])
RFMD_Clustring.head()