# Capstone Project

## 1. Business Understanding (Executive Pitch)

- **What problem are we solving?**

The main business challenge we are addressing is the **inefficient spending in customer retention** and the **critical need for proactive risk management** (responsible gambling).
Our solution is **Customer Segmentation based on Financial Value and Behavioral Risk** within our sports betting platform.

- **Who is the Customer?**

The customer is the **Digital Bettor**, in other words, the users betting on our app. This Digital Bettor interacts with the platform by placing different types of bets across multiple sports. Their continuous betting activity and financial outcomes (stakes, gains, losses) determine the company's core revenue stream, the **Gross Gaming Revenue (GGR)**.

- **Why does this problem matter for the business?**

Without applying segmentation to our clients, we are on our way to inefficient and unsustainable approaches to marketing and risk management.

1. **Retention Cost Inefficiency:** We spend large amounts of resources like bonuses or free bets on users who would have stayed anyway, or on those who will churn no matter what. This generic strategy fails to drive growth among high-potential customers, leading to a **sub-optimal Return on Investment (ROI)** for our marketing budget.
2. **Regulatory and Ethical Risk:** Failure to automatically identify **high-risk betting patterns** exposes the company to large regulatory fines and reputational damage for non-compliance with responsible gambling standards.

This project is essential to transition from blanket spending to a **data-driven investment strategy**.

- **What decision will our analysis help improve?**

This segmentation analysis will directly inform and improve two critical executive decisions:

1. **Strategic Bonus Allocation:** The Marketing Department will be able to decide **precisely which customer segment receives which promotional offer**, directing retention spending exclusively toward the **High-Value and High-Potential clusters** to maximize their Customer Lifetime Value (CLV).
2. **Proactive Risk Intervention:** The Risk Management team will use the identified **High-Risk clusters** (e.g., bettors with high frequency and volatile stakes) to trigger mandatory interventions, such as getting deposit limits or temporary exclusion, ensuring both compliance and ethical business practice.

## 2. Data Understanding

- **Dataset Selection and Suitability**

We selected the **Sports Betting Profiling Dataset** from Kaggle to go ahead with our segmentation model.

| Criterion | Justification for Selection |
| :--- | :--- |
| **Problem Relevance** | The dataset is **transactional** (each row is a single bet). This is the only way to perform meaningful customer analytics in this sector, as segmentation is driven by **betting frequency**, **monetary value (stakes)**, and **risk behaviour (odds)**.|
| **Complexity and Feature Engineering** | The data is **raw and unaggregated**. This is ideal for a Capstone project as it requires us to perform substantial **Feature Engineering**, transforming data from the bet level to the customer level (one row per `user_id`). |
| **Business Context** | It contains key financial variables like **Stake**, **Gain**, and **Odds**, allowing us to create segments directly tied to **financial value (GGR)** and the client's **risk profile**, which is paramount for both profitability and compliance. |

- **Dataset Limitations and Scope**

It is important to explain the data limitations to set realistic expectations for the business.

1. **Synthetic Data**: The dataset contains simulated data. While built to be realistic, it might not capture all complexities of real user habits, like changes in betting habits over years or support ticket interactions.
2. **Lack of Demographics**: Crucial non-transactional information such as **Age**, **Gender**, **Location**, or **Customer Support History** is missing. This restricts our segmentation model to be purely **behavioral and transactional**.
3. **Missing Time-Series Context**: There is no explicit `date` or `timestamp` column for each bet. This is the **most significant limitation**, as we cannot calculate true **Recency (R)** and accurate **Customer Lifetime Value (CLV)**. We must instead use surrogate variables like `bet_id` (assuming it implies sequence) or focus on **Frequency and Monetary** value segmentation.

- **Quick Data Overview (Descriptive Statistics)**

A preliminary review highlights the structure and initial composition of the transactional data:

| Key Column | Data Type | Role in the Project |
| :--- | :--- | :--- |
| `user_id` | String | **Primary Key** for aggregation (transforming bets into customer profiles). |
| `bet_type` | Categorical | **Behavioral:** Distinguishes between **'single'** (simple) and **'multiple'** (higher-risk combined bets). |
| `sport` | Categorical | **Preference:** Shows the customer's sport choice (e.g., Football, Tennis). |
| `odds` | Float | **Risk Metric:** Indicates the magnitude of potential payout and the risk taken by the bettor. |
| `is_win` | Boolean | **Success Rate:** Crucial for calculating customer's actual success rate (P&L). |
| `stake` | Float | **Monetary Value:** The bet amount (key for calculating total volume). |
| `gain` | Float | **Profitability:** The money won/lost by the platform (key for calculating GGR). |


In [1]:
import pandas as pd

# Load the Dataset
try:
    df = pd.read_csv('bets.csv', sep=';')
    print('Dataset loaded succesfully.\n')
except FileNotFoundError:
    print('ERROR: File not found. Please verify the CSV file name and path.')
    exit()

# Calculate Initial Stats

# 1. Total Records
total_records = len(df)

# 2. Total Unique Customers
unique_customers = df['user_id'].nunique()

# 3. Win/Loss Ratio 
win_count = df['is_win'].sum() 
loss_count = total_records - win_count
win_ratio = (win_count/total_records) * 100

# 4. Top Sport
sports_counts = df['sport'].value_counts()
top_sport = sports_counts.index[0]
top_sport_percentage = (sports_counts.iloc[0]/total_records) * 100

# Print results for the Initial Stats table
print("--- Results for the Initial Stats Table ---")
print(f"1. Total Records (Bets): {total_records}")
print(f"2. Total Unique Customers: {unique_customers}")
print(f"3. Winning Bet Ratio: {win_ratio:.2f}%")
print(f"4. Most Popular Sport: {top_sport} ({top_sport_percentage:.2f}%)")
print("---------------------------------------------")

Dataset loaded succesfully.

--- Results for the Initial Stats Table ---
1. Total Records (Bets): 100000
2. Total Unique Customers: 5000
3. Winning Bet Ratio: 36.45%
4. Most Popular Sport: Football (49.22%)
---------------------------------------------


Initial Stats

| Metric | Value | Business Insight |
| :--- | :--- | :--- |
| **Total Records** | 100000 | The sample size is large enough to ensure statistical robustness. |
| **Unique Customer** | 5000 | This is the final number of entities (rows) for our clustering model. |
| **Win/Loss Ratio** | 36.45% Win / 63.55% Loss | The platform operates at a healthy margin, but customer win rate needs to be factored into retention. |
| **Top Sport** | Football (49.22%) | Segmenting by sport preference will be a key feature for targeted marketing. |

## 3. Data Preparation & Exploratory Data Analysis (EDA)

The goal now is to create a new **customer-level Dataframe** (`df_customer`) where each row is a unique user (`user_id`), and the columns are the behavioral metrics needed for segmentation.

**Step 1: The Plan, Segmentation Metrics**

We will create **8 high-impact features** by combining standard RFM concepts with gambling-specific risk metrics:

| Category | Feature Name | Aggregation Method | Variable Type |
| :--- | :--- | :--- | :--- |
| **Frequency (F)** | `total_bets` | Count of total bets placed by the customer | Numerical |
| **Monetary (M)** | `total_staked` | Sum of all money wagered (`stake`). | Numerical |
| **Profitability (GGR)** | `net_gain_loss` | Sum of the `gain` column (Net Gain/Loss for the platform). | Numerical |
| **Risk Profile** | `avg_odds` | The average odds the customer bets on (risk taken). | Numerical |
| **Efficiency** | `win_rate` | Mean of the `is_win` column (customer's percentage of winning bets). | Numerical |
| **Behavioral Type** | `multiple_bet_ratio` | Percentage of bets that were 'multiple' (higher risk) | Numerical |
| **Preference** | `top_sport` | The sport the customer bet on most. | Categorical |

**Step 2: Feature Engineering**

We will use the next code to perform the aggregation. This is the core of our data preparation phase.

In [2]:
import numpy as np

# 1. Define Aggregation Functions (The Heart of Feature Engineering)
agg_funcs = {
    # A. Frequency and Monetary
    'bet_id': 'count',
    'stake': 'sum',
    'gain': 'sum',
    # B. Risk and Efficiency
    'odds': 'mean',
    'is_win': 'mean',
    # C. Categorical Behaviour
    'bet_type': lambda x: (x == 'multiple').mean(), # Calculates the percentage of bets that were 'multiple'
    'sport': lambda x: x.mode()[0] if not x.mode().empty else 'N/A' # Finds the most frequently bet sport 
}

# 2. Perform Aggregation
# Group by 'user_id' and apply all defined functions
df_customer = df.groupby('user_id').agg(agg_funcs).reset_index()

# 3. Rename Columns for Clarity
df_customer.columns = [
    'user_id',
    'total_bets',
    'total_staked',
    'net_gain_loss',
    'avg_odds',
    'win_rate',
    'multiple_bet_ratio',
    'top_sport'
]

# 4. Show the 'Before and After' for the Report
print("Data Before Preparation (Transactional):\n")
print(f"Rows: {len(df):,} | Columns: {df.shape[1]}\n")
print(df[['user_id', 'stake', 'bet_type', 'is_win']].head().to_markdown(index=False))

print("\nData After Preparation (Customer-level):\n")
print(f"Rows: {len(df_customer):,} | Columns: {df_customer.shape[1]}\n")
print(df_customer.head().to_markdown(index=False))

Data Before Preparation (Transactional):

Rows: 100,000 | Columns: 9

|   user_id |   stake | bet_type   | is_win   |
|----------:|--------:|:-----------|:---------|
|      3848 |   13.65 | multiple   | False    |
|       153 |  248.45 | single     | False    |
|      1527 |    3.5  | single     | True     |
|      3903 |  151.45 | single     | False    |
|      2290 |  319.05 | single     | True     |

Data After Preparation (Customer-level):

Rows: 5,000 | Columns: 8

|   user_id |   total_bets |   total_staked |   net_gain_loss |   avg_odds |   win_rate |   multiple_bet_ratio | top_sport   |
|----------:|-------------:|---------------:|----------------:|-----------:|-----------:|---------------------:|:------------|
|         1 |           22 |         776.3  |          688.01 |    5.79909 |   0.318182 |             0.272727 | Football    |
|         2 |           14 |          70.75 |           11.06 |    4.44714 |   0.142857 |             0.5      | Football    |
|         3 |    

**Step 3: Exploratory Data Analysis (EDA) & Customer Insights**

Now that we have a clean, customer-level dataset (`df_customer`), we can perform an exploratory analysis to uncover interesting patterns in customer behavior. We will focus on finding insights related to customer value, betting style, and sport preference. Then, we will use visualizations that allows us to understand the insights.

In [3]:
# We import the necessary libraries for visualization and set the plot style
import plotly.express as px
import plotly.graph_objects as go


**Insight 1: High-Risk Bettor Have a Lower Win Rate**

Next, we'll explore the relationship between a customer's betting style (risk appetite) and their success. We can analyze this by plotting their average odds (`avg_odds`) against their win rate (`win_rate`). The interactive scatter plot is perfect for identifying individual customers and outliers.

In [4]:
# Scatter Plot of Average Odds vs. Win Rate

# Define Segmentation Thresholds & Create Segments
median_odds = df_customer['avg_odds'].median()
median_win_rate = df_customer['win_rate'].median()

def assign_segment(row):
    if row['avg_odds'] < median_odds and row['win_rate'] >= median_win_rate:
        return 'Cautious & Successful'
    elif row['avg_odds'] >= median_odds and row['win_rate'] < median_win_rate:
        return 'High-Risk Bettor'
    elif row['avg_odds'] < median_odds and row['win_rate'] < median_win_rate:
        return 'Cautious & Unsuccessful'
    else:
        return 'High-Risk & Successful'

df_customer['segment'] = df_customer.apply(assign_segment, axis=1)


# Calculate Segment Percentages and define the order for the legend
segment_percentages = df_customer['segment'].value_counts(normalize=True) * 100

df_customer['segment_legend'] = df_customer['segment'].apply(
    lambda x: f"{x}: {segment_percentages[x]:.1f}%"
)

desired_order_base = [
    'Cautious & Successful',
    'High-Risk Bettor',
    'High-Risk & Successful',
    'Cautious & Unsuccessful'
]

legend_order_list = [f"{name}: {segment_percentages[name]:.1f}%" for name in desired_order_base]


# Create the Scatter Plot
fig = px.scatter(
    df_customer,
    x='avg_odds',
    y='win_rate',
    color='segment_legend',
    size='total_staked',
    category_orders={'segment_legend': legend_order_list},
    hover_name='user_id',
    hover_data={'segment': True, 'total_staked': ':.2f', 'net_gain_loss': ':.2f'},
    title='<b>Customer Segmentation by Betting Style</b>',
    labels={
        'avg_odds': 'Average Odds (Risk)',
        'win_rate': 'Win Rate (%)',
        'segment_legend': 'Customer Segment' 
    }
)

# Add quadrant lines to visually separate the segments
fig.add_hline(y=median_win_rate, line_dash="dash", line_color="gray", annotation_text="Median Win Rate")
fig.add_vline(x=median_odds, line_dash="dash", line_color="gray", annotation_text="Median Odds")


# Final Layout Update and Display
fig.update_layout(
    height=650,
    title_x=0.5,
    title_y=0.9,
    legend_title='<b>Segment (% of Customers)</b>',
    legend=dict(
        yanchor="top",
        y=0.98,
        xanchor="right",
        x=0.98
    )
)
fig.show()

**Key Finding**: We have succesfully segmented our entire customer base into four distinct behavioral groups. Our largest and most significant group is the **'High-Risk Bettor'**, representing **30,2% of all customers**. This analysis proves that these players, who prefer high-risk, high-reward bets, consistently have a lower win rate. This makes them a crucial and highly profitable segment for the business. This model now gives us a data-driven foundation to create targeted marketing campaigns and personalized offers for each specific group. 

**Insight 2: Customer Value is Highly Concentrated**

The first step is to understand the distribution of our customer's financial value to the business, represented by their `net_gain_loss`. A positive `net_gain_loss` from the customer's perspective is a loss for the company, and vice-versa. The interactive histogram will allow us to explore this distribution in detail.

In [5]:
# Visualization of Net Gain/Loss Distribution

# Create the histogram
fig = px.histogram(
    df_customer,
    x="net_gain_loss",
    nbins=50,
    title="<b>Distribution of Customer Net Gain/Loss<b>",
    labels={'net_gain_loss': 'Net Gain/Loss ($)'}
)
fig.update_traces(marker_color='lightgray', selector=dict(type='histogram'))

# Calculate key statistics to overlay on the plot
mean_val = df_customer['net_gain_loss'].mean()
median_val = df_customer['net_gain_loss'].median()
p95 = df_customer['net_gain_loss'].quantile(0.95)
p05 = df_customer['net_gain_loss'].quantile(0.05)
q01 = df_customer['net_gain_loss'].quantile(0.01) 
q99 = df_customer['net_gain_loss'].quantile(0.99) 


# Add vertical lines for mean, median, and key percentiles
fig.add_vline(x=mean_val, line_dash="solid", line_color="blue")
fig.add_vline(x=median_val, line_dash="solid", line_color="red")
fig.add_vline(x=p95, line_dash="dot", line_color="green")
fig.add_vline(x=p05, line_dash="dot", line_color="purple")

# Add nule traces to identify each vertical line
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    line=dict(color='red', width=2, dash='solid'),
    name=f'Median: ${median_val:,.2f}'
))
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    line=dict(color='blue', width=2, dash='solid'),
    name=f'Mean: ${mean_val:,.2f}'
))
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    line=dict(color='purple', width=2, dash='dot'),
    name=f'5th Percentile: ${p05:,.2f}'
))
fig.add_trace(go.Scatter(
    x=[None], y=[None],
    mode='lines',
    line=dict(color='green', width=2, dash='dot'),
    name=f'95th Percentile: ${p95:,.2f}'
))

# Update layout for a cleaner look and centered title
fig.update_layout(
    height=650,
    yaxis_title='Number of Customers',
    title_x=0.5, 
    xaxis_range=[q01, q99], 
    legend_title_text='<b>Key Metrics</b>', 
    legend=dict(
        yanchor="top",
        y=0.98,
        xanchor="right",
        x=0.98
    )
)

fig.show()

**Key Finding**: As we can see, the distribution of the data is very unbalanced. The majority of customers have a small net gain/loss, but a **small group of "high-value" customers (those with significant losses) represent the vast majority of our revenue**. So, we need to identify this small group of customers because losing one of them has a much bigger impact than losing a hundred of casual players.

**Insight 3: Football Bettors Stake More Money**

This analysis investigates whether a customer's favorite sport influences their betting volume. We'll compare the `total_staked` across the most popular sports in the dataset using an interactive box plot, which is excellent for comparing distributions.

In [6]:
# Box Plot for Sport Preferences vs. Total Staked

# Isolate the Top 4 Sports for Clarity
top_sports = df_customer['top_sport'].value_counts().nlargest(4).index
df_top_sports = df_customer[df_customer['top_sport'].isin(top_sports)]

# Create the Definitive Box Plot
fig = px.box(
    df_top_sports,
    x='top_sport',
    y='total_staked',
    log_y=True,
    category_orders={'top_sport': top_sports},
    points='outliers',
    title='<b>Total Amount Staked by Customer\'s Favorite Sport</b>',
    labels={
        'top_sport': 'Customer\'s Favorite Sport',
        'total_staked': 'Total Amount Staked ($)'
    },
    color='top_sport'
)

# Add an Annotation to Guide the Viewer
fig.add_annotation(
    x='Football', 
    yref='paper', 
    y=0.9,       
    text="Note the higher density and value<br>of outliers for Football",
    showarrow=True,
    arrowhead=1,
    arrowsize=2,
    ax=135,        
    ay=10       
)


# Final Layout Update
fig.update_layout(
    height=650,
    title_x=0.5,
    showlegend=False,
    xaxis_title=None
)

fig.show()

**Key Finding**: This data clearly shows that **Football is our most lucrative sport**. While we have fans across all sports, our Football customers consistently stake significantly more money. Any marketing budget for promotions, ads, or special offers will likely have the **highest return on investment if we target our Football bettors first**.

## 4. Modelling

Now that we have a solid understanding of our customer data, we will build models to segment them. As defined in our project scope, the goal is customer segmentation, so the appropiate modelling approach is **Clustering**.

We will build two models as required:

1. **Baseline Model**: Standard K-Means Clustering on the scaled data.
2. **Improved Model**: K-Means Clustering on data pre-processed with **PCA (Principal Component Analysis)**.

First, we import the necessary libraries from `sklearn` for modelling.

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

**Step 1: Data Preparation for Modelling**

Our `df_customer` dataframe has raw numerical values with very different scales (e.g., `total_staked` is in thousands, while `avg_odds` is in single digits). Clustering algorithms are highly sensitive to this, so we must **scale** our data first.

We also need to select only the features that describe behavior.

In [8]:
# Select the features for clustering (numerical features that defines a customer's behavior and value)
features = [
    'total_bets',
    'total_staked',
    'net_gain_loss',
    'avg_odds',
    'win_rate',
    'multiple_bet_ratio'
]

# Create a new dataframe for modelling
df_modelling = df_customer[features].copy()

# Scale the data with StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_modelling)

# Display the results
print("Data succesfully scaled. Shape:", df_scaled.shape)

Data succesfully scaled. Shape: (5000, 6)


**Step 2: Baseline Model (K-Means Clustering)**

**Why this method?** --> K-Means is our baseline because it's fast, efficient, and the most common, easily understood clustering algorithm.

First, we must find the optimal number of clusters ("K") using the **Elbow Method**.

- **Finding the Optimal 'K' (Elbow Method)** 

We'll plot the "inertia" (a measure of cluster tightness) for different values of K. We look for the "elbow" point where the benefit of adding more clusters decreases.

In [9]:
# Calculate inertia (Within-Cluster Sum of Squares) for K=1 to 10
inertia = []
K_range = range(1,11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=42)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)

# Plot the Elbow Method
fig = go.Figure(data=go.Scatter(x=list(K_range), y=inertia, mode='lines+markers'))
fig.add_vline(x=4, line_dash='dash', line_color='red', annotation_text='Elbow Point')

fig.update_layout(
    height=650,
    title='<b>Elbow Method for Optimal K</b>',
    xaxis_title='Number of Clusters (K)',
    yaxis_title='Inertia (WCSS)',
    title_x=0.5
)
fig.show()

**Analysis**: The plot shows a clear and sharp "elbow" at **K=4**. After this point, the curve flattens significantly, meaning adding more clusters (like 5 or 6) provides very little additional improvement. Therefore, 4 is the optimal number of segments for our baseline model.

- **Building the Baseline K-Means Model**

Now we run the model with our chosen K=4:

In [10]:
# Build Baseline Model
k_baseline = 4
kmeans_baseline = KMeans(n_clusters=k_baseline, init='k-means++', n_init=10, random_state=42)
kmeans_baseline.fit(df_scaled)

# Add the cluster labels back to our main customer dataframe
df_customer['cluster_baseline'] = kmeans_baseline.labels_

# Display results
print(f"Baseline model built with K={k_baseline}.\n")
print("Baseline Cluster Distribution:")
print(df_customer['cluster_baseline'].value_counts())

Baseline model built with K=4.

Baseline Cluster Distribution:
cluster_baseline
1    1633
3    1611
0    1445
2     311
Name: count, dtype: int64


**Step 3: Improved Model (PCA + K-Means)**

**Why this method?** --> Our dataset has 6 features, and some are likely correlated (e.g., `total_staked` and `total_bets`). K-Means can struggle with this.

Using **Principal Component Analysis (PCA)** first is our "improved model" because:

1. It combines our 6 correlated features into a few uncorrelated "Principal Components".
2. It reduces noise, leading to more stable and meaningful clusters.
3. It allows us to easily visualize the segments in 2D.

- **Apply PCA**

We will transform our 6 scaled features into just 2 Principal Components.

In [11]:
# Apply PCA
pca = PCA(n_components=2, random_state=42)
df_pca = pca.fit_transform(df_scaled)

# Create a dataframe from the PCA results for analysis and modelling
df_pca = pd.DataFrame(df_pca, columns=['PC1', 'PC2'])

# Display results comparing datasets
print(f"Original data shape: {df_scaled.shape}")
print(f"PCA-transformed data shape: {df_pca.shape}\n")

# Display variance analysis results
explained_variance = pca.explained_variance_ratio_
print(f"Variance explained by PC1: {explained_variance[0]:.2%}")
print(f"Variance explained by PC2: {explained_variance[1]:.2%}")
print(f"Total variance explained by 2 components: {explained_variance.sum():.2%}")

Original data shape: (5000, 6)
PCA-transformed data shape: (5000, 2)

Variance explained by PC1: 32.42%
Variance explained by PC2: 24.39%
Total variance explained by 2 components: 56.81%


- **Build the Improved K-Means Model (on PCA data)**

Now, we run K-Means on the new PCA components (PC1 and PC2) instead of the original 6 features. We can use the same K=4 for a fair comparison.

In [12]:
# Build Improved Model
k_improved = 4
kmeans_improved = KMeans(n_clusters=k_improved, init='k-means++', n_init=10, random_state=42)
kmeans_improved.fit(df_pca)

# Add the new cluster labels back to our main dataframe
df_customer['cluster_improved'] = kmeans_improved.labels_
df_pca['cluster_improved'] = kmeans_improved.labels_

# Display results
print(f"Improved Model (PCA + K-Means) built with K={k_improved}.\n")
print("Improved Cluster Distribution:")
print(df_customer['cluster_improved'].value_counts())

Improved Model (PCA + K-Means) built with K=4.

Improved Cluster Distribution:
cluster_improved
1    1949
2    1864
3    1012
0     175
Name: count, dtype: int64


- **Step 4: Cluster Evaluation and Visualization**

This is the most critical step. A cluster model is only useful if we can assign a **business meaning** to each segment.

- **Analyze and Name the Clusters**

We will group our main dataframe by the new clusters and find the average value of each feature. This will give us the "personality" of each segment.

In [13]:
# Group by the improved cluster and find the average for each original feature
cluster_analysis = df_customer.groupby('cluster_improved')[features].mean().reset_index()

# Display results
print("Cluster Analysis (Averages):\n")
print(cluster_analysis.to_string())

Cluster Analysis (Averages):

   cluster_improved  total_bets  total_staked  net_gain_loss  avg_odds  win_rate  multiple_bet_ratio
0                 0   22.937143  13663.178000   14779.912686  4.715876  0.404565            0.345295
1                 1   18.903027   1358.005977    1235.515505  4.000515  0.423192            0.300365
2                 2   19.496245   1579.179399    1055.593122  5.458924  0.288381            0.402137
3                 3   22.532609   5219.135326    4907.558429  4.645433  0.384665            0.348200


**Analysis of Customer Segments**

Based on the mean values from the `cluster_analysis` table, we can define four different customer segments:

- **Cluster 0: "High-Stakes Winners"**
  This is the **most dangerous segment** for the company. They have the **highest** `total_bets` **(22.9)** and **by far the highest** `total_staked` **($13,663)**. With a high `win_rate` of **40.4%**, their winnings are massive, costing the company an average of **$14,780 per customer**.
- **Cluster 1 : "Smart & Cautious Winners"**
  This segment has the **lowest** `total_bets`**(18.9)**, **lowest** `total_staked` **($1,358)**, and the **lowest** `avg_odds` **(4.00)**. Because they play safely, they achieve the **highest** `win_rate` **(42.3%)** and win an average of **$1,235**.
- **Cluster 2: "Lottery-Style Players"**
  This segment is the **least damaging** to the company. They are defined by the **highest** `avg_odds` **(5.46)** and the **highest** `multiple_bet_ratio` **(40.2%)**. This high-risk "long shot" strategy results in the **lowest** `win_rate` **(28.8%)**, and therefore the **lowest** `net_gain_loss` **($1,055)**.
- **Cluster 3: "Frequent Winners"**
  This is a high-volume, high-cost segment. They have the second-highest `total_staked` **($5,219)** and a high `win_rate` **(38.5%)**. They represent a significant, consistent cost to the business, winning an average of **$4,907 per customer**.

- **Create the Final Segment Visualization**

Now we map these new, accurate descriptive names to our clusters and create the final plot.

In [14]:
# --- Create the Final Plot ---

# Build dictionary for clusters
cluster_name_map = {
    0: 'High-Stakes Winners',
    1: 'Smart & Cautious Winners',
    2: 'Lottery-Style Players',
    3: 'Frequent Winners'
}

# Calculate cluster percentages
segment_percentages = df_customer['cluster_improved'].value_counts(normalize=True) * 100

# Create a legend label column with names and percentages
df_customer['segment_legend'] = df_customer['cluster_improved'].apply(
    lambda x: f"{cluster_name_map[x]} ({segment_percentages[x]:.1f}%)"
)
df_pca['segment_legend'] = df_pca['cluster_improved'].apply(
    lambda x: f"{cluster_name_map[x]} ({segment_percentages[x]:.1f}%)"
)

# Define a color palette
color_palette = px.colors.qualitative.Set1

# Create the final scatter plot
fig = px.scatter(
    df_pca,
    x='PC1',
    y='PC2',
    color='segment_legend',          
    color_discrete_sequence=color_palette, 
    opacity=0.7,
    title='<b>Customer Segments (PCA + K-Means)</b>',
    labels={'segment_legend': 'Customer Segment (% Total)'} 
)

fig.update_layout(
    height=650,
    title_x=0.5,
    title_y=0.9,
    legend_title='<b>Customer Segment (% Total)</b>',
    legend=dict(
        yanchor="top",
        y=0.98,
        xanchor="right",
        x=0.98
    ) 
)
fig.show()

## 5. Evaluation

In this phase, we evaluate the performance of our "Improved Model" (PCA + K-Means) from both a technical and a business perspective.

**Step 1: Technical Performance (Silhouette Score)**

**Why this metric?** Since this is an unsupervised clustering problem, we cannot use metrics like "accuracy". Instead, we will use the **Silhouette Score**.

- This score measures how well-separated and dense our clusters are.
- A score close to **+1** is excellent (dense, well-separated clusters).
- A score close to **0** means the clusters overlap.
- A score close to **-1** means the clusters are incorrect.

We will now calculate the Silhouette Score for our `k=4` model using the same PCA data we used to build it.

In [15]:
from sklearn.metrics import silhouette_score

# Define the data and the labels
X_data = df_pca[['PC1', 'PC2']]
labels = df_pca['cluster_improved']

# Calculate the score
silhouette_avg = silhouette_score(X_data, labels)

# Display results
print(f"Technical Performance:\n")
print(f"Silhouette Score for k=4 Model --> {silhouette_avg:.3f}")

# Analysis of the score
if silhouette_avg > 0.5:
    print("Analysis --> This score is excellent, indicating the clusters are dense and very well-separated.")
elif silhouette_avg > 0.3:
    print("Analysis --> This is a solid score, indicating the clusters are distinct and clearly separated.")
else:
    print("Analysis --> The score is low, suggesting the clusters are weak and have significant overlap.")

Technical Performance:

Silhouette Score for k=4 Model --> 0.384
Analysis --> This is a solid score, indicating the clusters are distinct and clearly separated.


**Step 2: Business Impact Analysis**

This is the most critical part of our evaluation. In phase 4, we analyzed the **average customer** in each segment. Now, we will analyze the **total financial impact** of each segment.

We need to answer: "Of all the money the company loses, what percentage comes from each segment?"

We will create a new summary table by grouping our customers by their `segment_name` and calculating the **total sum** of their gains/losses.

In [16]:
# Define Aggregation Metrics
agg_metrics = {
    'user_id': 'count',
    'total_staked': 'sum',
    'net_gain_loss': 'sum'
}

# Create the Business Impact Table
df_impact = df_customer.groupby('segment_legend').agg(agg_metrics)

# Calculate Key Business Percentages
total_company_loss = df_impact[df_impact['net_gain_loss'] > 0]['net_gain_loss'].sum()
df_impact['% of Total Cost'] = (df_impact['net_gain_loss'] / total_company_loss) 

# Calculate the average (mean) loss per user for the report
df_impact['avg_gain_loss_per_user'] = df_customer.groupby('segment_legend')['net_gain_loss'].mean()

# Reset the index to make 'segment_legend' a plottable column
df_impact_for_plotting = df_impact.reset_index()

# Get all the unique segment legend names
all_segments = df_impact_for_plotting['segment_legend'].unique()

# Define the colors
color_blue_dark = '#0d47a1'  
color_blue_light = '#42a5f5' 
color_gray = '#cccccc'     

# Build the color map
color_map = {}
for segment_name in all_segments:
    if 'Frequent Winners' in segment_name:
        color_map[segment_name] = color_blue_dark
    elif 'High-Stakes Winners' in segment_name:
        color_map[segment_name] = color_blue_light
    else:
        color_map[segment_name] = color_gray

# Chart 1: Total Financial Impact
fig_total_impact = px.bar(
    df_impact_for_plotting,
    x="segment_legend",
    y="net_gain_loss",
    color="segment_legend",
    color_discrete_map=color_map,
    title='<b>Total Financial Impact by Segment (The "What")<b>',
    labels={
        'segment_legend': "Customer Segment",
        'net_gain_loss': "Total Net Loss for Company ($)"
    },
    text='net_gain_loss'
)
fig_total_impact.update_xaxes(categoryorder="total descending") 
fig_total_impact.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
fig_total_impact.update_layout(width=1000, height=600, title_x=0.5, title_y=0.9,xaxis_title=None, showlegend=False)
fig_total_impact.show()

# Chart 2: Average Impact per User
fig_avg_impact = px.bar(
    df_impact_for_plotting,
    x='segment_legend',
    y='avg_gain_loss_per_user',
    color='segment_legend', 
    color_discrete_map=color_map, 
    title='<b>Average Financial Impact per User (The "Why")</b>',
    labels={
        'segment_legend': 'Customer Segment',
        'avg_gain_loss_per_user': 'Average Net Loss per User ($)'
    },
    text='avg_gain_loss_per_user'
)
fig_avg_impact.update_xaxes(categoryorder="total descending")
fig_avg_impact.update_traces(texttemplate='$%{text:,.0f}', textposition='outside')
fig_avg_impact.update_layout(width=1000, height=600, title_x=0.5, title_y=0.9,xaxis_title=None, showlegend=False)
fig_avg_impact.show()

**Step 3: Translating Results Into Business Impact**

**So what?** The financial impact charts, based on our validated model, reveal a critical and highly actionable insight. The company's financial losses are driven by two distinct types of risk: a **high-volume**, **moderate-cost** segment and a **low-volume**, **high-risk** segment.

Here is the business-level breakdown of each segment, based on the final data:

- **"Frequent Winners" (20.2% of Customers)**
    - **Finding**: This segment is the **#1 financial liability in terms of total volume**. The "Total Impact" chart clearly shows they are responsible for the largest total loss, at **$4.97 Million**.
    - **The "Why"**: They are a large group of customers (1,012) who cost the company a significant **$4,908 per user**.
    - **Business Impact**: This segment represents the largest, most consistent financial drain (41.6% of all costs). Managing the risk of this large, active group is a top priority.
- **"High-Stakes Winners" (3.5% of Customers)**
    - **Finding**: This segment is the **#1 financial liability in terms of individual risk**. While their total cost is second ($2.59 Million), their danger lies in the average cost.
    - **The "Why"**: Each user in this segment costs the company an average of **$14,780**, which is **3 times higher** than any other group.
    - **Business Impact**: This is the "time bomb" segment. Although very small (only 175 customers), each user is incredibly expensive. This group must be the top priority for the **Risk Management** team to set individual limits and prevent catastrophic losses.
- **"Smart & Cautious Winners" (39.0% of Customers)**
    - **Finding**: This is the largest segment of customers (1,949 users) but is only the 3rd most costly, at **$2.41 Million**.
    - **The "Why"**: Their cautious betting style results in a lower average cost of **$1,236** per user.
    - **Business Impact**: This group represents a "cost of doing business." They are skilled and consistent, but not a primary financial threat.
- **"Lottery-Style Players" (37.3% of Customers)**
    - **Finding**: This is the least costly segment, with a total loss of **$1.97 Million**.
    - **The "Why"**: Their high-risk, low-win-rate strategy results in the lowest average cost per user ($1,056).
    - **Business Impact**: This is the "healthiest" segment for the company. Their behavior is the most favorable for the business's margins and should be encouraged with aligned marketing (e.g., jackpots, parlays).

## 6. Summary & Recommendations

**Summary of Key Findings**

Our analysis set out to segment our customer base to improve efficiency and manage risk. The K-Means clustering model successfully came up with four distinct, actionable segments, revealing that **not all customers are created equal**.

The key finding is that our financial losses are **not** spread evenly. They are heavily concentrated and driven by two different types of high-cost segments:

1. **A "High-Volumne" Problem**: The **"Frequent Winners"** (20.2% of customers) are our largest total financial drain, costing the company **$4.97 Million** (41.6% of all costs) through consistent, high-frequency play.
2. **A "High-Risk" Problem**: The **"High-Stakes Winners"** (only 3.5% of customers) are our most dangerous individual risk. Each user in this segment costs **$14,780** on average: three times more than any other group.

The remaining 76% of our customers ("Smart & Cautious" and "Lottery-Style") are a much lower, more manageable cost of doing business. This model allows us to move from a one-size-fits-all strategy to a precise, segment-based approach.

**Specific Recommendations**

Based on this findings, we recommend the following segment-specific strategies to manage risk and optimize marketing spend:

- **For the "High-Stakes Winners" (The 3.5% "Time Bombs")**
    - **Action: Immediate Risk Management Review.** This entire segment (175 users) should be manually reviewed by the risk team.
    - **Action: Apply Strict Betting Limits.** Implement individual stake (max bet) limits and daily loss limits to cap the extreme financial risk they pose.
    - **Action: Marketing Exclusion.** Immediately exclude this segment from all bonus, free bet, and promotional campaigns. We are currently paying our most expensive customers to bet against us.
- **For the "Frequent Winners" (The 20.2% "Volume Drain")**
    - **Action: Proactive Monitoring.** This segment (1,012 users) should be monitored by the risk team for any users migrating into the "High-Stakes" profile.
    - **Action: Shift Marketing Strategy.** Instead of generinc bonuses, offer promotions that encourage higher-risk, higher-margin bets (e.g., parlay bonuses, odds boosts on non-favorites) to move them away from their current "safe" winning behavior.
- **For the "Smart & Cautious Winners" (The 39.0% "Skilled Base")**
    - **Action: "Bet-Get" Campaigns.** This is the largest group (1,949 users). The goal is to increase their engagement and volume. Offer "Bet $10, Get $5" promotions to encourage more activity wthout exposing the company to significant new risk.
    - **Action: Cross-Sell.** Introduce them to new sports or bet types to diversify their behavior.
- **For the "Lottery-Style Players" (The 37.3% "Healthy Segment")**
    - **Action: Full Marketing Engagement.** This is our "healthiest" segment. Target them with all promotions related to jackpots, "long-shot" bets, and multi-bet (parlay) insurance. Their behavior is the most favorable to our bottom line and should be encouraged.

**"What If" Section: Next Steps & Future Scope**

This project has successfully segmented our customers based on their betting behavior. The next logical steps would be to enrich this model and make it predictive.

- **If we had more data...**
    - **Demographic Data (Age, Location)**: We could answer questions like, "Are our 'High-Stakes Winners' all from a specific region?" or "Are they all between 25-30 years old?". This would allow for much sharper marketing and risk profiling.
    - **Bet-Level Data (Timing, Device)**: We could see when and how these segments bet. Do "High-Stakes Winners" only bet late at night on a mobile device? This behavioral data is crucial for real-time risk flagging.
    - **Customer Lifecycle Data (Join Date)**: Are the "High-Stakes Winners" new users or loyal customers? This completely changes the business problem (is it a bad acquisition problem or a long-term risk management failure?).
- **What could be the next step in this project?**
    1. **Build a Predictive Model**: The next step is to use these segments as a target. We would build a **classification model (like Logistic Regression or a Random Forest)** to predict, at the moment a new user signs up, which of the four segments they are most likely to join.
    2. **Create Real-Time Alerts**: This would allow the risk team to be **proactive**, not reactive. The model could flag a new user as a "Potential High-Stakes Winner", allowing the team to apply betting limits before they can cost the company $14,000.
    3. **A/B Test Recommendations**: Implement our recommendations on a test group (e.g., remove bonuses for 50% of the "High-Stakes" group) and measure the financial impact against a control group over 30 days. This would prove the exact dollar value of our segmentation model. 