### Import the needed libraries to set up

In [21]:
import pandas as pd

In [22]:
# Load datasets
daily_login = pd.read_csv(r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 5\6_1_bharatcash_overall_daily_login.csv")
referral_users = pd.read_csv(r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 5\6_2_from_referral_bharatcash.csv")
rev_referral = pd.read_csv(r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 5\6_3_rev_from_referral_bharatcash.csv")
rev_overall = pd.read_csv(r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 5\6_4_rev_overall_bharatcash.csv")

### Exploratory Data Analysis

In [23]:
# Basic Overview of the dataset
print("-"*50)
print("Overview of dataset")
print("-"*50)
print(f"Daily Login Records: {daily_login.shape[0]:,} rows")
print(f"Referral Users: {referral_users.shape[0]:,} rows")
print(f"Referral Revenue Records: {rev_referral.shape[0]:,} rows")
print(f"Overall Revenue Records: {rev_overall.shape[0]:,} rows\n")

--------------------------------------------------
Overview of dataset
--------------------------------------------------
Daily Login Records: 286,240 rows
Referral Users: 41,259 rows
Referral Revenue Records: 41,469 rows
Overall Revenue Records: 220,612 rows



### Data Cleaning & Enrichment

In [24]:
# Clean data
dfs = [daily_login, referral_users, rev_referral, rev_overall]
df_names = ["Daily Login", "Referral Users", "Referral Revenue", "Overall Revenue"]

print("DATA CLEANING SUMMARY")
print("-"*50)
for name, df in zip(df_names, dfs):
    initial = len(df)
    df.drop_duplicates(inplace=True)
    df.columns = df.columns.str.strip()
    print(f"{name}: Removed {initial - len(df)} duplicates")

DATA CLEANING SUMMARY
--------------------------------------------------
Daily Login: Removed 15318 duplicates
Referral Users: Removed 0 duplicates
Referral Revenue: Removed 6324 duplicates
Overall Revenue: Removed 50208 duplicates


Since our objective is segment wise classification and comparison, our first action would be to classify the users into two brackets of referral users and non-referral users

In [25]:
# User classification
referral_ids = set(referral_users['referre_adv_id'].unique())
all_ids = set(rev_overall['adv_id'].unique())
non_referral_ids = all_ids - referral_ids

print("\nUSER CLASSIFICATION")
print("-"*50)
print(f"Total users with revenue: {len(all_ids):,}")
print(f"Referral users: {len(referral_ids):,} ({len(referral_ids)/len(all_ids):.1%})")
print(f"Non-referral users: {len(non_referral_ids):,} ({len(non_referral_ids)/len(all_ids):.1%})")


USER CLASSIFICATION
--------------------------------------------------
Total users with revenue: 36,780
Referral users: 40,836 (111.0%)
Non-referral users: 25,879 (70.4%)


To proceed, I am marking each advertiser as referral (1) or non-referral (0).
Then I am merging this referral info with their revenue data using adv_id.

In [26]:
# Merge revenue with classification
revenue_class = rev_overall.merge(
    pd.DataFrame({
        'adv_id': list(all_ids),
        'is_referral': [1 if x in referral_ids else 0 for x in all_ids]
    }), 
    on='adv_id'
)

To further simplify our analysis, I am splitting users into referral and non-referral groups, then summarizing their revenue using statistics like average and median.
It also makes sense to count how many users in each group made zero revenue

In [27]:
# Revenue analysis
referral_rev = revenue_class[revenue_class['is_referral'] == 1]
non_referral_rev = revenue_class[revenue_class['is_referral'] == 0]

print("The Revenue Summary of Referral Users")
print("-"*50)
print(referral_rev['revenue_in_usd'].describe().apply(lambda x: f"{x:,.2f}"))
print(f"\nZero-revenue users: {(referral_rev['revenue_in_usd'] == 0).sum():,}")

print("The Revenue Summary of Non-Referral Users")
print("-"*50)
print(non_referral_rev['revenue_in_usd'].describe().apply(lambda x: f"{x:,.2f}"))
print(f"\nZero-revenue users: {(non_referral_rev['revenue_in_usd'] == 0).sum():,}")

The Revenue Summary of Referral Users
--------------------------------------------------
count    39,389.00
mean          0.20
std           1.19
min           0.00
25%           0.03
50%           0.06
75%           0.13
max         200.73
Name: revenue_in_usd, dtype: object

Zero-revenue users: 120
The Revenue Summary of Non-Referral Users
--------------------------------------------------
count    131,015.00
mean           0.19
std            0.69
min            0.00
25%            0.02
50%            0.07
75%            0.13
max           39.19
Name: revenue_in_usd, dtype: object

Zero-revenue users: 473


We’re first defining a function to classify each user’s revenue into four segments - Zero, Low, Medium, and High.
Then, for referral and non-referral users separately, we calculate and display what percentage of users fall into each revenue segment.

In [28]:
# First, define the revenue_class function that categorizes revenue
def revenue_class(value):
    if value == 0:
        return 'Zero'
    elif 0 < value < 1:
        return 'Low (<$1)'
    elif 1 <= value <= 5:
        return 'Medium ($1-5)'
    else:
        return 'High (>$5)'

# Then proceed with your existing code
segments = ['Zero', 'Low (<$1)', 'Medium ($1-5)', 'High (>$5)']

def format_segment(df):
    if df.empty or 'revenue_in_usd' not in df.columns:
        return pd.Series({s: "N/A" for s in segments})
    
    counts = df['revenue_in_usd'].apply(revenue_class).value_counts()
    total = counts.sum()
    if total == 0:
        return pd.Series({s: "0.0%" for s in segments})
    return pd.Series({s: f"{counts.get(s, 0)/total:.1%}" for s in segments})

print("\nRevenue Segment Distribution (Sorted)")
print("-"*50)
print("Referral Users:")
print(format_segment(referral_rev).to_string())

print("\nNon-Referral Users:")
print(format_segment(non_referral_rev).to_string())


Revenue Segment Distribution (Sorted)
--------------------------------------------------
Referral Users:
Zero              0.3%
Low (<$1)        96.1%
Medium ($1-5)     3.3%
High (>$5)        0.2%

Non-Referral Users:
Zero              0.4%
Low (<$1)        97.1%
Medium ($1-5)     2.1%
High (>$5)        0.5%


We can now move to calculating the Revenue Per User (RPU) for both referral and non-referral users.
It also would make sense in finding how much higher or lower the referral RPU is compared to the non-referral RPU, both in absolute and percentage terms

In [29]:
# RPU calculation
referral_rpu = referral_rev['revenue_in_usd'].sum() / len(referral_rev)
non_referral_rpu = non_referral_rev['revenue_in_usd'].sum() / len(non_referral_rev)
rpu_diff = referral_rpu - non_referral_rpu
pct_diff = (rpu_diff / non_referral_rpu) * 100

We’re printing a side-by-side comparison of key revenue metrics (total users, total revenue, RPU, median, zero-revenue %) for referral vs non-referral users.
Then we clearly show the difference in RPU and highlight which group performs better in terms of revenue per user

In [30]:
# Final comparison
print("\n" + "-"*50)
print("FINAL RPU COMPARISON")
print("-"*50)
print(f"{'Metric':<25} | {'Referral':>12} | {'Non-Referral':>12}")
print("-"*50)
print(f"{'Total Users':<25} | {len(referral_rev):>12,} | {len(non_referral_rev):>12,}")
print(f"{'Total Revenue':<25} | ${referral_rev['revenue_in_usd'].sum():>12,.2f} | ${non_referral_rev['revenue_in_usd'].sum():>12,.2f}")
print(f"{'Avg Revenue (RPU)':<25} | ${referral_rpu:>12.4f} | ${non_referral_rpu:>12.4f}")
print(f"{'Median Revenue':<25} | ${referral_rev['revenue_in_usd'].median():>12.4f} | ${non_referral_rev['revenue_in_usd'].median():>12.4f}")
print(f"{'Zero Revenue Users':<25} | {((referral_rev['revenue_in_usd'] == 0).sum()/len(referral_rev)):>12.1%} | {((non_referral_rev['revenue_in_usd'] == 0).sum()/len(non_referral_rev)):>12.1%}")

print("\n" + "-"*50)
print(f"RPU DIFFERENCE: ${rpu_diff:.4f} ({pct_diff:+.2f}%)")
if rpu_diff > 0:
    print("Referral users generate more revenue per user")
else:
    print("Non-referral users generate more revenue per user")


--------------------------------------------------
FINAL RPU COMPARISON
--------------------------------------------------
Metric                    |     Referral | Non-Referral
--------------------------------------------------
Total Users               |       39,389 |      131,015
Total Revenue             | $    7,937.75 | $   24,630.98
Avg Revenue (RPU)         | $      0.2015 | $      0.1880
Median Revenue            | $      0.0600 | $      0.0734
Zero Revenue Users        |         0.3% |         0.4%

--------------------------------------------------
RPU DIFFERENCE: $0.0135 (+7.19%)
Referral users generate more revenue per user


### Main Takeaway

#### 1. **Overview**

This analysis compares the **Revenue Per User (RPU)** between users acquired via **referral programs** and those acquired through **non-referral or organic channels**.


### 2. **Key Insights**

| Metric                 | Referral     | Non-Referral | Interpretation                                                                                                                                                      |
| ---------------------- | ------------ | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Total Users**        | 39,389       | 131,015      | Referral users are \~23% of the total base. The non-referral base is significantly larger.                                                                          |
| **Total Revenue**      | \$7,937.75   | \$24,630.98  | Referral users contribute \~24.4% of the revenue, close to their proportion of users.                                                                               |
| **Avg Revenue (RPU)**  | **\$0.2015** | **\$0.1880** | Referral users generate **7.19% more revenue per user** than non-referrals.                                                                                         |
| **Median Revenue**     | \$0.0600     | \$0.0734     | Despite a higher average RPU, the **median is lower for referrals**, suggesting a **skewed distribution** (a few high-value referral users pulling up the average). |
| **Zero Revenue Users** | 0.3%         | 0.4%         | Negligible difference - both channels have near-complete monetization.                                                                                              |


### 3. **Strategic Interpretation**

* **Referral programs are incrementally more effective** in driving revenue per user on average.
* **High-value users** in the referral segment are likely responsible for the average lift, despite a lower median.
* The referral strategy appears to attract users who, while not consistently high spenders, include **a few power users**.
* Since total revenue and total users are roughly proportional between channels, **referrals are not diluting monetization efficiency**.


### 4. **Recommendations**

* **Double down on referrals**, especially if cost per acquisition is lower than other channels.
* **Segment referral users** further - identify and profile high-RPU cohorts to refine targeting.
* Address the **long-tail of low or median spenders** through lifecycle nudges, upselling, or engagement-based incentives.


### 5. **Bottom Line**

> **Referral users generate 7.2% more revenue per user than non-referral users, with skew driven by a few high earners. The channel is performing well and merits further investment.**
