## Set Up The Data for Analysis

In [49]:
# Import necessary libraries
import pandas as pd
import requests

We should fetch the country data from the API and builds two dictionaries to map country IDs to codes and codes to names.

In [50]:
# Download and prepare country mapping data
url = "https://dashboard.pubscale.com/json/country.json"
country_data = requests.get(url).json()

# Create mapping dictionaries
id_to_code = {}
for country in country_data:
    if isinstance(country.get('id'), int):
        id_to_code[country['id']] = country['code']

code_to_name = {}
for country in country_data:
    if 'code' in country:
        code_to_name[country['code']] = country['name']

Now we can proceed to Load the user signup data, removes null and duplicate records, and maps country codes to country names for clarity.

In [51]:
# Load and clean user_signup_location data
signup_path = r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_4_user_signup_location.csv"
signup_df = pd.read_csv(signup_path)

# Clean data
signup_df = signup_df.dropna(subset=['adv_id'])
signup_df = signup_df.drop_duplicates()
signup_df['signup_country_name'] = signup_df['country_code'].map(code_to_name)

Repeating the same for click data, removes incomplete rows, maps numeric country codes to ISO codes, and then maps those to readable country names for analysis.

In [52]:
# Load and clean clicks_country data
clicks_country_path = r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_1_clicks_country.csv"
clicks_country_df = pd.read_csv(clicks_country_path)

Let us do the same for country data

In [53]:
# Clean and map country codes
clicks_country_df = clicks_country_df.dropna()
clicks_country_df['country_code'] = clicks_country_df['country_code'].astype(int)
clicks_country_df['click_country_code'] = clicks_country_df['country_code'].map(id_to_code)
clicks_country_df = clicks_country_df.dropna(subset=['click_country_code'])
clicks_country_df['click_country_name'] = clicks_country_df['click_country_code'].map(code_to_name)

Similarly we can load the click-level data, removes records without advertiser IDs, and standardizes column names by stripping extra whitespace.

In [54]:
# Load and clean clicks data
clicks_path = r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_2_clicks.csv"
clicks_df = pd.read_csv(clicks_path)

# Clean data
clicks_df = clicks_df.dropna(subset=['adv_id'])
clicks_df.columns = clicks_df.columns.str.strip()  # Fix column names

## Data Cleaning & Enrichment

Let us join click data with country info using click_id to enrich each click with its corresponding country details.

In [55]:
# Merge clicks and clicks_country data
merged_clicks = pd.merge(
    clicks_df,
    clicks_country_df[['click_id', 'click_country_code', 'click_country_name']],
    on='click_id',
    how='left'
)

Let us Further enrich the dataset by merging signup location details using adv_id to compare click and signup geographies.

In [56]:
# Merge with signup location data
final_df = pd.merge(
    merged_clicks,
    signup_df[['adv_id', 'country_code', 'signup_country_name']],
    on='adv_id',
    how='left'
)

While we are at it, let us rename the signup country code column for clearer distinction between click and signup locations.

In [57]:
# Rename columns for clarity
final_df = final_df.rename(columns={
    'country_code': 'signup_country_code'
})

We also need to Flag rows where the signup and click country codes do not match, indicating potential location discrepancies.

In [58]:
# Identify discrepancies
final_df['is_discrepancy'] = final_df['signup_country_code'] != final_df['click_country_code']

Filters only mismatched records and exports them to a CSV for further investigation.

In [59]:
# Save results for analysis
discrepancies_df = final_df[final_df['is_discrepancy']]
discrepancies_df.to_csv("country_discrepancies.csv", index=False)

## Exploratory Data Analysis

Let us Perform a quick summary to quantify total clicks, mismatches, and the overall discrepancy rate.

In [60]:
# Basic analysis
print(f"Total clicks: {len(final_df)}")
print(f"Clicks with discrepancies: {len(discrepancies_df)}")
print(f"Discrepancy rate: {len(discrepancies_df)/len(final_df):.2%}")

Total clicks: 206855
Clicks with discrepancies: 24573
Discrepancy rate: 11.88%


Let's first Identify the top 10 mismatched signup-click country pairs by frequency to highlight common discrepancy patterns.

In [61]:
# Top discrepancy country pairs
top_pairs = discrepancies_df.groupby(['signup_country_name', 'click_country_name']).size().nlargest(10)
print("\nTop 10 country discrepancy pairs:")
print(top_pairs)


Top 10 country discrepancy pairs:
signup_country_name  click_country_name  
India                Indonesia               7788
                     Philippines             6011
                     United Arab Emirates    3000
                     United States           2994
                     Thailand                1197
                     Malaysia                 931
                     Vietnam                  931
                     France                   551
                     Mexico                   449
                     Russia                   136
dtype: int64


It is also crucial to Highlight the top 10 advertisers with the highest number of country mismatches to flag potential anomalies.

In [62]:
# Users with most discrepancies
top_users = discrepancies_df.groupby('adv_id').size().nlargest(10)
print("\nTop 10 users with most discrepancies:")
print(top_users)


Top 10 users with most discrepancies:
adv_id
0b255e12-79a0-496c-8be7-ef0753cf38d6    151
0d95cd91-7a7f-423d-a525-c988bfddb743    151
12201644-eea8-486d-b01c-31a42e2b6125    151
1270700e-7782-446e-bdf1-19388053dd37    151
13f9b106-72d9-475c-b31c-ef7fe89c62c0    151
1c997f04-63a6-49d9-94d5-8e22cc83e1e0    151
2d60ae46-5a96-414e-83a1-05990fa3f836    151
2ff218b8-3c68-4409-9501-b9a80d46c7f6    151
44aa26f3-9bb9-4f1d-8f56-d874af291b7a    151
4ae830e1-97c9-4ac4-bf8f-fa46f7c35755    151
dtype: int64


It makese strong sense to Analyze user click volume and mismatch rate to flag highly active users with over 80% discrepancies as potentially suspicious.

In [63]:
# Calculate click frequency
click_frequency = final_df.groupby('adv_id')['click_id'].count().reset_index(name='total_clicks')

# Merge with discrepancies
user_engagement = pd.merge(
    click_frequency,
    discrepancies_df.groupby('adv_id')['click_id'].count().reset_index(name='discrepant_clicks'),
    on='adv_id',
    how='left'
).fillna(0)

# Identify hyperactive users
hyperactive_users = user_engagement[
    (user_engagement['total_clicks'] > 100) & 
    (user_engagement['discrepant_clicks']/user_engagement['total_clicks'] > 0.8)
]
print(f"\nHyperactive suspicious users: {len(hyperactive_users)}")
print(f"Avg clicks: {hyperactive_users['total_clicks'].mean():.1f}")
print(f"Avg discrepancy rate: {hyperactive_users['discrepant_clicks'].mean()/hyperactive_users['total_clicks'].mean():.2%}")


Hyperactive suspicious users: 37
Avg clicks: 170.0
Avg discrepancy rate: 88.82%


Let us also Detect users with click activity across multiple countries and lists top mismatched signup-click patterns to uncover cross-border behavior trends.

In [64]:
# Identify users active in >1 country
user_countries = final_df.groupby('adv_id').agg(
    signup_country=('signup_country_code', 'first'),
    unique_click_countries=('click_country_code', 'nunique')
)

multi_country_users = user_countries[user_countries['unique_click_countries'] > 1]
print(f"\nUsers active in multiple countries: {len(multi_country_users)}")
print(f"Percentage of total users: {len(multi_country_users)/len(user_countries):.2%}")

# Top country-hopping patterns
country_hops = final_df.groupby(['signup_country_code', 'click_country_code']).size().reset_index(name='count')
country_hops = country_hops[country_hops['signup_country_code'] != country_hops['click_country_code']]
print("\nTop 10 country-hop patterns:")
print(country_hops.sort_values('count', ascending=False).head(10))


Users active in multiple countries: 489
Percentage of total users: 1.30%

Top 10 country-hop patterns:
   signup_country_code click_country_code  count
60                  IN                 ID   7788
68                  IN                 PH   6011
53                  IN                 AE   3000
73                  IN                 US   2994
72                  IN                 TH   1197
74                  IN                 VN    931
64                  IN                 MY    931
58                  IN                 FR    551
63                  IN                 MX    449
70                  IN                 RU    136


We can also Examine which rewards are most associated with location mismatches, indicating potential abuse or targeting anomalies.

In [65]:
# Analyze rewards in discrepant clicks
reward_analysis = discrepancies_df.groupby('reward_id').agg(
    total_discrepancies=('click_id', 'count'),
    unique_users=('adv_id', 'nunique')
).sort_values('total_discrepancies', ascending=False).head(10)

print("\nTop 10 rewards targeted in discrepancies:")
print(reward_analysis)


Top 10 rewards targeted in discrepancies:
           total_discrepancies  unique_users
reward_id                                   
513529                    2825           367
493787                    1652           182
519189                    1286           322
508087                    1184           249
513534                    1141           309
191251                    1030           144
519188                     980           310
519146                     948           232
493788                     935           136
493560                     934           136


## Simplified Fraud Scoring

In [66]:
# Calculate basic user metrics
user_stats = final_df.groupby('adv_id').agg(
    total_clicks=('click_id', 'count'),
    discrepant_clicks=('is_discrepancy', 'sum'),
    unique_click_countries=('click_country_code', 'nunique'),
    unique_rewards=('reward_id', 'nunique')
).reset_index()

# Calculate discrepancy rate
user_stats['discrepancy_rate'] = user_stats['discrepant_clicks'] / user_stats['total_clicks']

In [67]:
def simple_fraud_score(row):
    score = 0
    
    # 1. Discrepancy contribution (max 50 points)
    score += min(row['discrepancy_rate'] * 50, 50)
    
    # 2. Multi-country usage (max 20 points)
    if row['unique_click_countries'] > 1:
        score += min(row['unique_click_countries'] * 5, 20)
    
    # 3. High click volume (max 15 points)
    if row['total_clicks'] > 50:
        score += 15
    elif row['total_clicks'] > 20:
        score += 8
    
    # 4. Reward diversity (max 15 points)
    if row['unique_rewards'] > 5:
        score += min(row['unique_rewards'] * 2, 15)
    
    return min(score, 100)  # Cap at 100

# Apply scoring
user_stats['fraud_score'] = user_stats.apply(simple_fraud_score, axis=1)

In [68]:
def classify_risk(score):
    if score > 75: return 'CRITICAL'
    elif score > 50: return 'HIGH'
    elif score > 25: return 'MEDIUM'
    elif score > 10: return 'LOW'
    else: return 'MINIMAL'

user_stats['risk_level'] = user_stats['fraud_score'].apply(classify_risk)

In [69]:
# Top discrepancy countries
country_discrepancies = final_df[final_df['is_discrepancy']].groupby(
    ['signup_country_name', 'click_country_name']
).size().reset_index(name='count')

# Show top 10 suspicious routes
top_routes = country_discrepancies.sort_values('count', ascending=False).head(10)
print("TOP 10 COUNTRY DISCREPANCY ROUTES:")
print(top_routes)

# Calculate fraud density by country
fraud_by_country = user_stats.merge(
    signup_df[['adv_id', 'signup_country_name']],
    on='adv_id'
).groupby('signup_country_name').agg(
    avg_fraud_score=('fraud_score', 'mean'),
    critical_users=('risk_level', lambda x: (x == 'CRITICAL').sum())
).sort_values('avg_fraud_score', ascending=False)

print("\nFRAUD RISK BY SIGNUP COUNTRY:")
print(fraud_by_country.head(10))

TOP 10 COUNTRY DISCREPANCY ROUTES:
   signup_country_name    click_country_name  count
9                India             Indonesia   7788
16               India           Philippines   6011
22               India  United Arab Emirates   3000
24               India         United States   2994
21               India              Thailand   1197
25               India               Vietnam    931
11               India              Malaysia    931
7                India                France    551
12               India                Mexico    449
18               India                Russia    136

FRAUD RISK BY SIGNUP COUNTRY:
                     avg_fraud_score  critical_users
signup_country_name                                 
Peru                       30.294118               0
Netherlands                20.000000               0
Singapore                  18.750000               0
El Salvador                16.666667               0
Chile                      15.000000        

In [70]:
# Risk distribution
risk_dist = user_stats['risk_level'].value_counts().sort_index()
print("\nFRAUD RISK DISTRIBUTION:")
print(risk_dist)

# Top suspicious users
top_suspects = user_stats[user_stats['risk_level'].isin(['CRITICAL', 'HIGH'])].sort_values(
    'fraud_score', ascending=False
).head(10)
print("\nTOP 10 SUSPECTED USERS:")
print(top_suspects[['adv_id', 'fraud_score', 'risk_level']])


FRAUD RISK DISTRIBUTION:
risk_level
CRITICAL      234
HIGH          232
LOW          4279
MEDIUM        448
MINIMAL     32511
Name: count, dtype: int64

TOP 10 SUSPECTED USERS:
                                     adv_id  fraud_score risk_level
2612   1270700e-7782-446e-bdf1-19388053dd37    94.411765   CRITICAL
2567   12201644-eea8-486d-b01c-31a42e2b6125    94.411765   CRITICAL
1938   0d95cd91-7a7f-423d-a525-c988bfddb743    94.411765   CRITICAL
4075   1c997f04-63a6-49d9-94d5-8e22cc83e1e0    94.411765   CRITICAL
2844   13f9b106-72d9-475c-b31c-ef7fe89c62c0    94.411765   CRITICAL
24999  a9c17a3d-a5ea-4d20-9db7-3ad422750ad3    94.411765   CRITICAL
25917  b0102003-07a6-47f9-882d-0b708925e445    94.411765   CRITICAL
23913  a2a3ad87-a401-49ff-8ca7-527378bfb8b5    94.411765   CRITICAL
6928   2ff218b8-3c68-4409-9501-b9a80d46c7f6    94.411765   CRITICAL
6582   2d60ae46-5a96-414e-83a1-05990fa3f836    94.411765   CRITICAL
