# Detecting Discrepancies in User Signup and Click Locations

### Import Required Libraries
He we are importing the necessary libraries and provides a brief overview of the goal. It sets the stage for the analysis.

Goal: Identify users whose click country differs from their signup country.
This analysis aims to detect potential fraud by identifying users whose signup location differs from their click activity location. Such discrepancies could indicate:
 <br> - Account takeover attempts
 <br> - VPN/proxy usage for fraudulent activities
 <br> - Bot networks operating from different geographic locations
 <br> - Legitimate users traveling or using VPNs

In [91]:
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

In [92]:
# Load country mapping from the provided API
def load_country_mapping():
    """
    Load country code to country name mapping from the API
    """
    response = requests.get('https://dashboard.pubscale.com/json/country.json')
    country_data = response.json()

    # Create a dictionary mapping country codes to names
    country_mapping = {}
    for country in country_data:
        country_mapping[country['code']] = country['name']

    print(f"Successfully loaded {len(country_mapping)} countries")
    return country_mapping


In [93]:
# Load the country mapping
country_mapping = load_country_mapping()


Successfully loaded 251 countries


In [94]:
# Display sample of country mappings
print("\nSample Country Mappings:")
sample_countries = list(country_mapping.items())[:10]
for code, name in sample_countries:
    print(f"   {code} → {name}")



Sample Country Mappings:
   Unresolved → Unresolved
   RW → Rwanda
   SO → Somalia
   YE → Yemen
   IQ → Iraq
   SA → Saudi Arabia
   IR → Iran
   CY → Cyprus
   TZ → Tanzania
   SY → Syria


In [95]:
# Load all CSV files
datasets = {
    'clicks_country': r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_1_clicks_country.csv",
    'clicks': r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_2_clicks.csv",
    'login_log': r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_3_login_log.csv",
    'user_signup': r"C:\Users\ASUS\OneDrive\Desktop\Greedy Game\Task 6\7_4_user_signup_location.csv"
}

data = {}
for name, filename in datasets.items():
    data[name] = pd.read_csv(filename)
    print(f"Loaded {name}: {data[name].shape[0]} rows, {data[name].shape[1]} columns")


Loaded clicks_country: 111904 rows, 2 columns
Loaded clicks: 206712 rows, 4 columns
Loaded login_log: 107712 rows, 3 columns
Loaded user_signup: 92562 rows, 2 columns


### STEP 3: Initial Data Exploration

In [96]:
# Explore each dataset structure
for name, df in data.items():
    print(f"\n{name.upper()} Dataset:")
    print(f"   Shape: {df.shape}")
    print(f"   Columns: {list(df.columns)}")
    print(f"   Sample data:")
    print(df.head(2).to_string(index=False))

    # Check for missing values
    missing = df.isnull().sum()
    if missing.any():
        print(f"   Missing values: {missing[missing > 0].to_dict()}")
    else:
        print("   No missing values detected")



CLICKS_COUNTRY Dataset:
   Shape: (111904, 2)
   Columns: ['click_id', 'country_code']
   Sample data:
                            click_id  country_code
775e6727-5c19-4a19-a5e0-f88806ee5e01     1269750.0
cc3c8a21-3364-442a-9d82-c2f34ff9dcdc     1269750.0
   Missing values: {'country_code': 9}

CLICKS Dataset:
   Shape: (206712, 4)
   Columns: ['adv_id', 'click_id', 'reward_id', 'ip']
   Sample data:
                              adv_id                             click_id  reward_id             ip
5fc3f485-7293-400f-b738-8c74f7df93d2 2970c66f-6d5c-45bf-96e1-c11816149083     516629 106.213.84.220
5fc3f485-7293-400f-b738-8c74f7df93d2 4e447654-5ec7-4417-b9f0-33e78b2c82ee     494202 106.213.84.220
   Missing values: {'adv_id': 19}

LOGIN_LOG Dataset:
   Shape: (107712, 3)
   Columns: ['adv_id', 'app_id', 'day']
   Sample data:
                              adv_id app_id        day
5ba99349-0e56-4a15-96d8-6708f21ae548 rupiyo 10/20/2024
1a893459-2252-4270-be92-ec9c5999698c  sikka 10/20/202

Dataset Relationships:<br>
clicks_country → Contains click_id and country_code (where click originated)<br>
clicks → Contains click_id, adv_id, reward_id, and IP address<br>
user_signup → Contains adv_id and signup country_code<br>
login_log → Contains adv_id, app_id, and login dates (DAU)<br>

In [97]:
# Merging clicks with clicks_country to get click locations
print("\nMerging clicks with click countries...")
clicks_with_location = pd.merge(
    data['clicks'],
    data['clicks_country'],
    on='click_id',
    how='inner'
)

print(f"Merged dataset shape: {clicks_with_location.shape}")
print(f"Original clicks: {len(data['clicks'])}")
print(f"Clicks with location: {len(clicks_with_location)}")



Merging clicks with click countries...
Merged dataset shape: (206712, 5)
Original clicks: 206712
Clicks with location: 206712


In [98]:
# Adding country names for click locations
clicks_with_location['click_country_name'] = clicks_with_location['country_code'].map(country_mapping)

In [99]:
# Merging with user signup data to get signup locations
print("\nAdding user signup locations...")
user_activity = pd.merge(
    clicks_with_location,
    data['user_signup'],
    on='adv_id',
    how='inner',
    suffixes=('_click', '_signup')
)

print(f"Final merged dataset shape: {user_activity.shape}")



Adding user signup locations...
Final merged dataset shape: (651852, 7)


In [100]:
# Adding signup country names
user_activity['signup_country_name'] = user_activity['country_code_signup'].map(country_mapping)

In [101]:
# Creating a flag for location discrepancies
user_activity['location_discrepancy'] = (
    user_activity['country_code_click'] != user_activity['country_code_signup']
)

In [102]:
# Calculating summary statistics
total_users = user_activity['adv_id'].nunique()
total_clicks = len(user_activity)
discrepant_clicks = user_activity['location_discrepancy'].sum()
discrepant_users = user_activity[user_activity['location_discrepancy']]['adv_id'].nunique()

print(f"DISCREPANCY ANALYSIS RESULTS:")
print(f"Total unique users: {total_users:,}")
print(f"Total clicks analyzed: {total_clicks:,}")
print(f"Clicks with location discrepancy: {discrepant_clicks:,} ({discrepant_clicks/total_clicks*100:.1f}%)")
print(f"Users with discrepant behavior: {discrepant_users:,} ({discrepant_users/total_users*100:.1f}%)")

DISCREPANCY ANALYSIS RESULTS:
Total unique users: 37,704
Total clicks analyzed: 651,852
Clicks with location discrepancy: 651,852 (100.0%)
Users with discrepant behavior: 37,704 (100.0%)


In [103]:
# Analyzing users with discrepancies
discrepant_users_df = user_activity[user_activity['location_discrepancy']].copy()

In [104]:
# Counting discrepancies by user
user_discrepancy_counts = discrepant_users_df.groupby('adv_id').agg({
    'click_id': 'count',
    'country_code_signup': 'first',
    'signup_country_name': 'first',
    'country_code_click': lambda x: list(x.unique()),
    'click_country_name': lambda x: list(x.unique())
}).reset_index()

user_discrepancy_counts.columns = [
    'adv_id', 'discrepant_clicks', 'signup_country_code', 'signup_country_name',
    'click_countries', 'click_country_names'
]
# Sorting by number of discrepant clicks
user_discrepancy_counts = user_discrepancy_counts.sort_values('discrepant_clicks', ascending=False)

In [105]:
print("TOP 10 USERS WITH MOST DISCREPANT CLICKS:")
print("-" * 60)
for idx, row in user_discrepancy_counts.head(10).iterrows():
    print(f"User {row['adv_id']}:")
    print(f"Signup Location: {row['signup_country_name']} ({row['signup_country_code']})")

    # Convert each item in the list to a string
    click_locations = [str(location) for location in row['click_country_names']]
    print(f"Click Locations: {', '.join(click_locations)}")

    print(f"Discrepant Clicks: {row['discrepant_clicks']}")
    print("-" * 60)


TOP 10 USERS WITH MOST DISCREPANT CLICKS:
------------------------------------------------------------
User ff33ae73-67e4-481d-b10c-88281077caaa:
Signup Location: India (IN)
Click Locations: nan
Discrepant Clicks: 3740
------------------------------------------------------------
User 96eca446-b90b-4a91-94de-34c1e834608f:
Signup Location: India (IN)
Click Locations: nan
Discrepant Clicks: 3740
------------------------------------------------------------
User 4ae830e1-97c9-4ac4-bf8f-fa46f7c35755:
Signup Location: India (IN)
Click Locations: nan
Discrepant Clicks: 3740
------------------------------------------------------------
User 13f9b106-72d9-475c-b31c-ef7fe89c62c0:
Signup Location: India (IN)
Click Locations: nan
Discrepant Clicks: 3740
------------------------------------------------------------
User a2a3ad87-a401-49ff-8ca7-527378bfb8b5:
Signup Location: India (IN)
Click Locations: nan
Discrepant Clicks: 3740
------------------------------------------------------------
User 2d60ae4

In [106]:
# Defining suspicious behavior criteria
def classify_risk_level(row):
    """
    Classify users based on risk level of their location discrepancies
    """
    discrepant_clicks = row['discrepant_clicks']
    num_different_countries = len(row['click_countries'])

    if discrepant_clicks >= 10 and num_different_countries >= 3:
        return 'HIGH_RISK'
    elif discrepant_clicks >= 5 and num_different_countries >= 2:
        return 'MEDIUM_RISK'
    elif discrepant_clicks >= 1:
        return 'LOW_RISK'
    else:
        return 'NO_RISK'

In [107]:
# Applying risk classification
user_discrepancy_counts['risk_level'] = user_discrepancy_counts.apply(classify_risk_level, axis=1)

In [108]:
# Counting users by risk level
risk_distribution = user_discrepancy_counts['risk_level'].value_counts()
print("RISK LEVEL DISTRIBUTION:")
for risk, count in risk_distribution.items():
    print(f"{risk}: {count} users")

RISK LEVEL DISTRIBUTION:
LOW_RISK: 37250 users
HIGH_RISK: 250 users
MEDIUM_RISK: 204 users


In [109]:
# High-risk users detailed analysis
high_risk_users = user_discrepancy_counts[user_discrepancy_counts['risk_level'] == 'HIGH_RISK']
print(f"\nHIGH-RISK USERS ANALYSIS ({len(high_risk_users)} users):")
print("-" * 60)

if len(high_risk_users) > 0:
    for idx, row in high_risk_users.iterrows():
        print(f"User {row['adv_id']} - HIGH RISK")
        print(f"Signup: {row['signup_country_name']} ({row['signup_country_code']})")

        # Convert each item in the list to a string
        click_locations = [str(location) for location in row['click_country_names']]
        print(f"Active in: {', '.join(click_locations)}")

        print(f"Discrepant clicks: {row['discrepant_clicks']}")
        print(f"Countries involved: {len(row['click_countries'])}")
        print("-" * 60)
else:
    print("No high-risk users detected based on current criteria")


HIGH-RISK USERS ANALYSIS (250 users):
------------------------------------------------------------
User ff33ae73-67e4-481d-b10c-88281077caaa - HIGH RISK
Signup: India (IN)
Active in: nan
Discrepant clicks: 3740
Countries involved: 7
------------------------------------------------------------
User 96eca446-b90b-4a91-94de-34c1e834608f - HIGH RISK
Signup: India (IN)
Active in: nan
Discrepant clicks: 3740
Countries involved: 7
------------------------------------------------------------
User 4ae830e1-97c9-4ac4-bf8f-fa46f7c35755 - HIGH RISK
Signup: India (IN)
Active in: nan
Discrepant clicks: 3740
Countries involved: 7
------------------------------------------------------------
User 13f9b106-72d9-475c-b31c-ef7fe89c62c0 - HIGH RISK
Signup: India (IN)
Active in: nan
Discrepant clicks: 3740
Countries involved: 7
------------------------------------------------------------
User a2a3ad87-a401-49ff-8ca7-527378bfb8b5 - HIGH RISK
Signup: India (IN)
Active in: nan
Discrepant clicks: 3740
Countrie

## Identifying Common Patterns

In [110]:
# Creating patterns for analysis
pattern_analysis = discrepant_users_df.groupby(['signup_country_name', 'click_country_name']).agg({
    'adv_id': 'nunique',
    'click_id': 'count'
}).reset_index()

pattern_analysis.columns = ['signup_country', 'click_country', 'unique_users', 'total_clicks']
pattern_analysis = pattern_analysis.sort_values('unique_users', ascending=False)

In [111]:
print("Top 15 Signup → Click Location Patterns:")
for idx, row in pattern_analysis.head(15).iterrows():
    print(f"{row['signup_country']} → {row['click_country']}: {row['unique_users']} users, {row['total_clicks']} clicks")


Top 15 Signup → Click Location Patterns:


Analyzing user activity patterns using login_log data

In [112]:
# Calculating activity metrics per user
user_activity_stats = user_activity.groupby('adv_id').agg({
    'click_id': 'count',
    'reward_id': 'nunique',
    'location_discrepancy': 'sum',
    'country_code_click': lambda x: x.nunique(),
    'country_code_signup': 'first'
}).reset_index()

user_activity_stats.columns = [
    'adv_id', 'total_clicks', 'unique_rewards', 'discrepant_clicks',
    'countries_clicked_from', 'signup_country'
]

In [113]:
# Calculating discrepancy rate per user
user_activity_stats['discrepancy_rate'] = (
    user_activity_stats['discrepant_clicks'] / user_activity_stats['total_clicks']
)

In [114]:
# Merging with login data for more insights
if 'login_log' in data:
    login_stats = data['login_log'].groupby('adv_id').agg({
        'day': ['count', 'nunique'],
        'app_id': 'nunique'
    }).reset_index()

    login_stats.columns = ['adv_id', 'total_logins', 'active_days', 'unique_apps']

    # Merging with user activity stats
    comprehensive_stats = pd.merge(user_activity_stats, login_stats, on='adv_id', how='left')

    print(f"Enhanced analysis with login data for {len(comprehensive_stats)} users")
else:
    comprehensive_stats = user_activity_stats
    print("Login data not available for temporal analysis")

Enhanced analysis with login data for 37704 users


In [115]:
def calculate_fraud_score(row):
    """
    Calculate a comprehensive fraud score based on multiple factors
    Score range: 0-100 (higher = more suspicious)
    """
    score = 0

    # Base score from discrepancy rate
    score += row['discrepancy_rate'] * 40

    # Add points for multiple countries
    if row['countries_clicked_from'] > 1:
        score += min(row['countries_clicked_from'] * 10, 30)

    # Add points for high click volume
    if row['total_clicks'] > 50:
        score += 10
    elif row['total_clicks'] > 20:
        score += 5

    # Add points for multiple rewards (potential reward farming)
    if row['unique_rewards'] > 10:
        score += 15
    elif row['unique_rewards'] > 5:
        score += 10

    # If login data available, factor in app usage patterns
    if 'unique_apps' in row and pd.notna(row['unique_apps']):
        if row['unique_apps'] > 5:
            score += 10

        # Suspicious if many clicks but few login days
        if row['total_clicks'] > 0 and row['active_days'] > 0:
            click_to_day_ratio = row['total_clicks'] / row['active_days']
            if click_to_day_ratio > 10:
                score += 15

    return min(score, 100)  # Cap at 100


In [116]:
# Calculating fraud scores
comprehensive_stats['fraud_score'] = comprehensive_stats.apply(calculate_fraud_score, axis=1)

In [117]:
# Classifying based on fraud score
def classify_fraud_risk(score):
    if score >= 70:
        return 'CRITICAL'
    elif score >= 50:
        return 'HIGH'
    elif score >= 30:
        return 'MEDIUM'
    elif score >= 10:
        return 'LOW'
    else:
        return 'MINIMAL'

In [118]:
comprehensive_stats['fraud_risk'] = comprehensive_stats['fraud_score'].apply(classify_fraud_risk)

In [119]:
# Displaying fraud score distribution
fraud_risk_dist = comprehensive_stats['fraud_risk'].value_counts()
print("FRAUD RISK DISTRIBUTION:")
for risk, count in fraud_risk_dist.items():
    print(f"{risk}: {count} users")

FRAUD RISK DISTRIBUTION:
MEDIUM: 31014 users
HIGH: 3985 users
CRITICAL: 2705 users


In [120]:
# Focusing on critical and high-risk users
critical_users = comprehensive_stats[comprehensive_stats['fraud_risk'].isin(['CRITICAL', 'HIGH'])].copy()
critical_users = critical_users.sort_values('fraud_score', ascending=False)

In [121]:
for idx, row in critical_users.head(10).iterrows():
    print(f"User {row['adv_id']} - FRAUD SCORE: {row['fraud_score']:.1f}")
    print(f"Risk Level: {row['fraud_risk']}")
    print(f"Total Clicks: {row['total_clicks']}")
    print(f"Discrepant Clicks: {row['discrepant_clicks']} ({row['discrepancy_rate']*100:.1f}%)")
    print(f"Countries Active: {row['countries_clicked_from']}")
    print(f"Unique Rewards: {row['unique_rewards']}")

    if 'active_days' in row and pd.notna(row['active_days']):
        print(f"Active Days: {row['active_days']}")
        print(f"Unique Apps: {row['unique_apps']}")
    print()


User 0000-0000 - FRAUD SCORE: 100.0
Risk Level: CRITICAL
Total Clicks: 882
Discrepant Clicks: 882 (100.0%)
Countries Active: 2
Unique Rewards: 13
Active Days: 1
Unique Apps: 5

User ff33ae73-67e4-481d-b10c-88281077caaa - FRAUD SCORE: 100.0
Risk Level: CRITICAL
Total Clicks: 3740
Discrepant Clicks: 3740 (100.0%)
Countries Active: 7
Unique Rewards: 41
Active Days: 1
Unique Apps: 3

User fe008056-e0db-4b34-907f-bd0ab3476ce7 - FRAUD SCORE: 100.0
Risk Level: CRITICAL
Total Clicks: 1617
Discrepant Clicks: 1617 (100.0%)
Countries Active: 12
Unique Rewards: 53
Active Days: 1
Unique Apps: 4

User 0bde5746-f4d0-40f6-9da9-47fa20751154 - FRAUD SCORE: 100.0
Risk Level: CRITICAL
Total Clicks: 1617
Discrepant Clicks: 1617 (100.0%)
Countries Active: 12
Unique Rewards: 53
Active Days: 1
Unique Apps: 4

User 15ae8c7c-ddb1-4186-9876-30193f977527 - FRAUD SCORE: 100.0
Risk Level: CRITICAL
Total Clicks: 1617
Discrepant Clicks: 1617 (100.0%)
Countries Active: 12
Unique Rewards: 53
Active Days: 1
Unique Apps:

In [122]:
# Key metrics summary
total_analyzed_users = len(comprehensive_stats)
users_with_discrepancies = len(comprehensive_stats[comprehensive_stats['discrepant_clicks'] > 0])
high_risk_users = len(comprehensive_stats[comprehensive_stats['fraud_risk'].isin(['HIGH', 'CRITICAL'])])

In [123]:
print("KEY METRICS:")
print(f"Total Users Analyzed: {total_analyzed_users:,}")
print(f"Users with Location Discrepancies: {users_with_discrepancies:,} ({users_with_discrepancies/total_analyzed_users*100:.1f}%)")
print(f"High-Risk Users: {high_risk_users:,} ({high_risk_users/total_analyzed_users*100:.1f}%)")
print(f"Average Fraud Score: {comprehensive_stats['fraud_score'].mean():.1f}")
print(f"Median Fraud Score: {comprehensive_stats['fraud_score'].median():.1f}")

KEY METRICS:
Total Users Analyzed: 37,704
Users with Location Discrepancies: 37,704 (100.0%)
High-Risk Users: 6,690 (17.7%)
Average Fraud Score: 44.5
Median Fraud Score: 40.0


In [124]:
# Most problematic signup countries
print("\nCOUNTRIES WITH HIGHEST FRAUD RATES:")
country_fraud_rates = comprehensive_stats.groupby('signup_country').agg({
    'fraud_score': 'mean',
    'adv_id': 'count',
    'fraud_risk': lambda x: (x.isin(['HIGH', 'CRITICAL'])).sum()
}).reset_index()

country_fraud_rates.columns = ['signup_country', 'avg_fraud_score', 'total_users', 'high_risk_users']
country_fraud_rates['high_risk_rate'] = country_fraud_rates['high_risk_users'] / country_fraud_rates['total_users']
country_fraud_rates = country_fraud_rates[country_fraud_rates['total_users'] >= 5]  # Filter for statistical significance
country_fraud_rates = country_fraud_rates.sort_values('avg_fraud_score', ascending=False)



COUNTRIES WITH HIGHEST FRAUD RATES:


In [125]:
print("Countries with highest average fraud scores (min 5 users):")
for idx, row in country_fraud_rates.head(10).iterrows():
    country_name = country_mapping.get(row['signup_country'], row['signup_country'])
    print(f"{country_name}: Avg Score {row['avg_fraud_score']:.1f}, {row['high_risk_users']}/{row['total_users']} high-risk")

Countries with highest average fraud scores (min 5 users):
Italy: Avg Score 47.9, 15/48 high-risk
Czech Republic: Avg Score 45.7, 2/7 high-risk
Morocco: Avg Score 45.5, 5/31 high-risk
India: Avg Score 44.9, 6369/33663 high-risk
Kuwait: Avg Score 44.2, 1/6 high-risk
Oman: Avg Score 44.2, 1/6 high-risk
Spain: Avg Score 42.7, 3/28 high-risk
Egypt: Avg Score 42.6, 2/17 high-risk
Bangladesh: Avg Score 42.5, 15/120 high-risk
Philippines: Avg Score 42.5, 52/543 high-risk
