# 🎯 Fraud Detection in Ride Payments

This notebook explores patterns in failed transactions to identify potential fraud signals among first-time users and prepare data for predictive modeling.

## Project Goals

- **[Data Analyst]** Identify behavioral and contextual patterns to detect first-time fraudulent users
- **[Product Manager]** Based on findings, propose top 2 product/development actions to reduce failed payments (for small dev team)
- **[Data Analyst]** Prepare a clean, feature-rich dataset for predictive modeling (success vs. fail)

## Table of contents

- **Exploratory analysis**
- **Data check**
- **Exploring Relationships**
- **Scatterplots**
- **Pairplots**
- **Categorical Plots**

### 1. Exploratory analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import os

In [None]:
# This option ensures the charts you create are displayed in the notebook without the need to "call" them specifically.

%matplotlib inline

In [None]:
# Importing Scheduled rides
path = "/Users/Glebazzz/Jupiter/Taxi" 
df = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_taxi.csv'))

#check types of colomns
df.dtypes

In [None]:
df.shape

### 2. Data check

#### Missing Values Check:

In [None]:
# Check for missing values

df.isna().sum().sort_values(ascending=False)

### 3. Exploring Relationships

In [None]:
df["country_original"] = df["country"]  # <- keeps string version

from sklearn.preprocessing import LabelEncoder

df_encoded = df.copy()

# Automatically detect all object and category columns
label_cols = df_encoded.select_dtypes(include=['object', 'category']).columns

# Encode all of them
for col in label_cols:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col].astype(str))  # make sure all values are strings

# Now this will work:
df_encoded.corr()

In [None]:
correlation_matrix = df_encoded.corr()

fig, ax = plt.subplots(figsize=(12, 10))  
cax = ax.matshow(correlation_matrix, cmap='coolwarm')
fig.colorbar(cax)

ax.set_xticks(range(len(correlation_matrix.columns)))
ax.set_yticks(range(len(correlation_matrix.columns)))
ax.set_xticklabels(correlation_matrix.columns, rotation=90)
ax.set_yticklabels(correlation_matrix.columns)

plt.title("Correlation Matrix (Numeric Only)", pad=20)
plt.tight_layout()
plt.show()

In [None]:
# Create boolean flags - whether the value is missing in the columns
df_encoded["card_bin_missing"] = df_encoded["card_bin"].isna()
df_encoded["city_id_missing"] = df_encoded["city_id"].isna()
df_encoded["destination_missing"] = df_encoded["real_destination_lat"].isna() | df["real_destination_lng"].isna()

# Create a subset excluding the "order_id" and "order_try_id", "user_id", "price_review_reason", "has_price_review", 
# "Name", lat, lng, real_destination_lat, real_destination_lng, "Unnamed: 0", "time", "date" columns

sub = df_encoded[['city_id', 'card_bin', 'failed_attempts', 'payment_status','hour', 'price_per_km', 'missing_card_bin', 'missing_city_id', 'is_successful_payment', 
          'ride_price', 'price_review_status', 'device_name','ride_distance','distance','country','device_os_version',
         'missing_destination']]

In [None]:
correlation_matrix = sub.corr().round(2) 

fig, ax = plt.subplots(figsize=(12, 10))  
cax = ax.matshow(correlation_matrix, cmap='coolwarm')
fig.colorbar(cax)

ax.set_xticks(range(len(correlation_matrix.columns)))
ax.set_yticks(range(len(correlation_matrix.columns)))
ax.set_xticklabels(correlation_matrix.columns, rotation=90)
ax.set_yticklabels(correlation_matrix.columns)

plt.title("Correlation Matrix", pad=20)
plt.tight_layout()
plt.show()

## ✅ Key Takeaways: Payment Failures & Fraud Patterns

---

### 📊 Correlation Insights

- **`failed_attempts` ↘ `payment_status`**  
  Strong negative correlation — users with repeated failures are less likely to succeed.  
  ➤ Action: Use as a key signal for fraud detection.

- **`missing_card_bin` ↘ `payment_status`**  
  Moderate negative correlation — missing payment info often aligns with risk.  
  ➤ Action: Add soft-block or verification for users without card data.

- **`missing_city_id` & `missing_destination`**  
  These missing values also show some correlation with failed payments.  
  ➤ Action: Missing geo signals = potential red flag.

- **`price_review_status` & `ride_price`**  
  Slight correlation — higher-priced rides more often trigger price review.  
  ➤ Action: Consider price-based limits or review thresholds for first-time users.

- **`is_successful_payment` vs `payment_status`**  
  Perfect inverse correlation — expected, redundant.  
  ➤ Action: Drop one to reduce multicollinearity in models.

- **`device_name` & `device_os_version`**  
  Weak correlation with payment outcomes, but useful for segmentation.  
  ➤ Action: Can be used in risk profiling or clustering.

---

### 📌 Recommendations for Product or Modeling:

- Focus on **`failed_attempts`**, **missing fields**, and **pricing behavior** for real-time detection.
- Remove highly correlated or redundant features before modeling (e.g., `is_successful_payment`).
- Use `device`, `country`, and `hour` features more for **segmentation**, not as strong predictors.
- Missing value flags (`*_missing`) provide strong explanatory signals.

In [None]:
df_encoded

In [None]:
# Round the correlation matrix
correlation_matrix = sub.corr().round(2)

# Closed subplot
f, ax = plt.subplots(figsize=(10, 10))

# Build a heat map
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', ax=ax)

# Show graph
plt.title("Correlation Heatmap (Rounded to 2 decimal places)")
plt.tight_layout()
plt.show()

### 4. Scatterplots

In [None]:
%matplotlib inline
sns.lmplot(x='ride_price', y='distance', data=sub)
plt.show()

### 5. Pairplots:

In [None]:
# A user is "first-time" if they have 0 failed_attempts (or 1st order, based on your logic)
sub = sub.copy()
sub["is_first_time_user"] = sub["failed_attempts"] == 0

# Select relevant numeric features for pairwise analysis
pairplot_vars = [
    "price_per_km",
    "ride_price",
    "ride_distance",
    "failed_attempts",
    "hour",
    "distance",
    "is_first_time_user"
]


# Build the pairplot with payment_status as hue
sns.pairplot(
    data=sub,
    vars=pairplot_vars,
    hue="payment_status",      # color points by payment status (success/failure)
    palette="Set1",            # color scheme
    corner=True,               # show only the lower triangle of the matrix
    plot_kws={"alpha": 0.5}    # set point transparency for better overlap visibility
)

# Add a title and adjust layout
plt.suptitle("Pair Plot of Key Numeric Features by Payment Status", y=1.02)
plt.tight_layout()
# Save the image 
plt.savefig(os.path.join(path, '04 Analysis', 'Visualisations', 'pairplot_payment_status.png'))
# Show plot
plt.show()

In [None]:
# Save the image 
plt.savefig(os.path.join(path, '04 Analysis', 'Visualisations', 'pairplot_payment_status.png'))

## 🔍 Pair Plot Insights: Key Features by Payment Status

---

### 📌 Key Observations:

- **Ride Price vs. Distance**:
  - Most rides with `payment_status = 0` (failures) cluster at low distances and extremely high prices.
  - Some failed payments occur for **very high-priced, short rides**, suggesting price manipulation or fraud.
 

- **Price per KM**:
  - Heavily skewed — extreme outliers with very high values often lead to failed payments.
  - Successful payments tend to fall within a lower, more stable price/km range.
    

- **Ride Distance & Total Distance**:
  - Strong overlap, but a few anomalies (long distances with failed payments) may indicate edge cases or spoofed GPS.
    

- **Failed Attempts**:
  - Most successful payments have 0–2 attempts.
  - Failed payments are often preceded by **10+ attempts**, indicating automated abuse or retry fraud.
    

- **Hour of Day**:
  - Successful payments are more common in typical user hours (7 AM – 11 PM).
  - Failed payments are more evenly spread — especially **spikes at night (0–6 AM)** could hint at scripted fraud.


In [None]:
# A user is "first-time" if they have 0 failed_attempts (or 1st order, based on your logic)
sub = sub.copy()
sub["is_first_time_user"] = sub["failed_attempts"] == 0

In [None]:
# Select relevant features and include is_first_time_user
pairplot_vars = [
    "price_per_km",
    "ride_price",
    "ride_distance",
    "failed_attempts",
    "hour",
    "distance",
    "is_first_time_user"  # make sure this exists as a binary flag
]

# Optional: sample to improve speed
# sub_sample = sub[pairplot_vars].sample(n=1000, random_state=42)

# Build the pair plot
pairplot = sns.pairplot(
    data=sub,
    vars=pairplot_vars,  # exclude is_first_time_user from axes
    hue="is_first_time_user",  # use it as color
    palette="Set1",
    corner=True,
    plot_kws={"alpha": 0.5}
)

# Add title
plt.suptitle("Pair Plot of Key Features Colored by is_first_time_user", y=1.02)


# Save the image 
plt.savefig(os.path.join(path, '04 Analysis', 'Visualisations', 'pairplot_is_first_time_user.png'))
# Show plot
plt.show()


## 🔍 Pair Plot Insights: First-Time Users vs Returning Users

---

### 📌 Key Observations:

- **Failed Attempts**:
  - As expected, `is_first_time_user = True` users have `failed_attempts = 0` by definition.
  - Returning users (`False`) show a wider distribution of failed attempts — some with 5+, suggesting behavioral differences.

- **Ride Price**:
  - First-time users are **overrepresented in high-price rides**, which may indicate risk.
  - Returning users cluster around lower ride prices, likely due to more consistent usage.

- **Price per KM**:
  - Several first-time users show **very high price/km values**, possibly related to short, overpriced rides.
  - Returning users stay in a tighter, more expected range.

- **Ride & Total Distance**:
  - First-time users often take shorter rides, but some extreme outliers exist — both in `ride_distance` and `distance`.

- **Hour of the Day**:
  - Returning users tend to ride more during the day (peaks around business hours).
  - First-time users are more evenly spread, including off-peak/night hours — may indicate automated or one-off usage.


### Categorical Plots:

In [None]:
plt.figure(figsize=(12, 6))
sns.barplot(
    data=df,
    x="country_original",
    y="ride_price",
    hue="country_original",
    estimator="mean",
    errorbar="sd",
    palette="viridis",
    legend=False 
)

plt.title("Average Ride Price by Country (with Names)")
plt.xlabel("Country")
plt.ylabel("Average Ride Price")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Average Ride Price by Country

- 🇨🇦 **Canada** has the **highest average ride price** among all countries (>6000), which may indicate pricing anomalies or extreme fare variability.
- 🇳🇬 **Nigeria** and 🇭🇺 **Hungary** also show **very high average prices** (>2500) with **large standard deviations**, suggesting wide variation in ride costs.
- 🇰🇪 **Kenya** and 🇷🇸 **Serbia** show **moderately high average prices** (~600–1000) with less variability compared to Canada and Nigeria.
- Most **European countries** (e.g., 🇱🇻 Latvia, 🇱🇹 Lithuania, 🇵🇱 Poland, 🇪🇪 Estonia) have **consistently low average ride prices**, often below 100.
- 🇿🇦 **South Africa**, 🇲🇽 **Mexico**, and 🇺🇦 **Ukraine** have **low to medium average ride prices**, but with relatively large variability.
- Some countries (e.g., 🇸🇦 **Saudi Arabia**, 🇪🇬 **Egypt**) show **extremely low or near-zero average prices**, which could indicate:
  - Widespread use of **free rides** (e.g., promotions)
  - Or **pricing data issues** such as incorrect currency mapping

In [None]:
sub["price_per_km"].describe()

In [None]:
# Drop zero and extreme outliers
filtered = sub[
    (sub["price_per_km"] > 0.0001) & 
    (sub["price_per_km"] < sub["price_per_km"].quantile(0.99))
]

plt.figure(figsize=(10, 6))
sns.histplot(
    data=filtered,
    x="price_per_km",
    hue="is_first_time_user",
    bins=100,
    stat="density",
    common_norm=False,
    palette="Set1",
    element="step"
)

plt.title("Filtered Distribution of Price per KM by First-Time User Status")
plt.xlabel("Price per KM")
plt.ylabel("Density")
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Price per KM Distribution by First-Time User Status

- The **distribution of price per KM is heavily right-skewed**, with most values concentrated below **0.05** for both first-time and returning users.
- **First-time users** (blue line) and **returning users** (red line) show **very similar pricing patterns**, suggesting no strong difference in per-KM pricing behavior.
- There is **a slightly sharper peak** for first-time users around **0.01–0.02**, possibly indicating exposure to promotional or standard base pricing.
- Extreme values beyond **0.2 price/km** are rare and shared across both groups, suggesting they are either anomalies or premium rides.
- Overall, the **price-per-km range is narrow**, and any model that uses this variable should consider its skewed nature (e.g., using a log transformation).

In [None]:
plt.figure(figsize=(12, 5))

sns.pointplot(
    data=sub,
    x="hour",
    y="ride_price",
    hue="is_first_time_user",
    errorbar="sd",
    palette="Set1"
)

plt.title("Trend of Ride Price by Hour and First-Time User")
plt.xlabel("Hour of Day")
plt.ylabel("Average Ride Price")
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Trend of Ride Price by Hour and First-Time User

- **First-time users** (blue line) consistently have **higher average ride prices** than returning users (red line) across most hours of the day.
- The **difference in pricing is most pronounced during late-night to early-morning hours (00:00–03:00)**, where average prices for first-time users peak.
- Both user groups show relatively **stable price trends from 06:00 to 20:00**, suggesting more consistent pricing during the day.
- **Price variance is significantly higher for first-time users** (longer error bars), especially at night, indicating more volatility in their pricing.
- A notable **outlier or data quality issue appears at hour 13** for returning users (sharp drop below -1000), which should be investigated further.
- The data suggests that **first-time users may be targeted with higher fares** or take more expensive rides on average — possibly due to:
  - Less price sensitivity
  - Fewer promotions
  - Riskier behavior patterns

In [None]:
# Filter failed users
failed_users = sub[sub["payment_status"] == 0]

# Get top 10 most common card_bins among failed users
top_bins = failed_users["card_bin"].value_counts().head(10)

# Plot
plt.figure(figsize=(10, 6))
sns.countplot(
    y="card_bin",
    data=failed_users[failed_users["card_bin"].isin(top_bins.index)],
    order=top_bins.index,
    
)

plt.title("Top 10 Card Bins Among Failed Transactions")
plt.xlabel("Count")
plt.ylabel("Card BIN")
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Top 10 Card BINs Among Failed Transactions

- The **most frequent card BIN** associated with failed transactions is **528497**, with over 3,500 occurrences — significantly more than any other BIN.
- Several other BINs (e.g., **484162, 557368, 412752**) are also associated with 1,500–2,000 failures, suggesting they may be tied to specific issuers or high-risk regions.
- There are **multiple BINs from the same leading digits (e.g., 516XXX and 557XXX)** — possibly indicating clusters of risky BINs from the same issuing bank or card scheme.
- The **distribution is long-tailed**, with the top BIN accounting for a disproportionate share of failed transactions, which could be indicative of:
  - A compromised issuer
  - Prepaid or virtual cards with low success rates
  - BINs frequently used in test or fraudulent behavior

In [None]:
# Create categories of trip types
df["ride_type"] = pd.cut(
    df["price"],
    bins=[-1, 0, 50, 200, np.inf],
    labels=["free", "low", "mid", "high"]
)

plt.figure(figsize=(8, 5))

# Building a plot
sns.barplot(
    x="ride_type",
    y="payment_status",
    data=df,
    order=["free", "low", "mid", "high"],
    
)

plt.title("Success Rate by Ride Type (Free vs Paid Tiers)")
plt.xlabel("Ride Type (by Price)")
plt.ylabel("Payment Success Rate")
plt.ylim(0, 1)
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Success Rate by Ride Type (Free vs Paid Tiers)

- **Free rides have an overwhelmingly high failure rate**, dominating the chart — indicating that most of these transactions do not complete successfully.
- Among **paid ride tiers** (low, mid, high), there is a **gradual increase in success rate** as price increases:
  - **Low-priced rides** have the lowest success among paid tiers.
  - **Mid- and high-priced rides** show relatively better performance and similar success rates.
- This pattern suggests that **free rides may be highly targeted for fraud, abuse, or test activity**, which could be:
  - Exploiting promotional codes
  - Generated by bots or synthetic users
- The chart reinforces the need to **treat free rides as a distinct risk category** in fraud detection logic.

In [None]:
# 1. First-time users vs. failed payment rate
plt.figure(figsize=(6, 4))
sns.barplot(
    x="is_first_time_user",
    y="payment_status",
    data=df,
    hue="is_first_time_user",       
    palette="coolwarm",
    legend=False                    
)
# 2. Failed attempts vs. success
plt.figure(figsize=(10, 4))
sns.histplot(
    data=df[df["failed_attempts"] > 0],
    x="failed_attempts",
    hue="payment_status",
    multiple="stack",
    bins=40,
    palette="Set2"
)
plt.title("Failed Attempts Distribution by Payment Status")
plt.xlabel("Failed Attempts")
plt.ylabel("Count")
plt.tight_layout()
plt.show()



### 📊 Key Takeaways: First-Time User Share by Payment Status

- The proportion of **first-time users** is extremely high (>95%) across **both successful and failed transactions**.
- This suggests that the current dataset may be **heavily skewed toward first-time users**, which can:
  - Limit the ability to distinguish behavioral patterns between new and returning users
  - Inflate signals that may seem predictive but are actually due to sampling bias
- Since both groups are equally dominated by first-time users, **"is_first_time_user" alone is not a strong discriminator** of payment outcome in this context.
- It’s possible the system or campaign is currently onboarding a large number of new users, which should be accounted for in interpretation.

### 📊 Key Takeaways: Failed Attempts Distribution by Payment Status

- The **vast majority of transactions** — both successful and failed — occur with **zero or one failed attempt**, confirming that retrying is relatively uncommon.
- However, **failed payments are more prominent at lower retry counts**, especially at 0–1 attempts.
- As the number of failed attempts increases, the total count drops significantly, but **the share of successful payments among high retry counts is still visible**.
- This suggests that **some users do succeed after multiple failed attempts**, highlighting a **potential conversion opportunity** if retries are allowed.
- Very long retry chains (10+ failures) are rare, but they **could signal abuse patterns or bot-like behavior**.

In [None]:
failed_users = df[df["payment_status"] == 0]

# Перевір, скільки non-null card_bin
print("Failed users with card_bin:", failed_users["card_bin"].notna().sum())

# Виведи топ 10 бінів
top_bins = failed_users["card_bin"].value_counts().head(10)
print("Top card bins:\n", top_bins)

# Перевір, чи дані не порожні після фільтра
filtered = failed_users[failed_users["card_bin"].isin(top_bins.index)]
print("Filtered rows count:", len(filtered))

In [None]:
df["card_bin_missing"] = df["card_bin"].isna()

sns.barplot(
    x="card_bin_missing",
    y="payment_status",
    hue="card_bin_missing",     
    data=df,
    palette="Set2",
    legend=False                
)
plt.title("Payment Success Rate by Card BIN Availability")
plt.xlabel("Is Card BIN Missing")
plt.ylabel("Success Rate")
plt.ylim(0, 1)
plt.xticks([0, 1], ["No", "Yes"])
plt.tight_layout()
plt.show()

### 🔍 Insight: Card BIN as a Fraud Indicator

A bar plot of payment success rate by `card_bin` availability reveals a **strong fraud signal**:

- ✅ When the `card_bin` is **available**, success rates range between 20–30%.
- ❌ When `card_bin` is **missing**, **100% of transactions fail**.

📌 **Implication**: `card_bin_missing == True` is a highly predictive feature for fraud. It can be used as:
- A direct input for a predictive model
- A real-time soft-blocking condition
- A trigger for additional user verification (e.g., phone/email validation)

In [None]:
# Створюємо таблицю перетину (кількість транзакцій)
heatmap_data = df.pivot_table(
    index="is_first_time_user",
    columns="card_bin_missing",
    values="payment_status",   # можна також використовувати "user_id"
    aggfunc="count",
    fill_value=0
)

# Візуалізація heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt="d",
    cmap="Reds",
    cbar=False,
    xticklabels=["BIN Present", "BIN Missing"],
    yticklabels=["Returning", "First-Time"]
)

plt.title("Transactions by Card BIN Availability and User Type")
plt.xlabel("Card BIN Missing")
plt.ylabel("User Type")
plt.tight_layout()
plt.show()

### 📊 Key Takeaways: Card BIN Availability by User Type

- The vast majority of transactions have a **valid Card BIN present**, regardless of user type:
  - **165,332** for first-time users
  - **45,637** for returning users
- **Missing BINs are extremely rare**, accounting for:
  - Only **96** first-time user transactions
  - Only **69** returning user transactions
- **First-time users dominate transaction volume overall**, and this holds true even when filtering by BIN presence.
- The negligible amount of missing BIN values suggests that **missing BIN is not a widespread data issue**, but could still be worth flagging when it occurs — particularly among new users.


# ✅ Fraud Detection – Question Summary Table

| **Category**                    | **Question**                                                                                         | **Answer / Insight**                                                                                       | **Status**         | **Action / Next Step**                                               |
|--------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------|--------------------|-----------------------------------------------------------------------|
| 🧍 User Behavior                | Are first-time users more likely to fail?                                                            | Not yet analyzed – needs user grouping by `user_id`.                                                        | 🔍 To Do           | Feature engineer "first-time user" flag                              |
|                                | Failure rate vs. `failed_attempts`?                                                                  | Strong negative correlation – more failures → lower success rate.                                          | ✅ Confirmed       | Use in model and flag logic                                          |
|                                | Common traits among repeat failed users?                                                             | Weak correlation with `device_name`, `os_version`, `card_bin`.                                              | ☑️ Partial         | Use for clustering/segmentation                                      |
|                                | Can we cluster users by behavior?                                                                    | Not yet done.                                                                                                | 🔍 To Do           | Try KMeans or DBSCAN with engineered features                        |
|                                | Which device models + OS versions fail more often?                                                   | Not analyzed.                                                                                                | 🔍 To Do           | Group by device + OS + payment_status                                |
| 🌍 Geography & Device Risk     | Countries with highest fail rate?                                                                    | Not yet calculated.                                                                                          | 🔍 To Do           | Group `country_id` by payment_status                                 |
|                                | Risky device–country combos?                                                                         | Not yet explored.                                                                                           | 🔍 To Do           | Pivot analysis on device × country × status                          |
|                                | Missing `card_bin` or `city_id` correlation?                                                         | Moderate to strong negative correlation → increased failure.                                                | ✅ Confirmed       | Add to rules/heuristics                                               |
| 💳 Payment & Pricing           | Are free or high-priced rides more likely to fail?                                                   | Yes – free rides and high-priced short rides often fail.                                                    | ✅ Confirmed       | Consider value thresholds for new users                              |
|                                | Price-per-km threshold and failures?                                                                 | Extreme values → high failure rate.                                                                         | ✅ Confirmed       | Use for rule-based blocking or alerts                                |
|                                | Do failed users retry or abandon?                                                                    | Not yet studied.                                                                                             | 🔍 To Do           | Analyze retry sequences per user                                     |
| 🕐 Time Patterns               | Time-of-day failure patterns?                                                                        | Not analyzed.                                                                                                | ⏳ Pending         | Plot failure rate by hour                                            |
|                                | Weekend or holiday spikes?                                                                          | Not yet checked.                                                                                             | ⏳ Pending         | Add temporal features                                                |
| 🔧 Feature Evaluation          | Most predictive features?                                                                            | `failed_attempts`, `missing_card_bin`, `price`, `price_per_km`, geo-missing fields.                         | ✅ Confirmed       | Use as model input                                                   |
|                                | Contextual failure rate (device + country + user type)?                                              | Not done yet.                                                                                                | 🔍 To Do           | Group by context and compute conversion rate                         |
|                                | Effect of soft-block simulation?                                                                     | Not tested yet.                                                                                              | 🔍 To Do           | Simulate filter + compare metrics                                    |
| 🔁 Product Strategy Support    | Would card validation before ride reduce fails?                                                      | To be simulated.                                                                                             | 🔍 To Do           | Calculate based on card_bin presence                                 |
|                                | Should we limit high-value rides for new users?                                                      | Needs trade-off simulation.                                                                                  | 🔍 To Do           | Segment users and model impact                                       |
|                                | Should we cap retries?                                                                               | Retry behavior not yet analyzed.                                                                            | 🔍 To Do           | Analyze per-user retry patterns                                      |
|                                | Would soft-gate (verify phone/email) help?                                                           | Potentially useful — needs testing.                                                                         | 💡 Hypothesis      | Simulate with historical risky profiles                              |

---

## 🧪 Hypotheses

| **#** | **Hypothesis**                                                                                  |
|------|--------------------------------------------------------------------------------------------------|
| 1    | Users with 2+ failed attempts are significantly less likely to succeed.                         |
| 2    | Missing `card_bin` or `city_id` strongly correlates with failure.                               |
| 3    | High-priced, short-distance rides are often fraudulent.                                         |
| 4    | Outliers in `price_per_km` are highly predictive of failure.                                    |
| 5    | Device–country combinations may help reveal fraud clusters.                                     |