<a href="https://colab.research.google.com/github/AnamHJ24/datascience-python-challenges/blob/main/notebooks/Day_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Day 11 - Stripe
You are a data analyst in **Stripe**'s risk management team investigating transaction patterns to identify potential fraud. The team needs to develop a systematic approach to screen transactions for financial risks. Your goal is to create an initial risk assessment methodology using transaction characteristics.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

# Import data files
url_1 = "https://raw.githubusercontent.com/AnamHJ24/datascience-python-challenges/refs/heads/main/Data/Day_11_1.txt"
url_2 = "https://raw.githubusercontent.com/AnamHJ24/datascience-python-challenges/refs/heads/main/Data/Day_11_2.txt"
fct_transactions = pd.read_csv(url_1)
fct_transactions.head()

Unnamed: 0,customer_email,transaction_id,transaction_date,transaction_amount,fraud_detection_score
0,alice@gmail.com,1,2024-10-05,120.0,10
1,bob@customdomain.com,2,2024-10-15,250.5,20
2,charlie@yahoo.com,3,2024-10-20,75.25,15
3,dana@hotmail.com,4,2024-10-25,100.0,30
4,eve@biz.org,5,2024-10-30,300.0,40


In [2]:
dim_risk_flags = pd.read_csv(url_2)
dim_risk_flags.head()

Unnamed: 0,risk_level,risk_flag_id,transaction_id
0,Low,1,2
1,Medium,2,7
2,High,3,11
3,High,4,12
4,High,5,13


## Question 1
How many transactions in October 2024 have a customer email ending with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'? This metric will help us identify transactions associated with less common email providers that may indicate emerging risk patterns.

## Solution

In [3]:
# Convert required columns to datetime
fct_transactions['transaction_date'] = pd.to_datetime(fct_transactions['transaction_date'])

# Filter OCtober 2024 Data
oct_2024 = fct_transactions[
  (fct_transactions['transaction_date'].dt.year == 2024) &
  (fct_transactions['transaction_date'].dt.month == 10)]

# Find uncommon email domains
common_domains = {'gmail.com', 'yahoo.com', 'hotmail.com'}
def has_uncommon_domain(email):
    if pd.isna(email):
        return False
    domain = email.lower().split('@')[-1]
    return domain not in common_domains

less_common_email = oct_2024[oct_2024['customer_email'].apply(has_uncommon_domain)]

print("Number of customer emails not having a common domain:",len(less_common_email))


Number of customer emails not having a common domain: 2


## Question 2
For transactions occurring in November 2024, what is the average transaction amount, using 0 as a default for any missing values? This calculation will help us detect abnormal transaction amounts that could be related to fraudulent activity.

## Solution

In [4]:
# Filter November 2024 Data
nov_2024 = fct_transactions[
  (fct_transactions['transaction_date'].dt.year == 2024) &
  (fct_transactions['transaction_date'].dt.month == 11)]

# Calculate the average transaction amount
avg_transaction = nov_2024['transaction_amount'].fillna(0).mean()
print("Average transaction amount for November 2024:",avg_transaction)

Average transaction amount for November 2024: 180.15


## Question 3
Among transactions flagged as 'High' risk in December 2024, which day of the week recorded the highest number of such transactions? This analysis is intended to pinpoint specific days with concentrated high-risk activity and support the development of our preliminary fraud detection score.

## Solution

In [9]:
# Filter December 2024 Data
dec_2024 = fct_transactions[
  (fct_transactions['transaction_date'].dt.year == 2024) &
  (fct_transactions['transaction_date'].dt.month == 12)]

# Filter 'High' risk flags
high_risk_flags = dim_risk_flags[dim_risk_flags['risk_level'] == 'High']

# Find high risk transactions in December 2024
high_risk_transactions = dec_2024.merge(
  high_risk_flags[['transaction_id']],
  on = "transaction_id",
  how = "inner")
print("High risk transactions in December 2024:\n")
print(high_risk_transactions)

High risk transactions in December 2024:

       customer_email  transaction_id transaction_date  transaction_amount  \
0  laura@riskmail.com              11       2024-12-02               100.0   
1  mike@securepay.com              12       2024-12-03               180.0   
2   nina@trusthub.com              13       2024-12-09               220.0   
3  paula@alertsys.com              15       2024-12-23               260.0   

   fraud_detection_score  
0                     80  
1                     85  
2                     90  
3                     95  


In [11]:
# Find day of the week for the transactions
high_risk_transactions['day_of_week'] = high_risk_transactions['transaction_date'].dt.dayofweek

# Map numeric day to name
day_names = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}
high_risk_transactions['day_name'] = high_risk_transactions['day_of_week'].map(day_names)

# Count high-risk transactions by day of the week
risk_by_day = high_risk_transactions.groupby('day_name')['transaction_id'].nunique()

# Find the day with the highest number of high-risk transactions
highest_risk_day = risk_by_day.idxmax()
highest_risk_count = risk_by_day.max()

print(f"Day with the most high-risk transactions: {highest_risk_day} ({highest_risk_count} transactions)")

Day with the most high-risk transactions: Monday (3 transactions)
