Day 11 of Python Summer Party

by Interview Master

Stripe

Payment Fraud Risk Detection in Online Transactions

You are a data analyst in Stripe's risk management team investigating transaction patterns to identify potential fraud. The team needs to develop a systematic approach to screen transactions for financial risks. Your goal is to create an initial risk assessment methodology using transaction characteristics.

In [1]:
import pandas as pd
import numpy as np


In [2]:
# Load the CSV file into a DataFrame and display it
fct_transactions = pd.read_csv('fct_transactions.csv')
dim_risk_flags = pd.read_csv('dim_risk_flags.csv')

fct_transactions_df = fct_transactions.copy()
dim_risk_flags_df = dim_risk_flags.copy()

print(fct_transactions_df.info())
print()
print(fct_transactions_df)
print()
print(dim_risk_flags_df.info())
print()
print(dim_risk_flags_df)
print()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   customer_email         15 non-null     object 
 1   transaction_id         15 non-null     int64  
 2   transaction_date       15 non-null     object 
 3   transaction_amount     14 non-null     float64
 4   fraud_detection_score  15 non-null     int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 732.0+ bytes
None

          customer_email  transaction_id transaction_date  transaction_amount  \
0        alice@gmail.com               1       2024-10-05              120.00   
1   bob@customdomain.com               2       2024-10-15              250.50   
2      charlie@yahoo.com               3       2024-10-20               75.25   
3       dana@hotmail.com               4       2024-10-25              100.00   
4            eve@biz.org               5       2024-10-3

Question 1 of 3

How many transactions in October 2024 have a customer email ending with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'? This metric will help us identify transactions associated with less common email providers that may indicate emerging risk patterns.

In [3]:
# First we need to normalize and transform the 'transaction_date' column to datetime format
fct_transactions_df['transaction_date'] = pd.to_datetime(fct_transactions_df['transaction_date'], format='%Y-%m-%d', errors='coerce')
print("'transaction_date' after converting to datetime format:")
print(fct_transactions_df.info())
print()

# Lets find inconsistencies in 'customer_email'
print(fct_transactions_df['customer_email'].unique())


'transaction_date' after converting to datetime format:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         15 non-null     object        
 1   transaction_id         15 non-null     int64         
 2   transaction_date       15 non-null     datetime64[ns]
 3   transaction_amount     14 non-null     float64       
 4   fraud_detection_score  15 non-null     int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 732.0+ bytes
None

['alice@gmail.com' 'bob@customdomain.com' 'charlie@yahoo.com'
 'dana@hotmail.com' 'eve@biz.org' 'frank@gmail.com' 'grace@outlook.com'
 'ivan@yahoo.com' 'judy@hotmail.com' 'ken@domain.net' 'laura@riskmail.com'
 'mike@securepay.com' 'nina@trusthub.com' 'oscar@fintech.com'
 'paula@alertsys.com']


In [4]:
# 'customer_email' is clean, no inconsistencies
# Now we will have to group the data by transaction_date
fct_oct_transactions_df = fct_transactions_df[(fct_transactions_df['transaction_date'] >= '2024-10-01') & (fct_transactions_df['transaction_date'] < '2024-11-01')]
print(fct_oct_transactions_df.info()) 
print(fct_oct_transactions_df.head())


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         5 non-null      object        
 1   transaction_id         5 non-null      int64         
 2   transaction_date       5 non-null      datetime64[ns]
 3   transaction_amount     5 non-null      float64       
 4   fraud_detection_score  5 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 240.0+ bytes
None
         customer_email  transaction_id transaction_date  transaction_amount  \
0       alice@gmail.com               1       2024-10-05              120.00   
1  bob@customdomain.com               2       2024-10-15              250.50   
2     charlie@yahoo.com               3       2024-10-20               75.25   
3      dana@hotmail.com               4       2024-10-25              100.00   
4

In [5]:
# We will now find transactions with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'
# To do that we need to define a tuple of valid domains
valid_domains = ('@gmail.com', '@yahoo.com', '@hotmail.com')

# Then we will find the transactions using '~' as negation and the 'str.endswith' method
fct_oct_trans_val_email_df = fct_oct_transactions_df[~fct_oct_transactions_df['customer_email'].str.endswith(valid_domains, na=False)]
print(fct_oct_trans_val_email_df)


         customer_email  transaction_id transaction_date  transaction_amount  \
1  bob@customdomain.com               2       2024-10-15               250.5   
4           eve@biz.org               5       2024-10-30               300.0   

   fraud_detection_score  
1                     20  
4                     40  


In [6]:
# Now we display the count of transactions by customer_email
print("There are only", fct_oct_trans_val_email_df.shape[0], "transactions with with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'")


There are only 2 transactions with with a domain other than 'gmail.com', 'yahoo.com', or 'hotmail.com'


Question 2

For transactions occurring in November 2024, what is the average transaction amount, using 0 as a default for any missing values? This calculation will help us detect abnormal transaction amounts that could be related to fraudulent activity.

In [7]:
# We will need to re-filter the date to include november 2024
fct_nov_transactions_df = fct_transactions_df[(fct_transactions_df['transaction_date'] >= '2024-11-01') & (fct_transactions_df['transaction_date'] < '2024-12-01')]
print(fct_nov_transactions_df.info()) 
print(fct_nov_transactions_df.head())


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 5 to 9
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         5 non-null      object        
 1   transaction_id         5 non-null      int64         
 2   transaction_date       5 non-null      datetime64[ns]
 3   transaction_amount     4 non-null      float64       
 4   fraud_detection_score  5 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 240.0+ bytes
None
      customer_email  transaction_id transaction_date  transaction_amount  \
5    frank@gmail.com               6       2024-11-03              150.75   
6  grace@outlook.com               7       2024-11-10                 NaN   
7     ivan@yahoo.com               8       2024-11-15              200.00   
8   judy@hotmail.com               9       2024-11-21              250.00   
9     ken@domain

In [8]:
# We can see that there is one null value for transaction_ammount so we will be replacinging it with 0
fct_nov_transactions_df = fct_nov_transactions_df.copy()
fct_nov_transactions_df['transaction_amount'] = pd.to_numeric(fct_nov_transactions_df['transaction_amount'], errors='coerce').fillna(0)
print(fct_nov_transactions_df.info())


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 5 to 9
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         5 non-null      object        
 1   transaction_id         5 non-null      int64         
 2   transaction_date       5 non-null      datetime64[ns]
 3   transaction_amount     5 non-null      float64       
 4   fraud_detection_score  5 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 240.0+ bytes
None


In [9]:
# Now that we got rid of the null values we can proceed and calculate the average transaction amount for the whole month
fct_nov_avg_transaction_df = fct_nov_transactions_df['transaction_amount'].mean()
print("The average transaction amount for the whole month of November 2024 is:", fct_nov_avg_transaction_df)


The average transaction amount for the whole month of November 2024 is: 180.15


Question 3

Among transactions flagged as 'High' risk in December 2024, which day of the week recorded the highest number of such transactions? This analysis is intended to pinpoint specific days with concentrated high-risk activity and support the development of our preliminary fraud detection score.

In [10]:
# We start again by filtering for transactions in December 2024
fct_dec_transactions_df = fct_transactions_df[(fct_transactions_df['transaction_date'] >= '2024-12-01') & (fct_transactions_df['transaction_date'] < '2025-01-01')]
print(fct_dec_transactions_df.info()) 
print(fct_dec_transactions_df.head())


<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, 10 to 14
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         5 non-null      object        
 1   transaction_id         5 non-null      int64         
 2   transaction_date       5 non-null      datetime64[ns]
 3   transaction_amount     5 non-null      float64       
 4   fraud_detection_score  5 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(1)
memory usage: 240.0+ bytes
None
        customer_email  transaction_id transaction_date  transaction_amount  \
10  laura@riskmail.com              11       2024-12-02               100.0   
11  mike@securepay.com              12       2024-12-03               180.0   
12   nina@trusthub.com              13       2024-12-09               220.0   
13   oscar@fintech.com              14       2024-12-16               140.0   
14  

In [11]:
# Then we will need to append the 'dim_risk_flags' DataFrame to the 'fct_transactions' DataFrame
fct_dec_tran_risk_df = pd.merge(fct_dec_transactions_df, dim_risk_flags_df, how='left', on='transaction_id')
print(fct_dec_tran_risk_df.info())
print(fct_dec_tran_risk_df)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         5 non-null      object        
 1   transaction_id         5 non-null      int64         
 2   transaction_date       5 non-null      datetime64[ns]
 3   transaction_amount     5 non-null      float64       
 4   fraud_detection_score  5 non-null      int64         
 5   risk_level             5 non-null      object        
 6   risk_flag_id           5 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(2)
memory usage: 412.0+ bytes
None
       customer_email  transaction_id transaction_date  transaction_amount  \
0  laura@riskmail.com              11       2024-12-02               100.0   
1  mike@securepay.com              12       2024-12-03               180.0   
2   nina@trusthub.com              13       

In [12]:
# Filter for high risk transactions
dec_high_risk = fct_dec_tran_risk_df[(fct_dec_tran_risk_df['risk_level'] == 'High')].copy()
print(dec_high_risk.info())



<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 4
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         4 non-null      object        
 1   transaction_id         4 non-null      int64         
 2   transaction_date       4 non-null      datetime64[ns]
 3   transaction_amount     4 non-null      float64       
 4   fraud_detection_score  4 non-null      int64         
 5   risk_level             4 non-null      object        
 6   risk_flag_id           4 non-null      int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(2)
memory usage: 256.0+ bytes
None


In [13]:
# Add weekday column
dec_high_risk['day_of_week'] = dec_high_risk['transaction_date'].dt.day_name()
print(dec_high_risk.info())
print(dec_high_risk.head())



<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 0 to 4
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   customer_email         4 non-null      object        
 1   transaction_id         4 non-null      int64         
 2   transaction_date       4 non-null      datetime64[ns]
 3   transaction_amount     4 non-null      float64       
 4   fraud_detection_score  4 non-null      int64         
 5   risk_level             4 non-null      object        
 6   risk_flag_id           4 non-null      int64         
 7   day_of_week            4 non-null      object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(3)
memory usage: 288.0+ bytes
None
       customer_email  transaction_id transaction_date  transaction_amount  \
0  laura@riskmail.com              11       2024-12-02               100.0   
1  mike@securepay.com              12       2024-12-03              

In [14]:
# Count by weekday
weekday_counts = dec_high_risk.groupby('day_of_week').size().reset_index(name='transaction_count')


# Find the max
max_day = weekday_counts.sort_values('transaction_count', ascending=False).head(1)

# Answer to question 3 
print("\nWeekdays with high-risk activity in Dec 2024:")
print(weekday_counts)
print("\nDay with highest high-risk activity in Dec 2024:")
print(max_day)



Weekdays with high-risk activity in Dec 2024:
  day_of_week  transaction_count
0      Monday                  3
1     Tuesday                  1

Day with highest high-risk activity in Dec 2024:
  day_of_week  transaction_count
0      Monday                  3
