#### **PART 1** : DATA CLEANING
Authored by : Nitin Jangir

<br>
Owing to the long preprocessing pipeline required to clean the dataset, the entire project is segmented into 2 sequential Notebooks. This segment details with all data wranfling processes to produce a clean dataset for high quality insights. 

##### **TABLE OF CONTENTS**

[STEP 1 : DUPLICATE HANDLING](#step-1--duplicate-handling)
<br>

[STEP 2 : NULL HANDLING](#step-2--null-handling)
- [IMPUTATION STRATEGY - id, Amt Refunded, ... Fee](#imputation-strategy---id-amt-refunded--fee)
- [IMPUTATION STRATEGY - Decline Reason](#imputation-strategy---decline-reason)
- [IMPUTATION STRATEGY - Seller Message](#imputation-strategy---seller-message)
- [IMPUTATION STRATEGY - Description](#imputation-strategy---description)
- [IMPUTATION STRATEGY - Checkout Line Item Summary](#imputation-strategy---checkout-line-item-summary)
- [IMPUTATION STRATEGY - Customer ID](#imputation-strategy---customer-id)
- [IMPUTATION STRATEGY - Refunded date, Dispute Date, Dispute Evidence Due (UTC)](#imputation-strategy---refunded-date-dispute-date-dispute-evidence-due-utc)
- [IMPUTATION STRATEGY - Card detail columns](#imputation-strategy---card-detail-columns)
- [IMPUTATION STRATEGY - Dispute columns](#imputation-strategy---dispute-columns)
<br>

[STEP 3 : DROPPING UNNECESSARY COLUMNS](#step-3--dropping-unnecessary-columns)
<br>

[STEP 4 : PARSING CORRECT DATATYPES](#step-4--parsing-correct-datatypes)
<BR>

[STEP 5 : DATA CONSISTENCY](#step-5--data-consistency)
<BR>

[STEP 6 : EXPORTING THE CLEANING DATASET](#step-6--exporting-the-cleaned-dataset)

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)

In [3]:
df = pd.read_csv('web3_jobs_stripe_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13359 entries, 0 to 13358
Data columns (total 38 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          7883 non-null   object 
 1   Created date (UTC)          13359 non-null  object 
 2   Amount                      13359 non-null  float64
 3   Amount Refunded             7883 non-null   float64
 4   Currency                    13359 non-null  object 
 5   Captured                    7883 non-null   object 
 6   Converted Amount            7883 non-null   float64
 7   Converted Amount Refunded   7883 non-null   float64
 8   Converted Currency          7883 non-null   object 
 9   Decline Reason              2203 non-null   object 
 10  Description                 5959 non-null   object 
 11  Fee                         7883 non-null   float64
 12  Is Link                     7883 non-null   object 
 13  Link Funding                293

In [4]:
df.head()

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
0,,2025-03-24 9:52:19,695.0,,usd,,,,,,,,,,Live,pi_3R67g7BnoxG9tBdf25MdiEfg,,,,requires_payment_method,,,,,,,,,,,,,,,,,,Job Post: Customer support (1)
1,,2025-03-24 9:52:17,447.0,,usd,,,,,,,,,,Live,pi_3R67g5BnoxG9tBdf2NIdMDTC,,,,requires_payment_method,,,,,,,,,,,,,,,,,,Job Post: looking for people for web3 game (1)
2,ch_3R673dBnoxG9tBdf32etciCT,2025-03-24 9:12:34,9.99,0.0,usd,True,13.09,0.0,sgd,,Subscription creation,1.01,False,,Live,pi_3R673dBnoxG9tBdf3rxWTzqv,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R673bBnoxG9tBdfti57d2nb,af18a1ba9061638a680a82884fd06d10,AE,MasterCard,debit,AE,apple_pay,cus_S07HDzTaLrjxqX,,7bd87277167663b41a5731ef602abf51,,,,,,Web3 Jobs Premium (1) (weekly)
3,ch_3R6526BnoxG9tBdf3nNx3crf,2025-03-24 7:02:52,9.99,0.0,usd,True,13.1,0.0,sgd,,Subscription creation,1.01,True,card,Live,pi_3R6526BnoxG9tBdf38qfllhs,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R6525BnoxG9tBdfkL1rJoDp,aeead298037c3c7c2dde5b5a69be48d3,TH,Visa,debit,GB,,cus_S05BvcU7dKNET8,,c18f08efa41c661b9e5905f4be60e816,,,,,,Web3 Jobs Premium (1) (weekly)
4,ch_3R64oZBnoxG9tBdf30EjMkLq,2025-03-24 6:48:52,39.99,0.0,usd,True,52.46,0.0,sgd,,Subscription creation,2.55,True,card,Live,pi_3R64oZBnoxG9tBdf3gckUYIo,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R64oYBnoxG9tBdfdk68ARH1,53b29e8bf70d1271bf82d96958d1513b,US,Visa,debit,US,,cus_R1ds09rQca8AMq,,6cafb52515bfce6ab74eff8e638e5af8,,,,,,Web3 Jobs Premium (1) (monthly)


## **DATA CLEANING**

##### `STEP 1 : DUPLICATE HANDLING`

In [5]:
#No duplicates
df[df.duplicated()]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary


##### `STEP 2 : NULL HANDLING`

In [6]:
pd.DataFrame(df.isna().sum(), columns=['Null Count'])

Unnamed: 0,Null Count
id,5476
Created date (UTC),0
Amount,0
Amount Refunded,5476
Currency,0
Captured,5476
Converted Amount,5476
Converted Amount Refunded,5476
Converted Currency,5476
Decline Reason,11156


##### <b> IMPUTATION STRATEGY - id, Amt Refunded, ... Fee

Note that the same number of NaNs exist in the following columns, so lets check if all these NaNs exist in the same rows.
 - id, Amount Refunded, Captured, Converted Amount, Converted Amount Refunded, Converted Currency, Fee, Is Link, Payment Source Type & Taxes On Fee

In [7]:
df[(df['id'].isna()) &
   (df['Amount Refunded'].isna()) & 
   (df['Captured'].isna()) & 
   (df['Converted Amount'].isna()) & 
   (df['Converted Amount Refunded'].isna()) & 
   (df['Converted Currency'].isna()) & 
   (df['Fee'].isna()) & 
   (df['Is Link'].isna()) & 
   (df['Payment Source Type'].isna()) & 
   (df['Taxes On Fee'].isna())]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
0,,2025-03-24 9:52:19,695.0,,usd,,,,,,,,,,Live,pi_3R67g7BnoxG9tBdf25MdiEfg,,,,requires_payment_method,,,,,,,,,,,,,,,,,,Job Post: Customer support (1)
1,,2025-03-24 9:52:17,447.0,,usd,,,,,,,,,,Live,pi_3R67g5BnoxG9tBdf2NIdMDTC,,,,requires_payment_method,,,,,,,,,,,,,,,,,,Job Post: looking for people for web3 game (1)
6,,2025-03-23 16:41:12,447.0,,usd,,,,,,,,,,Live,pi_3R5raGBnoxG9tBdf3mSoxCYV,,,,requires_payment_method,,,,,,,,,,,,,,,,,,Job Post: Blockchain Dev (1)
11,,2025-03-23 8:57:13,348.0,,usd,,,,,,,,,,Live,pi_3R5kLFBnoxG9tBdf2y87k78Q,,,,canceled,,,,,,,,,,,,,,,,,,Job Post: Community Moderator (1)
31,,2025-03-21 21:00:23,10906.0,,usd,,,,,,,,,,Live,pi_3R5CfzBnoxG9tBdf1KBhNubv,,,,canceled,,,,,,,,,,,,,,,,,,Bundle: 24 Jobs (1)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13354,,2022-02-04 11:53:08,326.0,,usd,,,,,,,,,,Live,pi_3KPQLgBnoxG9tBdf2iVzJcS5,,,,canceled,,,,,,,,,,,,,,,,,,AAAA AAA (1)
13355,,2022-02-04 11:52:59,326.0,,usd,,,,,,,,,,Live,pi_3KPQLXBnoxG9tBdf1iT3a329,,,,canceled,,,,,,,,,,,,,,,,,,AAAA AAA (1)
13356,,2022-02-04 11:21:25,326.0,,usd,,,,,,,,,,Live,pi_3KPPqzBnoxG9tBdf2dmAzYPL,,,,canceled,,,,,,,,,,,,,,,,,,sd (1)
13357,,2022-02-04 10:40:19,326.0,,usd,,,,,,,,,,Live,pi_3KPPDDBnoxG9tBdf2r4JpzvG,,,,canceled,,,,,,,,,,,,,,,,,,sdffsdsdf (1)


Since all 5476 NaNs in the aforementioned columns appear in the same rows, we shall impute the values for them together using Status as the reference column to guide the imputation.

In [8]:
#values of Status column where NaNs exist in the other selected columns
print(df[(df['Amount Refunded'].isna()) & 
   (df['Captured'].isna()) & 
   (df['Converted Amount'].isna()) & 
   (df['Converted Amount Refunded'].isna()) & 
   (df['Converted Currency'].isna()) & 
   (df['Fee'].isna()) & 
   (df['Is Link'].isna()) & 
   (df['Payment Source Type'].isna()) & 
   (df['Taxes On Fee'].isna())]['Status'].unique())

['requires_payment_method' 'canceled' 'requires_action']


Since values of the Status column in rows where the NaNs exist correspond to a form of transaction failure, the following code is used to impute the missing values. Note that 0 is used to impute values in Numerical columns, even though imputation of 'Not Applicable' would be a more meaningful replacement for some columns, so as to facilitate easier manipulation & aggregations for business analysis.

In [9]:
df['id'].fillna('Not Generated', inplace=True) #assuming that No payment_id is generated if the transaction failed
df['Amount Refunded'].fillna(0, inplace=True) #No transaction requires No refund
df['Captured'].fillna(False, inplace=True)
df['Converted Amount'].fillna(0, inplace=True)
df['Converted Currency'].fillna('N/A', inplace=True)
df['Converted Amount Refunded'].fillna(0, inplace=True)
df['Fee'].fillna(0, inplace=True) #No fee is charged on failed transactions
df['Is Link'].fillna(False, inplace=True)
df['Payment Source Type'].fillna('N/A', inplace=True) #no information on source can assumed without successful transfer
df['Taxes On Fee'].fillna(0, inplace=True) #intuitively, zero taxes on zero fees

df.isna().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['id'].fillna('Not Generated', inplace=True) #assuming that No payment_id is generated if the transaction failed
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Amount Refunded'].fillna(0, inplace=True) #No transaction requires No refund
The behavior will change in pandas 3

id                                0
Created date (UTC)                0
Amount                            0
Amount Refunded                   0
Currency                          0
Captured                          0
Converted Amount                  0
Converted Amount Refunded         0
Converted Currency                0
Decline Reason                11156
Description                    7400
Fee                               0
Is Link                           0
Link Funding                  10427
Mode                              0
PaymentIntent ID                  1
Payment Source Type               0
Refunded date (UTC)           13192
Statement Descriptor           5993
Status                            0
Seller Message                 5474
Taxes On Fee                      0
Card ID                        5923
Card Name MD5                  5986
Card Address Country           5493
Card Brand                     5923
Card Funding                   5923
Card Issue Country          

As illustrated below, a large amount of NaNs in the remaining columns exist in rows where Status values are requires_payment_method, canceled, or requires_action. So we will use the Status column again as a reference point to guide the remaining imputations

In [10]:
df[df['Status'].isin(['requires_payment_method', 'canceled', 'requires_action'])]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
0,Not Generated,2025-03-24 9:52:19,695.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3R67g7BnoxG9tBdf25MdiEfg,,,,requires_payment_method,,0.0,,,,,,,,,,,,,,,,Job Post: Customer support (1)
1,Not Generated,2025-03-24 9:52:17,447.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3R67g5BnoxG9tBdf2NIdMDTC,,,,requires_payment_method,,0.0,,,,,,,,,,,,,,,,Job Post: looking for people for web3 game (1)
6,Not Generated,2025-03-23 16:41:12,447.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3R5raGBnoxG9tBdf3mSoxCYV,,,,requires_payment_method,,0.0,,,,,,,,,,,,,,,,Job Post: Blockchain Dev (1)
11,Not Generated,2025-03-23 8:57:13,348.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3R5kLFBnoxG9tBdf2y87k78Q,,,,canceled,,0.0,,,,,,,,,,,,,,,,Job Post: Community Moderator (1)
31,Not Generated,2025-03-21 21:00:23,10906.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3R5CfzBnoxG9tBdf1KBhNubv,,,,canceled,,0.0,,,,,,,,,,,,,,,,Bundle: 24 Jobs (1)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13354,Not Generated,2022-02-04 11:53:08,326.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3KPQLgBnoxG9tBdf2iVzJcS5,,,,canceled,,0.0,,,,,,,,,,,,,,,,AAAA AAA (1)
13355,Not Generated,2022-02-04 11:52:59,326.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3KPQLXBnoxG9tBdf1iT3a329,,,,canceled,,0.0,,,,,,,,,,,,,,,,AAAA AAA (1)
13356,Not Generated,2022-02-04 11:21:25,326.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3KPPqzBnoxG9tBdf2dmAzYPL,,,,canceled,,0.0,,,,,,,,,,,,,,,,sd (1)
13357,Not Generated,2022-02-04 10:40:19,326.0,0.0,usd,False,0.0,0.0,,,,0.0,False,,Live,pi_3KPPDDBnoxG9tBdf2r4JpzvG,,,,canceled,,0.0,,,,,,,,,,,,,,,,sdffsdsdf (1)


##### <b> IMPUTATION STRATEGY - Decline Reason

In [11]:
df[df['Decline Reason'].isna()]['Status'].unique()

array(['requires_payment_method', 'Paid', 'canceled', 'requires_action',
       'Refunded', 'Failed'], dtype=object)

Notice that NaNs occur in this column for all values of Transaction Status. Consequently, the following rules are used,
- Paid -> N/A (since the payment was successfull)
- Refunded -> N/A (since the payment was initially successfull & then returned)
- Failed -> generic_decline
- Canceled -> generic_decline
- Requires Action -> authentication_required (since Stripe's official documentation mentions this status code is displayed when the customer didn’t complete the checkout)
- Requires Payment Method -> try_again_later (according to documentation, this status is raised when the payment fails on the business's checkout page)

<br>
Additonally, since there can be multiple reasons for transaction failure and therefore multiple possible decline code for Failed, Canceled & Requires Payment Method, the most generic failure codes are chosen to not create an unitended bias.

<br>

Source: [Status Codes Documentation](https://docs.stripe.com/payments/payment-intents/verifying-status) ,
[Decline Codes Documentation](https://docs.stripe.com/declines/codes)

In [12]:
decline_code_dict = {'Paid': 'N/A',
                     'Refunded': 'N/A',
                     'Failed': 'generic_decline',
                     'canceled': 'generic_decline',
                     'requires_action': 'authentication_required',
                     'requires_payment_method': 'try_again_later'}


df['Decline Reason'] = df.apply(lambda x: decline_code_dict[x['Status']] if pd.isna(x['Decline Reason']) 
                                else x['Decline Reason'], axis=1)

In [13]:
df['Decline Reason'].unique()

array(['try_again_later', 'N/A', 'previously_declined_do_not_retry',
       'generic_decline', 'insufficient_funds', 'transaction_not_allowed',
       'incorrect_number', 'do_not_honor', 'highest_risk_level',
       'invalid_account', 'authentication_required', 'incorrect_cvc',
       'call_issuer', 'card_velocity_exceeded', 'card_not_supported',
       'partner_insufficient_funds',
       'link_additional_verification_required', 'stolen_card',
       'requested_block_on_incorrect_cvc', 'lost_card', 'invalid_cvc',
       'expired_card', 'pickup_card', 'blocklist', 'elevated_risk_level',
       'rule', 'link_connection_closed', 'invalid_amount',
       'processing_error'], dtype=object)

##### <b> IMPUTATION STRATEGY - Seller Message

In [14]:
df[df['Seller Message'].isna()]['Decline Reason'].unique()

array(['try_again_later', 'generic_decline', 'authentication_required'],
      dtype=object)

Note that all the NaNs in Seller message exist in the same rows which correspond to transaction failure. Hence, the following imputation rules are used based on the documentation,

- try_again_later -> Ask the customer to attempt the payment again. If subsequent payments are declined, the customer needs to contact their card issuer for more information.
- authentication_required -> The card was declined because the transaction requires authentication
- generic_decline -> The card was declined for an unknown reason.

In [15]:
seller_message_dict = {'try_again_later' : 'Ask the customer to attempt the payment again. If subsequent payments are declined, the customer needs to contact their card issuer for more information.',
                       'authentication_required' : 'The card was declined because the transaction requires authentication',
                       'generic_decline' : 'The card was declined for an unknown reason.'}


df['Seller Message'] = df.apply(lambda x: seller_message_dict[x['Decline Reason']] if pd.isna(x['Seller Message']) 
                                else x['Seller Message'], axis=1)

In [16]:
#No more NaNs
df['Seller Message'].unique()

array(['Ask the customer to attempt the payment again. If subsequent payments are declined, the customer needs to contact their card issuer for more information.',
       'Payment complete.',
       "You previously attempted to charge this card. When the customer's bank declined that payment, it directed Stripe to block future attempts.",
       'The card was declined for an unknown reason.',
       'The bank returned the decline code `insufficient_funds`.',
       'The bank returned the decline code `transaction_not_allowed`.',
       'The bank returned the decline code `try_again_later`.',
       'The bank returned the decline code `incorrect_number`.',
       'The bank returned the decline code `do_not_honor`.',
       'Stripe blocked this payment as too risky.',
       'Your card number is incorrect.',
       'The bank did not return any further details with this decline.',
       'The bank returned the decline code `invalid_account`.',
       'The card was declined because the tra

##### <b> IMPUTATION STRATEGY - Description </b>

<br>
Since the values in Chcekout Line Summary Column are, in a way, a derivate of the Description (as shown below), we will use its values to impute missing data in Description column.

In [17]:
df[(~df['Description'].isna()) & (~df['Checkout Line Item Summary'].isna())].head()

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
2,ch_3R673dBnoxG9tBdf32etciCT,2025-03-24 9:12:34,9.99,0.0,usd,True,13.09,0.0,sgd,,Subscription creation,1.01,False,,Live,pi_3R673dBnoxG9tBdf3rxWTzqv,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R673bBnoxG9tBdfti57d2nb,af18a1ba9061638a680a82884fd06d10,AE,MasterCard,debit,AE,apple_pay,cus_S07HDzTaLrjxqX,,7bd87277167663b41a5731ef602abf51,,,,,,Web3 Jobs Premium (1) (weekly)
3,ch_3R6526BnoxG9tBdf3nNx3crf,2025-03-24 7:02:52,9.99,0.0,usd,True,13.1,0.0,sgd,,Subscription creation,1.01,True,card,Live,pi_3R6526BnoxG9tBdf38qfllhs,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R6525BnoxG9tBdfkL1rJoDp,aeead298037c3c7c2dde5b5a69be48d3,TH,Visa,debit,GB,,cus_S05BvcU7dKNET8,,c18f08efa41c661b9e5905f4be60e816,,,,,,Web3 Jobs Premium (1) (weekly)
4,ch_3R64oZBnoxG9tBdf30EjMkLq,2025-03-24 6:48:52,39.99,0.0,usd,True,52.46,0.0,sgd,,Subscription creation,2.55,True,card,Live,pi_3R64oZBnoxG9tBdf3gckUYIo,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R64oYBnoxG9tBdfdk68ARH1,53b29e8bf70d1271bf82d96958d1513b,US,Visa,debit,US,,cus_R1ds09rQca8AMq,,6cafb52515bfce6ab74eff8e638e5af8,,,,,,Web3 Jobs Premium (1) (monthly)
12,ch_3R5gViBnoxG9tBdf1vcYPGgn,2025-03-23 4:51:52,39.99,0.0,usd,True,52.36,0.0,sgd,,Subscription creation,2.54,False,,Live,pi_3R5gViBnoxG9tBdf1tHvJId6,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R5gVhBnoxG9tBdfkXZ8OjNi,cb2795677b2c33e3fd2a46f1daa14dc8,CA,MasterCard,credit,CA,,cus_Ry50i0H7S9VykA,,5f96efaf6ffcc28f20a6aaf0807470c0,,,,,,Web3 Jobs Premium (1) (monthly)
32,ch_3R5A7TBnoxG9tBdf1BaMwPew,2025-03-21 18:16:37,9.99,0.0,usd,True,13.07,0.0,sgd,,Subscription creation,1.01,True,card,Live,pi_3R5A7TBnoxG9tBdf1ZLb1Tbi,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R5A7SBnoxG9tBdfxxTHXel5,c05cdcce14fa2e59839661170308e8a8,US,Visa,debit,US,,cus_Rz8MaNkftT2hnc,,a2930c124bf77c738a73f308030771f2,,,,,,Web3 Jobs Premium (1) (weekly)


In [18]:
df['Description'] = df.apply(lambda x: x['Checkout Line Item Summary'] if pd.isna(x['Description']) 
                             else x['Description'], axis=1)

In [19]:
df['Description'].isna().sum()

np.int64(3)

Note that still 3 NaNs exist in the Description. This is because the values in Checkout Line Summary for these rows are also missing (as shown below)

In [20]:
#only 3 rows where both columns contain NaNs
df[(df['Description'].isna()) & (df['Checkout Line Item Summary'].isna())]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
12369,ch_3KxZSPBnoxG9tBdf3Cwsoluz,2022-05-09 16:29:13,298.0,0.0,usd,True,406.89,0.0,sgd,,,14.33,False,,Live,pi_3KxZSPBnoxG9tBdf37kpXTC8,card,,Web3 Jobs,Paid,Payment complete.,0.0,pm_1KmPd0BnoxG9tBdfYHza3TBy,b17bce8901cda47bd3a240b450d8dac6,US,MasterCard,credit,US,,cus_LTMLPF8E2dDQuE,,cf6bb90e201ab1d304c39d57834242e4,,,,,,
12637,Not Generated,2022-04-07 14:00:44,1.0,0.0,sgd,False,0.0,0.0,,try_again_later,,0.0,False,,Live,pi_3KlvtABnoxG9tBdf3cE1unZ8,,,,requires_payment_method,Ask the customer to attempt the payment again....,0.0,,,,,,,,cus_LSrEJJA1A6yKkL,,eea87a56d2cc6558d20c2419287072c5,,,,,,
12638,Not Generated,2022-04-07 14:00:06,1.0,0.0,usd,False,0.0,0.0,,try_again_later,,0.0,False,,Live,pi_3KlvsYBnoxG9tBdf191CjJow,,,,requires_payment_method,Ask the customer to attempt the payment again....,0.0,,,,,,,,cus_LQ8I4cJy0uPiXG,,bbb6563dbe7d2c57bb590b17d98e35b5,,,,,,


So, we will individually look into all 3 Customer IDs to identify purchase patterns & impute with the most likely purchase as per history.

In [21]:
df[df['Customer ID']=='cus_LTMLPF8E2dDQuE']

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
11829,ch_3LJR6aBnoxG9tBdf2PgcQKjr,2022-07-09 0:01:04,298.0,0.0,usd,True,409.21,0.0,sgd,,Payment for Invoice,14.41,False,,Live,pi_3LJR6aBnoxG9tBdf2RNnrRBW,card,,Job renewal 25552,Paid,Payment complete.,0.0,pm_1KmPd0BnoxG9tBdfYHza3TBy,b17bce8901cda47bd3a240b450d8dac6,US,MasterCard,credit,US,,cus_LTMLPF8E2dDQuE,,cf6bb90e201ab1d304c39d57834242e4,,,,,,
12096,ch_3L8UyUBnoxG9tBdf2JSqpPM4,2022-06-08 19:55:30,298.0,0.0,usd,True,401.64,0.0,sgd,,Job Renewal 25552,14.16,False,,Live,pi_3L8UyUBnoxG9tBdf2UykS77s,card,,Web3 Jobs,Paid,Payment complete.,0.0,pm_1KmPd0BnoxG9tBdfYHza3TBy,b17bce8901cda47bd3a240b450d8dac6,US,MasterCard,credit,US,,cus_LTMLPF8E2dDQuE,,cf6bb90e201ab1d304c39d57834242e4,,,,,,
12369,ch_3KxZSPBnoxG9tBdf3Cwsoluz,2022-05-09 16:29:13,298.0,0.0,usd,True,406.89,0.0,sgd,,,14.33,False,,Live,pi_3KxZSPBnoxG9tBdf37kpXTC8,card,,Web3 Jobs,Paid,Payment complete.,0.0,pm_1KmPd0BnoxG9tBdfYHza3TBy,b17bce8901cda47bd3a240b450d8dac6,US,MasterCard,credit,US,,cus_LTMLPF8E2dDQuE,,cf6bb90e201ab1d304c39d57834242e4,,,,,,
12622,ch_3KmPcXBnoxG9tBdf068KV1Bg,2022-04-08 21:46:03,298.0,0.0,usd,True,398.0,0.0,sgd,,Job Post: Web3 Marketing Manager - Brand Exper...,14.03,False,,Live,pi_3KmPcXBnoxG9tBdf0VIM8vHr,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1KmPd0BnoxG9tBdfYHza3TBy,b17bce8901cda47bd3a240b450d8dac6,US,MasterCard,credit,US,,cus_LTMLPF8E2dDQuE,,cf6bb90e201ab1d304c39d57834242e4,,,,,,Job Post: Web3 Marketing Manager - Brand Exper...


In [22]:
df.loc[12369, 'Description'] = 'Job Post'

In [23]:
df[df['Customer ID']=='cus_LSrEJJA1A6yKkL']

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
12637,Not Generated,2022-04-07 14:00:44,1.0,0.0,sgd,False,0.0,0.0,,try_again_later,,0.0,False,,Live,pi_3KlvtABnoxG9tBdf3cE1unZ8,,,,requires_payment_method,Ask the customer to attempt the payment again....,0.0,,,,,,,,cus_LSrEJJA1A6yKkL,,eea87a56d2cc6558d20c2419287072c5,,,,,,
12639,ch_3KlvQxBnoxG9tBdf0n5V16oD,2022-04-07 13:36:39,646.0,0.0,usd,True,861.12,0.0,sgd,,"Job Post: Senior Content Writer, Copywriter (1)",29.78,False,,Live,pi_3KlvQxBnoxG9tBdf0pdqUVcc,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1KlvVqBnoxG9tBdfpvr93cpu,46aab770d7f720c5a36e06c5762503be,AM,Visa,debit,SG,,cus_LSrEJJA1A6yKkL,,eea87a56d2cc6558d20c2419287072c5,,,,,,"Job Post: Senior Content Writer, Copywriter (1)"


In [24]:
df.loc[12637, 'Description'] = 'Job Post'

In [25]:
df[df['Customer ID']=='cus_LQ8I4cJy0uPiXG']

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
12638,Not Generated,2022-04-07 14:00:06,1.0,0.0,usd,False,0.0,0.0,,try_again_later,,0.0,False,,Live,pi_3KlvsYBnoxG9tBdf191CjJow,,,,requires_payment_method,Ask the customer to attempt the payment again....,0.0,,,,,,,,cus_LQ8I4cJy0uPiXG,,bbb6563dbe7d2c57bb590b17d98e35b5,,,,,,
12706,ch_3KjHCQBnoxG9tBdf21mPKToy,2022-03-31 7:02:50,1026.0,0.0,usd,True,1359.87,0.0,sgd,,Bundle: 6 Jobs (1),46.74,False,,Live,pi_3KjHCQBnoxG9tBdf2koBrUBm,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1KjI1tBnoxG9tBdfwzy1ZQ56,3dca4e51970452035c1a5d19f8c01346,SG,Visa,debit,SG,,cus_LQ8I4cJy0uPiXG,,bbb6563dbe7d2c57bb590b17d98e35b5,,,,,,Bundle: 6 Jobs (1)


In [26]:
df.loc[12638, 'Description'] = 'Job Bundle'

In [27]:
#No NaN entries anymore
df['Description'].isna().sum()

np.int64(0)

##### <b> IMPUTATION STRATEGY - Checkout Line Item Summary </b>

<br>
In the same manner, since the values of Checkout Line Summary are synonymous to that of Description, we will use description column to impute missing values.

In [28]:
df['Checkout Line Item Summary'] = df.apply(lambda x: x['Description'] if pd.isna(x['Checkout Line Item Summary']) 
                             else x['Checkout Line Item Summary'], axis=1)

In [29]:
#No NaNs anymore
df['Checkout Line Item Summary'].isna().sum()

np.int64(0)

##### <b> IMPUTATION STRATEGY - Customer ID

In [30]:
df[df['Customer ID'].isna()]['Status'].unique()

array(['requires_payment_method', 'canceled'], dtype=object)

Note that Customer IDs are missing only in rows which correspond to transaction failure. This implies that stripe does not associate the internal customer IDs until the payment is successfully done. Hence, the best strategy in this case would be to not fabricate any values and impute 'Not Generated' instead. 

In [31]:
df['Customer ID'].fillna('Not Generated', inplace=True)
df['Customer ID'].isna().sum() #0 nulls now

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Customer ID'].fillna('Not Generated', inplace=True)


np.int64(0)

##### <b> IMPUTATION STRATEGY - Refunded date, Dispute Date, Dispute Evidence Due (UTC) </b>

<br>
Note that even though Refunded Date column has several NaNs in rows where no actual refund was processed, we will not be filling these NaNs with something more comprehensible like 'N/A' as it will hinder with the DateTime operations & complicate analysis. As solution, we will convert NaNs to NaTs in the future steps.

<br>
The same logic has been applied to the other 2 date columns

In [32]:
#No refunds processed across all transaction status types
df[df['Refunded date (UTC)'].isna()]['Status'].unique()

array(['requires_payment_method', 'Paid', 'Failed', 'canceled',
       'requires_action'], dtype=object)

In [33]:
#checking to see if there are any rows where the refund was processed but date was not captured. Thankfully, no such rows.
df[(df['Refunded date (UTC)'].isna()) & (df['Amount Refunded']!=0)]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary


In [34]:
df[(df['Dispute Date (UTC)'].isna()) & (~df['Disputed Amount'].isin([0, np.nan]))]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary


In [35]:
df[(df['Dispute Evidence Due (UTC)'].isna()) & (~df['Disputed Amount'].isin([0, np.nan]))]

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary


##### <b> IMPUTATION STRATEGY - Card detail columns </b>

<br>
Note that Card details are also missing for several rows corresponding to succesful transactions (showm below)

In [36]:
df[df['Card Brand'].isna()]['Status'].value_counts()

Status
canceled                   5447
Paid                        313
Failed                      126
requires_action              18
requires_payment_method      11
Refunded                      8
Name: count, dtype: int64

In [37]:
df[df['Card Funding'].isna()]['Status'].value_counts()

Status
canceled                   5447
Paid                        313
Failed                      126
requires_action              18
requires_payment_method      11
Refunded                      8
Name: count, dtype: int64

In [38]:
df[df['Card Address Country'].isna()]['Status'].value_counts()

Status
canceled                   5447
requires_action              18
Paid                         13
requires_payment_method      11
Failed                        4
Name: count, dtype: int64

As shown below, we see that there are more than a 1000 rows where Card Issue Country is not equal to Card Address Country, so using Issue Country's data for address imputation will be misleading. Therefore, all missing values in Card Address Country are imputed with 'Not Captured'.

In [39]:
df[(~df['Card Address Country'].isna()) & (~df['Card Issue Country'].isna()) & (df['Card Address Country']!=df['Card Issue Country'])].head()

Unnamed: 0,id,Created date (UTC),Amount,Amount Refunded,Currency,Captured,Converted Amount,Converted Amount Refunded,Converted Currency,Decline Reason,Description,Fee,Is Link,Link Funding,Mode,PaymentIntent ID,Payment Source Type,Refunded date (UTC),Statement Descriptor,Status,Seller Message,Taxes On Fee,Card ID,Card Name MD5,Card Address Country,Card Brand,Card Funding,Card Issue Country,Card Tokenization Method,Customer ID,Customer Description,Customer Email MD5,Disputed Amount,Dispute Date (UTC),Dispute Evidence Due (UTC),Dispute Reason,Dispute Status,Checkout Line Item Summary
3,ch_3R6526BnoxG9tBdf3nNx3crf,2025-03-24 7:02:52,9.99,0.0,usd,True,13.1,0.0,sgd,,Subscription creation,1.01,True,card,Live,pi_3R6526BnoxG9tBdf38qfllhs,card,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R6525BnoxG9tBdfkL1rJoDp,aeead298037c3c7c2dde5b5a69be48d3,TH,Visa,debit,GB,,cus_S05BvcU7dKNET8,,c18f08efa41c661b9e5905f4be60e816,,,,,,Web3 Jobs Premium (1) (weekly)
5,ch_3QyQN0BnoxG9tBdf1A0rqRVk,2025-03-24 4:12:56,9.99,0.0,usd,False,9.99,0.0,usd,previously_declined_do_not_retry,Subscription update,0.0,False,,Live,pi_3QyQN0BnoxG9tBdf1T0O9SIe,card,,,Failed,You previously attempted to charge this card. ...,0.0,pm_1Qe6kKBnoxG9tBdfIXhipWvz,91002a1dc81b6a3a0d262dc2e258ff0a,CA,Visa,debit,DE,,cus_RXAy7hQZGbokx5,,145fa277704a6b11ee4649e88839c7ae,,,,,,Subscription update
30,ch_3QxbszBnoxG9tBdf24Yei1M3,2025-03-21 22:18:35,39.99,0.0,usd,False,39.99,0.0,usd,previously_declined_do_not_retry,Subscription update,0.0,False,,Live,pi_3QxbszBnoxG9tBdf2Y3tzNcB,card,,,Failed,You previously attempted to charge this card. ...,0.0,pm_1QQx9UBnoxG9tBdfPaxNWi05,4f724940cced3abde587da57400f2844,KY,MasterCard,debit,GB,,cus_RJIjNdglRtOJvl,,a376f899e05f8f31a664c94c0adb0948,,,,,,Subscription update
51,ch_3R4jnXBnoxG9tBdf3sK0W2Jb,2025-03-20 14:17:25,695.0,0.0,usd,True,910.54,0.0,sgd,,Job Post: KOL Manager for Web3/Crypto Project (1),36.01,False,,Live,pi_3R4jnXBnoxG9tBdf3ogi0hyT,three_d_secure_2,,WEB3 JOBS,Paid,Payment complete.,0.0,pm_1R4ju2BnoxG9tBdfgoF3nW0z,3bd7368bb4742d9404033e914ee20fd4,IN,Visa,credit,HK,,cus_RyhINpr3uxkKy5,,615dd822d1a7c777b4e2cef0ff2ae45a,,,,,,Job Post: KOL Manager for Web3/Crypto Project (1)
63,ch_3R3V3iBnoxG9tBdf0veEMqyI,2025-03-20 2:13:53,9.99,0.0,usd,False,9.99,0.0,usd,previously_declined_do_not_retry,Subscription update,0.0,False,,Live,pi_3R3V3iBnoxG9tBdf0AHnmyaJ,card,,,Failed,You previously attempted to charge this card. ...,0.0,pm_1Qe6kKBnoxG9tBdfIXhipWvz,91002a1dc81b6a3a0d262dc2e258ff0a,CA,Visa,debit,DE,,cus_RXAy7hQZGbokx5,,145fa277704a6b11ee4649e88839c7ae,,,,,,Subscription update


As noted above, we don't have any useful reference columns to guide the imputation and we also have more than 45% data missing. Hence, just imputing 'Not Captured' in these rows might be too simplistic and may rob the analysis of its full potential. Therefore, we will use the following impuatation strategy based on historical transactions data of the customers. The logic/ assumption is that a customer would have used the same card for atleast 2 years, so for any transaction date, we look for transactions from the same customer within a timeframe of -1 year & +1 year and impute the card details of the most recent transaction in this window if they are present. If not, we finally impute the remaining rows with 'Not Captured'

<br>

- Step 1: Divide the dataset into 2 parts - with Customer ID & without Customer ID
- Step 2: Group by Customer ID & select transactions for that customer in the 2 year window described above
- Setp 3: Identify the most recent transaction with non-null data for the concerned columns and fill it in the missing row
- Step 4: If no recent transactions have non-null data or a customer has only 1 transaction, we fill such rows with 'Not Captured'
- Step 5: Fill the nulls in subset of data with no customer ID with 'Not Captured'
- Step 6: Concatenate both subsets of data

In [40]:
card_columns = ['Card Brand', 'Card Funding', 'Card Address Country']

# Ensure Created date is datetime
df['Created date (UTC)'] = pd.to_datetime(df['Created date (UTC)'], errors='coerce')

# Split the dataset
df_with_customer = df[df['Customer ID'].notna()].copy()
df_without_customer = df[df['Customer ID'].isna()].copy()


In [41]:
# Define a function to fill missing card details for a customer's group
def fill_customer_group(group):
    group = group.sort_values('Created date (UTC)').reset_index(drop=True)
    
    for idx, row in group.iterrows():
        if row[card_columns].isna().any():
            window_start = row['Created date (UTC)'] - pd.DateOffset(years=1)
            window_end = row['Created date (UTC)'] + pd.DateOffset(years=1)
            
            # Get transactions within 1 year before and after (excluding itself)
            neighbors = group[
                (group['Created date (UTC)'] >= window_start) &
                (group['Created date (UTC)'] <= window_end) &
                (group.index != idx)
            ].copy()
            
            if neighbors.empty:
                group.loc[idx, card_columns] = 'Not captured'
                continue
            
            # Calculate time difference
            neighbors['time_diff'] = (neighbors['Created date (UTC)'] - row['Created date (UTC)']).abs()
            neighbors = neighbors.sort_values('time_diff')
            
            # Find first neighbor with available card info
            for _, neighbor in neighbors.iterrows():
                if not neighbor[card_columns].isna().all():
                    # Fill missing fields using combine_first()
                    current = row[card_columns].combine_first(neighbor[card_columns])
                    group.loc[idx, card_columns] = current
                    break
            else:
                group.loc[idx, card_columns] = 'Not captured'
                
    return group

In [42]:
# Apply the function group-by-group
df_with_customer_filled = df_with_customer.groupby('Customer ID', group_keys=False).apply(fill_customer_group)

# For transactions without Customer ID, fill 'Not captured' directly
for col in card_columns:
    df_without_customer[col] = df_without_customer[col].fillna('Not captured')

# Combine everything back
df_final = pd.concat([df_with_customer_filled, df_without_customer]).sort_index()

# Final sweep to replace any remaining NaN values with 'Not captured'
for col in card_columns:
    df_final[col] = df_final[col].fillna('Not captured')

  df_with_customer_filled = df_with_customer.groupby('Customer ID', group_keys=False).apply(fill_customer_group)


In [43]:
#No NaNs anymore
df_final[['Card Brand', 'Card Funding', 'Card Address Country']].isna().sum()

Card Brand              0
Card Funding            0
Card Address Country    0
dtype: int64

##### <b> IMPUTATION STRATEGY - Dispute columns </b>

<br>
There is a very low incidence of disputes, just 9 cases out the total. Hence, the most logical Null imputation for non date columns would be 'No Dispute'/0, as implemented below.

In [44]:
df_final['Disputed Amount'].fillna(0, inplace=True)
df_final['Dispute Status'].fillna('No Dispute', inplace=True)
df_final['Dispute Reason'].fillna('No Dispute', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_final['Disputed Amount'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_final['Dispute Status'].fillna('No Dispute', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which w

In [45]:
#no Nulls anymore
df_final[['Dispute Reason', 'Disputed Amount', 'Dispute Status']].isna().sum()

Dispute Reason     0
Disputed Amount    0
Dispute Status     0
dtype: int64

##### `STEP 3 : DROPPING UNNECESSARY COLUMNS`

<br>
we drop the following columns,

- LinkFunding - cant assume card payment types, because missing values also for succesful transaction 
- Customer description - no values at all 
- statement descriptor - not much use for Bindex
- Converted Amount - no much use for business insights
- Converted Amount Refunded - no much use for business insights
- Converted Currency - no much use for business insights
- Is Link - no much use for business insights
- Mode - since only 1 mode of payment throughout (Live)
- PaymentIntent ID - no much use for business insights
- Card ID - no much use for business insights
- card tokenization method - over 90% values missing, no useful insights
- Card MD5 - no need for masked data
- Card issue country - not much use to bondex

In [46]:
df_final.drop(columns=['Converted Amount', 'Converted Amount Refunded','Converted Currency', 'Is Link',
       'Link Funding', 'Mode', 'PaymentIntent ID', 'Payment Source Type', 'Statement Descriptor', 
       'Card ID', 'Card Name MD5', 'Card Issue Country', 'Card Tokenization Method','Customer Description', 'Customer Email MD5'],
       inplace=True)

In [47]:
#final list of columns. Note that Nulls remain in date columns because we will converts them to NaT later.
df_final.isna().sum()

id                                0
Created date (UTC)                0
Amount                            0
Amount Refunded                   0
Currency                          0
Captured                          0
Decline Reason                    0
Description                       0
Fee                               0
Refunded date (UTC)           13192
Status                            0
Seller Message                    0
Taxes On Fee                      0
Card Address Country              0
Card Brand                        0
Card Funding                      0
Customer ID                       0
Disputed Amount                   0
Dispute Date (UTC)            13350
Dispute Evidence Due (UTC)    13350
Dispute Reason                    0
Dispute Status                    0
Checkout Line Item Summary        0
dtype: int64

##### `STEP 4 : PARSING CORRECT DATATYPES`

In [48]:
df_final['Refunded date (UTC)'] = pd.to_datetime(df_final['Refunded date (UTC)'])
df_final['Dispute Date (UTC)'] = pd.to_datetime(df_final['Dispute Date (UTC)'])
df_final['Dispute Evidence Due (UTC)'] = pd.to_datetime(df_final['Dispute Evidence Due (UTC)'])

In [49]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13359 entries, 0 to 5337
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   id                          13359 non-null  object        
 1   Created date (UTC)          13359 non-null  datetime64[ns]
 2   Amount                      13359 non-null  float64       
 3   Amount Refunded             13359 non-null  float64       
 4   Currency                    13359 non-null  object        
 5   Captured                    13359 non-null  bool          
 6   Decline Reason              13359 non-null  object        
 7   Description                 13359 non-null  object        
 8   Fee                         13359 non-null  float64       
 9   Refunded date (UTC)         167 non-null    datetime64[ns]
 10  Status                      13359 non-null  object        
 11  Seller Message              13359 non-null  object        
 

##### `STEP 5 : DATA CONSISTENCY`

Since we need to analyze the Revenue Streams, we need the product descriptions to have consistent or predictable values for easier aggregation. However, that is not the case in our dataset. Not only are the entries made in an irregular manner, but there are several entries, like 'Payment Invoice', which can't be traced back to a single category. 

<br>
Hence, we will be creating a new column to describe the Product Category for each transaction to the maximum extent possible. Note that since they are devised by manual analysis of more than 2000 unique values of product descriptions/ summary, the categorisations are therefore limited.

In [75]:
def categorize_purchase(row):

    checkout_summary = str(row.get('Checkout Line Item Summary', '')).lower()
    description = str(row.get('Description', '')).lower()
    
    #Combining both fields into a single text block for easier checking
    combined_text = f"{checkout_summary} {description}"
    
    #RULE 1: Job Postings
    if 'job post' in combined_text:
        if 'bundle' in combined_text:
            return 'Job Posting - Bundled'
        else:
            return 'Job Posting - Single'
    
    if 'bundle' in combined_text:
            return 'Job Posting - Bundled'
    
    #RULE 2: Weekly Subscriptions
    if 'subscription' in combined_text:
        if 'weekly' in combined_text:
            return 'Subscription - weekly'
        elif 'monthly' in combined_text:
             return 'Subscription - monthly'
        else:
            return 'Subscription'
    
    #RULE 4: Ad Revenue
    if 'ad revenue' in combined_text or 'advertisement' in combined_text or 'ad spend' in combined_text or 'sticky' in combined_text:
        return 'Ad Revenue'
    
    return 'Not Specified'


In [87]:
df_final['Category'] = df_final.apply(categorize_purchase, axis=1)

In [88]:
df_final['Category'].value_counts()

Category
Job Posting - Single      6173
Subscription              3750
Job Posting - Bundled     1201
Not Specified             1019
Subscription - weekly      916
Subscription - monthly     278
Ad Revenue                  22
Name: count, dtype: int64

##### `STEP 6 : EXPORTING THE CLEANED DATASET`

In [82]:
df_final.sort_values(by='Created date (UTC)', inplace=True)

In [89]:
df_final.reset_index(drop=True, inplace=True)

In [94]:
df_final.to_csv('Cleaned_Dataset.csv', index=False)