# Credit Card Transaction Data Cleaning

**Date:** 2025-08-06

In this notebook, we clean and prepare a large dataset of credit card transactions to prepare it for further analysis.
We'll extract useful time features, calculate customer age, categorize transaction amounts, and flag nighttime transactions.


In [1]:
import pandas as pd
import numpy as np

# Load cleaned dataset
df = pd.read_csv("../data/cleansed/FraudTest_clean.csv")

# Preview dataset
df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,merch_lat,merch_long,is_fraud,hour,day,weekday,month,age,is_night,amount_bin
0,2020-06-21 12:14:25,2291163933867244,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,Columbia,...,33.986391,-81.200714,0,12,21,6,6,52,False,very_low
1,2020-06-21 12:14:33,3573030041201292,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,Altonah,...,39.450498,-109.960431,0,12,21,6,6,30,False,low
2,2020-06-21 12:14:53,3598215285024754,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,Bellmore,...,40.49581,-74.196111,0,12,21,6,6,50,False,low
3,2020-06-21 12:15:15,3591919803438423,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,Titusville,...,28.812398,-80.883061,0,12,21,6,6,33,False,low
4,2020-06-21 12:15:17,3526826139003047,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,Falmouth,...,44.959148,-85.884734,0,12,21,6,6,65,False,very_low


In [2]:
# Check structure
df.info()

# Summary statistics
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 28 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   trans_date_trans_time  555719 non-null  object 
 1   cc_num                 555719 non-null  int64  
 2   merchant               555719 non-null  object 
 3   category               555719 non-null  object 
 4   amt                    555719 non-null  float64
 5   first                  555719 non-null  object 
 6   last                   555719 non-null  object 
 7   gender                 555719 non-null  object 
 8   street                 555719 non-null  object 
 9   city                   555719 non-null  object 
 10  state                  555719 non-null  object 
 11  zip                    555719 non-null  int64  
 12  lat                    555719 non-null  float64
 13  long                   555719 non-null  float64
 14  city_pop               555719 non-nu

Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,merch_lat,merch_long,is_fraud,hour,day,weekday,month,age
count,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0,555719.0
mean,4.178387e+17,69.39281,48842.628015,38.543253,-90.231325,88221.89,38.542798,-90.23138,0.00386,12.809062,16.463904,2.726779,9.508536,46.636237
std,1.309837e+18,156.745941,26855.283328,5.061336,13.72178,300390.9,5.095829,13.733071,0.062008,6.810924,8.955311,2.178681,1.978205,17.418528
min,60416210000.0,1.0,1257.0,20.0271,-165.6723,23.0,19.027422,-166.671575,0.0,0.0,1.0,0.0,6.0,15.0
25%,180042900000000.0,9.63,26292.0,34.6689,-96.798,741.0,34.755302,-96.905129,0.0,7.0,9.0,1.0,8.0,33.0
50%,3521417000000000.0,47.29,48174.0,39.3716,-87.4769,2408.0,39.376593,-87.445204,0.0,14.0,17.0,2.0,10.0,45.0
75%,4635331000000000.0,83.01,72011.0,41.8948,-80.1752,19685.0,41.954163,-80.264637,0.0,19.0,24.0,5.0,12.0,58.0
max,4.992346e+18,22768.11,99921.0,65.6899,-67.9503,2906700.0,66.679297,-66.952026,1.0,23.0,31.0,6.0,12.0,96.0


### Feature Engineering

We include several derived features:
- `hour`, `day`, `weekday`, `month` from `trans_date_trans_time`
- `age` calculated from DOB
- `is_night` to indicate transactions between 10PM and 6AM
- `amount_bin` to group amounts into ranges


In [3]:
df[['hour', 'weekday', 'age', 'is_night', 'amount_bin']].describe(include='all')


Unnamed: 0,hour,weekday,age,is_night,amount_bin
count,555719.0,555719.0,555719.0,555719,555719
unique,,,,2,4
top,,,,False,low
freq,,,,389588,311484
mean,12.809062,2.726779,46.636237,,
std,6.810924,2.178681,17.418528,,
min,0.0,0.0,15.0,,
25%,7.0,1.0,33.0,,
50%,14.0,2.0,45.0,,
75%,19.0,5.0,58.0,,
