# Credit Card Fraud Detection Using Machine Learning

## Objective:
- Detect fraudulent credit card transactions using machine learning models while handling class imbalance effectively.

## Dataset:
The Kaggle Credit Card Transactions dataset, which contains transactions labeled as legitimate (0) or fraudulent (1). The dataset is highly imbalanced, with fraud cases representing a small fraction of all transactions.

## Approach:

1. Data Preprocessing:

- Feature-target separation and optional standardization.

- Handling missing values if any.

2. Handling Imbalance:

- Oversampling the minority class using SMOTE to generate synthetic fraud examples.

- Ensures models learn patterns from both classes effectively.

3. Modeling:

- Random Forest (RF)

- XGBoost (XGB)

- LightGBM (LGBM)

- Models are trained on the oversampled dataset and evaluated using classification metrics such as precision, recall, F1-score, and confusion matrix.

4. Experiment Tracking:

- MLflow is used to log parameters, metrics, and trained models for reproducibility and monitoring.

5. Inference:

- Trained models are saved and can be loaded later to predict fraud on new/unseen transactions.

## Goal:
Develop robust, reproducible models for fraud detection, ensuring high recall on fraud cases while maintaining acceptable performance on legitimate transactions.

# Model

## Importing Libraries 

In [5]:
!pip install polars

Collecting polars
  Using cached polars-1.33.1-cp39-abi3-win_amd64.whl.metadata (15 kB)
Downloading polars-1.33.1-cp39-abi3-win_amd64.whl (39.5 MB)
   ---------------------------------------- 0.0/39.5 MB ? eta -:--:--
   ---------------------------------------- 0.3/39.5 MB ? eta -:--:--
   - -------------------------------------- 1.3/39.5 MB 6.1 MB/s eta 0:00:07
   -- ------------------------------------- 2.1/39.5 MB 6.5 MB/s eta 0:00:06
   --- ------------------------------------ 3.4/39.5 MB 5.4 MB/s eta 0:00:07
   ----- ---------------------------------- 5.0/39.5 MB 5.7 MB/s eta 0:00:07
   ------ --------------------------------- 6.6/39.5 MB 6.1 MB/s eta 0:00:06
   ------- -------------------------------- 7.9/39.5 MB 6.2 MB/s eta 0:00:06
   --------- ------------------------------ 8.9/39.5 MB 5.9 MB/s eta 0:00:06
   ---------- ----------------------------- 10.0/39.5 MB 5.7 MB/s eta 0:00:06
   ---------- ----------------------------- 10.7/39.5 MB 5.8 MB/s eta 0:00:05
   ------------ -

In [6]:
import polars as pl

## Basic Observations

In [19]:
# Load CSV
df = pl.read_csv(r"C:\Users\DELL\Desktop\AMNIL Intern\credit_card_transactions.csv")

# View first rows
df.sample(5)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode
i64,str,i64,str,str,f64,str,str,str,str,str,str,i64,f64,f64,i64,str,str,str,i64,f64,f64,i64,i64
910955,"""2019-12-29 06:51:05""",3554818239968984,"""fraud_Wisozk and Sons""","""misc_pos""",30.63,"""Cory""","""Thomas""","""M""","""6458 Roberson Alley""","""Williamsburg""","""MO""",63388,38.8874,-91.7689,710,"""Glass blower/designer""","""1970-09-27""","""ef018a61d8962535b4c9b7f25ad391…",1356763865,39.100058,-91.38375,0,63359
332081,"""2019-06-07 09:16:43""",340951438290556,"""fraud_Block-Parisian""","""misc_net""",14.62,"""Maria""","""Garcia""","""F""","""865 Thomas Village""","""Orangeburg""","""NY""",10962,41.0442,-73.9609,5950,"""Records manager""","""1971-07-02""","""c8a19ee871094e8056f022cfa83278…",1339060603,40.956776,-74.28356,0,7444
914541,"""2019-12-29 19:58:00""",2296006538441789,"""fraud_Turcotte, McKenzie and K…","""entertainment""",34.32,"""Judy""","""Hogan""","""F""","""4970 Michelle Burgs""","""Brooklyn""","""NY""",11217,40.6816,-73.9798,2504700,"""Medical sales representative""","""1999-09-01""","""fcb02b4fc038952250f5a489efa807…",1356811080,41.673355,-74.453447,0,12763
1193436,"""2020-05-13 08:36:41""",4997733566924489,"""fraud_Block-Parisian""","""misc_net""",93.23,"""Stephanie""","""Taylor""","""F""","""598 Martin Pine Suite 365""","""Saint Paul""","""MN""",55128,44.9913,-92.9487,753116,"""Fisheries officer""","""1971-08-06""","""71a2e8d48ce083983504cff79cc8ea…",1368434201,44.186338,-92.256772,0,55932
33017,"""2019-01-20 17:18:46""",346273234529002,"""fraud_Dibbert and Sons""","""entertainment""",72.57,"""Donna""","""Moreno""","""F""","""32301 Albert River Suite 364""","""Ronceverte""","""WV""",24970,37.7418,-80.4626,4575,"""Statistician""","""1991-10-22""","""3dd7eb628e93a3f7afce44a65146a7…",1327079926,38.430593,-79.603407,0,24465


In [22]:
df.columns

['Unnamed: 0',
 'trans_date_trans_time',
 'cc_num',
 'merchant',
 'category',
 'amt',
 'first',
 'last',
 'gender',
 'street',
 'city',
 'state',
 'zip',
 'lat',
 'long',
 'city_pop',
 'job',
 'dob',
 'trans_num',
 'unix_time',
 'merch_lat',
 'merch_long',
 'is_fraud',
 'merch_zipcode']

- Unnamed: 0 – This is just an index column created when exporting the CSV; it doesn’t carry any information about the transaction.

- trans_date_trans_time – The exact date and time when the transaction occurred. This can be used to derive features like the hour of the day, day of the week, or month.

- cc_num – The credit card number used for the transaction. This is sensitive information and not used directly for modeling.

- merchant – The name of the merchant where the transaction took place. Different merchants may have different fraud risks.

- category – The type of transaction, such as entertainment, miscellaneous, or online purchases. This can help identify patterns in spending behavior.

- amt – The amount of money involved in the transaction. Larger amounts may have higher fraud risk.

- first – First name of the cardholder. Not relevant for modeling.

- last – Last name of the cardholder. Not relevant for modeling.

- gender – Gender of the cardholder (male or female). Can be used as a categorical feature.

- street – Street address of the cardholder. Usually not used directly for modeling.

- city – City of the cardholder. May help identify geographic patterns.

- state – State of the cardholder. Can be encoded to capture regional differences.

- zip – ZIP code of the cardholder. Can be used as a numeric or categorical feature.

- lat – Latitude of the cardholder’s location. Useful for detecting unusual transaction locations.

- long – Longitude of the cardholder’s location. Useful for detecting unusual transaction locations.

- city_pop – Population of the cardholder’s city. Could be correlated with fraud risk depending on urban vs rural behavior.

- job – Occupation of the cardholder. Certain job categories may be associated with different spending patterns.

- dob – Date of birth of the cardholder. Can be used to calculate age for modeling.

- trans_num – Unique identifier for each transaction. Not used as a feature.

- unix_time – The transaction time represented in UNIX timestamp format. Alternative to trans_date_trans_time.

- merch_lat – Latitude of the merchant location. Helps detect transactions occurring far from the cardholder’s usual area.

- merch_long – Longitude of the merchant location. Helps detect unusual transaction locations.

- is_fraud – Target variable: 1 indicates a fraudulent transaction, 0 indicates legitimate.

- merch_zipcode – ZIP code of the merchant. Can help capture geographic patterns in fraud.

In [24]:
df.shape

(1296675, 24)

In [28]:
df.null_count()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,195973


Here, we can see that there is almost no null values in every features except "merch_zipcode" so we need to handle this.

## Feature Engineering 

"trans_date_trans_time	" is converted into Datetime format 

In [42]:
df = df.with_columns([
    pl.col("trans_date_trans_time")
      .str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S")
      .alias("trans_date_trans_time")
])

Creating a new feature named "trans_day_cycle" which indicates the time of transcation in a cycle of day format. Here, the entries are "morning", "afternoon", "evening" and "night". This shows the time on the day when there is max fraud cases.

In [48]:
# Extract hour
df = df.with_columns([
    df["trans_date_trans_time"].dt.hour().alias("hour")
])

# Map to day cycle
def day_cycle(hour: int) -> str:
    if 5 <= hour < 12:
        return "morning"
    elif 12 <= hour < 17:
        return "afternoon"
    elif 17 <= hour < 21:
        return "evening"
    else:
        return "night"

df = df.with_columns([
    df["hour"].map_elements(day_cycle, return_dtype=pl.Utf8).alias("trans_day_cycle")
])


A new feature names "trans_season" is created from datetime to see the probability of fraud cases in each seasons.

In [61]:
df = df.with_columns([
    df["trans_date_trans_time"].dt.month().alias("month")
])

# Define mapping function
def get_season(month: int) -> str:
    if month in [12, 1, 2]:
        return "winter"
    elif month in [3, 4, 5]:
        return "spring"
    elif month in [6, 7, 8]:
        return "summer"
    else:
        return "autumn"

# Apply mapping
df = df.with_columns([
    df["month"].map_elements(get_season, return_dtype=pl.Utf8).alias("trans_season")
])

### Binning amount
The raw transaction amount (amt) is converted into a categorical feature named trans_amt_range, where amounts are grouped into small, medium, and large ranges. This transformation helps capture patterns in fraud likelihood based on transaction size. By binning continuous amounts into meaningful ranges, we make it easier to analyze whether fraudulent transactions are more common for larger amounts compared to smaller ones.

In [67]:
def amt_range(amt: float) -> str:
    if amt < 100:
        return "small"
    elif amt < 1000:
        return "medium"
    else:
        return "large"

df = df.with_columns([
    df["amt"].map_elements(amt_range, return_dtype=pl.Utf8).alias("trans_amt_range")
])

In [69]:
df.sample(5)

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state,zip,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,merch_zipcode,hour,trans_day_cycle,month,trans_season,trans_amt_range
i64,datetime[μs],i64,str,str,f64,str,str,str,str,str,str,i64,f64,f64,i64,str,str,str,i64,f64,f64,i64,i64,i8,str,i8,str,str
539967,2019-08-19 12:28:18,30427035050508,"""fraud_Gaylord-Powlowski""","""home""",5.74,"""John""","""Chandler""","""M""","""88325 Brandon Greens Apt. 477""","""Detroit""","""MI""",48202,42.377,-83.0796,673342,"""Broadcast presenter""","""1969-11-20""","""bc41cc0e58637cde247fa5fcfbf71a…",1345379298,42.088198,-82.930233,0,,12,"""afternoon""",8,"""summer""","""small"""
1008604,2020-02-18 21:42:57,38295635583927,"""fraud_Windler, Goodwin and Kov…","""home""",21.67,"""Candice""","""Brown""","""F""","""9412 Harris Mews""","""O Brien""","""TX""",79539,33.3749,-99.8473,178,"""Warden/ranger""","""1983-06-14""","""b73dd49435d9f960ddb290353d1920…",1361223777,33.664349,-99.66593,0,79505.0,21,"""night""",2,"""winter""","""small"""
549746,2019-08-23 15:57:07,180031190491743,"""fraud_Schmidt-Larkin""","""home""",40.01,"""Becky""","""Mckinney""","""F""","""250 Benjamin Hill Apt. 026""","""Mobile""","""AL""",36617,30.7145,-88.0918,270712,"""Surveyor, land/geomatics""","""1972-01-05""","""53f8a07afc55e13dae74da89567224…",1345737427,31.294862,-88.672809,0,,15,"""afternoon""",8,"""summer""","""small"""
158414,2019-03-25 08:29:15,3582754887089201,"""fraud_Tillman, Fritsch and Sch…","""misc_net""",9.61,"""Terrance""","""Mckinney""","""M""","""42965 Christopher Fords Suite …","""Norman""","""AR""",71960,34.4596,-93.6743,1383,"""Magazine features editor""","""1966-08-08""","""40be0cdc0c3e292228dd1a08a6c628…",1332664155,34.890761,-93.211122,0,72857.0,8,"""morning""",3,"""spring""","""small"""
830335,2019-12-10 20:48:51,36153880429415,"""fraud_Wilkinson PLC""","""kids_pets""",120.34,"""Erik""","""Stevens""","""M""","""84033 Pitts Overpass""","""Lakeland""","""FL""",33809,28.1762,-81.9591,237282,"""Plant breeder/geneticist""","""1949-10-13""","""669052a7b573e204340b4d17c73c79…",1355172531,27.56857,-82.224107,0,,20,"""evening""",12,"""winter""","""medium"""
