# 📊 ***Credit Card Fraud Detection***


## 🎯 Objective

The goal of this project is to:

✔️ Build a fraud detection model using machine learning  
✔️ Analyze transaction data to distinguish **fraudulent** from **legitimate** transactions  
✔️ Handle **imbalanced data** to improve fraud detection recall  
✔️ Use **Random Forest & LSTM** models for classification  

## 📂 **Dataset Description**

### **📌 Train Data Set for Credit Card Transactions**

| **Column Name** | **Description** |  
|------------|-------------|  
| `index` | Unique Identifier for each row |  
| `trans_date_trans_time` | Transaction DateTime |  
| `cc_num` | Credit Card Number of Customer |  
| `merchant` | Merchant Name |  
| `category` | Category of Merchant |  
| `amt` | Amount of Transaction |  
| `first` | First Name of Credit Card Holder |  
| `last` | Last Name of Credit Card Holder |  
| `gender` | Gender of Credit Card Holder |  
| `street` | Street Address of Credit Card Holder |  
| `city` | City of Credit Card Holder |  
| `state` | State of Credit Card Holder |  
| `zip` | Zip Code of Credit Card Holder |  
| `lat` | Latitude Location of Credit Card Holder |  
| `long` | Longitude Location of Credit Card Holder |  
| `city_pop` | Population of the Cardholder's City |  
| `job` | Job Title of Credit Card Holder |  
| `dob` | Date of Birth of Credit Card Holder |  
| `trans_num` | Unique Transaction Identifier |  
| `unix_time` | UNIX Timestamp of Transaction |  
| `merch_lat` | Latitude Location of Merchant |  
| `merch_long` | Longitude Location of Merchant |  
| `is_fraud` | **Target Variable (1 = Fraud, 0 = Legitimate)** |  


### **📌 Dataset Imbalance**  
⚠️ The dataset is **highly imbalanced**, meaning fraud cases (`1`) are much fewer than legitimate transactions (`0`).  

- **Legitimate Transactions (`0`)**: `1,289,169`  
- **Fraudulent Transactions (`1`)**: `7,506`  

Since fraud cases are rare.

### **📌 Test Set for Credit Card Transactions**  
The test dataset follows the same structure as the training set and includes all features required for fraud detection.

| **Column Name** | **Description** |  
|------------|-------------|  
| `index` | Unique Identifier for each row |  
| `trans_date_trans_time` | Transaction DateTime |  
| `cc_num` | Credit Card Number of Customer |  
| `merchant` | Merchant Name |  
| `category` | Category of Merchant |  
| `amt` | Amount of Transaction |  
| `first` | First Name of Credit Card Holder |  
| `last` | Last Name of Credit Card Holder |  
| `gender` | Gender of Credit Card Holder |  
| `street` | Street Address of Credit Card Holder |  
| `city` | City of Credit Card Holder |  
| `state` | State of Credit Card Holder |  
| `zip` | Zip Code of Credit Card Holder |  
| `lat` | Latitude Location of Credit Card Holder |  
| `long` | Longitude Location of Credit Card Holder |  
| `city_pop` | Population of the Cardholder's City |  
| `job` | Job Title of Credit Card Holder |  
| `dob` | Date of Birth of Credit Card Holder |  
| `trans_num` | Unique Transaction Identifier |  
| `unix_time` | UNIX Timestamp of Transaction |  
| `merch_lat` | Latitude Location of Merchant |  
| `merch_long` | Longitude Location of Merchant |  
| `is_fraud` | **Target Variable (1 = Fraud, 0 = Legitimate)** |  

## 🔧 **Installing & Importing Required Libraries**

We install and import necessary Python libraries for data processing, modeling, and evaluation.

In [87]:
# First Installing the required Libraries

!pip install pandas numpy scikit-learn joblib



In [89]:
# Import the Libraries for Project

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib as jb

## 🔄 **Data Preprocessing Steps**  
Before training our model, we need to **clean and preprocess** the dataset.  
This involves:  
✔ Dropping unnecessary columns  
✔ Converting timestamps into numerical features  
✔ Encoding categorical variables  

### 📥 Loading the Dataset  
We load the **training** and **testing** datasets to explore their structure.

In [94]:
# Load datasets
training_data = pd.read_csv('/content/fraudTrain.csv')
testing_data = pd.read_csv('/content/fraudTest.csv')

# Display first 5 rows
print(training_data.head())
print(testing_data.head())

   Unnamed: 0 trans_date_trans_time            cc_num  \
0           0   2019-01-01 00:00:18  2703186189652095   
1           1   2019-01-01 00:00:44      630423337322   
2           2   2019-01-01 00:00:51    38859492057661   
3           3   2019-01-01 00:01:16  3534093764340240   
4           4   2019-01-01 00:03:06   375534208663984   

                             merchant       category     amt      first  \
0          fraud_Rippin, Kub and Mann       misc_net    4.97   Jennifer   
1     fraud_Heller, Gutmann and Zieme    grocery_pos  107.23  Stephanie   
2                fraud_Lind-Buckridge  entertainment  220.11     Edward   
3  fraud_Kutch, Hermiston and Farrell  gas_transport   45.00     Jeremy   
4                 fraud_Keeling-Crist       misc_pos   41.96      Tyler   

      last gender                        street  ...      lat      long  \
0    Banks      F                561 Perry Cove  ...  36.0788  -81.1781   
1     Gill      F  43039 Riley Greens Suite 393  ...  48

### 🔍 Checking for Missing Values  
Before proceeding, let's check if any columns contain **null values**.

In [95]:
# Check missing values
print(training_data.info())
print(testing_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

✅ **Observation:** No missing values were found in the dataset.

### 🗑️ Dropping Unnecessary Columns  
Some columns do not contribute to fraud detection and should be removed.  
We drop:  
- `Unnamed: 0`, `first`, `last`, `street`, `zip`, `trans_num`, `dob`


In [57]:
drop_columns = ["Unnamed: 0", "first", "last", "street", "zip", "trans_num",'dob']
training_data.drop(columns=drop_columns, inplace=True)
testing_data.drop(columns=drop_columns, inplace=True)

### ⏳ Converting Date/Time into Numerical Features  
The `trans_date_trans_time` column is converted into separate numerical features:  
✔ **Hour**  
✔ **Day**  
✔ **Month**  
✔ **Year**  

In [58]:
def process_datetime(df):
    df["trans_date_trans_time"] = pd.to_datetime(df["trans_date_trans_time"])
    df["hour"] = df["trans_date_trans_time"].dt.hour
    df["day"] = df["trans_date_trans_time"].dt.day
    df["month"] = df["trans_date_trans_time"].dt.month
    df["year"] = df["trans_date_trans_time"].dt.year
    return df.drop(columns=["trans_date_trans_time"])

training_data = process_datetime(training_data)
testing_data = process_datetime(testing_data)

### 🔢 Encoding Categorical Variables  
Machine learning models require numerical input.  
We use **Label Encoding** for categorical columns:  
✔ `merchant`  
✔ `category`  
✔ `gender`  
✔ `state`  
✔ `job`  
✔ `city`  

In [59]:
categorical_cols = ["merchant", "category", "gender", "state", "job","city"]
label_encoders = {}
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    # Convert all values to strings before fitting
    all_values = pd.concat([training_data[col], testing_data[col]]).astype(str).unique()
    label_encoders[col].fit(all_values)
    training_data[col] = label_encoders[col].transform(training_data[col].astype(str))
    testing_data[col] = label_encoders[col].transform(testing_data[col].astype(str))

### ✅ Data Preprocessing Completed!  
Now that our data is **cleaned, structured, and encoded**, we can proceed to **model training**! 🚀

## 🌲 **Training a Random Forest Classifier**
Random Forest is chosen because it **handles class imbalance well** and provides **high accuracy**.

In [60]:
X_train = training_data.drop("is_fraud", axis=1)
y_train = training_data["is_fraud"]
X_test = testing_data.drop("is_fraud", axis=1)
y_test = testing_data["is_fraud"]

In [61]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [81]:
prid =model.predict(X_test)

In [64]:
accuracy = accuracy_score(y_test, prid)
print(f"Random Forest Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, prid))

Random Forest Accuracy: 0.9982
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.91      0.60      0.72      2145

    accuracy                           1.00    555719
   macro avg       0.95      0.80      0.86    555719
weighted avg       1.00      1.00      1.00    555719



## 📈 **Model Performance**
### **Random Forest Results**
- **Accuracy**: `99.82%`
- **Precision (Fraud Cases)**: `0.91`
- **Recall (Fraud Cases)**: `0.60`
- **F1-Score (Fraud Cases)**: `0.72`

## 💾 **Saving the Model for Future Use**
The trained model and encoders are saved using `joblib`.

In [86]:
jb.dump(model, 'Credit Card Fraud Detectionl.pkl')
jb.dump(label_encoders, 'label_encoders.pkl')

['label_encoders.pkl']

## 🔍 Model Summary
- The **Random Forest** classifier achieved **99.82% accuracy**.
- **Precision for fraud cases:** `0.91` (91% of flagged frauds were actual frauds).
- **Recall for fraud cases:** `0.60` (60% of actual fraud cases were detected).
- **F1-Score for fraud cases:** `0.72` (Balanced fraud detection performance).

✅ The model is now ready for **further optimization or deployment**!
