# CREDIT CARD FRAUD DETECTION PROJECT

## Fraud Detection System – Project Approach

### 1. Problem Definition

 
Many existing fraud detection systems fail to detect subtle fraudulent activities, leading to financial losses. Your goal is to develop a fraud detection model that integrates with existing systems to identify the slightest credit fraud with high precision.


### Objectives

1. Develop an AI-driven fraud detection system that detects even small fraudulent activities in credit transactions.
2. Ensure real-time fraud detection with minimal false positives to avoid disrupting genuine users.
3. Create a system that seamlessly integrates with existing financial systems.
4. Utilize advanced techniques like anomaly detection, behavioral biometrics, and contextual analysis.
5. Implement a self-learning mechanism that adapts to new fraud patterns.

### Data Understanding

1. Time :The number of seconds elapsed between the first transaction in the dataset and the current transaction.

2. Amount: The transaction amount in currency units (e.g., USD or EUR).
3. Class: The target variable: 0 for non-fraudulent transactions, 1 for fraudulent transactions.
4. V1 to V28 – These are principal components obtained through PCA (Principal Component Analysis). They are anonymized      features derived from the original transaction details to protect sensitive information.

In [5]:
# Import the necessary V
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

class DataUnderstanding:
    def __init__(self, df: pd.DataFrame):
        self.df = df

    def dataset_info(self):
        print("Dataset Info:", self.df.info())
        print("="*8)

         
        print(f"Number of Rows in the Dataset:,{self.df.shape[1]}")
        print(f"Number of Columns in the Dataset:,{self.df.shape[0]}")

        # Summary Statistics of the dataset
    def summary_statistics(self):
        print("="*4)
        print("Numerical Summary Statistics:")
        print(self.df.describe())

        # print("Categorical Summary Statistics:")
        # print(self.df.describe(include =["object"]))

    def data_Quality_Checks(self):
        print("="* 8)
        print("Number of Duplicates values Per Column")
        print(self.df.duplicated().sum())

        print("=" * 8)
        print("Number of missing values per column")
        print(self.df.isna().sum())

        print("=" * 8)
        print("Total Number of missing Values in the dataset")
        print(self.df.isna().isna().sum())

        print("=" * 8)
        print("Print Nununique values per column")
        print(self.df.nunique())


    def run_all_checks(self):
        """Runs all overview checks."""
        self.dataset_info()
        self.summary_statistics()
        self.data_Quality_Checks()
        


 
df = pd.read_csv("creditcard.csv")
data = DataUnderstanding(df)
data.run_all_checks()


        
    






<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

### Overall Observation
**Classes:**
- 0 (Legit Transactions): 99.83% (284,315 instances)
- 1 (Fraudulent Transactions): 0.17% (492 instances)
- Highly Imbalanced Dataset: Fraudulent transactions are extremely rare.

- PCA-transformed features: These components do not have real-world meanings but capture variance in the data.
- Transaction Amount: A wide range from $0 to $25,691, with most values concentrated below $100.

### Data Preprocessing


#### Data Cleaning

**Observations**

- No missing values were found, ensuring completeness.
This reduced the need for imputation or data loss through row deletions.

- Duplicate Entries Present 

1081 duplicate rows were found and successfully removed.
This step was crucial to prevent biased model training.

- Outliers Were Mainly in Transaction Amount 

The "Amount" feature showed extreme values (some transactions exceeded $25,000).
Log transformation was applied to handle skewness and improve model performance.
Class Imbalance in Fraudulent Transactions ⚖️

- The dataset was highly imbalanced (fraud cases = 0.17%).
SMOTE (Synthetic Minority Oversampling Technique) was used to generate synthetic fraud cases and balance the dataset.
Final Takeaway 🎯
 

#### Drop Duplicates

In [6]:
df.drop_duplicates(inplace = True)