# Mpesa Fraud-Detection

## 1. Business Understanding

### 1.1 Problem Statement

Financial fraud is a major threat to Africa’s 
economic stability. Traditional detection methods 
struggle with evolving fraudulent activities. Our 
system enhances transaction security, reducing 
false positives while maintaining a seamless user 
experience.

### 1.2 Proposed Solutions

By integrating advanced technology with 
proactive fraud prevention, we offer a 
good defense against credit card fraud 
and anomalies

### 1.3 Objectives

#### 1.3.1 Main Objective

#### 1.3.2 Specific Objectives

- Decrease financial losses due to fraudulent 
activities 
- a good detection rate for unauthorized 
transactions. 
- Reduce the rate of false positives in the financial 
sector. 
- Detect and alert stakeholders of suspicious 
activities within a few minutes.
- Retrain the model with new data and techniques 
at least adequate times per year.
- Comply with relevant data protection regulations 
and implement robust security measures. 
- Establish partnerships with financial institutions 
and regulatory bodies 
- Track and report key performance indicators 
(KPIs) related to fraud detection and prevention 
on a monthly basis

### 1.4 Metrics of Success

### 1.5 StakeHolders

### 1.6 Constraints

## 2.  Data Understanding

### 2.1 Data Collection

**Data sources**: 

Identify and collect relevant 
data from various sources, like historical fraud 
records, customer information and transaction 
data. Our major sources were Kaggle and 
github repositories.

#### Column Decsription of The data

- step: The time step of the transaction.
- type: The type of transaction (e.g., PAYMENT, TRANSFER, CASH_OUT).
- amount: The transaction amount.
- nameOrig: The originator's account ID.
- oldbalanceOrg: The old balance of the originator before the transaction.
- newbalanceOrig: The new balance of the originator after the transaction.
- nameDest: The destination account ID.
- oldbalanceDest: The old balance of the destination before the transaction.
- newbalanceDest: The new balance of the destination after the transaction.
- isFraud: A binary indicator (0 or 1) showing whether the transaction is fraudulent.
- isFlaggedFraud: A binary indicator for flagged fraudulent transactions.

In [2]:
# Import necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [3]:
# A function to load the data
class LoadData():
    def load_data(self,df):
        self.df= pd.read_csv(df)
        return self.df
    
data = LoadData()
data_path = "Fraud-Data.csv"
df = data.load_data(data_path)
# Display teh first five rows in the dataset
df.head()

Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [4]:
class DataUndersatnding():
  
    def load_data(self, path):
        self.df= pd.read_csv(path)
        return self.df
    
    def data_understanding(self, df):
        """Provides the understanding of the dataset"""
        # Dataset Info
        print("INFO")
        print("-"*4)
        self.df.info()

        # Shape of the dataset
        print("\n\nSHAPE")
        print("-" * 4)
        self.df.shape
        print(f"Records in the dataset:{self.df.shape[0]}, and columns in the dataset:{self.df.shape[1]}")

        # Columns in the Dataset
        print("COLUMNS")
        print("-" * 4)
        self.df.columns
        print(f"Number of columns in the dataset{len(self.df.shape)}")

        
        #Unique values in the dataset
        print("\n\nUNIQUE VALUES")
        print("-" * 12)
        for col in self.df.columns:
            print(f"Column {col} has {self.df[col].nunique()} unique values")
            if self.df[col].nunique() < 12:
                print(f"Top unique values in {col} include:")
                for idx in self.df[col].value_counts().index:
                    print(f"- {idx}")
            print("")
        # Missing values in the Dataset
        print("\nMISSING")
        print("-" * 10)
        self.df.isna().sum()

        #Duplicate Values in the dataset
        print("\nDuplicate values in the Dataset")
        print("- "* 4)
        print(f"The dataset has {self.df.duplicated().sum()} duplicated records.")
# Initialize the Data Understanding Class

data = DataUndersatnding()
path ="Fraud-Data.csv"
df = data.load_data(path)
df.head()
        


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0


In [7]:
# Call the data understanding function
data.data_understanding(df)

INFO
----
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


SHAPE
----
Records in the dataset:6362620, and columns in the dataset:11
COLUMNS
----
Number of columns in the dataset2


UNIQUE VALUES
------------
Column step has 743 unique values

Column type has 5 unique values
Top unique values in type include:
- CASH_OUT
- PAYMENT
- CASH_IN
- TRANSFER
- DEBIT

Column amount has 5316900 unique values

Column nameOrig has 6353307 unique values

Column oldbalanceOrg has 1845844 unique values



- There are a total of `6,362,620` entries (rows) in the dataset and 11 columns
- `int64`: Integer type (for columns like step, isFraud, and isFlaggedFraud).
- `float64`: Floating-point type (for monetary values and balances).
- `object`: Object type, typically for string or mixed data (for columns like type, nameOrig, and nameDest).

## 3. Data Preparation

### 3.1 Data Cleaning

#### Identify missing value in the dataset

In [8]:
# handling missing data
print(df.isna().sum())

step              0
type              0
amount            0
nameOrig          0
oldbalanceOrg     0
newbalanceOrig    0
nameDest          0
oldbalanceDest    0
newbalanceDest    0
isFraud           0
isFlaggedFraud    0
dtype: int64


THe dataset has no missing value

#### Checking for duplicates in the Dataset

In [9]:
print(df.duplicated().sum())

0


The dataset has no duplicate values

In [10]:
# Outlier detection and removal
class OutlierDetection():
    def __init__(self):
        pass

    def detect_outliers(self, df):
        self.df = df
        self.df = self.df.select_dtypes(include=[np.number])
        # self.df = self.df.dropna()
        self.z_scores = (self.df - self.df.mean()) / self.df.std()
        self.z_scores = self.z_scores.abs()
        self.df = self.df[(self.z_scores < 3).all(axis=1)]
        return self.df
outlier = OutlierDetection()
df = outlier.detect_outliers(df)
df.head()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,9839.64,170136.0,160296.36,0.0,0.0,0,0
1,1,1864.28,21249.0,19384.72,0.0,0.0,0,0
4,1,11668.14,41554.0,29885.86,0.0,0.0,0,0
5,1,7817.71,53860.0,46042.29,0.0,0.0,0,0
6,1,7107.77,183195.0,176087.23,0.0,0.0,0,0


### 3.2 Explanatory Data Analysis

## 4. Modelling

## 5. Conclusion

## 6. Recommendations

## 7. Next Steps