# **Problem Statement**

This project aims to build a machine learning model to detect fraudulent transactions using historical data. The goal is to reduce false positives and false negatives, improving accuracy. The model will also support real-time fraud detection to enhance security and protect financial institutions from losses.

# **Goal**

The goal of this dataset is to build a model that accurately identifies fraudulent transactions, helping financial institutions reduce fraud and prevent losses.

# **Features**

Transaction_ID: Unique identifier for each transaction.

User_ID: Unique identifier for each user.

Transaction_Amount: The amount of money involved in the transaction.

Transaction_Type: The type of transaction (e.g., ATM Withdrawal, Bill Payment).

Time_of_Transaction: The time when the transaction occurred.

Device_Used: The device used for the transaction (e.g., Mobile, Tablet, Desktop).

Location: The location where the transaction took place.

Previous_Fraudulent_Transactions: Number of previous fraud cases associated with the user.

Account_Age: The age of the account (in days).

Number_of_Transactions_Last_24H: Number of transactions made by the user in the last 24 hours.

Payment_Method: Method of payment used (e.g., Credit Card, Debit Card).

# **Target Variable**

Fraudulent: Indicates whether the transaction is fraudulent (1 = Fraudulent, 0 = Not Fraudulent).

In [None]:
#importing necessary libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# **Understand the data**

In [None]:
# load csv dataset
df=pd.read_csv('/content/Fraud Detection Dataset.csv')
df

In [None]:
df.head(10)

In [None]:
df.sample(10)

In [None]:
df.shape

In [None]:
# Checking the data types and non-null values
df.info()

# **Data cleaning**

In [None]:
df.isnull().sum()

In [None]:
df['Transaction_Amount'].value_counts()

In [None]:
df['Transaction_Amount'].unique()

In [None]:
sns.histplot(df['Transaction_Amount'])

In [None]:
df['Transaction_Amount'] = df['Transaction_Amount'].fillna(df['Transaction_Amount'].median())

In [None]:
sns.histplot(df['Time_of_Transaction'])

In [None]:
df['Time_of_Transaction'] = df['Time_of_Transaction'].fillna(df['Time_of_Transaction'].mean())

In [None]:
df.isnull().sum()

In [None]:
df['Location'] = df['Location'].fillna(df['Location'].mode()[0])
df['Device_Used'] = df['Device_Used'].fillna(df['Device_Used'].mode()[0])
df['Payment_Method'] = df['Payment_Method'].fillna(df['Payment_Method'].mode()[0])

In [None]:
df.info()

# **Descriptive Statistics**

In [None]:
# Summary numerical colum statistics for the numerical columns
df.describe()

# **Data visualization**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
df.hist(figsize=(20,10))
plt.show()

* User_ID: The distribution of User_ID appears uniform, meaning the user IDs are evenly distributed across the dataset.

* Transaction_Amount: This column has a right-skewed distribution, with most transaction amounts clustered at lower values, but there is a significant outlier around the 50,000 mark.

* Transaction_Type: This feature seems binary or categorical with a heavy imbalance. One category (perhaps 0) dominates, while the other has much fewer occurrences.

* Time_of_Transaction: The transaction times are relatively evenly distributed throughout the day, suggesting transactions occur consistently at different times.

* Location: There appear to be a few distinct locations with similar distributions, though some regions show fewer transactions.

* Previous_Fraudulent_Transactions: This seems categorical, with a fairly balanced distribution among the four categories.

* Account_Age: Account ages are somewhat uniformly distributed, with more transactions coming from accounts with ages clustered between 0 and 120.

* Number_of_Transactions_Last_24H: The distribution of transactions in the last 24 hours shows balanced activity across different counts.

* Fraudulent: This is binary (fraudulent or not), with most transactions being non-fraudulent.

* Device_Used_Mobile, Device_Used_Tablet, Device_Used_Unknown: These binary features show that most transactions occur via mobile, fewer via tablet, and some via unknown devices.

* Payment_Method (Debit Card, Invalid Method, Net Banking, UPI): These binary columns show that most transactions are not done via the respective methods, but there are some clusters for each payment method.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df[df['Fraudulent'] == 0]['Transaction_Amount'], color='blue', label='Non-Fraud', kde=True)
sns.histplot(df[df['Fraudulent'] == 1]['Transaction_Amount'], color='red', label='Fraud', kde=True)
plt.title('Transaction Amount Distribution: Fraud vs Non-Fraud')
plt.legend()
plt.show()

* This graph compares the distribution of transaction amounts for fraudulent and non-fraudulent transactions.

In [None]:
plt.figure(figsize=(10, 6))
df.groupby('Time_of_Transaction')['Fraudulent'].sum().plot(kind='line', color='blue')
plt.title('Fraudulent Transactions Over Time')
plt.ylabel('Number of Fraudulent Transactions')
plt.xlabel('Time of Transaction')
plt.show()


* This plot shows how fraud occurs at different times of the day or across specific time intervals.

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Transaction_Amount', y='Number_of_Transactions_Last_24H', hue='Fraudulent', data=df)
plt.title('Transaction Amount vs. Number of Transactions (Colored by Fraud)')
plt.show()

* This scatter plot can reveal whether higher transaction amounts or a higher number of recent transactions are associated with fraud.

# **Outlier Detection**

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df)
plt.xticks(rotation=45, ha='right')
plt.title('Boxplot for Numerical Features')
plt.show()

In [None]:
df

In [None]:
sns.boxplot(df['Device_Used'])

In [None]:
sns.boxplot(df['Payment_Method'])

In [None]:
sns.boxplot(df['Location'])

In [None]:
sns.boxplot(df['Transaction_Type'])

In [None]:
# Calculate the IQR for Transaction_Amount
Q1 = df['Transaction_Amount'].quantile(0.25)
Q3 = df['Transaction_Amount'].quantile(0.75)
IQR = Q3 - Q1

# Define lower and upper bounds for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_cleaned = df[(df['Transaction_Amount'] >= lower_bound) & (df['Transaction_Amount'] <= upper_bound)]

In [None]:
# Cap outliers
df['Transaction_Amount'] = np.where(df['Transaction_Amount'] < lower_bound, lower_bound, df['Transaction_Amount'])
df['Transaction_Amount'] = np.where(df['Transaction_Amount'] > upper_bound, upper_bound, df['Transaction_Amount'])

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df)
plt.xticks(rotation=45, ha='right')
plt.title('Boxplot for Numerical Features')
plt.show()

In [None]:
df = df.drop('Transaction_ID', axis=1)

it does not contribute to fraud prediction.

# **encoding**

## **Label encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['Location'] = le.fit_transform(df['Location'])

## **Mapping**

In [None]:
mapping = {
    'Online Purchase': 0,
    'POS Purchase': 1,
    'ATM Withdrawal': 2
}
df['Transaction_Type'] = df['Transaction_Type'].map(mapping)

## **One-Hot Encoding**

In [None]:
 df = pd.get_dummies(df, columns=['Device_Used'], drop_first=True, dtype=int)

In [None]:
df.info()

In [None]:
df = pd.get_dummies(df, columns=['Payment_Method'], drop_first=True, dtype=int)

# **Correlation**

In [None]:
plt.figure(figsize=(12, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()

In [None]:
corr_matrix

# **Dimentionality reduction**

# **Scalling**

## **standardScaler**

In [None]:
from sklearn.preprocessing import StandardScaler
X = df.drop('Fraudulent', axis=1)
y = df['Fraudulent']

In [None]:
standard_scaler=StandardScaler()
X_standardized=standard_scaler.fit_transform(X)
X_standardized

In [None]:
X_standardized=pd.DataFrame(X_standardized,columns=X.columns)
X_standardized

# **Traning The Modeling**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.svm import SVR
from sklearn.impute import SimpleImputer
X = df.drop(columns=['Fraudulent'])  # Features
y = df['Fraudulent']  # Target variable

In [None]:
imputer = SimpleImputer(strategy='mean')  # You can choose other strategies like 'median'
X = imputer.fit_transform(X)

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
model = RandomForestClassifier(random_state=42)  # You can choose other models like LogisticRegression, XGBoost, etc.
model.fit(X_train, y_train)

In [None]:
svr = SVR(kernel='rbf')  # You can try different kernels like 'linear', 'poly', etc.
svr.fit(X_train, y_train)

In [None]:
y_pred = svr.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

ValueError: Classification metrics can't handle a mix of binary and continuous targets

In [None]:
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)