<a href="https://colab.research.google.com/github/Bhuvnesh4996/Bhuvnesh4996/blob/main/Fraud_Detection_System_for_Financial_Security.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing Required Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

Loading dataset to Panadas Dataframe

In [2]:
from google.colab import files
uploaded = files.upload()

Saving Project-2.csv to Project-2.csv


In [3]:
df = pd.read_csv('Project-2.csv')

In [4]:
df.head()

Unnamed: 0,Transaction_ID,Customer_ID,Transaction_Amount,Location,Time_of_Day,Is_Fraudulent
0,1.0,101.0,1000.0,Mumbai,Morning,0.0
1,2.0,102.0,500.0,Delhi,Afternoon,1.0
2,3.0,103.0,2000.0,Chennai,Evening,0.0
3,4.0,104.0,300.0,Gujarat,Morning,1.0
4,5.0,105.0,800.0,Bengalore,Afternoon,0.0


In [5]:
df.tail()

Unnamed: 0,Transaction_ID,Customer_ID,Transaction_Amount,Location,Time_of_Day,Is_Fraudulent
16,17.0,117.0,1300.0,Delhi,Afternoon,0.0
17,18.0,118.0,1700.0,Chennai,Evening,0.0
18,19.0,119.0,500.0,Gujarat,Morning,1.0
19,20.0,120.0,900.0,Bengalore,Afternoon,0.0
20,,,,,,


Dataset Information

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Transaction_ID      20 non-null     float64
 1   Customer_ID         20 non-null     float64
 2   Transaction_Amount  20 non-null     float64
 3   Location            20 non-null     object 
 4   Time_of_Day         20 non-null     object 
 5   Is_Fraudulent       20 non-null     float64
dtypes: float64(4), object(2)
memory usage: 1.1+ KB


Checking Missing values in dataset

In [8]:
df.isnull().sum()

Transaction_ID        1
Customer_ID           1
Transaction_Amount    1
Location              1
Time_of_Day           1
Is_Fraudulent         1
dtype: int64


No. of Legit & Fradulent transactions



In [9]:
df['Is_Fraudulent'].value_counts()

Is_Fraudulent
0.0    12
1.0     8
Name: count, dtype: int64

This dataset is imbalanced

0--> Legit Transaction

1--> Fraudulent Transaction

In [10]:
#Separating the data for analysis
legit = df[df.Is_Fraudulent == 0]
fraud = df[df.Is_Fraudulent == 1]

In [11]:
#statistical measures
legit.Transaction_Amount.describe()

count      12.000000
mean     1333.333333
std       754.782730
min       400.000000
25%       800.000000
50%      1100.000000
75%      1775.000000
max      3000.000000
Name: Transaction_Amount, dtype: float64

In [12]:
fraud.Transaction_Amount.describe()

count       8.000000
mean      981.250000
std       837.486674
min       300.000000
25%       387.500000
50%       500.000000
75%      1575.000000
max      2500.000000
Name: Transaction_Amount, dtype: float64

Compare the Value of both transactions

In [13]:
df['Is_Fraudulent'].mean()

0.4

Random selection of data



In [14]:
legit_sample = legit.sample(n=8)

Concatinating two dataframes

In [15]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)

In [16]:
new_dataset.head()

Unnamed: 0,Transaction_ID,Customer_ID,Transaction_Amount,Location,Time_of_Day,Is_Fraudulent
15,16.0,116.0,800.0,Mumbai,Morning,0.0
8,9.0,109.0,700.0,Gujarat,Evening,0.0
0,1.0,101.0,1000.0,Mumbai,Morning,0.0
17,18.0,118.0,1700.0,Chennai,Evening,0.0
11,12.0,112.0,2200.0,Delhi,Evening,0.0


In [17]:
new_dataset['Is_Fraudulent'].value_counts()

Is_Fraudulent
0.0    8
1.0    8
Name: count, dtype: int64

In [18]:
new_dataset['Is_Fraudulent'].mean()

0.5

In [19]:
new_dataset.dtypes

Transaction_ID        float64
Customer_ID           float64
Transaction_Amount    float64
Location               object
Time_of_Day            object
Is_Fraudulent         float64
dtype: object

Spliting the data into features & targets

In [20]:
X = new_dataset.drop(columns='Is_Fraudulent', axis=1)
Y = new_dataset['Is_Fraudulent']

In [21]:
print(X)

    Transaction_ID  Customer_ID  Transaction_Amount   Location Time_of_Day
15            16.0        116.0               800.0     Mumbai     Morning
8              9.0        109.0               700.0    Gujarat     Evening
0              1.0        101.0              1000.0     Mumbai     Morning
17            18.0        118.0              1700.0    Chennai     Evening
11            12.0        112.0              2200.0      Delhi     Evening
13            14.0        114.0              3000.0    Gujarat   Afternoon
16            17.0        117.0              1300.0      Delhi   Afternoon
6              7.0        107.0               400.0      Delhi     Morning
1              2.0        102.0               500.0      Delhi   Afternoon
3              4.0        104.0               300.0    Gujarat     Morning
5              6.0        106.0              1500.0     Mumbai     Evening
7              8.0        108.0              2500.0    Chennai   Afternoon
10            11.0       

In [22]:
print(Y)

15    0.0
8     0.0
0     0.0
17    0.0
11    0.0
13    0.0
16    0.0
6     0.0
1     1.0
3     1.0
5     1.0
7     1.0
10    1.0
12    1.0
14    1.0
18    1.0
Name: Is_Fraudulent, dtype: float64


Spilting the data into training data & testing data

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)


In [24]:
print(X.shape, X_train.shape, X_test.shape)

(16, 5) (12, 5) (4, 5)


Logistical Regression

In [25]:
model = LogisticRegression()

Training the Logistical Regression model

In [27]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Assuming 'Location' is a categorical variable in your dataset
# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('Location', OneHotEncoder(), ['Location'])  # Apply one-hot encoding to the 'Location' column
    ],
    remainder='passthrough'
)

In [34]:
# Assuming 'Location' is a categorical variable in your dataset
# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('Location', OneHotEncoder(), ['Location'])  # Apply one-hot encoding to the 'Location' column
    ],
    remainder='passthrough'
)

model = LinearRegression()

In [42]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Assuming 'Location' is a categorical variable in your dataset
# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('Location', OneHotEncoder(), ['Location'])  # Apply one-hot encoding to the 'Location' column
    ],
    remainder='passthrough'
)

In [39]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression

# Assuming 'Time_of_Day' is a categorical variable in your dataset
# Define the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('Time_of_Day', OneHotEncoder(), ['Time_of_Day'])  # Apply one-hot encoding to the 'Time_of_Day' column
    ],
    remainder='passthrough'
)

model = LinearRegression()


In [40]:
# Create a pipeline combining preprocessing and modeling steps
pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

# Fit the pipeline with training data
pipe.fit(X_train, Y_train)

ValueError: could not convert string to float: 'Gujarat'

In [41]:
model.fit(X_train, Y_train)

ValueError: could not convert string to float: 'Gujarat'

In [43]:
print(Y_train.dtypes)

float64
