# Fraud Detection with Logistic Regression and Feature Engineering

You are a data scientist at a financial institution, and your primary task is to develop a fraud detection model using logistic regression. The dataset you have is highly imbalanced, with only a small fraction of transactions being fraudulent. Your objective is to create an effective model by implementing logistic regression and employing various feature engineering techniques to improve the model's performance:

1. Data Preparation:

a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).

b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

2. Initial Logistic Regression Model:

a. Implement a basic logistic regression model using the raw dataset.

b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score

3. Feature Engineering:

a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:

-Creating new features.

-Scaling or normalizing features.

-Handling missing values.

-Encoding categorical variables.

b. Explain why each feature engineering technique is relevant for fraud detection.

4. Handling Imbalanced Data:

a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.

b. Implement strategies to address class imbalance, such as:

- Oversampling the minority class.

-Undersampling the majority class.

-Using synthetic data generation techniques (e.g., SMOTE).

5. Logistic Regression with Feature-Engineered Data:

a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data. I

b. Evaluate the model's performance using appropriate evaluation metrics.

6. Model Interpretation:

a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.

b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

7. Model Comparison:

a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.

b. Discuss the advantages and limitations of each approach.

8. Presentation and Recommendations:

a. Prepare a presentation or report summarizing your analysis, results, and recommendations for the financial institution. Highlight the importance of feature engineering and handling imbalanced data in building an effective fraud detection system.

In this case study, you are required to showcase your ability to preprocess data, implement logistic regression, apply feature engineering techniques, and address class imbalance to improve the model's performance. Your analysis should also demonstrate your understanding of the nuances of fraud detection in a financial context.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [5]:
#Load the dataset
data=pd.read_csv("bank_transactions.csv")
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10-01-1994,F,JAMSHEDPUR,17819.05,02-08-2016,143207,25.0
1,T2,C2142763,04-04-1957,M,JHAJJAR,2270.69,02-08-2016,141858,27999.0
2,T3,C4417068,26-11-1996,F,MUMBAI,17874.44,02-08-2016,142712,459.0
3,T4,C5342380,14-09-1973,F,MUMBAI,866503.21,02-08-2016,142714,2060.0
4,T5,C9031234,24-03-1988,F,NAVI MUMBAI,6714.43,02-08-2016,181156,1762.5


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048567 entries, 0 to 1048566
Data columns (total 9 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   TransactionID            1048567 non-null  object 
 1   CustomerID               1048567 non-null  object 
 2   CustomerDOB              1045170 non-null  object 
 3   label                    1047467 non-null  object 
 4   CustLocation             1048416 non-null  object 
 5   CustAccountBalance       1046198 non-null  float64
 6   TransactionDate          1048567 non-null  object 
 7   TransactionTime          1048567 non-null  int64  
 8   TransactionAmount (INR)  1048567 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 72.0+ MB


In [7]:
#b.Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustAccountBalance,1046198.0,115403.540056,846485.380601,0.0,4721.76,16792.18,57657.36,115035500.0
TransactionTime,1048567.0,157087.529393,51261.854022,0.0,124030.0,164226.0,200010.0,235959.0
TransactionAmount (INR),1048567.0,1574.335003,6574.742978,0.0,161.0,459.03,1200.0,1560035.0


In [8]:
data.isna().sum()

TransactionID                 0
CustomerID                    0
CustomerDOB                3397
label                      1100
CustLocation                151
CustAccountBalance         2369
TransactionDate               0
TransactionTime               0
TransactionAmount (INR)       0
dtype: int64

In [9]:
df = data.fillna(method="bfill")
df.isna().sum()

TransactionID              0
CustomerID                 0
CustomerDOB                0
label                      0
CustLocation               0
CustAccountBalance         0
TransactionDate            0
TransactionTime            0
TransactionAmount (INR)    0
dtype: int64

In [10]:
x=data.iloc[:,7:9].values
x

array([[1.43207e+05, 2.50000e+01],
       [1.41858e+05, 2.79990e+04],
       [1.42712e+05, 4.59000e+02],
       ...,
       [1.83313e+05, 7.70000e+02],
       [1.84706e+05, 1.00000e+03],
       [1.81222e+05, 1.16600e+03]])

In [11]:
y=data.iloc[:,3:4].values
y

array([['F'],
       ['M'],
       ['F'],
       ...,
       ['M'],
       ['M'],
       ['M']], dtype=object)

In [62]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["label"]=le.fit_transform(data['label'])
#data["CustomerID"]=le.fit_transform(data['CustomerID'])
#data["TransactionID"]=le.fit_transform(data['TransactionID'])
#data["CustomerDOB"]=le.fit_transform(data['CustomerDOB'])
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,0,531762,5379,0,JAMSHEDPUR,17819.05,02-08-2016,143207,25.0
1,159679,128998,2013,1,JHAJJAR,2270.69,02-08-2016,141858,27999.0
2,270790,372985,14818,0,MUMBAI,17874.44,02-08-2016,142712,459.0
3,381901,477898,8014,0,MUMBAI,866503.21,02-08-2016,142714,2060.0
4,493012,879221,13330,0,NAVI MUMBAI,6714.43,02-08-2016,181156,1762.5


In [63]:
fraudulent_count = data['label'].sum()
non_fraudulent_count = len(data) - fraudulent_count

# Calculate the proportion of fraudulent transactions
fraudulent_proportion = fraudulent_count / len(data)
non_fraudulent_proportion = non_fraudulent_count / len(data)

print("Class Distribution:")
print(f"Fraudulent Transactions: {fraudulent_count} ({fraudulent_proportion * 100:.2f}%)")
print(f"Non-Fraudulent Transactions: {non_fraudulent_count} ({non_fraudulent_proportion * 100:.2f}%)")

Class Distribution:
Fraudulent Transactions: 768832 (73.32%)
Non-Fraudulent Transactions: 279735 (26.68%)


In [64]:
#x= data[['TransactionTime','TransactionAmount (INR)']]  # Independent variables
#y= data['label']  # Dependent variable

xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.25,random_state=2)
print('Training data -X - shape:\t',xtrain.shape)
print()
print('Training data -Y - shape:\t',ytrain.shape)
print()
print('Testing data shape\n')
print('testing data(x-input) shape :\t',xtest.shape)
print()
print('testing data(Y-input) shape :\t',ytest.shape)

Training data -X - shape:	 (786425, 7)

Training data -Y - shape:	 (786425,)

Testing data shape

testing data(x-input) shape :	 (262142, 7)

testing data(Y-input) shape :	 (262142,)


In [None]:
from sklearn.linear_model import LogisticRegression

# Create an instance of LogisticRegression with a random seed
log_reg = LogisticRegression(random_state=2)

# Train the model
print('Training the model\n')
log_reg.fit(xtrain, ytrain)

# Test the model
ypred = log_reg.predict(xtest)
print('Predicted label for the input samples:\n', ypred)
print()
print('Testing is completed\n')
print('Testing samples are: \t', len(ypred))


In [66]:
xtrain.shape

(786425, 7)

In [67]:
xtest.shape

(262142, 7)

In [68]:
ytrain.shape

(786425,)

In [69]:
ytest.shape

(262142,)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,confusion_matrix

# For multiclass classification, set the 'average' parameter
# You can choose one of [None, 'micro', 'macro', 'weighted']
# 'micro' and 'macro' are common choices.

# Calculate accuracy
accuracy = accuracy_score(ytest,ypred)

# Calculate precision for each class and get the average
precision = precision_score(ytest,predictions, average='macro')

# Calculate recall for each class and get the average
recall = recall_score(ytest,predictions, average='macro')

# Calculate F1-Score for each class and get the average
f1 = f1_score(ytest, predictions, average='macro')

conf_matrix = confusion_matrix(ytest, ypred)

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nConfusion Matrix:\n", conf_matrix)

In [87]:
#feature Engineering
#Creating new features.
data['CustAccountBalance']=data['TransactionTime']%24
data

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,0,531762,5379,0,JAMSHEDPUR,23,02-08-2016,143207,25.0
1,159679,128998,2013,1,JHAJJAR,18,02-08-2016,141858,27999.0
2,270790,372985,14818,0,MUMBAI,8,02-08-2016,142712,459.0
3,381901,477898,8014,0,MUMBAI,10,02-08-2016,142714,2060.0
4,493012,879221,13330,0,NAVI MUMBAI,4,02-08-2016,181156,1762.5
...,...,...,...,...,...,...,...,...,...
1048562,53962,766771,4373,1,NEW DELHI,0,18-09-2016,184824,799.0
1048563,53963,598520,11070,1,NASHIK,14,18-09-2016,183734,460.0
1048564,53964,589517,10109,1,HYDERABAD,1,18-09-2016,183313,770.0
1048565,53965,591805,16739,1,VISAKHAPATNAM,2,18-09-2016,184706,1000.0


In [88]:
#Scaling or normalizing features.
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
data['CustAccountBalance']=sc.fit_transform(data[['TransactionTime']])
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,0,531762,5379,0,JAMSHEDPUR,-0.270777,02-08-2016,143207,25.0
1,159679,128998,2013,1,JHAJJAR,-0.297093,02-08-2016,141858,27999.0
2,270790,372985,14818,0,MUMBAI,-0.280433,02-08-2016,142712,459.0
3,381901,477898,8014,0,MUMBAI,-0.280394,02-08-2016,142714,2060.0
4,493012,879221,13330,0,NAVI MUMBAI,0.46952,02-08-2016,181156,1762.5


In [89]:
#Handling missing values.
df = data.fillna(method="bfill")
df.isna().sum()

TransactionID              0
CustomerID                 0
CustomerDOB                0
label                      0
CustLocation               0
CustAccountBalance         0
TransactionDate            0
TransactionTime            0
TransactionAmount (INR)    0
dtype: int64

In [90]:
#Encoding categorical variables.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["label"]=le.fit_transform(data["label"])
data["CustAccountBalance"]=le.fit_transform(data["CustAccountBalance"])
data

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,0,531762,5379,0,JAMSHEDPUR,47847,02-08-2016,143207,25.0
1,159679,128998,2013,1,JHAJJAR,47058,02-08-2016,141858,27999.0
2,270790,372985,14818,0,MUMBAI,47552,02-08-2016,142712,459.0
3,381901,477898,8014,0,MUMBAI,47554,02-08-2016,142714,2060.0
4,493012,879221,13330,0,NAVI MUMBAI,61036,02-08-2016,181156,1762.5
...,...,...,...,...,...,...,...,...,...
1048562,53962,766771,4373,1,NEW DELHI,63224,18-09-2016,184824,799.0
1048563,53963,598520,11070,1,NASHIK,62574,18-09-2016,183734,460.0
1048564,53964,589517,10109,1,HYDERABAD,62313,18-09-2016,183313,770.0
1048565,53965,591805,16739,1,VISAKHAPATNAM,63146,18-09-2016,184706,1000.0


# Handling Imbalanced data

# Challenges associated with imbalanced datasets in the context of fraud detection

- Model Bias
- Limited Learning
- Evaluation Metrics Misleading
- Class Imbalance Impact

In [91]:
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

x = df.drop(['label', 'CustLocation'], axis=1)
y = df['label']

x_train,x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
print('Splitting is done')

Splitting is done


# Oversampling the minority class

- Make more examples of fraud so the model can learn better.
- If there are not enough fraud examples, the model may not understand what fraud looks like.
- Either copy existing fraud examples or use a smart method (like SMOTE) to make new ones.


# Undersampling the Majority Class:

- Use fewer examples of non-fraud so the model doesn't think everything is okay.
- If there are too many non-fraud examples, the model might just learn to predict everything as non-fraud.
- Pick some non-fraud examples randomly and don't use them, so the dataset has a better balance.


# Using Synthetic Data Generation Techniques (e.g., SMOTE):

- Create fake examples of fraud to trick the model into understanding fraud better.
- If there aren't enough real examples of fraud, the model might not learn well. Fake examples help it understand different       types of fraud.
- Imagine new examples of fraud that are like the ones we have, but a bit different, so the model can learn more broadly.

These strategies are like giving the model the right mix of examples.

In [94]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data=pd.read_csv('bank_transactions.csv')
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount (INR)
0,T1,C5841053,10-01-1994,F,JAMSHEDPUR,17819.05,02-08-2016,143207,25.0
1,T2,C2142763,04-04-1957,M,JHAJJAR,2270.69,02-08-2016,141858,27999.0
2,T3,C4417068,26-11-1996,F,MUMBAI,17874.44,02-08-2016,142712,459.0
3,T4,C5342380,14-09-1973,F,MUMBAI,866503.21,02-08-2016,142714,2060.0
4,T5,C9031234,24-03-1988,F,NAVI MUMBAI,6714.43,02-08-2016,181156,1762.5


In [95]:
x=data.iloc[:,-2:].values
x

array([[1.43207e+05, 2.50000e+01],
       [1.41858e+05, 2.79990e+04],
       [1.42712e+05, 4.59000e+02],
       ...,
       [1.83313e+05, 7.70000e+02],
       [1.84706e+05, 1.00000e+03],
       [1.81222e+05, 1.16600e+03]])

In [101]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data["label"]=le.fit_transform(data['label'])
#data["TransactionDate"]=le.fit_transform(data['TransactionDate'])

data.head()
y=data.iloc[:,3:4].values
y


array([[0],
       [1],
       [0],
       ...,
       [1],
       [1],
       [1]], dtype=int64)

In [52]:
# Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data

# Oversampling
oversampler = RandomOverSampler(random_state=42)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(x_train, y_train)
model_oversampled = RandomForestClassifier(random_state=42)
model_oversampled.fit(X_train_oversampled, y_train_oversampled)
y_pred_oversampled = model_oversampled.predict(x_test)


# Undersampling
under_sampler = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = under_sampler.fit_resample(x_train, y_train)
log_reg_under = LogisticRegression(random_state=2).fit(X_resampled_under, y_resampled_under)
y_pred_under = log_reg_under.predict(x_test)

# SMOTE
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(x_train, y_train)
log_reg_smote = LogisticRegression(random_state=2).fit(X_resampled_smote, y_resampled_smote)
y_pred_smote = log_reg_smote.predict(x_test)

print('Training with logistic regression model using the feature-engineered dataset')

In [None]:
# Evaluate the model's performance using appropriate evaluation metrics

# Oversampling
accuracy_oversampled = accuracy_score(y_test, y_pred_oversampled)
precision_oversampled = precision_score(y_test, y_pred_oversampled)
recall_oversampled = recall_score(y_test, y_pred_oversampled)
f1_oversampled = f1_score(y_test, y_pred_oversampled)
conf_matrix_oversampled = confusion_matrix(y_test, y_pred_oversampled)
print("***************Model's Performance after Handling Imbalance with Oversampling****************")
print(f"Accuracy: {accuracy_oversampled}")
print(f"Precision: {precision_oversampled}")
print(f"Recall: {recall_oversampled}")
print(f"F1 Score: {f1_oversampled}")
print("\nConfusion Matrix:\n", conf_matrix_oversampled)

# Undersampling
accuracy_under = accuracy_score(y_test, y_pred_under)
precision_under = precision_score(y_test, y_pred_under)
recall_under = recall_score(y_test, y_pred_under)
f1_under = f1_score(y_test, y_pred_under)
conf_matrix_under = confusion_matrix(y_test, y_pred_under)
print("\n***************Model's Performance after Handling Imbalance with Undersampling****************")
print("\nAccuracy:", accuracy_under)
print("Precision:", precision_under)
print("Recall:", recall_under)
print("F1 Score:", f1_under)
print("\nConfusion Matrix:\n", conf_matrix_under)

# SMOTE
accuracy_smote = accuracy_score(y_test, y_pred_smote)
precision_smote = precision_score(y_test, y_pred_smote)
recall_smote = recall_score(y_test, y_pred_smote)
f1_smote = f1_score(y_test, y_pred_smote)
conf_matrix_smote = confusion_matrix(y_test, y_pred_smote)
print("\n***************Model's Performance after Handling Imbalance with SMOTE****************")
print("\nAccuracy:", accuracy_smote)
print("Precision:", precision_smote)
print("Recall:", recall_smote)
print("F1 Score:", f1_smote)
print("\nConfusion Matrix:\n", conf_matrix_smote)

In [106]:
#6
# Interpret the coefficients of the logistic regression model
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(random_state=10)
logreg_model.fit(xtrain, ytrain)

In [None]:
logreg_model.classes_

In [None]:
logreg_model.coef_

In [None]:
# Discussing the features that have the most influence on fraud detection

# Get the coefficients and feature names
coefficients = logreg_model.coef_[0]
feature_names = X.columns

# Create a DataFrame to display the coefficients and feature names
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the DataFrame by absolute coefficient values for better interpretation
coefficients_df['Absolute_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Absolute_Coefficient', ascending=False)

# Display the sorted coefficients
print(coefficients_df)

b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud

Adjust the threshold and model parameters based on the balance between the cost of false positives and false negatives in your application for decision-making in identifying potential fraud.

In [None]:
#7. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model

print("***************Initial Model's Performance****************")
print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nConfusion Matrix:\n", conf_matrix)

print("\n***************Balanced Data Model's Performance****************")
print("\nAccuracy:", accuracy_under)
print("Precision:", precision_under)
print("Recall:", recall_under)
print("F1 Score:", f1_under)
print("\nConfusion Matrix:\n", conf_matrix_under)

Discuss the advantages and limitations of each approach

The balanced data model has higher accuracy, precision, recall, and F1 score, indicating an improvement in overall performance. It is a positive sign that the model trained on the balanced data is performing better and this aligns with the expectations when dealing with imbalanced datasets.