# Fraud Detection with Logistic Regression and Feature Engineering

You are a data scientist at a financial institution, and your primary task is to develop a fraud detection model using logistic regression. The dataset you have is highly imbalanced, with only a small fraction of transactions being fraudulent. Your objective is to create an effective model by implementing logistic regression and employing various feature engineering techniques to improve the model's performance:

# 1. Data Preparation:

   a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).
   
   b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.





In [3]:
import pandas as pd

# Load the dataset
data = pd.read_csv("fraud.csv")
data

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud
0,1,120.75,1,Debit,Local,John Doe,0.0
1,2,50.00,2,Credit,International,Jane Smith,0.0
2,3,200.00,3,Debit,Local,Bob Johnson,0.0
3,4,30.25,4,Debit,Local,Alice Williams,0.0
4,5,500.50,5,Credit,International,Charlie Brown,1.0
...,...,...,...,...,...,...,...
100,101,120.75,101,Debit,Local,Aiden Wilson,0.0
101,102,50.00,102,Credit,International,Mia Turner,0.0
102,103,200.00,103,Debit,Local,Ella Harris,0.0
103,104,30.25,104,Debit,Local,Lucas Davis,0.0


In [4]:
''' Display the first few rows to get an overview of the available features, including transaction details, customer
 information, and labels (fraudulent or non-fraudulent).  '''

data.head()

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud
0,1,120.75,1,Debit,Local,John Doe,0.0
1,2,50.0,2,Credit,International,Jane Smith,0.0
2,3,200.0,3,Debit,Local,Bob Johnson,0.0
3,4,30.25,4,Debit,Local,Alice Williams,0.0
4,5,500.5,5,Credit,International,Charlie Brown,1.0


In [5]:
# Get information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105 entries, 0 to 104
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   TransactionID  105 non-null    int64  
 1   Amount         105 non-null    float64
 2   Time           105 non-null    int64  
 3   Type           105 non-null    object 
 4   Location       105 non-null    object 
 5   CardHolder     105 non-null    object 
 6   IsFraud        104 non-null    float64
dtypes: float64(2), int64(2), object(3)
memory usage: 5.9+ KB


In [6]:
data.describe()

Unnamed: 0,TransactionID,Amount,Time,IsFraud
count,105.0,105.0,105.0,104.0
mean,53.0,156.766667,53.0,0.144231
std,30.454885,133.783326,30.454885,0.353025
min,1.0,25.75,1.0,0.0
25%,27.0,55.2,27.0,0.0
50%,53.0,120.75,53.0,0.0
75%,79.0,200.0,79.0,0.0
max,105.0,500.5,105.0,1.0


In [7]:
# Describe the class distribution of fraudulent and non-fraudulent transactions

# Count the number of fraudulent and non-fraudulent transactions
class_distribution = data['IsFraud'].value_counts()

# Calculate the percentage of fraudulent transactions
fraudulent_percentage = (class_distribution[1] / len(data)) * 100
Non_Fraudulent_percentage = (class_distribution[0] / len(data)) * 100

# Display class distribution and imbalance discussion
print('Class Distribution:\n',class_distribution)
print(f"Percentage of Fraudulent Transactions: {fraudulent_percentage:.2f}%")
print(f"Percentage of Non-Fraudulent Transactions: {Non_Fraudulent_percentage:.2f}%")

Class Distribution:
 0.0    89
1.0    15
Name: IsFraud, dtype: int64
Percentage of Fraudulent Transactions: 14.29%
Percentage of Non-Fraudulent Transactions: 84.76%


# 2. Initial Logistic Regression Model:

   a. Implement a basic logistic regression model using the raw dataset. 
   
   
   b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score.

In [8]:
data.isna().sum()

TransactionID    0
Amount           0
Time             0
Type             0
Location         0
CardHolder       0
IsFraud          1
dtype: int64

In [9]:
df = data.fillna(method = 'bfill')
df.isna().sum()

TransactionID    0
Amount           0
Time             0
Type             0
Location         0
CardHolder       0
IsFraud          0
dtype: int64

In [10]:
df.head()

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud
0,1,120.75,1,Debit,Local,John Doe,0.0
1,2,50.0,2,Credit,International,Jane Smith,0.0
2,3,200.0,3,Debit,Local,Bob Johnson,0.0
3,4,30.25,4,Debit,Local,Alice Williams,0.0
4,5,500.5,5,Credit,International,Charlie Brown,1.0


In [11]:
x = df.iloc[:,:3]
x

Unnamed: 0,TransactionID,Amount,Time
0,1,120.75,1
1,2,50.00,2
2,3,200.00,3
3,4,30.25,4
4,5,500.50,5
...,...,...,...
100,101,120.75,101
101,102,50.00,102
102,103,200.00,103
103,104,30.25,104


In [12]:
y = df.iloc[:,-1:]
y

Unnamed: 0,IsFraud
0,0.0
1,0.0
2,0.0
3,0.0
4,1.0
...,...
100,0.0
101,0.0
102,0.0
103,0.0


In [13]:
# Split the data into training and testing sets

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2, random_state = 2)
print('Training data-X-shape:\t',xtrain.shape)
print()
print('Training data-Y-shape:\t',ytrain.shape)
print()
print('Testing data(X-input) shape :\t',xtest.shape)
print()
print('Testing data(Y-input) shape :\t',ytest.shape)

Training data-X-shape:	 (84, 3)

Training data-Y-shape:	 (84, 1)

Testing data(X-input) shape :	 (21, 3)

Testing data(Y-input) shape :	 (21, 1)


In [56]:
# Implement a basic logistic regression model using the raw dataset

from sklearn.linear_model import LogisticRegression

# train the model
log_reg = LogisticRegression(solver = 'liblinear', verbose = 2)
print('Training the model\n')
log_reg.fit(xtrain, ytrain)

# test the model
ypred = log_reg.predict(xtest)
print('Predicted label for the input samples:\n', ypred)
print()
print('Testing is completed\n')
print('Testing samples are: \t', len(ypred))

Training the model

[LibLinear]Predicted label for the input samples:
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0.]

Testing is completed

Testing samples are: 	 21


  y = column_or_1d(y, warn=True)


In [15]:
# Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score

from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

accuracy = accuracy_score(ytest, ypred)
precision = precision_score(ytest, ypred)
recall = recall_score(ytest, ypred)
f1 = f1_score(ytest, ypred)
conf_matrix = confusion_matrix(ytest, ypred)

# Display the evaluation metrics
print("***************Model's Performance****************")
print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nConfusion Matrix:\n", conf_matrix)

***************Model's Performance****************

Accuracy: 0.8571428571428571
Precision: 0.6666666666666666
Recall: 0.5
F1 Score: 0.5714285714285715

Confusion Matrix:
 [[16  1]
 [ 2  2]]


# 3. Feature Engineering:

   a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include

      -Creating new features.
      -Scaling or normalizing features.
      -Handling missing values.
      -Encoding categorical variables.


   b. Explain why each feature engineering technique is relevant for fraud detection.


In [16]:
# Creating new features
data['HourOfDay'] = data['Time'] % 24
data

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud,HourOfDay
0,1,120.75,1,Debit,Local,John Doe,0.0,1
1,2,50.00,2,Credit,International,Jane Smith,0.0,2
2,3,200.00,3,Debit,Local,Bob Johnson,0.0,3
3,4,30.25,4,Debit,Local,Alice Williams,0.0,4
4,5,500.50,5,Credit,International,Charlie Brown,1.0,5
...,...,...,...,...,...,...,...,...
100,101,120.75,101,Debit,Local,Aiden Wilson,0.0,5
101,102,50.00,102,Credit,International,Mia Turner,0.0,6
102,103,200.00,103,Debit,Local,Ella Harris,0.0,7
103,104,30.25,104,Debit,Local,Lucas Davis,0.0,8


In [17]:
# Scaling or normalizing features

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
data['AmountScaled'] = sc.fit_transform(data[['Amount']])
data

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud,HourOfDay,AmountScaled
0,1,120.75,1,Debit,Local,John Doe,0.0,1,-0.270508
1,2,50.00,2,Credit,International,Jane Smith,0.0,2,-0.801884
2,3,200.00,3,Debit,Local,Bob Johnson,0.0,3,0.324709
3,4,30.25,4,Debit,Local,Alice Williams,0.0,4,-0.950219
4,5,500.50,5,Credit,International,Charlie Brown,1.0,5,2.581652
...,...,...,...,...,...,...,...,...,...
100,101,120.75,101,Debit,Local,Aiden Wilson,0.0,5,-0.270508
101,102,50.00,102,Credit,International,Mia Turner,0.0,6,-0.801884
102,103,200.00,103,Debit,Local,Ella Harris,0.0,7,0.324709
103,104,30.25,104,Debit,Local,Lucas Davis,0.0,8,-0.950219


In [18]:
# Handling missing values
df = data.fillna(method = 'bfill')
df.isna().sum()

TransactionID    0
Amount           0
Time             0
Type             0
Location         0
CardHolder       0
IsFraud          0
HourOfDay        0
AmountScaled     0
dtype: int64

In [19]:
# Encoding categorical variables

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Type'] = le.fit_transform(df['Type'])
df['Location'] = le.fit_transform(df['Location'])
df

Unnamed: 0,TransactionID,Amount,Time,Type,Location,CardHolder,IsFraud,HourOfDay,AmountScaled
0,1,120.75,1,1,1,John Doe,0.0,1,-0.270508
1,2,50.00,2,0,0,Jane Smith,0.0,2,-0.801884
2,3,200.00,3,1,1,Bob Johnson,0.0,3,0.324709
3,4,30.25,4,1,1,Alice Williams,0.0,4,-0.950219
4,5,500.50,5,0,0,Charlie Brown,1.0,5,2.581652
...,...,...,...,...,...,...,...,...,...
100,101,120.75,101,1,1,Aiden Wilson,0.0,5,-0.270508
101,102,50.00,102,0,0,Mia Turner,0.0,6,-0.801884
102,103,200.00,103,1,1,Ella Harris,0.0,7,0.324709
103,104,30.25,104,1,1,Lucas Davis,0.0,8,-0.950219


# 4. Handling Imbalanced Data:

   a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.

   b. Implement strategies to address class imbalance, such as:
   
      -Oversampling the minority class.
      -Undersampling the majority class.
      -Using synthetic data generation techniques (e.g., SMOTE).

In [41]:
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE

X = df.drop(['IsFraud', 'CardHolder'], axis=1)
y = df['IsFraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Splitting is done')

Splitting is done


# 5. Logistic Regression with Feature-Engineered Data:

a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data.

b. Evaluate the model's performance using appropriate evaluation metrics.

In [42]:
# Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data

# Oversampling
oversampler = RandomOverSampler(random_state=42)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(X_train, y_train)
model_oversampled = RandomForestClassifier(random_state=42)
model_oversampled.fit(X_train_oversampled, y_train_oversampled)
y_pred_oversampled = model_oversampled.predict(X_test)


# Undersampling
under_sampler = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = under_sampler.fit_resample(X_train, y_train)
log_reg_under = LogisticRegression(random_state=2).fit(X_resampled_under, y_resampled_under)
y_pred_under = log_reg_under.predict(X_test)

# SMOTE
smote = SMOTE(random_state=42)
X_resampled_smote, y_resampled_smote = smote.fit_resample(X_train, y_train)
log_reg_smote = LogisticRegression(random_state=2).fit(X_resampled_smote, y_resampled_smote)
y_pred_smote = log_reg_smote.predict(X_test)

print('Training with logistic regression model using the feature-engineered dataset')

Training with logistic regression model using the feature-engineered dataset


In [43]:
# Evaluate the model's performance using appropriate evaluation metrics

# Oversampling
accuracy_oversampled = accuracy_score(y_test, y_pred_oversampled)
precision_oversampled = precision_score(y_test, y_pred_oversampled)
recall_oversampled = recall_score(y_test, y_pred_oversampled)
f1_oversampled = f1_score(y_test, y_pred_oversampled)
conf_matrix_oversampled = confusion_matrix(y_test, y_pred_oversampled)
print("***************Model's Performance after Handling Imbalance with Oversampling****************")
print(f"Accuracy: {accuracy_oversampled}")
print(f"Precision: {precision_oversampled}")
print(f"Recall: {recall_oversampled}")
print(f"F1 Score: {f1_oversampled}")
print("\nConfusion Matrix:\n", conf_matrix_oversampled)

# Undersampling
accuracy_under = accuracy_score(y_test, y_pred_under)
precision_under = precision_score(y_test, y_pred_under)
recall_under = recall_score(y_test, y_pred_under)
f1_under = f1_score(y_test, y_pred_under)
conf_matrix_under = confusion_matrix(y_test, y_pred_under)
print("\n***************Model's Performance after Handling Imbalance with Undersampling****************")
print("\nAccuracy:", accuracy_under)
print("Precision:", precision_under)
print("Recall:", recall_under)
print("F1 Score:", f1_under)
print("\nConfusion Matrix:\n", conf_matrix_under)

# SMOTE
accuracy_smote = accuracy_score(y_test, y_pred_smote)
precision_smote = precision_score(y_test, y_pred_smote)
recall_smote = recall_score(y_test, y_pred_smote)
f1_smote = f1_score(y_test, y_pred_smote)
conf_matrix_smote = confusion_matrix(y_test, y_pred_smote)
print("\n***************Model's Performance after Handling Imbalance with SMOTE****************")
print("\nAccuracy:", accuracy_smote)
print("Precision:", precision_smote)
print("Recall:", recall_smote)
print("F1 Score:", f1_smote)
print("\nConfusion Matrix:\n", conf_matrix_smote)

***************Model's Performance after Handling Imbalance with Oversampling****************
Accuracy: 0.8571428571428571
Precision: 1.0
Recall: 0.4
F1 Score: 0.5714285714285715

Confusion Matrix:
 [[16  0]
 [ 3  2]]

***************Model's Performance after Handling Imbalance with Undersampling****************

Accuracy: 0.9047619047619048
Precision: 0.8
Recall: 0.8
F1 Score: 0.8000000000000002

Confusion Matrix:
 [[15  1]
 [ 1  4]]

***************Model's Performance after Handling Imbalance with SMOTE****************

Accuracy: 0.7619047619047619
Precision: 0.5
Recall: 0.6
F1 Score: 0.5454545454545454

Confusion Matrix:
 [[13  3]
 [ 2  3]]


# 6. Model Interpretation:

   a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.

   b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

In [53]:
# Interpret the coefficients of the logistic regression model
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(random_state=10)
logreg_model.fit(X_train, y_train)

In [50]:
logreg_model.classes_

array([0, 1], dtype=int64)

In [51]:
logreg_model.coef_

array([[-1.51220792e-02,  9.53253155e-03, -1.51220792e-02,
        -1.06399583e+00, -1.06399583e+00, -4.78122962e-02,
         9.41109899e-05]])

In [52]:
# Discussing the features that have the most influence on fraud detection

# Get the coefficients and feature names
coefficients = logreg_model.coef_[0]
feature_names = X.columns

# Create a DataFrame to display the coefficients and feature names
coefficients_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the DataFrame by absolute coefficient values for better interpretation
coefficients_df['Absolute_Coefficient'] = coefficients_df['Coefficient'].abs()
coefficients_df = coefficients_df.sort_values(by='Absolute_Coefficient', ascending=False)

# Display the sorted coefficients
print(coefficients_df)

         Feature  Coefficient  Absolute_Coefficient
3           Type    -1.063996              1.063996
4       Location    -1.063996              1.063996
5      HourOfDay    -0.047812              0.047812
0  TransactionID    -0.015122              0.015122
2           Time    -0.015122              0.015122
1         Amount     0.009533              0.009533
6   AmountScaled     0.000094              0.000094


# 7. Model Comparison:

   a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.

   b. Discuss the advantages and limitations of each approach.

In [68]:
# Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model

print("***************Initial Model's Performance****************")
print("\nAccuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("\nConfusion Matrix:\n", conf_matrix)

print("\n***************Balanced Data Model's Performance****************")
print("\nAccuracy:", accuracy_under)
print("Precision:", precision_under)
print("Recall:", recall_under)
print("F1 Score:", f1_under)
print("\nConfusion Matrix:\n", conf_matrix_under)

***************Initial Model's Performance****************

Accuracy: 0.8571428571428571
Precision: 0.6666666666666666
Recall: 0.5
F1 Score: 0.5714285714285715

Confusion Matrix:
 [[16  1]
 [ 2  2]]

***************Balanced Data Model's Performance****************

Accuracy: 0.9047619047619048
Precision: 0.8
Recall: 0.8
F1 Score: 0.8000000000000002

Confusion Matrix:
 [[15  1]
 [ 1  4]]


# 8. Presentation and Recommendations:

   a. Prepare a presentation or report summarizing your analysis, results, and recommendations for the financial institution. Highlight the importance of feature engineering and handling imbalanced data in building an effective fraud detection system.
   
   
   In this case study, you are required to showcase your ability to preprocess data, implement logistic regression, apply feature engineering techniques, and address class imbalance to improve the model's performance. Your analysis should also demonstrate your understanding of the nuances of fraud detection in a financial context.