# Fraud Detection with Logistic Regression and Feature Engineering

Day 3

You are a data scientist at a financial institution, and your primary task is to develop a fraud detection model using logistic regression. The dataset you have is highly imbalanced, with only a small fraction of transactions being fraudulent. Your objective is to create an effective model by implementing logistic regression and employing various feature engineering techniques to improve the model's performance:

1. Data Preparation:

a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).

b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

2. Initial Logistic Regression Model:

a. Implement a basic logistic regression model using the raw dataset.

b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score.

3. Feature Engineering:

a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:

-Creating new features.

- Scaling or normalizing features.

Handling missing values.

-Encoding categorical variables.

b. Explain why each feature engineering technique is relevant for fraud detection.

4. Handling Imbalanced Data:

a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.

b. Implement strategies to address class imbalance, such as:

-Oversampling the minority class.

Undersampling the majority class.

Using synthetic data generation techniques (e.g., SMOTE).

5. Logistic Regression with Feature-Engineered Data:

a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data.

b. Evaluate the model's performance using appropriate evaluation metrics:

6. Model Interpretation:

a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.

b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

7. Model Comparison:

a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.

b. Discuss the advantages and limitations of each approach.

8. Presentation and Recommendations:

a. Prepare a presentation or report summarizing your analysis, results, and recommendations for the financial institution. Highlight the importance of feature engineering and handling imbalanced data in building an effective fraud detection system.

In this case study, you are required to showcase your ability to preprocess data, implement logistic regression, apply feature engineering techniques, and address class imbalance to improve the model's performance. Your analysis should also demonstrate your understanding of the nuances of fraud detection in a financial context.

# Q1

In [34]:
import pandas as pd

# Load the dataset (replace 'dataset.csv' with the actual file path)
data = pd.read_csv("C://Users//TmC//Downloads//archive//creditcard.csv")

# Display the first few rows of the dataset to get an overview of the features
print(data.head())

# Check the column names and data types
print(data.info())

   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.137458  0.141267 -0.206010   

        V26       V27       V28 

# Q2

In [35]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Load the dataset (replace 'creditcard.csv' with the actual file path)
data = pd.read_csv('creditcard.csv')

# Define your features (independent variables) and target (dependent variable)
X = data.drop('Class', axis=1)  # Features
y = data['Class']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print("Model Performance Metrics:")
print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 Score: {:.2f}".format(f1))

print("\nConfusion Matrix:")
print(conf_matrix)

Model Performance Metrics:
Accuracy: 1.00
Precision: 0.62
Recall: 0.55
F1 Score: 0.58

Confusion Matrix:
[[56831    33]
 [   44    54]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Q3

In [36]:
from sklearn.preprocessing import StandardScaler

# Standardize numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [37]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

In [52]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit and transform the scaler on the training data
X_train['Amount'] = scaler.fit_transform(X_train[['Amount']])

# Transform the test data using the fitted scaler
X_test['Amount'] = scaler.transform(X_test[['Amount']])


# Q5

In [41]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Load the feature-engineered dataset
data = pd.read_csv("C://Users//TmC//Downloads//archive//creditcard.csv")  # Replace with your actual file path

# Define your features (independent variables) and target (dependent variable)
X = data.drop('Class', axis=1)  # Features
y = data['Class']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)

# Classification report
class_report = classification_report(y_test, y_pred)

# Print the results
print("Model Performance Metrics:")
print("Accuracy: {:.2f}".format(accuracy))
print("Precision: {:.2f}".format(precision))
print("Recall: {:.2f}".format(recall))
print("F1 Score: {:.2f}".format(f1))

print("\nConfusion Matrix:")
print(conf_matrix)

print("\nClassification Report:")
print(class_report)

Model Performance Metrics:
Accuracy: 1.00
Precision: 0.62
Recall: 0.55
F1 Score: 0.58

Confusion Matrix:
[[56831    33]
 [   44    54]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.62      0.55      0.58        98

    accuracy                           1.00     56962
   macro avg       0.81      0.78      0.79     56962
weighted avg       1.00      1.00      1.00     56962



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Q6

In [42]:
# Assuming you have already fitted a logistic regression model (model) and have the feature names
coefficients = model.coef_[0]
feature_names = X.columns

# Create a dataframe to display the feature names and their corresponding coefficients
coeff_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the dataframe by the absolute magnitude of the coefficients
coeff_df['Abs_Coefficient'] = abs(coeff_df['Coefficient'])
sorted_coeff_df = coeff_df.sort_values(by='Abs_Coefficient', ascending=False)

# Display the sorted coefficients
print(sorted_coeff_df)

   Feature  Coefficient  Abs_Coefficient
14     V14    -1.081815         1.081815
2       V2    -0.880399         0.880399
17     V17    -0.746935         0.746935
3       V3    -0.733691         0.733691
9       V9    -0.558533         0.558533
1       V1     0.533949         0.533949
8       V8    -0.509440         0.509440
7       V7     0.431350         0.431350
15     V15    -0.430041         0.430041
13     V13    -0.428819         0.428819
16     V16    -0.419183         0.419183
22     V22     0.391334         0.391334
25     V25    -0.386550         0.386550
10     V10    -0.336703         0.336703
5       V5     0.286533         0.286533
21     V21     0.285225         0.285225
11     V11    -0.231455         0.231455
4       V4     0.200136         0.200136
6       V6    -0.103810         0.103810
27     V27    -0.094108         0.094108
28     V28     0.071549         0.071549
20     V20     0.066653         0.066653
19     V19     0.056360         0.056360
26     V26     0

# Q7

In [54]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Define and train the initial logistic regression model
model_initial = LogisticRegression()
model_initial.fit(X_train, y_train)

# Define and train the feature-engineered logistic regression model
model_feature_engineered = LogisticRegression()
model_feature_engineered.fit(X_train, y_train)  # Use the same X_train and y_train for feature-engineered model

# Initial Logistic Regression Model
y_pred_initial = model_initial.predict(X_test)

# Performance metrics for the initial model
accuracy_initial = accuracy_score(y_test, y_pred_initial)
precision_initial = precision_score(y_test, y_pred_initial)
recall_initial = recall_score(y_test, y_pred_initial)
f1_initial = f1_score(y_test, y_pred_initial)
roc_auc_initial = roc_auc_score(y_test, model_initial.predict_proba(X_test)[:,1])

# Feature-Engineered and Balanced Data Model
y_pred_feature_engineered = model_feature_engineered.predict(X_test)

# Performance metrics for the feature-engineered model
accuracy_feature_engineered = accuracy_score(y_test, y_pred_feature_engineered)
precision_feature_engineered = precision_score(y_test, y_pred_feature_engineered)
recall_feature_engineered = recall_score(y_test, y_pred_feature_engineered)
f1_feature_engineered = f1_score(y_test, y_pred_feature_engineered)
roc_auc_feature_engineered = roc_auc_score(y_test, model_feature_engineered.predict_proba(X_test)[:,1])

# Compare the performance metrics
print("Initial Model Performance:")
print("Accuracy: {:.2f}".format(accuracy_initial))
print("Precision: {:.2f}".format(precision_initial))
print("Recall: {:.2f}".format(recall_initial))
print("F1 Score: {:.2f}".format(f1_initial))
print("AUC-ROC: {:.2f}".format(roc_auc_initial))

print("\nFeature-Engineered Model Performance:")
print("Accuracy: {:.2f}".format(accuracy_feature_engineered))
print("Precision: {:.2f}".format(precision_feature_engineered))
print("Recall: {:.2f}".format(recall_feature_engineered))
print("F1 Score: {:.2f}".format(f1_feature_engineered))
print("AUC-ROC: {:.2f}".format(roc_auc_feature_engineered))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Initial Model Performance:
Accuracy: 1.00
Precision: 0.66
Recall: 0.56
F1 Score: 0.61
AUC-ROC: 0.90

Feature-Engineered Model Performance:
Accuracy: 1.00
Precision: 0.66
Recall: 0.56
F1 Score: 0.61
AUC-ROC: 0.90


# Q4