# Fraud Detection Model Improvement

## 2. Use Resampling Techniques

Fraud detection is highly imbalanced, and resampling can help address this without needing a new model.

### Method: Apply SMOTE
- **Synthetic Minority Over-sampling Technique (SMOTE)**: This method increases the representation of fraud cases in the training data or undersamples the majority class.

### Tools
- Libraries like `imblearn` provide easy-to-implement SMOTE.

### Goal
- Improve recall without excessively sacrificing precision.

In [1]:
import pandas as pd
# Load the cleaned data
df = pd.read_csv(r'C:\Users\Zana\Desktop\portfolio_projects\project_8\fraudData_cleaned.csv')

In [2]:
# Create a copy of the dataframe for processing
df_copy = df.copy()

# Display the first few rows of the dataframe
print(df_copy.head())

   Unnamed: 0 trans_date_trans_time            cc_num  \
0           0   2020-06-21 12:14:25  2291163933867244   
1           1   2020-06-21 12:14:33  3573030041201292   
2           2   2020-06-21 12:14:53  3598215285024754   
3           3   2020-06-21 12:15:15  3591919803438423   
4           4   2020-06-21 12:15:17  3526826139003047   

                               merchant        category    amt   first  \
0                 fraud_Kirlin and Sons   personal_care   2.86    Jeff   
1                  fraud_Sporer-Keebler   personal_care  29.84  Joanne   
2  fraud_Swaniawski, Nitzsche and Welch  health_fitness  41.28  Ashley   
3                     fraud_Haley Group        misc_pos  60.05   Brian   
4                 fraud_Johnston-Casper          travel   3.19  Nathan   

       last gender                       street  ...  \
0   Elliott      M            351 Darlene Green  ...   
1  Williams      F             3638 Marsh Union  ...   
2     Lopez      F         9333 Valentine Po

Step 1: Prepare the Data for SMOTE

In [4]:
# Drop unnecessary columns from the copy
columns_to_drop = ['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'first', 'last', 'street', 'city', 'zip', 'job', 'trans_num']
df_copy = df_copy.drop(columns=columns_to_drop)

# Display the updated dataframe's first few rows and columns
print(df_copy.head())
print(df_copy.columns)

         category    amt gender state      lat      long  city_pop  \
0   personal_care   2.86      M    SC  33.9659  -80.9355    333497   
1   personal_care  29.84      F    UT  40.3207 -110.4360       302   
2  health_fitness  41.28      F    NY  40.6729  -73.5365     34496   
3        misc_pos  60.05      M    FL  28.5697  -80.8191     54767   
4          travel   3.19      M    MI  44.2529  -85.0170      1126   

          dob   unix_time  merch_lat  merch_long  is_fraud  transaction_hour  \
0  1968-03-19  1371816865  33.986391  -81.200714         0                12   
1  1990-01-17  1371816873  39.450498 -109.960431         0                12   
2  1970-10-21  1371816893  40.495810  -74.196111         0                12   
3  1987-07-25  1371816915  28.812398  -80.883061         0                12   
4  1955-07-06  1371816917  44.959148  -85.884734         0                12   

   transaction_dayofweek  transaction_day  transaction_month  age  
0                      6      

Step 2: Encode Categorical Variables

In [5]:
# Encode 'gender' column
df_copy['gender'] = df_copy['gender'].apply(lambda x: 1 if x == 'M' else 0)

# One-hot encoding for 'category' and 'state' columns
df_copy = pd.get_dummies(df_copy, columns=['category', 'state'], drop_first=True)

# Display the updated dataframe's first few rows
print(df_copy.head())
print(df_copy.columns)

     amt  gender      lat      long  city_pop         dob   unix_time  \
0   2.86       1  33.9659  -80.9355    333497  1968-03-19  1371816865   
1  29.84       0  40.3207 -110.4360       302  1990-01-17  1371816873   
2  41.28       0  40.6729  -73.5365     34496  1970-10-21  1371816893   
3  60.05       1  28.5697  -80.8191     54767  1987-07-25  1371816915   
4   3.19       1  44.2529  -85.0170      1126  1955-07-06  1371816917   

   merch_lat  merch_long  is_fraud  ...  state_SD  state_TN  state_TX  \
0  33.986391  -81.200714         0  ...     False     False     False   
1  39.450498 -109.960431         0  ...     False     False     False   
2  40.495810  -74.196111         0  ...     False     False     False   
3  28.812398  -80.883061         0  ...     False     False     False   
4  44.959148  -85.884734         0  ...     False     False     False   

   state_UT  state_VA  state_VT  state_WA  state_WI  state_WV  state_WY  
0     False     False     False     False     Fa

In [7]:
# Step 2: Data Preparation

# Drop unnecessary columns
columns_to_drop = ['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'first', 'last', 'street', 'city', 'zip', 'job', 'trans_num', 'dob']
df_copy = df_copy.drop(columns=columns_to_drop)

# Ensure all columns are numeric
# Convert 'gender' to binary (M=1, F=0)
df_copy['gender'] = df_copy['gender'].apply(lambda x: 1 if x == 'M' else 0)

# One-hot encoding for categorical features like 'category' and 'state'
df_copy = pd.get_dummies(df_copy, columns=['category', 'state'], drop_first=True)

# Display the updated dataframe's first few rows and columns
print(df_copy.head())
print(df_copy.columns)

KeyError: "['Unnamed: 0', 'trans_date_trans_time', 'cc_num', 'merchant', 'first', 'last', 'street', 'city', 'zip', 'job', 'trans_num'] not found in axis"

In [8]:
# Display the current columns in df_copy
print(df_copy.columns)

Index(['amt', 'gender', 'lat', 'long', 'city_pop', 'dob', 'unix_time',
       'merch_lat', 'merch_long', 'is_fraud', 'transaction_hour',
       'transaction_dayofweek', 'transaction_day', 'transaction_month', 'age',
       'category_food_dining', 'category_gas_transport',
       'category_grocery_net', 'category_grocery_pos',
       'category_health_fitness', 'category_home', 'category_kids_pets',
       'category_misc_net', 'category_misc_pos', 'category_personal_care',
       'category_shopping_net', 'category_shopping_pos', 'category_travel',
       'state_AL', 'state_AR', 'state_AZ', 'state_CA', 'state_CO', 'state_CT',
       'state_DC', 'state_FL', 'state_GA', 'state_HI', 'state_IA', 'state_ID',
       'state_IL', 'state_IN', 'state_KS', 'state_KY', 'state_LA', 'state_MA',
       'state_MD', 'state_ME', 'state_MI', 'state_MN', 'state_MO', 'state_MS',
       'state_MT', 'state_NC', 'state_ND', 'state_NE', 'state_NH', 'state_NJ',
       'state_NM', 'state_NV', 'state_NY', 'state_OH'

In [10]:
# Check the data types of the columns in df_copy
print(df_copy.dtypes)

amt         float64
gender        int64
lat         float64
long        float64
city_pop      int64
             ...   
state_VT       bool
state_WA       bool
state_WI       bool
state_WV       bool
state_WY       bool
Length: 77, dtype: object


In [11]:
# Drop the 'dob' column since we already have 'age'
df_copy = df_copy.drop(columns=['dob'])

# Check the updated dataframe
print(df_copy.head())
print(df_copy.dtypes)  # Verify the data types again

     amt  gender      lat      long  city_pop   unix_time  merch_lat  \
0   2.86       1  33.9659  -80.9355    333497  1371816865  33.986391   
1  29.84       0  40.3207 -110.4360       302  1371816873  39.450498   
2  41.28       0  40.6729  -73.5365     34496  1371816893  40.495810   
3  60.05       1  28.5697  -80.8191     54767  1371816915  28.812398   
4   3.19       1  44.2529  -85.0170      1126  1371816917  44.959148   

   merch_long  is_fraud  transaction_hour  ...  state_SD  state_TN  state_TX  \
0  -81.200714         0                12  ...     False     False     False   
1 -109.960431         0                12  ...     False     False     False   
2  -74.196111         0                12  ...     False     False     False   
3  -80.883061         0                12  ...     False     False     False   
4  -85.884734         0                12  ...     False     False     False   

   state_UT  state_VA  state_VT  state_WA  state_WI  state_WV  state_WY  
0     False 

Step 3: Apply SMOTE

In [12]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Separate features (X) and target (y)
X = df_copy.drop(columns=['is_fraud'])  # Features
y = df_copy['is_fraud']                  # Target variable

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Check the class distribution after SMOTE
print("Original training set shape:", y_train.value_counts())
print("Resampled training set shape:", y_train_resampled.value_counts())

Original training set shape: is_fraud
0    387502
1      1501
Name: count, dtype: int64
Resampled training set shape: is_fraud
0    387502
1    387502
Name: count, dtype: int64


Step 4: Model Training

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train the model with the resampled training data
log_reg.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
 [[158409   7663]
 [   151    493]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98    166072
           1       0.06      0.77      0.11       644

    accuracy                           0.95    166716
   macro avg       0.53      0.86      0.54    166716
weighted avg       1.00      0.95      0.97    166716



Step 5: Save SMOTE Results

In [14]:
# Save SMOTE results
smote_results = pd.DataFrame({
    'Original Class Distribution': y_train.value_counts(),
    'Resampled Class Distribution': y_train_resampled.value_counts()
})

# Save the results to a CSV file
smote_results.to_csv(r'C:\Users\Zana\Desktop\portfolio_projects\project_8\smote_results.csv', index=True)


## SMOTE Results

### Original Class Distribution
- **Number of Non-Fraud Cases (0)**: 387,502
- **Number of Fraud Cases (1)**: 1,501

### Resampled Class Distribution
- **Number of Non-Fraud Cases (0)**: 387,502
- **Number of Fraud Cases (1)**: 387,502


-----------------