# Credit Card Transactions Fraud Detection

## **Part V:** Final Pipeline
**Table of contents:**
    
1. Build final prediction pipeline
    - 1.1. Build final pipeline
    - 1.2. Check if pipeline works correctly
    - 1.3. Save final pipeline
2. Load the saved pipeline for making predictions on new data.

Now, it's time to build the final pipeline to use. 

I summarize the information as follows:

The original dataset has 21 features including: 

['trans_date_trans_time', 'cc_num', 'merchant', 'category', 'amt',
       'first', 'last', 'gender', 'street', 'city', 'state', 'zip', 'lat',
       'long', 'city_pop', 'job', 'dob', 'trans_num', 'unix_time', 'merch_lat',
       'merch_long'] with a target variable 'is_fraud'. 
       
---------------------------------
To create a complete pipeline for prediction on new unseen data using your saved tuned model, you can follow these steps:
- Step 1. Load the saved best/ model
- Step 2. Data Preprocessing
    1. Feature Engineering: create new features:
       >- 'age' from original 'dob';
       >- 'distance' from original lat, long, merch_lat, merch_long;
       >- 'transaction_hour', 'transaction_day_of_week', 'transaction_day_of_month', 'transaction_month' from original 'trans_date_trans_time'.
    2. Remove redundant features: 'first', 'last', 'trans_num', 'street', 'trans_date_trans_time', 'unix_time', 'cc_num', 'dob', 'lat', 'long', 'merch_lat', 'merch_long', 'merchant', 'city'
 
    3. Data Transformation: scaling numerical variables and encoding categorical variables.
    4. Handling imbalanced data
       
- Step 3. Model

# 1. Build full/complete prediction pipeline

- Pipeline includes all necessary preprocessing steps, loading tuned model.

In [11]:
# Import liraries

import pandas as pd
from geopy.distance import geodesic
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import category_encoders as ce
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
import joblib
import warnings
warnings.filterwarnings("ignore")

### Custom Transformer for Feature Engineering

In [3]:
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer for feature engineering
class FeatureEngineeringTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        data = X.copy()
        # Extract time-based features from trans_date_trans_time
        data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time'])
        data['transaction_hour'] = data['trans_date_trans_time'].dt.hour
        data['transaction_day_of_week'] = data['trans_date_trans_time'].dt.dayofweek
        data['transaction_day_of_month'] = data['trans_date_trans_time'].dt.day
        data['transaction_month'] = data['trans_date_trans_time'].dt.month

        # Calculate age from dob
        data['dob'] = pd.to_datetime(data['dob'])
        data['age'] = data['trans_date_trans_time'].dt.year - data['dob'].dt.year
        
        # Calculate distance between cardholder and merchant
        data['distance'] = data.apply(lambda row: geodesic((row['lat'], row['long']), (row['merch_lat'], row['merch_long'])).miles, axis=1)

        # Remove redundant features
        data = data.drop(columns=['first', 'last', 'trans_num', 'street', 'trans_date_trans_time', 'unix_time', 'cc_num', 'dob', 'lat', 'long', 'merch_lat', 'merch_long', 'merchant', 'city'])
        
        return data

### Create Transformation Pipeline

In [4]:
# Define numerical and categorical columns
numerical_features = ['amt', 'city_pop', 'transaction_hour', 'transaction_day_of_week', 'transaction_day_of_month', 'transaction_month', 'age', 'distance']
categorical_features = ['category', 'gender', 'state', 'zip', 'job']

# Preprocessing for numerical data
numerical_transformer = StandardScaler()

# Preprocessing for categorical data using Target Encoding
categorical_transformer = ce.TargetEncoder(cols=categorical_features)

# Combining numerical and categorical transformers
preprocessor_pipeline = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ], remainder='passthrough')  # Keep other columns unchanged

### Load tuned model that was saved 

In [6]:
# Load tuned model that was saved
final_model = joblib.load('../models/best_model.pkl')
# final_model = joblib.load('../models/best_model_w_customthreshold_098_highest_precision.pkl')

### Create the Full Pipeline with Feature Engineering and SMOTE

In [12]:
# Create the final prediction pipeline with feature engineering
prediction_pipeline = ImbPipeline(steps=[
    ('feature_engineering', FeatureEngineeringTransformer()),
    ('preprocessor', preprocessor_pipeline),
    ('smote', SMOTE(random_state=42)),
    ('model', final_model)
])


######## Prepare data
data_train = pd.read_csv('../data/fraudTrain.csv', index_col=0)
X_train = data_train.drop(columns='is_fraud')
y_train = data_train['is_fraud']


# Fit the pipeline
prediction_pipeline.fit(X_train, y_train)

  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)
  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)


# 2. Check if pipeline works correctly

In [13]:
# Example new data
new_data = pd.DataFrame({
    'trans_date_trans_time': ['2022-09-01 00:00:00'],
    'cc_num': [1234567890123456],
    'merchant': ['merchant1'],
    'category': ['grocery_pos'],
    'amt': [100.0],
    'first': ['John'],
    'last': ['Doe'],
    'gender': ['M'],
    'street': ['123 Main St'],
    'city': ['Anytown'],
    'state': ['NY'],
    'zip': [12345],
    'lat': [40.712776],
    'long': [-74.005974],
    'city_pop': [8000000],
    'job': ['Engineer'],
    'dob': ['1980-01-01'],
    'trans_num': ['abc123'],
    'unix_time': [1661990400],
    'merch_lat': [40.730610],
    'merch_long': [-73.935242]
})

# Predict using the pipeline
predictions = prediction_pipeline.predict(new_data)

# Output the predictions
print(predictions)

[0]


  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)


# 3. Save complete prediction_pipeline

In [14]:
# Save the complete prediction_pipeline to a file
directory = '../models/complete_prediction_pipeline.pkl'
joblib.dump(prediction_pipeline, directory)
print(f"Pipeline saved to {directory}")

Pipeline saved to ../models/complete_prediction_pipeline.pkl


# 4. Implement prediction on new data (unseen TestFraud dataset)
When we need to use the saved pipeline, we can load it as follows:

In [15]:
# unseen TestFraud dataset
data_test = pd.read_csv('../data/fraudTest.csv', index_col=0)
X_test = data_test.drop(columns='is_fraud')
y_test = data_test['is_fraud']

In [17]:
# Load the saved Pipeline
pipeline = joblib.load('../models/complete_prediction_pipeline.pkl')
print("Pipeline loaded successfully")

# Use the loaded pipeline to make predictions on new data
predictions = pipeline.predict(X_test)

# Output the predictions
print(predictions)

Pipeline loaded successfully


  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)


[0 0 0 ... 0 0 0]


In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
conf_matrix = confusion_matrix(y_test, predictions)
class_report = classification_report(y_test, predictions)

# Print evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

Accuracy: 0.9968
Precision: 0.8657
Recall: 0.2163
F1 Score: 0.3461
Confusion Matrix:
[[553502     72]
 [  1681    464]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.87      0.22      0.35      2145

    accuracy                           1.00    555719
   macro avg       0.93      0.61      0.67    555719
weighted avg       1.00      1.00      1.00    555719



In [20]:
# Use the loaded pipeline to check predictions on training data
predictions = pipeline.predict(X_train)
# Calculate evaluation metrics
accuracy = accuracy_score(y_train, predictions)
precision = precision_score(y_train, predictions)
recall = recall_score(y_train, predictions)
f1 = f1_score(y_train, predictions)
conf_matrix = confusion_matrix(y_train, predictions)
class_report = classification_report(y_train, predictions)

# Print evaluation metrics
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')

  X[column] = X[column].astype("object").fillna(np.nan).map(col_mapping)


Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
Confusion Matrix:
[[1289169       0]
 [      0    7506]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1289169
           1       1.00      1.00      1.00      7506

    accuracy                           1.00   1296675
   macro avg       1.00      1.00      1.00   1296675
weighted avg       1.00      1.00      1.00   1296675

