<a href="https://colab.research.google.com/github/Btere/btereml/blob/main/creditcard_fraud_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
DATASET_PATH = Path("/content/drive/MyDrive/Colab Notebooks")

In [None]:
def read_csv_files(dataset_path: Path)-> pd.DataFrame:
    train_dataset = pd.read_csv(f'{dataset_path}/fraudTrain.csv', index_col=False)
    test_dataset = pd.read_csv(f'{dataset_path}/fraudTest.csv', index_col=False)
    return train_dataset, test_dataset

In [None]:
train_dataset, test_dataset = read_csv_files(DATASET_PATH)


In [None]:
display(train_dataset)

In [None]:
# Dataset overview

train_dataset.shape

In [None]:
test_dataset.shape

In [None]:
train_dataset.columns

In [None]:
train_dataset.info()

In [None]:
train_dataset.isnull().sum()

In [None]:
train_dataset.isna().sum()

In [None]:
display(test_dataset)

In [None]:
test_dataset.nunique()

In [None]:
train_dataset["is_fraud"].value_counts()

In [None]:
test_dataset["is_fraud"].value_counts()

Data cleaning

In [None]:
train_data = train_dataset.drop(columns='Unnamed: 0')
test_data = test_dataset.drop(columns='Unnamed: 0')

In [None]:
train_data["dob"] = pd.to_datetime(train_data["dob"])
train_data['trans_date_trans_time'] = pd.to_datetime(train_data['trans_date_trans_time'])

In [None]:
train_data

In [None]:
test_data["dob"] = pd.to_datetime(test_data["dob"])
test_data['trans_date_trans_time'] = pd.to_datetime(test_data['trans_date_trans_time'])

In [None]:
test_data

In [None]:
train_data.describe()

In [None]:
train_data.shape , test_data.shape

In [None]:
train_data.columns , test_data.columns

First, we want to apply some transformation to the dataset to normalize the features values before encoding the categorical labels.



counts the number of occurrences of each job title among the rows in the test_data DataFrame where the is_fraud column is 1. It helps in understanding the distribution of job titles specifically for fraudulent cases in the dataset.

In [None]:
test_data[test_data["is_fraud"] == 1]["job"].value_counts()

In [None]:
train_data[train_data["is_fraud"] == 1]["merchant"].value_counts()

In [None]:
# encoding test data
encoder=LabelEncoder()

test_data['merchant']=encoder.fit_transform(test_data['merchant'])
test_data['category']=encoder.fit_transform(test_data['category'])
test_data['street']=encoder.fit_transform(test_data['street'])
test_data['job']=encoder.fit_transform(test_data['job'])
test_data['trans_num']=encoder.fit_transform(test_data['trans_num'])
test_data['first']=encoder.fit_transform(test_data['first'])
test_data['city']=encoder.fit_transform(test_data['city'])
test_data['state']=encoder.fit_transform(test_data['state'])
test_data['last']=encoder.fit_transform(test_data['last'])
test_data['gender']=encoder.fit_transform(test_data['gender'])
test_data['trans_date_trans_time']=encoder.fit_transform(test_data['trans_date_trans_time'])
test_data['dob']=encoder.fit_transform(test_data['dob'])

In [None]:
test_data.head()

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder

def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    le = LabelEncoder()

    standard_scaler = StandardScaler()
    minmax_scaler = MinMaxScaler()

    # Separate numerical and categorical columns
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
    categorical_cols = df.select_dtypes(include=['object']).columns

    # Scale and Normalize numerical columns
    if len(numerical_cols) > 0:
        # First, apply MinMaxScaler for normalization
        df[numerical_cols] = minmax_scaler.fit_transform(df[numerical_cols])
        # Then, apply StandardScaler for scaling
        df[numerical_cols] = standard_scaler.fit_transform(df[numerical_cols])

    # Encode categorical columns
    for column in categorical_cols:
        df[column] = le.fit_transform(df[column])

    return df



In [None]:
train_df = preprocess_data(train_data)
display(train_df)

In [None]:
train_df.describe()

In [None]:
test_data.describe()

In [None]:
test_data.head()

In [None]:
train_df = train_data.copy()

In [None]:
test_df = test_data.copy()

In [None]:
#splitting dataset and convert to a numpy array.
X_train = train_df.loc[:, train_df.columns != 'is_fraud'].values
y_train = train_df.loc[:, 'is_fraud'].values

print(X_train.shape)
print(y_train.shape)

In [None]:
X_test = test_df.loc[:, test_df.columns != 'is_fraud'].values
y_test = test_df.loc[:, 'is_fraud'].values

print(X_test.shape)
print(y_test.shape)

In [None]:
#model building and training

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score , classification_report , confusion_matrix

modelLR = LogisticRegression(random_state = 42,n_estimators = 10, n_jobs = -1, max_depth = 20)
modelRF = RandomForestClassifier(random_state = 42,n_estimators = 10, n_jobs = -1, max_depth = 20)
modelDT = DecisionTreeClassifier()



In NumPy, the reshape function is used to change the shape of an array without changing its data. The arguments (-1, 1) and (1, -1) specify how the array should be reshaped. Here’s a detailed explanation of each:

reshape(-1, 1)
-1: This is a special placeholder used in NumPy’s reshape method. It tells NumPy to automatically determine the size of this dimension based on the size of the array and the remaining dimensions.
1: This specifies that the resulting shape should have a single column.


Explanation: The -1 tells NumPy to infer the number of rows based on the total number of elements (which is 6 in this case) and the specified number of columns (1). So, the resulting shape is (6, 1).

reshape(1, -1)
1: This specifies that the resulting shape should have a single row.
-1: This tells NumPy to automatically determine the size of this dimension based on the size of the array and the remaining dimensions.


Explanation: The -1 tells NumPy to infer the number of columns based on the total number of elements (which is 6 in this case) and the specified number of rows (1). So, the resulting shape is (1, 6).

Summary
reshape(-1, 1) converts a 1D array into a 2D array with one column and as many rows as needed.
reshape(1, -1) converts a 1D array into a 2D array with one row and as many columns as needed.
The -1 in the reshape function is useful for automatically calculating dimensions when you only need to specify one of the dimensions, making it easier to reshape arrays without manually calculating the required sizes.

In [None]:
arr = np.array([1, 2, 3, 4, 5, 6])

# Reshape to a 2D array with 1 column
reshaped_arr = arr.reshape(-1, 1)
print(reshaped_arr)

In [None]:
reshaped_arr = arr.reshape(1, -1)
print(reshaped_arr)