<a href="https://colab.research.google.com/github/Ace9933/Ace9933.github.io/blob/main/Mastercard1_CreditCardFraudDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from geopy.distance import geodesic


In [None]:
df = pd.read_csv('/content/fraudTrain.csv')


FileNotFoundError: [Errno 2] No such file or directory: '/content/fraudTrain.csv'

In [None]:
df.shape

## **Exploratory Data Analysis (EDA)**

In [None]:
print(df.info())


In [None]:
df.head()

In [None]:
df.isnull().sum()


In [None]:
df_fraudulent = df[df['is_fraud'] == 1]
df_fraudulent.head(20)

In [None]:
df = df.drop(columns=['Unnamed: 0'])


The 'Unnamed: 0' column consists of the row number of each credit card transaction entry.

In [None]:
numeric_df = df.select_dtypes(include=['float64', 'int64'])

correlation_matrix = numeric_df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Features')
plt.show()

In [None]:
df['log_amt'] = np.log1p(df['amt'])
sns.boxplot(x='is_fraud', y='log_amt', data=df)
plt.title('Log Transaction Amount Distribution by Fraud Status')
plt.show()


Fraudulent transactions tend to involve higher amounts when compared to non-fraudulent ones, which aligns with the earlier correlation matrix analysis showing a moderate positive correlation between transaction amount (amt) and fraud (is_fraud).

In [None]:
'''
features = ['amt', 'city_pop', 'lat', 'long', 'merch_lat', 'merch_long']
for feature in features:
    plt.figure(figsize=(10, 8))
    sns.histplot(data=df, x=feature, hue='is_fraud', multiple='stack', kde=True)
    plt.title(f'Distribution of {feature} by Fraud Status')
    plt.show()
'''

## **Data Cleaning**


In [None]:
df = df.drop(columns=['cc_num', 'Unnamed', 'gender'], errors='ignore')

In [None]:
print(df.isnull().sum())


There are no missing values

In [None]:
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])
df['hour'] = df['trans_date_trans_time'].dt.hour
df['day_of_week'] = df['trans_date_trans_time'].dt.dayofweek
df['month'] = df['trans_date_trans_time'].dt.month


In [None]:
df['unique_id'] = df['first'] + '_' + df['last'] + '_' + df['street']


unique_id: Combines 'first', 'last', and 'street' to create a unique identifier for individuals.


In [None]:
df['transaction_count'] = df.groupby('unique_id')['unique_id'].transform('count')


transaction_count: Counts how many transactions are associated with each individual.


In [None]:
df['fraudulent_transaction_count'] = df.groupby('unique_id')['is_fraud'].transform('sum')


fraudulent_transaction_count: Counts how many fraudulent transactions are associated with each individual.

In [None]:
df['multiple_fraud_flag'] = df['fraudulent_transaction_count'] > 1


multiple_fraud_flag: A binary feature to flag individuals with more than one fraudulent transaction.

In [None]:
df['dob'] = pd.to_datetime(df['dob'])
df['age'] = df['trans_date_trans_time'].dt.year - df['dob'].dt.year


we create an age column from the dob feature

In [None]:
df = pd.get_dummies(df, columns=['city', 'state', 'job'], drop_first=True)


In [None]:
df = df.drop(columns=['first', 'last', 'street', 'trans_num', 'dob'])


In [None]:
fraud_distribution = df['is_fraud'].value_counts(normalize=True) * 100
print(fraud_distribution)


In [None]:
df = df.sort_values(by=['unique_id', 'trans_date_trans_time'])
df['cumsum_amt'] = df.groupby('unique_id')['amt'].cumsum()
df['prev_cumsum_amt'] = df.groupby('unique_id')['cumsum_amt'].shift(1)


Sorts the data by customer and transaction time to ensure calculations occur in the correct order.
Calculates the running total of how much each customer has spent over time.
Shifts this cumulative total to capture how much the customer had spent before the current transaction.

In [None]:
df['trans_7d_count'] = df.groupby('unique_id').cumcount() + 1
df['prev_trans_count'] = df.groupby('unique_id')['trans_7d_count'].shift(1)
df[['trans_date_trans_time', 'amt', 'trans_7d_count', 'prev_trans_count']].head()


Created a running transaction count (trans_7d_count) for each customer.
Shifted the transaction count by one to capture how many transactions occurred before the current transaction.

In [None]:
df['spending_velocity'] = df['amt_7d_sum'] / df['prev_trans_count']
df['spending_velocity'].fillna(0, inplace=True)
df[['amt_7d_sum', 'prev_trans_count', 'spending_velocity']].head()


Spending velocity (spending_velocity) measures the average amount spent per transaction over a recent time window (7 days in this case). It is a useful feature to detect abnormal spending behavior, which might indicate potential fraud.

In [None]:
def calculate_distance(row):
    customer_location = (row['lat'], row['long'])
    merchant_location = (row['merch_lat'], row['merch_long'])
    return geodesic(customer_location, merchant_location).kilometers
df['distance'] = df.apply(calculate_distance, axis=1)
df[['lat', 'long', 'merch_lat', 'merch_long', 'distance']].head()


We created a new feature that calculates the distance between the customer and the merchant for each transaction. Using the `geodesic()` function from the `geopy` library, we computed the distance in kilometers based on the latitude and longitude of both the customer and merchant. This was applied row-by-row to the DataFrame, and the resulting distances were stored in a new column, `distance`. This feature helps detect potentially fraudulent transactions that occur far from the customer's typical location.

In [None]:
df['fraud_7d_flag'] = df.groupby('unique_id')['is_fraud'].apply(
    lambda x: x.rolling(window=7, min_periods=1).sum()
).reset_index(level=0, drop=True)
df['fraud_7d_flag'] = df['fraud_7d_flag'].apply(lambda x: 1 if x > 0 else 0)
df[['trans_date_trans_time', 'is_fraud', 'fraud_7d_flag']].head()


In this code, we created a new feature, `fraud_7d_flag`, to track whether a customer has committed any fraudulent transactions in the past 7 days. First, we used a rolling window of 7 days within each customer group to sum the fraudulent transactions. Then, we converted this sum into a binary flag, where `1` indicates at least one fraud within the past 7 days and `0` indicates none. The resulting feature helps to identify customers with recent fraudulent activity, which can be useful for detecting suspicious patterns.

In [None]:
df = df.drop(columns=['zip'])


In [None]:
print(df.isnull().sum())
df.dropna(subset=['prev_trans_count'], inplace=True)
print(df.isnull().sum())

In [None]:
y.dropna(inplace=True)


In [None]:
print(X.isnull().sum())
X.dropna(inplace=True)


In [None]:
X, y = X.align(y, join='inner', axis=0)

print(X.shape, y.shape)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print("Training set:", X_train.shape, y_train.shape)
print("Test set:", X_test.shape, y_test.shape)


In [None]:
# Check for datetime columns
datetime_cols = X_train.select_dtypes(include=['datetime', 'datetimetz']).columns
print(f"Datetime columns: {datetime_cols}")


In [None]:
# Convert datetime columns to Unix timestamp
for col in datetime_cols:
    X_train[col] = X_train[col].astype('int64') // 10**9  # Convert to Unix timestamp
    X_test[col] = X_test[col].astype('int64') // 10**9  # Ensure the test set is also converted


In [None]:
# Drop datetime columns if not needed
X_train = X_train.drop(columns=datetime_cols, errors='ignore')
X_test = X_test.drop(columns=datetime_cols, errors='ignore')
