In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

In [None]:
df = pd.read_csv("../data/creditcard.csv")

In [None]:
df.head()

In [None]:
df.info()

Key observations:

Size of the DataFrame:
    The DataFrame contains 284,807 entries (rows).
    There are 31 columns in total.

Non-null Counts:
    All columns have 284,807 non-null values, indicating there are no missing values in the dataset.

Data Types:
    The majority of the columns (30 out of 31) are of type float64, which suggests that they hold continuous numerical data.
    Only one column, Class, is of type int64, typically indicating a categorical variable or a label (often used in classification tasks).

Memory Usage:
    The DataFrame uses approximately 67.4 MB of memory, which is relatively efficient for the amount of data it holds.

Column Breakdown:
    The columns are prefixed with V (from V1 to V28), indicating they may represent different features or variables, commonly found in datasets used for machine learning tasks (e.g., anomaly detection).
    The Amount column represents a monetary value, while the Class column suggests the data is for a classification problem, possibly distinguishing between normal and fraudulent transactions since this is a financial dataset.

Summary:
    The dataset is complete, with no missing values and a mix of float and integer data types. The structure suggests it may be used for predictive modeling in a financial context.

In [None]:
df.isnull().sum()

No missing values in this dataset, but always verify

In [None]:
# Checkl for duplicates and remove if present
print(f"Duplicate rows: {df.duplicated().sum()}")

In [None]:
# Removing duplicates
df = df.drop_duplicates()
df.reset_index(drop=True, inplace=True)
print(f"Duplicate rows: {df.duplicated().sum()}")

In [None]:
# Since fraud detection is a classification problem, we meed to check if the dataset is imbalanced
sns.countplot(x='Class', data=df)
plt.title("Class Distribution")
plt.show()

fraud_count = df['Class'].value_counts(normalize=True) * 100
print(fraud_count)

Key Observations:
Very imbalanced dataset (fraud cases ≈ 0.17% of the total)

This confirms we will need oversampling (SMOTE) or undersampling later.

In [None]:
df.describe()

Descriptive Statistics:
Time: The mean value is 94,811.08 with a standard deviation of 47,481.05, indicating a wide range of time values. The minimum is 0 and the maximum is 172,792, suggesting a time span that likely covers many transactions.

V1 to V28: These variables have varying means and standard deviations, indicating different distributions. Most have means close to 0, with some negative and positive values, suggesting they might be standardized features.

Amount: The mean amount is 88.47, with a relatively high standard deviation of 250.40, indicating significant variability in transaction amounts. The minimum amount is 0 and the maximum is 25,691.16.

Class: This column is binary, with a maximum value of 1 and a minimum of 0, indicating it is used for classification (e.g., normal vs. fraudulent).

Distribution Insights:
The 25th percentile (Q1) and 75th percentile (Q3) for most variables suggest that many features have values that are tightly clustered around the median (50th percentile), with some outliers, particularly in the V columns.
The standard deviations for the V columns vary significantly, indicating some features are more volatile than others.

Potential Issues:
The high variability in the Amount column relative to its mean suggests that outliers could be influencing the results, which may need further investigation.

Overall Structure:
The DataFrame contains 31 columns, mostly numerical, indicating it is suitable for machine learning tasks, particularly classification and regression analyses.

Summary
The dataset is complete. It contains various numerical features with different distributions, indicating great suitability for analysis in a financial context, in this case for detecting anomalies or fraud. The presence of outliers and variability in certain key metrics suggests careful preprocessing will be required for effective modeling.

In [None]:
# Since Amount and Time are not transformed like V1-V28, we normalize them
scaler = StandardScaler()
df['Scaled_Amount'] = scaler.fit_transform(df[['Amount']])
df['Scaled_Time'] = scaler.fit_transform(df[['Time']])

# Drop original columns
df.drop(columns=['Amount', 'Time'], inplace=True)
print(df.head())

In [None]:
print(df.head())

Great! Now, we have Scaled_Amount and Scaled_Time.

In [None]:
# To see how features are related:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Matrix")
plt.show()

Overall Structure:
The heatmap displays the correlation values between all pairs of features, with colors indicating the strength and direction of the correlations.

Strong Positive Correlations:
Look for dark red areas, which indicate strong positive correlations (close to 1). Features that are highly correlated may provide redundant information for modeling.

Strong Negative Correlations:
Dark blue areas represent strong negative correlations (close to -1). These features move in opposite directions, which can be insightful for understanding relationships in the data.

Weak Correlations:
Lighter colors (pale blue or white) indicate weak correlations (close to 0). These features may not provide significant predictive power and could be candidates for removal.

Class Correlation:
Investigate how the Class feature correlates with other features. A strong correlation (positive or negative) with Class is particularly important, as it indicates potential predictors for classification.

Scaled Features:
The correlations for Scaled_Amount and Scaled_Time should be examined. Their relationships with other features can provide insights into how these scaled values interact with the original features.

Multicollinearity:
If several V features show high correlations with each other, it suggests multicollinearity, which might complicate model training and interpretation.

Feature Selection:
The heatmap guides feature selection and engineering. Features with low correlation to the target variable (Class) or high redundancy may be dropped to simplify the model.

Summary
The correlation matrix heatmap provides a visual representation of the relationships among features. Key observations include identifying strong positive or negative correlations, assessing the relevance of features to the target variable, and recognizing multicollinearity. This analysis will inform our decisions regarding feature selection and model building.

In [None]:
# Outlier Detection
# Fraud transactions often have outliers
plt.figure(figsize=(12,6))
sns.boxplot(x='Class', y='Scaled_Amount', data=df)
plt.title("Transaction Amount Distribution by Class")
plt.show()

Key Observations:

Class Distribution:
The plot shows two distinct classes: 0 and 1. Class 0 likely represents normal transactions, while Class 1 may represent fraudulent transactions.

Transaction Amounts:
Class 0: The majority of transaction amounts are concentrated around lower values, with some outliers extending significantly higher. This suggests that most legitimate transactions involve smaller amounts, but there are occasional larger transactions.
Class 1: The distribution for Class 1 (fraudulent transactions) is notably different, with a much tighter range of amounts. Most fraudulent transactions appear to cluster around very low amounts, with few outliers also present.

Outliers:
There are significant outliers in Class 0, indicating that while most transactions are small, there are some high-value transactions that might warrant further investigation.
The outliers in Class 1 are less pronounced, reflecting a different pattern of fraudulent transactions, which may indicate that fraudsters are often making lower-value transactions.

Potential Fraud Patterns:
The data suggests that fraudulent transactions (Class 1) are not typically associated with high amounts, which could indicate a strategy of making low-value transactions to avoid detection.

Implications for Model Training:
The stark difference in transaction amount distributions between the two classes can inform feature selection and engineering. This feature (transaction amount) may be a significant predictor for distinguishing between the two classes.

Class Imbalance:
If the count of Class 1 (fraudulent transactions) is significantly lower than Class 0, this could indicate a class imbalance issue, which is important to consider during model training.

Summary
The boxplot reveals distinct distributions of transaction amounts for each class, with Class 0 showing a wide range of higher amounts and Class 1 clustering around lower values. This difference can inform modeling strategies and highlight potential patterns in fraudulent behavior, as well as suggest the importance of the transaction amount feature in classification tasks.

In [None]:
# After cleaning and scaling, save the dataset
df.to_csv("../data/processed_creditcard.csv", index=False)