## Credit Card Fraud Detection with PySpark

Goal: Build a logistic regression model (with and without class weighting) to detect fraudulent credit card transactions using PySpark's MLlib.

Why PySpark? Spark scales well with large datasets, supports distributed computing, and integrates well with ML workflows using its MLlib library.

# 1. Setup and Imports

In [None]:
# PySpark Setup
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline

# Evaluation with sklearn
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import matplotlib.pyplot as plt


# 2. Create Spark Session

In [None]:
spark = SparkSession.builder\
        .appName("Credit Card Fraud Detection")\
        .getOrCreate()


# 3. Load and Inspect the Dataset

In [None]:
# Load CSV dataset
df = spark.read.csv("/kaggle/input/creditcardfraud/creditcard.csv", header=True, inferSchema=True)

# Show schema and sample data
df.printSchema()
df.show(5)


# 4. Preprocessing and Train-Test Split

In [None]:
df = df.withColumn("label", col("Class").cast("integer")).drop("Class")
df = df.na.drop()

train_data, test_data = df.randomSplit([0.8, 0.2], seed=42)


# 5. Address Class Imbalance with Weighting

In [None]:
# Assign higher weight to minority class (fraud)
class_weight = 20.0
train_weighted = train_data.withColumn("weight", when(col("label") == 1, class_weight).otherwise(1.0))
test_weighted = test_data.withColumn("weight", when(col("label") == 1, class_weight).otherwise(1.0))


# 6. Feature Engineering with VectorAssembler

In [None]:
# Select features
feature_cols = [col for col in df.columns if col not in ("label", "weight")]

# Assemble feature vector
assembler = VectorAssembler(inputCols=feature_cols, outputCol='features')
train_data_assembled = assembler.transform(train_data).select("features", "label")
test_data_assembled = assembler.transform(test_data).select("features", "label")


# 7. Train Logistic Regression Models

. Unweighted logistic regression

. Weighted logistic regression using a Spark Pipeline

In [None]:
lr = LogisticRegression(featuresCol="features", labelCol="label")
lr_weighted = LogisticRegression(featuresCol="features", labelCol="label", weightCol="weight")

# Train models
lr_model = lr.fit(train_data_assembled)

pipeline = Pipeline(stages=[assembler, lr_weighted])
lr_weighted_model = pipeline.fit(train_weighted)


# 8. Evaluate Model Performance

In [None]:
# Evaluate with AUC
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")

predictions = lr_model.transform(test_data_assembled)
predictions_weighted = lr_weighted_model.transform(test_weighted)

auc = evaluator.evaluate(predictions)
auc_weighted = evaluator.evaluate(predictions_weighted)

print(f"AUC (Unweighted): {auc:.4f}")
print(f"AUC (Weighted):   {auc_weighted:.4f}")


# 9. Classification Report and Confusion Matrix

In [None]:
# Convert predictions to pandas
y_pred = predictions.select("prediction").toPandas()
y_true = predictions.select("label").toPandas()

y_pred_w = predictions_weighted.select("prediction").toPandas()
y_true_w = predictions_weighted.select("label").toPandas()

# Print classification reports
print("\nUnweighted Model Report:\n", classification_report(y_true, y_pred))
print("\nWeighted Model Report:\n", classification_report(y_true_w, y_pred_w))


# 10. Confusion Matrix Visualization

In [None]:
# Plot confusion matrices
cm = confusion_matrix(y_true, y_pred)
cm_w = confusion_matrix(y_true_w, y_pred_w)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

ConfusionMatrixDisplay(confusion_matrix=cm).plot(ax=axes[0])
axes[0].set_title("Unweighted Model")

ConfusionMatrixDisplay(confusion_matrix=cm_w).plot(ax=axes[1])
axes[1].set_title("Weighted Model")

plt.tight_layout()
plt.show()


## Summary

- PySpark was used to process a real-world, imbalanced dataset

- Logistic Regression was trained using both unweighted and class-weighted approaches

- Evaluation included AUC, classification reports, and confusion matrices

- The weighted model significantly improved detection of fraudulent cases