# K-Nearest Neighbors Classifier
---

### Objectives:

- Use SVM Classifier for classification

- Preprocess data for modeling

- Implement Model Classifier for Credit Card Fraud Detection 

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip install seaborn==0.13.0

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Data Modeling / Model Linear / Metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay

# Warnings
import warnings
warnings.filterwarnings('ignore')

### Load the data

In [0]:
df = pd.read_csv('./data/creditcard.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)


### Understand the data

---

#### Scenario

Financial institutions and credit card companies are constantly working to recognize fraudulent credit card transactions. The main goal is to protect customers so they are not charged for items they did not purchase. The challenge lies in accurately identifying fraud patterns in real-time without blocking legitimate transactions (false positives).

#### Loading Credit Card Fraud Data

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents a snapshot of transactions that occurred over a period of two days.

#### About the Dataset

This is a classic imbalance classification dataset, where each row represents a single transaction. The positive class (frauds) accounts for only 0.172% of all transactions (492 frauds out of 284,807). Due to confidentiality issues, most of the original features have been transformed using PCA (Principal Component Analysis).

The dataset consists of numerical variables only:

* **V1, V2, ... V28**: Principal components obtained with PCA. The original background information about these features is not available.

* **Time**: Contains the seconds elapsed between each transaction and the first transaction in the dataset.

* **Amount**: The transaction amount. This feature is crucial for example-dependent cost-sensitive learning.

* **Class**: The target variable. It takes the value **1 in case of fraud** and **0 otherwise**.

### Explore the data
First, consider a statistical summary of the data.

In [0]:
df.describe()

In [0]:
df.info()

### Check distribution of target variable

In [0]:
# Create Figure
plt.figure(figsize = (8, 6))

# Count Plot
ax = sns.countplot(
    x = 'Class',
    data = df,
    edgecolor = 'white',
    palette = 'pastel',
    
)

plt.title('Class Distribution (No Fraud vs. Fraud)', fontsize = 14)
plt.xlabel('Classe (0: No Fraud, 1: Fraud)', fontsize = 12)
plt.ylabel('Counting (Log Scale)', fontsize = 12)

# Adjusting the y-axis to the logarithmic scale to visualize the minority class.
ax.set_yscale('log')

# Add the exact numbers above the bars 
for p in  ax.patches:
    ax.annotate(f'{int(p.get_height())}',
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha = 'center',
                va = 'center',
                xytext = (0, 5),
                textcoords = 'offset points', 
                fontsize = 12
    )

plt.tight_layout()
plt.show()

### Checking the correlations between the variables

In [0]:
df.corr()['Class'].abs().sort_values(ascending = False)

In [0]:
plt.rc('font', size = 6)
fig, ax = plt.subplots(figsize = (20, 12))
sns.heatmap(df.corr(), annot = True, cmap = 'coolwarm', ax = ax)
ax.set_title('Correlation Features Matrix')
plt.tight_layout()
plt.show()

In [0]:
# Collecting data
correlation_values = df.corr()['Class']

if 'Class' in  correlation_values.index:
    plot_data = correlation_values.drop('Class').sort_values()

else:
    plot_data = correlation_values.sort_values()


# Colors 
colors = ['#f44336' if x > 0 else '#2196f3' for x in plot_data]

# Figure
plt.figure(figsize = (10, 6))

# Plot
plot_data.plot(
    kind = 'barh', 
    color = colors, 
    edgecolor = 'white'
)

plt.title('Correlation of Features with Fraud (Class)', fontsize = 10)
plt.xlabel('Correlation Coefficient (Pearson)', fontsize = 10)
plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.grid(axis = 'x', linestyle = '--', alpha = 0.7)

sns.despine(left = True, bottom = True) 

plt.tight_layout()
plt.show()

### Extract the input features and labels from the data set
Extract the required columns and convert the resulting dataframes to NumPy arrays.


In [0]:
# Shape Dataset
df.shape

In [0]:
X = df.drop(columns = ['Class']).copy()
y = df['Class'].astype('int').copy()

print(f'The shape X train {X.shape}')
print(f'The shape y train {y.shape}')


### Preprocess selected features an train test split

Create train and test datasets

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 33)

The next step is to standardize the input variables so that the model does not inadvertently favor any variable due to its magnitude. The typical way to do this is to subtract the mean and divide by the standard deviation.

In [0]:
std_scaler = StandardScaler()
X_train = std_scaler.fit_transform(X_train)
X_test = std_scaler.transform(X_test)


In [0]:
pd.DataFrame(X_train).head()

In [0]:
pd.DataFrame(X_train).describe().round(2)

A standardized variable has zero mean and a standard deviation of one.

### Support Vector Machine model

In [0]:
# Create a model object
SVM = LinearSVC(class_weight = 'balanced', loss = 'hinge', fit_intercept = False, random_state = 33)

# Train the model in the training data
SVM.fit(X_train, y_train)

In [0]:
# Predict the target variable in the test data
y_pred = SVM.predict(X_test)
y_pred[: 10]

In [0]:
# Collecting the probabilities of the model classifications
y_prob = SVM.decision_function(X_test)
y_prob[ : 10]

### Model Evaluation

In [0]:
coefficients = pd.Series(SVM.coef_[0], index = df.columns[: - 1])
coefficients

In [0]:
# Data collect
data_ax = coefficients.sort_values().reset_index()
data_ax.columns = ['Feature', 'Coefficient']

# # Creating a category to automatically "paint" the graph
# If > 0 it helps Churn (Bad/Red), if < 0 it retains the customer (Good/Blue)
data_ax['Impact'] = data_ax['Coefficient'].apply(lambda x: 'High Risk' if x > 0 else 'Low Risk')

# Figure
plt.figure(figsize = (12, 8))
sns.set_style('whitegrid')

# Barplot
sns.barplot(
    data = data_ax,
    y = 'Feature',
    x = 'Coefficient',
    edgecolor = 'black',
    hue = 'Impact',
    dodge = False,
    palette = {'High Risk': 'firebrick', 'Low Risk': 'teal'}
)

plt.axvline(x = 0, color = 'black', linestyle = '--', linewidth = 1)
plt.title('Features Importance', fontsize = 15)
plt.xlabel('Coefficients', fontsize = 12)
plt.ylabel('Variables', fontsize = 12)
plt.legend(title = 'Type of Impact', loc = 'upper right', fontsize = 12)

plt.tight_layout()
plt.show()

In [0]:
print(classification_report(y_test, y_pred))

In [0]:
ConfusionMatrixDisplay.from_predictions(
    y_test, y_pred, 
    display_labels = ['No-Fraud', 'Fraud'], 
    cmap = 'Blues', 
    values_format = 'd'
)

plt.grid(False) 
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

In [0]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(f'AUC-ROC: {roc_auc_score(y_test, y_prob):.2f}')


### Conclusion

---

- The developed **Linear SVM** model demonstrates high statistical robustness and predictive capacity for fraud detection. With an overall **Accuracy of ~100%** (due to class imbalance) and, more importantly, an **Area Under the Curve (AUC-ROC) of 0.94**, the classifier exhibits a discriminatory competence far superior to chance. The model successfully achieved a **Recall of 0.80** for the minority class (Fraud), validating its effectiveness in flagging 80% of fraudulent transactions while maintaining a reasonable precision.

- The analysis of the coefficients reveals clear vectors of influence, divided into risk factors (pushing towards Fraud) and protective factors (indicating Normal behavior):

  - The variables `V11` (+6.00), `V4` (+4.73), and `V2` (+4.20) show the highest positive coefficients. This indicates that high values in these specific Principal Components are the main drivers for classifying a transaction as **Fraudulent**.
  
  - The variables `V17` (-15.38), `V12` (-10.60), and `V14` (-10.57) act as the strongest protective factors. They possess the largest absolute weights in the model, meaning they are the most critical features for confirming the legitimacy of a transaction. A high value in `V17` is the strongest mathematical indicator of a **Normal** transaction.
  
  - Original features such as `Time` (-0.04) and `Amount` (+0.15) showed coefficients relatively close to zero compared to the PCA components. This indicates that the raw transaction value and timestamp have significantly less explanatory power than the structural patterns hidden within the PCA-transformed features (V1-V28).

- The **F1-Score of 0.77** for the Fraud class suggests a balanced trade-off between precision and recall, meaning the model is aggressive enough to catch fraud without generating an unmanageable number of false positives for the bank to investigate.