### **Part-1: Logistic Regression** | *Detecting card fraud*

**1) Import Data from Drive.**

In [None]:
import pandas as pd
import numpy as np
from pandas.api.types import is_bool_dtype
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss

#upload to google drive, read file path into pandas df
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/card_fraud.csv'
df = pd.read_csv(file_path)

# Check df.
print(df.dtypes)
df.head()
print(f"Number of observations in dataset: {len(df)}")

Mounted at /content/drive
distance_from_home                float64
distance_from_last_transaction    float64
ratio_to_median_purchase_price    float64
repeat_retailer                      bool
used_chip                            bool
used_pin_number                      bool
online_order                         bool
fraud                              object
dtype: object
Number of observations in dataset: 1000000


**2) Train-Test Split**

- Target variable: Y := [fraud] , where $Y \in \{0,1 \} $. We encode "fraud" as 1 and "not fraud" as 0.
- Predictors: X
- There are 100,0000 entries. We'll only use 10,000 entries to train the model.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Y = [fraud]; map 'Not Fraud' --> 0 , 'Fraud' --> 1.
y = df['fraud'].map({'Not Fraud': 0, 'Fraud': 1})

# X = all columns in df apart from 'fraud'(target column)
X = df.drop('fraud', axis=1)

#1) Split;
# stratify y-values across train and test datasets, so that the train and test data sets
# have the same propotion of "fraud":"not fraud" label counts.
# random seed for reproducibility.
# 10% for train, 90% for test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.9, random_state=42, stratify=y)

#2) Check dimensions of feature, output dfs across train,test splits:
print(f"Shape of X_train (features for training): {X_train.shape}")
print(f"Shape of X_test (features for testing): {X_test.shape}")
print(f"Shape of y_train (target for training): {y_train.shape}")
print(f"Shape of y_test (target for testing): {y_test.shape}")


#3) Check stratification: look at y-label counts across test, train sets:
#(i) Train proportions
print("\ny_train value counts:")
display(y_train.value_counts(normalize=True))#normalize=True returns the proportions, rather than the raw count of labels.

#(ii) Test proportions
print("\ny_test value counts:")
display(y_test.value_counts(normalize=True))

#4) Check features in training data
print("\nX_train head:")
display(X_train.head())


Shape of X_train (features for training): (100000, 7)
Shape of X_test (features for testing): (900000, 7)
Shape of y_train (target for training): (100000,)
Shape of y_test (target for testing): (900000,)

y_train value counts:


Unnamed: 0_level_0,proportion
fraud,Unnamed: 1_level_1
0,0.9126
1,0.0874



y_test value counts:


Unnamed: 0_level_0,proportion
fraud,Unnamed: 1_level_1
0,0.912597
1,0.087403



X_train head:


Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
516187,19.179396,0.178758,2.979353,True,False,False,True
419611,47.192898,1.224832,0.293538,True,True,False,True
955905,54.389043,5.29091,4.492304,True,True,False,False
739350,3.129745,0.607212,0.357527,True,False,False,True
54077,0.925275,2.238057,0.684942,False,False,False,False


***Question:*** Imagine we always predict "Not Fraud". What accuracy score (i.e., **proportion correctly classified**) do we get on the training set? On the test set? Why can there not be any overfitting here?

**Ans.** We see that "Not Fraud" labels comprise 91.26% of the labels in the train and test sets. So if we always predict "Not Fraud", our accuracy score would be 91.26%

**3) Fit the Logistic Regression Model.**

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=0.1, solver='saga', max_iter=5000, random_state=42, tol=1e-4)

print("Fitting Logistic Regression model...")
model.fit(X_train, y_train)

print("Model fitting complete!")
# Try-1 with max_iter = 2000 : reached, did not converge.
# Try- 2 with max_iter = 5000 : converged!!

Fitting Logistic Regression model...
Model fitting complete!


**4) Predict labels on test data using the model.**

In [None]:
##The .predict() method uses the trained model to predict the class labels

print("Generating predictions ...")
y_test_pred = model.predict(X_test) #predict on test
y_train_pred = model.predict(X_train) #predict on train
print("Predictions generated!")

##Display a snapshot of the first 10 predictions alongside the actual test labels
print("\nSnapshot of predictions vs. actual labels (first 10):")
predictions_df = pd.DataFrame({'Actual': y_test.head(10), 'Predicted': y_test_pred[:10]})
display(predictions_df)

## View the model coefficients
print("\nModel Coefficients:")
##Create a DataFrame to easily view feature names alongside their coefficients
coefficients_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_[0]})
display(coefficients_df.sort_values(by='Coefficient', ascending=False))

## Calculate the accuracy score on the test and train data
from sklearn.metrics import accuracy_score

## compares true labels from y_test vs. predicted labels from y_test_pred
on_test_accuracy = accuracy_score(y_test, y_test_pred)

on_train_accuracy = accuracy_score(y_train, y_train_pred)

print(f"\nAccuracy Score on Test Data: {on_test_accuracy:.4f}")
print(f"\nAccuracy Score on Train Data: {on_train_accuracy:.4f}")



Generating predictions ...
Predictions generated!

Snapshot of predictions vs. actual labels (first 10):


Unnamed: 0,Actual,Predicted
865896,0,0
804784,0,0
533096,0,0
814787,0,0
858482,0,0
791565,0,0
814352,0,0
822115,0,0
721250,0,0
893796,0,0



Model Coefficients:


Unnamed: 0,Feature,Coefficient
2,ratio_to_median_purchase_price,0.300568
1,distance_from_last_transaction,0.007472
0,distance_from_home,0.005832
5,used_pin_number,-0.272783
6,online_order,-0.363424
4,used_chip,-0.518938
3,repeat_retailer,-1.14849



Accuracy Score on Test Data: 0.9196

Accuracy Score on Train Data: 0.9188


Notice that Accuracy Score on Train Data: 0.9188 < Accuracy Score on Test Data: 0.9196.

So the model is generalizing well, and I am not worried about overfitting.

**5) Analyze the confusion matrix.**

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_test_pred)

print("\nConfusion Matrix:")
## Display the confusion matrix. Using a DataFrame for better readability.
cm_df = pd.DataFrame(cm, index=['Actual Not Fraud', 'Actual Fraud'], columns=['Predicted Not Fraud', 'Predicted Fraud'])
display(cm_df)


Confusion Matrix:


Unnamed: 0,Predicted Not Fraud,Predicted Fraud
Actual Not Fraud,811871,9466
Actual Fraud,62933,15730


- The most important entry would be the **false negatives** (instances of actual fraud that are not detected by our model). Here, the proportion of misclassified frauds is 80%, which is concerning.

### **Part-2) Naive Bayes by hand**

We have 7 features: $X_1,...,X_7$ are random variables;
Assume $(X_i|Y=k) \sim f_{kj}$, where $f_{kj} \sim N(\mu_{kj}, \sigma_{jk}^2)$ when $X_i$ is continuous, and follows a discrete conditional probability distribution when $X_i$ is discrete.

This is a binary classification problem, so $Y \in \{0,1\}$.

We use the notation: $X=x$ to denote the event: $X_i = x_i$ for each $i=1,2,...,7$.

We use the Naive Bayes assumption: $\{ (X_i|Y=k)_{i=1}^{p} \}$ are independent, so:
$$P(X=x|Y=k) = \Pi_{i=1}^{7}P(X_i=x_i|Y=k)$$
that is:
$$ P(X=x|Y=k) = \Pi_{i=1}^{7} f_{ki}(x_i)  $$

1. Baye's Theorem (posteriors):
$$P(Y=k|X=x) = P(X=x|Y=k)\cdot P(Y=k)/P(X=x)$$

2. We denote $\pi_{k}: P(Y=k)$; these are the prior probabilities.

3. The total probability in the denominator $P(X=x)$ does not need to be calculated for these predictions, since we are just interested in the "higher score" of a label (1/0).

### 1) **Compute the priors ($P(Y=k)$)**

1. $\pi_{Fraud}$ : count(Y=1)/#Y entries
2. $\pi_{Not Fraud}$ : count(Y=0)/#Y entries

In [None]:
# Total number of observations
n_total = len(y)

# Number of fraud transactions (y = 1)
n_fraud = (y == 1).sum()
# Number of non-fraud transactions (y = 0)
n_not_fraud = (y == 0).sum()

# Compute prior probabilities
pi_fraud = n_fraud / n_total
pi_not_fraud = n_not_fraud / n_total

# store the two priors in a dictionary
prior_dct = {
    "Fraud": pi_fraud,
    "Not Fraud": pi_not_fraud
}

#Check prior_dct
print(f"Prior Probability of Fraud: {prior_dct['Fraud']}")
print(f"Prior Probability of Not Fraud: {prior_dct['Not Fraud']}")

# Reality check
print(prior_dct["Fraud"] + prior_dct["Not Fraud"])

Prior Probability of Fraud: 0.087403
Prior Probability of Not Fraud: 0.912597
1.0


**We now concatenate the X_train and y_train sets into one dataframe.**

In [None]:
df_train = pd.concat([X_train, y_train], axis=1)

# Display the first few rows of the new DataFrame
print("First 5 rows of df_train:")
display(df_train.head())

# Check the shape of the new DataFrame
print(f"\nShape of df_train: {df_train.shape}")

# Verify the columns of df_train
print("\nColumns in df_train:")
print(df_train.columns)

First 5 rows of df_train:


Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
516187,19.179396,0.178758,2.979353,True,False,False,True,0
419611,47.192898,1.224832,0.293538,True,True,False,True,0
955905,54.389043,5.29091,4.492304,True,True,False,False,0
739350,3.129745,0.607212,0.357527,True,False,False,True,0
54077,0.925275,2.238057,0.684942,False,False,False,False,0



Shape of df_train: (100000, 8)

Columns in df_train:
Index(['distance_from_home', 'distance_from_last_transaction',
       'ratio_to_median_purchase_price', 'repeat_retailer', 'used_chip',
       'used_pin_number', 'online_order', 'fraud'],
      dtype='object')


### **2) Define Gaussian Helper Function**

Recall, we assumed that $(X_i|Y=k) \sim N(\mu_{kj}, \sigma_{jk})$ when $X_i$ is a **quantitative** (continuous) random variable.

We define a Gaussian_helper function that will accept a dataframe, a class `k` (Y-value) and a predictor $X_i$ (that is, a `column`) , and compute the parameters
$\mu_{kj}, \sigma_{kj}$.



In [None]:
def Gaussian_helper(df, k, col):
    """
    Calculates the mean and standard deviation of a specified column
    for a given class ('Fraud' or 'Not Fraud') within a DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame (e.g., df_train).
        k (str): The class, either 'Fraud' or 'Not Fraud'.
        col (str): The name of the continuous column for which to calculate statistics.

    Returns:
        dict: A dictionary containing the mean and standard deviation for the column
              within the specified class.
    """
    # 1) Map the string class 'k' to its numerical representation
    class_value = 1 if k == 'Fraud' else 0

    # 2) Filter the DataFrame for the specified class
    df_filtered = df[df['fraud'] == class_value]

    # 3) Calculate mean and standard deviation for the specified column
    mean_val = df_filtered[col].mean()
    std_val = df_filtered[col].std()

    return {'mean': mean_val, 'std': std_val}

### **Test Gaussian Helper Function**

After defining the function, let's check its usage with one of the continuous columns ('distance_from_home') for both 'Fraud' and 'Not Fraud' classes using the `df_train` DataFrame, and print the results to verify its correctness.


In [None]:
print("Testing Gaussian_helper function...")

# 1. Call for 'Fraud' class and 'distance_from_home' column
fraud_stats = Gaussian_helper(df_train, 'Fraud', 'distance_from_home')

# 2. Print fraud_stats
print("\nStatistics for 'Fraud' transactions (distance_from_home):")
print(fraud_stats)

# 3. Call for 'Not Fraud' class and 'distance_from_home' column
not_fraud_stats = Gaussian_helper(df_train, 'Not Fraud', 'distance_from_home')

# 4. Print not_fraud_stats
print("\nStatistics for 'Not Fraud' transactions (distance_from_home):")
print(not_fraud_stats)

print("Gaussian_helper function tested successfully!")

'''
Ouput:

Testing Gaussian_helper function...

Statistics for 'Fraud' transactions (distance_from_home):
{'mean': np.float64(66.4612134352921), 'std': 137.8794574299837}

Statistics for 'Not Fraud' transactions (distance_from_home):
{'mean': np.float64(22.871919462848016), 'std': 55.12907441598514}
Gaussian_helper function tested successfully!

'''

Testing Gaussian_helper function...

Statistics for 'Fraud' transactions (distance_from_home):
{'mean': np.float64(66.4612134352921), 'std': 137.8794574299837}

Statistics for 'Not Fraud' transactions (distance_from_home):
{'mean': np.float64(22.871919462848016), 'std': 55.12907441598514}
Gaussian_helper function tested successfully!


"\nTesting Gaussian_helper function...\n\nStatistics for 'Fraud' transactions (distance_from_home):\n{'mean': np.float64(66.4612134352921), 'std': 137.8794574299837}\n\nStatistics for 'Not Fraud' transactions (distance_from_home):\n{'mean': np.float64(22.871919462848016), 'std': 55.12907441598514}\nGaussian_helper function tested successfully!\n\n"

**Compare calculated values of the Gaussian helper function with column means:**

In [None]:
df_train.groupby("fraud").mean()

Unnamed: 0_level_0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,22.871919,4.268299,1.435052,0.880835,0.35778,0.110684,0.622891
1,66.461213,13.193887,5.986008,0.880206,0.252403,0.003204,0.947368


### **3) Define Boolean helper function (Discrete Case)**

When $X_i$ is a categorical variable, we estimate $f_{kj}$ by counting the proportion of training observations for $X_j$ corresponding to each class.

For instance, suppose $X_j \in \{a,b,c\}$ and there are 100 observations in the $(Y=0)$ class. Suppose $X_i$ takes on the values a, b, c in 20,30 and 50 of these 100 observations respectively.

Then we estimate the conditional probability mass function:
$$f_{0j}(x_j) = 0.2 \text{ if } x_j = a$$
$$f_{0j}(x_j) = 0.3 \text{ if } x_j = b$$
$$f_{0j}(x_j) = 0.5 \text{ if } x_j = c$$

In our case, we know that the predictors: *repeat_retailer, used_chip, used_pin_number, 	online_order*, are all boolean; thus we define a boolean helper function to estimate $f_{kj}$ for each of these predictors.

In [None]:
def Boolean_helper(df, k, col):
    """
    Calculates the proportion of True and False values for a specified boolean column
    within a given class ('Fraud' or 'Not Fraud').

    Args:
        df (pd.DataFrame): The input DataFrame (e.g., df_train).
        k (str): The class, either 'Fraud' or 'Not Fraud'.
        col (str): The _name_ of the boolean column for which to calculate proportions.

    Returns:
        dict: A dictionary with boolean keys (True, False) representing the proportions
              of these values within the specified class and column.
    """
    # 1) Map the string class 'k' to its numerical representation
    class_value = 1 if k == 'Fraud' else 0

    # 2) Filter the DataFrame for the specified class
    df_filtered = df[df['fraud'] == class_value]

    # 3) Calculate value counts and normalize to get proportions
    # Ensure the column is treated as boolean explicitly to get True/False keys
    proportions = df_filtered[col].astype(bool).value_counts(normalize=True)

    # 4) Store proportions in a dictionary with boolean keys
    output_dct = {
        # If True/False does not exist in proportions, we return 0.
        True: proportions.get(True, 0.0),  # Use .get() to handle cases where a value might not exist
        False: proportions.get(False, 0.0)
    }
    return output_dct

### **Test Boolean helper**

In [None]:
Boolean_helper(df_train, 'Fraud', 'repeat_retailer')

{True: np.float64(0.8802059496567506), False: np.float64(0.11979405034324943)}

***Interpretation*** : Among the fraudulent transactions, 88.02% were repeat repeat retailers, while 11.97% were not. This matches the output in line-10.

### 4) **Computing the Numerator in Naive Bayes**

We set up a new data frame `X_temp` to compute the value:
$$ P(X=x|Y=k) = \Pi_{i=1}^{7} f_{ki}(x_i)  $$

`X_temp` will store the values $f_{ki}(x_i)$ for $k=1$ and for each $i$.

Recall, $f_{ki}$ estimates $P(X_i|Y=k)$.

**Steps:**
1. Fix $k =$ "Fraud"
2. Copy the original dataframe `X_train` into `X_temp`.
3. For each column of `X_temp`, use `Gaussian_helper` (when $X_i$ is quantitative) or `Boolean_helper` (when $X_i$ is Boolean) to replace each value $x_i$ with $f_{1i}(x_i)$.


In [None]:
# Assign k = "Fraud"
k = "Fraud"

# Copy X_train into a new DataFrame called X_temp
X_temp = X_train.copy()

from pandas.api.types import is_bool_dtype, is_float_dtype
import numpy as np

print(f"Transforming features for class: {k}\n")

# Loop over the columns of X_temp
for col in X_temp.columns:
    # Determine if the column is boolean
    if is_bool_dtype(X_temp[col]):
        # Use Boolean_helper to get the proportions for the current column and class
        boolean_props = Boolean_helper(df_train, k, col)
        # Map the original boolean values to their corresponding proportions
        X_temp[col] = X_temp[col].map(boolean_props)

    # Determine if the column is a float (continuous)
    elif is_float_dtype(X_temp[col]):
        # Use Gaussian_helper to get the mean and standard deviation for the current column and class
        gaussian_stats = Gaussian_helper(df_train, k, col)
        mean = gaussian_stats['mean']
        std = gaussian_stats['std']

        # Handle case where standard deviation is 0 to avoid division by zero in PDF
        if std == 0:
            # If std is 0, all values are the same. Likelihood is 1 for that value, 0 otherwise.
            # For simplicity in this context, if std is 0, we can assume a very small std
            # or directly assign a high likelihood to the exact value and 0 to others.
            # For Naive Bayes, if std is 0, the feature provides perfect separation for that value.
            # A common approach is to assign a very small epsilon to std, or check for exact match.
            # Here, we'll assign 1 to matching values and a tiny epsilon to others to avoid issues
            # This approximation is for numerical stability. Ideally, it's a Dirac delta function.
            epsilon = 1e-9 # A very small number
            pdf_values = np.where(X_temp[col] == mean, 1.0, epsilon) # Assign 1 if matches mean, else epsilon
        else:
            # Calculate the Gaussian Probability Density Function (PDF) for each value
            # PDF formula: (1 / (sqrt(2 * pi) * sigma)) * exp(-((x - mu)^2 / (2 * sigma^2)))
            exponent = -((X_temp[col] - mean)**2) / (2 * std**2)
            pdf_values = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(exponent)

        # Replace the original values with their PDF likelihoods
        X_temp[col] = pdf_values

# Display the first few rows of the transformed X_temp DataFrame
print("First 5 rows of X_temp (transformed probabilities/likelihoods for 'Fraud' class):\n")
display(X_temp.head())

#The 0-th column contains the index of the corresponding observations (rows) lifted from X_train.

Transforming features for class: Fraud

First 5 rows of X_temp (transformed probabilities/likelihoods for 'Fraud' class):



Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
516187,0.002728,0.007244,0.058531,0.880206,0.747597,0.996796,0.947368
419611,0.002865,0.007278,0.042382,0.880206,0.252403,0.996796,0.947368
955905,0.002882,0.007381,0.064305,0.880206,0.252403,0.996796,0.052632
739350,0.002604,0.007258,0.042809,0.880206,0.747597,0.996796,0.947368
54077,0.002584,0.007307,0.044979,0.119794,0.747597,0.996796,0.052632


The values in each row under each column represent the probability of the predictor (column variable) taking on this value given that the class is "Fraud".

### **Calculating the Numerator**

Steps:
1) Compute the product $ \Pi_{i=1}^{7} f_{ki}(x_i)  $ for the Numerator for each of the classes: { "Fraud", "Not Fraud" }; then multiply by the prior probability $P(Y=k)$ for each class, which was previously computed and stored in `prior_dict`.

2) Once the code is working, wrap the whole thing into another for loop, corresponding to k = "Fraud" and k = "Not Fraud", putting the two resulting pandas Series into a length 2 dictionary with keys "Fraud" and "Not Fraud". Call this dictionary num_dct, because it represents the numerators of (4.30).

In [None]:
from pandas.api.types import is_bool_dtype, is_float_dtype
import numpy as np

# We are going to loop over X_train and compute the probabilities f_ki(x_i) for each i,k
# Initialize an empty dictionary to store numerators for each class
num_dct = {}

# Loop over both classes: 'Fraud' and 'Not Fraud'
for k in ['Fraud', 'Not Fraud']:
    print(f"\nCalculating numerators for class: {k}")

    # 1. Copy X_train into a new DataFrame called X_temp for this class
    X_temp = X_train.copy()

    # 2. For each column of X_temp, replace values with f_ki(x_i) (likelihoods/proportions)
    for col in X_temp.columns:
        # If the column is boolean, use Boolean_helper
        if is_bool_dtype(X_temp[col]):
            boolean_props = Boolean_helper(df_train, k, col)
            X_temp[col] = X_temp[col].map(boolean_props)

        # If the column is float (continuous), use Gaussian_helper
        elif is_float_dtype(X_temp[col]):
            gaussian_stats = Gaussian_helper(df_train, k, col)
            mean = gaussian_stats['mean']
            std = gaussian_stats['std']

            # Handle case where standard deviation is 0 to avoid division by zero
            if std == 0:
                epsilon = 1e-9 # Small number for numerical stability
                pdf_values = np.where(X_temp[col] == mean, 1.0, epsilon)
            else:
                # Calculate the Gaussian Probability Density Function (PDF) for each value
                exponent = -((X_temp[col] - mean)**2) / (2 * std**2) #nodivby0
                pdf_values = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(exponent)
            X_temp[col] = pdf_values

    # 3. Multiply all entries in each row of X_temp
    # This computes the product of f_ki(x_i) for all i for each observation
    row_products = X_temp.prod(axis=1)

    # 4. Multiply by the prior probability of the current class (k)
    # The prior_dct was computed earlier and holds P(Y=k) values
    numerator_series = row_products * prior_dct[k]

    # 5. Store the resulting pandas Series in num_dct
    num_dct[k] = numerator_series
    print(f"Finished calculating numerators for {k} class.")

# Create a new two-column pandas DataFrame from num_dct
df_num = pd.DataFrame(num_dct)

# Display the first few rows of the df_num DataFrame
print("\nFirst 5 rows of df_num (Numerators for 'Fraud' and 'Not Fraud' classes):\n")
display(df_num.head())


Calculating numerators for class: Fraud
Finished calculating numerators for Fraud class.

Calculating numerators for class: Not Fraud
Finished calculating numerators for Not Fraud class.

First 5 rows of df_num (Numerators for 'Fraud' and 'Not Fraud' classes):



Unnamed: 0,Fraud,Not Fraud
516187,6.28303e-08,4.993619e-06
419611,1.620588e-08,2.931482e-06
955905,1.393723e-09,5.85897e-07
739350,4.394082e-08,5.521896e-06
54077,3.488128e-10,4.854527e-07


In the table above, the numbers in the 'Fraud' and 'Not Fraud' columns of df_num for each row are the scores for that observation belonging to the respective class.

### 5. **Predict Class Labels using the NB model**

We predict "Fraud" or "Not Fraud" for a given observation $x$ by comparing the scores of $x$ for each of these two classes (the scores are the **numerators** in th NB model).

For example, if Score$(Y= \text{"Fraud"}|X=x) >$ Score $(Y= \text{"Not Fraud"}|X=x)$, we predict the label "Fraud" for the observation $x$.

In [None]:
# Predict the class based on which numerator is higher
# If 'Fraud' numerator > 'Not Fraud' numerator, predict 1 (Fraud), else 0 (Not Fraud)
# df_num is where we stored the scores for each entry and class.
y_train_pred_naive_bayes = (df_num['Fraud'] > df_num['Not Fraud']).astype(int) #series of boolean values (predictions)

# Display the first few predictions alongside actual y_train
print("Naive Bayes Predictions vs Actuals (first 10 from train set):")
predictions_comparison = pd.DataFrame({
    'Actual': y_train.head(10),
    'Predicted': y_train_pred_naive_bayes.head(10)
})
display(predictions_comparison)


Naive Bayes Predictions vs Actuals (first 10 from train set):


Unnamed: 0,Actual,Predicted
516187,0,0
419611,0,0
955905,0,0
739350,0,0
54077,0,0
212971,0,0
405061,0,0
653674,1,0
862170,0,0
659540,0,0


**Question:** **What proportion of the values in X_train are correctly identified as Fraud using this procedure?**

To answer this, we will comput the `accuracy_score` and `confusion matrix` of our model's predictions using in-built functions.

In [None]:
# Calculate overall accuracy on the training set
overall_accuracy_train = accuracy_score(y_train, y_train_pred_naive_bayes)
print(f"\nOverall Accuracy on Training Set (Naive Bayes): {overall_accuracy_train:.4f}")

# Calculate the confusion matrix
cm_naive_bayes_train = confusion_matrix(y_train, y_train_pred_naive_bayes)
print("\nConfusion Matrix for Training Set (Naive Bayes):")
cm_df_naive_bayes = pd.DataFrame(
    cm_naive_bayes_train,
    index=['Actual Not Fraud', 'Actual Fraud'],
    columns=['Predicted Not Fraud', 'Predicted Fraud']
)
display(cm_df_naive_bayes)

# Extract values from the confusion matrix
# True Negative (TN), False Positive (FP)
# False Negative (FN), True Positive (TP)
TN, FP, FN, TP = cm_naive_bayes_train.ravel()

# Proportion of actual Fraud values correctly identified (Recall for Fraud)
# This is TP / (TP + FN)
if (TP + FN) > 0:
    recall_fraud = TP / (TP + FN)
    print(f"\nProportion of actual Fraud correctly identified (Recall for Fraud): {recall_fraud:.4f}")
else:
    print("\nNo actual fraud cases in the training set to calculate recall.")


Overall Accuracy on Training Set (Naive Bayes): 0.9258

Confusion Matrix for Training Set (Naive Bayes):


Unnamed: 0,Predicted Not Fraud,Predicted Fraud
Actual Not Fraud,89498,1762
Actual Fraud,5661,3079



Proportion of actual Fraud correctly identified (Recall for Fraud): 0.3523


**Thus, the proportion of actual Fraud correctly identified (Recall for Fraud) is 35.23%.**

This is higher than Recall for Fraud offered by our logistic regression model, which was approximately 20%.