# Lab 2 - Math 178, Spring 2024

You are encouraged to work in groups of up to 3 total students, but each student should submit their own file. (It's fine for everyone in the group to submit the same link.)

Put the full names of everyone in your group (even if you're working alone) here. This makes grading easier.

**Names**: Katie Kim, Shun Iwata

The attached data is a very slightly altered form of this [Kaggle dataset](https://www.kaggle.com/datasets/dhanushnarayananr/credit-card-fraud/data).

* Read in the attached credit card fraud data and look at its contents.  Pay particular attention to the column data types.  In this lab, we are interested in predicting the contents of the "fraud" column.

## Preparing the data

Divide the data into a training set and a test set.  Specify a `random_state` when you call `train_test_split`, so that you get consistent results.  I had trouble in the logistic regression section if my training set was too big, so I recommend using only 10% of the data (still a lot, 100,000 rows) as the training size.  It's possible that using even a smaller training size is appropriate.

* Imagine we always predict "Not Fraud".  What accuracy score (i.e., proportion correctly classified) do we get on the training set?  On the test set?  Why can there not be any overfitting here?

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

df = pd.read_csv("card_fraud.csv")
var = df.columns.to_list()
type(var)
var.remove('fraud')
var

X_train, X_test, y_train, y_test = train_test_split(df[var], df[['fraud']], train_size = 0.1, random_state = 11)
y_train

Unnamed: 0,fraud
444737,Not Fraud
957507,Fraud
277798,Not Fraud
817894,Not Fraud
911671,Not Fraud
...,...
359761,Not Fraud
728155,Not Fraud
808016,Not Fraud
822975,Not Fraud


above has training data accuracy, 0.912

In [None]:
len(y_test[y_test['fraud'] == 'Not Fraud'])/len(y_test)  #test data accuracy

0.91266

* Imagine we always predict "Not Fraud".  What accuracy score (i.e., proportion correctly classified) do we get on the training set?  On the test set?  Why can there not be any overfitting here?

Since the test set and training set is randomly selected, the ratio of "Not Fraud" is the same for both sets. Thus if we always predict "Not Fraud", this is not particularly adjusting for the training data set, thus the accuracy, or the proportion of "Not Fraud" in the test set will be irrelevant to how the training data was. Since they will show the same accuracy (not fraud rate) due to the random selection of the test set, this clearly can't be overfitting. 
 

## Logistic regression - using scikit-learn

Fit a scikit-learn `LogisticRegression` classification model to the training data.  Because it is such a large dataset, I ran into errors/warnings during the `fit` stage if I had instantiated the `LogisticRegression` object using the default parameters.   To combat this, I used only 10% of the data in my training set, I increased the default number of iterations, and I changed the solver.  You can see the options in the `LogisticRegression` class [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).  Originally I also increased the default tolerance, but it seems like this makes the model less accurate, so try to avoid increasing the tolerance if possible.  Don't be surprised if fitting the model takes up to 5 minutes.  If you're having issues, try increasing the tolerance very slightly.

* What is the accuracy score on the training set?  On the test set?  Are you concerned about overfitting?

In [None]:
# model = LogisticRegression(max_iter = 10**5)
model = LogisticRegression(max_iter = 10**3)
model.fit(X_train, np.ravel(y_train))

In [None]:
pred = model.predict(X_train)
score = model.score(X_train, y_train)
score  #training data set

0.95843

Since 91.2% is actually fraud in the dataset, set this as the baseline. 
accuracy score on the training set: 95.84. This is higher than 91.2 so it is good. 

In [None]:
pred = model.predict(X_test)
score = model.score(X_test, y_test)
score  #test data set

0.9582366666666666

Since 91.2% is actually fraud in the dataset, set this as the baseline. 
accuracy score on the test set: 95.82. This is higher than 91.2 so it is good. 

* Evaluate the scikit-learn `confusion_matrix` function on the test data ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)).  Which entry in this confusion matrix would you focus the most on, if you were a bank?  Why? 

In [None]:
cm = confusion_matrix(y_test, pred)
cm

array([[ 47056,  31550],
       [  6037, 815357]])

The top right entry of the confusion matrix. It is a entry where it is a false negative where the input is actually a fraud is falsely indicated as not a fraud. Since the bank will mind the case where a fraud case is bypassed, the bank will care about this case the most. 

## Naive Bayes - by hand

Our goal in this section is to perform Naive Bayes "by hand" (or at least without using a scikit-learn model).  Recall that Naive Bayes is based on the following formula, taken from Section 4.4 of ISLP:

![Formula 4.30](naiveBayes.png)

In our case, $k$ will represent either "Fraud" or "Not Fraud".  The function $f_{ki}(x_i)$ represents the probability (or probability density) of the i-th predictor being $x_i$ in class $k$.  To estimate these functions $f_{ki}$, we will use the first and third bullet points beneath Equation (4.30) in ISLP, according to whether the variable is a float type or a Boolean type.  The term $\pi_k$ represents *prior* probability of class $k$ (*prior* meaning without dependence on the predictors $x_i$).

Strategy:
* We first compute the values $\pi_k$.
* We then (prepare to) compute the functions $f_{ki}$ when $i$ represents a float column.
* We then (prepare to) compute the functions $f_{ki}$ when $i$ represents a Boolean column.

In [None]:
pi_fraud = len(y_train[y_train['fraud'] == 'Fraud'])/len(y_train)
pi_not_fraud = len(y_train[y_train['fraud'] == 'Not Fraud'])/len(y_train)

In [None]:
print(pi_fraud)
print(pi_not_fraud)

0.08797
0.91203


In [None]:
def bool_func(df_train, col, fraud_bool):

    mydf = df_train[df_train['fraud'] == fraud_bool]

    true = len(mydf[mydf[col] == True])/len(mydf)
    false = 1-true

    mylist = []
    for val in mydf[col]:
        if val == True:
            mylist.append(true)
        else:
            mylist.append(false)

    return mylist

In [None]:
X_train

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
444737,8.202548,0.273947,0.399124,True,False,False,False
957507,202.486736,1.486194,6.215277,True,False,False,False
277798,23.974382,4.302344,5.113708,True,True,True,False
817894,1.050778,0.563827,0.384282,False,False,False,False
911671,2.426653,6.746158,0.267412,True,False,False,False
...,...,...,...,...,...,...,...
359761,15.735257,0.031037,4.406235,True,False,False,False
728155,12.747647,4.710128,2.218541,True,False,True,True
808016,3.012602,1.074755,1.939998,True,False,False,False
822975,1.097913,1.730297,0.406524,False,True,False,True


In [None]:
import scipy.stats as st

In [None]:
def float_func(df_train, col, fraud_bool):

    mydf = df_train[df_train['fraud'] == fraud_bool]

    mu = mydf[col].mean()
    sig = mydf[col].std()

    mylist = []
    for x in mydf[col]:
        p = st.norm.pdf(x, loc = mu, scale = sig)
        mylist.append(p)

    return mylist

* Define a dictionary `prior_dct` representing the two prior values $\pi_{Fraud}$ and $\pi_{Not Fraud}$, as in the following template.
```
prior_dct = {
    "Fraud": ???,
    "Not Fraud": ???
}
```
Reality check: the two values should sum to (approximately) 1.

In [None]:
pi_fraud = len(y_train[y_train['fraud'] == 'Fraud'])/len(y_train)
pi_not_fraud = len(y_train[y_train['fraud'] == 'Not Fraud'])/len(y_train)

prior_dct = {
    "Fraud": pi_fraud,
    "Not Fraud": pi_not_fraud
}

* It's temporarily convenient here to have `X_train` and `y_train` together in the same DataFrame.  Concatenate these together along the columns axis and name the result `df_train`.

In [None]:
df_train = pd.concat([X_train, y_train], axis=1)
df_train

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud
444737,8.202548,0.273947,0.399124,True,False,False,False,Not Fraud
957507,202.486736,1.486194,6.215277,True,False,False,False,Fraud
277798,23.974382,4.302344,5.113708,True,True,True,False,Not Fraud
817894,1.050778,0.563827,0.384282,False,False,False,False,Not Fraud
911671,2.426653,6.746158,0.267412,True,False,False,False,Not Fraud
...,...,...,...,...,...,...,...,...
359761,15.735257,0.031037,4.406235,True,False,False,False,Not Fraud
728155,12.747647,4.710128,2.218541,True,False,True,True,Not Fraud
808016,3.012602,1.074755,1.939998,True,False,False,False,Not Fraud
822975,1.097913,1.730297,0.406524,False,True,False,True,Not Fraud


* Write a function `Gaussian_helper` which takes a DataFrame input `df` and two string inputs, a class `k` (which will be "Not Fraud" or "Fraud" in our case) and a column name `col` of one of the float columns.  The output should be a dictionary with keys `"mean"` and `"std"`, representing the mean and the standard deviation for the given column within the given class, as in the first bullet point after (4.30).  

Comment: To find the mean and standard deviation, you can use the formulas in (4.20) (take the square root of the variance to get the standard deviation), but I think it's easier to just let pandas compute these for you, using the `mean` and `std` methods of a pandas Series.  It's possible pandas will use $n$ instead of the $n-K$ in Equation (4.20), but that shouldn't be significant here because $n$ is so big and $K=2$. 

Here is a possible template:
```
def Gaussian_helper(df, k, col):
    output_dct = {}
    ... # one or more lines here
    output_dct["mean"] = ...
    output_dct["std"] = ...
    return output_dct
```

In [None]:
def Gaussian_helper(df, k, col):

    mydf = df[df['fraud'] == k]

    output_dct = {}
    output_dct["mean"] = mydf[col].mean()
    output_dct["std"] = mydf[col].std()

    return output_dct

* Similarly, write a function `Boolean_helper` which takes a DataFrame input `df` and two string inputs, a class `k` (which will be "Not Fraud" or "Fraud" in our case) and a column name `col` of one of the Boolean columns.  The output should be a dictionary with keys `True` and `False`, representing the proportion of these values within the given class.  For an example, see the third bullet point after (4.30) in the textbook.

Comment: Make sure your keys are `bool` values, not strings.

In [None]:
def bool_helper(df_train, k, col):

    mydf = df_train[df_train['fraud'] == k]

    true = len(mydf[mydf[col] == True])/len(mydf)
    false = 1-true

    output_dct = {}
    output_dct["True"] = true
    output_dct["False"] = false
    
    return output_dct

* Check your helper functions by comparing a few of their outputs to the following.  (I feel like there is probably a nice way to use the following DataFrame directly and never define the helper functions, but I did not succeed in doing that.)

```
df_train.groupby("fraud").mean()
```

In [None]:
df_train.groupby("fraud").mean()

Unnamed: 0_level_0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fraud,67.112515,11.811646,6.024489,0.880414,0.253382,0.002842,0.945777
Not Fraud,22.831175,4.255813,1.414602,0.884532,0.357291,0.110588,0.621328


In [None]:
df_train.groupby("fraud").std()

Unnamed: 0_level_0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order
fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fraud,131.712889,42.096718,5.716638,0.324495,0.434972,0.053236,0.22647
Not Fraud,50.905992,16.166079,1.934448,0.319587,0.479204,0.313624,0.485059


In [None]:
Gaussian_helper(df_train, 'Fraud', 'distance_from_home')

{'mean': 67.11251533233086, 'std': 131.71288930935555}

In [None]:
Gaussian_helper(df_train, 'Not Fraud', 'distance_from_home')

{'mean': 22.831175153939462, 'std': 50.90599207082154}

In [None]:
Gaussian_helper(df_train, 'Fraud', 'distance_from_last_transaction')

{'mean': 11.811646222217306, 'std': 42.09671841181567}

In [None]:
Gaussian_helper(df_train, 'Not Fraud', 'distance_from_last_transaction')

{'mean': 4.2558134678412065, 'std': 16.166079350060176}

Here is an example of using a dictionary to replace every value in a column.  Think of the values in this dictionary as our estimated probabilities.

```
temp_dct = {True: 0.71, False: 0.29}

X_train["used_chip"].map(temp_dct)
```

Momentarily fix the class $k$ to be "Fraud".  We are going to compute the numerator of Equation (4.30) for every row of `X_train`.  (Here we switch back to using `X_train` rather than `df_train`.)

Do all of the following in a single code cell.  (The reason for not separating the cells is so that the entire cell can be run again easily.)

* Assign `k = "Fraud"`.
* Copy the `X_train` DataFrame into a new DataFrame called `X_temp`.  Use the `copy` method.
* For each column of `X_temp`, use `Gaussian_helper` or `Boolean_helper`, as appropriate, to replace each value $x_i$ with $f(x_i)$, where $f$ is as in (4.30).  You can use a for loop to loop over the columns, but within a fixed column, you should not need to use a for loop (in other words, you should not need to loop over the rows, only over the columns).  The following imports might be helpful for determining the data types (make the imports outside of any for loop).
```
from pandas.api.types import is_bool_dtype, is_float_dtype
```

Comment: Your code should be changing the `X_temp` entries but not the `X_train` entries.  When you are finished, `X_temp` will be a DataFrame containing probabilities, all corresponding to the "Fraud" class.

* For each row, multiply all entries in that row.  (Hint.  DataFrames have a `prod` method.)  Also multiply by the prior probability of "Fraud".  (Use `k`, do not type `"Fraud"`.)  The end result should be a pandas Series corresponding to the numerator of (4.30), for each row of `X_train`.  Don't be surprised if the numbers are very small, like around $10^{-10}$.

In [None]:
def bool_func(df_train, col,k):
    mydf = df_train[df_train['fraud'] == k]

    true = len(mydf[mydf[col] == True])/len(mydf)
    false = 1-true

    temp_dct = {True: true, False: false}

    myser = df_train[col].map(temp_dct)

    return myser    

In [None]:
def float_func(df_train, col,k):
    mydf = df_train[df_train['fraud'] == k]

    mu = mydf[col].mean()
    sig = mydf[col].std()

    mylist = []
    for x in df_train[col]:
        p = st.norm.pdf(x, loc = mu, scale = sig)
        mylist.append(p)

    return mylist

In [None]:
df_sub = df_train.copy()
df_sub['fraud'].map(prior_dct)

444737    0.91203
957507    0.08797
277798    0.91203
817894    0.91203
911671    0.91203
           ...   
359761    0.91203
728155    0.91203
808016    0.91203
822975    0.91203
403353    0.91203
Name: fraud, Length: 100000, dtype: float64

In [None]:
from pandas.api.types import is_bool_dtype, is_float_dtype
k = "Fraud"

df_temp = df_train.copy()  #has fraud col
X_temp = X_train.copy()  #no fradu col


for col in X_temp.columns:

    if is_bool_dtype(X_temp[col]) == True:
        df_temp[col] = bool_func(df_temp, col,k)

    else:
        df_temp[col] = float_func(df_temp, col,k)


df_temp["fraud"] = df_temp["fraud"].map(prior_dct)
df_temp.prod(axis=1,numeric_only=True)

444737    3.486877e-08
957507    3.581686e-09
277798    5.784414e-11
817894    4.612560e-09
911671    3.441176e-08
              ...     
359761    5.559158e-08
728155    2.341414e-09
808016    4.325602e-08
822975    2.760669e-08
403353    6.529805e-07
Length: 100000, dtype: float64

* Once the code is working, wrap the whole thing into another for loop, corresponding to `k = "Fraud"` and `k = "Not Fraud"`, putting the two resulting pandas Series into a length 2 dictionary with keys `"Fraud"` and `"Not Fraud"`.  Call this dictionary `num_dct`, because it represents the numerators of (4.30).

In [None]:
num_dct = {}

for k in ['Fraud', 'Not Fraud']:

    df_temp = df_train.copy()  #has fraud col
    X_temp = X_train.copy()  #no fradu col

    for col in X_temp.columns:

        if is_bool_dtype(X_temp[col]) == True:
            df_temp[col] = bool_func(df_temp, col,k)

        else:
            df_temp[col] = float_func(df_temp, col,k)
    df_temp["fraud"] = df_temp["fraud"].map(prior_dct)
    num_dct[k] = df_temp.prod(axis=1,numeric_only=True)

num_dct

{'Fraud': 444737    3.486877e-08
 957507    3.581686e-09
 277798    5.784414e-11
 817894    4.612560e-09
 911671    3.441176e-08
               ...     
 359761    5.559158e-08
 728155    2.341414e-09
 808016    4.325602e-08
 822975    2.760669e-08
 403353    6.529805e-07
 Length: 100000, dtype: float64,
 'Not Fraud': 444737    5.648824e-06
 957507    6.010885e-11
 277798    7.733600e-08
 817894    7.014100e-07
 911671    5.327139e-06
               ...     
 359761    2.016076e-06
 728155    1.277531e-06
 808016    6.103287e-06
 822975    6.529298e-07
 403353    5.692318e-06
 Length: 100000, dtype: float64}

* Create a new two-column pandas DataFrame with the results using the following code:
```
df_num = pd.DataFrame(num_dct)
```

In [None]:
df_num = pd.DataFrame(num_dct)

In [None]:
df_num

Unnamed: 0,Fraud,Not Fraud
444737,3.486877e-08,5.648824e-06
957507,3.581686e-09,6.010885e-11
277798,5.784414e-11,7.733600e-08
817894,4.612560e-09,7.014100e-07
911671,3.441176e-08,5.327139e-06
...,...,...
359761,5.559158e-08,2.016076e-06
728155,2.341414e-09,1.277531e-06
808016,4.325602e-08,6.103287e-06
822975,2.760669e-08,6.529298e-07


In [None]:
clf = []
for i in range(len(df_num.index)):
    fraud = df_num['Fraud'].iloc[i]
    not_fraud = df_num["Not Fraud"].iloc[i]
    
    if fraud > not_fraud:
        clf.append("Fraud")
    else:
        clf.append("Not Fraud")
df_num['type'] = clf

In [None]:
df_train['predict'] = clf
df_train

Unnamed: 0,distance_from_home,distance_from_last_transaction,ratio_to_median_purchase_price,repeat_retailer,used_chip,used_pin_number,online_order,fraud,predict
444737,8.202548,0.273947,0.399124,True,False,False,False,Not Fraud,Not Fraud
957507,202.486736,1.486194,6.215277,True,False,False,False,Fraud,Fraud
277798,23.974382,4.302344,5.113708,True,True,True,False,Not Fraud,Not Fraud
817894,1.050778,0.563827,0.384282,False,False,False,False,Not Fraud,Not Fraud
911671,2.426653,6.746158,0.267412,True,False,False,False,Not Fraud,Not Fraud
...,...,...,...,...,...,...,...,...,...
359761,15.735257,0.031037,4.406235,True,False,False,False,Not Fraud,Not Fraud
728155,12.747647,4.710128,2.218541,True,False,True,True,Not Fraud,Not Fraud
808016,3.012602,1.074755,1.939998,True,False,False,False,Not Fraud,Not Fraud
822975,1.097913,1.730297,0.406524,False,True,False,True,Not Fraud,Not Fraud


* What proportion of the values in `X_train` are correctly identified as Fraud using this procedure?  (Note.  We never actually need to compute the denominator in (4.30), since all we care about here is which entry is bigger.)

In [None]:
wrong = 0
for i in range(len(df_train.index)):
    act = df_train['fraud'].iloc[i]
    predict = df_train['predict'].iloc[i]
    if act != predict:
        wrong += 1

1 - wrong/len(df_train.index)

0.9359500000000001

## Submission

* Using the `Share` button at the top right, enable public sharing, and enable Comment privileges. Then submit the created link on Canvas.

## Possible extensions

* I originally wanted us to consider log loss as our error metric, but I decided the lab was already getting rather long, so I removed that.  But in general, log loss is a more refined measure for detecting overfitting than accuracy score.  It should be relatively straightforward to evaluate log loss for the Logistic Regression model.  Compare this to the log loss of a baseline prediction, where we predict the same probability for every row.  I got some errors when I tried to evaluate log loss for the Naive Bayes model and I haven't thought carefully about how to avoid these.
* How do our values compare to using the scikit-learn Naive Bayes model?  (I don't think this will be easy, because you will have to treat the Gaussian and the Boolean portions separately.  There might also be some discrepancy due to our method of estimating standard deviation, but I don't think that is crucial.  I have not tried this myself, so there could also be other discrepancies I'm not anticipating.)
* How does KNN compare in performance?  (What's the optimal number of neighbors?)  I haven't tried this, and I think the training size might be too large, so be prepared to reduce the size of the training set further.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c7576d13-6fc0-4627-81ec-ac1ef6a08f48' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>