<a href="https://colab.research.google.com/github/Amazn1234/CS470-Final-Project/blob/main/AI470_Final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset

1. load csv file (panda, numpy)
2. split dataset. Example code:()
   ```
   random.shuffle(data) # change if you are using pandas dataframe
   training = data[:int(len(data)*0.8)]
   test = data[int(len(data)*0.8):]

   fold5 = KFold(5) # https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
   for train_idx, val_idx in fold5.split(training):
      sub_val = training[val_idx]
      sub_train = training[train_idx]
      clf = model(sub_train, sub_val, ...) # training the model, and evaluate it on validation dataset
      performance(clf, test) # test the model on test dataset
   ```

In [34]:
import numpy as np
import pandas as pd
import os
from google.colab import files
from sklearn.model_selection import train_test_split

#upload dataset
fileName = 'spambase.csv'

#Avoid having to upload file at every runtime
if not os.path.exists(fileName):
    #Upload the file if it doesn't exist
    uploaded = files.upload()
    #Save the uploaded file to avoid future uploads
    for name, data in uploaded.items():
        with open(fileName, 'wb') as f:
            f.write(data)

#Load the CSV file into Pandas
data = pd.read_csv(fileName)

#Remove attributes that are not related to word (the last four attributes)
columnsToDrop = ['capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total', 'char_freq_#']

if set(columnsToDrop).issubset(data.columns):
    data.drop(columnsToDrop, axis=1, inplace=True)
    print("Columns dropped successfully.")
else:
    print("Columns not found in the DataFrame.")

#Shuffle the data
dataShuffled = data.sample(frac=1).reset_index(drop=True)

#Set feature/target variables
#featuresVec=(dataShuffled.iloc[:,:-1])
#resultVar=(dataShuffled.iloc[:,-1])

#Split the dataset and set feature/target variables (80/20 split)
features_train, features_test, result_train, result_test = train_test_split(dataShuffled.iloc[:, :-1], dataShuffled.iloc[:, -1], test_size=0.2)

#print("features_train:", features_train)
#print("features_test:", features_test)
#print("result_train:", result_train)
#print("result_test:", result_test)


Columns dropped successfully.


#Naive bayes

1. model learning:

   Note:

   features: remove attributes that is not related to word (the last four attributes)

   labels: the last column

   count P(c) -> how many samples are positive, and how many are negtive

   if freq_word>0, then this word exists. You could use this to calculate P(a|c) -> for each class, what is the prob of each word

   remember to use laplace smoothing.

2. model evaluation (on val dataset -> performance(model, val)):
   
   for each new sample, $\prod{P(a|c)}P(c)$ if word is in the email(freq_word > 0); and find the maximum class
   

   

In [35]:
from sklearn.metrics import roc_auc_score, roc_curve

#Find chance of being spam
priorSpam = len(dataShuffled[dataShuffled['spam'] == 1]) / len(dataShuffled)
priorNotSpam = 1 - priorSpam

#print("percentage of emails that are spam in the dataset: ", priorSpam)

#Calculate likelihood of spam
likelihoodSpamArr = {}
likelihoodNotSpamArr = {}

for col in features_train.columns:
    likelihoodSpamArr[col] = {}
    likelihoodNotSpamArr[col] = {}

    for val in dataShuffled[col].unique():
        likelihoodSpamArr[col][val] = len(dataShuffled[(dataShuffled['spam'] == 1) & (features_train[col] == val)]) / len(dataShuffled[dataShuffled['spam'] == 1])
        likelihoodNotSpamArr[col][val] = len(dataShuffled[(dataShuffled['spam'] == 0) & (features_train[col] == val)]) / len(dataShuffled[dataShuffled['spam'] == 0])


#Predict class of a sample
def predict(sample, likelihoodSpamArr, likelihoodNotSpamArr, priorSpam, priorNotSpam, threshold=0.5):

    #Calculate class probabilities for each feature
    for col, val in sample.items():
        if col != 'spam':  #Skip the target column
            #Calculate likelihood using Laplace smoothing
            priorSpam *= likelihoodSpamArr[col].get(val, 0.05)
            priorNotSpam *= likelihoodNotSpamArr[col].get(val, 0.05)

    #Make prediction based on the class probability and threshold
    if priorSpam > threshold:
        return 1
    else:
        return 0

#Test the model
predictions = []
for index, sample in features_test.iterrows():
    #Predict the class
    prediction = predict(sample, likelihoodSpamArr, likelihoodNotSpamArr, priorSpam, priorNotSpam)
    predictions.append(prediction)


############################
#Performance Testing
############################

#Calculate accuracy
def CalculateAccuracy(predictions, trueLabelArr):
  correctPredictions = sum(predictions == trueLabelArr)

  totalSamples = len(trueLabelArr)

  accuracy = correctPredictions / totalSamples

  return accuracy

#Print Accuracy
accuracy = CalculateAccuracy(result_test, predictions)
print("Accuracy:", accuracy)

#find true positives and false positives
def FindTPAndFP(guessArr, trueLabel):
    truePositives, falsePositives = 0, 0

    #loop through each prediction and true result
    for pred, trueLabel in zip(guessArr, trueLabel):
        if pred == 1 and trueLabel == 1:
            truePositives += 1
        elif pred == 1 and trueLabel == 0:
            falsePositives += 1

    return truePositives, falsePositives

def performance(predictions, trueLabels):
    #Calculate AUC score
    aucScore = roc_auc_score(trueLabels, predictions)

    return aucScore

truePositives, falsePositives = FindTPAndFP(predictions, result_test)
aucScore = performance(predictions, result_test)
print("False Positives:", falsePositives)
print("True Positives:", truePositives)
print("Area Under the Curve (AUC):", aucScore)

percentage of emails that are spam in the dataset:  0.39404477287546186
Accuracy: 0.6156351791530945
False Positives: 0
True Positives: 0
Area Under the Curve (AUC): 0.5


# KNN
1. model learning: None

2. model evaluation(on val dataset): You could use each row(exclude the last column) as the feature of the email. You do not have to recalcuate the freqency.

   ```
   Note:
   parallel programing
   numpy.cos() to calcuate the similarity
   ```

# LR

1. model learning: You could use each row(exclude the last column) as the feature of the email. You do not have to recalcuate the freqency.
    
    $y = sigmoid(MX)$

step 1: add one more column (all value is 1) in X -> X' = np.c_[np.ones((len(X), 1)), X]

step 2:vector M = np.random.randn(len(X[0])+1, 1);

key formula for step 3 (Note: n is the size of the TRAINING dataset; $cdot$ is dot production ):

1. $pred_y = sigmoid(M\cdot X')$

2. $loss = -\sum(y\cdot log(pred_y)+(1-y)\cdot log(1-pred_y))/n$

3. $gm=X'\cdot (pred_y - y)*2/n$

Step 3 example code:
   ```
   #Step 3: performing gradient descent on whole dataset:
   best_model = M
   best_performace = 0
   for i in range(epoch):
     pred_y = ...
     gm = ...
     _p = performace(model, val)
     if _p > best_performance:
        best_model = M
        best_performance = _p
     M = M - learning_rate*gm
   ```

2. model evaluation(on val dataset):
  
   calculate pred_y, if more than 0.5, then the predicted label is 1.

# Model Evaluation

https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
def performance(model, data):
  print("result:")
  return result