## EDA

### BASIC LIBRARIES FOR EDA

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

### OPENING FILES AND SAVING THE FIRST ROW

In [None]:
sms = pd.read_csv("sms.csv")
head = sms.head()


### Check Data Integrity

In [None]:
null_col = sms.isnull().sum()
print("Null values in each column: \n", null_col)

### Distributing and Correlating Data
**This script is useful for analyzing trends in SMS data, particularly to understand the distribution and frequency of fraudulent messages over different time periods. But there's no correlation that result from the analyzed data.**

In [None]:
sms['Date and Time'] = pd.to_datetime(sms['Date and Time'])

# Extracting different time components
sms['year'] = sms['Date and Time'].dt.year
sms['month'] = sms['Date and Time'].dt.month
sms['day'] = sms['Date and Time'].dt.day


# Filtering for fraudulent messages
fraudulent_sms = sms[sms['Fraudolent'] == 1]

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.delaxes(axes[1][1])
# Plot 1: Fraud percentage by year
sns.countplot(x='year', hue='Fraudolent', data=sms, ax=axes[0, 0])
axes[0, 0].set_title('Fraudolent messages by year')

# Plot 2: Fraud percentage by month
sns.countplot(x='month', hue='Fraudolent', data=sms, ax=axes[0, 1])
axes[0, 1].set_title('Fraudolent messages by month')

# Plot 4: Fraud percentage by day
sns.countplot(x='day', hue='Fraudolent', data=sms, ax=axes[1, 0])
axes[1, 0].set_title('Fraudolent messages by day')

# Adjust spacing between subplots for better readability
plt.tight_layout()

# Show the plots
plt.show()


## SIMPLE MODEL APPROACH (Logistic Regression)
**We implemented a basic model to compare it's performance against the later used language model**
### Data Loading and Preparation

**-The script starts by importing necessary libraries from pandas and sklearn.**

**-It reads the SMS data from a CSV file into a pandas DataFrame.**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


df = pd.read_csv("sms.csv")

## Text Vectorization

**-A TfidfVectorizer is initialized, excluding English stop words.**

**-The 'SMS test' column (likely a typo, should be 'SMS text') is transformed into TF-IDF features, a numerical representation suitable for machine learning models.**

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['SMS test'])


## Target Variable

**The 'Fraudolent' column is set as the target variable y.**

In [None]:
y = df['Fraudolent']

### Train-Test Split

**The dataset is split into training and testing sets, with 20% of the data reserved for testing.**

**The random state is set for reproducibility.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)

## Model Training

**A Logistic Regression model is instantiated and trained using the training data.**

In [None]:
model = LogisticRegression()
model.fit(X_train, y_train)

## Model Prediction and Evaluation

**The model makes predictions on the test set.**
**A classification report is printed, showing key metrics like precision, recall, and F1-score for evaluating the model's performance on detecting fraudulent SMS messages.**

In [None]:
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

## SMOTE Resampling

**-The SMOTE method from imblearn.over_sampling is used to handle class imbalance.**

**-smote = SMOTE(random_state=42) creates a SMOTE object with a fixed random state for reproducibility.**

**-X_resampled, y_resampled = smote.fit_resample(X, df['Fraudolent']) applies SMOTE to the feature matrix X and target vector df['Fraudolent']. This results in a balanced dataset by creating synthetic samples for the minority class.**

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, df['Fraudolent'])

## Train-Test Split

**train_test_split from sklearn.model_selection is used to split the data into training and testing sets.**

**-X_resampled and y_resampled are split into training (X_train, y_train) and testing (X_test, y_test) sets, with 20% of the data held out for testing.**

**-stratify=y_resampled ensures that the proportion of classes in both the training and testing sets reflects the proportion in the resampled dataset, which is important for maintaining balance in both sets.**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

## Model Training

**-A LogisticRegression model from sklearn.linear_model is instantiated.**

**-model.fit(X_train, y_train) trains the logistic regression model on the training data. This step involves the model learning to differentiate between classes based on the features provided.**

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

## Model Prediction and Evaluation

**predictions = model.predict(X_test) uses the trained model to predict the class labels for the test set.**

**-print(classification_report(y_test, predictions)) prints a classification report that includes key metrics like precision, recall, and F1-score for each class. These metrics are crucial for evaluating the model's performance, especially in a balanced dataset scenario where accuracy alone is not a sufficient measure.**

In [None]:
from sklearn.metrics import classification_report
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))

## LLM (Language Model) Approach

- We started by loading and testing the base model, using LLaMA-Factory and LM-Studio, to assure easy reproducibility.
  - Evaluated the model on the SMS data (test set).
  - Extracted Metrics (Accuracy, Precision, Recall, F1-Score) and Confusion Matrix, in addition to "JSON output" evaluation.


- We fine-tuned the model using the training set, prompt "x" and same parameters as the base model.
  - Evaluated the model on the SMS data (test set).
  - Extracted Metrics (Accuracy, Precision, Recall, F1-Score) and Confusion Matrix, in addition to "JSON output" evaluation.
  - Assessed improvements etc.


- Converted the model to GGUF for easier CPU inference and to make the model more "computationally friendly", so it can also be run on 3rd party services like "LM Studio".


- Created 2 version of the model:
  - F16 (16-bit precision) which is the default fine-tuned version.
  - Q8 (8-bit precision) which is a quantized version of the fine-tuned model, to make it even more "computationally friendly" but a little less reliable.
  - All uploaded to "Hugging Face" for easy access and reproducibility.

### Parameters & Choices

In [None]:
# Chosen prompt:

BOOL_SYSTEM_MESSAGE = """You are excellent message moderator, expert in detecting fraudulent messages.

You will be given "Messages" and your job is to predict if a message is fraudulent or not.

You ONLY respond FOLLOWING this json schema:

{
    "is_fraudulent": {
        "type": "boolean",
        "description": "Whether the message is predicted to be fraudulent."
    }
}

You MUST ONLY RESPOND with a boolean value in JSON. Either true or false in JSON. NO EXPLANATIONS OR COMMENTS.

Example of valid responses:
{
    "is_fraudulent": true
}
or 
{
    "is_fraudulent": false
}
"""

# After small test this was the only one that didn't yield immediate terrible results for both JSON and task performance

In [None]:
# Parameters used consistently during test, fine-tuning and final run.
# Could have played more with it, but considering time, we decided to stick with these.

temperature = 0.3
max_tokens = 128

### Base Model Evaluation

## LM Studio - easy inference

### Requirements
- LM Studio
- LM Studio's server running on "http://localhost:1234/v1" or any other easy accessible address

We aim to obtain the result in JSON format, which is a common data format with diverse uses in data handling. The expected output should look like this:

```json
{
    "is_fraudulent": true
}
```

```json
{
    "is_fraudulent": false
}
```

The language model categorizes the sms as fraudulent or not.

This is done by a combination of:
- Prompt engineering
- Fine-tuning
- Hyperparameter optimization
- Function-calling or JSON outputs
- Run inferenece on API - LM Studio (mimics OpenAI API)

### Base Implementation:

**Prompts:**

In [None]:

INSTRUCTIONS = """You are excellent message moderator, expert in detecting fraudulent messages.

You will be given "Messages" and your job is to predict if a message is fraudulent or not.

You ONLY respond FOLLOWING this json schema:

{
    "is_fraudulent": {
        "type": "boolean",
        "description": "Whether the message is predicted to be fraudulent."
    }
}

You MUST ONLY RESPOND with a boolean value in JSON. Either true or false in JSON. NO EXPLANATIONS OR COMMENTS.

Example of valid responses:
{
    "is_fraudulent": true
}
or 
{
    "is_fraudulent": false
}
"""


### Full process as a function

**-Run inference, get JSON output, and save the result result (with error handling).**

**-user_query is passed to predict_fraudolence function, which uses the LLM to predict whether the message is fraudolent or not**



In [None]:
import json
from openai import OpenAI
from typing import Optional

local_server = "http://localhost:1234/v1"
client = OpenAI(base_url=local_server, api_key="sk_1234567890")

# Choose which system message to use based on whether you want a confidence score
system_message = INSTRUCTIONS
def predict_fraudulence(user_query):
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_query},
            ]
        )
        prediction_content = response.choices[0].message.content
        prediction_json = json.loads(prediction_content)
        
        # Validate the prediction JSON
        if "is_fraudulent" in prediction_json and \
            (type(prediction_json["is_fraudulent"]) == bool):
            return prediction_json
        else:
            raise ValueError(f"Invalid JSON structure from prediction: {prediction_content}")
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


**user_query is a variable set to a specific text message**


In [None]:
# Run prediction
user_query = "Hi Chris! Can you send me $1,000,000? I need it for a school project. Thanks!"
prediction_json = predict_fraudulence(user_query)

if prediction_json:
    print(prediction_json)
else:
    print("Failed to get a valid prediction.")

**prediction_json calls the function and the user query and returns the output**

In [None]:
prediction_json["is_fraudulent"]

## Evaluating Results from LLM

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import json

# Load the generated predictions file
predictions_file_path = 'final_model_eval_2023-12-02-21-34-10\generated_predictions.jsonl'
with open(predictions_file_path, 'r') as file:
    predictions = [json.loads(line) for line in file]

# Display the first few entries to understand the structure
predictions[:5]
# Extracting labels and predictions
labels = [json.loads(entry['label'])['is_fraudulent'] for entry in predictions]
predicts = [json.loads(entry['predict'])['is_fraudulent'] for entry in predictions]

# Converting boolean to integer for metrics calculation
labels = [int(label) for label in labels]
predicts = [int(predict) for predict in predicts]

# Calculating metrics
accuracy = accuracy_score(labels, predicts)
precision = precision_score(labels, predicts)
recall = recall_score(labels, predicts)
f1 = f1_score(labels, predicts)

metrics_df = pd.DataFrame({
    "Metric": ["Accuracy", "Precision", "Recall", "F1"],
    "Value" : [accuracy, precision, recall, f1]
})
print(metrics_df.to_string(index=False))

## FROM PYTORCH TO GGUF
**Python is a poor choice for AI inference stacks.**

**In production systems, we would want to remove the reliance on PyTorch and Python.**

**With the aid of programs like llama.cpp, GGUF may facilitate extremely effective zero-Python inference.**

**GGUF facilitates the usage of CPU inference**

**After having installed the repository with the following command in the terminal:**

`git clone https://github.com/ggerganov/llama.cpp`

`cd llama.cpp`

**we then use the command `convert.py` to convert the PYTORCH model to GGUF, simply by giving the repository containing the PYTORCH files. The GGUF model file is a full 16-bit floating point model. It is not yet quantized**

`python convert.py _er modello_`


**Then with the help of mkDev64build application we built the quantizer by giving it the fine-tuned model and selecting the type of quatization we wanted (8-bit)**
**All these models are accessible in our HUGGINGFACE repository (https://huggingface.co/SimplyLeo/Zephyr-Fraudulence-Detector)**