# Report on Implementation of Deep Learning Algorithms on DiFraud Dataset

# 1. Introduction
The purpose of this task is to evaluate the performance of different deep learning algorithms on the DiFraud dataset, which is available on Hugging Face. DiFraud is designed to detect fraudulent activities, and we aim to compare the effectiveness of various models in identifying these activities. id## =10%nl

# 2. Dataset

- ## 2.1 Description
The DiFraud dataset contains records of transactions labeled as either fraudulent or legitimate. The features include various attributes of the transactions that are relevant for detectin In the first step of doing this task I checked the description of this dataset on Huggingface to know more about this dataset. Each subdirectory of the dataset contains the individual dataset split into three files: train.jsonl, test.jsonl, and validation.jsonl. The splits are: train=80%, test=10%, and validation=10%.
First I tried to read the dataset by this line of code: **"dataset = load_dataset('difraud')"** but due to various restrictions and policies in Iran's Internet, I couldn't download the data using this API. So I decided to download and read the dataset in the code manually. Unfortunately, the same issue exist with tools like ChatGPT or other AI tools since our IP address and location have been blocked by these services, even if we use a vpn or a proxy.  
 fraud.

- ### 2.2 Preprocessing
After downloading the dataset I converted the **.jsonl** format file to **.csv** by using a Python code so that I could work and check the files easier in csv format. The code for converting the files can be seen in the first cell of Jupyter notebook file for each subdirectory. Then I wrote a code to check if I can see the content of each file and also what are the contents.on3oding.

# 3. Models Implemented
- ### 3.1 Convolutional Neural Network (CNN)
CNNs are typically used for image data but can be adapted for sequence data by treating the sequence as a 1D "image".

- ### 3.2 Bidirectional Long Short-Term Memory (BiLSTM)
BiLSTM networks process data in both forward and backward directions, which can capture context from both past and future states.

- ### 3.3 Recurrent Neural Network (RNN)
RNNs are suitable for sequence data as they maintain a state that can capture information from previous steps in the sequence.

- ### 3.4 Gated Recurrent Unit (GRU)
GRUs are a type of RNN that aim to solve the vanishing gradient problem, similar to LSTMs but with a more streamlined archi

I also tried to implement FastText model on the dataset but since I encountered an unknown issue with installing FastText by running "pip install FastText" repeatedly, I didn't implement this model but if you know what is the solution I will be more than happy to hear that. The error showing in the powershell was :
- ERROR: Could not build wheels for fasttext, which is required to install pyproject.toml-based projects

For each one of the codes I have written a description of the important part of the code so that it will be more clear what I've done. I really love to talk about each line of the code but since it will take so much time and pages to do that I will just give an overview of the code. Also, in the first line of each cell I have written which one of the implementation it is (e.g. #RNN Implementation). Now let's talk about each implementation: 

## Convolutional Neural Network (CNN)

After importing libraries, loading the datasets, defining some constants, tokenizing the text data, converting text to sequences, padding sequences, and converting labels to Numpy arrays, we build the model. A Sequential model is created. An **Embedding** layer maps each word to a 128-dimensional vector. Two **Conv1D** layers with ReLU activation extract features from the sequences. Then **MaxPooling1D** layer reduces the sequence length by downsampling. After that, **GlobalMaxPooling1D** layer reduces the sequence to a single vector. A **Dense** layer with 128 units and ReLU activation is added. Finally, a final **Dense** layer with 1 unit and sigmoid activation outputs the probability for binary classification.
The model is compiled with the **Adam optimizer** and binary crossentropy loss function. For this model, **Accuracy** is used as the evaluation metric.
(I tried using model.summary to show the layers and their output shapes but it didn't give any output. I also tried it for other models, checking if it's working on the other ones. I wanted to remove it from the code but since running the cells would take so much time and it wasn't a big deal, I left it as it is.)
And finally, the model is trained for 10 epochs with a batch size of 32. Validation data is used to monitor performance and the model is evaluated on the test set, and the test accuracy is printed.

For some datasets the accuracy was about 98 or 96 percent. However, for some of the datasets(e.g Political Statements) the accuracy was so low(about 60%) so I tried to improve the model by increasing the epochs and also using **GloVe** embeddings. *GloVe* (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. GloVe embeddings are pre-trained on large datasets and can be used to initialize the embedding layer of neural networks for various natural language processing tasks. I used *glove.6B.100d.txt* for my code and wrote a code segment for reading this file. However, after I used all this methods, the model didn't improved significantly. I think several conclusions and possibilities can be inferred. The dataset with higher accuracy might have higher quality data with clear and distinguishable features, while the dataset with lower accuracy might have more noise or ambiguous features. A conclusion is that, the underlying distributions of the two high and low accuracy datasets might be significantly different. The model might be better suited to the distribution of the dataset with higher achieved accuracy. Also, the dataset with 60% accuracy might be more diverse, presenting more challenging cases that the model finds difficult to handle. ANother possible reason is Overfitting and Underfitting.
The explanation for the GloVe file code segment:
The function reads the GloVe file line by line. Each line contains a word followed by its embedding coefficients. The word is stored in *word*, and its coefficients are stored in *coefs*. **embeddings_index** dictionary maps each word to its corresponding embedding vector. Then, an embedding matrix of zeros is initialized with shape **(len(word_index) + 1, embedding_dim)**. **len(word_index) + 1** accounts for the fact that indices are 1-based (0 is reserved for padding). For each word in **word_index** the GloVe embedding vector for the word is retrieved from **embeddings_index**. If the word has a corresponding GloVe embedding, it is placed in the **embedding_matrix** at the position corresponding to the word's index. Finally, it is used in the model. It is used in **weights=[embedding_matrix]**. This initializes the embedding layer with GloVe embeddings. **embedding_matrix** is the matrix created earlier, where each row corresponds to the GloVe vector for a word. By using this embedding layer setup, the model's performance can be improved since these embeddings capture a lot of semantic meaning from a large corpus of text. However, it didn't improve model's accuracy for some datasets.

## Bidirectional Long Short-Term Memory (BiLSTM)

Since Only the code for model is different from the previous model, I will only explain the code for creating the model. This time I used GloVe file for BiLSTM model by default, making sure the model does it's best on this task.
First we define a function to create a Bidirectional LSTM model, use the GloVe embeddings, add a Bidirectional LSTM layer with 128 units, use GlobalMaxPooling1D to reduce the output. Then, add a Dense layer with 128 units and ReLU activation and a Dropout layer for regularization.
Also a Dense output layer with sigmoid activation for binary classification. We compile the model with the Adam optimizer and binary crossentropy loss.

In this model I performed **K-fold Cross-Validation** to assess the model's performance and train the final model on the entire dataset. In this process we perform 5-fold cross-validation and for each fold, split the data into training and validation sets. Then created and trained a new model for each fold. We evaluate the model on the validation set and append the accuracy to the *accuracies* list. At the end we calculate and print the average cross-validation accuracy.

## Recurrent Neural Network (RNN)

The code for RNN is the same with a slight difference. It uses two layers of LSTM. 
**First LSTM Layer:** 128 is The number of LSTM units in the other words it is dimensionality of the output space. *return_sequences=True* ensures that the LSTM layer returns the full sequence of outputs for the next LSTM layer.
**Second LSTM Layer:** By default, **return_sequences** is False, so this LSTM layer returns only the last output in the output sequence, which is passed to the next layer.
In the compile part of the model, we use Adam optimizer with a learning rate of 0.001. **loss='binary_crossentropy'**: The binary crossentropy loss function is used for binary classification tasks. metrics=['accuracy']: The accuracy metric is used to evaluate the model's performance.

## Gated Recurrent Unit (GRU)

For the last model the architecture is a little bit different. Let's start describing the important parts of the code. The **tokenize** function converts text to lowercase and tokenizes it. The **build_vocab** function creates a vocabulary mapping each word to a unique index. It reserves indices 0 and 1 for padding and unknown tokens. It builds this vocabulary from the combined training and validation text data. Then **TextDataset** class handles tokenizing and padding text data and make it compatible with PyTorch's **DataLoader**.

Let's review defining GRU model. We define a GRU model with an embedding layer, a GRU layer, a fully connected layer, and dropout for regularization. 
The model instance is created using the vocabulary size, embedding dimension, hidden dimension, number of unique labels, and padding index which we defined in thr first lines of the code.

Also we set the device to GPU if available, otherwise to CPU. We define the loss function which is cross-entropy loss and the Adam optimizer has been used.

We define functions for training and evaluating the model. In the code **train_epoch** trains the model for one epoch, while **eval_model** evaluates the model without updating weights. And finally, the training loop runs for a specified number of epochs and prints training and validation loss and accuracy after each epoch. It then evaluates the model on the test set and prints the test accuracy.

# 4. Results

In this section, I have documented the output results for each of the models. However, due to limitations in my laptop's processing power, I was unable to gather results for all the models simultaneously. The training process, especially for the GRU model, took a considerable amount of time, so I only trained the "fake news" with it. I ran the codes on my Asus X570UD laptop, which features an Intel Core™ i5-8250U Processor, NVIDIA GeForce GTX 1050 4GB GDDR5 Graphics, and 16GB DDR4 SO-DIMM Memory. Although I have provided some results, I will collect all the results for the models that I didn't run the code on. The accuracy and loss values for the validation dataset can be found in the Jupyter notebook of implementations.

### CNN
- **Fake News**:
Train Accuracy:0.9228                   Train Loss:0.4898                     Test Accuracy:91.94%

- **Job Scams**:
Train Accuracy:0.9754                   Train Loss:0.2205                     Test Accuracy:97.27%

- **Phishing**:
Train Accuracy:0.9823                   Train Loss:0.1109                     Test Accuracy:98.04%

- **Political Statements**:
Train Accuracy:0.6389                   Train Loss:2.7066                     Test Accuracy:62.64%

- **Product Reviews**:
Train Accuracy:0.6566                   Train Loss:2.7181                     Test Accuracy:62.96%

- **SMS**:
Train Accuracy:0.9861                   Train Loss:0.0796                     Test Accuracy:99.09%

- **Twitter Rumours**:
Train Accuracy:0.8686                   Train Loss:0.6653                     Test Accuracy:86.36%


### CNN(With GloVe)
- **Fake News**:
Train Accuracy:0.6415                   Train Loss:0.9321                     Test Accuracy:63.92%

- **Job Scams**:
Train Accuracy:0.6616                   Train Loss:0.9461                     Test Accuracy:65.22%

- **Phishing**:
Train Accuracy:0.6313                   Train Loss:1.0321                     Test Accuracy:62.32%

- **Political Statements**:
Train Accuracy:0.6019                   Train Loss:0.9624                     Test Accuracy:60.08%

- **Product Reviews**:
Train Accuracy:0.6313                   Train Loss:0.7937                     Test Accuracy:62.73%

- **SMS**:
Train Accuracy:0.9914                   Train Loss:0.0585                     Test Accuracy:99.24%

- **Twitter Rumours**:
Train Accuracy:0.8407                   Train Loss:0.5868                     Test Accuracy:84.80%


### BiLSTM
- **Fake News**:
Train Accuracy:N/A                       Train Loss:N/A                     Test Accuracy:N/A

- **Job Scams**:
Train Accuracy:N/A                       Train Loss:N/A                     Test Accuracy:N/A

- **Phishing**:
Train Accuracy:0.9823                    Train Loss:0.0555                  Test Accuracy:98.23%

- **Political Statements**:
Train Accuracy:0.6539                    Train Loss:0.9522                  Test Accuracy:62.48%

- **Product Reviews**:
Train Accuracy:0.6515                    Train Loss:0.9305                  Test Accuracy:63.87%

- **SMS**:
Train Accuracy:0.9889                    Train Loss:0.0547                  Test Accuracy:98.94%

- **Twitter Rumours**:
Train Accuracy:0.8609                    Train Loss:0.4515                  Test Accuracy:85.15%


### RNN
- **Fake News**:
Train Accuracy:0.8952                    Train Loss:0.5062                     Test Accuracy:88.76%

- **Job Scams**:
Train Accuracy:0.9608                    Train Loss:0.2809                     Test Accuracy:96.08%

- **Phishing**:
Train Accuracy:0.9827                    Train Loss:0.0853                     Test Accuracy:97.77%

- **Political Statements**:
Train Accuracy:0.6000                    Train Loss:2.7423                     Test Accuracy:58.00%

- **Product Reviews**:
Train Accuracy:0.6202                    Train Loss:2.3471                     Test Accuracy:60.25%

- **SMS**:
Train Accuracy:N/A                       Train Loss:N/A                        Test Accuracy:N/A

- **Twitter Rumours**:
Train Accuracy:N/A                       Train Loss:N/A                        Test Accuracy:N/A
48%4)

### GRU

For the GRU I only could run it on the "Fake News" dataset and got the following results. As we can see the results are significantly higher than the other models:

Train Loss: 0.0003         Train Accuracy: 1.0                Test Accuracy: 96.48%

# Future Work
Future work could explore:

1. Process these numbers deeper using plots and diagrams to see which model performed better on the datasets.
2. Hyperparameter tuning for further performance improvement.
3. Incorporating more advanced architectures such as Transformers and BERT.
4. Running all existing codes tte on all of the datasets. However, it takes a lot of time for the GRU model, especially on my low-end laptop.he GRU model, especially on my low-end laptop.

# Sources
Here are the sources that I used for this task:

- Hugging Face DiFraud Dataset: https://huggingface.co/datasets/difraud/difraud
- How to use GloVe File: https://www.geeksforgeeks.org/pre-trained-word-embedding-using-glove-in-nlp-models/
- GloVe source and documentation: https://nlp.stanford.edu/projects/glove/
- PyTorch DataLoader: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
- PyTorch nn.Embedding: https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
- Keras Conv1D Layer: https://keras.io/api/layers/convolution_layers/convolution1d/
- Keras MaxPooling1D Layer: https://keras.io/api/layers/pooling_layers/max_pooling1d/
- Keras GlobalMaxPooling1D Layer: https://keras.io/api/layers/pooling_layers/global_max_pooling1d/
- Keras Dense Layer: https://keras.io/api/layers/core_layers/dense/
- Keras Model Compilation and Training: https://keras.io/api/models/model_training_apis/
- Keras Tokenizer: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer
- Keras Padding Sequences: https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences
- PyTorch nn.GRU: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
- PyTorch nn.CrossEntropyLoss: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
- nltk Tokenizer: https://www.nltk.org/api/nltk.tokenize.html
- nltk Tokenizer: https://www.nltk.org/howto/tokenize.html
- Text Tokenizing: https://www.geeksforgeeks.org/tokenize-text-using-nltk-python/
- LSTM: https://www.youtube.com/watch?v=5dMXyiWddYs
- Simple Explanation of GRU: https://www.youtube.com/watch?v=tOuXgORsXJ4&pp=ygURZ3J1IGRlZXAgbGVhcm5pbmc%3D