### Competition Overview
DataHack is a datathon hosted by Club Sientifique de l'Esi (CSE) community in The Higher National School of Computer Science (ESI ex INI) Algiers for 2023/2024 season.
### Description
In this Challenge, you are tasked with developing machine learning models to identify spam and non-spam (genuine) reviews from Chicago hotels and restaurants. The dataset consists of reviews and metadata, and the goal is to create models that accurately classify reviews as either spam (Y) or non-spam (N).  
[**Link to the competetion**](https://www.kaggle.com/competitions/DATAHACK-Collective-Opinion-Spam-Detection/overview)

#### **Note** :
in this notebook i've kept only the reviews columns in order to use BERT , later on in theis competition i've tried encoding the reviews column with Word2Vec and used XGBoost which gave me way better results (because i exploited the other columns)  

#### Imports

In [1]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score

  from .autonotebook import tqdm as notebook_tqdm


This code defines a custom PyTorch Dataset class named `CustomDataset` for processing textual data in a datathon competition. The class takes three parameters during initialization: `dataframe` (presumably containing text data and corresponding labels), `tokenizer` (likely a pre-trained tokenizer from the Hugging Face Transformers library), and `max_length` (the maximum length of input sequences to be processed). 

The `__len__` method returns the length of the dataset, while the `__getitem__` method retrieves a sample from the dataset at a given index. For each sample, it retrieves the text and label from the dataframe, encodes the text using the provided tokenizer with specified parameters (truncation, padding, and maximum length), and returns a dictionary containing `'input_ids'` (tokenized input sequence), `'attention_mask'` (mask indicating which tokens are padding and which are not), and `'labels'` (the corresponding label converted to a PyTorch tensor of type long).

All in the purpose of facilitating the preparation of data , ensuring that text inputs are properly tokenized and formatted as required by the model architecture.


In [3]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.iloc[idx]['review']
        label = self.data.iloc[idx]['label']
        encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

#### loading trainning data

In [2]:
train_data = pd.read_csv("./data/training_data.csv")  
train_data.head()

Unnamed: 0,id,Date,review ID,reviewer ID,product ID,rating_Helpful,rating_Thanks,rating_LoveThis,rating_OhNo,reviews,Label
0,0,5/17/2009,0dFa6egshOwhusL8aSMw-Q,8GC6cFcby0stKarnzL9i2w,dKcO9OQ44RPRlkWe-vToFA,0,0,0,4,Just got back from Shaw's. Great oysters. They...,Y
1,1,10/25/2011,htQgJ_Z0ADA_QHeKthfeFw,88KSdQ5IMdpCkOidmq1udg,NkOir65b_YAAQVlJR_zmJA,0,0,0,2,Back from friday & saturday nite stays in King...,Y
2,2,8/23/2009,2RsvT8p0SuAC25bhAi3EIw,bMKlvA-zWF4jU3OJCVbVlA,cQnY_VneZisfUAqcbuEuKg,0,0,0,5,It is a beautiful Saturday afternoon and my wi...,N
3,3,10/28/2011,LM-zONQMUNnAuf6NBISrow,9DMoXd0afrTIdpcwcDDVsw,WBU0yq9J8qiYQfI_fh2P1Q,0,1,1,5,A friend told me about this place but I have t...,N
4,4,6/18/2010,-DoQeDcNYFdmhOYcgx2MjQ,PyUn2FeMuLdmyB6xxMe4NA,-pO0hsi0xlF4FwqLGJUizg,0,2,0,5,I went to Uncommon Ground for brunch on a Sund...,N


#### loading testing data

In [3]:
test_data = pd.read_csv('./data/testing_data.csv')
test_data.head()

Unnamed: 0,id,Date,review ID,reviewer ID,product ID,rating_Helpful,rating_Thanks,rating_LoveThis,rating_OhNo,reviews
0,0,11/25/2007,EpUIAOmCal3KLpwfRPwaSw,y5-7amVLpxYyg3EsaV_nSw,tW2jfL-qMccAYZSghPBbHA,0,0,0,4,"Great pizza, good location, and a parking lot...."
1,1,3/16/2009,WP8YNEOrIYkA-JD1pj4SoA,8wtvJvvxDehPtYgep_525Q,LMaoM2Ue2BR_HI9ba3JsZg,0,0,0,5,This is my favorite place in Chicago. The food...
2,2,11/9/2009,fIklWlw56IGRosS,LdBKnXa6JeePsi_SVuSdHQ,O6uWHgJzylSjWjPSJKGhnQ,0,0,0,4,A few friends and I were visiting our other fr...
3,3,5/12/2010,7wVIW6OChqj4Y4y7OiuLVw,w2mghTRdP5THiUPSpxVl9g,DXwSYgiXqIVNdO9dazel6w,1,0,1,3,How do I put this politely without offending m...
4,4,8/26/2012,GdMImdnQta4l3AkQILj2HA,oq6T6FcKl0TA9LV_970_8Q,p9aMkgTdOKhsjkkv4G0QBw,0,0,0,3,"Traveling through Chicago for business, they b..."


#### keeping only the reviews column and dropping ids

In [9]:
columns_to_drop = ['id', 'Date', 'review ID','reviewer ID','product ID','rating_Helpful','rating_Thanks','rating_LoveThis','rating_OhNo']
test_data.drop(columns=columns_to_drop, inplace=True)
train_data.drop(columns=columns_to_drop, inplace=True)

In [10]:
test_data.head()

Unnamed: 0,reviews
0,"Great pizza, good location, and a parking lot...."
1,This is my favorite place in Chicago. The food...
2,A few friends and I were visiting our other fr...
3,How do I put this politely without offending m...
4,"Traveling through Chicago for business, they b..."


In [11]:
train_data.head()

Unnamed: 0,reviews,Label
0,Just got back from Shaw's. Great oysters. They...,Y
1,Back from friday & saturday nite stays in King...,Y
2,It is a beautiful Saturday afternoon and my wi...,N
3,A friend told me about this place but I have t...,N
4,I went to Uncommon Ground for brunch on a Sund...,N


This code initializes a BERT (Bidirectional Encoder Representations from Transformers) model for sequence classification using the `BertForSequenceClassification` class from the Hugging Face Transformers library. The model is loaded with pre-trained weights from the 'bert-base-uncased' model checkpoint, which is a BERT model trained on uncased English text. 

The `num_labels` parameter is set to 2, indicating that the model will be used for binary classification tasks, where the target labels are either 0 or 1. This parameter specifies the number of output labels the model will predict.

This cell sets up the neural network model architecture for performing sequence classification tasks on text data, leveraging the power of pre-trained BERT embeddings for feature extraction and classification.

In [12]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


This code cell is responsible for training and evaluating the BERT-based sequence classification model on the provided data. 

First, it initializes parameters such as `batch_size` (number of samples in each batch), `max_length` (maximum length of input sequences), `epochs` (number of training epochs), and `learning_rate` (initial learning rate for the optimizer).

Then, it prepares the training dataset using the `CustomDataset` class initialized earlier, creates a DataLoader to iterate over batches of data during training, and initializes the AdamW optimizer with the specified learning rate.

During training, it iterates over each epoch and each batch of data, computes the model's loss, performs backpropagation, and updates the model's parameters.

After training, it evaluates the model's performance on the test set using the same data processing steps as in training, calculates the accuracy of the model's predictions compared to the ground truth labels, and prints the test accuracy.

This cell effectively trains the BERT model for sequence classification and evaluates its performance on unseen test data, providing insights into the model's effectiveness in classifying text data.


In [None]:
batch_size = 8
max_length = 128
epochs = 3
learning_rate = 2e-5


train_dataset = CustomDataset(train_data, tokenizer, max_length)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)


optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_loader) * epochs

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['labels']

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        loss.backward()
        optimizer.step()

    avg_train_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch + 1}/{epochs}, Average Training Loss: {avg_train_loss:.4f}")

# Evaluation on test set
test_dataset = CustomDataset(test_data, tokenizer, max_length)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

model.eval()
all_preds = []
all_labels = []

for batch in test_loader:
    input_ids = batch['input_ids']
    attention_mask = batch['attention_mask']
    labels = batch['labels']

    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    
    logits = outputs.logits
    preds = torch.argmax(logits, dim=1).cpu().numpy()
    all_preds.extend(preds)
    all_labels.extend(labels.cpu().numpy())

accuracy = accuracy_score(all_labels, all_preds)
print(f"Test Accuracy: {accuracy:.4f}")