# 1. Setup

In [1]:
# add root path to system path
import sys
import os

sys.path.append('../')

os.chdir("../")

In [2]:
import pandas as pd
import warnings

from utility.paths import DataPath
from preprocessing import Preprocessing
from models.gru import GRU
from tqdm.auto import tqdm

warnings.filterwarnings("ignore")

2023-12-21 08:51:27.315479: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**Due to the complexity of implementing the GRU model, we have created a dedicated [gru.py](../models/gru.py) class to encapsulate all the necessary logic. This notebook is primarily used for presenting our results, ensuring both conciseness and clarity.**

# 2. Data loading

First, we will load the pre-processed datasets.

In [3]:
# Create training and testing preprocessing object.
train_prep = Preprocessing([DataPath.TRAIN_NEG_FULL, DataPath.TRAIN_POS_FULL])
test_prep = Preprocessing([DataPath.TEST], is_test=True)

In [4]:
# Declare paramaterss for GRU.
MAX_LEN = 256
BATCH_SIZE = 128
EPOCHS = 10
EMBEDDING_DIM = 100  # Since we're using GLoVe

In [5]:
# Declare GRU model.
gru = GRU(weight_path=DataPath.GRU_WEIGHT,
          submission_path=DataPath.GRU_SUBMISSION,
          max_length=MAX_LEN)

# 3. Data preprocessing

Now, we will preprocess the data to ensure it is clean before beginning the training process.

In [6]:
# Retrieve preprocessing steps declared in GRU class for both train and test data.
for step in tqdm(gru.preprocessing(), desc="Preprocessing train data"):
    getattr(train_prep, step)()

for step in tqdm(gru.preprocessing(is_train=False), desc="Preprocessing test data"):
    getattr(test_prep, step)()

Preprocessing train data:   0%|          | 0/14 [00:00<?, ?it/s]

Executing: `drop_duplicates`
Executing: `remove_ending`
Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:01<00:00, 1555423.27it/s]


Executing: `remove_space_around_emoji`
Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:01<00:00, 1620862.03it/s]


Executing: `reconstruct_emoji`


100%|█████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:15<00:00, 142863.55it/s]


Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:01<00:00, 1153637.89it/s]


Executing: `emoji_to_tag`


100%|██████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:27<00:00, 81485.42it/s]


Executing: `reconstruct_last_emoji`
Executing: `num_to_tag`
Executing: `hashtag_to_tag`
Executing: `repeat_symbols_to_tag`
Executing: `elongate_to_tag`
Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████| 2268591/2268591 [00:01<00:00, 1328159.67it/s]


Preprocessing test data:   0%|          | 0/13 [00:00<?, ?it/s]

Executing: `remove_ending`
Executing: `remove_extra_space`


100%|█████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 985503.76it/s]


Executing: `remove_space_around_emoji`
Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 1023450.30it/s]


Executing: `reconstruct_emoji`


100%|█████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 125352.78it/s]


Executing: `remove_extra_space`


100%|█████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 978856.92it/s]


Executing: `emoji_to_tag`


100%|██████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 71049.44it/s]


Executing: `reconstruct_last_emoji`
Executing: `num_to_tag`
Executing: `hashtag_to_tag`
Executing: `repeat_symbols_to_tag`
Executing: `elongate_to_tag`
Executing: `remove_extra_space`


100%|████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 1006238.51it/s]


In [22]:
# Retrieve the preprocessed df.
train_data = train_prep.__get__()
test_data = test_prep.__get__()

In [25]:
# Export the dataframes. For training frames, shuffles.
train_data = train_data.sample(frac=1)
train_data.to_csv(DataPath.GRU_TRAIN, index=False)

test_data.to_csv(DataPath.GRU_TEST, index=False)

In [29]:
# Read the dataframe
train_df = pd.read_csv(DataPath.GRU_TRAIN)
train_df.dropna(inplace=True)

In [30]:
# Split the data into training and testing sets
X, y = train_df['text'].values, train_df['label'].values

# 4. Training GRU

We can now begin training the GRU model.

In [None]:
# Update vocabulary for GRU embedding
gru.update_vocabulary(X)

# Start the training process
gru.train(X, y, batch_size=BATCH_SIZE, epochs=EPOCHS)

Executing: `update_vocabulary`
Vocabulary size: 439824
Executing: `padding`
Executing: `padding`
Executing: `generate_embedding_matrix`


Loading GloVe: 0it [00:00, ?it/s]

Found 1193514 word vectors


Generating embedding matrix:   0%|          | 0/439822 [00:00<?, ?it/s]

Converted 172879 words (266943 missing)
Executing: `build_model`
Model summary
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 256, 100)          43982400  
                                                                 
 bidirectional (Bidirection  (None, 200)               121200    
 al)                                                             
                                                                 
 dense (Dense)               (None, 100)               20100     
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 44123801 (168.32 MB)
Trainable params: 141401 (552.35 KB)
Non-trainable params: 43982400 (167.78 MB)
____________________________________________________________

  saving_api.save_model(


Saving weights


The GRU model achieves an accuracy of `0.86` after `10 epochs`. It appears that 10 epochs are sufficient for the loss to converge.

# 5. Submission

In [None]:
# Read preprocessed test data
test_df = pd.read_csv(DataPath.GRU_TEST)

# Retrieve `text` column for predicting
X_test = test_df["text"]

# Make the prediction
gru.predict(X_test)

Executing: `padding`


This submission to AIcrowd achieved the following accuracy scores:
    
- First Score =`0.865`
- Secondary Score = `0.866`

You can access the results here:

- csv output file : [test_predictions_GRU.csv](../submissions/gru/test_predictions_GRU.csv)
- AIcrowd submission id : **[#247060](https://www.aicrowd.com/challenges/epfl-ml-text-classification/submissions/247060)**

Up to this point, it's the model that has achieved the highest accuracy. We will now proceed to explore transformer models, specifically the : [Bidirectional Encoder Representations from Transformers (BERT)](model_BERT.ipynb).