# Problem set 2 (30 + 35 = 65 pts)



## Problem 1: Sentiment Analysis for E-commerce Reviews (Total: 30 pts)

<div style="align: center;">
    <br>
    <img src="https://storage.googleapis.com/kaggle-datasets-images/11827/16290/140ca3b71ec51512dcac444c57583f25/dataset-cover.jpg" style="display:block; margin:auto; width:95%; height:250px;">
</div><br><br>

<div style="letter-spacing:normal; opacity:1.;">
<!--   https://xkcd.com/color/rgb/   -->
  <p style="text-align:center; background-color: lightsalmon; color: Jaguar; border-radius:10px; font-family:monospace; border-radius:20px;
            line-height:1.4; font-size:32px; font-weight:bold; text-transform: uppercase; padding: 9px;">
            <strong>Women's E-Commerce Clothing Reviews</strong></p>  
  


**Context** \\
​
This is a Women’s Clothing E-Commerce dataset revolving around the reviews written by customers. Its nine supportive features offer a great environment to parse out the text through its multiple dimensions. Because this is real commercial data, it has been anonymized, and references to the company in the review text and body have been replaced with “retailer”.
​
**Content**
​
This dataset includes **`23486 rows`** and **`10 feature`** variables. Each row corresponds to a customer review, and includes the variables:
​
- **Clothing ID:** Integer Categorical variable that refers to the **`specific piece`** being reviewed.
- **Age:** Positive Integer variable of the **`reviewers age`**.
- **Title:** String variable for the **`title of the review`**.
- **Review Text:** String variable for the **`review body`**.
- **Rating:** Positive Ordinal Integer variable for the **`product score`** granted by the customer from **`1 Worst, to 5 Best`**.
- **Recommended IND:** Binary variable stating where the customer recommends the product where **`1 is recommended, 0 is not recommended`**.
- **Positive Feedback Count:** Positive Integer documenting the **`number`** of other customers **`who found this review positive`**.
- **Division Name:** Categorical name of the **`product high level division`**.
- **Department Name:** Categorical name of the product **`department name`**.
- **Class Name:** Categorical name of the **`product class name`**.

**The dataset can be downloaded from lms canvas folder "dataset/Womens Clothing E-Commerce Reviews.zip"**

This task is to predict whether customers recommend the product they purchased using the information in their review text.
The model is to focus on finding reviews of products that are not recommended.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!unzip '/content/drive/MyDrive/Deep Learning/HW/Datasets/Womens Clothing E-Commerce Reviews.zip'
dataset_path = '/content/Womens Clothing E-Commerce Reviews.csv'
!pip install contractions

Archive:  /content/drive/MyDrive/Deep Learning/HW/Datasets/Womens Clothing E-Commerce Reviews.zip
replace Womens Clothing E-Commerce Reviews.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: Womens Clothing E-Commerce Reviews.csv  


In [2]:
import pandas as pd
import numpy as np
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, Dataset
import re #regular expressions
import nltk #NLP: tokenization, stemming, lemmatization
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, SnowballStemmer
# import contractions
from nltk import pos_tag
from sklearn.preprocessing import QuantileTransformer, LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from torchtext.vocab import Vocab, build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from sklearn.model_selection import train_test_split
from collections import Counter
from itertools import chain
from transformers import BertTokenizer, BertForSequenceClassification, BertModel, BertTokenizer, BertConfig
from torch.utils.data.sampler import SequentialSampler
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import f1_score, recall_score, average_precision_score


nltk.download('stopwords') #list of stopwords that has to be deleted
nltk.download('wordnet') #lemmatization of words
nltk.download('punkt') #module for splitting text on sentences and words
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed_val = 42
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

### 1. Exploratory Data Analysis and Data Cleaning (2 pts)
Read Dataframe, and explore it's shape and distribution of missing values

In [None]:
df = pd.read_csv(dataset_path)
print(df.shape)
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values>0]
print(missing_values)
df.head(4)

(23486, 11)
Title              3810
Review Text         845
Division Name        14
Department Name      14
Class Name           14
dtype: int64


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants


### 2. Feature Selection (2 pts)
Identify relevant features for sentiment analysis and adjust column names for consistency and clarity (you need to define a column ["not_recommended"])

In [None]:
df = df.rename(columns={
    "Clothing ID":"product_id",
    "Age":"customer_age",
    "Title":"review_title",
    "Review Text":"review_text",
    "Rating":"review_rating",
    "Recommended IND":"recommended_product",
    "Positive Feedback Count":"positive_feedback_count",
    "Division Name":"product_division",
    "Department Name":"product_department",
    "Class Name":"product_class"})
df["not_recommended"] = np.where(df['recommended_product'] == 0, 1, 0)
assert (df["not_recommended"] + df['recommended_product']).unique() == [1]

#processing NaN values: categorical and text features can be replaced by empty values
for feature_name in ['review_title', 'review_text', 'product_division', 'product_department', 'product_class']:
  df.loc[df[feature_name].isnull(), feature_name] = '' #or df[feature_name] = df[feature_name].fillna('')
assert df.isnull().sum().unique() == [0]

### 4. Text Preprocessing and Text Mining - NLTK (Natural Language Toolkit) (6 pts)
- **Text Cleaning (1 pt)**
    - Remove unnecessary characters, convert case, trim unnecessary spaces,...
- **NLTK Tokenization (1 pt)**
    - Split text into words or tokens.
- **NLTK Stop Word Removal (1 pt)**
    - Eliminate common words that do not add significant meaning.
- **NLTK Normalization (1 pt)**
    - Apply stemming and lemmatization.
- **Handling Abbreviations and Acronyms (1 pt)**
    - Expand abbreviations to their full forms.
- **Removing Rare or Infrequent Words (1 pt)**
    - Filter out rare words based on a set frequency threshold.

In [None]:
#dataset analysis
numerical_features_list = ['customer_age', 'review_rating', 'positive_feedback_count']
categorical_features_list = ['product_division', 'product_department', 'product_class']
text_features_list = ['review_text', 'review_title']
label_column = ['not_recommended']
df_processed = df[numerical_features_list+categorical_features_list+text_features_list+label_column].copy(deep=True)
print(df_processed[df_processed['not_recommended']==0].count().sum())
print(df_processed[df_processed['not_recommended']==1].count().sum())
#Calculate weights for imbalanced classification problem
class_weights = compute_class_weight('balanced', classes=df_processed['not_recommended'].unique(), y=df_processed['not_recommended'])
class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(device)

173826
37548


In [None]:
#text concatenation
df_processed['combined_text'] = df_processed['review_text'] + ' ' + df_processed['review_title']
df_processed = df_processed.drop(text_features_list, axis=1)

#categorical feature processing
le = LabelEncoder()
df_processed[categorical_features_list] = df_processed[categorical_features_list].apply(lambda x: le.fit_transform(x.astype(str)))

#numerical feature processing
numerical_transformer = QuantileTransformer(output_distribution='uniform', n_quantiles=10)
df_processed[numerical_features_list] = numerical_transformer.fit_transform(df_processed[numerical_features_list])

In [None]:
df_processed.head()

Unnamed: 0,customer_age,review_rating,positive_feedback_count,product_division,product_department,product_class,not_recommended,combined_text
0,0.222222,0.333333,0.0,3,3,6,0,Absolutely wonderful - silky and sexy and comf...
1,0.25,1.0,0.814815,1,2,4,0,Love this dress! it's sooo pretty. i happene...
2,0.888889,0.166667,0.0,1,2,4,1,I had such high hopes for this dress and reall...
3,0.722222,1.0,0.0,2,1,14,0,"I love, love, love this jumpsuit. it's fun, fl..."
4,0.666667,1.0,0.888889,1,5,1,0,This shirt is very flattering to all due to th...


In [None]:
#Perform analysis of existing abbreviations

from collections import Counter

# reviews_text =
pattern = r"\b[A-Z]{2,}\b|\b[A-Za-z]+\.[A-Za-z]+\b|[a-z]'[a-z]"
abbreviations = []
df_processed['combined_text'] = df_processed['combined_text'].dropna().apply(contractions.fix) #fix most popular contractions

def find_context(string, substring, count=3, size=20):
    start = 0
    counter = 0
    contexts = []
    while counter!=count:
        index = string.find(substring, start)
        if index == -1:
            break
        contexts.append(string[index-size//2:index+size//2])
        start = index + 1
        counter += 1

    return contexts

for sample in df_processed['combined_text'].dropna():
    found = re.findall(pattern, sample)
    abbreviations.extend(found)

abbreviation_counts = Counter(abbreviations)
most_common_abbreviations = abbreviation_counts.most_common(30)
for abbrev in most_common_abbreviations:
  print(abbrev, 'contexts: ', find_context(' '.join(df_processed['combined_text'].dropna().apply(str)), abbrev[0]))

abbreviations_dict = {
    "tts": "true to size",
    "lot's": "a lot",
    "i.e": "that is",
    "dind't": "did not",
    "did't": "did not",
    "preggo's": "pregnant",
    "t.la": "the laundry alternative",
    "p.js":"pajamas",
    "doe'n'tt": "does not",
    "p.s.": "post scriptum"
}

("r's", 170) contexts:  ["or new year's eve. i", "umid summer's day   ", "e alst year's versio"]
("e's", 83) contexts:  ["y valentine's dinner", "f round one's neck &", "d, this one's going "]
("n's", 57) contexts:  [" be a woman's white ", "this season's dresse", "ns that won's slide "]
("l's", 52) contexts:  ["s the model's photos", "e the model's black ", "s the model's photos"]
("d's", 43) contexts:  ["or a friend's spring", " who on god's green ", " my husband's size x"]
("t's", 38) contexts:  ["e dress fit's tts (f", " color. lot's of com", "n awesome t'shirt an"]
("y's", 27) contexts:  [" than today's peasan", "s or skinny's undern", "ny of today's low-ri"]
('i.e', 24) contexts:  [' chested (i.e., you ', 'aterials (i.e. not f', 'designer (i.e.: a li']
("d't", 23) contexts:  ["end, i dind't keep t", "eat. i dind't find t", "r and i did't realiz"]
("o's", 17) contexts:  ["s on anthto's facebo", "love pilcro's jeans.", " for preggo's! this "]
("i's", 15) contexts:  [" strap cami's 

In [None]:
class text_processing:
  def __init__(self, textseries):
    self.textseries = textseries
    self.processed_text = None
    self.processed_tokens = None
    self.procedures = {1: self.clear_text, 2: self.expand_abbreviations, 3:self.pos_tagging, 4:self.tokenize_text}

  def _procedure(num):
      def wrapper(func):
          func.__procedure__ = num
          return func
      return wrapper

  @_procedure(1)
  def clear_text(self): #clear text and delete stop words
    text = self.processed_text
    text = re.sub(r'[^\w\s]', ' ', text)  # replace all non-alphabet-digital-space symbols on space
    text = text.lower() #place symbols in low register
    text = text.strip() #delete useless spaces
    stop_words = set(stopwords.words('english')) #form set of stopwords
    tokens = [word for word in text.split() if word not in stop_words] #clear from stop words
    self.processed_text = ' '.join(tokens)

  @_procedure(2) #expand abbreviations
  def expand_abbreviations(self):
    text = self.processed_text
    tokens = text.split(sep=' ')
    tokens = [abbreviations_dict.get(word, word) for word in tokens]
    text = ' '.join(tokens)
    self.processed_text = text
    self.processed_tokens = tokens

  @_procedure(3) #apply PoS tagging
  def pos_tagging(self):
    info_tokens = pos_tag(self.processed_tokens)
    self.processed_tokens = [word for word, pos in info_tokens if pos in ['NN', 'JJ', 'RB']]

  @_procedure(4) #tokemize and normalize
  def tokenize_text(self):
    tokens = self.processed_tokens
    # tokens = nltk.word_tokenize(text) #obtain whole list of words
    lemmatizer = WordNetLemmatizer()
    stemmer = SnowballStemmer("english")
    tokens = [stemmer.stem(token) for token in tokens]
    tokens = [lemmatizer.lemmatize(token) for token in tokens] #get list of normalized tokens
    self.processed_tokens = tokens

  def remove_rare_words(self, packed_tokens, all_tokens, freq=5): #delete rare tokens

    token_counts = Counter(all_tokens)
    return [[word for word in tokenlist if token_counts[word] >= freq] for tokenlist in packed_tokens]

  def process_sample(self, sample, code='1234'):
      self.processed_text = str(sample)  # Преобразовать входные данные в строку
      for proc in code:
          self.procedures[int(proc)]()
      return self.processed_tokens

  def process_series(self, freq=5, code='1234'):
      packed_tokens = []
      all_tokens = []
      for sample in tqdm(self.textseries):
          sample_tokens = self.process_sample(sample)
          packed_tokens.append(sample_tokens)
          all_tokens.extend(sample_tokens)
      return self.remove_rare_words(packed_tokens, all_tokens, freq=freq)


df_processed['tokenized_text'] = text_processing(df_processed['combined_text']).process_series()
df_processed['tokenized_text'] = df_processed['tokenized_text'].apply(lambda x: x if (x and len(x[0]) > 0) else None)
df_processed = df_processed.dropna(subset=['tokenized_text'])

def convert_features_to_text(df, processing_field, features):
    text = df[processing_field].apply(lambda x: ' '.join(x))
    for col in features:
        text += ' ' + col + ':' + df[col].astype(str) + ' '
    return text
df_processed['concatenated_text_info'] = convert_features_to_text(df = df_processed, processing_field = 'tokenized_text', features = numerical_features_list+categorical_features_list)

100%|██████████| 23486/23486 [01:28<00:00, 264.88it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_processed['concatenated_text_info'] = convert_features_to_text(df = df_processed, processing_field = 'tokenized_text', features = numerical_features_list+categorical_features_list)


In [None]:
df_processed.head(3)

Unnamed: 0,customer_age,review_rating,positive_feedback_count,product_division,product_department,product_class,not_recommended,combined_text,tokenized_text,concatenated_text_info
0,0.222222,0.333333,0.0,3,3,6,0,Absolutely wonderful - silky and sexy and comf...,"[absolut, wonder, silki, sexi, comfort]",absolut wonder silki sexi comfort customer_age...
1,0.25,1.0,0.814815,1,2,4,0,Love this dress! it is sooo pretty. i happen...,"[dress, sooo, pretti, find, store, glad, never...",dress sooo pretti find store glad never onlin ...
2,0.888889,0.166667,0.0,1,2,4,1,I had such high hopes for this dress and reall...,"[high, realli, want, work, initi, petit, small...",high realli want work initi petit small usual ...


### 5. Deep Learning and BERT Models for Sentiment Classification (15 pts)
- **PyTorch LSTM-GRU Model (7.5 pts)**
    - Tokenization, sequencing, and padding (1.5 pt)
    - Train-test split (1 pt)
    - Define and train LSTM-GRU Model (3.5 pts)
    - Model evaluation (1.5 pt)

- **Transformers BERT Model using PyTorch (7.5 pts)**
    - Train-test split (1 pt)
    - Define tokenizer and apply tokenization and sequencing (1.5 pt)
    - Transform text to tensor and padding (1.5 pt)
    - Define and train BERT Model using PyTorch (2.5 pts)
    - Model evaluation (1 pt)

In [None]:
#In this task we have categorical, numerical and text features. There is 2 different approaches how to process it.
#1 approach - tokenize text data and after combine it with numerical and text data - it was choosen
#2 approach - combine different types of features all together and after that process this whole text

# Process text data to tensor
texts = df_processed['tokenized_text'].values

# Function to generate tokens from Series of list of tokens
def generate_tokens(textdata):
    for tokenlist in textdata: #iterate series
      for token in tokenlist: #iterate list
        yield token #get token

# Create the vocabulary
vocab = build_vocab_from_iterator(generate_tokens(texts), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

# Convert text data to tensors and add padding
text_tensor = [torch.tensor([vocab[token] for token in text], dtype=torch.long) for text in texts]  # Convert texts to tensors
padded_sequences = pad_sequence(text_tensor, batch_first=True, padding_value=vocab['<pad>'])  # Add padding to the sequences

#Tranform numerical, categorical and labels to tensors
numbers = torch.tensor(df_processed[numerical_features_list].values, dtype=torch.float32)
categories = torch.tensor(df_processed[categorical_features_list].values, dtype=torch.long)
labels = torch.tensor(df_processed['not_recommended'].values, dtype=torch.long)

# Split all data
X_text_train, X_text_val, y_text_train, y_text_val = train_test_split(padded_sequences, labels, test_size=0.2, random_state=42)
X_cat_train, X_cat_val, y_cat_train, y_cat_val = train_test_split(categories, labels, test_size=0.2, random_state=42)
X_num_train, X_num_val, y_num_train, y_num_val = train_test_split(numbers, labels, test_size=0.2, random_state=42)
#Make sure that all indices are equal
assert ((y_text_train == y_cat_train) & (y_cat_train == y_num_train)).all()

#Get test datasets
mask = labels == 1
y_test, X_text_test, X_num_test, X_cat_test = [tensor[mask] for tensor in [labels, padded_sequences, numbers, categories]]

#Prepare loaders
train_loader = DataLoader(TensorDataset(X_text_train, X_cat_train, X_num_train, y_text_train), batch_size=64, shuffle=True)
val_loader = DataLoader(TensorDataset(X_text_val, X_cat_val, X_num_val, y_text_val), batch_size=64, shuffle=False)
test_loader = DataLoader(TensorDataset(X_text_test, X_cat_test, X_num_test, y_test), batch_size=64, shuffle=False)

In [None]:
def get_accuracy(outputs, labels):
    _, predicted = torch.max(outputs, dim=1)
    true_labels = (predicted == labels).sum().item()
    return true_labels/len(labels)

def train_epoch(model, train_loader, optimizer, loss_fn, device, scheduler, use_attention_mask=False, attention_mask=None):
    model.train()
    avg_loss = 0.0
    avg_acc = 0.0
    for inputs in train_loader:
        if use_attention_mask:
            text_inputs, numerical_inputs, categorical_inputs, labels, attention_mask = inputs
        else:
            text_inputs, numerical_inputs, categorical_inputs, labels = inputs
        text_inputs, numerical_inputs = text_inputs.to(device), numerical_inputs.to(device).float()
        categorical_inputs, labels = categorical_inputs.to(device).float(), labels.to(device)
        optimizer.zero_grad()
        if use_attention_mask: outputs = model(text_inputs, numerical_inputs, categorical_inputs, attention_mask)
        else: outputs = model(text_inputs, numerical_inputs, categorical_inputs)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        avg_loss += loss.item()
        avg_acc += get_accuracy(outputs, labels)
    avg_loss /= len(train_loader)
    avg_acc /= len(train_loader)

    if scheduler is not None:
        scheduler.step(avg_loss)

    return avg_loss, avg_acc

def val_epoch(model, val_loader, loss_fn, device, get_metrics=False, use_attention_mask=False, attention_mask=None):
    model.eval()
    avg_loss = 0.0
    avg_acc = 0.0
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for inputs in val_loader:
            if use_attention_mask:
                text_inputs, numerical_inputs, categorical_inputs, labels, attention_mask = inputs
            else:
                text_inputs, numerical_inputs, categorical_inputs, labels = inputs
            text_inputs, numerical_inputs = text_inputs.to(device), numerical_inputs.to(device).float()
            categorical_inputs, labels = categorical_inputs.to(device).float(), labels.to(device)
            if use_attention_mask: outputs = model(text_inputs, numerical_inputs, categorical_inputs, attention_mask)
            else: outputs = model(text_inputs, numerical_inputs, categorical_inputs)
            loss = loss_fn(outputs, labels)
            avg_loss += loss.item()
            avg_acc += get_accuracy(outputs, labels)

            _, preds = torch.max(outputs, dim=1)
            all_labels += labels.cpu().tolist()
            all_preds += preds.cpu().tolist()


    avg_loss /= len(val_loader)
    avg_acc /= len(val_loader)
    if get_metrics:
        f1 = f1_score(all_labels, all_preds, average='macro')
        recall = recall_score(all_labels, all_preds, average='macro', zero_division=1)
        avg_precision = average_precision_score(all_labels, all_preds, average='macro')
        return avg_loss, avg_acc, f1, recall, avg_precision

    return avg_loss, avg_acc

In [None]:
class LSTMGRUModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, num_features, cat_features, output_dim, dropout):
        super().__init__()
        self.num_features = num_features
        self.cat_features = cat_features
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.embedding_norm = nn.LayerNorm(embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.gru = nn.GRU(hidden_dim, hidden_dim, num_layers, dropout=dropout, batch_first=True)
        self.fc_text = nn.Linear(hidden_dim, hidden_dim)
        self.fc_text_norm = nn.LayerNorm(hidden_dim)
        self.fc_num = nn.Linear(num_features, hidden_dim)
        self.fc_num_norm = nn.LayerNorm(hidden_dim)
        self.fc_cat = nn.Linear(cat_features, hidden_dim)
        self.fc_cat_norm = nn.LayerNorm(hidden_dim)
        self.fc_out = nn.Linear(hidden_dim * 3, output_dim)
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(dropout)

    def forward(self, text_inputs, numerical_inputs, categorical_inputs):
        embedded = self.embedding(text_inputs)
        embedded = self.embedding_norm(embedded)
        lstm_out, _ = self.lstm(embedded)
        gru_out, _ = self.gru(lstm_out)
        output_text = self.dropout(self.fc_text_norm(self.fc_text(gru_out)))
        output_num = self.dropout(self.fc_num_norm(self.fc_num(numerical_inputs)))
        output_cat = self.dropout(self.fc_cat_norm(self.fc_cat(categorical_inputs)))
        output_text = output_text[:, -1, :]  # From (batch_size, sequence_length, hidden_size) to (batch_size, hidden_size). Last hidden state
        output = torch.cat((output_text, output_num, output_cat), dim=1)
        output = self.fc_out(output)
        output = self.sigmoid(output)
        return output


In [None]:
# Set hyperparameters
vocab_size = len(vocab)
embedding_dim = 128
hidden_dim = 128
num_layers = 5
num_features = len(numerical_features_list)
cat_features = len(categorical_features_list)
output_dim = 2
dropout = 0.2

# Initialize the model, loss function, and optimizer
model = LSTMGRUModel(vocab_size, embedding_dim, hidden_dim, num_layers, num_features, cat_features, output_dim, dropout).to(device)
loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode = 'min')

In [None]:
# Train the model
num_epochs = 10
early_stopping = 5
best_val_acc = 0.0

for epoch in range(num_epochs):
  print(f"Epoch {epoch}")
  train_loss, train_acc = train_epoch(model, train_loader, optimizer, loss_fn, device=device, scheduler=scheduler)
  print(f"Train loss {train_loss:.3f}. Train acc {train_acc:.3f}")
  val_loss, val_acc = val_epoch(model, val_loader, loss_fn, device=device)
  if val_acc > best_val_acc:
    best_val_acc = val_acc
    stop_iter = 0
    torch.save(model.state_dict(), 'best_model_LSTM_sep.pt')
  else:
    stop_iter += 1
  print(f"Val loss {val_loss:.3f}. Val acc {val_acc:.3f}. Best val acc = {best_val_acc:.3f}")

val_loss, val_acc, val_f1, val_recall, val_avg_precision = val_epoch(model, val_loader, loss_fn, device, get_metrics=True)
test_loss, test_acc, test_f1, test_recall, test_avg_precision = val_epoch(model, test_loader, loss_fn, device, get_metrics=True)
LSTM_metrics = [test_loss, test_acc, test_f1, test_recall, test_avg_precision]
print(f"Val loss: {val_loss:.3f}, Val acc: {val_acc:.3f}, Val F1: {val_f1:.3f}, Val Recall: {val_recall:.3f}, Val Avg Precision: {val_avg_precision:.3f}")
print(f"Test loss: {test_loss:.3f}, Test acc: {test_acc:.3f}, Test F1: {test_f1:.3f}, Test Recall: {test_recall:.3f}, Test Avg Precision: {test_avg_precision:.3f}")

Epoch 0
Train loss 0.457. Train acc 0.858
Val loss 0.390. Val acc 0.932. Best val acc = 0.932
Epoch 1
Train loss 0.376. Train acc 0.938
Val loss 0.376. Val acc 0.932. Best val acc = 0.932
Epoch 2
Train loss 0.371. Train acc 0.938
Val loss 0.375. Val acc 0.932. Best val acc = 0.932
Epoch 3
Train loss 0.370. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 4
Train loss 0.369. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 5
Train loss 0.369. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 6
Train loss 0.370. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 7
Train loss 0.369. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 8
Train loss 0.369. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Epoch 9
Train loss 0.369. Train acc 0.938
Val loss 0.374. Val acc 0.932. Best val acc = 0.932
Val loss: 0.374, Val acc: 0.932, Val F1: 0.895, Val Recall: 

In [None]:
LSTM_metrics_dict = {
    'test_loss': test_loss,
    'test_acc': test_acc,
    'test_f1': test_f1,
    'test_recall': test_recall,
    'test_avg_precision': test_avg_precision
}
results_dict = {
    'LSTM_metrics_dict': LSTM_metrics_dict
}
with open('results.json', 'w') as f:
    json.dump(results_dict, f, indent=4)

# Reading JSON
with open('results.json', 'r') as f:
    loaded_results = json.load(f)

print(loaded_results)

{'first_LSTM_metrics_dict': {'test_loss': 0.3599729054805004, 'test_acc': 0.9532828282828283, 'test_f1': 0.4881609618451724, 'test_recall': 0.9768696069031639, 'test_avg_precision': 1.0}}


In [None]:
my_token = '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'

max_len = 256

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

config = BertConfig.from_pretrained('bert-base-uncased')

hidden_size = config.hidden_size

def encode(text, tokenizer, max_len):
    encoded_dict = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt',
    )
    input_ids = encoded_dict['input_ids']
    attention_mask = encoded_dict['attention_mask']


    return input_ids, attention_mask

text_inputs, attention_mask = zip(*[encode(text, tokenizer, 46) for text in texts])
text_inputs = torch.cat(text_inputs, dim=0).to(device)
attention_mask = torch.cat(attention_mask, dim=0).to(device)

attention_mask_train, attention_mask_val, _, _ = train_test_split(attention_mask, labels, test_size=0.2, random_state=42)
attention_mask_test = attention_mask[mask]

train_dataset_BERT = TensorDataset(X_text_train, X_cat_train, X_num_train, y_text_train, attention_mask_train)
val_dataset_BERT = TensorDataset(X_text_val, X_cat_val, X_num_val, y_text_val, attention_mask_val)
test_dataset_BERT = TensorDataset(X_text_test, X_cat_test, X_num_test, y_test, attention_mask_test)

train_loader_BERT = DataLoader(train_dataset_BERT, batch_size=64, shuffle=True)
val_loader_BERT = DataLoader(val_dataset_BERT, batch_size=64, shuffle=False)
test_loader_BERT = DataLoader(test_dataset_BERT, batch_size=64, shuffle=False)

In [None]:
class BERTModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, num_features, cat_features, output_dim, dropout):
        super().__init__()
        self.num_features = num_features
        self.cat_features = cat_features
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        for param in self.bert.parameters(): param.requires_grad = True
        self.fc_text = nn.Linear(hidden_size, embedding_dim)
        self.fc_text_norm = nn.LayerNorm(embedding_dim)
        self.fc_num = nn.Linear(num_features, embedding_dim)
        self.fc_num_norm = nn.LayerNorm(embedding_dim)
        self.fc_cat = nn.Linear(cat_features, embedding_dim)
        self.fc_cat_norm = nn.LayerNorm(embedding_dim)
        self.fc_out = nn.Linear(embedding_dim * 3, output_dim)
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(dropout)

    def forward(self, text_inputs, numerical_inputs, categorical_inputs, attention_mask):
        embedded = self.bert(text_inputs, attention_mask=attention_mask)[0]
        embedded = self.fc_text(embedded)
        embedded = embedded.mean(dim=1).view(-1, embedding_dim)
        output_text = self.fc_text_norm(self.dropout(embedded))
        output_num = self.fc_num_norm(self.dropout(self.fc_num(numerical_inputs)))
        output_cat = self.fc_cat_norm(self.dropout(self.fc_cat(categorical_inputs)))
        output = torch.cat((output_text, output_num, output_cat), dim=1)
        output = self.fc_out(output)
        output = self.sigmoid(output)
        return output


In [None]:
#Choose parameters
vocab_size = len(vocab)
embedding_dim = 64
num_features = len(numerical_features_list)
cat_features = len(categorical_features_list)
output_dim = 2
dropout = 0.2
# Train the model
model = BERTModel(vocab_size, embedding_dim, num_features, cat_features, output_dim, dropout).to(device)
loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode = 'min')
num_epochs = 5
early_stopping = 3
best_val_acc = 0.0

for epoch in range(num_epochs):
  print(f"Epoch {epoch}")
  train_loss, train_acc = train_epoch(model, train_loader_BERT, optimizer, loss_fn, device=device, scheduler=scheduler, use_attention_mask=True, attention_mask=attention_mask_train)
  print(f"Train loss {train_loss:.3f}. Train acc {train_acc:.3f}")
  val_loss, val_acc = val_epoch(model, val_loader_BERT, loss_fn, device=device, use_attention_mask=True, attention_mask=attention_mask_val)
  if val_acc > best_val_acc:
    best_val_acc = val_acc
    stop_iter = 0
    torch.save(model.state_dict(), 'best_model_BERT_sep.pt')
  else:
    stop_iter += 1
  print(f"Val loss {val_loss:.3f}. Val acc {val_acc:.3f}. Best val acc = {best_val_acc:.3f}")

val_loss, val_acc, val_f1, val_recall, val_avg_precision = val_epoch(model, val_loader_BERT, loss_fn, device, get_metrics=True, use_attention_mask=True, attention_mask=attention_mask_val)
test_loss, test_acc, test_f1, test_recall, test_avg_precision = val_epoch(model, test_loader_BERT, loss_fn, device, get_metrics=True, use_attention_mask=True, attention_mask=attention_mask_test)
BERT_metrics = [test_loss, test_acc, test_f1, test_recall, test_avg_precision]
print(f"Val loss: {val_loss:.3f}, Val acc: {val_acc:.3f}, Val F1: {val_f1:.3f}, Val Recall: {val_recall:.3f}, Val Avg Precision: {val_avg_precision:.3f}")
print(f"Test loss: {test_loss:.3f}, Test acc: {test_acc:.3f}, Test F1: {test_f1:.3f}, Test Recall: {test_recall:.3f}, Test Avg Precision: {test_avg_precision:.3f}")

Epoch 0
Train loss 0.477. Train acc 0.829
Val loss 0.386. Val acc 0.936. Best val acc = 0.936
Epoch 1
Train loss 0.381. Train acc 0.934
Val loss 0.368. Val acc 0.936. Best val acc = 0.936
Epoch 2
Train loss 0.374. Train acc 0.935
Val loss 0.367. Val acc 0.936. Best val acc = 0.936
Epoch 3
Train loss 0.372. Train acc 0.935
Val loss 0.366. Val acc 0.936. Best val acc = 0.936
Epoch 4
Train loss 0.372. Train acc 0.935
Val loss 0.366. Val acc 0.936. Best val acc = 0.936
Val loss: 0.366, Val acc: 0.936, Val F1: 0.901, Val Recall: 0.947, Val Avg Precision: 0.727
Test loss: 0.358, Test acc: 0.955, Test F1: 0.488, Test Recall: 0.977, Test Avg Precision: 1.000



### 6. Compare Models (3 pts)
- **Accuracy Scores, F1 Scores, Recall Scores, and Average Precision Score (3 pts)**: Compare the performance of LSTM-GRU and BERT models across these metrics.

### 7. Predict Test Data (2 pts)
- **All Unrecommended Reviews (2 pts)**: Use the trained models to predict sentiment on the dataset of unrecommended reviews.


In [None]:
BERT_metrics_dict = {
    'test_loss': test_loss,
    'test_acc': test_acc,
    'test_f1': test_f1,
    'test_recall': test_recall,
    'test_avg_precision': test_avg_precision
}

with open('results.json', 'r') as file:
    results = json.load(file)

results['BERT_metrics_dict'] = BERT_metrics_dict

with open('results.json', 'w') as file:
    json.dump(results, file)

with open('results.json', 'r') as file:
    results = json.load(file)

lstm_metrics_df = pd.DataFrame.from_dict(results['LSTM_metrics_dict'], orient='index').reset_index().rename(columns={'index': 'metric', 0: 'value'})
lstm_metrics_df['model'] = 'LSTM'

bert_metrics_df = pd.DataFrame.from_dict(results['BERT_metrics_dict'], orient='index').reset_index().rename(columns={'index': 'metric', 0: 'value'})
bert_metrics_df['model'] = 'BERT'

metrics_df = pd.concat([lstm_metrics_df, bert_metrics_df])

print(metrics_df)

               metric     value model
0           test_loss  0.359973  LSTM
1            test_acc  0.953283  LSTM
2             test_f1  0.488161  LSTM
3         test_recall  0.976870  LSTM
4  test_avg_precision  1.000000  LSTM
0           test_loss  0.358428  BERT
1            test_acc  0.955048  BERT
2             test_f1  0.488334  BERT
3         test_recall  0.977201  BERT
4  test_avg_precision  1.000000  BERT


## Problem 2. Natural language generation (35 pts)

This problem requires uploading two csv files along with the solution notebook. Please compress these three files in zip archive and upload it in Canvas.

Natural language generation (NLG) is a well-known research problem concerned with generating textual descriptions of structured data, such as tables, as output. Compared to machine translation, where the goal is to completely convert an input sentence into another language, NLG requires overcoming two different challenges: deciding what to say, by selecting a relevant subset of the input data to describe, and deciding how to say it, by generating text that flows and reads naturally.

In this task you will need to generate table descriptions and titles for the dataset that can be downloaded from lms canvas folder "dataset/dataset_nlg.zip". Your inference pipeline should receive `.csv` and output 2 strings: table description `text` and table title `title`.

As the solution to this task you shoud complete `submission.csv` and `submission_reranking.csv` files as below and report the link on your finetuned checkpoints.

In [3]:
import numpy as np
import pandas as pd
import torch
!pip install rouge-score
!pip install sacrebleu
!pip install bert-score
from tqdm import tqdm
from transformers import T5ForConditionalGeneration, T5Tokenizer
from sklearn.model_selection import train_test_split
import nltk
# from sacrebleu.metrics import BLEU
from nltk.translate.meteor_score import meteor_score
from bert_score import score
from rouge_score import rouge_scorer
from transformers import BertTokenizer, BertModel

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import single_meteor_score
import torch.optim as optim

import torch.nn as nn
import torch.nn.functional as F

device = ("cuda" if torch.cuda.is_available() else "cpu")

# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('omw')

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24933 sha256=8b0bebeb4466923d56108002c4f769c7a1bf3dd6d9e40b5571f7104112498593
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2
Collecting sacrebleu
  Downloading sacrebleu-2.4.2-py3-none-any.whl (106 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.7/106.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting portalocker (from sacrebleu)
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Collecting colorama (from sacrebleu)
  Downloading colorama-0.4.6-py2.py3-no

In [4]:
!unzip '/content/drive/MyDrive/Deep Learning/HW/Datasets/dataset_nlg.zip'
dataset_path = '/content/dataset_nlg'

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
  inflating: dataset_nlg_v1/data/223685202396115.csv  
  inflating: dataset_nlg_v1/data/1903008436452839588.csv  
  inflating: dataset_nlg_v1/data/17849348756475776434.csv  
  inflating: dataset_nlg_v1/data/223685202391810.csv  
  inflating: dataset_nlg_v1/data/13052822949931163071.csv  
  inflating: dataset_nlg_v1/data/3752806015948410158.csv  
  inflating: dataset_nlg_v1/data/3685863684819740294.csv  
  inflating: dataset_nlg_v1/data/7537359542959669888.csv  
  inflating: dataset_nlg_v1/data/14222033772740785160.csv  
  inflating: dataset_nlg_v1/data/1778422877764209398.csv  
  inflating: dataset_nlg_v1/data/7347000480692584125.csv  
  inflating: dataset_nlg_v1/data/11655774884564649005.csv  
  inflating: dataset_nlg_v1/data/17191328118726412715.csv  
  inflating: dataset_nlg_v1/data/223685202387825.csv  
  inflating: dataset_nlg_v1/data/15590756896164817683.csv  
  inflating: dataset_nlg_v1/data/223685

In [5]:
data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)
data.head(5)

Unnamed: 0,text,title
871923758931292416,This statistic presents the global revenue of ...,Omnicom Group 's revenue from 2006 to 2019 ( i...
12713542298181105208,This statistic shows the number of hotel and s...,Number of hotel and similar accommodation esta...
5796511258704617257,"In 2019 , just 2.5 percent of all private wage...",Unemployment rate in the U.S. broadcasting ind...
14629703118053421010,This statistic displays the benefits of using ...,If a â€œconnected deviceâ€ ? had the following...
14801098692472737046,The statistic shows global gross domestic prod...,Global gross domestic product ( GDP ) at curre...


In [7]:
sample = pd.read_csv(f'{dataset_path}/data/871923758931292416.csv', index_col=0)
sample

Unnamed: 0,Year,Revenue in billion U.S. dollars
0,2019,14.95
1,2018,15.29
2,2017,15.27
3,2016,15.42
4,2015,15.13
5,2014,15.32
6,2013,14.58
7,2012,14.22
8,2011,13.87
9,2010,12.54


In [8]:
submission = pd.read_csv(f'{dataset_path}/submission.csv', index_col=0)
submission.head(5)

Unnamed: 0,text,title
223685202396506,,
223685202396505,,
223685202396504,,
223685202396503,,
223685202396502,,


- (**5 pts**) Propose and implement at least 2 variants of the input data preprocessing from tables to string data.
- (**5 pts**) Fine-tune [T5](https://huggingface.co/docs/transformers/model_doc/t5) as a baseline using `t5-base` checkpoint ([paper](https://arxiv.org/pdf/1910.10683.pdf)). In order to handle 2 types of output, test usage of prefixes for T5 model.
- (**5 pts**) Propose and implement at least 2 variants of data augmentation, retune T5 and compare performance.
- (**5 pts**) Add domain adoptation via additional Masked language modeling loss (MLM, [paper, section 3.1, Task #1](https://arxiv.org/pdf/1810.04805.pdf)) loss term for encoder, provide hyperparameter search for the regularization parameter $\lambda$, use BERTScore as objective, compare performance:
$$L(x, y) = -LogLikelihood(x, y) + \lambda L_{MLM}(x_{masked}, x)$$


The following metrics should be reported:
- [SacreBLEU](https://github.com/mjpost/sacrebleu)
- [ROUGEL](https://github.com/google-research/google-research/tree/master/rouge)
- [METEOR](https://www.nltk.org/_modules/nltk/translate/meteor_score.html)
- [BERTScore](https://github.com/Tiiiger/bert_score) using `bert-base-uncased` checkpoint and 9th layer output

Using the best checkpoint from above prepare submission file `submission.csv`, where index is a table caption from the `data` folder, and report the link on your finetuned checkpoint.

In [5]:
#Processing 1 way
def process_data(dataframe):
    series = dataframe.stack().reset_index()
    series.columns = ['index', 'column', 'value']
    series['column'] = series['column'].str.lower()
    series['info'] = series['column'] + ': ' + series['value'].astype(str)
    main_info = series.groupby('index')['info'].apply('; '.join).reset_index(drop=True)
    return '\n '.join(main_info)

df = pd.read_csv(f'{dataset_path}/data/871923758931292416.csv', index_col=0)
main_info = process_data(df)
print(main_info)

year: 2019.0; revenue in billion u.s. dollars: 14.95
 year: 2018.0; revenue in billion u.s. dollars: 15.29
 year: 2017.0; revenue in billion u.s. dollars: 15.27
 year: 2016.0; revenue in billion u.s. dollars: 15.42
 year: 2015.0; revenue in billion u.s. dollars: 15.13
 year: 2014.0; revenue in billion u.s. dollars: 15.32
 year: 2013.0; revenue in billion u.s. dollars: 14.58
 year: 2012.0; revenue in billion u.s. dollars: 14.22
 year: 2011.0; revenue in billion u.s. dollars: 13.87
 year: 2010.0; revenue in billion u.s. dollars: 12.54
 year: 2009.0; revenue in billion u.s. dollars: 11.72
 year: 2008.0; revenue in billion u.s. dollars: 13.36
 year: 2007.0; revenue in billion u.s. dollars: 12.69
 year: 2006.0; revenue in billion u.s. dollars: 11.38


In [10]:
#Processing 2 way
def process_data2(dataframe):
    main_info = dataframe.apply(lambda row: '; '.join([f'{col.lower()}: {row[col]}' for col in row.index if col != 'main_info']), axis=1)
    return '\n'.join(main_info)

df = pd.read_csv(f'{dataset_path}/data/871923758931292416.csv', index_col=0)
main_info = process_data2(df)
print(main_info)

year: 2019.0; revenue in billion u.s. dollars: 14.95
year: 2018.0; revenue in billion u.s. dollars: 15.29
year: 2017.0; revenue in billion u.s. dollars: 15.27
year: 2016.0; revenue in billion u.s. dollars: 15.42
year: 2015.0; revenue in billion u.s. dollars: 15.13
year: 2014.0; revenue in billion u.s. dollars: 15.32
year: 2013.0; revenue in billion u.s. dollars: 14.58
year: 2012.0; revenue in billion u.s. dollars: 14.22
year: 2011.0; revenue in billion u.s. dollars: 13.87
year: 2010.0; revenue in billion u.s. dollars: 12.54
year: 2009.0; revenue in billion u.s. dollars: 11.72
year: 2008.0; revenue in billion u.s. dollars: 13.36
year: 2007.0; revenue in billion u.s. dollars: 12.69
year: 2006.0; revenue in billion u.s. dollars: 11.38


In [38]:
X_train_ind, X_val_ind, y_train, y_val = train_test_split(data.index.values, data.values, test_size=0.2, random_state=42) #y[text, values]

my_token = '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'
model = T5ForConditionalGeneration.from_pretrained("t5-base", token=my_token).to(device)
tokenizer = T5Tokenizer.from_pretrained("t5-base", token=my_token)


class CSVDataset(torch.utils.data.Dataset):
  def __init__(self, csv_indices, data):
    self.csv_indices = csv_indices
    self.database = pd.DataFrame({'main_info': [self.summarize(fileindex) for fileindex in tqdm(self.csv_indices, unit_divisor=5000)]}, index = self.csv_indices)
    self.database['text'] = data[:, 0]
    self.database['title'] = data[:, 1]

  def summarize(self, index):
    df = pd.read_csv(f'{dataset_path}/data/{index}.csv', index_col=0)
    main_info = df.apply(lambda row: '; '.join([f'{col.lower()}: {row[col]}' for col in row.index if col != 'main_info']), axis=1)
    return '\n'.join(main_info)

  def __getitem__(self, index):

    main_info = self.database.loc[self.csv_indices[index], 'main_info']
    # inputs = tokenizer.encode(main_info, return_tensors="pt", max_length=128, truncation=True, padding='max_length')
    inputs = tokenizer(main_info, return_tensors='pt', max_length=128, truncation=True, padding='max_length', return_token_type_ids=False).to(device)

    reference_text = self.database.loc[self.csv_indices[index], 'text']
    # targets_text = tokenizer.encode(reference_text, return_tensors="pt", max_length=128, truncation=True, padding='max_length')
    targets_text = tokenizer.encode_plus(reference_text, return_tensors='pt', max_length=128, truncation=True, padding='max_length', return_token_type_ids=False).to(device)

    reference_title = self.database.loc[self.csv_indices[index], 'title']
    # targets_title = tokenizer.encode(reference_title, return_tensors="pt", max_length=128, truncation=True, padding='max_length')
    targets_title = tokenizer(reference_title, return_tensors='pt', max_length=128, truncation=True, padding='max_length', return_token_type_ids=False).to(device)

    # results = [tensor[0, :] for tensor in [inputs, targets_text, targets_title]]
    return inputs, targets_text, targets_title

  def __len__(self):
    return len(self.database)

train_dataset = CSVDataset(X_train_ind, y_train)
val_dataset = CSVDataset(X_val_ind, y_val)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=8, shuffle=False)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
100%|██████████| 20256/20256 [00:48<00:00, 414.69it/s]
100%|██████████| 5065/5065 [00:20<00:00, 243.57it/s]


In [41]:
def get_metrics_sample(reference, hypothesis):
    # BLEU score
    cheng_lapata = SmoothingFunction()
    BLEU_score = sentence_bleu(reference, hypothesis, smoothing_function=cheng_lapata.method4)

    # ROUGE score
    rouge_scorer_obj = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    rg_score = rouge_scorer_obj.score(' '.join(hypothesis), ' '.join(reference))['rougeL'].precision

    reference_tokens = nltk.word_tokenize(reference.lower())
    hypothesis_tokens = nltk.word_tokenize(hypothesis.lower())

    reference_tokens = [token for token in reference_tokens if token.isalnum()]
    hypothesis_tokens = [token for token in hypothesis_tokens if token.isalnum()]

    meteor = single_meteor_score(reference_tokens, hypothesis_tokens)

    # BERT score
    bert_score_obj = score([hypothesis], [reference], model_type='bert-base-uncased', num_layers=9)
    bert_score = float(score([hypothesis], [reference], model_type='bert-base-uncased', num_layers=9)[0])

    return {
        'BLEU': BLEU_score,
        'ROUGE': rg_score,
        'METEOR': meteor,
        'BERT': bert_score
    }


def train_epoch(model, train_loader, optimizer, loss_fn, device, scheduler):
    model.train()
    avg_loss = 0.0

    for inputs_data in train_loader:
        optimizer.zero_grad()
        inputs, targets_text, targets_title = inputs_data
        inputs, targets_text, targets_title = inputs.to(device), targets_text.to(device), targets_title.to(device)
        outputs_text = model(inputs['input_ids'][:, 0, :], inputs['attention_mask'][:, 0, :], labels = inputs['input_ids'][:, 0, :])
        outputs_title = model(inputs['input_ids'][:, 0, :], inputs['attention_mask'][:, 0, :], labels = inputs['input_ids'][:, 0, :])
        print(outputs_text)
        print(outputs_title)
        loss = loss_fn(outputs_text, targets_text) + loss_fn(outputs_title, targets_title)
        loss.backward()
        optimizer.step()
        avg_loss += loss.item()
    avg_loss /= len(train_loader)

    if scheduler is not None:
        scheduler.step(avg_loss)

    return avg_loss

def eval_epoch(model, val_loader, loss_fn, device, get_metrics=False):
    avg_loss = 0.0
    metrics = {'texts': {'BLEU': [], 'ROUGE': [], 'METEOR': [], 'BERT': []},
               'titles': {'BLEU': [], 'ROUGE': [], 'METEOR': [], 'BERT': []}}
    with torch.no_grad():
      for inputs_data in val_loader:
          inputs, targets_text, targets_title = inputs_data
          inputs, targets_text, targets_title = inputs.to(device), targets_text.to(device), targets_title.to(device)
          outputs_text = model(inputs['input_ids'][:, 0, :], inputs['attention_mask'][:, 0, :], labels = inputs['input_ids'][:, 0, :])
          outputs_title = model(inputs['input_ids'][:, 0, :], inputs['attention_mask'][:, 0, :], labels = inputs['input_ids'][:, 0, :])
          if get_metrics:
            for reference, hypothesis in zip(targets_text, outputs_text):
                metric = get_metrics_sample(reference, hypothesis)
                for key in metric:
                    metrics['texts'][key].append(metric[key])
            for reference, hypothesis in zip(targets_title, outputs_title):
                metric = get_metrics_sample(reference, hypothesis)
                for key in metric:
                    metrics['titles'][key].append(metric[key])
          loss = loss_fn(outputs_text, targets_text).item() + loss_fn(outputs_title, targets_title).item()
          avg_loss += loss
      avg_loss /= len(train_loader)

    if get_metrics:
      for key in metrics['texts']:
          metrics['texts'][key] = sum(metrics['texts'][key]) / len(val_loader)
      for key in metrics['titles']:
          metrics['titles'][key] = sum(metrics['titles'][key]) / len(val_loader)
      return metrics, avg_loss
    else:
      return avg_loss


In [42]:
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode = 'min')
num_epochs = 10
early_stopping = 3
best_loss = 999.999

for epoch in range(num_epochs):
  print(f"Epoch {epoch}")
  train_loss = train_epoch(model, train_loader, optimizer, loss_fn, device=device, scheduler=scheduler)
  print(f"Train loss {train_loss:.3f}")
  val_loss = val_epoch(model, val_loader, loss_fn, device=device)
  if val_loss < best_loss:
    best_loss = val_loss
    stop_iter = 0
    torch.save(model.state_dict(), 'best_model.pt')
  else:
    stop_iter += 1
  print(f"Val loss {val_loss:.3f}. Best val loss = {best_loss:.3f}")

metrics = val_epoch(model, val_loader, loss_fn, device=device, get_metrics = True)
texts_result_base = pd.DataFrame(metrics['texts'])
titles_result_base = pd.DataFrame(metrics['titles'])

Epoch 0


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 9.06 MiB is free. Process 5690 has 14.74 GiB memory in use. Of the allocated memory 13.83 GiB is allocated by PyTorch, and 802.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# Your code is here

## Reranking approach (15 pts)

Using maximum likelihood, an ideal model will assign all probability mass to the reference summary. During inference, the model must also generate the output based on possibly erroneous previous steps. This can affect the performance of the model, a phenomenon often called exposure bias. One way to solve this problem is to require our model to be able to accurately predict the ranking order of a set of most likely candidates via an additional contrastive loss term

$$L(x, y) = -LogLikelihood(x, y) + L_{contrastive}(x, y)$$

where

$$
L_{contrastive}(x, y) = \sum_i\sum_{j < i}\max(0, f(s_i(x)) - f(s_j(x)) + \alpha_{ij})
$$

where $\alpha_{ij} = \alpha \cdot (i - j)$ is a margin, $s_i$ and $s_j$ are different candidates (generated by [beam search](https://huggingface.co/blog/how-to-generate)) such that for selected ranking function $r$ $r(s_j, y) > r(s_i, y)$, and
$f(s)$ is a length-normalised estimated log-probability:

$$
f(s) = \frac{\sum_{t} LogProb(s_t| s_{<t}, x)}{|x|},
$$

where $|x|$ is a lenght of $x$.

Your task is to fine-tune the model with reranking-aware loss using BERTScore as the ranking function $r$, provide hyperparameter search for the margin scaling factor $\alpha$ using BERTScore as objective, report metrics for the best case (SacreBLEU, ROUGEL, METEOR, BERTScore), and prepare the submission file `submission_reranking.csv` and report the link on your finetuned checkpoint.

In [None]:
# Your code is here

In [9]:
import gc
gc.collect()
torch.cuda.empty_cache()
del model