# Requirements

In [27]:
%pip install tf-keras

Collecting tf-keras
  Downloading tf_keras-2.18.0-py3-none-any.whl.metadata (1.6 kB)
Downloading tf_keras-2.18.0-py3-none-any.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ------ --------------------------------- 0.3/1.7 MB ? eta -:--:--
   ------------------ --------------------- 0.8/1.7 MB 2.0 MB/s eta 0:00:01
   ------------------------------ --------- 1.3/1.7 MB 2.1 MB/s eta 0:00:01
   ---------------------------------------- 1.7/1.7 MB 2.0 MB/s eta 0:00:00
Installing collected packages: tf-keras
Successfully installed tf-keras-2.18.0
Note: you may need to restart the kernel to use updated packages.


In [34]:
# Add as many imports as you need.
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch
from sklearn.preprocessing import LabelEncoder

# Laboratory Exercise - Run Mode (8 points)

## Introduction
This laboratory assignment's primary objective is to fine-tune a pre-trained language model for binary classification on a dataset consisting of wine reviews. The dataset contains two attributes: **description** and **points**. The description is a brief text describing the wine and the points represent a quality metric ranging from 1 to 100. If some wine has at least 90 points it is considered **exceptional**. Your task involves predicting if some wine is **exceptional** based on its review.

## The Wine Reviews Dataset

Load the dataset using the `datasets` library.

In [3]:
# Write your code here. Add as many boxes as you need.
df = pd.read_csv('./wine-reviews.csv')

In [4]:
df.head()

Unnamed: 0,description,points
0,"Translucent in color, silky in the mouth, this...",85
1,"On the palate, this wine is rich and complex, ...",92
2,The producer blends 57% Chardonnay from the Ma...,92
3,"Pure Baga in all its glory, packed with dry an...",93
4,Think of Subsídio as a contribution rather tha...,89


## Target Extraction
Extract the target **exceptional** for each wine review. If some wine has at least 90 points it is considered **exceptional**.

In [5]:
# Write your code here. Add as many boxes as you need.
df.isnull().sum()

description    0
points         0
dtype: int64

In [6]:
df['points'].value_counts()

points
90     1585
91     1114
87     1081
88     1033
92     1016
86      785
89      710
93      656
85      598
84      398
94      392
83      205
95      150
82      119
96       50
81       44
80       27
97       25
98        7
99        3
100       2
Name: count, dtype: int64

In [7]:
df['points'] = df['points'].apply(lambda x: 'Exceptional' if x > 90 else 'Non-Exceptional')

In [8]:
df['points'].value_counts()

points
Non-Exceptional    6585
Exceptional        3415
Name: count, dtype: int64

## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.


In [36]:
label = LabelEncoder()

In [37]:
df['points'] = label.fit_transform(df['points'])

In [61]:
# Write your code here. Add as many boxes as you need.
features = df.drop('points', axis=1)

In [62]:
target = df['points']

In [63]:
target

0       1
1       0
2       0
3       0
4       1
       ..
9995    0
9996    1
9997    0
9998    1
9999    0
Name: points, Length: 10000, dtype: int32

In [64]:
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

## Tokenization
Tokenize the texts using the `AutoTokenizer` class.

In [65]:
# Write your code here. Add as many boxes as you need.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [66]:
tokenized_text = tokenizer(features['description'])

ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

In [43]:
train_encodings = tokenizer(list(x_train), truncation=True, padding=True, max_length=128, return_tensors="tf")
test_encodings = tokenizer(list(x_test), truncation=True, padding=True, max_length=128, return_tensors="tf")

In [44]:
train_encodings

{'input_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[ 101, 2685,  102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 1, 1]])>}

In [46]:
test_encodings

{'input_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[ 101, 2685,  102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(1, 3), dtype=int32, numpy=array([[1, 1, 1]])>}

## Fine-tuning a Pre-trained Language Model for Classification
Fine-tune a pre-trained language model for classification on the given dataset.

Define the model using the `AutoModelForSequenceClassification` class.

In [59]:
# Write your code here. Add as many boxes as you need.
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Define the traning parameters using the `TrainingArguments` class.

In [48]:
# Write your code here. Add as many boxes as you need.
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=0.001,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
)



Define the training using the `Trainer` class.

In [49]:
# Write your code here. Add as many boxes as you need.
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=x_train,
    eval_dataset=x_test
)

Fine-tune (train) the pre-trained lanugage model.

In [50]:
# Write your code here. Add as many boxes as you need.
class ClassificationDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [58]:
tokenizer.convert_ids_to_tokens(train_encodings['input_ids'])

TypeError: only length-1 arrays can be converted to Python scalars

In [51]:
train_dataset = ClassificationDataset(train_encodings, y_train)
test_dataset = ClassificationDataset(test_encodings, y_test)

Use the trained model to make predictions for the test set.

In [55]:
print(len(train_encodings['input_ids']), len(y_train))
print(len(test_encodings['input_ids']), len(y_test))

1 8000
1 2000


In [56]:
print(train_encodings['input_ids'][:5])  # View the first 5 examples of the input_ids
print(y_train[:5])  # View the first 5 labels in y_train


tf.Tensor([[ 101 2685  102]], shape=(1, 3), dtype=int32)
9254    0
1561    1
1670    1
6087    1
6669    1
Name: points, dtype: int32


In [54]:
# Write your code here. Add as many boxes as you need.
trainer.train()
najverojatno e vo split na datasetot ovoj 5542 error

  0%|          | 0/1500 [00:00<?, ?it/s]

KeyError: 5542

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [None]:
# Write your code here. Add as many boxes as you need.

# Laboratory Exercise - Bonus Task (+ 2 points)

Implement a simple machine learning pipeline to classify wine reviews as **exceptional** or not. Use TF-IDF vectorization to convert text into numerical features and train a logistic regression. Split the dataset into training and testing sets, fit the pipeline on the training data, and evaluate its performance using metrics such as precision, recall, and F1-score. Analyze the texts to find the most influential words or phrases associated with the **exceptional** wines. Use the coefficients from the logistic regression trained on TF-IDF features to identify the top positive and negative keywords for **exceptional** wines. Present these keywords in a simple table or visualization (e.g., bar chart).

In [None]:
# Write your code here. Add as many boxes as you need.