# Dataset Card for Amazon Review Polarity

The Amazon reviews dataset consists of reviews from amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.

Data Fields
'title': a string containing the title of the review - escaped using double quotes (") and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".

'content': a string containing the body of the document - escaped using double quotes (") and any internal double quote is escaped by 2 double quotes (""). 
New lines are escaped by a backslash followed with an "n" character, that is "\n".

'label': either 1 (positive) or 0 (negative) rating.

In [6]:
import dask.dataframe as dd
from datasets import Dataset

# Load only the first 12,000 rows without loading the full dataset
ddf = dd.read_parquet("hf://datasets/fancyzhx/amazon_polarity/amazon_polarity/train-*.parquet")
small_df = ddf.head(12000, npartitions=-1)  # Processes partitions in parallel

# Convert to Hugging Face Dataset
dataset = Dataset.from_pandas(small_df)
print(dataset)

Dataset({
    features: ['label', 'title', 'content', '__index_level_0__'],
    num_rows: 12000
})


In [7]:
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from sklearn.metrics import classification_report,f1_score




In [8]:
df = Dataset.to_pandas(dataset)
df

Unnamed: 0,label,title,content,__index_level_0__
0,1,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...,0
1,1,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...,1
2,1,Amazing!,This soundtrack is my favorite music of all ti...,2
3,1,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...,3
4,1,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine...",4
...,...,...,...,...
11995,1,Quick Success to Stain Glass,I am a beginner and I'm teaching myself. This ...,11995
11996,1,A Must-Read for SoCal dog owners,"Look, you got a dog or two? Living or travelin...",11996
11997,1,Excellent Follow-Up Score.,This is an excellent follow-up score to TOY ST...,11997
11998,1,Not a Disney soundtrack masterpiece,Even if the 'Toy Story 2' soundtrack is'nt so ...,11998


In [9]:
df= df.drop(columns='__index_level_0__',axis=1)
df

Unnamed: 0,label,title,content
0,1,Stuning even for the non-gamer,This sound track was beautiful! It paints the ...
1,1,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
2,1,Amazing!,This soundtrack is my favorite music of all ti...
3,1,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
4,1,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
...,...,...,...
11995,1,Quick Success to Stain Glass,I am a beginner and I'm teaching myself. This ...
11996,1,A Must-Read for SoCal dog owners,"Look, you got a dog or two? Living or travelin..."
11997,1,Excellent Follow-Up Score.,This is an excellent follow-up score to TOY ST...
11998,1,Not a Disney soundtrack masterpiece,Even if the 'Toy Story 2' soundtrack is'nt so ...


In [13]:
dataset = Dataset.from_pandas(df)
training_df,test_df = dataset.train_test_split(test_size=0.2).values()
training_df

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 9600
})

In [14]:
test_df

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 2400
})

In [15]:
train_df,validation_df = training_df.train_test_split(test_size=0.1).values()
validation_df

Dataset({
    features: ['label', 'title', 'content'],
    num_rows: 960
})

In [18]:
from datasets import DatasetDict

dataset = DatasetDict({
    "train": training_df,
    "validation": validation_df,
    "test": test_df
})

In [48]:
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Define tokenization function
def tokenize_function(examples):
    # Combine 'title' and 'content' (adjust if needed)
    texts = [f"{title} {content}" for title, content in zip(examples["title"], examples["content"])]
    
    # Tokenize
    return tokenizer(
        texts,
        truncation=True,
        padding="max_length",  # or "longest" for dynamic padding
        max_length=50,        # Your chosen max length
        return_tensors=None   # Returns lists instead of tensors
    )

# Apply tokenization (batched for efficiency)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/9600 [00:00<?, ? examples/s]

Map:   0%|          | 0/960 [00:00<?, ? examples/s]

Map:   0%|          | 0/2400 [00:00<?, ? examples/s]

In [49]:
model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased",num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [50]:
tokenized_dataset = tokenized_dataset.rename_column("label", "labels")

In [51]:
columns_to_remove = ['title', 'content']
tokenized_dataset = tokenized_dataset.remove_columns(columns_to_remove)

# Set to PyTorch format
tokenized_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'token_type_ids', 'labels'])

In [53]:
sample_batch = {
    'input_ids': tokenized_dataset['train']['input_ids'][:2],
    'attention_mask': tokenized_dataset['train']['attention_mask'][:2],
    'token_type_ids': tokenized_dataset['train']['token_type_ids'][:2],
    'labels': tokenized_dataset['train']['labels'][:2]
}

# Should run without errors
outputs = model(**sample_batch)
print(outputs.loss) 

tensor(1.0153, grad_fn=<NllLossBackward0>)


In [54]:
args = TrainingArguments(
    output_dir="amazon",
    eval_strategy= "epoch",
    num_train_epochs=2,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=2e-4,
    metric_for_best_model="f1"
)

In [55]:
def compute_metrics(eval_pred):
    logits,labels = eval_pred
    pred = np.argmax(logits,axis=-1)
    f1 = f1_score(labels, pred, average='weighted')
    return {"f1": f1}

In [60]:
trainer = Trainer(
    model=model,
    args=args,
    compute_metrics=compute_metrics,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation']
)

In [61]:
print(tokenized_dataset.column_names)

{'train': ['labels', 'input_ids', 'token_type_ids', 'attention_mask'], 'validation': ['labels', 'input_ids', 'token_type_ids', 'attention_mask'], 'test': ['labels', 'input_ids', 'token_type_ids', 'attention_mask']}


In [62]:
print(tokenized_dataset['train'][0])

{'labels': tensor(0), 'input_ids': tensor([  101,  1141,  1104,  1103,  4997,  1188,  1520,  1108,  1141,  1104,
         1103,  4997,   146,  1138,  1518,  2373,   119,   146,  1354,  1122,
         1108,  1280,  1106,  1129, 21964,  1113,  1103,  1297,  1104, 15518,
         4446,  1121,  1109,  3394,  8430,  1642,   119,  3743,   117,  1122,
         1108,   187,  3984, 11273,  1183,   117,  1105,  5733, 12533,   102]), 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0]), 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1])}


In [63]:
trainer.train()

Epoch,Training Loss,Validation Loss,F1
1,0.7053,0.699884,0.353193
2,0.7062,0.69273,0.353193


TrainOutput(global_step=2400, training_loss=0.708075091044108, metrics={'train_runtime': 2625.1044, 'train_samples_per_second': 7.314, 'train_steps_per_second': 0.914, 'total_flos': 493333228800000.0, 'train_loss': 0.708075091044108, 'epoch': 2.0})

In [73]:
preds = trainer.predict(tokenized_dataset['test'])

In [74]:
preds = np.argmax(preds.predictions, axis=-1)
print(classification_report(tokenized_dataset['test']['labels'], preds,zero_division=0))

              precision    recall  f1-score   support

           0       0.51      1.00      0.68      1233
           1       0.00      0.00      0.00      1167

    accuracy                           0.51      2400
   macro avg       0.26      0.50      0.34      2400
weighted avg       0.26      0.51      0.35      2400

