For this NLP course project, you will develop a machine learning model to classify movies into different genres based solely on their plot synopses. The goal is to train a model that can accurately predict whether a movie is action, comedy, drama, etc after reading a brief text description of the plot. This is a common text classification task with many real-world applications.

To make the project more concrete, imagine you are a data scientist working at VidFlex, a major movie studio with an extensive catalog of films across every genre. Being able to automatically tag each movie with its genre would improve VidFlex's recommendation system and help users find relevant titles on their streaming platform.

You will be provided a dataset of movie synopses and corresponding genre labels to train your model on. After training a deep learning text classifier, you can evaluate its accuracy at predicting genres for thousands of movies it has never seen before. This project provides hands-on experience building and assessing NLP models for a practical text classification problem involving messy real-world data. A successful model could be deployed by companies like VidFlex to automatically tag and organize large databases of movies based on short plot descriptions.

In [1]:
!pip install transformers[torch] datasets evaluate

Collecting transformers[torch]
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_

In [97]:
from datasets import load_dataset, ClassLabel
from rich import print
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
import evaluate
from transformers import TrainingArguments
import numpy as np
import torch

In [98]:
dataset = load_dataset("parquet", data_files={"train": 'train.parquet', "test" : "test.parquet"})
labels = list(set(dataset["train"]["genre"]))
label2id = {}
id2label = {}
for id, label in enumerate(labels):
  if label == "thriller":
    label2id[label] = 0
    id2label[0] = label
  else:
    label2id["not_thriller"] = 1
    id2label[1] = "not_thriller"

In [99]:
train_validation_dataset = dataset["train"].train_test_split(train_size=0.7)

In [100]:
train_validation_dataset["train"]

Dataset({
    features: ['id', 'movie_name', 'synopsis', 'genre'],
    num_rows: 37800
})

In [101]:
print(label2id)
print(id2label)

In [102]:
features = dataset["train"].features.copy()
# features["label"] = ClassLabel(num_classes=len(labels),names=labels)
features["label"] = ClassLabel(num_classes=2,names=["thriller","not_thriller"])
def adjust_labels(batch):
  batch["label"] = []
  for label in batch["genre"]:
    if label == "thriller":
      batch["label"].append(label2id[label])
    else:
      batch["label"].append(label2id["not_thriller"])
  return batch
train_dataset = train_validation_dataset["train"].map(adjust_labels, batched=True, features=features)
validation_dataset = train_validation_dataset["test"].map(adjust_labels, batched=True, features=features)

Map:   0%|          | 0/37800 [00:00<?, ? examples/s]

Map:   0%|          | 0/16200 [00:00<?, ? examples/s]

In [103]:
# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
model.config.label2id = label2id
model.config.id2label = id2label

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [104]:
tokenizer.decode(tokenizer.encode("Hello world"))

'[CLS] hello world [SEP]'

In [105]:
model.config

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "thriller",
    "1": "not_thriller"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "not_thriller": 1,
    "thriller": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.31.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [143]:
def tokenize_function(example):
    return tokenizer(example["synopsis"],padding=True, truncation=True, return_tensors="pt").to("cuda")

train_tokenized_datasets = train_dataset.map(tokenize_function, batched=True)
validation_tokenized_datasets = validation_dataset.map(tokenize_function, batched=True)
validation_tokenized_datasets.set_format("pt", columns=["input_ids"], output_all_columns=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [144]:
print(type(validation_tokenized_datasets[0]["input_ids"]))

In [107]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="steps", per_device_train_batch_size=128, num_train_epochs=1)

In [108]:
def compute_metrics(eval_preds):
    f1_metric = evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    f1_score = f1_metric.compute(predictions=predictions, references=labels,average="micro")
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    return {"F1" : f1_score, "Accuracy": accuracy}

In [109]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_tokenized_datasets,
    eval_dataset=validation_tokenized_datasets,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

In [110]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


TrainOutput(global_step=296, training_loss=0.3060935767921242, metrics={'train_runtime': 161.1374, 'train_samples_per_second': 234.582, 'train_steps_per_second': 1.837, 'total_flos': 1707492082640640.0, 'train_loss': 0.3060935767921242, 'epoch': 1.0})

In [135]:
validation_dataset[0]

{'id': 27632,
 'movie_name': 'Secret of the Wings',
 'synopsis': 'Tinkerbell wanders into the forbidden Winter woods and meets Periwinkle. Together they learn the secret of their wings and try to unite the warm fairies and the winter fairies to help Pixie Hollow.',
 'genre': 'family',
 'label': 1}

In [152]:
inputs = validation_tokenized_datasets[0:2]["input_ids"].to("cuda")

In [153]:
outputs = model(inputs)
print(outputs)

In [154]:
import torch.nn.functional as F

In [161]:
predictions = F.softmax(outputs.logits, dim=-1)
print(predictions)