In [None]:
!pip install transformers

## IMPORTS

In [1]:
# using torch because there is no pretrained model on HuggingFace for TextClassification that is trained on an IMDB dataset for tf 
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

## PRETRAINED MODEL

In [2]:
# SOURCE: https://huggingface.co/aychang/roberta-base-imdb
tokenizer = AutoTokenizer.from_pretrained("aychang/roberta-base-imdb")
model = AutoModelForSequenceClassification.from_pretrained("aychang/roberta-base-imdb")

## DEVELOPMENT

In [3]:
# SOURCE: https://www.denofgeek.com/tv/star-trek-discovery-season-4-episode-8-review-all-in/
text = """It’s been a month since Star Trek: Discovery has been on our screens, so you’d be forgiven for hoping that the series’ return from its midseason break might at last offer up some answers about the big Season 4 mystery surrounding the Dark Matter Anomaly and the mysterious Species 10-C that created it. Or at least move Season 4’s overall plot along a bit more deliberately than this particular hour manages to do. Instead, “All In” is a decent enough hour in a world where we haven’t had a new episode in several weeks, but one that’s more focused on Michael and Book’s relationship and whether it will be able to survive his decision to embrace Ruon Tarka’s plan to destroy the DMA.

The last-minute revelation that Species 10-C is some sort of highly evolved race with dangerously advanced technological capabilities that are using the DMA to dredge for the tiny energy-rich elements needed to power its vast Faraday cage-style space dome is intriguing. (And lends credence to the Federation’s fear that attempting to destroy the DMA–and thereby damage Species 10-C’s power supply—could lead to a war no one wants or is ready to fight.) But it all feels a bit tacked on to the rest of the hour, and one can only hope we’ll get to delve into all this in a more substantive way during next week’s installment.

In all honesty, a lot of “All In” feels as though it’s made of moments that might have been better served in different episodes: Hugh Culber’s stress breakdown that gets just a single scene to play out, Michael’s sudden interest in Owo’s mental state even though the two have barely spoken this season, and the revelation of the lieutenant commander’s apparent cage-fighting skills, just for a start. Yet, the hour’s criminal underworld setting and vague heist vibes are a welcome reprieve from the recent run of Discovery episodes that have largely tended to involve people arguing around a Federation Headquarters’ desk as much as I have enjoyed the depth of their conversations. (Speaking of which, are we just…never seeing Tilly again? Even though Discovery is constantly at what is essentially home base?)

As most of us likely expected, the bulk of this episode focuses on the aftermath of Book’s decision to steal the prototype spore drive and head off to try and destroy the DMA on his own. Michael, naturally, is torn between her love for her partner and her duty to the Federation—whose inclination to diplomacy she fully supports in this particular instance. Though both Admiral Vance and President Rillak tell her to keep out of it, as she is obviously too close to what’s happening to be involved, Vance counts on her not listening to those instructions and manages to give her a secondary mission—tracking down some star charts from a less advanced species whose homeworld is basically across the galactic barrier from where Species 10-C exists—that allows her to go after Book anyway. 

I keep thinking that at some point, I’m going to get over how annoying it is that Michael has bent and broken the rules so often that it’s now not only expected of her but behavior that is actively encouraged, and yet. Here I am! To be fair, Season 4 has been remarkably good at delineating the differences between Michael Burnham, Starfleet officer, and Michael Burnham, Starfleet captain, and what her change in station has meant for the way she’s had to approach problems and manage competing loyalties.

But so much of this episode feels like backsliding on that front, and suddenly we’re right back to Michael doing things she shouldn’t and getting rewarded for them in the end. Even her admission at the end of the hour that she’ll have to do whatever’s necessary to stop Book and Tarka from destroying the DMA rings more hollow than it would have before this episode when it at least felt like Discovery was trying to attach real stakes and weight and even some level of accountability to Michael’s decisions as captain. (At last!)

But now — Does anyone really think she’d kill Book to avert a war rather than come up with an insane scheme to stop him that eventually, Rillak will somehow have to dub brilliant? Don’t get me wrong, I don’t want anything to happen to Book (David Ajala remains my Season 4 MVP), but please let there at least be some real consequences attached to whatever Michael’s about to do to save him. Even if those consequences come from Book in some way.

One of the best elements about this episode, though, is the way it calls back, both in setting and tone, to Michael and Book’s first adventures together when she came to the 32nd century. It’s easy to forget that the two of them spent a year together as couriers when Michael thought she’d lost her Starfleet life forever, working with shady types and transporting illicit materials all over the galaxy. The ease with which they slip back into that partnership, from their awkwardly fake banter to the clear hand singles they use with one another, is a strangely welcome and necessary reminder of how good they really are together, something that has been (rightfully I think) overshadowed by Book’s grief over Kwaijon. 

Which, of course, comes at the precise moment where their future together seems more clouded and uncertain than ever. As is typical of the mature way Discovery has always written the romance between these two, the ultimate fate of Book and Burnham’s relationship has little  to do with how much they love one another. Instead, it’s about whether they’ll be able to successfully navigate what appear to be rather significant differences in moral philosophy. Book is willing to risk it all to prevent other races from feeling the loss he now struggles with every day and embraces an end justify the means attitude toward protecting those lives. 

But Burnham, a child of the Federation through and through, believes that diplomacy, leading with hope, and showing our best face to those that are different from us must be the first path that humanity always tries, even with those who may not deserve that particular bit of grace. How they will find a common ground in the middle of that, when Book’s personal experiences with the DMA and Species 10-C make him especially certain of the rightness of his perspective, is anyone’s guess. Despite Michael’’s clear willingness to fight for Book, here’s certainly more than one moment in “All In” where it seems as though the pair are gambling on more than a card game, but the very future of their relationship and it doesn’t feel like either of them really win. (But props to Burnham for understanding that no matter how much she might have wanted to believe she could change his mind, she still needed to plant that tracking device anyway.)"""

In [4]:
# do not add special tokens now - LATER, manually
tokens = tokenizer.encode_plus(text, add_special_tokens=False, return_tensors="pt")

Token indices sequence length is longer than the specified maximum sequence length for this model (1490 > 512). Running this sequence through the model will result in indexing errors


In [5]:
len(tokens["input_ids"][0])

1490

In [6]:
chunk_size = 512

# 510 sized chunks (because of two BERT special tokens which will be added later)

input_id_chunks = list(tokens["input_ids"][0].split(chunk_size - 2))
mask_chunks = list(tokens["attention_mask"][0].split(chunk_size - 2))

In [7]:
chunks_total = len(input_id_chunks)

In [8]:
# 1490 / 512 = 2.91
chunks_total

3

In [9]:
# chunking the text

for i in range(chunks_total):
    # CLS and SEP tokens
    input_id_chunks[i] = torch.cat([torch.tensor([101]), input_id_chunks[i], torch.tensor([102])])

    # attention tokens
    mask_chunks[i] = torch.cat([torch.tensor([1]), mask_chunks[i], torch.tensor([1])])

    # get padding length (difference between 512 and actual embedding length)
    pad_len = chunk_size - input_id_chunks[i].shape[0]

    if pad_len > 0:
        # if padding length is positive (the embedding is shorter than 512), apply padding [0] in the amount of this difference
        input_id_chunks[i] = torch.cat([input_id_chunks[i], torch.Tensor([0] * pad_len)])
        mask_chunks[i] = torch.cat([mask_chunks[i], torch.Tensor([0] * pad_len)])

In [10]:
# check if all chunks are the same length
for chunk in input_id_chunks:
    print(len(chunk))

512
512
512


In [12]:
# check our result - start with 101, ends with 102, padded to 512
chunk.long()

tensor([  101,     8,  6328,     6,     7,   988,     8,  5972,    17,    27,
           29,    78, 18848,   561,    77,    79,   376,     7,     5,  2107,
         1187,  3220,     4,    85,    17,    27,    29,  1365,     7,  4309,
           14,     5,    80,     9,   106,  1240,    10,    76,   561,    25,
        15763, 21724,    77,   988,   802,    79,    17,    27,   417,   685,
           69, 47877,   301,  6000,     6,   447,    19, 31665,  3505,     8,
        21778, 16108,  3183,    70,    81,     5, 22703,     4,    20,  5136,
           19,    61,    51,  9215,   124,    88,    14,  3088,     6,    31,
           49, 34556,  4486, 32723,     7,     5,   699,   865,  7695,    51,
          304,    19,    65,   277,     6,    16,    10, 34374,  2814,     8,
         2139,  8306,     9,   141,   205,    51,   269,    32,   561,     6,
          402,    14,    34,    57,    36,  4070,  6459,    38,   206,    43,
        22140,    30,  5972,    17,    27,    29, 12903,    81, 

In [13]:
# stack everything together
input_ids = torch.stack(input_id_chunks).long()
attention_mask = torch.stack(mask_chunks).int()

In [14]:
input_ids

tensor([[ 101,  243,   17,  ..., 9123,    9,  102],
        [ 101, 5972,   17,  ...,   11, 2749,  102],
        [ 101,    8, 6328,  ...,    0,    0,    0]])

In [15]:
attention_mask

tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)

In [16]:
# reformat
input_dict = {"input_ids": input_ids,
              "attention_mask": attention_mask}

In [17]:
input_dict

{'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32),
 'input_ids': tensor([[ 101,  243,   17,  ..., 9123,    9,  102],
         [ 101, 5972,   17,  ...,   11, 2749,  102],
         [ 101,    8, 6328,  ...,    0,    0,    0]])}

In [18]:
# roBERTa
outputs = model(**input_dict)

In [19]:
outputs

SequenceClassifierOutput([('logits', tensor([[-2.7947,  2.7715],
                                   [ 2.2441, -2.0543],
                                   [-3.3942,  3.4687]], grad_fn=<AddmmBackward0>))])

In [20]:
probabilities = torch.nn.functional.softmax(outputs[0], dim=-1)

In [21]:
probabilities

tensor([[0.0038, 0.9962],
        [0.9866, 0.0134],
        [0.0010, 0.9990]], grad_fn=<SoftmaxBackward0>)

In [22]:
avg_probs = probabilities.mean(dim=0)

In [23]:
sentiment = torch.argmax(avg_probs).item()

In [24]:
# 0 - negative
# 1 - positive

# the reviewer gives it a 3.6/5, so I guess it's correct
sentiment

1

## FUNCTIONS AND NEW DATA

In [32]:
def long_text_tensors(text):
  tokens = tokenizer.encode_plus(text, add_special_tokens=False, return_tensors="pt")
  chunk_size = 512

  input_id_chunks = list(tokens["input_ids"][0].split(chunk_size - 2))
  mask_chunks = list(tokens["attention_mask"][0].split(chunk_size - 2))

  chunks_total = len(input_id_chunks)

  for i in range(chunks_total):
    input_id_chunks[i] = torch.cat([torch.tensor([101]), input_id_chunks[i], torch.tensor([102])])
    mask_chunks[i] = torch.cat([torch.tensor([1]), mask_chunks[i], torch.tensor([1])])
    pad_len = chunk_size - input_id_chunks[i].shape[0]

    if pad_len > 0:
        input_id_chunks[i] = torch.cat([input_id_chunks[i], torch.Tensor([0] * pad_len)])
        mask_chunks[i] = torch.cat([mask_chunks[i], torch.Tensor([0] * pad_len)])

  input_ids = torch.stack(input_id_chunks)
  attention_mask = torch.stack(mask_chunks)

  input_dict = {"input_ids": input_ids.long(),
              "attention_mask": attention_mask.int()}

  return input_dict


def predict_sentiment(input_dict):
    outputs = model(**input_dict)
    probabilities = torch.nn.functional.softmax(outputs[0], dim=-1)
    avg_probs = probabilities.mean(dim=0)
    sentiment = torch.argmax(avg_probs).item()

    if sentiment == 0:
      return "NEGATIVE"

    elif sentiment == 1:
      return "POSITIVE"

In [33]:
# SOURCE: https://www.imdb.com/review/rw3814011/?ref_=tt_urv
txt = ("""Where, o where is the Zero rating that IMDB needs to add? This would certainly rate it. This absolute trash of a show takes fifty years of Star Trek (I've been watching since 1966) and flushes it down the toilet. There wasn't one episode worth watching in the entire first season. Not one. Let's see what's bad about this show.

1 - It seems that the show runners/writers know nothing about Star Trek. If they did, this wouldn't have been created. They also lied, claiming that this takes place in the real or original time line. No way.

2 - Here's the biggest question: If this is the prime time line, where has Spock's 'adopted human' sister been for FIFTY YEARS? In a coma? An alternate reality? Was he hiding her from Kirk? This is one of the absolute dumbest ideas ever. So because she was raised on Vulcan, does that make her able to be a Vulcan?

3 - Designs. The Discovery is one of the ugliest ships ever. It should be a Klingon ship. And since when to Federation ships have golden hulls? Which pinhead designed those horrible uniforms that look like they'd be rejected at Studio 54! As for the Klingon, Glenn Hetrick should be ashamed. Where the hell did they come from? They look like the only warring they would do is under a disco ball on a dance floor.

4 - Continuity. Let's just toss that to the winds. So the Federation knew about the Mirror universe at least ten years before Kirk discovered it? Why wasn't this in the databanks? Oh, we just wanted things to be cool. Yeah. Ok.

5 - the cast. Sorry but the guy playing the doctor looks WAY too young for the role. Perhaps he he were an alien, that might be different. Doug Jones is great, but the character he portrays is one of the sillier aliens. Oh, he can sense death can he? So can I - just put someone in a red shirt. Oh that's right - everyone's in those ridiculous disco uniforms.

6 - I know Bryan Fuller thinks it's cool (what an insipid concept) to give women Male names. Cool isn't a good enough reason. How about she's named after a relative who died in the service? See how simple?

This show is nothing but an obscene money grab for CBS. The sad part is, that those deluded fans who say "All Trek is Good Trek" may keep this alive for a while. I won't be one of them. I like quality writing, something this show is missing.""")

In [34]:
input_dict = long_text_tensors(txt)

In [35]:
predict_sentiment(input_dict)

'NEGATIVE'