In [1]:
from transformers import pipeline
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


In [38]:
import pandas as pd

In [39]:
model = TFAutoModelForSequenceClassification.from_pretrained("sarahai/movie-sentiment-analysis")

tokenizer = AutoTokenizer.from_pretrained("sarahai/movie-sentiment-analysis", truncation = True, max_length = 512, padding = True)

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


In [40]:

#classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [41]:
classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

Device set to use 0


In [42]:
df = pd.read_csv('..\data\IMDB Dataset.csv')

In [43]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [44]:
df[df['sentiment'] == 'negative']

Unnamed: 0,review,sentiment
3,Basically there's a family where a little boy ...,negative
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
10,Phil the Alien is one of those quirky films wh...,negative
11,I saw this movie when I was about 12 when it c...,negative
...,...,...
49994,This is your typical junk comedy.<br /><br />T...,negative
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [45]:
classifier([df['review'][7],
            df['review'][8],
            df['review'][10],
            df['review'][11],])

[{'label': 'LABEL_0', 'score': 0.9784377813339233},
 {'label': 'LABEL_0', 'score': 0.9851534366607666},
 {'label': 'LABEL_1', 'score': 0.8562858700752258},
 {'label': 'LABEL_0', 'score': 0.9255669713020325}]

Label_1 seems to be "positive" and Label_0 seems to be "negative":

In [46]:
classifier(['I really enjoyed this movie',
            'This movie was terrible',
            ])

[{'label': 'LABEL_1', 'score': 0.9531933069229126},
 {'label': 'LABEL_0', 'score': 0.9388782978057861}]

- Change the test's labels according to the model's labels;
- Perform train/test split;
- Apply the model on the test data to count Accuracy

In [47]:
df_lbld = df.copy()

In [48]:
df_lbld['sentiment'][df_lbld['sentiment'] == 'negative'] = 'LABEL_0'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_lbld['sentiment'][df_lbld['sentiment'] == 'negative'] = 'LABEL_0'


In [49]:
df_lbld['sentiment'][df_lbld['sentiment'] == 'positive'] = 'LABEL_1'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_lbld['sentiment'][df_lbld['sentiment'] == 'positive'] = 'LABEL_1'


In [50]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [51]:
df_lbld['review'] = df_lbld['review'].str.replace('"', ' \' ')

In [52]:
#tokenized = tokenizer(df_lbld['review'].tolist(), truncation=True, padding=True, return_tensors='tf')

In [53]:
#tokenized

In [54]:
# df_lbld['reviews'] = tokenized['input_ids'].numpy().tolist()
# df_lbld['attention_mask'] = tokenized['attention_mask'].numpy().tolist()

In [55]:
df_lbld.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,LABEL_1
1,A wonderful little production. <br /><br />The...,LABEL_1
2,I thought this was a wonderful way to spend ti...,LABEL_1
3,Basically there's a family where a little boy ...,LABEL_0
4,Petter Mattei's ' Love in the Time of Money '...,LABEL_1


In [56]:
from sklearn.model_selection import train_test_split

In [57]:
X = df_lbld['review']
y = df_lbld['sentiment']

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

In [59]:
X_train.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's  ' Love in the Time of Money '...
Name: review, dtype: object

In [60]:
X_test.head()

25000    This movie was bad from the start. The only pu...
25001    God, I never felt so insulted in my whole life...
25002    Not being a fan of the Coen Brothers or George...
25003    The movie Andaz Apna Apna in my books is the t...
25004    I have to say I was really looking forward on ...
Name: review, dtype: object

In [61]:
y_train.value_counts()

sentiment
LABEL_0    12526
LABEL_1    12474
Name: count, dtype: int64

In [62]:
y_test.value_counts()

sentiment
LABEL_1    12526
LABEL_0    12474
Name: count, dtype: int64

In [63]:
print(X_test.to_list()[3:5])

["The movie Andaz Apna Apna in my books is the top 5 intelligent comedy movies ever made in Bollywood perhaps even Hollywood. <br /><br />When the movie released i was a 8 year old and I heard it was a flop but I never understood till now why was it a flop...but let me tell you one thing...this movie would have more money by selling home Cassettes and DVDs and by showing in TV movie channels than any hit movie in theaters. This movie has been shown countless times in Movie channels and I think even now the public love and the TV producers keep repeating the movie again and again. I personally have watched the entire movie more than 80 -100 times and I still love it.....<br /><br />The performance by both Aamir khan as Amar and Salman khan as Prem is mind blowing but i especially liked the performance of Aamir khan as a street smart guy....his dialogs in the movie are Hilarious... the story is simple and heres how it goes.....<br /><br />Amar and Prem are poor , lazy chaps and come from

In [64]:
print(len(X_test.to_list()))

25000


In [65]:
result = classifier(X_test.to_list()[:3])

In [66]:
result

[{'label': 'LABEL_0', 'score': 0.9848387241363525},
 {'label': 'LABEL_0', 'score': 0.9800547361373901},
 {'label': 'LABEL_1', 'score': 0.9811004400253296}]

In [67]:
strange_entry = X_test[25003]

In [68]:
normal_entry = X_test[25004]

In [69]:
print (strange_entry)

The movie Andaz Apna Apna in my books is the top 5 intelligent comedy movies ever made in Bollywood perhaps even Hollywood. <br /><br />When the movie released i was a 8 year old and I heard it was a flop but I never understood till now why was it a flop...but let me tell you one thing...this movie would have more money by selling home Cassettes and DVDs and by showing in TV movie channels than any hit movie in theaters. This movie has been shown countless times in Movie channels and I think even now the public love and the TV producers keep repeating the movie again and again. I personally have watched the entire movie more than 80 -100 times and I still love it.....<br /><br />The performance by both Aamir khan as Amar and Salman khan as Prem is mind blowing but i especially liked the performance of Aamir khan as a street smart guy....his dialogs in the movie are Hilarious... the story is simple and heres how it goes.....<br /><br />Amar and Prem are poor , lazy chaps and come from a

In [70]:
print (normal_entry)

I have to say I was really looking forward on watching this film and finding some new life in it that would separate it from most dull and overly crafted mexican films. I have no idea why but I trusted Sexo, Pudor y Lagrimas to be the one to inject freshness and confidence to our non-existent industry. Maybe it was because the soundtrack(which I listened to before I saw the film) sounded different from others, maybe it was because it dared to include newer faces(apart from Demian Bichir who is always a favorite of mexican film directors) and supposedly dealed within it's script with modern social behaviour, maybe because it's photography I saw in the trailers was bright and realistic instead of theatrical. The film turned out to be a major crowd pleaser, and a major letdown. What Serrano actually deals here with is the very old fashioned  ' battle of the sexes '  as in  ' all men are the same '  and  ' why is it that all women...; '  blah,blah,blah. Nothing new in it, not even that, it

In [71]:
classifier(normal_entry)

[{'label': 'LABEL_1', 'score': 0.9450821876525879}]

In [72]:
classifier('''inputs=
          I have to say I was really looking forward on watching this film and finding some new life in it that would separate it from most dull and overly crafted mexican films. I have no idea why but I trusted Sexo, Pudor y Lagrimas to be the one to inject freshness and confidence to our non-existent industry. Maybe it was because the soundtrack(which I listened to before I saw the film) sounded different from others, maybe it was because it dared to include newer faces(apart from Demian Bichir who is always a favorite of mexican film directors) and supposedly dealed within it's script with modern social behaviour, maybe because it's photography I saw in the trailers was bright and realistic instead of theatrical. The film turned out to be a major crowd pleaser, and a major letdown. What Serrano actually deals here with is the very old fashioned "battle of the sexes" as in "all men are the same" and "why is it that all women...;" blah,blah,blah. Nothing new in it, not even that, it uses so much common ground and clichè that it eventually mocks itself without leaving any valuable reflexion on the female/male condition. Full of usual tramps on the audience like safe gags about the clichès I talked about before(those always work, always) and screaming performances(it is a well acted film in it's context)..and by screaming I mean, literally. The at first more compelling characters played by Monica Dionne and Demian Bichir turn out to be according to Serrano the more pathetic ones. I completely disagree with Serrano, they shouldn't have been treated that way only to serve as marionettes for his lesson to come through...he made sure we got HIS message and completely destroyed their roles that were the only solid ground in which this story could have stood. Anyway, it is after all, a very entertaining film at times and you will probably have a good time seeing it (if you accept to be manipulated by it) 
           ''')

[{'label': 'LABEL_1', 'score': 0.959822952747345}]

In [73]:
classifier(strange_entry, truncate=True)

Token indices sequence length is longer than the specified maximum sequence length for this model (749 > 512). Running this sequence through the model will result in indexing errors


InvalidArgumentError: Exception encountered when calling layer 'embeddings' (type TFEmbeddings).

{{function_node __wrapped__ResourceGather_device_/job:localhost/replica:0/task:0/device:CPU:0}} indices[0,705] = 705 is not in [0, 512) [Op:ResourceGather] name: 

Call arguments received by layer 'embeddings' (type TFEmbeddings):
  • input_ids=tf.Tensor(shape=(1, 749), dtype=int32)
  • position_ids=None
  • inputs_embeds=None
  • training=False

In [None]:
results = classifier(X_test.to_list()[4:9])