In [21]:
from transformers import pipeline

In [22]:
import pandas as pd

In [23]:
classifier = pipeline("sentiment-analysis", model = "sarahai/movie-sentiment-analysis")

Device set to use cpu


In [24]:
df = pd.read_csv('..\data\IMDB Dataset.csv')

In [25]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [26]:
df[df['sentiment'] == 'negative']

Unnamed: 0,review,sentiment
3,Basically there's a family where a little boy ...,negative
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
10,Phil the Alien is one of those quirky films wh...,negative
11,I saw this movie when I was about 12 when it c...,negative
...,...,...
49994,This is your typical junk comedy.<br /><br />T...,negative
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [27]:
classifier([df['review'][7],
            df['review'][8],
            df['review'][10],
            df['review'][11],])

[{'label': 'LABEL_0', 'score': 0.9784377813339233},
 {'label': 'LABEL_0', 'score': 0.9851534366607666},
 {'label': 'LABEL_1', 'score': 0.8562856912612915},
 {'label': 'LABEL_0', 'score': 0.9255669713020325}]

Label_1 seems to be "positive" and Label_0 seems to be "negative":

In [28]:
classifier(['I really enjoyed this movie',
            'This movie was terrible',
            ])

[{'label': 'LABEL_1', 'score': 0.9531933069229126},
 {'label': 'LABEL_0', 'score': 0.9388781785964966}]

- Change the test's labels according to the model's labels;
- Perform train/test split;
- Apply the model on the test data to count Accuracy

In [29]:
df_lbld = df.copy()

In [30]:
df_lbld['sentiment'][df_lbld['sentiment'] == 'negative'] = 'LABEL_0'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_lbld['sentiment'][df_lbld['sentiment'] == 'negative'] = 'LABEL_0'


In [31]:
df_lbld['sentiment'][df_lbld['sentiment'] == 'positive'] = 'LABEL_1'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_lbld['sentiment'][df_lbld['sentiment'] == 'positive'] = 'LABEL_1'


In [32]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [33]:
df_lbld.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,LABEL_1
1,A wonderful little production. <br /><br />The...,LABEL_1
2,I thought this was a wonderful way to spend ti...,LABEL_1
3,Basically there's a family where a little boy ...,LABEL_0
4,"Petter Mattei's ""Love in the Time of Money"" is...",LABEL_1


In [35]:
from sklearn.model_selection import train_test_split

In [36]:
X = df_lbld['review']
y = df_lbld['sentiment']

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

In [38]:
X_train.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [39]:
X_test.head()

25000    This movie was bad from the start. The only pu...
25001    God, I never felt so insulted in my whole life...
25002    Not being a fan of the Coen Brothers or George...
25003    The movie Andaz Apna Apna in my books is the t...
25004    I have to say I was really looking forward on ...
Name: review, dtype: object

In [40]:
y_train.value_counts()

sentiment
LABEL_0    12526
LABEL_1    12474
Name: count, dtype: int64

In [41]:
y_test.value_counts()

sentiment
LABEL_1    12526
LABEL_0    12474
Name: count, dtype: int64

In [43]:
print(X_test.to_list()[:5])

["This movie was bad from the start. The only purpose of the movie was that Angela wanted to get a high body count. The acting was horrible. The killings were acted out very badly. Like when Ally got stuffed down that toilet I guess it was in the abandoned cabin. But when the end of the movie comes and Molly and the other guy are in the cabin you see Ally so Angela must have gone in to get her. The part that really got me was when the black girl and Angela were in the cabin and Angela took the guitar string and chocked her. One it was horrible acting and two why wouldn't you just turn around and punch the bitch?!?!? Then when Molly is getting chased by Angela if you have the neigh why not just turn around and stab her??? So stupid. This movie sucked...", "God, I never felt so insulted in my whole life than with this crap. There are so many ways to describe this piece of crap, that I think that if I said everything that came to mind, I would get banned by this site.<br /><br />How do I 

In [49]:
print(len(X_test.to_list()))

25000


In [47]:
result = classifier(X_test.to_list()[:3])

In [48]:
result

[{'label': 'LABEL_0', 'score': 0.9848387241363525},
 {'label': 'LABEL_0', 'score': 0.9800547361373901}]

In [50]:
results = classifier(X_test.tolist()[:])

RuntimeError: The size of tensor a (749) must match the size of tensor b (512) at non-singleton dimension 1