<h3>Hugging Face Pipelines</h3>
<ul>
    <li>It Already knows how to tokenize</li>
    <li>Already has a pretrained models</li>
    <li>For Classification, adds classification head and gives labels+scores</li>
</ul>

In [2]:
import warnings

warnings.filterwarnings("ignore")

In [3]:
from transformers import pipeline

# Creating an Object
classification = pipeline("text-classification")
print(classification("I really enjoyed the movie, it was fantastic"))

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998760223388672}]


In [4]:
sentences = [
    "The Movie Was Fantastic, I really enjoyed it",
    "There was alot of corruption in the city, making the lives of people harder",
    "Despite the hardships the invention prevailed",
    "Nobody was happy with the idea but his execuation was successful and his invention cured the disease of millions"
]

answers = classification(sentences)
for dicte in answers:
    print(dicte)

{'label': 'POSITIVE', 'score': 0.9998775720596313}
{'label': 'NEGATIVE', 'score': 0.9980985522270203}
{'label': 'POSITIVE', 'score': 0.9951199293136597}
{'label': 'POSITIVE', 'score': 0.9974551796913147}


<h1>
    Movie Sentimental Analysis Using The Hugging Face Dataset 
</h1>

<h5>Loading the IMDB Dataset</h5>

In [11]:
"""
The IMDB dataset comes prepackaged in Hugging Face datasets. It has 50,000 movie reviews,
split evenly into train/test, labeled as 0 = negative, 1 = positive.
"""

from datasets import load_dataset

imdb = load_dataset("imdb") # Creates a Dataset Object
print(imdb)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [15]:
print(imdb["train"][0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [66]:
# Approach 1: Using Pandas DataFrame to do inferencing

df_train = imdb["train"].to_pandas()

sample_train_texts = df_train["text"][:5].to_list()
predictions_list = classifier(sample_train_texts)

for text, prediction_dict in zip(sample_train_texts, predictions_list):
    print(f"Movie Review: {text[:80]}...")
    print(f"Sentiment Prediction: {prediction_dict["label"]}, Score: {prediction_dict["score"]}")
    print("==============================")

Movie Review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy ...
Sentiment Prediction: POSITIVE, Score: 0.7872827053070068
Movie Review: "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't ma...
Sentiment Prediction: NEGATIVE, Score: 0.9991909861564636
Movie Review: If only to avoid making this type of film in the future. This film is interestin...
Sentiment Prediction: NEGATIVE, Score: 0.998217761516571
Movie Review: This film was probably inspired by Godard's Masculin, féminin and I urge you to ...
Sentiment Prediction: POSITIVE, Score: 0.8144614696502686
Movie Review: Oh, brother...after hearing about this ridiculous film for umpteen years all I c...
Sentiment Prediction: NEGATIVE, Score: 0.9993877410888672


In [75]:
# Approach 2: Directly on Hugging Face Dataset

sample_dataset = imdb["train"].select(range(5)) # Replication of df.head()

sample_texts = list(sample_dataset["text"])
prediction_list2 = classifier(sample_texts)

for text, prediction_dict in zip(sample_texts, prediction_list3):
    print(f"Movie Review: {text[:80]}...")
    print(f"Sentiment Prediction: {prediction_dict["label"]}, Score: {prediction_dict["score"]}")
    print("=======================")

Movie Review: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy ...
Sentiment Prediction: POSITIVE, Score: 0.7872827053070068
Movie Review: "I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't ma...
Sentiment Prediction: NEGATIVE, Score: 0.9991909861564636
Movie Review: If only to avoid making this type of film in the future. This film is interestin...
Sentiment Prediction: NEGATIVE, Score: 0.998217761516571
Movie Review: This film was probably inspired by Godard's Masculin, féminin and I urge you to ...
Sentiment Prediction: POSITIVE, Score: 0.8144614696502686
Movie Review: Oh, brother...after hearing about this ridiculous film for umpteen years all I c...
Sentiment Prediction: NEGATIVE, Score: 0.9993877410888672


<h4>Preparing the test set</h3>

In [82]:
df_test = imdb["test"].to_pandas()
sample_text = df_test["text"][:10].to_list()
test_predictions = classifier(sample_text)

print(test_predictions[0])

{'label': 'NEGATIVE', 'score': 0.999616265296936}


In [84]:
df_test["label"].head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

In [89]:
true_labels = df_test["label"][:10].to_list() # Since getting RunTime Error, taking only
# 10 Samples to test
predicted_labels = [1 if prediction["label"]=="POSITIVE" else 0 for prediction in test_predictions]
predicted_labels

[0, 0, 0, 0, 1, 0, 0, 0, 0, 0]

In [90]:
for true, pred in zip(true_labels, predicted_labels):
    print(f"True: {true}, Predicted: {pred}")

True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 1
True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 0
True: 0, Predicted: 0


In [92]:
from sklearn.metrics import accuracy_score

print(f"Accuracy: {accuracy_score(true_labels, predicted_labels)*100}")

Accuracy: 90.0


In [100]:
batch_size = 10
true_labels = df_test["label"][:500].to_list()
predicted_labels = []

for x in range(0, 500, batch_size):
    batch_text = df_test["text"][x:x+batch_size].to_list()
    test_predictions = classifier(batch_text,
                                 truncation=True, # The last words after max_tokens will be dropped
                                 max_length=512, # Max amount of tokens the model can handle,
                                 padding=True)
    labels = [1 if prediction["label"]=="POSITIVE" else 0 for prediction in test_predictions]
    predicted_labels.extend(labels)


print(f"Accuracy: {accuracy_score(true_labels, predicted_labels)}")

Accuracy: 0.906
