PART I - The Dataset

In [1]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("imdb")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [2]:
print(ds_builder.info.description)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.


In [3]:
ds_builder.info.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [4]:
from datasets import load_dataset

In [5]:
from datasets import get_dataset_split_names

In [6]:
get_dataset_split_names("imdb")

['train', 'test', 'unsupervised']

1 - How many splits does the dataset has ?

In [57]:
print(len(get_dataset_split_names("imdb")))

3


In [58]:
dataset = load_dataset("imdb")

Found cached dataset imdb (C:/Users/leand/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

2 - How big are thes splits ?

In [59]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

3 - What is the proportion of each class on the supervised splits ?

In [10]:
sum(dataset["train"]["label"]) + sum(dataset["test"]["label"]) #count of supervised documents with positive reviews

25000

In [11]:
len(dataset["train"]) + len(dataset["test"]) #count of supervised documents

50000

PART II - Naive Bayes Classifier

In [60]:
import string
dataset = load_dataset("imdb")

Found cached dataset imdb (C:/Users/leand/.cache/huggingface/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)


  0%|          | 0/3 [00:00<?, ?it/s]

In [62]:
display(dataset["train"][0]["text"])

'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far between, ev

In [63]:
def preprocess(data):
    str = data["text"]
    for c in string.punctuation:
        str = str.replace(c, ' ')
    data["text"] = str.lower()
    return data

In [65]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
dataset = dataset.map(lambda examples: tokenizer(examples["text"]), batched=True)

Loading cached processed dataset at C:\Users\leand\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-1e31ce8330321103.arrow
Loading cached processed dataset at C:\Users\leand\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-5743b37a20b41bc8.arrow


Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (529 > 512). Running this sequence through the model will result in indexing errors


In [66]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [67]:
updated_dataset = dataset.map(preprocess)

Loading cached processed dataset at C:\Users\leand\.cache\huggingface\datasets\imdb\plain_text\1.0.0\d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0\cache-0f3893649e3d5ad8.arrow


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [71]:
def build_vocabulary_string(V, str, category, C):
    for word in str.split():
        if not (word in V) :
            V[word] = {}
            for c in C:
                V[word][c] = 0
        V[word][category] += 1
    return V

def build_vocabulary(dataset, C):
    V = {}
    for document in dataset:
        V = build_vocabulary_string(V, document["text"], document["label"], C)
    return V


In [98]:
import math
def train_naive_bayes(D,C): #C : positif/negatif
    logprior = {}
    loglikelihood = {}
    Ndoc = len(D)
    Vocabulary = build_vocabulary(D, C) #bigdoc is useless
    for word in Vocabulary:
        loglikelihood[word] = {}
    for c in C:
        Nc = 0
        for document in D:
            if document["label"] == c :
                Nc += 1
        logprior[c] = math.log(Nc / Ndoc)
        word_number = 0
        for word in Vocabulary:
            word_number += Vocabulary[word][c] + 1
        for word in Vocabulary:
            loglikelihood[word][c] = math.log((Vocabulary[word][c] + 1)/word_number)
    return logprior, loglikelihood, Vocabulary

In [99]:
(r1, r2, r3) = train_naive_bayes(updated_dataset["train"], {0, 1})

In [91]:
def test_naive_bayes(testdoc, logprior, loglikelihood, C, V) :
    sum = {}
    for c in C:
        sum[c] = logprior[c]
        for word in testdoc.split():
            if word in V:
                sum[c] += loglikelihood[word][c]
    return max(sum, key=sum.get)

In [101]:
good_answers = 0
for document in updated_dataset["test"]:
    if test_naive_bayes(document["text"], r1, r2, {0, 1}, r3) == document["label"] :
        good_answers += 1

In [103]:
print(good_answers/len(updated_dataset["test"]))

0.80976


In [83]:
import sklearn
from sklearn.pipeline import Pipeline

In [84]:
from sklearn.base import BaseEstimator, TransformerMixin

In [85]:
class PreProcessing(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    def fit(self, X, y=None):
        # fit should only take X and y as parameters
        # Even if your model is unsupervised, you need to accept a y argument!
        # Model fitting code goes here
        # fit returns self
        return self
    def transform(self, X):
        # transform takes as parameter only X
        # Apply some transformation to X
        X_transformed = X.map(preprocess)
        return X_transformed["text"]

In [86]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
pipe = Pipeline([("PreProcessing", PreProcessing()), ("CountVectorization", CountVectorizer()), ("Naive Bayes Classifier", MultinomialNB())])

In [87]:
pipe.fit(dataset["train"], dataset["train"]["label"])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

In [88]:
pipe.score(dataset["test"], dataset["test"]["label"])

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

0.8136

La différence de résultats peut s'expliquer par le fait que MultinomialNB utilise directement des flottants plutôt que des entiers, ce qui permet d'avoir des résultats plus proches lors des calculs.

L'accuracy est un moyen suffisant d'évaluation puisqu'il permet d'avoir les proportions de bonnes et de mauvaises réponses du modèle, et que l'on dispose d'un dataset important.

In [106]:
found = 0
for document in updated_dataset["test"]:
    if found < 2:
        if test_naive_bayes(document["text"], r1, r2, {0, 1}, r3) != document["label"] :
            found += 1
            print(document["text"])
            print(document["label"])
            print("\n\n\n")

blind date  columbia pictures  1934   was a decent film  but i have a few issues with this film  first of all  i don t fault the actors in this film at all  but more or less  i have a problem with the script  also  i understand that this film was made in the 1930 s and people were looking to escape reality  but the script made ann sothern s character look weak  she kept going back and forth between suitors and i felt as though she should have stayed with paul kelly s character in the end  he truly did care about her and her family and would have done anything for her and he did by giving her up in the end to fickle neil hamilton who in my opinion was only out for a good time  paul kelly s character  although a workaholic was a man of integrity and truly loved kitty  ann sothern  as opposed to neil hamilton  while he did like her a lot  i didn t see the depth of love that he had for her character  the production values were great  but the script could have used a little work 
0




ben 

Dans le premier document, l'auteur a indiqué que le film n'était pas terrible, mais a voulu nuancer son propos ; par conséquent, une bonne partie du texte comprend des points positifs du film, ce qui a fausse le resultat.

Dans le second document, on retrouve le meme probleme, avec certains mots employés ("excellent", "best", "interesting", "enjoyable") qui s'apparentent beaucoup plus au vocabulaire positif qu'au vocabulaire négatif, bien qu'éventuellement utilisés avec de la négation ou des nuances.