# Introduction to Natural Language Processing 01
## Lab 03
Nelson VICEL--FARAH\
Karen KASPAR\
Romain BRAND

In [1]:
from datasets import load_dataset

imdb_dataset = load_dataset("imdb")

imdb_dataset

Reusing dataset imdb (/home/miolith/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

### 1. On transforme le dataset en données utilisables par fasttext

In [2]:
import pandas as pd
import numpy as np
from string import punctuation


def preparing(input_df):
    df = input_df.copy()
    #change values 0 and 1 to negative and positive
    df['label'] = df['label'].replace([0, 1], ['negative', 'positive'])

    #pretreatment (applying lower case and punctuation removal) 
    df['text'] = df['text'].str.lower()
    df['text'] = df['text'].str.replace('[{}]'.format(punctuation), ' ', regex=True)
    return df

def save_to_fasttext(df, filename):
    #save the content of the dataframe into a text file with the appropriate format
    np.savetxt(filename, '__label__'+df['label']+' '+df['text'], fmt='%s')

In [3]:
# On mélange les données pour éviter les biais
train_df = pd.DataFrame(imdb_dataset['train']).sample(frac=1).reset_index()
test_df = pd.DataFrame(imdb_dataset['test']).sample(frac=1).reset_index()


save_to_fasttext(preparing(train_df), "train_df.txt")
save_to_fasttext(preparing(test_df), "test_df.txt")

In [4]:
!head -n 2 'train_df.txt'

__label__negative first of all  it s very dilettantish to try describe way of history only from positions of guns  germs and steel  the same tried to do marxists from economical positions  br    br   the reason of western success can t be just dumb luck  the advantages of domesticated plants and animals  we see  that all around the world any advantages and bonuses are complete useless if they aren t wisely managed  in the japan there isn t huge natural resources  but japan is one of the top world economies  the same situation in singapore  but in nigeria  country with rich oil resources  there are only middle low success  both of this nations had and still have access to western technology and inventions  but why such gap   br    br   in the end of movie daimond declared  that it s very important to understand factors of guns  germs and steel  to understand  maybe the main factor of world s difference is not geography  but people ability to understand and use things  the mental ability

### 2. Entraînement simple et test

In [5]:
import fasttext

In [22]:
default_model = fasttext.train_supervised('train_df.txt')

Read 6M words
Number of words:  75910
Number of labels: 2
Progress: 100.0% words/sec/thread: 3078785 lr:  0.000000 avg.loss:  0.400692 ETA:   0h 0m 0s


In [23]:
default_model.test("test_df.txt")

(25000, 0.8776, 0.8776)

### 3. Recherche d'hyperparamètres

In [8]:
from sklearn.model_selection import train_test_split

train, val = train_test_split(train_df, shuffle=True)

save_to_fasttext(preparing(train), "cooking.train")
save_to_fasttext(preparing(val), "cooking.valid")

In [9]:
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')

Progress: 100.0% Trials:    9 Best score:  0.895680 ETA:   0h 0m 0s
Training again with best arguments
Read 4M words
Number of words:  67573
Number of labels: 2
Progress: 100.0% words/sec/thread: 1735495 lr:  0.000000 avg.loss:  0.047160 ETA:   0h 0m 0s 57.2% words/sec/thread: 1718398 lr:  0.036404 avg.loss:  0.077950 ETA:   0h 0m10s


In [10]:
model.test("test_df.txt")

(25000, 0.89392, 0.89392)

### 4. Différence d'attributs

In [78]:
print("attribute | Default  |  Autotuned")
print("lr        |", default_model.lr, "     |", model.lr)
print("epoch     |", default_model.epoch, "       |", model.epoch)
print("dim       |", default_model.dim, "     |", model.dim)
print("wordNgrams|", default_model.wordNgrams, "       |", model.wordNgrams)
print("bucket    |", default_model.bucket, "       |", model.bucket)

attribute | Default  |  Autotuned
lr        | 0.1      | 0.08499425639667486
epoch     | 5        | 100
dim       | 100      | 92
wordNgrams| 1        | 2
bucket    | 0        | 4110692


### 5. Exemples mal classifiés

In [11]:
# On remplace les occurences de "positive" en "1" et "negative" en "0"
test_df["prediction"] = test_df["text"].apply(lambda x: int("p" in model.predict(x)[0][0]))
test_df

Unnamed: 0,index,text,label,prediction
0,16562,Chi-hwa-seong (Painted Fire) recounts the life...,1,1
1,10265,Far from combining the best bits of Pontypool ...,0,0
2,14047,Fantastic movie! One of the best film noir mov...,1,1
3,4389,This movie was sooooooo sloooow!!! And everyth...,0,0
4,19773,I don't know about the English version of this...,1,1
...,...,...,...,...
24995,9814,I'm going to be generous here and give it a 3 ...,0,0
24996,1616,"OK, what was this story about again? I am afra...",0,0
24997,10201,So one day I was in the video store looking fo...,0,0
24998,12995,A very delightful bit of filmwork that should ...,1,1


In [20]:
failed_sample1 = test_df[(test_df["prediction"] == 1) & (test_df["label"] == 0)].iloc[2,:]["text"]
failed_sample2 = test_df[(test_df["prediction"] == 0) & (test_df["label"] == 1)].iloc[2,:]["text"]

### Cet exemple est un faux positif. Le dataset classe ce commentaire comme négatif alors qu'il fait l'éloge du film, l'incohérence viendrait donc du dataset plutôt que du modèle.

In [21]:
failed_sample1

'Yes, it might be not historically accurate(actually only 6 soldiers of 9th rota were killed there), and yes, it has some mistakes and exaggeration(bended machine gun? come on! or the that "history lesson" about how Afghanistan was never conquered by anyone - educated Russian officer would know history much better than that - take for example British campaign in Afghanistan). And yes, it does not have multi-million dollars Hollywood-style special effects, but it\'s strongest point in showing soldier\'s life there, their relationships and their feelings when the best friends are being killed in front of their eyes. In my opinion 9ya rota really does a good job showing all those things.<br /><br />Again, movie has it weaknesses, but, in my opinion, it appears to be one of the strongest Russian movie for the past few years.<br /><br />8/10'

### Cet exemple est un faux négatif. Bien que le commenaire soit positif ("Great movie") la majorité du texte est consacrée à relever les défaut du film.

In [17]:
failed_sample2

'Roy Anderssons "Du Levande" is not totally original as it is counter piece to Anderssons previous movie "Sånger från andra våningen". Still the movie has aura of total originality. Some conventions of movie making are still thrown away: most of the actors look nothing like what you would expect in movie and the shots take long time. Most of the time camera doesn\'t move but people move around it. The shots start from somewhere and many times the scenery builds up in amazing proportions. W.G. Sebald comes to mind in literature with same technique. Because of the time invested in every shot the suspension is really high in many of the scenes. There is a story and isn\'t - it is left for viewer to build up in his or her own mind. This movie is positive. It is determined not to see this all in negative way and at the same time will not pass the social injustices. One of the messages I got from it was that maybe all failures and accidents are not fatal after all. Great movie.'

### Bonus

### minn et maxnn sont à 0 car les mots anglais sont rarement une composition d'autres mots, nul besoin donc de les diviser en sous-tokens