
# Deuxième approche de changement des appréciations

## 1 - Installation des packages 

In [None]:
!pip3 install torch torchvision torchtext



In [None]:
! python3 -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
!pip install msgpack==0.5.6

Collecting msgpack==0.5.6
[?25l  Downloading https://files.pythonhosted.org/packages/22/4e/dcf124fd97e5f5611123d6ad9f40ffd6eb979d1efdc1049e28a795672fcd/msgpack-0.5.6-cp36-cp36m-manylinux1_x86_64.whl (315kB)
[K     |████████████████████████████████| 317kB 7.5MB/s eta 0:00:01
[?25hInstalling collected packages: msgpack
  Found existing installation: msgpack 1.0.2
    Uninstalling msgpack-1.0.2:
      Successfully uninstalled msgpack-1.0.2
Successfully installed msgpack-0.5.6


## 2 - CNN pour traitment de texte

On va utiliser un réseau de neuronne de convolution pour faire l'analyse de sentiment des différents mots qui composent les appréciations.

Les réseaux de neuronnes à convolutions sont généralement utilisés pour le traitement des images and qu'on av les utiliser pour traiter du texte dans ce projet. 

L'idée est que dans une image 2 pixel qui sont cote à cote sont relié de la même façon que deux mots cote à cote sont relié. En même temps un CNN cherche à trouver des patters dans les images, dans ce cas de notre projet, il va essayer de chercher n-grams (en utilisant des filtre 1xn)

L'idée principale dérrière àa est que l'apparence de certaines 2 grams, 3 grams ... vont jouer un rôle pour définir le sentiment final

## 3 - Preparation de données

In [None]:
import torch
from torchtext import data
from torchtext import datasets
import random

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()

train, test = datasets.IMDB.splits(TEXT, LABEL)

train, valid = train.split(random_state=random.seed(SEED))


In [None]:
len(train[0].text)

158

## 4 - Importer le modèle word embeddings 

In [None]:
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

.vector_cache/glove.6B.zip: 862MB [06:27, 2.23MB/s]                          
100%|█████████▉| 398045/400000 [00:15<00:00, 26373.16it/s]

In [None]:
test_w=TEXT.vocab.itos[9205]
test_w2=TEXT.vocab.itos[9206]

print(test_w)
print(test_w2)

degrees
dental


In [None]:
from torch.nn.functional import cosine_similarity
test_v=TEXT.vocab.vectors[9205].unsqueeze(0)
test_v2=TEXT.vocab.vectors[9206]

cosine_similarity(test_v,TEXT.vocab.vectors,dim=1).sort()

torch.return_types.sort(values=tensor([-0.3674, -0.3650, -0.3144,  ...,  0.6289,  0.7273,  1.0000]), indices=tensor([21527, 18999, 14249,  ..., 13863,  2524,  9205]))

In [1]:
print(TEXT.vocab.itos[538])

In [None]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False,
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu'))



```
# Ce texte est au format code
```

## 5 - Le Modele

Les images ont deux dimensions alors que les texte ont juste une dimension, mais si on convertie un mot en des vecteurs avec les techniques de NLP (comme word embedding ) on peut avoir une représentation en deux dimensions.


On considère ci dessous la représentation d'une phrase en deux dimensions en utilisant le word embedding. nots mots sont représentés en vert. On a 4 mots et 5 dimensions d'embedding ce qui créer une image de **[4x5]**.

![](https://i.imgur.com/ci1h9hv.png)

Puis un filtre qui couvre deux mots est déssiné en jaune. La sortie de ce filtre est un nombre réel.

On peut donc considérer un filtre **[n x emb_dim]**. Ça va donc couvrir n mots séquentielles.

![](https://i.imgur.com/QlXduXu.png)

Le filtre doit passer sur toute l'image.

![](https://i.imgur.com/wuA330x.png)

![](https://i.imgur.com/gi1GaEz.png)

Dans notre modèle, on va aussi utiliser des filtres avec différents tailles 3, 4, 5, 100 ... afin de poivoir prendre en compte les (différents n-grams : 3 grams, 2 gram ... ) pour savoir le sentiment de la phrase

La prochaine étape du modèle est d'utiliser des couches de pooling après des couches de convolution. Ci dessous un exemple, du fait q'uon prend la valeur maximum 0.9 de la dernière couche de convolution

![](https://i.imgur.com/gzkS3ze.png)

L'idée ici, est que la valeur maximal est celle qui détérmine le plus important n-gram.


In [None]:
import torch.nn as nn

class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.conv_0 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[0],embedding_dim))
        self.conv_1 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[1],embedding_dim))
        self.conv_2 = nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(filter_sizes[2],embedding_dim))
        self.fc = nn.Linear(len(filter_sizes)*n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):          
        x = x.permute(1, 0)
                       
        embedded = self.embedding(x)                
        
        embedded = embedded.unsqueeze(1)
             
        conved_0 = F.relu(self.conv_0(embedded).squeeze(3))
        conved_1 = F.relu(self.conv_1(embedded).squeeze(3))
        conved_2 = F.relu(self.conv_2(embedded).squeeze(3))         
        
        pooled_0 = F.max_pool1d(conved_0, conved_0.shape[2]).squeeze(2)
        pooled_1 = F.max_pool1d(conved_1, conved_1.shape[2]).squeeze(2)
        pooled_2 = F.max_pool1d(conved_2, conved_2.shape[2]).squeeze(2)
                
        cat = self.dropout(torch.cat((pooled_0, pooled_1, pooled_2), dim=1))

        return self.fc(cat)

Pour l'instant notre modèle CNN utilise que 3 différents filtres. mais c'est possible d'améliorer le code en utilisant la fonction nn.ModuleList qui prend en paramètresnn.Module.

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.convs = nn.ModuleList([nn.Conv2d(in_channels=1, out_channels=n_filters, kernel_size=(fs,embedding_dim)) for fs in filter_sizes])
        self.fc = nn.Linear(len(filter_sizes)*n_filters, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):        
        x = x.permute(1, 0)
                        
        embedded = self.embedding(x)
                        
        embedded = embedded.unsqueeze(1)
                
        conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
                    
        pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
              
        cat = self.dropout(torch.cat(pooled, dim=1))
            
        return self.fc(cat)

## 6 - Créer une instance de notres CNN


In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3,4,5]
OUTPUT_DIM = 1
DROPOUT = 0.5

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT)

In [None]:
pretrained_embeddings = TEXT.vocab.vectors

model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.4870, -0.3286,  0.6392,  ..., -0.4184, -0.0256,  0.1911],
        [-0.3896, -0.0554,  0.4922,  ..., -0.0182, -0.8245,  0.0696],
        [ 0.1829,  0.1536, -0.1446,  ..., -0.1389,  0.3579,  0.8286]])

## 7 - Entrainer le modèle

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

criterion = nn.BCEWithLogitsLoss()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

100%|█████████▉| 398045/400000 [00:30<00:00, 26373.16it/s]

Implémenter la fonction qui calcule la précision



In [None]:
import torch.nn.functional as F

def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y.float()).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

Définir un fonction pour entrainer notre modèle

In [None]:
def train_model(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label.float())
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Définir une fonction pour tester notre modèle


In [None]:
def evaluate_model(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label.float())
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
  N_EPOCHS = 5
  for epoch in range(N_EPOCHS):

      train_loss, train_acc = train_model(model, train_iterator, optimizer, criterion)
      #valid_loss, valid_acc = evaluate_model(model, valid_iterator, criterion)
      valid_loss=0
      valid_acc=0

      print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc*100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc*100:.2f}%')



Epoch: 01, Train Loss: 0.495, Train Acc: 75.19%, Val. Loss: 0.000, Val. Acc: 0.00%
Epoch: 02, Train Loss: 0.302, Train Acc: 87.24%, Val. Loss: 0.000, Val. Acc: 0.00%
Epoch: 03, Train Loss: 0.216, Train Acc: 91.37%, Val. Loss: 0.000, Val. Acc: 0.00%
Epoch: 04, Train Loss: 0.144, Train Acc: 94.44%, Val. Loss: 0.000, Val. Acc: 0.00%
Epoch: 05, Train Loss: 0.089, Train Acc: 97.01%, Val. Loss: 0.000, Val. Acc: 0.00%


...and get our best test accuracy yet! 

In [None]:
test_loss, test_acc = evaluate_model(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f}, Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.313, Test Acc: 88.07%


## User Input

In [None]:
# Some auxiliary functions in order to make a color pallete. 
# Couresy of https://www.oreilly.com/library/view/python-cookbook/0596001673/ch09s11.html

import math

def floatRgb(mag):
    """ Return a tuple of floats between 0 and 1 for R, G, and B. """
    
    #blue  = min((max((4*(0.75-x), 0.)), 1.))
    #red   = min((max((4*(x-0.25), 0.)), 1.))
    #green = min((max((4*math.fabs(x-0.5)-1., 0.)), 1.))
    red=0
    blue=0
    green =0
    
    if mag>0:
      blue=min(1,mag/3)
    else:
      if mag<0:
        red=min(1,-mag/3)
        
        
     
    
    return red, green, blue
  
def rgb(mag):
    """ Return a tuple of integers, as used in AWT/Java plots. """
    red, green, blue = floatRgb(mag)
    return int(red*255), int(green*255), int(blue*255)

def strRgb(mag):
    """ Return a hex string, as used in Tk plots. """
    return "#%02x%02x%02x" % rgb(mag)

In [None]:
import spacy
import numpy as np
from IPython.core.display import display,HTML
from torch.nn.functional import cosine_similarity

nlp = spacy.load('en')
eps = np.finfo(np.float32).eps.item()
def predict_sentiment(sentence, explain_scores=True,explain_relative_to=1):
  
    tokenized_sentence = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed_sentence = [TEXT.vocab.stoi[t] for t in tokenized_sentence]
    tensor = torch.LongTensor(indexed_sentence).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.sigmoid(model(tensor))
    
    original_input_embedding,input_grad=get_input_gradients(indexed_sentence,prediction,explain_relative_to)
    
    explanation=get_prediction_explanation(tokenized_sentence,original_input_embedding,input_grad,explain_scores)
      
    return {'tokenized_sentence': tokenized_sentence,'prediction': prediction.item(),'explanation': explanation }

def get_input_gradients(original_sentence,prediction,in_relation_to):
    gradient_truth=torch.Tensor([in_relation_to]).unsqueeze(0)
    if torch.cuda.is_available():
      gradient_truth=gradient_truth.cuda()
    
    loss=criterion(prediction,gradient_truth)
    optimizer.zero_grad()
    loss.backward()
    
    input_grad=torch.Tensor(len(original_sentence),model.embedding.weight.size(1))
    original_input_embedding=torch.Tensor(len(original_sentence),model.embedding.weight.size(1))
    
    for i in range(0,len(original_sentence)):
      original_input_embedding[i]=model.embedding.weight[original_sentence[i]]
      input_grad[i]=model.embedding.weight.grad[original_sentence[i]]
    
    return original_input_embedding,input_grad
    
    

def get_input_scores(input,input_embedding,input_grad):
  
  
  # Take a SGD step using grads
  
  input_after_step=input_embedding-input_grad
  after_grad_norms = torch.norm(input_after_step, 2, 1)
  before_grad_norms = torch.norm(input_embedding, 2, 1)
  variation = after_grad_norms-before_grad_norms
 
  standard_deviation=torch.std(variation)
  mean=torch.mean(variation)
  z_score=(variation-mean)/standard_deviation
 
  return z_score

def old_get_input_scores(input,input_embedding,input_grad):
  
  grad_norms=torch.norm(input_grad,2,1)
  
  return grad_norms/torch.max(grad_norms)

  
def get_prediction_explanation(input,input_embedding,input_grad, explain_scores):
 
  
  input_word_scores=get_input_scores(input,input_embedding,input_grad)
   
  explanation=""
  for i in range(0,len(input)):
    token=input[i]
    token_color=strRgb(input_word_scores[i])
    if explain_scores:
      str_token="%s (%.3f)"%(token,input_word_scores[i])
    else:
      str_token=token
    
    explanation=explanation+'<font color="'+token_color+'">'+str_token+'&nbsp;</font>'
    if i>0 and i%20==0:
      explanation=explanation+"<br/>"
    
  return {'word_scores': input_word_scores,'input_gradient': input_grad,'textual_explanation':explanation }

  
  

In [None]:
from torch.nn.functional import cosine_similarity

def get_projected_words(word,word_gradient,num_words=1):
  
  word_index=TEXT.vocab.stoi[word]
  word_embedding=TEXT.vocab.vectors[word_index]
  learning_rate=1
  i=0
  result=[]
  
  while i<100000:
    try: 
      word_embedding=word_embedding-learning_rate*word_gradient
    except:
      # We can have a float overflow here if this process gets out of control
      return result
    similarity_value,similarity_index=cosine_similarity(word_embedding.unsqueeze(0),TEXT.vocab.vectors,dim=1).sort(descending=True)
    if similarity_index[0]!=word_index:
      if  similarity_value[0]<0.5:
        break
      
      result.append({'word':TEXT.vocab.itos[similarity_index[0]],'similarity': similarity_value[0]})
      word_index=similarity_index[0]
      learning_rate=1
      if len(result)>=num_words:
        break
      
    i=i+1
    learning_rate=learning_rate*1.1
      
  return result

  
def get_projected_sentence_word(prediction,word):
  
  sentence=prediction['tokenized_sentence']
  word_index_in_sentence=[i for i in range(0,len(sentence)) if sentence[i]==word][0]
  word_gradient=prediction['explanation']['input_gradient'][word_index_in_sentence]
  
  
  return get_projected_words(word,word_gradient,1)
  
  

In [None]:


prediction=predict_sentiment("This is a ridiculous movie and you should never see it.")
print(prediction['prediction'])
display(HTML(prediction['explanation']['textual_explanation']))
print(get_projected_sentence_word(prediction,'ridiculous'))

0.24004195630550385


[{'word': 'interesting', 'similarity': tensor(0.7075)}]


In [None]:

prediction=predict_sentiment("I affection Miiasaki memorable <PAD> <PAD>")
print(prediction['prediction'])
display(HTML(prediction['explanation']['textual_explanation']))




0.9992479681968689


In [None]:
print(get_projected_sentence_word(prediction,'affection'))

[]


## 8 - Importer les fonctions de nlp.py pour avoir le bon sentiment des phrases 

---




In [None]:
import re
import string

import nltk
nltk.download('punkt')
import numpy as np


# Calculons le vecteur associé au texte
def text2vec(wv, idf, text):
    text_vector = np.zeros(300)
    weights = 0
    # Pour tous les tokens du texte
    for word in text:
        try:
            # On extrait le vecteur d'un mot
            vector = wv.get_vector(word)
            norm = np.linalg.norm(vector)
            # On le normalise
            vector = vector / norm
            # On récupère l'idf du mot (voir TP2)
            weight = idf[word]
            # On pondère le vecteur avant de le rajouter au vecteur représentant le texte
            text_vector += weight*vector
            weights += weight
        except KeyError:
            pass
    # On renormalise le vecteur
    if weights > 0:
        text_vector /= weights
    return text_vector


# Pris de la correction du TP2
def extract_tokens(text):
    res = []
    for sent in nltk.sent_tokenize(text):
        tmp_res = nltk.word_tokenize(sent)
        for token in tmp_res:
            res += re.split("[./]", token)
    return res

def clean_tokens(tokens):
    return [token.lower() for token in tokens if token not in string.punctuation]

def text2tokens(text):
    tokens = extract_tokens(text)
    tokens = clean_tokens(tokens)
    return tokens
## Fin de la correction TP2 ##

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
import json
import numpy as np
import matplotlib.pyplot as plt

# Chargement modèle NLP 
import joblib
# fichier nlp.py fouillez le !

# Chargement modèle MNIST
import torch
from torchvision import transforms

nlp_model = joblib.load('nlp_model.joblib')
ml = nlp_model["ml"]
idf = nlp_model["idf"]
wv = nlp_model["wv"]

def compute_sentiment(text, wv, idf, ml, threshold=0.55):
    # NLP : feature exctraction
    tokens = text2tokens(text)
    vector = text2vec(wv, idf, tokens)
    # Compute prediction
    prediction = ml.predict_proba(vector.reshape(1, -1))[0]
    # Use positive class proba and threshold to estimate sentiment
    sentiment = (prediction[1] > threshold)
    return sentiment



In [None]:
def display_message(message):
  display(HTML(message))
  
def predict_and_make_it_better(text,better_direction=1):
  
  version=0
  word_to_change=None
  better_word=None
  
  while int(compute_sentiment(text, wv, idf, ml, threshold=0.55) == True) != better_direction:
    
    prediction=predict_sentiment(text,explain_scores=False,explain_relative_to=better_direction)
    #display_message("<H2> Version "+str(version)+"</H2>")
    #if word_to_change!=None:
      #display_message("<H3>"+word_to_change+"->"+better_word+"</H3>")
      
    #display_message("<H3> Sentiment: "+str(prediction['prediction'])+"+</H3>")
    #display(HTML(prediction['explanation']['textual_explanation']))
    
    # Get the word with the highest absolute score
    word_to_change=None
    better_word=None
  
    word_scores=prediction['explanation']['word_scores']
    _,sorted_indices=torch.abs(word_scores).sort(descending=True)
    changed_text=False
    for i in range(0,sorted_indices.size(0)):
      tokenized_sentence=prediction['tokenized_sentence']
      word_to_change=tokenized_sentence[sorted_indices[i]]
      better_words=get_projected_sentence_word(prediction,word_to_change)
      if len(better_words)>0:
        better_word=better_words[0]['word']
        new_tokenized_sentence=[t if t!=word_to_change else better_word for t in tokenized_sentence]
        text=" ".join(new_tokenized_sentence)
        changed_text=True
        break
    
    if not changed_text:
      return
    
    version=version+1
  return text

In [None]:
print(predict_and_make_it_better("bad worse stupid not funny",better_direction=1))

good i crazy not funny


In [None]:
import json
# Lecture de "l'email"
with open("new_email.json", "r") as fp:
    email = json.load(fp)



def create_new_email(email):
    new_email = email.copy()
    count = 0
    for student in new_email:
        count += 1
        if(count < 19):
            student['appreciation'] = predict_and_make_it_better(student['appreciation'], 1)
        else :
            student['appreciation'] = predict_and_make_it_better(student['appreciation'], 0)
 

    return new_email


## 9 -  Enregistrer les changements des emails dans un fichier Json

In [None]:
new_email = create_new_email(email)
for student in new_email :
    sentiment = compute_sentiment(student['appreciation'], wv, idf, ml)
    print(student['appreciation'], sentiment)

with open("email_perf.json", "w") as ne:
    json.dump(new_email, ne)

friendship giving too much away , there is a fade to white an hour into the film . True
My wife and I really had high hopes for this film , but it was a major expressed . True
And how come the Whoop never changes her hair or glasses over the many years this film showcases ? True
I seriously feel like this is something that a screenwriting student would be written in a Quentin Tarantino / Eddie Murphy phase and unique True
Welles would go from there to explore the mystery narrative and the self-reference of Shakespeare with this eye. True
The idea of getting to spirit was n't yet for her , unique is became dedication hook between True
This is the wonderful of lauded - horror item you 'd find packaged in with 50 other random cheesefests and and row programmers . True
met , even at the age of 7 and could tell that and was watching collection . True
It 's rare G - rated showcase fare and at least you do n't have to worry about leaving your kids alone while they watch it . True
I remember b