## Sentence reformulation

In [22]:
import numpy as np
import nltk
from gensim.models import KeyedVectors
import scipy.spatial as sp

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hman1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\hman1\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Load downloaded pretrained FastText vectors by gensim library:

In [4]:
fast = KeyedVectors.load_word2vec_format('../data/cc.en.300.vec')

In [5]:
with open('../data/Yelp.train.text','r+') as f:
    yelp = f.read()

Compute similarity of two words using gensim

In [9]:
#We discussed different words, look and similarity of 'king' and 'queen' for example. Could you put it into context?
pairs = [['king','queen'],['king','majesty'],['queen','majesty'],['man','woman'],['husband','wife'],['brother','sister'],['prince','princess'],['right','wrong'],['car','truck'],['porsche','subaru'],['porsche','bentley'],['superhero','villain'],['jesus','moses']]
for pair in pairs:
    w1,w2 = pair
    sim = np.round(fast.similarity(w1,w2),3)
    print("Similarity of {} and {} is: {:.3f}".format(w1,w2,sim))

Similarity of king and queen is: 0.707
Similarity of king and majesty is: 0.445
Similarity of queen and majesty is: 0.376
Similarity of man and woman is: 0.766
Similarity of husband and wife is: 0.894
Similarity of brother and sister is: 0.818
Similarity of prince and princess is: 0.755
Similarity of right and wrong is: 0.562
Similarity of car and truck is: 0.648
Similarity of porsche and subaru is: 0.525
Similarity of porsche and bentley is: 0.546
Similarity of superhero and villain is: 0.595
Similarity of jesus and moses is: 0.592


Sentence tokenization. Split Yelp! texts into separate tokens (words and punctuation marks) by space

In [13]:
#your code here
word_tokens = nltk.tokenize.word_tokenize(yelp)

Try part of speech tagging using [NLTK POS-tagger](https://www.nltk.org/book/ch05.html).
The function returns list of tuples (word, POS_tag)

In [16]:
#your code here
pos_tag = nltk.pos_tag(yelp)

Can you find the most similar word to the given? Can you write a method that returns a list of tuples (word, similarity) in order of decreasing similarity?

In [53]:
def find_most_sim_n (traget, vec_space, n, func = 'euclidean'):
    vec = vec_space.wv.get_vector(traget).reshape(1,-1)
    dist = sp.distance.cdist(vec_space.wv.vectors, vec, func)[:,0]
    idxs=np.argsort(dist)[n:0:-1]
    keys = list(vec_space.wv.vocab.keys()) 
    return [(keys[i],dist[i]) for i in idxs]

In [45]:
find_most_sim_n ('moses',fast,5)

  
  This is separate from the ipykernel package so we can avoid doing imports until
[ 667106 1218377  134177 1096986  930533]
  


[('israelites', 1.2634099491206856),
 ('abrahams', 1.2596855196513037),
 ('joshua', 1.2592746645836506),
 ('malachi', 1.2513902803123798),
 ('bernice', 1.251252513206231)]

Let's do the simplest reformulation task. We just want to reformulate some sentences replacing an ajective with a similar one

In [66]:
def v(sentence):
    # Sentence tokenization
    tokenized_sentence = nltk.tokenize.word_tokenize(sentence)

    # Part of speech tagging
    POS_tagged_words = nltk.pos_tag(tokenized_sentence)

    reformulated_sentence_words = []
    for word, pos_tag in POS_tagged_words:
        # If the word is adjective...
        if pos_tag in ['JJR', 'JJS', 'JJ']:
            try:
                # ...look for the word most similar to the given and replace it
                w = find_most_sim_n(word,fast,5)[np.random.randint(5)][0]
                reformulated_sentence_words.append(w)
                # your code here
            except:
                print('There is no {} word in FastText dictionary! ...'.format(word))
        else:
            reformulated_sentence_words.append(word)
    # Join words list in a sentence
    return ' '.join(reformulated_sentence_words)

In [70]:
lst = []
for i in range (5):
    lst.append(reformulate_sentence('the stupid turd gave a press conference telling good americans to consume toxic bleach'))

  
  This is separate from the ipykernel package so we can avoid doing imports until
  """


In [69]:
lst

['the asinine turd gave a press conference telling decent americans to consume poisonous bleach',
 'the dumbass turd gave a press conference telling excellent americans to consume harmful bleach',
 'the asinine turd gave a press conference telling good.Good americans to consume poisonous bleach',
 'the moronic turd gave a press conference telling good.Good americans to consume harmful bleach',
 'the asinine turd gave a press conference telling semi-good americans to consume noxious bleach']

## Sentiment analysis

In [71]:
import random

In [72]:
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\hman1\AppData\Roaming\nltk_data...


VADER sentiment classifier from NLTK library. The range of sentiment is from -1 to 1 where -1 is negative, 0 is neutral and 1 is positive

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [73]:
sentiment_analyzer = SentimentIntensityAnalyzer()

Read the dataset text file line by line and put lines into the list

In [77]:
#your code here
split_sent = yelp.split("\n")
split_sent[:10]

['i was sadly mistaken .',
 'so on to the hoagies , the italian is general run of the mill .',
 'minimal meat and a ton of shredded lettuce .',
 'nothing really special & not worthy of the $ _num_ price tag .',
 'second , the steak hoagie , it is atrocious .',
 'i had to pay $ _num_ to add cheese to the hoagie .',
 'she told me there was a charge for the dressing on the side .',
 'are you kidding me ?',
 'i was not going to pay for the dressing on the side .',
 'i ordered it without lettuce , tomato , onions , or dressing .']

Read Yelp dataset from text file and get 1000 random sentences

In [79]:
#your code here
samp = np.array(split_sent)[np.random.choice(range(len(split_sent)),1000,replace=False)]
samp[:10]

array(['she really should be fired .', 'best pizza in pittsburgh .',
       'i definitely will use them again .', 'thank you .',
       'it was only downhill from there .',
       "hands down the best ice cream i 've had in ages .",
       'excellent welcome from front desk when we arrived .',
       'well worth the money and experience to try out sushi and hibachi .',
       'the donuts were very fresh and warm .',
       'i will drive a longer distance to avoid this location .'],
      dtype='<U108')

Compute average sentiment of 1000 sentences sentences set by VADER sentiment classifier

In [97]:
av_sent = np.mean([sentiment_analyzer.polarity_scores(sent)['compound'] for sent in samp])

print(f'the average compunded sentiment that those little shits on yelp write is: {np.round(av_sent,2)}')

the average compunded sentiment that those little shits on yelp write is: 0.25


Reformulate sentences and compute average sentiment again. Try to come up with ways to make senteces more positive on average. What about more negative? Can you come up with some interesting experiment on this data with POS-tagged reformulations?

In [99]:
#your code here
import tqdm
reformulated = []
pbar = tqdm.tqdm(total=1000, position= 0 , leave = True)
for sent in samp:
    reformulated.append(reformulate_sentence(sent))
    pbar.update()
pbar.close()
av_sent = np.mean([sentiment_analyzer.polarity_scores(sent)['compound'] for sent in reformulated])
print(f'the average compunded sentiment for the rformulated sentences is: {np.round(av_sent,2)}')

  
  This is separate from the ipykernel package so we can avoid doing imports until
  """
 14%|█▎        | 136/1000 [06:34<39:20,  2.73s/it]There is no a/c word in FastText dictionary! ...
 16%|█▌        | 162/1000 [08:06<56:50,  4.07s/it]There is no _num_ word in FastText dictionary! ...
 22%|██▎       | 225/1000 [11:19<1:08:29,  5.30s/it]There is no _num_ word in FastText dictionary! ...
 30%|██▉       | 299/1000 [14:48<40:51,  3.50s/it]There is no _num_ word in FastText dictionary! ...
 36%|███▌      | 355/1000 [17:31<30:10,  2.81s/it]There is no _num_ word in FastText dictionary! ...
 38%|███▊      | 382/1000 [18:52<31:46,  3.09s/it]There is no _num_ word in FastText dictionary! ...
 41%|████▏     | 414/1000 [20:34<27:06,  2.78s/it]There is no _num_ word in FastText dictionary! ...
 45%|████▌     | 453/1000 [22:24<22:49,  2.50s/it]There is no _num_ word in FastText dictionary! ...
 49%|████▉     | 494/1000 [24:12<32:35,  3.87s/it]There is no _num_ word in FastText dictionary! ...


In [100]:
print(f'the average compunded sentiment for the rformulated sentences is: {np.round(av_sent,2)}')

the average compunded sentiment for the rformulated sentences is: 0.19


## changing the average score of the sentence
In order to make a sentence more positive (or negative)
1. We can look for the most negative words in a sentence and pick the n-most negative words. for each of those words we can look for the most similar words and replace the original word with the highest sentiment - scoring from the list of most similar words (given that the score of the highest sentiment scoring similar word is higher than the score of the original one). <br>
Of course that if we want to increase the negativity of a sentence we will replace the most positive words with lass positive similar words.
2. we can use compund adjectives, we can randomly add aditional adjectives for any existing adjactive in the original sentence.

## POS reformulations expiriments
1. before submitting reviews on YELP, analyze the sentences and suggest alternative phrasings, ask the user if the alternative sentence could replace the original phrasing. that way we can examine the effect of words that users are more used to use in their day to day life on the sentiment calculation.
2. take the original reviews and give them to random readers to rate the overall text. use those measurements as a baseline. then isolate different POS tags and refurmulate sentences to be more positive\more negative and have another group of subjects rate those reviews. this expirement will allow us to get an intuition to which POS is the most significant in determinating the tone of the review.

**Side point**: when replacing a word in a sentence we should check if the similar words can be used as a similar POS tag. some words cannot be used as verbs as well as adjactives, and we should control for that.
