# Semantics and Sentiment Analysis
## 1. Semantics and Text Vectors
### Text Vectorization
* NLP libraries (e.g. Spacy) have modules which map specific elements of text to numerical vectors
* In Spacy, this module is called **Word2Vec** and it is already trained within the medium and large language libraries
* Note that as well as creating vectors for individual words, you can also create document vectors, which are the average vector of all the word vectors within them

In [1]:
# load libraries
import spacy

# load large library
nlp = spacy.load('en_core_web_lg')

In [2]:
# look at vector for specific word (first 5 values only)
nlp(u'lion').vector[:5]

array([ 0.18963, -0.40309,  0.3535 , -0.47907, -0.43311], dtype=float32)

In [3]:
# 300 dimensions attached to this vector (i.e. word described by 300 values)
nlp(u'lion').vector.shape

(300,)

### Text Similarity
* You can compare the similarity of words (or tokens or docs etc.) using Spacy
* It will give you a score between 0 and 1 to indicate how similar the vectors are
* In the first example below, you can see that cat and pet are strongly similar for example
* This is based off the trained algorithm within Spacy's Word2Vec which has mapped the vectors of many words/tokens
* The second example below shows that for some words, even if their meaning is opposite (e.g. love and hate), their vector similarity might be high because they're used in similar contexts to one another

In [4]:
# create text tokens
tokens = nlp(u'lion cat pet')

# look at similarity between tokens
for token1 in tokens:
    for token2 in tokens:
        # use '.similarity' to compare 2 tokens vectors
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.52654374
lion pet 0.39923766
cat lion 0.52654374
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923766
pet cat 0.7505456
pet pet 1.0


In [5]:
# create text tokens
tokens = nlp(u'like love hate')

# look at similarity between tokens
for token1 in tokens:
    for token2 in tokens:
        # use '.similarity' to compare 2 tokens vectors
        print(token1.text, token2.text, token1.similarity(token2))

like like 1.0
like love 0.65790397
like hate 0.6574652
love like 0.65790397
love love 1.0
love hate 0.6393099
hate like 0.6574652
hate love 0.6393099
hate hate 1.0


### Vector Attributes
* It's sometimes useful to aggregate the 300 dimensions within a single vector
* Euclidian L2-Norm = square root of the sum of squared vectors for a specific word/token/doc
* There are a number of other attributes attached to our vectors, such as being able to check if they're within our vocab

In [6]:
# check number of unique words within our vocabulary (that we have vectors for)
# each vector has 300 dimensions, so our vocab is a 684,830 x 300 matrix
len(nlp.vocab.vectors)

684830

In [7]:
# create text
tokens = nlp(u'dog cat nargle')

# check if token is within our vocab
for token in tokens:
    # nargle is a made up word so will have no vector
    # has_vector will be False, is_oov (out of vocab) will be true
    # the vector_norm is the sum of squares for each value in our vector
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


### Vector Arithmetic
* Once words are transformed into vectors, we can perform vector arithmetic on them
* This is how similarity is calculated
* For example...
    * An individual word (e.g. King) is given a numeric vector
    * This vector can be plotted/visualized in some dimensional space
    * You can then use cosine similarity to determine the distance between this word's vector and another word's vector
    * This means you can perform vector arithmetic with the vectorized words
    * For example, you could calculate the vector of King - man + woman
    * Then you could look for the closest vector in your vector library to this calculated vector
    * An example of what it returned could be Queen
* This vector arithmetic allows Spacy to interpret similarites between words such as tense, age, gender etc. based on the relationships between their vectors

<img src="NLP Course Files/Images/WordVectors2.png">

In [8]:
# load libraries
from scipy import spatial

# manually calculate cosine similarity
cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)

# create text vars
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

# calculate combo of 3 vectors
new_vector = king - man + woman

# compare new vector to all words in vocab
# create list to store similarities
computed_similarities = []

# iterate through words in vocab
for word in nlp.vocab:
    # check if word has a vector, is lowercase and is text
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                # calculate similarity and store as tuple in list
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))

# sort list (descending order using lambda expression)
computed_similarities = sorted(computed_similarities, key=lambda item:-item[1])

# print top 10 similar words
print([t[0].text for t in computed_similarities[:10]])

['king', 'woman', 'she', 'lion', 'who', 'when', 'dare', 'cat', 'love', 'was']


## 2. Sentiment Analysis
### Overview
* Sentiment can be analysed in many ways, but broadly it looks at polarity (+/-) and strength of sentiment
* VADER is a useful library in NLTK that can be used for sentiment analysis
* It analyses not only the polarity and strength of individual words, but also those around it
    * It will interpret context e.g. "I did not like" is negative despite containing "like"
    * It accounts for capitalisation, punctuation etc. too e.g. "I LOVE THIS!!!" received higher strength
* It is worth noting that sentiment analysis is a tricky field and no algorithm/model will be perfect
* If you consider scenarios such as mixed reviews and sarcasm, it's almost impossible for a model to interpret this perfectly

In [9]:
# load libraries
import nltk

# download VADER lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Matthew.Allen2\AppData\Roaming\nltk_data...


True

In [14]:
# load libraries
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# create instance of analyzer
sid = SentimentIntensityAnalyzer()

# create text
text = "This was a good movie."

# analyze text
# shows mix of negative/neutral/positive text and an average
# values range between 0 and 1
sid.polarity_scores(text)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [15]:
# create text
text = "This was the best, most awesome movie EVER MADE!!!"

# analyze text
sid.polarity_scores(text)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [16]:
# create text
text = "This is the worst movie to ever disgrace the screen."

# analyze text
sid.polarity_scores(text)

{'neg': 0.477, 'neu': 0.523, 'pos': 0.0, 'compound': -0.8074}

## 3. Amazon Reviews
### Text Analysis
* We can load in a set of text reviews for movies and analyse the sentiment attached
* These reviews already have labels which we could use for classification already
* However, we can predict our own sentiment values using VADER and then compare our results to the original labels

In [19]:
# load libraries
import pandas as pd
import numpy as np

# read in Amazon reviews data
in_path = 'C:/Users/Matthew.Allen2/Documents/GitHub/Data-Science/Courses/Udemy/NLP/NLP Course Files/TextFiles/'
df = pd.read_csv(in_path + 'amazonreviews.tsv', sep='\t')

# drop nulls
df.dropna(inplace=True)

# peek at data
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [20]:
# chcek mix of positive and negative reviews
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [21]:
# blanks list
blanks = []

# check for blanks
# iterate through dataframe, unpack objects
for i, lb, rv in df.itertuples():
    # if review is a string
    if type(rv) == str:
        # if it's empty whitespace
        if rv.isspace():
            # store index of blanks
            blanks.append(i)

# drop blanks if necessary
#df.drop(blanks, inplace=True)
            
# check blanks
blanks

[]

In [27]:
# check first review
df.iloc[0]['review']

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [26]:
# polarity review of first review
sid.polarity_scores(df.iloc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

### Predicting and Scoring
* We can run polarity scores for the entire dataframe, looking at all reviews
* From this, we can extract the compound score that best describes all fields (i.e. neg/neu/pos)
* Finally, we can define positive and negative review labels based on their compound scores (i.e. neg < 0 <= pos)

In [31]:
# calculate polarity scores for whole dataframe
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

# extract compound score specifically (most relevant score)
df['compound'] = df['scores'].apply(lambda scores: scores['compound'])

# create labels based on >= 0 for +ve reviews
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')

# check output
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


Confusion Matrix:
* This shows us that it's not doing an horrific job of classifying
* Most positive reviews are being correctly classified
* It's having issues with negative reviews though, likely due to sarcasm

In [42]:
# evaluate predictions
# load libraries
from sklearn.metrics import accuracy_score, classification_report

# confusion matrix
confusion_matrix = pd.crosstab(df['label'], df['comp_score'], rownames=['Actual'], colnames=['Predicted'])
confusion_matrix

Predicted,neg,pos
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
neg,2622,2475
pos,434,4469


Accuracy and Other Scores:
* Overall accuracy is ~71%
* A model which randomly guessed would get around 50% so it's definitely an improvement on a random model
* Our positive cases are mostly being correctly guessed, with good recall and f1
* Whilst negative cases are trickier and receive a lower recall and f1 score

In [41]:
# accuracy score
print(str(round(accuracy_score(df['label'], df['comp_score']) * 100,2)) + '%')

70.91%


In [40]:
# classification report
print(classification_report(df['label'], df['comp_score']))

              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

    accuracy                           0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000

