### Word2Vec
**Word2Vec** is a two-layer neural net that processes text
> * Input is text corpus
> * Output is set of vectors (i.e. feature vectors for words in corpus)
> > * Vectors that are distributed numerical representations of word features, features such as the context of individual words

**Purpose** and usefulness of Word2Vec is to group the vectors of similar words together in vectorspace
> * It detects similarities mathematically

**Outcome** of giving Word2Vec enough data is making highly accurate guesses about a word's meaning based on past appearances

**Trains** aginst other words that neighbor them in the input corpus
> 1. using context to predict a target word - method known as continuous bag of words (CBOW)
> 2. Using word to predict a target context - Skip-Gram

<img src='CBOW_Skip.png'>

**Recall** each word is now represented by a vector
> With spacy, each vector has **300** dimensions

**Cosine Similarity** is a measure of similarity between vectors
> * Now that we have our words in vectors, we can evaluate their relationships

<img src='Cos_Sim.png'>

**Vector Arithmetic** can now be performed
> e.g. finding the difference of man from king plus woman could be related to queen

$$new vector = king - man + woman \approx queen$$
<img src='Vec_Sim_Example.png'>

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_lg")

In [3]:
# This illustrates the vector components of the word 'lion',
# Doc and Span objects also have vectors, and are derived from the averages of the indivdual token vectors
# That allows you to perform document2vec, not only word2vec
display(nlp(u"lion").vector[0:10])
display(nlp(u"lion").vector.shape)

array([ 0.18963 , -0.40309 ,  0.3535  , -0.47907 , -0.43311 ,  0.23857 ,
        0.26962 ,  0.064332,  0.30767 ,  1.3712  ], dtype=float32)

(300,)

In [4]:
# Doc2Vec is the average of all the singular words that are there
display(nlp(u'The quick brown fox jumped').vector[0:10])
display(nlp(u'The quick brown fox jumped').vector.shape)

array([-0.209218  , -0.0278228 , -0.0357064 ,  0.1552184 , -0.012805  ,
        0.13162704, -0.19946599,  0.0475812 ,  0.1267988 ,  1.647928  ],
      dtype=float32)

(300,)

In [5]:
tokens = nlp(u'lion cat pet')

In [6]:
# This numerical value is the cosine similarity
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

lion lion 1.0
lion cat 0.5265437
lion pet 0.39923772
cat lion 0.5265437
cat cat 1.0
cat pet 0.7505456
pet lion 0.39923772
pet cat 0.7505456
pet pet 1.0


In [7]:
tokens = nlp(u'like love hate')

for t1 in tokens:
    for t2 in tokens:
        print(t1.text, t2.text, t1.similarity(t2))

like like 1.0
like love 0.65790397
like hate 0.6574652
love like 0.65790397
love love 1.0
love hate 0.6393099
hate like 0.6574652
hate love 0.6393099
hate hate 1.0


In [8]:
# Aggregate this into a Euclidean L2 norm: square root of the sum of squared vectors
# spacy has a method for this
display(len(nlp.vocab.vectors))
nlp.vocab.vectors.shape

684831

(684831, 300)

In [10]:
tokens = nlp(u'dog cat nargle')

In [13]:
for t in tokens:
    print(t.text,t.has_vector,t.vector_norm,t.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
nargle False 0.0 True


In [14]:
tokens = nlp(u'dog cat John')
for t in tokens:
    print(t.text,t.has_vector,t.vector_norm,t.is_oov)

dog True 7.0336733 False
cat True 6.6808186 False
John True 6.533578 False


## Vector Arithmetic

In [20]:
from scipy import spatial

cosine_similarity = lambda vec1,vec2: 1-spatial.distance.cosine(vec1,vec2)

In [21]:
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

In [22]:
# King - man + woman --> new_vector similar to queen, princess, highness

In [23]:
new_vector = (king - man) + woman

In [24]:
computed_similarities = []
# For all the words in my vocabulary
for word in nlp.vocab:
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha:
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarities.append((word, similarity))
# Sort in descending order by the similarity
computed_similarities = sorted(computed_similarities,key=lambda item:-item[1])

In [28]:
# Removing gender is not sufficient to disentagle king to get queen, but it's close
print([t[0].text for t in computed_similarities[0:10]])

['king', 'queen', 'prince', 'kings', 'princess', 'royal', 'throne', 'queens', 'monarch', 'kingdom']


# Sentiment Analysis Without Labels - Unsupervised
**VADER** (Valence Aware Dictionary for sEntiment Reasoning): model used for text sentiment analysis that is sensitive to both polarity (pos/neg) and intensity of emotion
> Found in nltk package

* Relies on a dictionary which maps lexical features to emotion intensities called sentiment scores
> The sentiment score of a text can be obtained by summing up the intensity of each word in the text

* It takes into account context and punctuation: "love" versus "do not love", "like" vs "LOVE!!!"

In [29]:
import nltk

In [30]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/daiglechris/nltk_data...


True

In [31]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [32]:
sid = SentimentIntensityAnalyzer()

In [33]:
a = 'This is a good movie'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}

In [34]:
a = 'This was the best, most awesome movie EVER MADE!!!'
sid.polarity_scores(a)

{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

In [35]:
a = 'This was the WORST movie that has ever disgraced the screen.'
sid.polarity_scores(a)

{'neg': 0.465, 'neu': 0.535, 'pos': 0.0, 'compound': -0.8331}

### Using **VADER** to analyze text (amazon reviews)

In [56]:
import pandas as pd
import numpy as np

In [37]:
df = pd.read_csv('amazonreviews.tsv',sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [38]:
df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [44]:
# Cleaning
df['label'].value_counts()
display(df.shape)
display(df.isnull().sum())

df.dropna(inplace=True)
display(df.shape)
display(df.isnull().sum())

blanks = []
for i,lb,rv in df.itertuples():
    if type(rv) == str:
        if rv.isspace():
            blanks.append(i)
df.drop(blanks, inplace=True)
display(df.shape)

(10000, 2)

label     0
review    0
dtype: int64

(10000, 2)

label     0
review    0
dtype: int64

(10000, 2)

In [46]:
print(df.iloc[0]['review'])
sid.polarity_scores(df.iloc[0]['review'])

Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^


{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [47]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [48]:
df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [49]:
# Create a column of the values in compound under the column scores
df['compound'] = df['scores'].apply(lambda d:d['compound'])

In [50]:
df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [53]:
# Assign reasonable values to a new column to indicate the positive or negative sentiment
df['comp_score'] = df['compound'].apply(lambda score: 'pos' if score >= 0 else 'neg')

In [54]:
df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454,pos
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957,pos
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858,pos
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814,pos
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781,pos


In [61]:
# They match approximately 71% of the time
np.mean(df['label'] == df['comp_score'])

0.7091

In [62]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [65]:
display(accuracy_score(df['label'], df['comp_score']))
display(pd.DataFrame(confusion_matrix(df['label'],df['comp_score']), 
                            index=['TrueNeg','TruePos'], 
                            columns=['PredNeg','PredPos']))
print(classification_report(df['label'], df['comp_score']))

0.7091

Unnamed: 0,PredNeg,PredPos
TrueNeg,2623,2474
TruePos,435,4468


              precision    recall  f1-score   support

         neg       0.86      0.51      0.64      5097
         pos       0.64      0.91      0.75      4903

   micro avg       0.71      0.71      0.71     10000
   macro avg       0.75      0.71      0.70     10000
weighted avg       0.75      0.71      0.70     10000

