# Goals of this notebook
- Semantic Vectors
- Sentiment Analysis

## 1. Word Vectors
Word vectors, also known as word embeddings, are numerical vector representations of words that capture semantic meaning by placing words with similar meanings close to each other in a multi-dimensional space. These vectors are used extensively in natural language processing (NLP) tasks such as word similarity, sentiment analysis, and machine translation.

Word embeddings are typically learned from large text corpora using various methods. Some popular algorithms for generating word vectors include Word2Vec, GloVe (Global Vectors for Word Representation), and FastText.

### 1.1 Word2Vec
Word2Vec is a neural network-based approach to learning distributed representations of words, i.e., word embeddings. It was introduced by Mikolov et al. in 2013. The main idea is to learn a vector representation for each word such that words with similar meanings are close to each other in the vector space. It comes in two main variants: **Continuous Bag of Words (CBOW)** and **Skip-gram**.

### 1.2 Skip-gram Model
In the Skip-gram model, the objective is to predict the context words given a target word. The model maximizes the probability of the context words given the target word over a large corpus of text.

The objective function for the Skip-gram model is:

$$ \max \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} \mid w_t) $$

Where:
- $ T $ is the total number of words in the corpus
- $ c $ is the context window size
- $ w_t $ is the target word at position $ t $
- $ w_t+j $ are the context words within the window size $ c $ around the target word.

The probability $ P(w_O \mid w_I) $ is usually computed using the softmax function.

$$ P(w_O \mid w_I) = \frac{\exp(\mathbf{v}_{w_O} \cdot \mathbf{v}_{w_I})}{\sum_{w \in V} \exp(\mathbf{v}_w \cdot \mathbf{v}_{w_I})} $$

Where:
- $ v_{w_0} $ is the vector representation of the context word $ w_O $.
- $ v_{w_1} $ is the vector representation of the input (target) word $ w_I $.
- $ V $ is the vocabulary size.

### 1.3 Continuous Bag of Words (CBOW) Model
The CBOW model is the inverse of the Skip-gram model. It predicts the target word given the context words. The objective is to maximize the probability of the target word given the context words.

The objective function for the CBOW model is:

$$ \max \frac{1}{T} \sum_{t=1}^{T} \log P(w_t \mid w_{t-c}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+c}) $$

The probability is computed using the softmax function in a similar manner to the Skip-gram model.

### 1.4 Negative Sampling
To make the training more efficient, Word2Vec uses a technique called negative sampling. Instead of computing the softmax over the entire vocabulary, it samples a few negative examples (words that are not in the context) for each positive example (word in the context).

The objective function with negative sampling is:

$$ \log \sigma(\mathbf{v}_{w_O} \cdot \mathbf{v}_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} \left[ \log \sigma(-\mathbf{v}_{w_i} \cdot \mathbf{v}_{w_I}) \right] $$

Where:
- $ \sigma $ is the sigmoid function
- $ k $ is the number of negative samples
- $ P_n(w) $ is the noise distribution of negative sampling $

### 1.5 Visual Representation of CBOW and Skip-gram
![CBOW & Skip-gram](https://www.researchgate.net/profile/Wang-Ling-16/publication/281812760/figure/fig1/AS:613966665486361@1523392468791/Illustration-of-the-Skip-gram-and-Continuous-Bag-of-Word-CBOW-models.png)

## 2. Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are in terms of their orientation, irrespective of their magnitude. It’s commonly used in text analysis, particularly with word embeddings, to compare the similarity between two words or documents.

Cosine similarity between vectors \(\mathbf{A}\) and \(\mathbf{B}\) is:

$$ \text{Cosine Similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} $$

Where:
- $ A \cdot B $ is the dot product of vectors $ A $ and $ B $.
- $ \|A\| \text{and} \|B\| $ are the magnitudes or norms of vectors $ A $ and $ B $, respectively.

Cosine similarity ranges from -1 to 1:
- $ 1 $ indicates that the vectors are identical (i.e., pointing in the same direction).
- $ 0 $ indicates orthogonality (i.e., no similarity).
- $ -1 $ indicates that the vectors are diametrically opposed (i.e., pointing in opposite directions).

### 2.1 Dot Product
The dot product of two vectors measures their similarity by calculating the sum of the products of their corresponding components. The formula for the dot product of vectors $ A $ and $ B $ is:

$$ \mathbf{A} \cdot \mathbf{B} = \sum_{i=1}^{n} A_i B_i $$

**Example**:

For vectors $ A = [1, 2, 3] and B = [4, 5, 6] $

$$ A \cdot B = (1 \cdot 4) + (2 \cdot 5) + (3 \cdot 6) = 4 + 10 + 18 = 32 $$

### 2.2 Magnitude (Norm)
The magnitude (or norm) of a vector is a measure of its length. For a vector $ A $, the magnitude is calculated as:

$$ \|\mathbf{A}\| = \sqrt{\sum_{i=1}^{n} A_i^2} $$

**Example**:

For vector $ A = [1, 2, 3] $

$$ \|\mathbf{A}\| = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{1 + 4 + 9} = \sqrt{14} \approx 3.74 $$

For vector $ B = [4, 5, 6] $

$$ \|\mathbf{B}\| = \sqrt{4^2 + 5^2 + 6^2} = \sqrt{16 + 25 + 36} = \sqrt{77} \approx 8.77 $$

### 2.3 Cosine Similarity (Calculation)
Once we have calculated the dot product and magnitudes, we can then plug the values in the formula and get the answer.

$$ \text{Cosine Similarity}(\mathbf{A}, \mathbf{B}) = \frac{32}{3.74 \cdot 8.77} \approx \frac{32}{32.8} \approx 0.97 $$

### 2.4 Cosine Distance
Cosine distance is derived from cosine similarity and is defined as

$$ cosine distance = 1 - cosine similarity $$

Cosine distance is useful when you want a measure where 0 indicates high similarity and 1 indicates high dissimilarity.

## 3. Word Vectors with Spacy

In [65]:
import spacy

In [66]:
# only the medium and large library contain word vectors
NLP = spacy.load("en_core_web_lg")

These floating point numbers encode information about this Token, which the model has learned by observing the word in its context of occurrences.

You can read more here [spaCy word vectors](https://spacy.io/usage/linguistic-features#vectors-similarity) and [word2vec](https://www.tensorflow.org/text/tutorials/word2vec#:~:text=word2vec%20is%20not%20a%20singular,downstream%20natural%20language%20processing%20tasks.)

In [67]:
# Vector components for the word cat
NLP(u"cat").vector

array([ 3.7032e+00,  4.1982e+00, -5.0002e+00, -1.1322e+01,  3.1702e-02,
       -1.0255e+00, -3.0870e+00, -3.7327e+00,  5.3875e-01,  3.5679e+00,
        6.9276e+00,  1.5793e+00,  5.1188e-01,  3.1868e+00,  6.1534e+00,
       -4.8941e+00, -2.9959e-01, -3.6276e+00,  2.3825e+00, -1.4402e+00,
       -4.7577e+00,  4.3607e+00, -4.9814e+00, -3.6672e+00, -1.8052e+00,
       -2.1888e+00, -4.2875e+00,  5.5712e+00, -5.2875e+00, -1.8346e+00,
       -2.2015e+00, -7.7091e-01, -4.8260e+00,  1.2464e+00, -1.7945e+00,
       -8.1280e+00,  1.9994e+00,  1.1413e+00,  3.8032e+00, -2.8783e+00,
       -4.2136e-01, -4.4177e+00,  7.7456e+00,  4.9535e+00,  1.7402e+00,
        1.8275e-01,  2.4218e+00, -3.1496e+00, -3.8057e-02, -2.9818e+00,
        8.3396e-01,  1.1531e+01,  3.5684e+00,  2.5970e+00, -2.8438e+00,
        3.2755e+00,  4.5674e+00,  3.2219e+00,  3.4206e+00,  1.1200e-01,
        1.0303e-01, -5.8396e+00,  4.6370e-01,  2.7750e+00, -5.3713e+00,
       -5.0247e+00, -2.0212e+00,  5.8772e-01,  1.1569e+00,  1.32

In [68]:
tokens = NLP(u"cat pet bird")

In [69]:
# similarity for each token with every other token
for first_token in tokens:
    for second_token in tokens:
        print(f"{first_token.text:{5}} and {second_token.text:{5}} similarity {round(first_token.similarity(second_token), 2) * 100}%")

cat   and cat   similarity 100.0%
cat   and pet   similarity 73.0%
cat   and bird  similarity 54.0%
pet   and cat   similarity 73.0%
pet   and pet   similarity 100.0%
pet   and bird  similarity 37.0%
bird  and cat   similarity 54.0%
bird  and pet   similarity 37.0%
bird  and bird  similarity 100.0%


## 4. Vector Arithmetic

In [70]:
from sklearn.metrics.pairwise import cosine_similarity

In [71]:
technology = NLP.vocab["technology"].vector
computer = NLP.vocab["computer"].vector
software = NLP.vocab["software"].vector
internet = NLP.vocab["internet"].vector
innovation = NLP.vocab["innovation"].vector

In [72]:
new_vector = technology - computer + software

In [73]:
similar_vects = []

# Iterate over words in the vocabulary
for word in NLP.vocab:
    if word.has_vector and word.is_lower and word.is_alpha:
        similarity = cosine_similarity([new_vector], [word.vector])[0][0]
        similar_vects.append((word, similarity))

In [74]:
similar_vects = sorted(similar_vects, key=lambda item: -item[1])

In [75]:
# top 5 similar words
for t in similar_vects[:5]:
    print(t[0].text)

technology
software
innovation
internet
and


We can extend this approach to documents. The steps will remain relatively same.

In [59]:
import numpy as np

In [76]:
def document_vector(document):
    doc = [token for token in document if token.is_stop and token.has_vector]        
    return np.mean([token.vector for token in doc], axis=0)

In [89]:
doc_one = "AI and machine learning are being rapidly transforming the technology landscape."
doc_two = "Let's eat something"

In [90]:
doc_one = NLP(doc_one)
doc_two = NLP(doc_two)

doc_one_vect = document_vector(doc_one)
doc_two_vect = document_vector(doc_two)

In [91]:
similarity = cosine_similarity([doc_one_vect], [doc_two_vect])[0][0]

In [94]:
print("Similarity between doc_one and doc_two", round(doc_one.similarity(doc_two), 2) * 100, "%")

Similarity between doc_one and doc_two 20.0 %


## 5. Sentiment Analysis
Sentiment analysis is the process of determining the emotional tone behind a piece of text. It can be used to identify the sentiment expressed in reviews, social media posts, or any other text data.

**Approaches to sentiment analysis**

### 5.1 Lexicon-Based Approach
- Uses predefined dictionaries of words with associated sentiment scores.
- Words in the text are matched against these dictionaries, and the sentiment score of the text is calculated based on the matched words.

### 5.2 Machine Learning Approach
- Uses supervised learning techniques where a model is trained on a labeled dataset containing text and corresponding sentiment labels.
- Common classifiers include Naive Bayes, Support Vector Machines (SVM), and deep learning models like Recurrent Neural Networks (RNN) and Transformers.

### 5.3 Hybrid Approach
- Combines both lexicon-based and machine learning methods to leverage the strengths of both.

### 5.4 Sentiment Analysis using VADER
(Valence Aware Dictionary and sEntiment Reasoner)

In [95]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [96]:
nltk.download("vader_lexicon")

[nltk_data] Downloading package vader_lexicon to C:\Users\AR-
[nltk_data]     LABS\AppData\Roaming\nltk_data...


True

In [177]:
analyzer = SentimentIntensityAnalyzer()

In [178]:
def analyze_sentiment_using_vader(text):
    doc = NLP(text)
    sentiment_scores = analyzer.polarity_scores(doc.text)
    
    return sentiment_scores

In [179]:
doc_one = "This product is fine, I like it."
doc_two = "Man what a mess! this thing is terrible."

In [180]:
doc_one_sent = analyze_sentiment_using_vader(doc_one)
doc_two_sent = analyze_sentiment_using_vader(doc_two)

In [181]:
print("First document sentiment: ", doc_one_sent)
print("Second document sentiment: ", doc_two_sent)

First document sentiment:  {'neg': 0.0, 'neu': 0.482, 'pos': 0.518, 'compound': 0.5106}
Second document sentiment:  {'neg': 0.541, 'neu': 0.459, 'pos': 0.0, 'compound': -0.7088}


As we can see, the first document got positive score of 0.518 and neutral score of 0.482.

While the second document got negative score of 0.541 and neutral score of 0.459

### 5.5 Sentiment Analysis using spaCy
spaCy enables sentiment analysis by utilizing the [TextBlob](https://github.com/sloria/TextBlob) library.

In [104]:
# re-import to clean things up
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [102]:
NLP = spacy.load("en_core_web_lg")

In [105]:
NLP.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x23062a17350>

In [107]:
doc = NLP("What an amazing day it is!")

In [110]:
polarity = doc._.blob.polarity
print(f"{round(polarity, 2) * 100}% {'positive' if polarity > 0 else 'negative' if polarity < 0 else 'neutral'}")

75.0% positive


In [111]:
doc = NLP("Horrible mess it is mate!")

In [112]:
polarity = doc._.blob.polarity
print(f"{round(polarity, 2) * 100}% {'positive' if polarity > 0 else 'negative' if polarity < 0 else 'neutral'}")

-61.0% negative


### 5.6 Sentiment Analysis using Machine Learning
We can train machine learning algorithms such as Naive Bayes or Support Vector Classifier to do the sentiment analysis.

Datasets have been downloaded from here [Datasets](https://archive.ics.uci.edu/dataset/331/sentiment+labelled+sentences)

In [182]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [183]:
yelp_reviews = pd.read_csv(
    "./sentiment labelled sentences/yelp_labelled.txt",
    sep="\t",
    header=None
)

In [184]:
cols = ["Review", "Label"]

In [185]:
yelp_reviews.columns = ["Review", "Label"]

In [186]:
yelp_reviews.head(5)

Unnamed: 0,Review,Label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [187]:
amazon_reviews = pd.read_csv(
    "./sentiment labelled sentences/amazon_cells_labelled.txt",
    sep="\t",
    header=None
)

In [188]:
amazon_reviews.columns = ["Review", "Label"]

In [189]:
amazon_reviews.head(5)

Unnamed: 0,Review,Label
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [190]:
imdb_reviews = pd.read_csv(
    "./sentiment labelled sentences/imdb_labelled.txt",
    sep="\t",
    header=None
)

In [191]:
imdb_reviews.columns = ["Review", "Label"]

In [192]:
imdb_reviews.head(5)

Unnamed: 0,Review,Label
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [193]:
print(yelp_reviews.shape)
print(amazon_reviews.shape)
print(imdb_reviews.shape)

(1000, 2)
(1000, 2)
(748, 2)


In [194]:
df = pd.concat([yelp_reviews, amazon_reviews, imdb_reviews], axis=0)

In [195]:
df.head()

Unnamed: 0,Review,Label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [196]:
df.shape

(2748, 2)

In [197]:
df.isnull().sum()

Review    0
Label     0
dtype: int64

In [198]:
X = df["Review"]
y = df["Label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [199]:
classifier_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("clf", LinearSVC(dual=False))
])

In [200]:
classifier_pipeline.fit(X_train, y_train)

In [201]:
preds = classifier_pipeline.predict(X_test)

In [202]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.84      0.82      0.83       435
           1       0.81      0.83      0.82       390

    accuracy                           0.82       825
   macro avg       0.82      0.82      0.82       825
weighted avg       0.82      0.82      0.82       825



In [203]:
print(confusion_matrix(y_test, preds))

[[358  77]
 [ 68 322]]


In [204]:
classifier_pipeline.predict(["Man you are looking amazing!"])[0]

1

**1** is POSITIVE
**0** is NEGATIVE

In [205]:
classifier_pipeline.predict(["Their service is terrible"])[0]

0

## The end