In [0]:
import pandas as pd

In [0]:
docs = list(pd.read_csv('yelp.csv')['text'])

# Print the first review from t
print(docs[0])

# Context-free sentence similarity detection: tf-idf

tf-idf stands for: Term Frequency - Inverse Document Frequency

It is a metric to determine how similar one term is to the terms found in a specific document from the set of avaliable documents. It is calculated via the following equation:

**tfidf(t, d, D) = tf(t, d) * idf(t, D)**

where:
    
| Variable | Meaning                                                    |
| -------- | ---------------------------------------------------------- |
| t        | The term to find                                           |
| d        | The document we are referring to find the term's frequency |
| D        | Set of all documents                                       |
| tf()     | Function to calculate the term frequency                   |
| idf()    | Function to calcualte the inverse document frequency       |

## Calculating term frequency: tf(t, d)

Tem Frequency refers to how often a specific term appears in a specific document. The simpliest method is by counting the number of times the term appears in the document. Let this raw count be denoted as $f_{t,d}$. By default, `gensim.models.TfidfModel` calculates the term frequency weighting using this approach. Below is basic pseudocode to represent how $f_{t,d}$ could possibly be implemented:

```python
def raw_count(t, d):
    count = 0
    for term in d:
        if t == term:
            count += 1
    return count
````

However, the document length could have an impact on the bias. By dividing $f_{t,d}$ by the most frequent term in the document, it provides a more normalized frequency values across serveral documents. To balance small documents, we can provide an offest to account for their size. Below is one implementation of this augmented term frequency function:

$$ 0.5 + \frac{f_{t,d}}{\max\limits_{t' \in d}(f_{t',d})} $$

`gensim.models.TfidfModel` has the ability to change its term frequency weightings. See the following url for the class documentation:

> https://radimrehurek.com/gensim/models/tfidfmodel.html

## Calculating Inverse Document Frequency: idf(t, D)

Inverse Document Frequency is a huerisitc that measures how much information the term provides found in all given documents. One possible way to define inverse document frequency is the following:

$$ \log \frac{|D|}{|\{d | d \in D, t \in d \}|} $$

where $|D|$ is the number of documents and $|\{d | d \in D, t \in d\}|$ is the number of documents that the term appears in.

Examing this function, we can notice the inverse relationship between idf and the number of occurrence of the document. As there is less times the term appears, the higher the idf hueristic gets.

By default, `gensim.models.TfidfModel` uses a function similar to the one above to calculate the inverse document frequency. Just like with term frequency, the class has the ability to change its inverse document frequency weightings.  to See the following url for the class documentation:

> https://radimrehurek.com/gensim/models/tfidfmodel.html

In [0]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess
from gensim.similarities import MatrixSimilarity

In [0]:
#Tokenize words in each document

#Create a dictionary - assign indices to the words in the documents

In [0]:
#doc_test = "The food was terrible. The service was unprofessional"
doc_test = "I loved this place. The food was great. The staff was professional"

# Context-sensitive sentence similarity detection: word2vec

2 possible methodologies to create the word representation:

- continuous skip-gram
    - Given the middle word, predict the surrounding words.
- continuous bag-of-words
    - Given the surrounding words, predict the middle word.
    
In either of the methods listed above, cosine similarity is used to measure sentence similarity.

## Cosine Similarity

To determine the similarity between two documents are similar, we can manipulate the Euclidian Dot Product Formula to generate the following equation for vectors A & B:

$$ similarity = abs(\cos(\theta)) = abs(\frac{A \bullet B}{||A||~||B||})  $$

By examining this equation, parallel vectors would have maximum similarity and perpendicular vectors would have 0 similarity between them.

In [0]:
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [0]:
test_data = simple_preprocess("I loved the food".lower())
v = model.infer_vector(test_data)

sims = model.docvecs.most_similar([v])

In [0]:
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt

In [0]:
word_list = ['the', 'professional', 'best', 'good', 'bad', 'amazing', 'awful', 'awesome', 'food', 'service']

In [0]:
fig = plt.figure()
ax = fig.add_subplot(111)
colors = ['r', 'g', 'b', 'magenta', 'cyan', 'brown', 'black', 'orange', 'purple', 'yellow']
for i in range(10):
    pts = X_embedded[10*i:10*i+10]
    ax.scatter(pts[:,0], pts[:,1], c=colors[i], cmap='viridis')
plt.show()