# Using the SpaCy Package for Semantic Similarity
There are different packages in Python for natural language processing, including SpaCy and NLTK. Here, we will be looking into the SpaCy package.

## Installing SpaCy and English Pipeline
To use this package, first we need to install SpaCy:

`pip install -U spacy`

We also need to install a language library to use spacy tools:

`python -m spacy download en_core_web_lg`

*(These are the commands for Mac, so it will be different for Windows/Linux)*. Note that en_core_web_lg is a big file (around 500 MB). There is also en_core_web_sm and en_core_web_md, but we will use the large one since it includes built-in word vectors which allows for better comparisons (our primary goal).

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_lg")

## Basic Word and Sentence Similarity
Now, we can use SpaCy to make comparisons between words, sentences, multiple sentences, and much more. For now, we'll start with the simple tasks.
### Word Comparisons

In [20]:
# Specify words we want to compare
w1 = 'red'
w2 = 'bowling'
w3 = 'blue'
w4 = 'burgundy'

In [21]:
# Turn words into spacy objects
sw1 = nlp(w1)
sw2 = nlp(w2)
sw3 = nlp(w3)
sw4 = nlp(w4)

In [22]:
s1 = sw1.similarity(sw2)
s2 = sw1.similarity(sw3)
s3 = sw1.similarity(sw4)

print(f'Similarity of "{w1}" to "{w2}": {s1:.2f}')
print(f'Similarity of "{w1}" to "{w3}": {s2:.2f}')
print(f'Similarity of "{w1}" to "{w4}": {s3:.2f}')


Similarity of "red" to "bowling": 0.23
Similarity of "red" to "blue": 0.81
Similarity of "red" to "burgundy": 0.55


We can see here that "red" and "blue" have the most similarity, which makes sense because they are both very simplistic terms for color. On the other hand, even though "red" and "burgundy" are more similar in color than "red" and "blue" are, burgundy is more of a descriptive term that would be used for extra specificity. Using burgundy instead of red gives more context than using blue instead of red. As expected, "bowling" and "red" had the least similarity.

### Sentence Comparisons

In [29]:
s1 = nlp("The big red dog could barely fit through the door.")
s2 = nlp("The large blue cat hardly made it through the entrance.")
s3 = nlp("Do you want pizza or spaghetti for dinner tonight?")
ss1 = s1.similarity(s2)
ss2 = s1.similarity(s3)

print(f'Similarity between: \n1. {s1}\n2. {s2}\nSimilarity: {ss1:.2f}\n')
print(f'Similarity between: \n1. {s1}\n3. {s3}\nSimilarity: {ss2:.2f}')

Similarity between: 
1. The big red dog could barely fit through the door.
2. The large blue cat hardly made it through the entrance.
Similarity: 0.88

Similarity between: 
1. The big red dog could barely fit through the door.
3. Do you want pizza or spaghetti for dinner tonight?
Similarity: 0.42


Here, we see that sentence 1 is very similar to sentence 2 with a similarity of 0.88. They use an almost identical sentence structure with very similar vocabularity. The similarity between sentence 1 and sentence 3 is significantly smaller which also makes sense, since the sentences achieve two very different things.

In [31]:
s4 = nlp("The barely red door fit the big red dog.")
ss3 = s1.similarity(s4)
print(f'Similarity between: \n1. {s1}\n4. {s4}\nSimilarity: {ss3:.2f}')

Similarity between: 
1. The big red dog could barely fit through the door.
4. The barely red door fit the big red dog.
Similarity: 0.95


Here we see that we've used the same words to consruct sentence 4 but it has a different phrasing which changes the meaning. The similarity, however, between 1 and 4 is 0.95 which is even higher than the similarity between sentence 1 and 2. 

## Using Spacy for Our Project
Spacy includes a number of different features used for natural language processing and its applications. There are a few that could be particularly useful for our project:

**Sentence Boundary Detection (SBD):** Finding and segmenting individual sentences. We can use this to compare individual sentences to each other when we want to match an input to the movie script line with the highest similarity.

**Similarity:** comparing words, text spans, and documents and how similar they are to each other. The application of this is obvious; we want to compare how similar an input is to movie characters!

**Training:** Updating and improving a statistical model's predictions. This could help us improve our identification accuracy of a particular character.

The similarity between doc and span objects default to the average of the token vectors (words) which means that it doesn't take into account the ordering of the words. We'll want to change this if possible, since we want to take phrasing into account. (This is why sentence 1 and 4 had the highest similarity).

*Source: https://spacy.io/usage/spacy-101*

https://www.sciencedirect.com/science/article/pii/S1877050919313791

