# Spacy

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

### Features

| NAME                              | DESCRIPTION                                                                                                        |
|-----------------------------------|--------------------------------------------------------------------------------------------------------------------|
| Tokenization                      | Segmenting text into words, punctuations marks etc.                                                                |
| Part-of-speech (POS) Tagging      | Assigning word types to tokens, like verb or noun.                                                                 |
| Dependency Parsing                | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
| Lemmatization                     | Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.      |
| Sentence Boundary Detection (SBD) | Finding and segmenting individual sentences.                                                                       |
| Named Entity Recognition (NER)    | Labelling named “real-world” objects, like persons, companies or locations.                                        |
| Entity Linking (EL)               | Disambiguating textual entities to unique identifiers in a knowledge base.                                         |
| Similarity                        | Comparing words, text spans and documents and how similar they are to each other.                                  |
| Text Classification               | Assigning categories or labels to a whole document, or parts of a document.                                        |
| Rule-based Matching               | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.       |
| Training                          | Updating and improving a statistical model’s predictions.                                                          |
| Serialization                     | Saving objects to files or byte strings.                                                                           |

In [None]:
import spacy

### Downloading models

https://spacy.io/usage/models

In [None]:
# !python -m spacy download en_core_web_lg
# or you can use this solution from StackOverflow - https://stackoverflow.com/questions/55742788/ssl-certificate-verify-failed-error-while-downloading-python-m-spacy-download

### Creating nlp pipeline

At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp". You can use the nlp object like a function to analyze text.It contains all the different components in the pipeline.
It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy.lang

In [None]:
import en_core_web_sm as en_core
nlp = en_core.load()

In [None]:
nlp.pipe_names

### The Doc object

When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.
The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!

In [None]:
# Created by processing a string of text with the nlp object
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over tokens in a Doc
for token in doc:
    print(token.text)
    
print(type(doc))
doc

### The Token object

Token objects represent the tokens in a document – for example, a word or a punctuation character.
To get a token at a specific position, you can index into the doc.
Token objects also provide various attributes that let you access more information about the tokens. For example, the .text attribute returns the verbatim token text.

<img src="./img/spacy1.png">

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Index into the Doc to get a single Token
token = doc[2]

# Get the token text via the .text attribute
print(token.text)

### Tokenization
Segment text into words, punctuations marks, etc.


In [None]:
tokens = nlp.tokenizer("""
Tokenization segments text into words, punctuations marks, etc. 
It is smarter than regex. 
It won't split the U.K. for example.
""")
list(tokens)

### POS tagging


For each token in the doc, we can print the text and the .pos_ attribute, the predicted part-of-speech tag.


In [None]:
doc = nlp("She ate the pizza. Billion is a number")

# Iterate over the tokens
for token in doc:
    # Print the text and the predicted part-of-speech tag
    print(token.text, token.pos_)

### Predicting Syntactic Dependencies
In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The .dep_ attribute returns the predicted dependency label.

The .head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

<img src="img/1.png">

In [None]:
doc = nlp("She ate the pizza")

for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

### Lemmatization

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for 1 billion of dollars")

for token in doc:
    print(token.text, token.lemma_)

### Named Entity Recognition

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

# Iterate over the predicted entities
for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

#### Hint: Spacy Explain
To get definitions for the most common tags and labels, you can use the spacy.explain helper function.

For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy.explain can tell you that it refers to countries, cities and states.

The same works for part-of-speech tags and dependency labels.

In [None]:
print(spacy.explain("GPE"))
print(spacy.explain("nsubj"))
print(spacy.explain("DET"))

# Semantic Similarity

spaCy can compare two objects and predict how similar they are – for example, documents, spans or single tokens.

The Doc, Token and Span objects have a .similarity method that takes another object and returns a floating point number between 0 and 1, indicating how similar they are.

One thing that's very important: In order to use similarity, you need a larger spaCy model that has word vectors included.

By default, the similarity returned by spaCy is the cosine similarity between two vectors – but this can be adjusted if necessary.

In [None]:
doc1 = nlp("I like fast food")
doc2 = nlp("I like burgers")
doc3 = nlp("Apple is looking at buying U.K. startup for $1 billion")
print(doc1.similarity(doc2))
print(doc1.similarity(doc3))

https://spacy.io/usage/spacy-101

# Word Embeddings

Similarity is determined using word vectors, multi-dimensional representations of meanings of words.


## Word representation

Recall onehot  representation

$$boy \qquad \qquad \, girl\quad \qquad \quad apple\qquad \quad orange\qquad \qquad king\qquad \qquad queen\qquad \qquad\\ \begin{bmatrix} 0 \\ \vdots  \\ 0 \\ 1 \\ 0 \\ \quad  \\ \vdots  \\ \quad  \\ \quad  \\ \quad  \\ 0 \end{bmatrix}\begin{matrix} \quad  \\ \leftarrow 1458 \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \end{matrix}\begin{bmatrix} 0 \\ \vdots  \\ \quad  \\ 0 \\ 1 \\ 0 \\ \vdots  \\ \quad  \\ \quad  \\ \quad  \\ 0 \end{bmatrix}\begin{matrix} \quad  \\ \quad  \\ \leftarrow 3945 \\ \quad  \\ \quad  \\ \quad  \\ \quad  \end{matrix}\begin{bmatrix} 0 \\ \vdots  \\ 1 \\ 0 \\ 1 \\ 0 \\ \vdots  \\ \quad  \\ \quad  \\ \quad  \\ 0 \end{bmatrix}\begin{matrix} \leftarrow 472 \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \end{matrix}\begin{bmatrix} 0 \\ \quad  \\ \quad  \\ \quad  \\ \vdots  \\ \quad  \\ \quad  \\ 1 \\ 0 \\ \vdots  \\ 0 \end{bmatrix}\begin{matrix} \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \leftarrow 6117 \\ \quad  \end{matrix}\begin{bmatrix} 0 \\ \quad  \\ \vdots  \\ \quad  \\ 0 \\ 1 \\ 0 \\ \quad  \\ \vdots  \\ \quad  \\ 0 \end{bmatrix}\begin{matrix} \quad  \\ \quad  \\ \quad  \\ \leftarrow 4924 \\ \quad  \\ \quad  \\ \quad  \end{matrix}\begin{bmatrix} 0 \\  \\ \quad  \\ \quad  \\ \vdots  \\ \quad  \\ \quad  \\ \quad  \\ 0 \\ 1 \\ 0 \end{bmatrix}\begin{matrix} \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \quad  \\ \leftarrow 9714\quad  \\ \quad  \end{matrix}\\ \quad$$


This representation does not provide any relation for similar words e.g. 
<br>`I like apple juice` and   `I like orange juice` 


The target is to get **vectorized reprezentation** even if vectors are not semantically defined. 

The target vectors are expected to be similar by cosine similairity for semantically similar words. 

<img src="img/2.png" align = 'left' style="width:350;height:250px;"> <br>
<div style="clear:left;"></div>

If model is trained using such vectors then having in traing set $\quad$ `I like apple` <u><b> juice </b></u>,  <br>  would make easier to predict $\quad$ <u><b> juice </b></u> $\quad$ for $\quad$  `I like orange` <u><b> $\text{_____}$ </b></u>




### Spacy

In [None]:
doc = nlp("I have a banana")
# Access the vector via the token.vector attribute
print(len(doc[3].vector))
print(doc[3].vector)

### Word2Vec
https://uk.wikipedia.org/wiki/Word2vec

The idea behind Word2Vec is pretty simple. We are making and assumption that you can tell the meaning of a word by the company it keeps. This is analogous to the saying show me your friends, and I'll tell who you are. So if you have two words that have very similar neighbors (i.e. the usage context is about the same), then these words are probably quite similar in meaning or are at least highly related. For example, the words shocked,appalled and astonished are typically used in a similar context.

In this tutorial, you will learn how to use the Gensim implementation of Word2Vec and actually get it to work! I have heard a lot of complaints about poor performance etc, but its really a combination of two things, (1) your input data and (2) your parameter settings. Note that the training algorithms in this package were ported from the original Word2Vec implementation by Google and extended with additional functionality.

#### Training word2vec model using gensim

In [None]:
import gzip
import gensim

#### Loading data


In [None]:
input_file = "./data/reviews_data.txt.gz"

with gzip.open(input_file, 'rb') as f:
    for i,line in enumerate (f):
        print(line)
        break

In [None]:
def read_input(input_file):
    """This method reads the input file which is in gzip format"""
    
    print("reading file {0}...this may take a while".format(input_file))
    
    with gzip.open(input_file, 'rb') as f:
        for i, line in enumerate (f): 
            if (i%10000==0):
                print("read {0} reviews".format(i))
            # do some pre-processing and return a list of words for each review text
            yield gensim.utils.simple_preprocess(line)

# read the tokenized reviews into a list
# each review item becomes a serries of words
# so this becomes a list of lists
documents = list(read_input(input_file))
print("Done reading data file")

Training the model is fairly straightforward. You just instantiate Word2Vec and pass the reviews that we read in the previous step (the documents). So, we are essentially passing on a list of lists. Where each list within the main list contains a set of tokens from a user review. Word2Vec uses all these tokens to internally create a vocabulary.

After building the vocabulary, we just need to call train(...) to start training the Word2Vec model. Training on the OpinRank dataset takes about 10 minutes so please be patient while running your code on this dataset.

<img src="img/7.png">

In [None]:
model = gensim.models.Word2Vec(documents, vector_size=150, window=10, min_count=2, workers=24)
model.train(documents,total_examples=len(documents),epochs=10)

In [None]:
import pickle

# with open("./w2v.model", "wb") as f:
#     pickle.dump(model, f)
    
with open("./w2v.model", "rb") as f:
    model = pickle.load(f)

In [None]:
word_v= model.wv.get_vector('hello')
print(len(word_v))
print(word_v)

#### Now, let's look at some output
This first example shows a simple case of looking up words similar to the word dirty. All we need to do here is to call the most_similar function and provide the word dirty as the positive example. This returns the top 10 similar words.

In [None]:
w1 = "dirty"
model.wv.most_similar(positive=w1)

In [None]:
w1 = ["polite"]
model.wv.most_similar(positive=w1,topn=6)

In [None]:
w1 = ["vehicle"]
model.wv.most_similar(positive=w1,topn=6)

That's, nice. You can even specify several positive examples to get things that are related in the provided context and provide negative examples to say what should not be considered as related. In the example below we are asking for all items that relate to bed only:


### Analogy reasoning

$boy - girl =  \begin{bmatrix} \text{~ 0} \\ \text{~ 2} \\ \text{~ 0} \\ \text{~ 0} \\  \text{~ 0} \\ \vdots \\ \text{~ 0}  \\ \end{bmatrix} \qquad king - queen =  \begin{bmatrix} \text{~ 0} \\ \text{~ 2} \\ \text{~ 0} \\ \text{~ 0} \\  \text{~ 0} \\ \vdots \\ \text{~ 0}  \\ \end{bmatrix}$

We may say:  `boy` - `girl` = `king` - `queen`

Furher, answer the questions like `boy` is to `girl` as `king` is to `WHO?`




In [None]:
result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)

# king - man + woman = ?

print(result)

In [None]:
result = model.wv.most_similar(positive=["doctor", "woman"], negative=['man'], topn=1)

# doctor - man + woman = ?

print(result)

In [None]:
result = model.wv.most_similar(positive=["huge", "small"], negative=['big'], topn=1)

# huge - big + small = ?

print(result)

#### Similarity between two words in the vocabulary

In [None]:
print('similarity between two identical words')
print(model.wv.similarity(w1="dirty",w2="dirty"))

print('\nsimilarity between two different words')
print(model.wv.similarity(w1="dirty",w2="smelly"))
print(model.wv.similarity(w1="bye",w2="goodbye"))
print(model.wv.similarity(w1="car",w2="vehicle"))

print('\nsimilarity between two opposit words')
print(model.wv.similarity(w1="dirty",w2="clean"))
print(model.wv.similarity(w1="wet",w2="dry"))

print('\nsimilarity between two unrelated words')
print(model.wv.similarity(w1="green",w2="hotel"))
print(model.wv.similarity(w1="hello",w2="the"))


Under the hood, it computes the cosine similarity between the two specified words using word vectors of each. From the scores, it makes sense that dirty is highly similar to smelly but dirty is dissimilar to clean. If you do a similarity between two identical words, the score will be 1.0 as the range of the cosine similarity score will always be between [0.0-1.0]. You can read more about cosine similarity scoring here.

#### Find the odd one out
You can even use Word2Vec to find odd items given a list of items.

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["cat","dog","france"])

In [None]:
# Which one is the odd one out in this list?
model.wv.doesnt_match(["bed","pillow","sheet","shower"])

### USE - Universal Sentence Encoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. 



In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

In [None]:
# Import the Universal Sentence Encoder's TF Hub module
# It may take a while

# module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
# embed = hub.Module(module_url)

# or download the model and use it locally

embed = tf.saved_model.load("./USE/")

In [None]:
# Compute a representation for each message, showing various lengths supported.
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
    "Universal Sentence Encoder embeddings also support short paragraphs. "
    "There is no hard limit on how long the paragraph is. Roughly, the longer "
    "the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]


message_embeddings = embed(messages)

for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
    print("Message: {}".format(messages[i]))
    print("Embedding size: {}".format(len(message_embedding)))
    message_embedding_snippet = ", ".join(
        (str(x) for x in message_embedding[:3]))
    print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

In [None]:
str1 = "I like my bike"
str2 = "My motorcycle looks good"
str3 = "Two more cells and we are done"

messages = [str1, str2, str3]
emb1, emb2, emb3 = embed(messages)

print(f"The similarity between '{str1}' and '{str2}' = {np.inner(emb1, emb2)}")
print(f"The similarity between '{str1}' and '{str3}' = {np.inner(emb1, emb3)}")

In [None]:
def plot_similarity(labels, features, rotation):
    print(len(features))
    corr = np.inner(features, features)
    sns.set(font_scale=1.2)
    g = sns.heatmap(
      corr,
      xticklabels=labels,
      yticklabels=labels,
      vmin=0,
      vmax=1,
      cmap="YlOrRd")
    g.set_xticklabels(labels, rotation=rotation)
    g.set_title("Semantic Textual Similarity")


def run_and_plot(messages, encoding_tensor):
    message_embeddings = encoding_tensor(messages)
    plot_similarity(messages, message_embeddings, 90)

In [None]:
messages = [
    # Smartphones
    "I like my phone",
    "My phone is not good.",
    "Your cellphone looks great.",

    # Weather
    "Will it snow tomorrow?",
    "Recently a lot of hurricanes have hit the US",
    "Global warming is real",

    # Food and health
    "An apple a day, keeps the doctors away",
    "Eating strawberries is healthy",
    "Is paleo better than keto?",

    # Asking about age
    "How old are you?",
    "what is your age?",
]

run_and_plot(messages, embed)

# Home task

1. Using a Spacy, create a keywords extractor that should do the following things:
 - Take some text (article like) as an input.
 - Remove all stop words from the text.
 - Extract all the Nouns from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Verbs from text and sort them by count and return in descending order with amount of occurrences.  
 - Extract all the Numbers from text and sort them by count and return in descending order with amount of occurrences. 
 - Extract all the Named Entities from the text, group them into 4 groups (Location, Person, Organization, Misc.) and return groups in descending order with amount of occurrences. 


2. Using multilingual USE, align strings in English and Russian texts:
 - Download multilingual USE model - https://tfhub.dev/google/universal-sentence-encoder-multilingual/3
 - Read "./data/corpora/en.txt" and "./data/corpora/ru.txt" files
 - Align English strings with their Russian analogues using mUSE
 
 
3. Using the USE, create a Duplicate Phrase Finder that will do the following:
 - Take some large text as an input.
 - Separates text to SENTENCES (phrases). 
 - Finds semantically similar strings (cosine similarity >=0.80)

In [None]:
from nltk.corpus import stopwords, gutenberg
from nltk.tokenize import word_tokenize
import en_core_web_sm as en_core
from typing import Tuple
from nltk import FreqDist
import spacy

## 1. Using a Spacy, create a keywords extractor that should do the following things:

In [None]:
# define nlp
nlp = en_core.load()

### Take some text (article like) as an input.

In [None]:
text = gutenberg.raw("milton-paradise.txt")
print(text[1:10])

### Remove stop words

In [None]:
def remove_stopwords(words: list[str], stopwords_type: str) -> str:
    """Remove stop words from text

    Args:
        texts (list[str]): list of words
        stopwords_type (str): stop words language

    Returns:
        str: text with deleted stopwords
    """
    
    # define stop words
    stop_words = set(stopwords.words(stopwords_type)) 

    # creating a list that is not including stop words
    filtered_words = [word for word in words if word not in stop_words]
    
    return " ".join(filtered_words)

In [None]:
# preprocess
text = text.replace("\n", "").split(" ")

# removing stopwords
text = remove_stopwords(text, stopwords_type="english")

# result
text[:100]

### Use `nlp` to analyze words

In [None]:
# use nlp
analyzed_words = nlp(text)

# result
analyzed_words[:100]

### Extract tokens with specific language parts

In [None]:
def extract_lang_parts(words: spacy.tokens.doc.Doc, lang_part: str = None) -> list:
    """Take the decided language part from text or all tokens

    Args:
        words (spacy.tokens.doc.Doc): all words
        lang_part (str, optional): language part that we want to extract. Defaults to None.

    Returns:
        list: result list with words or tokens
    """

    # check if [lang_part] is not None
    if lang_part:
        # return list with words
        return [
            token.text 
            for token in words
            if token.pos_ == lang_part]
    else:
        # return list with tokens
        return [
            token
            for token in words
        ]

#### Extract NOUN(sort by desc)

In [None]:
# take all NOUNs
NOUN_words = extract_lang_parts(analyzed_words, lang_part="NOUN")

# count NOUNs
NOUN_count = FreqDist(NOUN_words)

# check the result
print(NOUN_count.most_common()[0:10])

#### Extract VERB(sort by desc)

In [None]:
# take all VERBs
VERB_words = extract_lang_parts(analyzed_words, lang_part="VERB")

# count VERBs
VERB_count = FreqDist(VERB_words)

# check the result
print(VERB_count.most_common()[0:10])

#### Extract NUM(sort by desc)

In [None]:
# take all NUMs
NUM_words = extract_lang_parts(analyzed_words, lang_part="NUM")

# count NUMs
NUM_count = FreqDist(NUM_words)

# check the result
print(NUM_count.most_common()[0:5])

#### Extract all the Named Entities from the text, group them into 4 groups (Location, Person, Organization, Misc.)

In [None]:
def extract_entities(doc: spacy.tokens.doc.Doc, categories: list, with_label: bool = True) -> list:
    """Extract specific entities with values or no

    Args:
        doc (spacy.tokens.doc.Doc): tokens
        categories (list): entities to extract
        with_label (bool, optional): return with values(Country name, Person name etc.) or no. Defaults to True.

    Returns:
        list: result list
    """

    # check if it is True or not
    if with_label:
        # return result list with values and labels
        return [(token.text, token.label_) for token in doc.ents if token.label_ in categories]
    else:
        # return only labels
        return [token.label_ for token in doc.ents if token.label_ in categories]

In [None]:
from collections import Counter

In [None]:
# first method, we have list[Tuple]
ents = extract_entities(analyzed_words, categories=["LOC", "PERSON", "ORG", "GPE"])

# check result
print(dict(Counter(elem[1] for elem in ents)))

# second method, we have only labels
ent_labels = extract_entities(analyzed_words, with_label=False, categories=["LOC", "PERSON", "ORG", "GPE"])

# check result
print(FreqDist(ent_labels).most_common())


## 2. Using multilingual USE, align strings in English and Russian texts

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

In [None]:
# load model
embed = tf.saved_model.load("./USE/")

In [None]:
# Read "./data/corpora/en.txt" and "./data/corpora/ru.txt" files

en = []
ru = []
with open("./data/corpora/en.txt", "r", encoding="utf-8") as f:
    for line in f.readlines()[:50]:
        en.append(line.strip())
        
with open("./data/corpora/ru.txt", "r", encoding="utf-8") as f:
    for line in f.readlines()[:50]:
        ru.append(line.strip())

en[0], ru[0]

### Processing with data

In [None]:
embed_en = embed(en)
embed_ru = embed(ru)

len(embed_en), len(embed_ru)

### Align English strings with their Russian analogues using USE

In [None]:
def plot_similarity(labels: list, features: list):
  corr = np.inner(features[0], features[1])
  sns.set(font_scale=1)
  g = sns.heatmap(
      corr,
      xticklabels=labels[0],
      yticklabels=labels[1],
      vmin=0,
      vmax=1,
      cmap="magma_r"
  )
  g.set_title("Semantic Textual Similarity")

In [None]:
plot_similarity(labels=[en, ru], features=[embed_en, embed_ru])

## 3. Using the USE, create a Duplicate Phrase Finder that will do the following
- Take some large text as an input.
- Separates text to SENTENCES (phrases). 
- Finds semantically similar strings (cosine similarity >=0.80)

In [None]:
import tensorflow as tf
import numpy as np
from nltk.corpus import gutenberg
from nltk.tokenize import sent_tokenize
import re

In [None]:
# load model
embed = tf.saved_model.load("./USE/")

In [471]:
# get file id
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [474]:
# load text
def load_text(file_id: str) -> str:
    """Load your text

    Args:
        file_id (str): file id you want to download from gutenberg

    Returns:
        str: raw text
    """
    return gutenberg.raw(file_id).replace("\n", " ")

In [477]:
text_to_compare, text_to_compare_with = load_text("carroll-alice.txt"), load_text("whitman-leaves.txt")

In [478]:
# tokenize text
def tokenize_to_sent(text: str) -> list:
    """Tokenize your text to sentences

    Args:
        text (str): full text

    Returns:
        list: sentence tokens
    """
    return sent_tokenize(text)[:800]

In [480]:
tokenized_to_comp, tokenized_to_comp_with = tokenize_to_sent(text_to_compare), tokenize_to_sent(text_to_compare_with)

In [483]:
# define embeddings
embeddings_to_comp = embed(tokenized_to_comp)
embeddings_to_comp_with = embed(tokenized_to_comp_with)

In [487]:
def find_similar_sents(embed_to_comp: np.array, embed_to_comp_with: np.array, texts_to_comp: list[str], texts_to_comp_with: list[str], similarity: float = 0.5) -> list[list]:
    """Finds semantically similar strings

    Args:
        embed_to_comp (np.array): array of scaled sentences to compare
        embed_to_comp_with (np.array): array of scaled sentences to compare with
        texts_to_comp (list[str]): list of sentences to compare
        texts_to_comp_with (list[str]): list of senteces to compare with
        similarity (float, optional): [description]. Defaults to 0.5.

    Returns:
        list[list]: result list of similar strings
    """

    # define empty list
    results = []

    # iterate by two lists with their sentences
    for to_compare, to_sent in zip(embed_to_comp, texts_to_comp):
        
        for to_compare_with, to_sent_with in zip(embed_to_comp_with, texts_to_comp_with):

            # check if sentences are not the same at all
            if (np.array(to_compare) != np.array(to_compare_with)).all():

                # calculate similarity
                calced_similarity = np.inner(to_compare, to_compare_with)

                # define data to add
                to_add = {to_sent, to_sent_with, calced_similarity}

                # check if calculated similarity is bigger the provided by user and if the current data is already exist in the list
                if calced_similarity >= similarity and to_add not in results:

                    # add to list
                    results.append(to_add)

    return results

#### Check different texts

In [488]:
find_similar_sents(embeddings_to_comp, embeddings_to_comp_with, tokenized_to_comp, tokenized_to_comp_with, similarity=.8)

[{"'Who are YOU?'", 0.8118769, 'what are you?'}]

#### Check the same text

In [489]:
# load text
loaded_text = load_text("melville-moby_dick.txt")

# tokenize
tokenized_text = tokenize_to_sent(loaded_text)

# use embed
embeddings = embed(tokenized_text)

# check text
find_similar_sents(embeddings, embeddings, tokenized_text, tokenized_text, similarity=.85)