In [1]:
import spacy
import numpy as np

In [2]:
#load the model
nlp = spacy.load('en_core_web_md')

In [3]:
# read Data
with open ("DS_MarianaTrench.txt", "r") as f:
    text = f.read()

#transform text into a doc
doc = nlp(text)

Word Vector Similarity: Similar words are obtained from the vocabulary using 
   spaCy's word vectors. Word vectors are initialized and associated with words in the vocabulary. SpaCy calculates the similarity between word vectors using vector similarity metrics, such as cosine similarity.

In [4]:
# to find similar words 
your_word='pet'

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=5)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print('Similar Words to "',your_word,'" are ',words)
#doc[0].vector #This Will show the tokens as a numarical vectors.



Similar Words to " pet " are  ['puppies', 'PUPPIES', 'CHINCHILLA', 'Breed', 'cattery']


Under the hood, spaCy computes the similarity based on the word vectors associated with each token in the Doc objects. Each token has an associated word vector, which represents its meaning in a multi-dimensional space. The similarity between two vectors is typically calculated using a similarity metric, such as cosine similarity.

In [5]:
#Lets Test The Similarity of sentences 
doc1=nlp('I enjoe oranges.')
doc2=nlp('I enjoe apples.')
doc3=nlp('I enjoe burgers.')
doc4=nlp('I enjoe playing football')

In [6]:
#it works as a cluster ex:Fruits(apple, orange)
print('The similarity betwen "',doc1,'" & "',doc2,'" is :',round(doc1.similarity(doc2),3)*100,'%')

The similarity betwen " I enjoe oranges. " & " I enjoe apples. " is : 92.5 %


In [7]:
print('The similarity betwen "',doc1,'" & "',doc3,'" is :',round(doc1.similarity(doc3),3)*100,'%')

The similarity betwen " I enjoe oranges. " & " I enjoe burgers. " is : 83.6 %


In [8]:
print('The similarity betwen "',doc1,'" & "',doc4,'" is :',round(doc1.similarity(doc4),3)*100,'%')

The similarity betwen " I enjoe oranges. " & " I enjoe playing football " is : 78.8 %


 It is important to note that the similarity score provided by spaCy is a measure of similarity in terms of their meaning or semantic similarity, rather than a measure of exact string or lexical similarity.

================================================================================


QA System

A Question Answering (QA) system in spaCy works by taking a question and a context as input. The question and context are tokenized and assigned numerical representations called embeddings. These embeddings capture the meaning of the tokens in a continuous vector space. The tokens are then processed through encoding layers and attention mechanisms to understand their contextual dependencies and importance. The encoded question and context representations are combined to facilitate interaction. The model predicts the start and end positions of the answer within the context by applying classification layers or probability distributions over the tokens. The corresponding tokens are extracted as the predicted answer and post-processed to provide the final answer. Building a complete QA system may require additional components and techniques, but spaCy provides the necessary tools for tokenization, encoding, and context understanding.

In [9]:
question = "What is the Mariana Trench?"
#question = "why was Mariana Trench called by that name?"

# Process the question
question_doc = nlp(question)

In [10]:
answer = None
thresh_hold=0.8

for sentence in doc.sents:
    if question_doc.similarity(sentence) > thresh_hold:
        answer = sentence.text
        break

In [11]:
if answer is not None:
    print("Answer:", answer)
else:
    print("No answer found.")

Answer: The Mariana Trench is an oceanic trench located in the western Pacific Ocean, about 200 kilometres (124 mi) east of the Mariana Islands; it is the deepest oceanic trench on Earth.
