<a href="https://colab.research.google.com/github/GabrielFePL/NLP-Fatec-Matao/blob/main/NlpPreProcessing_Exercise_17_09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing - Fatec Matão

## Environment Setup

In [6]:
!pip install scikit-learn spacy --quiet
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_sm

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m53.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m89.3 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installa

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
from spacy import displacy
import numpy as np

In [8]:
corpus = [
    "The solar system has eight planets.",
    "Planets orbit the sun in elliptical paths.",
    "Astronomy helps us understand the universe.",
    "The sun is the center of our solar system."
]

In [9]:
sent1 = corpus[0]
sent2 = corpus[3]

## Bag of Words

In [10]:
bow_vectorizer = CountVectorizer()
bow_vectors = bow_vectorizer.fit_transform(corpus)
bow_cos = cosine_similarity(bow_vectors[0], bow_vectors[3])[0][0]

In [11]:
print("Frase 1:", sent1)
print("Frase 2:", sent2)
print("\nSimilaridade de Cosseno (BoW): ", round(bow_cos, 3))

Frase 1: The solar system has eight planets.
Frase 2: The sun is the center of our solar system.

Similaridade de Cosseno (BoW):  0.492


## TF-IDF

In [12]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform(corpus)
tfidf_cos = cosine_similarity(tfidf_vectors[0], tfidf_vectors[3])[0][0]

In [13]:
print("Frase 1:", sent1)
print("Frase 2:", sent2)
print("Similaridade de Cosseno (TF-IDF): ", round(tfidf_cos, 3))

Frase 1: The solar system has eight planets.
Frase 2: The sun is the center of our solar system.
Similaridade de Cosseno (TF-IDF):  0.333


## Word Embeddings

In [14]:
nlp = spacy.load("en_core_web_md")
emb1 = nlp(sent1).vector
emb2 = nlp(sent2).vector
embed_cos = np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2))

In [15]:
print("Frase 1:", sent1)
print("Frase 2:", sent2)
print("Similaridade de Cosseno (Embeddings): ", round(embed_cos, 3))

Frase 1: The solar system has eight planets.
Frase 2: The sun is the center of our solar system.
Similaridade de Cosseno (Embeddings):  0.869


## Sintax Analysis

In [16]:
nlp = spacy.load("en_core_web_sm")

In [17]:
phrases = [
    "The cat sleeps.",
    "The quick brown fox jumps over the lazy dog.",
    "Although the weather was cold, the children continued playing outside, enjoying the fresh snow."
]

In [18]:
for phrase in phrases:
    doc = nlp(phrase)
    print(f"\nPhrase: {phrase}")
    print("Dependencies (token -> head, relation):")
    for token in doc:
        print(f"{token.text:15} -> {token.head.text:15} ({token.dep_})")

    displacy.render(doc, style="dep", jupyter=True, options={"distance": 110})


Phrase: The cat sleeps.
Dependencies (token -> head, relation):
The             -> cat             (det)
cat             -> sleeps          (nsubj)
sleeps          -> sleeps          (ROOT)
.               -> sleeps          (punct)



Phrase: The quick brown fox jumps over the lazy dog.
Dependencies (token -> head, relation):
The             -> fox             (det)
quick           -> fox             (amod)
brown           -> fox             (amod)
fox             -> jumps           (nsubj)
jumps           -> jumps           (ROOT)
over            -> jumps           (prep)
the             -> dog             (det)
lazy            -> dog             (amod)
dog             -> over            (pobj)
.               -> jumps           (punct)



Phrase: Although the weather was cold, the children continued playing outside, enjoying the fresh snow.
Dependencies (token -> head, relation):
Although        -> was             (mark)
the             -> weather         (det)
weather         -> was             (nsubj)
was             -> continued       (advcl)
cold            -> was             (acomp)
,               -> continued       (punct)
the             -> children        (det)
children        -> continued       (nsubj)
continued       -> continued       (ROOT)
playing         -> continued       (xcomp)
outside         -> playing         (advmod)
,               -> continued       (punct)
enjoying        -> continued       (advcl)
the             -> snow            (det)
fresh           -> snow            (amod)
snow            -> enjoying        (dobj)
.               -> continued       (punct)
