<a href="https://colab.research.google.com/github/Guilherme-Inkotte/distance-matrix/blob/main/CI_Matriz_de_Distancia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trabalho 7 - Matriz de Distâncias
## Construção de Interpretadores
## Aluno: Guilherme Henrique Schneider Inkotte

Sua tarefa será gerar uma matriz de distância, computando o cosseno do ângulo entre todos os vetores que encontramos usando o tf-idf. Para isso use a seguinte fórmula para o cálculo do cosseno use  a  fórmula  apresentada  em  Word2Vector  [frankalcantara.com](https://frankalcantara.com/Aulas/Nlp/out/Aula4.html#/0/4/2) e apresentada na figura a seguir:  
O resultado  deste trabalho  será uma matriz que relaciona cada um dos vetores já calculados com todos os outros vetores disponíveis na matriz termo-documento. 

## Setup

In [1]:
from bs4 import BeautifulSoup
import requests
import spacy
import string
import numpy as np

## Parte 1 - Construção do Corpus

In [2]:
# Autor: Guilherme Henrique Schneider Inkotte

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('sentencizer')

urls = [
  "https://en.wikipedia.org/wiki/Natural_language_processing", 
  "https://www.ibm.com/cloud/learn/natural-language-processing",
  "https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP",
  "https://www.tableau.com/learn/articles/natural-language-processing-examples",
  "https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1"
]
noLi = [
  "https://en.wikipedia.org/wiki/Natural_language_processing",
  "https://www.tableau.com/learn/articles/natural-language-processing-examples",
]
corpuses = []

for url in urls:
  page = requests.get(url)
  if page.status_code != 200:
    print(f"Ocorreu um erro ao buscar a página {url}.");
    continue
  soup = BeautifulSoup(page.content, 'html.parser')
  corpus = []
  for paragraphTag in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "li" if url not in noLi else ""]):
    paragraph = paragraphTag.get_text()
    strippedParagraph = paragraph.strip()
    if len(strippedParagraph) > 0:
      doc = nlp(strippedParagraph)
      for sentence in doc.sents:
        corpus.append(sentence.text.translate(str.maketrans('', '', string.punctuation)))
  corpuses.append(corpus)
  print(f"Número de palavras em {url}: {len(' '.join(corpus).split(' '))}")
  print(corpus)

Número de palavras em https://en.wikipedia.org/wiki/Natural_language_processing: 1360
['Natural language processing', 'Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data  ', 'The goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them', 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves', 'Challenges in natural language processing frequently involve speech recognition naturallanguage understanding and naturallanguage generation', 'Contents', 'Historyedit', 'Natural language processing has its roots in the 1950s', 'Already in 1950 Alan Turing published an article titled Computing Machinery

## Parte 2 - Construção do Bag of Words

In [3]:
lexemes = []
numberOfSentences = 0
for corpus in corpuses:
  for sentences in corpus:
    numberOfSentences += 1
    for lexeme in sentences.split(' '):
      if lexeme not in lexemes:
        lexemes.append(lexeme)
bagOfWords = np.zeros((numberOfSentences,len(lexemes)))
currentSentence = 0
for corpus in corpuses:
  for sentences in corpus:
    for lexeme in sentences.split(' '):
      bagOfWords[currentSentence][lexemes.index(lexeme)] += 1
    currentSentence += 1

print(f"Sentenças: {len(bagOfWords)}")
print(f"Lexemas distintos: {len(bagOfWords[0])}")
print("Bag of Words: ")
print(bagOfWords)


Sentenças: 839
Lexemas distintos: 2840
Bag of Words: 
[[1. 1. 1. ... 0. 0. 0.]
 [1. 3. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Parte 3 - TF - IDF

In [4]:
# TF
tfs = np.zeros((len(bagOfWords),len(bagOfWords[0])))
currentSentence = 0
for corpus in corpuses:
  for sentences in corpus:
    numberOfLexemes = len(sentences.split(' '))
    for lexeme in sentences.split(' '):
      tfs[currentSentence][lexemes.index(lexeme)] = bagOfWords[currentSentence][lexemes.index(lexeme)] / numberOfLexemes
    currentSentence += 1
print("TF: ")
print(tfs)

# IDF
idfs = []
for lexeme in range(len(lexemes)):
  lexemeRate = 0
  for sentenceRates in bagOfWords:
    if sentenceRates[lexeme] > 0: lexemeRate += 1
  idfs.append(np.log10(len(bagOfWords)/lexemeRate))
print("IDF: ")
print(len(bagOfWords))
print(idfs)

tfidf = np.zeros((len(bagOfWords),len(bagOfWords[0])))
for i in range(len(tfidf)):
  for j in range(len(tfidf[0])):
    tfidf[i][j] = tfs[i][j] * idfs[j]
print('TF-IDF: ')
print(tfidf)

TF: 
[[0.33333333 0.33333333 0.33333333 ... 0.         0.         0.        ]
 [0.02439024 0.07317073 0.02439024 ... 0.         0.         0.        ]
 [0.         0.04761905 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
IDF: 
839
[1.3555602367617052, 0.7564446260805242, 1.0604391007082443, 0.851879953522575, 0.7334302626584088, 0.6542490166107839, 2.9237619608287004, 0.5381556872303881, 1.9237619608287002, 1.6933130394504263, 2.1456107104450566, 0.4628641180721524, 1.9695194513893755, 1.6933130394504263, 2.9237619608287004, 1.0487006974370003, 0.5054606695089547, 2.1456107104450566, 2.0206719738367567, 1.8823692756704753, 1.4466407061090378, 0.6882335139211514, 2.446640706109038, 1.7196419781727754, 0.5417449182538319, 2.2247919564926817, 1.5435507191170943, 1.

##Parte 4 - Matriz de Distâncias

In [5]:
currentVector = 0
distancesMatrix = np.zeros((len(tfidf),len(tfidf)))

for vector in tfidf:
  i = currentVector
  while i < len(tfidf):
    distance = np.dot(vector,tfidf[i])/(np.linalg.norm(vector)*np.linalg.norm(tfidf[i]))
    distancesMatrix[currentVector][i] = distance
    distancesMatrix[i][currentVector] = distance
    i += 1
  currentVector += 1

print(distancesMatrix)

1.0000000000000002
[[1.         0.23695259 0.03804315 ... 0.         0.         0.        ]
 [0.23695259 1.         0.09584586 ... 0.         0.         0.02182825]
 [0.03804315 0.09584586 1.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.02182825 0.         ... 0.         0.         1.        ]]
