<a href="https://colab.research.google.com/github/Guilherme-Inkotte/tf-idf/blob/main/CI_TF_IDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trabalho 6 - TF - IDF
## Construção de Interpretadores
## Aluno: Guilherme Henrique Schneider Inkotte

Sua  tarefa  será  gerar  a  matriz  termo-documento  usando  TF-IDF  por  meio  da  aplicação  das fórmulas TF-IDF na matriz termo-documento criada com a utilização do algoritmo Bag of Words. Sobre o Corpus que recuperamos anteriormente.  

## Setup

In [1]:
from bs4 import BeautifulSoup
import requests
import spacy
import string
import numpy as np

## Parte 1 - Construção do Corpus

In [2]:
# Autor: Guilherme Henrique Schneider Inkotte

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('sentencizer')

urls = [
  "https://en.wikipedia.org/wiki/Natural_language_processing", 
  "https://www.ibm.com/cloud/learn/natural-language-processing",
  "https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP",
  "https://www.tableau.com/learn/articles/natural-language-processing-examples",
  "https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1"
]
noLi = [
  "https://en.wikipedia.org/wiki/Natural_language_processing",
  "https://www.tableau.com/learn/articles/natural-language-processing-examples",
]
corpuses = []

for url in urls:
  page = requests.get(url)
  if page.status_code != 200:
    print(f"Ocorreu um erro ao buscar a página {url}.");
    continue
  soup = BeautifulSoup(page.content, 'html.parser')
  corpus = []
  for paragraphTag in soup.find_all(["p", "h1", "h2", "h3", "h4", "h5", "h6", "li" if url not in noLi else ""]):
    paragraph = paragraphTag.get_text()
    strippedParagraph = paragraph.strip()
    if len(strippedParagraph) > 0:
      doc = nlp(strippedParagraph)
      for sentence in doc.sents:
        corpus.append(sentence.text.translate(str.maketrans('', '', string.punctuation)))
  corpuses.append(corpus)
  print(f"Número de palavras em {url}: {len(' '.join(corpus).split(' '))}")
  print(corpus)

Número de palavras em https://en.wikipedia.org/wiki/Natural_language_processing: 1359
['Natural language processing', 'Natural language processing NLP is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data  ', 'The goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them', 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves', 'Challenges in natural language processing frequently involve speech recognition naturallanguage understanding and naturallanguage generation', 'Contents', 'Historyedit', 'Natural language processing has its roots in the 1950s', 'Already in 1950 Alan Turing published an article titled Computing Machinery

## Parte 2 - Construção do Bag of Words

In [3]:
lexemes = []
numberOfSentences = 0
for corpus in corpuses:
  for sentences in corpus:
    numberOfSentences += 1
    for lexeme in sentences.split(' '):
      if lexeme not in lexemes:
        lexemes.append(lexeme)
bagOfWords = np.zeros((numberOfSentences,len(lexemes)))
currentSentence = 0
for corpus in corpuses:
  for sentences in corpus:
    for lexeme in sentences.split(' '):
      bagOfWords[currentSentence][lexemes.index(lexeme)] += 1
    currentSentence += 1

print(f"Sentenças: {len(bagOfWords)}")
print(f"Lexemas distintos: {len(bagOfWords[0])}")
print("Bag of Words: ")
print(bagOfWords)


Sentenças: 782
Lexemas distintos: 2782
Bag of Words: 
[[1. 1. 1. ... 0. 0. 0.]
 [1. 3. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]]


## Parte 3 - TF - IDF

In [22]:
# TF
tfs = np.zeros((len(bagOfWords),len(bagOfWords[0])))
currentSentence = 0
for corpus in corpuses:
  for sentences in corpus:
    numberOfLexemes = len(sentences.split(' '))
    for lexeme in sentences.split(' '):
      tfs[currentSentence][lexemes.index(lexeme)] = bagOfWords[currentSentence][lexemes.index(lexeme)] / numberOfLexemes
    currentSentence += 1
print("TF: ")
print(tfs)

# IDF
idfs = []
for lexeme in range(len(lexemes)):
  lexemeRate = 0
  for sentenceRates in bagOfWords:
    if sentenceRates[lexeme] > 0: lexemeRate += 1
  idfs.append(np.log10(len(bagOfWords)/lexemeRate))
print("IDF: ")
print(len(bagOfWords))
print(idfs)

tfidf = np.zeros((len(bagOfWords),len(bagOfWords[0])))
for i in range(len(tfidf)):
  for j in range(len(tfidf[0])):
    tfidf[i][j] = tfs[i][j] * idfs[j]
print('TF-IDF: ')
print(tfidf)

TF: 
[[0.33333333 0.33333333 0.33333333 ... 0.         0.         0.        ]
 [0.02439024 0.07317073 0.02439024 ... 0.         0.         0.        ]
 [0.         0.04761905 0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]
IDF: 
782
[1.3369042522925607, 0.725889418311672, 1.0239750333288717, 0.8213247457537226, 0.720020484647574, 0.6331353650747732, 2.893206753059848, 0.5148088521117103, 1.893206753059848, 1.662757831681574, 2.1150555026762046, 0.4383618930513378, 1.938964243620523, 1.662757831681574, 2.893206753059848, 1.0067160278873661, 0.46345447305744003, 2.1150555026762046, 1.893206753059848, 1.851814067901623, 1.4160854983401856, 0.6757228088459417, 2.4160854983401854, 1.6890867704039232, 0.5148088521117103, 2.1942367487238292, 1.5314789170422551, 1.7470787