## Separação de páginas de acordo com o coletor necessário

Dado uma página inicial de portal de transparência de uma prefeitura, é de interesse que possamos gerar links a partir dela e separar essses links gerados de acordo com subtags de dados a serem coletados (o que é tarefa para um classificador de páginas) e principalmente considerando o coletor necessário para recuperação da informação (tema abordado nesse notebook). A hipótese inicial é que isso pode ser feito comparando e clusterizando as páginas de acordo com sua estrutura, seu estilo ou considerando os botões presentes nela.

In [1]:
!pip install jellyfish
!pip install html-similarity

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
import urllib3
import jellyfish
import requests
import asyncio
import os
import time
import cv2
import numpy as np

from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib3.exceptions import *
from html_similarity import style_similarity, structural_similarity, similarity

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from playwright.async_api import async_playwright


options = Options()
options.add_argument("--disable-notifications")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-web-security")
options.add_argument('--headless')

A seguir, há algumas páginas do Template 2. Essas páginas foram agrupadas de acordo com os coletores necessários para cada grupo, excetuando-se a paginas_extras, que se tratam de páginas isoladas do template. O ideal é que páginas de mesmo grupo (excetuando-se o último) apresentem resultados bem proximos entre si e resultados distantes de páginas de outros grupos. Assim, para que se possa mensurar isso, testou-se utilizando diversas métricas.

In [3]:
paginas_base = ["https://sardoa.mg.gov.br/transparencia/empenhos", 
                "https://sardoa.mg.gov.br/transparencia/pagamentos",
                "https://sardoa.mg.gov.br/transparencia/receitas"]

coletor_1 = ["https://sardoa.mg.gov.br/transparencia/folhas-de-pagamento/detalhes?indSituacaoServidorPensionista%5B0%5D=P&indSituacaoServidorPensionista%5B1%5D=03&IDE_FLPGO_id=35&ano=2015",
             "https://sardoa.mg.gov.br/transparencia/empenhos/detalhes/2022/02/118",
             "https://sardoa.mg.gov.br/transparencia/pagamentos/detalhes/2022/02/118"]

coletor_2 = ["https://sardoa.mg.gov.br/transparencia/empenhos/exibir/2022/02/33089",
             "https://sardoa.mg.gov.br/transparencia/pagamentos/exibir/2021/12/35798",
             "https://sardoa.mg.gov.br/transparencia/empenhos/exibir/2022/05/34915"]

paginas_extras = ["https://sardoa.mg.gov.br/transparencia",
                  "https://sardoa.mg.gov.br/transparencia/folhas-de-pagamento",
                  "https://sardoa.mg.gov.br/transparencia/coronavirus"]

testes = paginas_base + coletor_1 + coletor_2 + paginas_extras

Assim, geram-se matrizes de similaridade de acordo com cada uma das métricas consideradas. Nessa matriz, as submatrizes 3x3 que começam nos elementos [0][0], [3][3] e [6][6] idealmente devem possuir valores próximos e que indiquem alta semelhança, o que é um sinal de que um mesmo coletor pode ser utilizado.

In [4]:
RED   = "\033[1;31m"  
BLUE  = "\033[1;34m"
CYAN  = "\033[1;36m"
GREEN = "\033[0;32m"
RESET = "\033[0;0m" 

In [36]:
def gerarResultados(metrica, dados):
    
    similares = 0
    nao_similares = 0
    
    for i in range(len(dados)):
        for j in range(len(dados)):
            resultado = metrica(dados[i], dados[j])
            string = f"{resultado:.2f}"
            if i<3 and j< 3:
                print(GREEN + string, end = " ")
                if i != j:
                    similares += resultado
            elif i<6 and i>2 and j<6 and j>2:
                print(BLUE + string, end = " ")
                if i != j:
                    similares += resultado
            elif i<9 and i>5 and j<9 and j>5:
                print(CYAN + string, end = " ")
                if i != j:
                    similares += resultado
            else:
                print(RED + string, end = " ")
                if i != j:
                    nao_similares += resultado
        print("\n")
        
    print(RESET + "Similaridade Desejada Média = " + f"{(similares/18):.2f}")
    print("Similaridade Não Desejada Média = " + f"{(nao_similares/114):.2f}")

## Similaridade entre strings usando Jellyfish

In [17]:
def extrairHtml(url):

    driver = webdriver.Chrome(options=options)
    driver.get(url)
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    
    try:
        soup = BeautifulSoup(driver.page_source, "lxml")
        
        driver.close()
        
        for elm in soup.find_all():
            if not elm.find(recursive=False):
                elm.string = ''
            elm.attrs = {}
            
        html = str(soup.prettify()).replace("\n", "")
            
        return html.replace(" ", "")

    
    except:
        driver.close()
        
        print("Erro ao tentar abrir a pagina: " + url)

In [None]:
extrairHtml("https://sardoa.mg.gov.br/transparencia/coronavirus")

In [7]:
testes_1 = [extrairHtml(teste) for teste in testes]

In [17]:
gerarResultados(jellyfish.jaro_distance, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m0.92 [1;31m0.85 [1;31m0.90 [1;31m0.86 [1;31m0.77 [1;31m0.77 [1;31m0.77 [1;31m0.78 [1;31m0.75 [1;31m0.75 

[0;32m1.00 [0;32m1.00 [0;32m0.92 [1;31m0.85 [1;31m0.90 [1;31m0.86 [1;31m0.77 [1;31m0.77 [1;31m0.77 [1;31m0.78 [1;31m0.75 [1;31m0.75 

[0;32m0.92 [0;32m0.92 [0;32m1.00 [1;31m0.85 [1;31m0.90 [1;31m0.86 [1;31m0.77 [1;31m0.77 [1;31m0.77 [1;31m0.78 [1;31m0.75 [1;31m0.74 

[1;31m0.85 [1;31m0.85 [1;31m0.85 [1;34m1.00 [1;34m0.87 [1;34m0.93 [1;31m0.85 [1;31m0.85 [1;31m0.85 [1;31m0.83 [1;31m0.84 [1;31m0.83 

[1;31m0.90 [1;31m0.90 [1;31m0.90 [1;34m0.87 [1;34m1.00 [1;34m0.89 [1;31m0.78 [1;31m0.78 [1;31m0.78 [1;31m0.79 [1;31m0.77 [1;31m0.76 

[1;31m0.86 [1;31m0.86 [1;31m0.86 [1;34m0.93 [1;34m0.89 [1;34m1.00 [1;31m0.84 [1;31m0.83 [1;31m0.84 [1;31m0.84 [1;31m0.82 [1;31m0.81 

[1;31m0.77 [1;31m0.77 [1;31m0.77 [1;31m0.85 [1;31m0.78 [1;31m0.84 [1;36m1.00 [1;36m0.97 [1;36m1.00 [1;31m0.91 [1;

In [18]:
gerarResultados(jellyfish.hamming_distance, testes_1)

[0;32m0.00 [0;32m0.00 [0;32m2325.00 [1;31m2306.00 [1;31m2106.00 [1;31m2306.00 [1;31m4031.00 [1;31m4028.00 [1;31m4031.00 [1;31m3967.00 [1;31m4057.00 [1;31m4071.00 

[0;32m0.00 [0;32m0.00 [0;32m2325.00 [1;31m2306.00 [1;31m2106.00 [1;31m2306.00 [1;31m4031.00 [1;31m4028.00 [1;31m4031.00 [1;31m3967.00 [1;31m4057.00 [1;31m4071.00 

[0;32m2325.00 [0;32m2325.00 [0;32m0.00 [1;31m2385.00 [1;31m2210.00 [1;31m2338.00 [1;31m4053.00 [1;31m4056.00 [1;31m4053.00 [1;31m4018.00 [1;31m4071.00 [1;31m4107.00 

[1;31m2306.00 [1;31m2306.00 [1;31m2385.00 [1;34m0.00 [1;34m2035.00 [1;34m1188.00 [1;31m2673.00 [1;31m2677.00 [1;31m2673.00 [1;31m2705.00 [1;31m2704.00 [1;31m2722.00 

[1;31m2106.00 [1;31m2106.00 [1;31m2210.00 [1;34m2035.00 [1;34m0.00 [1;34m1821.00 [1;31m3799.00 [1;31m3786.00 [1;31m3799.00 [1;31m3778.00 [1;31m3820.00 [1;31m3853.00 

[1;31m2306.00 [1;31m2306.00 [1;31m2338.00 [1;34m1188.00 [1;34m1821.00 [1;34m0.00 [1;31m2946.00 [1;31m29

In [19]:
gerarResultados(jellyfish.jaro_winkler, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.91 [1;31m0.94 [1;31m0.92 [1;31m0.86 [1;31m0.86 [1;31m0.86 [1;31m0.87 [1;31m0.85 [1;31m0.85 

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.91 [1;31m0.94 [1;31m0.92 [1;31m0.86 [1;31m0.86 [1;31m0.86 [1;31m0.87 [1;31m0.85 [1;31m0.85 

[0;32m0.95 [0;32m0.95 [0;32m1.00 [1;31m0.91 [1;31m0.94 [1;31m0.92 [1;31m0.86 [1;31m0.86 [1;31m0.86 [1;31m0.87 [1;31m0.85 [1;31m0.85 

[1;31m0.91 [1;31m0.91 [1;31m0.91 [1;34m1.00 [1;34m0.92 [1;34m0.96 [1;31m0.91 [1;31m0.91 [1;31m0.91 [1;31m0.90 [1;31m0.90 [1;31m0.90 

[1;31m0.94 [1;31m0.94 [1;31m0.94 [1;34m0.92 [1;34m1.00 [1;34m0.93 [1;31m0.87 [1;31m0.87 [1;31m0.87 [1;31m0.87 [1;31m0.86 [1;31m0.85 

[1;31m0.92 [1;31m0.92 [1;31m0.92 [1;34m0.96 [1;34m0.93 [1;34m1.00 [1;31m0.90 [1;31m0.90 [1;31m0.90 [1;31m0.91 [1;31m0.89 [1;31m0.88 

[1;31m0.86 [1;31m0.86 [1;31m0.86 [1;31m0.91 [1;31m0.87 [1;31m0.90 [1;36m1.00 [1;36m0.98 [1;36m1.00 [1;31m0.95 [1;

## HTML similarity

In [20]:
gerarResultados(structural_similarity, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.76 [1;31m0.71 [1;31m0.80 [1;31m0.69 [1;31m0.69 [1;31m0.69 [1;31m0.62 [1;31m0.68 [1;31m0.70 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.76 [1;31m0.71 [1;31m0.80 [1;31m0.69 [1;31m0.69 [1;31m0.69 [1;31m0.62 [1;31m0.68 [1;31m0.70 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.76 [1;31m0.71 [1;31m0.80 [1;31m0.69 [1;31m0.69 [1;31m0.69 [1;31m0.62 [1;31m0.68 [1;31m0.70 

[1;31m0.76 [1;31m0.76 [1;31m0.76 [1;34m1.00 [1;34m0.78 [1;34m0.91 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.76 [1;31m0.86 [1;31m0.88 

[1;31m0.71 [1;31m0.71 [1;31m0.71 [1;34m0.78 [1;34m1.00 [1;34m0.84 [1;31m0.68 [1;31m0.68 [1;31m0.68 [1;31m0.63 [1;31m0.70 [1;31m0.71 

[1;31m0.80 [1;31m0.80 [1;31m0.80 [1;34m0.91 [1;34m0.84 [1;34m1.00 [1;31m0.79 [1;31m0.79 [1;31m0.79 [1;31m0.73 [1;31m0.82 [1;31m0.84 

[1;31m0.70 [1;31m0.70 [1;31m0.69 [1;31m0.84 [1;31m0.69 [1;31m0.81 [1;36m1.00 [1;36m0.99 [1;36m1.00 [1;31m0.80 [1;

In [21]:
gerarResultados(similarity, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.88 [1;31m0.86 [1;31m0.90 [1;31m0.84 [1;31m0.85 [1;31m0.84 [1;31m0.81 [1;31m0.84 [1;31m0.85 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.88 [1;31m0.86 [1;31m0.90 [1;31m0.84 [1;31m0.85 [1;31m0.84 [1;31m0.81 [1;31m0.84 [1;31m0.85 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.88 [1;31m0.86 [1;31m0.90 [1;31m0.84 [1;31m0.85 [1;31m0.84 [1;31m0.81 [1;31m0.84 [1;31m0.85 

[1;31m0.88 [1;31m0.88 [1;31m0.88 [1;34m1.00 [1;34m0.89 [1;34m0.96 [1;31m0.91 [1;31m0.92 [1;31m0.91 [1;31m0.88 [1;31m0.93 [1;31m0.94 

[1;31m0.86 [1;31m0.86 [1;31m0.86 [1;34m0.89 [1;34m1.00 [1;34m0.92 [1;31m0.84 [1;31m0.84 [1;31m0.84 [1;31m0.81 [1;31m0.85 [1;31m0.86 

[1;31m0.90 [1;31m0.90 [1;31m0.90 [1;34m0.96 [1;34m0.92 [1;34m1.00 [1;31m0.89 [1;31m0.90 [1;31m0.89 [1;31m0.86 [1;31m0.91 [1;31m0.92 

[1;31m0.85 [1;31m0.85 [1;31m0.85 [1;31m0.92 [1;31m0.85 [1;31m0.90 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m0.90 [1;

In [6]:
def extrairHtmlCompleto(url):

    driver = webdriver.Chrome(options=options)
    driver.get(url)
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    
    try:
        soup = BeautifulSoup(driver.page_source, "lxml")
        driver.close()
        
        html = str(soup)
            
        return html
    
    except:
        driver.close()
        print("Erro ao tentar abrir a pagina: " + url)

In [14]:
testes_2 = [extrairHtmlCompleto(teste) for teste in testes]

In [23]:
gerarResultados(structural_similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.77 [1;31m0.71 [1;31m0.80 [1;31m0.63 [1;31m0.69 [1;31m0.63 [1;31m0.62 [1;31m0.63 [1;31m0.70 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.77 [1;31m0.71 [1;31m0.80 [1;31m0.63 [1;31m0.69 [1;31m0.63 [1;31m0.62 [1;31m0.63 [1;31m0.70 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.77 [1;31m0.71 [1;31m0.80 [1;31m0.63 [1;31m0.69 [1;31m0.63 [1;31m0.62 [1;31m0.63 [1;31m0.70 

[1;31m0.77 [1;31m0.77 [1;31m0.77 [1;34m1.00 [1;34m0.82 [1;34m0.93 [1;31m0.72 [1;31m0.80 [1;31m0.72 [1;31m0.73 [1;31m0.74 [1;31m0.84 

[1;31m0.71 [1;31m0.71 [1;31m0.71 [1;34m0.82 [1;34m1.00 [1;34m0.84 [1;31m0.62 [1;31m0.68 [1;31m0.62 [1;31m0.63 [1;31m0.64 [1;31m0.71 

[1;31m0.80 [1;31m0.80 [1;31m0.80 [1;34m0.93 [1;34m0.84 [1;34m1.00 [1;31m0.72 [1;31m0.79 [1;31m0.72 [1;31m0.73 [1;31m0.74 [1;31m0.84 

[1;31m0.64 [1;31m0.64 [1;31m0.64 [1;31m0.73 [1;31m0.64 [1;31m0.73 [1;36m1.00 [1;36m0.90 [1;36m1.00 [1;31m0.73 [1;

In [68]:
gerarResultados(style_similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.76 [1;31m0.88 [1;31m0.88 [1;31m0.53 [1;31m0.53 [1;31m0.53 [1;31m0.51 [1;31m0.54 [1;31m0.56 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.76 [1;31m0.88 [1;31m0.88 [1;31m0.53 [1;31m0.53 [1;31m0.53 [1;31m0.51 [1;31m0.54 [1;31m0.56 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.75 [1;31m0.87 [1;31m0.88 [1;31m0.52 [1;31m0.52 [1;31m0.52 [1;31m0.51 [1;31m0.54 [1;31m0.55 

[1;31m0.76 [1;31m0.76 [1;31m0.75 [1;34m1.00 [1;34m0.85 [1;34m0.86 [1;31m0.65 [1;31m0.65 [1;31m0.65 [1;31m0.66 [1;31m0.70 [1;31m0.72 

[1;31m0.88 [1;31m0.88 [1;31m0.87 [1;34m0.85 [1;34m1.00 [1;34m0.99 [1;31m0.57 [1;31m0.57 [1;31m0.57 [1;31m0.57 [1;31m0.61 [1;31m0.63 

[1;31m0.88 [1;31m0.88 [1;31m0.88 [1;34m0.86 [1;34m0.99 [1;34m1.00 [1;31m0.58 [1;31m0.58 [1;31m0.58 [1;31m0.58 [1;31m0.61 [1;31m0.63 

[1;31m0.53 [1;31m0.53 [1;31m0.52 [1;31m0.65 [1;31m0.57 [1;31m0.58 [1;36m1.00 [1;36m0.93 [1;36m1.00 [1;31m0.76 [1;

In [24]:
gerarResultados(similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.81 [1;31m0.80 [1;31m0.83 [1;31m0.54 [1;31m0.61 [1;31m0.54 [1;31m0.57 [1;31m0.55 [1;31m0.64 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.81 [1;31m0.80 [1;31m0.83 [1;31m0.54 [1;31m0.61 [1;31m0.54 [1;31m0.57 [1;31m0.55 [1;31m0.64 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.80 [1;31m0.79 [1;31m0.83 [1;31m0.54 [1;31m0.61 [1;31m0.54 [1;31m0.57 [1;31m0.54 [1;31m0.63 

[1;31m0.81 [1;31m0.81 [1;31m0.80 [1;34m1.00 [1;34m0.88 [1;34m0.94 [1;31m0.61 [1;31m0.70 [1;31m0.61 [1;31m0.67 [1;31m0.63 [1;31m0.75 

[1;31m0.80 [1;31m0.80 [1;31m0.79 [1;34m0.88 [1;34m1.00 [1;34m0.90 [1;31m0.55 [1;31m0.63 [1;31m0.55 [1;31m0.61 [1;31m0.57 [1;31m0.67 

[1;31m0.83 [1;31m0.83 [1;31m0.83 [1;34m0.94 [1;34m0.90 [1;34m1.00 [1;31m0.61 [1;31m0.70 [1;31m0.61 [1;31m0.66 [1;31m0.63 [1;31m0.75 

[1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.62 [1;31m0.56 [1;31m0.61 [1;36m1.00 [1;36m0.81 [1;36m1.00 [1;31m0.66 [1;

## Extração de botões??

In [25]:
def extrairBotoes(url):

    resultados = ""
    
    page = requests.get(url)
    data = page.text
    soup = BeautifulSoup(data, 'html.parser')
    
    buttons = soup.find_all('button')
    for button in buttons:
        resultados+=str(button) + "\n"
        
    return resultados

In [26]:
testes_3 = [extrairBotoes(teste) for teste in testes]

In [27]:
gerarResultados(similarity, testes_3)

[0;32m1.00 [0;32m1.00 [0;32m0.90 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 

[0;32m1.00 [0;32m1.00 [0;32m0.90 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 

[0;32m0.90 [0;32m0.90 [0;32m1.00 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m1.00 [1;

In [30]:
def extrairBotoesSelenium(url):

    resultados = ""

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    html = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html, 'html.parser')
    buttons = soup.find_all('button')
    for button in buttons:
        resultados+=str(button)+"\n"
        
    return resultados

In [31]:
testes_4 = [extrairBotoesSelenium(teste) for teste in testes]

In [32]:
gerarResultados(similarity, testes_4)

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.72 [1;31m0.84 [1;31m0.84 [1;31m0.31 [1;31m0.31 [1;31m0.31 [1;31m0.20 [1;31m0.31 [1;31m0.31 

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.72 [1;31m0.84 [1;31m0.84 [1;31m0.31 [1;31m0.31 [1;31m0.31 [1;31m0.20 [1;31m0.31 [1;31m0.31 

[0;32m0.95 [0;32m0.95 [0;32m1.00 [1;31m0.69 [1;31m0.79 [1;31m0.79 [1;31m0.29 [1;31m0.29 [1;31m0.29 [1;31m0.19 [1;31m0.29 [1;31m0.29 

[1;31m0.72 [1;31m0.72 [1;31m0.69 [1;34m1.00 [1;34m0.86 [1;34m0.86 [1;31m0.44 [1;31m0.44 [1;31m0.44 [1;31m0.24 [1;31m0.44 [1;31m0.44 

[1;31m0.84 [1;31m0.84 [1;31m0.79 [1;34m0.86 [1;34m1.00 [1;34m1.00 [1;31m0.37 [1;31m0.37 [1;31m0.37 [1;31m0.22 [1;31m0.37 [1;31m0.37 

[1;31m0.84 [1;31m0.84 [1;31m0.79 [1;34m0.86 [1;34m1.00 [1;34m1.00 [1;31m0.37 [1;31m0.37 [1;31m0.37 [1;31m0.22 [1;31m0.37 [1;31m0.37 

[1;31m0.31 [1;31m0.31 [1;31m0.29 [1;31m0.44 [1;31m0.37 [1;31m0.37 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m0.23 [1;

## Essa métrica de considerar os botões geraram alto nivel de diferença entre páginas de grupos diferentes, mas a similaridade desejada foi bem baixa em alguns casos. E se misturarmos essa métrica com uma de similaridade desejada média alta?

In [33]:
similares = 0
nao_similares = 0

for i in range(len(testes_4)):
    for j in range(len(testes_4)):
        resultado_1 = similarity(testes_4[i], testes_4[j])
        resultado_2 =  similarity(testes_2[i], testes_2[j])
        resultado = (resultado_1+resultado_2)/2
        string = f"{resultado:.2f}"
        if i<3 and j< 3:
            print(GREEN + string, end = " ")
            if i != j:
                similares += resultado
        elif i<6 and i>2 and j<6 and j>2:
            print(BLUE + string, end = " ")
            if i != j:
                similares += resultado
        elif i<9 and i>5 and j<9 and j>5:
            print(CYAN + string, end = " ")
            if i != j:
                similares += resultado
        else:
            print(RED + string, end = " ")
            if i != j:
                nao_similares += resultado
    print("\n")

print(RESET + "Similaridade Desejada Média = " + f"{(similares/18):.2f}")
print("Similaridade Não Desejada Média = " + f"{(nao_similares/114):.2f}")

[0;32m1.00 [0;32m1.00 [0;32m0.97 [1;31m0.76 [1;31m0.82 [1;31m0.83 [1;31m0.42 [1;31m0.46 [1;31m0.42 [1;31m0.39 [1;31m0.43 [1;31m0.47 

[0;32m1.00 [0;32m1.00 [0;32m0.97 [1;31m0.76 [1;31m0.82 [1;31m0.83 [1;31m0.42 [1;31m0.46 [1;31m0.42 [1;31m0.39 [1;31m0.43 [1;31m0.47 

[0;32m0.97 [0;32m0.97 [0;32m1.00 [1;31m0.74 [1;31m0.79 [1;31m0.81 [1;31m0.42 [1;31m0.45 [1;31m0.42 [1;31m0.38 [1;31m0.42 [1;31m0.46 

[1;31m0.76 [1;31m0.76 [1;31m0.74 [1;34m1.00 [1;34m0.87 [1;34m0.90 [1;31m0.52 [1;31m0.57 [1;31m0.52 [1;31m0.45 [1;31m0.53 [1;31m0.59 

[1;31m0.82 [1;31m0.82 [1;31m0.79 [1;34m0.87 [1;34m1.00 [1;34m0.95 [1;31m0.46 [1;31m0.50 [1;31m0.46 [1;31m0.41 [1;31m0.47 [1;31m0.52 

[1;31m0.83 [1;31m0.83 [1;31m0.81 [1;34m0.90 [1;34m0.95 [1;34m1.00 [1;31m0.49 [1;31m0.53 [1;31m0.49 [1;31m0.44 [1;31m0.50 [1;31m0.56 

[1;31m0.43 [1;31m0.43 [1;31m0.42 [1;31m0.53 [1;31m0.47 [1;31m0.49 [1;36m1.00 [1;36m0.91 [1;36m1.00 [1;31m0.44 [1;

## Metrica par a par

In [3]:
def gerar_resultado_botoes(url1, url2):

    resultados1, resultados2 = "", ""

    driver = webdriver.Chrome(options=options)
    driver.get(url1)

    html = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html, 'html.parser')
    buttons = soup.find_all('button')
    for button in buttons:
        resultados1+=str(button)+"\n"
        
    
    driver = webdriver.Chrome(options=options)
    driver.get(url2)

    html = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html, 'html.parser')
    buttons = soup.find_all('button')
    for button in buttons:
        resultados2+=str(button)+"\n"
    
    return similarity(resultados1, resultados2)

In [4]:
gerar_resultado_botoes("https://sardoa.mg.gov.br/transparencia/empenhos", "https://sardoa.mg.gov.br/transparencia/pagamentos")

WebDriverException: Message: unknown error: net::ERR_NAME_NOT_RESOLVED
  (Session info: headless chrome=103.0.5060.134)
Stacktrace:
#0 0x560344db0cd3 <unknown>
#1 0x560344bb8968 <unknown>
#2 0x560344bb2745 <unknown>
#3 0x560344ba5096 <unknown>
#4 0x560344ba6032 <unknown>
#5 0x560344ba5362 <unknown>
#6 0x560344ba4795 <unknown>
#7 0x560344ba345f <unknown>
#8 0x560344ba38c2 <unknown>
#9 0x560344bba3c2 <unknown>
#10 0x560344c2160f <unknown>
#11 0x560344c0dbb2 <unknown>
#12 0x560344c20ea8 <unknown>
#13 0x560344c0daa3 <unknown>
#14 0x560344be33fa <unknown>
#15 0x560344be4555 <unknown>
#16 0x560344df82bd <unknown>
#17 0x560344dfc418 <unknown>
#18 0x560344de236e <unknown>
#19 0x560344dfd078 <unknown>
#20 0x560344dd6bb0 <unknown>
#21 0x560344e19d58 <unknown>
#22 0x560344e19ed8 <unknown>
#23 0x560344e33cfd <unknown>
#24 0x7f4f4ae72609 <unknown>


## Comparando screenshots

In [54]:
cd = os.getcwd()

In [55]:
try:
    os.mkdir("screenshots")
except FileExistsError:
    pass

In [None]:
async def tirar_screenshots(urls):
    
    playwright = await async_playwright().start()
    browser = await playwright.chromium.launch(headless = True)
    page = await browser.new_page()
    
    for i in range(len(urls)):
        await page.goto(urls[i])
        await page.screenshot(path=os.getcwd()+"/screenshots/ss"+str(i) +".png")
        
    await page.close()
    await browser.close()

In [34]:
loop = asyncio.get_event_loop()
loop.create_task(tirar_screenshots(testes))

<Task pending name='Task-16' coro=<tirar_screenshots() running at /tmp/ipykernel_171573/2948449917.py:1>>

In [101]:
def diferenca_duas_imagens(img_1, img_2):
    
    MSE = np.square(np.subtract(img_1,img_2)).mean() 
    RMSE = np.sqrt(MSE)
    
    return RMSE/255

In [111]:
def calcular_distancias():
    
    dir_screenshots = os.getcwd()+"/screenshots"
    
    screenshots = list(os.walk(os.getcwd()+"/screenshots"))[0][2]
    dados = [cv2.imread(dir_screenshots + "/"+image) for image in screenshots]
    
    tam = len(screenshots)
    resultados = np.zeros((tam, tam), dtype = np.float32)
    
    for i in range(0, tam):
        for j in range(0, tam):
            distancia = diferenca_duas_imagens(dados[i], dados[j])
            resultados[i,j] = distancia
            resultados[j,i] = distancia
            
    return resultados

In [112]:
res = calcular_distancias()

In [113]:
print(res)

[[0.         0.00755656 0.00743515 0.01499168 0.00756169 0.02532585
  0.00901948 0.00152697 0.01022694 0.00946298 0.01564439 0.00952083]
 [0.00755656 0.         0.00896011 0.01559064 0.00813164 0.02549205
  0.00947042 0.00757184 0.01058512 0.0097188  0.01584595 0.01055994]
 [0.00743515 0.00896011 0.         0.01505203 0.00873241 0.02527068
  0.01000045 0.00758476 0.01123649 0.01037665 0.01637761 0.01016361]
 [0.01499168 0.01559064 0.01505203 0.         0.01559751 0.02097796
  0.01633633 0.01500151 0.01704526 0.0164161  0.01889529 0.01635696]
 [0.00756169 0.00813164 0.00873241 0.01559751 0.         0.02546529
  0.00552943 0.00769193 0.01049148 0.00953105 0.01581229 0.01046621]
 [0.02532585 0.02549205 0.02527068 0.02097796 0.02546529 0.
  0.02552939 0.0253243  0.02563792 0.02557976 0.02507985 0.0257978 ]
 [0.00901948 0.00947042 0.01000045 0.01633633 0.00552943 0.02552939
  0.         0.00912895 0.01051213 0.00955527 0.01654168 0.01069114]
 [0.00152697 0.00757184 0.00758476 0.01500151 0.0

In [114]:
screenshots = list(os.walk(os.getcwd()+"/screenshots"))[0][2]
print(screenshots)

['ss0.png', 'ss2.png', 'ss4.png', 'ss3.png', 'ss8.png', 'ss11.png', 'ss6.png', 'ss1.png', 'ss10.png', 'ss7.png', 'ss9.png', 'ss5.png']


## Comparando localização de hiperlinks na página

In [18]:
def processar_html(texto_html):
    
    tam = len(texto_html)
    tags = []
    
    i = 0
    while i < tam:
        
        aux_tag = ""
        
        if texto_html[i] == "<":
            i+=1
            if texto_html[i] == "!":
                while texto_html[i] != "<":
                    i+=1
            else:
                while texto_html[i] != ">":
                    aux_tag+=texto_html[i]
                    i+=1
                i+=1
                tags.append(aux_tag)

        else:
            while texto_html[i] != "<":
                i+=1
    
    return tags

In [19]:
texto_html = extrairHtml("https://sardoa.mg.gov.br/transparencia/coronavirus")

In [20]:
lista_tags = processar_html(texto_html)
print(lista_tags)

['html', 'head', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'meta', '/meta', 'link', '/link', 'title', '/title', 'style', '/style', 'link', '/link', 'link', '/link', 'link', '/link', 'link', '/link', 'link', '/link', 'link', '/link', 'link', '/link', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', 'style', '/style', '/head', 'body', 'div', 'div', 'div', 'img', '/img', 'img', '/img', '/div', 'div', 'div', 'div', '/div', 'div', '/div', '

In [21]:
def localizar_tags(lista_tags):
    
    caminhos = []
    caminho_atual = []
    
    for ele in lista_tags:
        
        if ele.startswith("/"):
            caminho_atual.pop()
        else:
            caminho_atual.append(ele)
            
        if ele == "a":
            caminhos.append(caminho_atual.copy())
            
    return caminhos
            

In [22]:
localizacoes = localizar_tags(lista_tags)

In [23]:
class No:

    def __init__(self, tag):

        self.tag = tag
        self.sons = {}
        self.number = 0

In [28]:
class Arvore:
    
    def __init__(self,tags):
        
        self.raiz = No("")
        self.localizacao_tags = tags
        self.resultado = []
        self.resultado_string = ""
        self.vector = []
        
    def gerar_arvore(self):
        
        for tags in self.localizacao_tags:
            
            no_atual = self.raiz
            
            for tag in tags:
                
                if tag not in no_atual.sons:
                    
                    novo_no = No(tag)
                    no_atual.sons[tag] = novo_no
                    no_atual = novo_no
                    
                    if tag == 'a':
                        novo_no.number = 1
                    
                else:
                    
                    no_atual = no_atual.sons[tag]
                
                    if tag == 'a':
                        novo_no.number +=1
                
        return self.raiz
                
        
    def caminhar_em_profundidade(self, no):
        
        no_atual = no
        self.resultado.append(no_atual.tag)
        
        if not any(no_atual.sons):
            return
        
        else:
            
            for no in no_atual.sons:
            
                self.caminhar_em_profundidade(no_atual.sons[no])
                
    def gerar_resultado(self):
        
        self.resultado_string = "/".join(self.resultado)
    

In [29]:
arv = Arvore(localizacoes)
raiz = arv.gerar_arvore()

In [30]:
print(raiz)

<__main__.No object at 0x7f6036a83220>


In [31]:
arv.caminhar_em_profundidade(raiz)
arv.gerar_resultado()

In [32]:
arv.resultado_string

'/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/a/footer/ul/a/a'

In [33]:
def gerar_arvore_url(url):
    
    texto_html = extrairHtml(url)
    
    lista_tags = processar_html(texto_html)
    localizacoes = localizar_tags(lista_tags)
    
    arv = Arvore(localizacoes)
    raiz = arv.gerar_arvore()
    
    arv.caminhar_em_profundidade(raiz)
    arv.gerar_resultado()
    
    return arv.resultado_string

In [34]:
testes = [gerar_arvore_url(teste) for teste in testes]

In [41]:
gerarResultados(jellyfish.jaro_distance, testes)

[0;32m1.00 [0;32m0.92 [0;32m0.92 [1;31m0.93 [1;31m0.92 [1;31m0.93 [1;31m0.93 [1;31m0.93 [1;31m0.93 [1;31m0.94 [1;31m0.93 [1;31m0.93 

[0;32m0.92 [0;32m1.00 [0;32m0.95 [1;31m0.92 [1;31m0.95 [1;31m0.93 [1;31m0.92 [1;31m0.92 [1;31m0.92 [1;31m0.90 [1;31m0.92 [1;31m0.92 

[0;32m0.92 [0;32m0.95 [0;32m1.00 [1;31m0.92 [1;31m1.00 [1;31m0.92 [1;31m0.92 [1;31m0.92 [1;31m0.92 [1;31m0.90 [1;31m0.92 [1;31m0.92 

[1;31m0.93 [1;31m0.92 [1;31m0.92 [1;34m1.00 [1;34m0.92 [1;34m0.96 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m0.96 [1;31m1.00 [1;31m0.98 

[1;31m0.92 [1;31m0.95 [1;31m1.00 [1;34m0.92 [1;34m1.00 [1;34m0.92 [1;31m0.92 [1;31m0.92 [1;31m0.92 [1;31m0.90 [1;31m0.92 [1;31m0.92 

[1;31m0.93 [1;31m0.93 [1;31m0.92 [1;34m0.96 [1;34m0.92 [1;34m1.00 [1;31m0.96 [1;31m0.96 [1;31m0.96 [1;31m0.94 [1;31m0.96 [1;31m0.97 

[1;31m0.93 [1;31m0.92 [1;31m0.92 [1;31m1.00 [1;31m0.92 [1;31m0.96 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m0.96 [1;

In [40]:
for teste in testes:
    print(teste)

/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/table/tbody/tr/td/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/div/table/tbody/tr/td/a/ul/li/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/div/table/tbody/tr/td/a/ul/li/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/div/table/tbody/tr/td/a/ul/li/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/ul/li/a/footer/ul/a/a
/html/body/div/nav/div/a/ul/li/a/div/div/ul/a/div/aside/div/ul/li/a/main/section/div/div/div/div/div/div/div/a/footer/ul/a/a
/html/body/di