## Separação de páginas de acordo com o coletor necessário

Dado uma página inicial de portal de transparência de uma prefeitura, é de interesse que possamos gerar links a partir dela e separar essses links gerados de acordo com subtags de dados a serem coletados (o que é tarefa para um classificador de páginas) e principalmente considerando o coletor necessário para recuperação da informação (tema abordado nesse notebook). A hipótese inicial é que isso pode ser feito comparando e clusterizando as páginas de acordo com sua estrutura, seu estilo ou considerando os botões presentes nela.

In [32]:
!pip install jellyfish
!pip install html-similarity



In [3]:
import urllib3
import jellyfish
import requests

from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib3.exceptions import *
from html_similarity import style_similarity, structural_similarity, similarity

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--disable-notifications")
options.add_argument("--disable-popup-blocking")
options.add_argument("--disable-web-security")
options.add_argument('--headless')

A seguir, há algumas páginas do Template 2. Essas páginas foram agrupadas de acordo com os coletores necessários para cada grupo, excetuando-se a paginas_extras, que se tratam de páginas isoladas do template. O ideal é que páginas de mesmo grupo (excetuando-se o último) apresentem resultados bem proximos entre si e resultados distantes de páginas de outros grupos. Assim, para que se possa mensurar isso, testou-se utilizando diversas métricas.

In [34]:
paginas_base = ["https://freigaspar.mg.gov.br/transparencia/empenhos", 
                "https://freigaspar.mg.gov.br/transparencia/pagamentos",
                "https://freigaspar.mg.gov.br/transparencia/receitas"]

coletor_1 = ["https://freigaspar.mg.gov.br/transparencia/folhas-de-pagamento/detalhes?indSituacaoServidorPensionista%5B0%5D=P&indSituacaoServidorPensionista%5B1%5D=03&IDE_FLPGO_id=35&ano=2015",
             "https://freigaspar.mg.gov.br/transparencia/empenhos/detalhes/2022/02/118",
             "https://freigaspar.mg.gov.br/transparencia/pagamentos/detalhes/2022/02/118"]

coletor_2 = ["https://freigaspar.mg.gov.br/transparencia/empenhos/exibir/2022/02/33089",
             "https://freigaspar.mg.gov.br/transparencia/pagamentos/exibir/2021/12/35798",
             "https://freigaspar.mg.gov.br/transparencia/empenhos/exibir/2022/05/34915"]

paginas_extras = ["https://freigaspar.mg.gov.br/transparencia",
                  "https://freigaspar.mg.gov.br/transparencia/folhas-de-pagamento",
                  "https://freigaspar.mg.gov.br/transparencia/coronavirus"]

testes = paginas_base + coletor_1 + coletor_2 + paginas_extras

['https://freigaspar.mg.gov.br/transparencia/empenhos', 'https://freigaspar.mg.gov.br/transparencia/pagamentos', 'https://freigaspar.mg.gov.br/transparencia/receitas', 'https://freigaspar.mg.gov.br/transparencia/folhas-de-pagamento/detalhes?indSituacaoServidorPensionista%5B0%5D=P&indSituacaoServidorPensionista%5B1%5D=03&IDE_FLPGO_id=35&ano=2015', 'https://freigaspar.mg.gov.br/transparencia/empenhos/detalhes/2022/02/118', 'https://freigaspar.mg.gov.br/transparencia/pagamentos/detalhes/2022/02/118', 'https://freigaspar.mg.gov.br/transparencia/empenhos/exibir/2022/02/33089', 'https://freigaspar.mg.gov.br/transparencia/pagamentos/exibir/2021/12/35798', 'https://freigaspar.mg.gov.br/transparencia/empenhos/exibir/2022/05/34915', 'https://freigaspar.mg.gov.br/transparencia', 'https://freigaspar.mg.gov.br/transparencia/folhas-de-pagamento', 'https://freigaspar.mg.gov.br/transparencia/coronavirus']


Assim, geram-se matrizes de similaridade de acordo com cada uma das métricas consideradas. Nessa matriz, as submatrizes 3x3 que começam nos elementos [0][0], [3][3] e [6][6] idealmente devem possuir valores próximos e que indiquem alta semelhança, o que é um sinal de que um mesmo coletor pode ser utilizado.

In [85]:
RED   = "\033[1;31m"  
BLUE  = "\033[1;34m"
CYAN  = "\033[1;36m"
GREEN = "\033[0;32m"
RESET = "\033[0;0m" 

In [86]:
def gerarResultados(metrica, dados):
    
    similares = 0
    nao_similares = 0
    
    for i in range(len(dados)):
        for j in range(len(dados)):
            resultado = metrica(dados[i], dados[j])
            string = f"{resultado:.2f}"
            if i<3 and j< 3:
                print(GREEN + string, end = " ")
                similares += resultado
            elif i<6 and i>2 and j<6 and j>2:
                print(BLUE + string, end = " ")
                similares += resultado
            elif i<9 and i>5 and j<9 and j>5:
                print(CYAN + string, end = " ")
                similares += resultado
            else:
                print(RED + string, end = " ")
                nao_similares += resultado
        print("\n")
        
    print(RESET + "Similaridade Desejada Média = " + f"{(similares/27):.2f}")
    print("Similaridade Não Desejada Média = " + f"{(nao_similares/117):.2f}")

## Similaridade entre strings usando Jellyfish

In [78]:
def extrairHtml(url):

    driver = webdriver.Chrome(options=options)
    driver.get(url)
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    
    try:
        soup = BeautifulSoup(driver.page_source, "lxml")
        
        driver.close()
        
        for elm in soup.find_all():
            if not elm.find(recursive=False):
                elm.string = ''
            elm.attrs = {}
            
        html = str(soup.prettify()).replace("\n", "")
            
        return html.replace(" ", "")

    
    except:
        driver.close()
        
        print("Erro ao tentar abrir a pagina: " + url)

In [6]:
extrairHtml("https://freigaspar.mg.gov.br/transparencia/coronavirus")

'<html><head><!--|METAS|--><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><!--|OGTAGS|--><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><meta></meta><link></link><title></title><style></style><!--|CSS|--><link></link><link></link><link></link><link></link><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style><style></style></head><body><div><div><div><img></img><img></img></div><div><div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div><div></div></div></div></div><!--|NAVBAR|--><nav><div><a><img></img></a><a><i></i><span></span></a></div><button><i></i></button

In [79]:
testes_1 = [extrairHtml(teste) for teste in testes]

In [87]:
gerarResultados(jellyfish.jaro_distance, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m0.91 [1;31m0.85 [1;31m0.90 [1;31m0.88 [1;31m0.77 [1;31m0.77 [1;31m0.80 [1;31m0.76 [1;31m0.75 [1;31m0.74 

[0;32m1.00 [0;32m1.00 [0;32m0.91 [1;31m0.85 [1;31m0.90 [1;31m0.88 [1;31m0.77 [1;31m0.77 [1;31m0.80 [1;31m0.76 [1;31m0.75 [1;31m0.74 

[0;32m0.91 [0;32m0.91 [0;32m1.00 [1;31m0.84 [1;31m0.90 [1;31m0.89 [1;31m0.77 [1;31m0.77 [1;31m0.80 [1;31m0.76 [1;31m0.75 [1;31m0.74 

[1;31m0.85 [1;31m0.85 [1;31m0.84 [1;34m1.00 [1;34m0.86 [1;34m0.85 [1;31m0.85 [1;31m0.85 [1;31m0.80 [1;31m0.84 [1;31m0.84 [1;31m0.83 

[1;31m0.90 [1;31m0.90 [1;31m0.90 [1;34m0.86 [1;34m1.00 [1;34m0.92 [1;31m0.78 [1;31m0.78 [1;31m0.80 [1;31m0.77 [1;31m0.76 [1;31m0.75 

[1;31m0.88 [1;31m0.88 [1;31m0.89 [1;34m0.85 [1;34m0.92 [1;34m1.00 [1;31m0.76 [1;31m0.76 [1;31m0.78 [1;31m0.76 [1;31m0.75 [1;31m0.74 

[1;31m0.77 [1;31m0.77 [1;31m0.77 [1;31m0.85 [1;31m0.78 [1;31m0.76 [1;36m1.00 [1;36m0.96 [1;36m0.82 [1;31m0.92 [1;

In [88]:
gerarResultados(jellyfish.hamming_distance, testes_1)

[0;32m0.00 [0;32m0.00 [0;32m2313.00 [1;31m2329.00 [1;31m2119.00 [1;31m2202.00 [1;31m3837.00 [1;31m3843.00 [1;31m3824.00 [1;31m3832.00 [1;31m3891.00 [1;31m3882.00 

[0;32m0.00 [0;32m0.00 [0;32m2313.00 [1;31m2329.00 [1;31m2119.00 [1;31m2202.00 [1;31m3837.00 [1;31m3843.00 [1;31m3824.00 [1;31m3832.00 [1;31m3891.00 [1;31m3882.00 

[0;32m2313.00 [0;32m2313.00 [0;32m0.00 [1;31m2375.00 [1;31m2289.00 [1;31m2317.00 [1;31m3856.00 [1;31m3860.00 [1;31m3828.00 [1;31m3854.00 [1;31m3874.00 [1;31m3905.00 

[1;31m2329.00 [1;31m2329.00 [1;31m2375.00 [1;34m0.00 [1;34m2029.00 [1;34m2289.00 [1;31m2483.00 [1;31m2495.00 [1;31m3219.00 [1;31m2481.00 [1;31m2529.00 [1;31m2544.00 

[1;31m2119.00 [1;31m2119.00 [1;31m2289.00 [1;34m2029.00 [1;34m0.00 [1;34m2065.00 [1;31m3616.00 [1;31m3608.00 [1;31m3591.00 [1;31m3606.00 [1;31m3662.00 [1;31m3655.00 

[1;31m2202.00 [1;31m2202.00 [1;31m2317.00 [1;34m2289.00 [1;34m2065.00 [1;34m0.00 [1;31m3884.00 [1;31m38

In [90]:
gerarResultados(jellyfish.jaro_winkler, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.91 [1;31m0.94 [1;31m0.93 [1;31m0.86 [1;31m0.86 [1;31m0.88 [1;31m0.86 [1;31m0.85 [1;31m0.85 

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.91 [1;31m0.94 [1;31m0.93 [1;31m0.86 [1;31m0.86 [1;31m0.88 [1;31m0.86 [1;31m0.85 [1;31m0.85 

[0;32m0.95 [0;32m0.95 [0;32m1.00 [1;31m0.91 [1;31m0.94 [1;31m0.93 [1;31m0.86 [1;31m0.86 [1;31m0.88 [1;31m0.86 [1;31m0.85 [1;31m0.84 

[1;31m0.91 [1;31m0.91 [1;31m0.91 [1;34m1.00 [1;34m0.92 [1;34m0.91 [1;31m0.91 [1;31m0.91 [1;31m0.88 [1;31m0.91 [1;31m0.90 [1;31m0.90 

[1;31m0.94 [1;31m0.94 [1;31m0.94 [1;34m0.92 [1;34m1.00 [1;34m0.95 [1;31m0.87 [1;31m0.87 [1;31m0.88 [1;31m0.86 [1;31m0.86 [1;31m0.85 

[1;31m0.93 [1;31m0.93 [1;31m0.93 [1;34m0.91 [1;34m0.95 [1;34m1.00 [1;31m0.86 [1;31m0.86 [1;31m0.87 [1;31m0.86 [1;31m0.85 [1;31m0.84 

[1;31m0.86 [1;31m0.86 [1;31m0.86 [1;31m0.91 [1;31m0.87 [1;31m0.86 [1;36m1.00 [1;36m0.98 [1;36m0.89 [1;31m0.95 [1;

## HTML similarity

In [91]:
gerarResultados(structural_similarity, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.74 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.61 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.74 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.61 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.75 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.61 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[1;31m0.74 [1;31m0.74 [1;31m0.75 [1;34m1.00 [1;34m0.76 [1;34m0.73 [1;31m0.81 [1;31m0.82 [1;31m0.73 [1;31m0.77 [1;31m0.85 [1;31m0.87 

[1;31m0.69 [1;31m0.69 [1;31m0.69 [1;34m0.76 [1;34m1.00 [1;34m0.76 [1;31m0.65 [1;31m0.65 [1;31m0.60 [1;31m0.62 [1;31m0.67 [1;31m0.69 

[1;31m0.67 [1;31m0.67 [1;31m0.67 [1;34m0.73 [1;34m0.76 [1;34m1.00 [1;31m0.62 [1;31m0.63 [1;31m0.57 [1;31m0.59 [1;31m0.64 [1;31m0.66 

[1;31m0.67 [1;31m0.67 [1;31m0.67 [1;31m0.83 [1;31m0.67 [1;31m0.64 [1;36m1.00 [1;36m0.99 [1;36m0.90 [1;31m0.81 [1;

In [64]:
gerarResultados(similarity, testes_1)

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.87 [1;31m0.85 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.81 [1;31m0.83 [1;31m0.84 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.87 [1;31m0.85 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.81 [1;31m0.83 [1;31m0.84 

[0;32m1.00 [0;32m1.00 [0;32m1.00 [1;31m0.87 [1;31m0.85 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.81 [1;31m0.83 [1;31m0.84 

[1;31m0.87 [1;31m0.87 [1;31m0.87 [1;34m1.00 [1;34m0.88 [1;34m0.86 [1;31m0.91 [1;31m0.91 [1;31m0.91 [1;31m0.89 [1;31m0.92 [1;31m0.93 

[1;31m0.85 [1;31m0.85 [1;31m0.85 [1;34m0.88 [1;34m1.00 [1;34m0.88 [1;31m0.83 [1;31m0.83 [1;31m0.83 [1;31m0.81 [1;31m0.84 [1;31m0.84 

[1;31m0.83 [1;31m0.83 [1;31m0.83 [1;34m0.86 [1;34m0.88 [1;34m1.00 [1;31m0.81 [1;31m0.81 [1;31m0.81 [1;31m0.80 [1;31m0.82 [1;31m0.83 

[1;31m0.84 [1;31m0.84 [1;31m0.84 [1;31m0.91 [1;31m0.83 [1;31m0.82 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m0.91 [1;

In [92]:
def extrairHtmlCompleto(url):

    driver = webdriver.Chrome(options=options)
    driver.get(url)
    urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
    
    try:
        soup = BeautifulSoup(driver.page_source, "lxml")
        driver.close()
        
        html = str(soup)
            
        return html
    
    except:
        driver.close()
        print("Erro ao tentar abrir a pagina: " + url)

In [66]:
testes_2 = [extrairHtmlCompleto(teste) for teste in testes]

In [93]:
gerarResultados(structural_similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.74 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.66 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.74 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.66 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.75 [1;31m0.69 [1;31m0.67 [1;31m0.66 [1;31m0.67 [1;31m0.66 [1;31m0.62 [1;31m0.66 [1;31m0.68 

[1;31m0.74 [1;31m0.74 [1;31m0.75 [1;34m1.00 [1;34m0.76 [1;34m0.73 [1;31m0.81 [1;31m0.82 [1;31m0.81 [1;31m0.77 [1;31m0.85 [1;31m0.87 

[1;31m0.69 [1;31m0.69 [1;31m0.69 [1;34m0.76 [1;34m1.00 [1;34m0.76 [1;31m0.65 [1;31m0.65 [1;31m0.65 [1;31m0.62 [1;31m0.67 [1;31m0.69 

[1;31m0.67 [1;31m0.67 [1;31m0.67 [1;34m0.73 [1;34m0.76 [1;34m1.00 [1;31m0.62 [1;31m0.63 [1;31m0.62 [1;31m0.59 [1;31m0.64 [1;31m0.66 

[1;31m0.67 [1;31m0.67 [1;31m0.67 [1;31m0.83 [1;31m0.67 [1;31m0.64 [1;36m1.00 [1;36m0.99 [1;36m1.00 [1;31m0.81 [1;

In [68]:
gerarResultados(style_similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.76 [1;31m0.88 [1;31m0.88 [1;31m0.53 [1;31m0.53 [1;31m0.53 [1;31m0.51 [1;31m0.54 [1;31m0.56 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.76 [1;31m0.88 [1;31m0.88 [1;31m0.53 [1;31m0.53 [1;31m0.53 [1;31m0.51 [1;31m0.54 [1;31m0.56 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.75 [1;31m0.87 [1;31m0.88 [1;31m0.52 [1;31m0.52 [1;31m0.52 [1;31m0.51 [1;31m0.54 [1;31m0.55 

[1;31m0.76 [1;31m0.76 [1;31m0.75 [1;34m1.00 [1;34m0.85 [1;34m0.86 [1;31m0.65 [1;31m0.65 [1;31m0.65 [1;31m0.66 [1;31m0.70 [1;31m0.72 

[1;31m0.88 [1;31m0.88 [1;31m0.87 [1;34m0.85 [1;34m1.00 [1;34m0.99 [1;31m0.57 [1;31m0.57 [1;31m0.57 [1;31m0.57 [1;31m0.61 [1;31m0.63 

[1;31m0.88 [1;31m0.88 [1;31m0.88 [1;34m0.86 [1;34m0.99 [1;34m1.00 [1;31m0.58 [1;31m0.58 [1;31m0.58 [1;31m0.58 [1;31m0.61 [1;31m0.63 

[1;31m0.53 [1;31m0.53 [1;31m0.52 [1;31m0.65 [1;31m0.57 [1;31m0.58 [1;36m1.00 [1;36m0.93 [1;36m1.00 [1;31m0.76 [1;

In [94]:
gerarResultados(similarity, testes_2)

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.75 [1;31m0.78 [1;31m0.78 [1;31m0.59 [1;31m0.60 [1;31m0.59 [1;31m0.57 [1;31m0.60 [1;31m0.62 

[0;32m1.00 [0;32m1.00 [0;32m0.99 [1;31m0.75 [1;31m0.78 [1;31m0.78 [1;31m0.59 [1;31m0.60 [1;31m0.59 [1;31m0.57 [1;31m0.60 [1;31m0.62 

[0;32m0.99 [0;32m0.99 [0;32m1.00 [1;31m0.75 [1;31m0.78 [1;31m0.77 [1;31m0.59 [1;31m0.60 [1;31m0.59 [1;31m0.56 [1;31m0.60 [1;31m0.62 

[1;31m0.75 [1;31m0.75 [1;31m0.75 [1;34m1.00 [1;34m0.81 [1;34m0.79 [1;31m0.73 [1;31m0.73 [1;31m0.73 [1;31m0.71 [1;31m0.77 [1;31m0.79 

[1;31m0.78 [1;31m0.78 [1;31m0.78 [1;34m0.81 [1;34m1.00 [1;34m0.88 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.60 [1;31m0.64 [1;31m0.66 

[1;31m0.78 [1;31m0.78 [1;31m0.77 [1;34m0.79 [1;34m0.88 [1;34m1.00 [1;31m0.60 [1;31m0.60 [1;31m0.60 [1;31m0.59 [1;31m0.63 [1;31m0.64 

[1;31m0.60 [1;31m0.60 [1;31m0.60 [1;31m0.74 [1;31m0.62 [1;31m0.61 [1;36m1.00 [1;36m0.96 [1;36m1.00 [1;31m0.79 [1;

## Extração de botões??

In [70]:
def extrairBotoes(url):

    resultados = ""
    
    page = requests.get(url)
    data = page.text
    soup = BeautifulSoup(data, 'html.parser')
    
    buttons = soup.find_all('button')
    for button in buttons:
        resultados+=str(button) + "\n"
        
    return resultados

In [71]:
testes_3 = [extrairBotoes(teste) for teste in testes]

In [95]:
gerarResultados(similarity, testes_3)

[0;32m1.00 [0;32m1.00 [0;32m0.90 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 

[0;32m1.00 [0;32m1.00 [0;32m0.90 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 [1;31m0.61 

[0;32m0.90 [0;32m0.90 [0;32m1.00 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 [1;31m0.55 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;34m1.00 [1;34m1.00 [1;34m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;31m1.00 

[1;31m0.61 [1;31m0.61 [1;31m0.55 [1;31m1.00 [1;31m1.00 [1;31m1.00 [1;36m1.00 [1;36m1.00 [1;36m1.00 [1;31m1.00 [1;

In [74]:
def extrairBotoesSelenium(url):

    resultados = ""

    driver = webdriver.Chrome(options=options)
    driver.get(url)

    html = driver.page_source
    driver.quit()

    soup = BeautifulSoup(html, 'html.parser')
    buttons = soup.find_all('button')
    for button in buttons:
        resultados+=str(button)+"\n"
        
    return resultados

In [75]:
testes_4 = [extrairBotoesSelenium(teste) for teste in testes]

In [96]:
gerarResultados(similarity, testes_4)

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.72 [1;31m0.84 [1;31m0.84 [1;31m0.31 [1;31m0.31 [1;31m0.20 [1;31m0.31 [1;31m0.31 [1;31m0.31 

[0;32m1.00 [0;32m1.00 [0;32m0.95 [1;31m0.72 [1;31m0.84 [1;31m0.84 [1;31m0.31 [1;31m0.31 [1;31m0.20 [1;31m0.31 [1;31m0.31 [1;31m0.31 

[0;32m0.95 [0;32m0.95 [0;32m1.00 [1;31m0.69 [1;31m0.79 [1;31m0.79 [1;31m0.29 [1;31m0.29 [1;31m0.19 [1;31m0.29 [1;31m0.29 [1;31m0.29 

[1;31m0.72 [1;31m0.72 [1;31m0.69 [1;34m1.00 [1;34m0.86 [1;34m0.86 [1;31m0.44 [1;31m0.44 [1;31m0.24 [1;31m0.44 [1;31m0.44 [1;31m0.44 

[1;31m0.84 [1;31m0.84 [1;31m0.79 [1;34m0.86 [1;34m1.00 [1;34m1.00 [1;31m0.37 [1;31m0.37 [1;31m0.22 [1;31m0.37 [1;31m0.37 [1;31m0.37 

[1;31m0.84 [1;31m0.84 [1;31m0.79 [1;34m0.86 [1;34m1.00 [1;34m1.00 [1;31m0.37 [1;31m0.37 [1;31m0.22 [1;31m0.37 [1;31m0.37 [1;31m0.37 

[1;31m0.31 [1;31m0.31 [1;31m0.29 [1;31m0.44 [1;31m0.37 [1;31m0.37 [1;36m1.00 [1;36m1.00 [1;36m0.23 [1;31m1.00 [1;

## Essa métrica de considerar os botões geraram alto nivel de diferença entre páginas de grupos diferentes, mas a similaridade desejada foi bem baixa em alguns casos. E se misturarmos essa métrica com uma de similaridade desejada média alta?

In [104]:
similares = 0
nao_similares = 0

for i in range(len(testes_4)):
    for j in range(len(testes_4)):
        resultado_1 = similarity(testes_4[i], testes_4[j])
        resultado_2 =  similarity(testes_2[i], testes_2[j])
        resultado = (resultado_1+resultado_2)/2
        string = f"{resultado:.2f}"
        if i<3 and j< 3:
            print(GREEN + string, end = " ")
            similares += resultado
        elif i<6 and i>2 and j<6 and j>2:
            print(BLUE + string, end = " ")
            similares += resultado
        elif i<9 and i>5 and j<9 and j>5:
            print(CYAN + string, end = " ")
            similares += resultado
        else:
            print(RED + string, end = " ")
            nao_similares += resultado
    print("\n")

print(RESET + "Similaridade Desejada Média = " + f"{(similares/27):.2f}")
print("Similaridade Não Desejada Média = " + f"{(nao_similares/117):.2f}")

[0;32m1.00 [0;32m1.00 [0;32m0.97 [1;31m0.74 [1;31m0.81 [1;31m0.81 [1;31m0.45 [1;31m0.45 [1;31m0.40 [1;31m0.44 [1;31m0.45 [1;31m0.46 

[0;32m1.00 [0;32m1.00 [0;32m0.97 [1;31m0.74 [1;31m0.81 [1;31m0.81 [1;31m0.45 [1;31m0.45 [1;31m0.40 [1;31m0.44 [1;31m0.45 [1;31m0.46 

[0;32m0.97 [0;32m0.97 [0;32m1.00 [1;31m0.72 [1;31m0.79 [1;31m0.78 [1;31m0.44 [1;31m0.44 [1;31m0.39 [1;31m0.43 [1;31m0.44 [1;31m0.45 

[1;31m0.74 [1;31m0.74 [1;31m0.72 [1;34m1.00 [1;34m0.83 [1;34m0.83 [1;31m0.58 [1;31m0.58 [1;31m0.48 [1;31m0.58 [1;31m0.60 [1;31m0.62 

[1;31m0.81 [1;31m0.81 [1;31m0.79 [1;34m0.83 [1;34m1.00 [1;34m0.94 [1;31m0.49 [1;31m0.49 [1;31m0.42 [1;31m0.48 [1;31m0.51 [1;31m0.51 

[1;31m0.81 [1;31m0.81 [1;31m0.78 [1;34m0.83 [1;34m0.94 [1;34m1.00 [1;31m0.49 [1;31m0.49 [1;31m0.41 [1;31m0.48 [1;31m0.50 [1;31m0.51 

[1;31m0.45 [1;31m0.45 [1;31m0.44 [1;31m0.59 [1;31m0.50 [1;31m0.49 [1;36m1.00 [1;36m0.98 [1;36m0.61 [1;31m0.89 [1;