<a href="https://colab.research.google.com/github/KakashiHataki-lab/github2/blob/main/Copy_of_CE_1_4_por.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Estudo de caso 1.4: Agrupamento espectral - Agrupamento de notícias

---
<br>

Este estudo de caso considera um banco de dados de artigos da imprensa, sobre diferentes temas, e usa _agrupamento espectral_ para agrupá-los de acordo com a frequência de determinadas palavras. Este notebook fornece o código para gerar o banco de dados.

Este estudo de caso usa a biblioteca [`mitie`](https://github.com/mit-nlp/MITIE), desenvolvida no MIT. Todas as etapas para instalar a biblioteca e o modelo NER usados ​​neste estudo de caso podem ser encontradas na documentação online.

<br>

---

Configuração do notebook:

* Primeiramente, baixe a biblioteca MITIE a partir do seu repositório de GitHub, instale-a no ambiente de execução e baixe seus principais modelos de *NLP*, dentre eles o modelo `NER` que usaremos neste estudo de caso.

* Depois, instale o restante das bibliotecas necessárias e o modelo `NER` em uma variável de forma que possamos usá-lo no estudo.

In [1]:
!pip3 install git+https://github.com/mit-nlp/MITIE.git
!wget https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2.tar.bz2
!tar jxf MITIE-models-v0.2.tar.bz2

print('MITIE instalado com sucesso e modelos baixados!')

Collecting git+https://github.com/mit-nlp/MITIE.git
  Cloning https://github.com/mit-nlp/MITIE.git to /tmp/pip-req-build-e31ef9u5
  Running command git clone -q https://github.com/mit-nlp/MITIE.git /tmp/pip-req-build-e31ef9u5
Building wheels for collected packages: mitie
  Building wheel for mitie (setup.py) ... [?25l[?25hdone
  Created wheel for mitie: filename=mitie-0.7.0-cp37-none-any.whl size=418688 sha256=a0a0779c972c56830beae6ddcf20c4b1e03e7bda94fa189da21f99acf95c2728
  Stored in directory: /tmp/pip-ephem-wheel-cache-45597w4s/wheels/b4/c1/21/8e7e7e14cf3211bf5c73aad0b1d76d1186fbf681f4b9ef6c06
Successfully built mitie
Installing collected packages: mitie
Successfully installed mitie-0.7.0
--2021-06-02 00:23:47--  https://github.com/mit-nlp/MITIE/releases/download/v0.4/MITIE-models-v0.2.tar.bz2
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://g

In [2]:
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import csv

#ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cluster

#Bibliotecas de web scraping
from bs4 import BeautifulSoup

#NLP
from mitie import *
print('Bibliotecas importadas com sucesso!\n')
print("Carregando o modelo NER...")
ner = named_entity_extractor('MITIE-models/english/ner_model.dat')
print("\nEtiquetas de saída do modelo NER:", ner.get_possible_ner_tags())

Bibliotecas importadas com sucesso!

Carregando o modelo NER...

Etiquetas de saída do modelo NER: ['PERSON', 'LOCATION', 'ORGANIZATION', 'MISC']


# Geração do banco de dados (Web Scraping)

Neste exemplo, foram compilados artigos de 8 temas diferentes do jornal britânico __The Guardian__. A seguir, temos as etapas para criar o banco de dados:

1. Obter o código fonte do site principal do The Guardian e armazenar os links das seções (temas) de interesse.
2. Iterar a lista de links e obter a informação de 10 artigos por seção (título e conteúdo).
3. Salvar os artigos, títulos e temas em arquivos `.txt`.

In [3]:
UK_news_url = 'https://www.theguardian.com/uk'
#Baixando os links dos diferentes temas
html_data = requests.get(UK_news_url).text
soup = BeautifulSoup(html_data, 'html.parser')
url_topics = [el.find('a')['href'] for el in soup.find_all(class_ = 'subnav__item')[1:8]]
topics = [el.text.strip('\n').replace(' ','_') for el in soup.find_all(class_ = 'subnav-link')[1:8]]
for i in range(len(topics)):
    print('Topic {}: {} ({})'.format(i+1,topics[i],url_topics[i]))


Topic 1: ____________World
________ (https://www.theguardian.com/world)
Topic 2: ____________Environment
________ (https://www.theguardian.com/us/environment)
Topic 3: ____________Soccer
________ (https://www.theguardian.com/football)
Topic 4: ____________US_Politics
________ (https://www.theguardian.com/us-news/us-politics)
Topic 5: ____________Business
________ (https://www.theguardian.com/us/business)
Topic 6: ____________Tech
________ (https://www.theguardian.com/us/technology)
Topic 7: ____________Science
________ (https://www.theguardian.com/science)


In [4]:
def save_to_txt(filename, content):
    '''
    Creates a new .txt file with as specific name in the Data directory
    '''
    with open(r"Data/{}.txt".format(filename), "w") as f:
        print(content, file=f)


In [5]:
#Cria-se um diretório onde serão salvos os artigos
os.mkdir('Data/')

In [None]:
article_titles = []
article_contents = []
article_topics = []
articles_per_topic = 10
n = 1

In [29]:
for topic, url_topic in list(zip(topics,url_topics)):
    #Getting the first 15
    print('Leo1')
    soup = BeautifulSoup(requests.get(url_topic).text, 'html.parser')
    url_articles = [el.find('a')['href'] for el in soup.find_all(class_ = 'fc-item__content')]
    print(url_articles)
    print('url done')
    print('\n{}:'.format(topic))
    i = 0
    while article_topics.count(topic) < articles_per_topic:
        #soup = BeautifulSoup(requests.get(url_articles[i]).text, 'html.parser')
        soup = BeautifulSoup(requests.get(url_articles[i]).text, 'html.parser')
        try:
            title = soup.find(class_ = 'content__headline').text.strip('\n')
            print('LeoTitle')
            print(title)
            #content = ' '.join([el.text for el in soup.find(class_ = 'content__article-body from-content-api js-article__body').find_all('p')])
            content = ' '.join([el.text for el in soup.find_all('p')])
            print(content)            
            i += 1
            if i == len(url_articles):
                print('Only {} articles found in \"{}"'.format(article_topics.count(topic),topic))
                break
            if title not in article_titles:
                article_titles += [title]
                article_contents += [content]
                article_topics += [topic]
                save_to_txt('title-{}'.format(n),title)
                save_to_txt('article-{}'.format(n),content)
                save_to_txt('topic-{}'.format(n),topic)
                print('{}'.format(title))
                n += 1
                if round(len(article_titles)/10) == len(article_titles)/10:
                    print('Article count: {}'.format(len(article_titles)))
        except:
            i += 1
            if i == len(url_articles):
                print('Only {} articles found in \"{}"'.format(article_topics.count(topic),topic))
                break
            pass
        
                
df = pd.DataFrame({'topic':article_topics,'title':article_titles,'content':article_contents})

Leo1
url done

____________World
________:
LeoTitle
Coronavirus live news: India aims for 10m Covid jabs a day by July; WHO approves Chinese Sinovac jab — as it happened
So far nearly 45 million people fully vaccinated, 4.7% of India’s adult population; Sinovac is second Chinese vaccine approved as safe by WHO 
Nadeem Badshah (now);  
Kaamil Ahmed ,
Rachel Hall , 
Martin Belam,
Helen Livingstone (earlier) 

        
            Tue 1 Jun 2021 18.56 EDT



            First published on Tue 1 Jun 2021 01.05 EDT


 

6.56pm EDT
18:56

 That’s it from the UK blog team, thanks for following our coverage. 

6.52pm EDT
18:52

 Victoria in Australia has imposed a seven-day “circuit-breaker lockdown” in response to a growing Covid cluster in Melbourne’s northern suburbs. From 11.59pm Thursday, 27 May until 11.59pm Thursday 3 June, the following rules apply to the entire state of Victoria, not just Melbourne. 

6.44pm EDT
18:44

 Authorities in Australia have released a list of public exposure 

# Importação do banco de dados

Após salvar o banco de dados na pasta desejada, podemos usar o código do estudo de caso para importar a informação.

In [30]:
#número total de artigos a serem processados
N = df.shape[0]
#para armazenar os temas, títulos e conteúdos das notícias:
topics_array = []
titles_array = []
corpus = []
for i in range(1, N+1):
    #obtenha o conteúdo do artigo.
    with open('Data/article-' + str(i) + '.txt', 'r') as myfile:
        d1=myfile.read().replace('\n', '')
        d1 = d1.lower()
        corpus.append(d1)
    #obtenha o tema original do artigo.
    with open('Data/topic-' + str(i) + '.txt', 'r') as myfile:
        to1=myfile.read().replace('\n', '')
        to1 = to1.lower()
        topics_array.append(to1)
    #obtenha o título do artigo.
    with open('Data/title-' + str(i) + '.txt', 'r') as myfile:
        ti1=myfile.read().replace('\n', '')
        ti1 = ti1.lower()
        titles_array.append(ti1)

# Geração de atributos

Para gerar os atributos de cada instância (artigo):

1. Vinculamos todo o corpus de texto do artigo para determinar todas as palavras únicas que são usadas no conjunto de dados.
2. Procuramos o subconjunto das entidades do modelo NER encontrado entre as palavras únicas que são usadas no conjunto de dados (determinado na etapa 1).

In [31]:
#vetor de subconjunto de entidades
entity_text_array = [] 
for i in range(1, N+1):
    #carregue o arquivo de texto con o conteúdo do artigo e converta-o em uma lista de palavras
    tokens = tokenize(load_entire_file(('Data/article-' + str(i) + '.txt')))
    #extraia todas as entidades conhecidas do modelo ner mencionado neste artigo
    entities = ner.extract_entities(tokens)
    #extraia as palavras de entidades reais adicione-as ao vetor
    for e in entities: 
        range_array = e[0]
        tag = e[1]
        score = e[2]
        score_text = "{:0.3f}".format(score)
        entity_text = " ".join(tokens[j].decode("utf-8") for j in range_array) 
        entity_text_array.append(entity_text.lower())
#elimine as entidades duplicadas que foram detectadas
#entity_text_array = np.unique(entity_text_array)
entity_text_array = list(set(entity_text_array))

Agora que temos a lista de todas as entidades utilizadas no banco de dados, podemos representar cada artigo como um vetor que contém a pontuação de [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf) para cada entidade armazenada no `entity_text_array`. Esta tarefa pode ser realizada facilmente com a biblioteca [scikit-learn](http://scikit-learn.org/stable/) de Python

In [32]:
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
                       stop_words='english', vocabulary=entity_text_array)
corpus_tf_idf = vect.fit_transform(corpus)

Agora que temos os artigos representados por seus atributos (pontuações de TF-IDF), podemos fazer o agrupamento espectral deles usando novamente a biblioteca `scikit-learn`

In [33]:
#Altere n_clusters para o número de grupos desejados  
n_clusters = 8
#Agrupamento espectral 
spectral = cluster.SpectralClustering(n_clusters= n_clusters, 
                                      eigen_solver='arpack', 
                                      affinity="nearest_neighbors", 
                                      n_neighbors = 10)
spectral.fit(corpus_tf_idf)

SpectralClustering(affinity='nearest_neighbors', assign_labels='kmeans',
                   coef0=1, degree=3, eigen_solver='arpack', eigen_tol=0.0,
                   gamma=1.0, kernel_params=None, n_clusters=8,
                   n_components=None, n_init=10, n_jobs=None, n_neighbors=10,
                   random_state=None)

Por fim, as linhas de código a seguir permitem ver o resultado no seguinte formato (uma linha por artigo):

<br>

__no. artigo, tema, grupo, título__

In [34]:
if hasattr(spectral, 'labels_'):
    cluster_assignments = spectral.labels_.astype(np.int)
    for i in range(0, len(cluster_assignments)):
        print(i, topics_array[i], cluster_assignments [i], titles_array[i])

0 ____________soccer________ 6                                             premier league 2020-21 review: the big quiz of the season                                    
1 ____________world________ 5 coronavirus live news: india aims for 10m covid jabs a day by july; who approves chinese sinovac jab — as it happened
2 ____________world________ 6 australia coronavirus live: victoria records six new covid cases as lockdown extension expected, nsw on alert
3 ____________world________ 4                     ugandan minister speaks from hospital bed after assassination attempt – video                
4 ____________world________ 4                     ‘when will you know?’: richard colbeck can't say how many aged care workers are vaccinated – video                
5 ____________world________ 7                     'democracy itself is in peril': biden delivers memorial day speech – video                
6 ____________world________ 5                     sri lanka faces environmental disaster as c

In [35]:
df['predictions'] = cluster_assignments
predictions_df = pd.get_dummies(df, columns=['predictions']).drop(['title','content'],axis=1).groupby(['topic']).sum()
predictions_df

Unnamed: 0_level_0,predictions_0,predictions_1,predictions_2,predictions_3,predictions_4,predictions_5,predictions_6,predictions_7
topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
____________Environment\n________,1,1,1,0,0,0,0,0
____________Science\n________,0,0,0,1,0,0,2,0
____________Soccer\n________,1,2,3,3,0,0,1,0
____________Tech\n________,0,0,0,0,0,0,1,0
____________US_Politics\n________,0,0,0,0,0,0,1,0
____________World\n________,0,0,0,0,2,2,2,1


Como podemos ver, o algoritmo nem sempre classifica os artigos de acordo com as seções de onde foram obtidos. Você pode se aprofundar nos parâmetros do modelo para melhorar esses resultados ou procurar uma explicação para entender os critérios pelos quais o algoritmo está agrupando os artigos.