### Problem Statement

Given a data set of news articles about various topics and their cluster information, we develop a solution which clusters similar news together based on their contextual similarity.

### Importing Dependencies

In [37]:
#import re
import pandas as pd
import numpy as np
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import seaborn as sns
import urllib
import json
import nltk
from nltk.tokenize import sent_tokenize

### Data Loader

In [22]:
#Downloads the json file and convert it into pandas dataframe format
def load_convert_data(url):
    with urllib.request.urlopen(url) as url:
        df = json.loads(url.read().decode('utf-8'))
        df = pd.DataFrame.from_dict(df)
    return df
data = load_convert_data("https://storage.googleapis.com/public-resources/dataset/clusters.json")
# checking data features
print(data.columns)
# checking number of data points
print(data.shape)
# checking distinct clusters and their numbers
cluster_names = list(zip(data['cluster_name'].unique(),data['cluster'].unique()))
print(cluster_names[:13])

Index(['id', 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name'], dtype='object')
(181, 7)
[('MS fails to respond', '0'), ('Anti-Russia', '1'), ('Claims about China', '2'), ('Collapse', '3'), ('Coronavirus is not serious', '4'), ('Cure', '5'), ('EU fails to respond', '6'), ('Miscellaneous', '7'), ('Origins', '8'), ('Properties', '9'), ('Was predicted', '10'), ('Secret plan of the global elite', '11'), ('Ukraine fails to respond', '12')]


### Data Preprocessing and Cleaning

In [23]:
def clean_data(sentence):
    res = re.sub('[!*)@#%(&$_^]','',sentence)
    return res
data['text'] = data['text'].apply(clean_data)
data['title'] = data['title'].apply(clean_data)
print(data['title'][0])
print(data['text'][0])

Lithuania: No Physicians, No Food Stock - Strategic Culture Fund
The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.
At the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.
If there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.
The Minister of Defense of Lithuania Raimundas Karoblis has already promis

### Text Summarization

In [None]:
from summarizer import Summarizer
model = Summarizer()

In [25]:
def bert_summarizer(news_article):
    summarized = model(news_article, min_length=30)
    return summarized
original_news = data['text'][0]
summarized_news = bert_summarizer(original_news)
summarized_news

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance. The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but would anyone really want him to be treated not by a doctor, but, for example, by an artilleryman? This meant that companies, receiving money in advance, pledged in an extreme situation to supply everything necessary for the needs of the population.'

In [None]:
data['new_summarized'] = data['text'].apply(bert_summarizer)

In [27]:
print(data.columns)
print(data['new_summarized'][0])

Index(['id', 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name',
       'new_summarized'],
      dtype='object')
The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance. The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but would anyone really want him to be treated not by a doctor, but, for example, by an artilleryman? This meant that companies, receiving money in advance, pledged in an extreme situation to supply everything necessary for the needs of the population.


### Contextual Clustering

In [33]:
def sentence_tokens(article):
    return sent_tokenize(article)
data['new_summarized'] = data['new_summarized'].astype(str).apply(sentence_tokens)
print(data['new_summarized'][0])

["['The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.", "', 'Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius.", "', 'In the districts, something like paramedic points remained for emergency assistance.", "', 'The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but would anyone really want him to be treated not by a doctor, but, for example, by an artilleryman?", "', 'This meant that companies, receiving money in advance, pledged in an extreme situation to supply everything necessary for the needs of the population.']"]


In [38]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
model = hub.load(module_url)
print ("module %s loaded" % module_url)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [42]:
def embed(input):
    return model(input)
embeddings = []
for index, row in data.iterrows():
    news_emb = embed(row['new_summarized'])
    embeddings.append(news_emb.numpy())
    print(f'Obtain embedding of news {index+1}: {news_emb.numpy()}')
        

Obtain embedding of news 1: [[-0.05760741  0.00826574 -0.03200125 ...  0.02968193 -0.0036256
  -0.07033622]
 [-0.06308062  0.05199011  0.03511423 ... -0.03040699 -0.00728911
  -0.06726194]
 [-0.02893057  0.07098991 -0.04916837 ... -0.00700544 -0.02638289
   0.03984028]
 [-0.00077114 -0.03427493  0.04885418 ...  0.00221819 -0.02126683
  -0.0259736 ]
 [ 0.05516692 -0.08770609 -0.0058286  ... -0.04245264 -0.0042739
   0.07119424]]
Obtain embedding of news 2: [[ 0.01996352  0.03231549 -0.03987238 ...  0.03287501  0.02071153
  -0.09245825]
 [-0.02223844  0.0460071  -0.0715671  ...  0.00163913 -0.00470258
  -0.07795639]
 [ 0.04623949 -0.07806639 -0.06892981 ... -0.00627649 -0.06987774
  -0.0616219 ]
 ...
 [ 0.02578483  0.07967581 -0.06156636 ...  0.05955523 -0.00783031
   0.0223297 ]
 [ 0.06408015 -0.10791706 -0.01103734 ... -0.05439744 -0.03442081
   0.02399206]
 [ 0.07327369 -0.10620847  0.04003875 ... -0.04776834 -0.04433813
   0.07171834]]
Obtain embedding of news 3: [[-0.05047249 -0.013

Obtain embedding of news 98: [[ 0.00155185 -0.00550415 -0.03307436 ...  0.04971029 -0.02758695
  -0.06254242]
 [ 0.02894479 -0.07875541  0.03215922 ...  0.00262967 -0.00268277
  -0.0017538 ]
 [-0.00550923 -0.07450344  0.03734345 ... -0.01283344  0.01550711
  -0.01495657]
 ...
 [ 0.04586231 -0.03808804 -0.03328226 ...  0.03789429  0.02123699
  -0.04839479]
 [-0.00953209 -0.04441155  0.00136384 ... -0.01015722 -0.06655225
  -0.05491805]
 [ 0.04608577 -0.02507609 -0.06688628 ...  0.00278284 -0.04465536
   0.00181722]]
Obtain embedding of news 99: [[ 0.05762394  0.00394561 -0.07885604 ... -0.00181148 -0.05949799
  -0.09265035]
 [ 0.005181    0.0096109  -0.03231804 ... -0.00915689 -0.05979423
  -0.07393928]
 [-0.02433358 -0.04618163 -0.03184826 ...  0.00348379 -0.07573432
  -0.07659053]
 [ 0.02180123  0.02513633  0.05508893 ...  0.00073637  0.10361638
   0.02739098]]
Obtain embedding of news 100: [[ 0.07156038 -0.02977349 -0.04041868 ...  0.04159512  0.01547913
   0.04755787]
 [-0.01846815 

Obtain embedding of news 181: [[ 0.03904901 -0.04254241 -0.04478291 ...  0.03300881 -0.07601045
  -0.08094712]
 [ 0.05089664  0.01145332 -0.0148047  ...  0.02621235  0.00651448
  -0.06022458]
 [ 0.07986785  0.00471711  0.02968196 ...  0.02676776  0.01470877
   0.06059643]
 [-0.04794929 -0.01784362  0.02874328 ...  0.03655949  0.00168337
  -0.0667128 ]
 [ 0.0095497  -0.0489671   0.01170866 ... -0.01020362 -0.07349624
  -0.07244444]]


In [48]:
data['new_embeddings'] = embeddings
data.new_embeddings[0]

array([[-0.05760741,  0.00826574, -0.03200125, ...,  0.02968193,
        -0.0036256 , -0.07033622],
       [-0.06308062,  0.05199011,  0.03511423, ..., -0.03040699,
        -0.00728911, -0.06726194],
       [-0.02893057,  0.07098991, -0.04916837, ..., -0.00700544,
        -0.02638289,  0.03984028],
       [-0.00077114, -0.03427493,  0.04885418, ...,  0.00221819,
        -0.02126683, -0.0259736 ],
       [ 0.05516692, -0.08770609, -0.0058286 , ..., -0.04245264,
        -0.0042739 ,  0.07119424]], dtype=float32)

### Finding Cluster of a new News Article

In [59]:
class GetCluster():
    def __init__(self, news):
      self.news = news
      self.clean_news = None
      self.news_summary = None
      self.sent_tokens = None
      self.news_embedddings = None

    def _cleannews(self):
      self.clean_news = clean_data(self.news)

    def _get_tokens(self):
        self.sent_tokens = sentence_tokens(self.clean_news)
        if len(self.sent_tokens) > 10:
          self.news_summary = bert_summarizer(self.clean_news)
          self.sent_tokens = sentence_tokens(self.news_summary)

    def _get_st_embeddings(self):
        try:
            self.news_embedddings = embed(self.sent_tokens).numpy()
        except:
            print('here')
            self.news_embedddings = embed([self.sent_tokens]).numpy()

    def _cosine_max_compute(self, vect1):
        flatten_similarities = cosine_similarity(vect1, self.news_embedddings).flatten()
        flatten_sorted = -np.sort(-flatten_similarities)
        max_mean_score = flatten_sorted[:3].mean()
        return max_mean_score

    def _cosine_similarity(self):
        df_data['similarities'] = data['new_summarized'].apply(self._cosine_max_compute)


    def _get_sort_best(self, N=1):
        df = data.sort_values('similarities', ascending=False)
        print()
        cluster_nom = df['cluster_name'].head(1).to_string(index=False)
        print(f"RESULT: The NEW NEWS article belongs to cluster :- {cluster_nom}")
        print()
        df = df[['text','title']].head(N)
        return df

    def get_N_similar_news(self, n=1):
        self._cleannews()
        self._get_tokens()
        self._get_st_embeddings()
        self._cosine_similarity()
        n_similar_news = self._get_sort_best(N=n)
        return n_similar_news

In [None]:
new_news = """
WHO is following up with Chinese authorities about a cluster of COVID-19 cases in Beijing, People’s Republic of China.
Today, officials from the National Health Commission and Beijing Health Commission briefed WHO’s China country office, to share details of preliminary investigations ongoing in Beijing.  
As of 13 June, 41 symptomatic laboratory confirmed cases and 46 laboratory confirmed cases without symptoms of COVID-19 have been identified in Beijing.
The first identified case had symptom onset on 9 June, and was confirmed on 11 June.  Several of the initial cases were identified through six fever clinics in Beijing.  Preliminary investigations revealed that some of the initial symptomatic cases had a link to the Xinfadi Market in Beijing.  Preliminary laboratory investigations of throat swabs from humans and environmental samples from Xinfadi Market identified 45 positive human samples (all without symptoms at the time of reporting) and 40 positive environmental samples.  One additional case without symptoms was identified as a close contact of a confirmed case.
"""
n = 2
get_cluster = GetCluster(new_news)
results = get_cluster.get_N_similar_news(n)