<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Case-study-1.2.2:-Spectral-Clustering---Grouping-News-Stories" data-toc-modified-id="Case-study-1.2.2:-Spectral-Clustering---Grouping-News-Stories-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Case study 1.2.2: Spectral Clustering - Grouping News Stories</a></span></li><li><span><a href="#Database-generation-(Web-Scraping)" data-toc-modified-id="Database-generation-(Web-Scraping)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Database generation (Web Scraping)</a></span></li><li><span><a href="#Importing-database" data-toc-modified-id="Importing-database-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Importing database</a></span></li><li><span><a href="#Feature-generation" data-toc-modified-id="Feature-generation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature generation</a></span></li></ul></div>

# Case study 1.2.2: Spectral Clustering - Grouping News Stories

---
<br>

This case study considers a database of news articles covering different topics, and uses _spectral clustering_ to cluster them depending on the frequency of certain words. The code for generating the database of news articles is provided, but a sample dataset of articles generated on May 28, 2020 from the newspaper The Guardian can be found in the folder `/Data`. This dataset has been generated by using data mining techniques (_web scraping_).

This case study uses the NLP library [`mitie`](https://github.com/mit-nlp/MITIE), developed at MIT. All the steps in order to install both the library and the NER model used in this particular case study can be found in the documentation of the library.

<br>

---

Notebook setup:

In [1]:
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
import csv

#ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import cluster

#Web scraping libraries
from bs4 import BeautifulSoup

#NLP libraries
from mitie import *
print("loading NER model...")
#ner = named_entity_extractor('.../mitie/models/MITIE-models/english/ner_model.dat') #[PATH TO MITIE LIBRARY]
ner = named_entity_extractor('/Users/inigo/opt/anaconda3/lib/python3.7/site-packages/mitie/models/MITIE-models/english/ner_model.dat')
print("\nTags output by this NER model:", ner.get_possible_ner_tags())

## IGNORE THE CODE BELOW ##
#Getting names of imported libraries and versions for creating a requirements.txt file
import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]
        elif isinstance(val, type):
            name = val.__module__.split(".")[0]
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]
        yield name
imports = list(set(get_imports()))

requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name != "pip":
        requirements.append((m.project_name, m.version))

print('List of packages and versions:\n')      

for r in requirements:
        print("{}=={}".format(*r))

loading NER model...

Tags output by this NER model: ['PERSON', 'LOCATION', 'ORGANIZATION', 'MISC']
List of packages and versions:

scikit-learn==0.22.1
requests==2.23.0
pandas==1.0.0
numpy==1.18.1
mitie==0.7.36
matplotlib==3.1.3


In [3]:
for m in pkg_resources.working_set:
    print(m.project_name)

zipp
zict
yfinance
yapf
yahoofinancials
xlwt
xlwings
XlsxWriter
xlrd
xgboost
wurlitzer
wrapt
widgetsnbextension
wheel
Werkzeug
webencodings
wcwidth
watchdog
w3lib
vincent
urllib3
unicodecsv
ujson
typing-extensions
typed-ast
traitlets
tqdm
tornado
toolz
toml
testpath
terminado
termcolor
tensorflow
tensorflow-estimator
tensorboard
tblib
tabulate
tables
sympy
statsmodels
sqlparse
SQLAlchemy
spyder
spyder-kernels
sphinxcontrib-websupport
sphinxcontrib-serializinghtml
sphinxcontrib-qthelp
sphinxcontrib-jsmath
sphinxcontrib-htmlhelp
sphinxcontrib-devhelp
sphinxcontrib-applehelp
Sphinx
SPARQLWrapper
soupsieve
sortedcontainers
sortedcollections
snowballstemmer
six
singledispatch
simplegeneric
Shapely
shap
setuptools
Send2Trash
selenium
seaborn
scipy
scikit-learn
scikit-image
ruamel-yaml
Rtree
rope
retrying
requests
regex
rdflib
ratelim
QtPy
qtconsole
QtAwesome
QDarkStyle
pyzmq
PyYAML
PyWavelets
pytz
pytrends
python-language-server
python-jsonrpc-server
python-dateutil
pytest
pytest-remotedata


# Database generation (Web Scraping)

In this example, news articles from the newspaper __The Guardian__ are collected across 8 different topics. The steps that are performed for building the dataset are:

1. Retrieving the source code from the main site of The Guardian and storing the links to different sections of interest in a list.
2. Iterating through the list of links and getting the information (title, and content) for 10 articles from each topic.
3. Storing the articles, titles, and topics in `.txt` files.

In [2]:
UK_news_url = 'https://www.theguardian.com/uk'
#Descargando los links de los diferentes temas
html_data = requests.get(UK_news_url).text
soup = BeautifulSoup(html_data, 'html.parser')
url_topics = [el.find('a')['href'] for el in soup.find_all(class_ = 'subnav__item')[1:9]]
topics = [el.text.strip('\n').replace(' ','_') for el in soup.find_all(class_ = 'subnav-link')[1:9]]
for i in range(len(topics)):
    print('Topic {}: {} ({})'.format(i+1,topics[i],url_topics[i]))


Topic 1: Elections_2020 (https://www.theguardian.com/us-news/us-elections-2020)
Topic 2: World (https://www.theguardian.com/world)
Topic 3: Environment (https://www.theguardian.com/us/environment)
Topic 4: Soccer (https://www.theguardian.com/football)
Topic 5: US_Politics (https://www.theguardian.com/us-news/us-politics)
Topic 6: Business (https://www.theguardian.com/us/business)
Topic 7: Tech (https://www.theguardian.com/us/technology)
Topic 8: Science (https://www.theguardian.com/science)


In [3]:
def save_to_txt(filename, content):
    '''
    Creates a new .txt file with as specific name in the Data directory
    '''
    with open(r"Data/{}.txt".format(filename), "w") as f:
        print(content, file=f)

article_titles = []
article_contents = []
article_topics = []
articles_per_topic = 10
print('Getting news articles from The Guardian: \n')
n = 1
for topic, url_topic in list(zip(topics,url_topics)):
    soup = BeautifulSoup(requests.get(url_topic).text, 'html.parser')
    url_articles = [el.find('a')['href'] for el in soup.find_all(class_ = 'fc-item__content')]
    print('\n{}:'.format(topic))
    i = 0
    while article_topics.count(topic) < articles_per_topic:
        soup = BeautifulSoup(requests.get(url_articles[i]).text, 'html.parser')
        try:
            title = soup.find(class_ = 'content__headline').text.strip('\n')
            content = ' '.join([el.text for el in soup.find(class_ = 'content__article-body from-content-api js-article__body').find_all('p')])
            i += 1
            if i == len(url_articles):
                print('Only {} articles found in \"{}"'.format(article_topics.count(topic),topic))
                break
            if title not in article_titles:
                article_titles += [title]
                article_contents += [content]
                article_topics += [topic]
                save_to_txt('title-{}'.format(n),title)
                save_to_txt('article-{}'.format(n),content)
                save_to_txt('topic-{}'.format(n),topic)
                print('{}'.format(title))
                n += 1
                if round(len(article_titles)/10) == len(article_titles)/10:
                    print('Article count: {}'.format(len(article_titles)))
        except:
            i += 1
            if i == len(url_articles):
                print('Only {} articles found in \"{}"'.format(article_topics.count(topic),topic))
                break
            pass
        
                
df = pd.DataFrame({'topic':article_topics,'title':article_titles,'content':article_contents})

Getting news articles from The Guardian: 


Elections_2020:
Revealed: conservative group fighting to restrict voting tied to powerful dark money network
Republicans sense rich pickings in Biden archive – but will it be made public?
'Feels good to be out of my house': Biden lays Memorial Day wreath in Delaware
'You have to respond forcefully': can Joe Biden fight Trump's brutal tactics?
Why is Trump so restrained about the Biden sexual assault allegation?
Swing states become partisan battlegrounds in America's fight against Covid-19
Trump campaign focuses fire on Biden as pandemic undermines strategy
‘The United States is broken as hell’ – the division in politics over race and class
Socialism used to be a dirty word. Is America now ready to embrace it?
Article count: 10

World:
Global report: South Korea postpones school reopening due to new outbreak
'Gross incompentence at highest levels': ex-Obama adviser blasts Trump's Covid response
'Demand is huge': EU citizens flock to open-air c

# Importing database

Once we have the database stored in the directory we want, we can use the code provided in the case study to import the information.

In [4]:
#total number of articles to process
N = 80
#in memory stores for the topics, titles and contents of the news stories
topics_array = []
titles_array = []
corpus = []
for i in range(1, N+1):
    #get the contents of the article
    with open('Data/article-' + str(i) + '.txt', 'r') as myfile:
        d1=myfile.read().replace('\n', '')
        d1 = d1.lower()
        corpus.append(d1)
    #get the original topic of the article
    with open('Data/topic-' + str(i) + '.txt', 'r') as myfile:
        to1=myfile.read().replace('\n', '')
        to1 = to1.lower()
        topics_array.append(to1)
    #get the title of the article
    with open('Data/title-' + str(i) + '.txt', 'r') as myfile:
        ti1=myfile.read().replace('\n', '')
        ti1 = ti1.lower()
        titles_array.append(ti1)

# Feature generation

We are now ready to do the following:
1. Loop over all the article text corpuses to determine all the unique words used across our dataset.
2. Find the subset of the entities from the NER model that are among the unique words being used across the dataset (determined in step 1).

In [5]:
#entity subset array
entity_text_array = [] 
for i in range(1, N+1):
    #load the article contents text file and convert it into a list of words.
    tokens = tokenize(load_entire_file(('Data/article-' + str(i) + '.txt')))
    #extract all entities known to the ner model mentioned in this article
    entities = ner.extract_entities(tokens)
    #extract the actual entity words and append to the array
    for e in entities: 
        range_array = e[0]
        tag = e[1]
        score = e[2]
        score_text = "{:0.3f}".format(score)
        entity_text = " ".join(str(tokens[j]) for j in range_array) 
        entity_text_array.append(entity_text.lower())
#remove duplicate entities detected
#entity_text_array = np.unique(entity_text_array)
entity_text_array = list(set(entity_text_array))

Now that we have the list of all entities used across our dataset, we can represent each article as a vector that contains the [TF-IDF](https://en.wikipedia.org/wiki/Tf–idf ) score for each entity stored in the `entity_text_array`. This task can easily be achieved by using the [scikit-learn library](http://scikitlearn.org/stable/) for Python

In [6]:
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word',
                       stop_words='english', vocabulary=entity_text_array)
corpus_tf_idf = vect.fit_transform(corpus)


Now that we have the articles represented as vectors of their TF-IDF scores, we are ready to perform Spectral Clustering on the articles. We can use the scikit-learn library for this purpose as well:

In [7]:
# change n_clusters to equal the number of clusters desired
n_clusters = 8
#spectral clustering
spectral = cluster.SpectralClustering(n_clusters= n_clusters, 
                                      eigen_solver='arpack', 
                                      affinity="nearest_neighbors", 
                                      n_neighbors = 10)
spectral.fit(corpus_tf_idf)

SpectralClustering(affinity='nearest_neighbors', assign_labels='kmeans',
                   coef0=1, degree=3, eigen_solver='arpack', eigen_tol=0.0,
                   gamma=1.0, kernel_params=None, n_clusters=8,
                   n_components=None, n_init=10, n_jobs=None, n_neighbors=10,
                   random_state=None)

We now have the spectral clustering model fitted to the dataset. The following lines of code will help us see the output in the following format (one line per article):

<br>

<center>__article_number, topic, spectral_clustering_cluster_number, article_title__</center>

In [8]:
if hasattr(spectral, 'labels_'):
    cluster_assignments = spectral.labels_.astype(np.int)
    for i in range(0, len(cluster_assignments)):
        print (i, topics_array[i], cluster_assignments [i], titles_array[i])

0 elections_2020 0 revealed: conservative group fighting to restrict voting tied to powerful dark money network
1 elections_2020 7 republicans sense rich pickings in biden archive – but will it be made public?
3 elections_2020 5 'feels good to be out of my house': biden lays memorial day wreath in delaware
4 elections_2020 0 'you have to respond forcefully': can joe biden fight trump's brutal tactics?
5 elections_2020 0 why is trump so restrained about the biden sexual assault allegation?
6 elections_2020 4 swing states become partisan battlegrounds in america's fight against covid-19
7 elections_2020 0 trump campaign focuses fire on biden as pandemic undermines strategy
8 elections_2020 4 ‘the united states is broken as hell’ – the division in politics over race and class
9 elections_2020 0 socialism used to be a dirty word. is america now ready to embrace it?
10 world 4 global report: south korea postpones school reopening due to new outbreak
11 world 0 'gross incompentence at highes

As it can be observed, the algorithm does not classify the articles according to the newspaper sections they have been taken from. You can take a further look at the model parameters in order to improve those results, or find insights about the criteria the algorithm is currently using to cluster the articles.