# Find similar news headlines and compare sentiment
Now that I have done sentiment analysis with headlines and worked with word vectors for article content, I came up with another question: Given a headline, can I look up other similar headlines and compare the sentiment?
\
\
To do this, I use Doc2Vec to vectorize headlines and find other similar headlines. I can then perform sentiment analysis on these headlines using TextBlob.

In [5]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import string
from sklearn.manifold import TSNE
from textblob import TextBlob
import pickle
import json

In [6]:
with open('../secrets.json') as file:
    secrets = json.load(file)
    connection_string = secrets['connection_string']
db = create_engine(connection_string)
df = pd.read_sql('select * from news_article', con=db)

In [7]:
df['headline'][0]

"Muons: 'Strong' evidence found for a new force of nature"

### Function to tokenize headline, remove stop words and remove punctuation

In [8]:
def clean_headline(headline):
    stop_words = set(stopwords.words('english'))
    return [word for word in nltk.word_tokenize(headline) if word not in stop_words and word not in string.punctuation]

### Prepare the data for Doc2Vec

In [9]:
# LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
stop_words = set(stopwords.words('english')) 

sentences = []
tag_lookup = {}

# tokenize headlines,remove stop words and punctuation
# Create a dictionary with the tags as keys so I can look up the actual headlines later.
# Also create a list of TaggedDocuments which will be used for training the model.
for i in range(len(df)):
    headline = clean_headline(df['headline'][i])
    tag = f'SENT_{i}'
    sentences.append(TaggedDocument(words=headline, tags=[tag]))
    tag_lookup.update({tag: df['headline'][i]})

sentences

[TaggedDocument(words=['Muons', "'Strong", 'evidence', 'found', 'new', 'force', 'nature'], tags=['SENT_0']),
 TaggedDocument(words=['A', 'third', 'COVID', 'survivors', 'suffer', 'neurological', 'mental', 'disorders', 'study'], tags=['SENT_1']),
 TaggedDocument(words=['U.S.', 'disturbed', 'imprisoned', 'Kremlin', 'critic', 'Navalny', "'s", 'deteriorating', 'health'], tags=['SENT_2']),
 TaggedDocument(words=['Jailed', 'Kremlin', 'critic', 'Navalny', 'loses', 'sensation', 'hands', 'Lawyers'], tags=['SENT_3']),
 TaggedDocument(words=['Taiwan', 'says', 'may', 'shoot', 'Chinese', 'drones', 'South', 'China', 'Sea'], tags=['SENT_4']),
 TaggedDocument(words=['Wuhan', 'cemeteries', 'see', '320,000', 'early', 'mourners', 'Tomb', 'Sweeping', 'Festival', 'Chinese', 'media'], tags=['SENT_5']),
 TaggedDocument(words=['Navalny', 'lawyer', 'says', 'opposition', 'leader', 'diagnosed', 'spinal', 'hernias'], tags=['SENT_6']),
 TaggedDocument(words=['Rolls-Royce', 'hits', 'new', 'sales', 'record', 'first',

### Pickle the tag_lookup dictionary so I can load it later in my data app

In [18]:
with open('models/doc2vec/tag_lookup.pickle', 'wb') as f:
    pickle.dump(tag_lookup, f, protocol=pickle.HIGHEST_PROTOCOL)

### Build and train a Doc2Vec model

In [11]:
model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, total_examples=len(sentences), epochs=100)

### The next step is to lookup a headline, find similar headlines, and compare the sentiment of that headline to other similar headlines.

In [12]:
# function to lookup a headline and find similar headlines
def find_similar(headline, topn):
    new_clean_headline = clean_headline(headline)
    similar = model.docvecs.most_similar(positive=[model.infer_vector(new_clean_headline)],topn=topn)

    return similar

### Test the model on a headline not included in the training set

In [13]:
new_headline = 'this is a test headline, this is not real'
similar = find_similar(new_headline, 5)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

this is a test headline, this is not real
Sentiment(polarity=-0.1, subjectivity=0.30000000000000004)

0: Hackers post fake stories on real news sites 'to discredit Nato' -- Sentiment(polarity=-0.15, subjectivity=0.65)
1: Conservative delegates reject adding 'climate change is real' to the policy book -- Sentiment(polarity=0.2, subjectivity=0.30000000000000004)
2: Cancer can be precisely diagnosed using a urine test with artificial intelligence -- Sentiment(polarity=-0.09999999999999998, subjectivity=0.9)
3: CERN scientists design trap to transport antimatter between facilities -- Sentiment(polarity=0.0, subjectivity=0.0)
4: Breakthrough mRNA vaccine developed for cancer immunotherapy by Chinese scientists -- Sentiment(polarity=0.05, subjectivity=0.15)


### Lookup an existing headline to see other similar headlines
Since the headline I'm looking up was part of training, the most similar headline is always the same one I looked up. So to get the top 5 most similar headlines, I just get the top 6 and ignore the first.

In [14]:
new_headline = df['headline'][0]
similar = find_similar(new_headline, 6)
similar.pop(0)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

Muons: 'Strong' evidence found for a new force of nature
Sentiment(polarity=0.2848484848484848, subjectivity=0.5939393939393939)

0: Israel keeps blowing up military targets in Iran, hoping to force a confrontation before Trump can be voted out in November, sources say -- Sentiment(polarity=-0.1, subjectivity=0.1)
1: Legislation coming this year to force Google, Facebook to pay for news content -- Sentiment(polarity=0.0, subjectivity=0.0)
2: WHO acknowledges 'emerging evidence' of airborne spread of COVID-19 -- Sentiment(polarity=0.0, subjectivity=0.0)
3: Taiwan produces evidence it warned WHO of coronavirus in December -- Sentiment(polarity=0.0, subjectivity=0.0)
4: Taiwan reports large incursion by Chinese air force -- Sentiment(polarity=0.10714285714285714, subjectivity=0.21428571428571427)


### Presenting the results
I figured the best way to present these results would be a data app. This way a user can select whatever headline they want and find similar ones. To do this, I will save the Doc2Vec model to a file and load it from the app.

In [17]:
model.save('models/doc2vec/headline_model')

Test loading the model to make sure it works

In [16]:
loaded_model = Doc2Vec.load('headline_model')

new_headline = df['headline'][1]
new_clean_headline = clean_headline(new_headline)
similar = loaded_model.docvecs.most_similar(positive=[model.infer_vector(new_clean_headline)],topn=6)
similar.pop(0)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

A third of COVID survivors suffer neurological or mental disorders: study
Sentiment(polarity=-0.05, subjectivity=0.1)

0: Dark hair was common among Vikings, genetic study confirms -- Sentiment(polarity=-0.22499999999999998, subjectivity=0.45)
1: Saudi Arabia textbooks revised to be more tolerant - study -- Sentiment(polarity=0.5, subjectivity=0.5)
2: Nurses suffer burn-out, psychological distress in COVID fight - association -- Sentiment(polarity=0.0, subjectivity=0.1)
3: Billionaire Kerry Stokes exempted from strict quarantine rules after arriving in Perth from Aspen by private jet -- Sentiment(polarity=0.0, subjectivity=0.375)
4: A 64-year-old man accidentally ejected himself from a fighter jet at 2,500 feet -- Sentiment(polarity=0.0, subjectivity=0.0)
