# Find similar news headlines and compare sentiment
Now that I have done sentiment analysis with headlines and worked with word vectors for article content, I came up with another question: Given a headline, can I look up other similar headlines and compare the sentiment?
\
\
To do this, I use Doc2Vec to vectorize headlines and find other similar headlines. I can then perform sentiment analysis on these headlines using TextBlob.

In [4]:
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import nltk
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
import string
from sklearn.manifold import TSNE
from textblob import TextBlob
import pickle
import json

In [5]:
with open('../secrets.json') as file:
    secrets = json.load(file)
    connection_string = secrets['connection_string']
db = create_engine(connection_string)
df = pd.read_sql('select * from news_article', con=db)

In [27]:
df['headline'][0]

'Top U.S. general resists Trump administration?s efforts to provoke war with Iran ? Mondoweiss'

### Function to tokenize headline, remove stop words and remove punctuation

In [28]:
def clean_headline(headline):
    stop_words = set(stopwords.words('english'))
    return [word for word in nltk.word_tokenize(headline) if word not in stop_words and word not in string.punctuation]

### Prepare the data for Doc2Vec

In [29]:
# LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
stop_words = set(stopwords.words('english')) 

sentences = []
tag_lookup = {}

# tokenize headlines,remove stop words and punctuation
# Create a dictionary with the tags as keys so I can look up the actual headlines later.
# Also create a list of TaggedDocuments which will be used for training the model.
for i in range(len(df)):
    headline = clean_headline(df['headline'][i])
    tag = f'SENT_{i}'
    sentences.append(TaggedDocument(words=headline, tags=[tag]))
    tag_lookup.update({tag: df['headline'][i]})

sentences

[TaggedDocument(words=['Top', 'U.S.', 'general', 'resists', 'Trump', 'administration', 'efforts', 'provoke', 'war', 'Iran', 'Mondoweiss'], tags=['SENT_0']),
 TaggedDocument(words=['Atalanta', 'vs', 'Valencia', 'linked', 'accelerating', 'coronavirus', 'spread'], tags=['SENT_1']),
 TaggedDocument(words=['Boris', 'Johnson', "'s", 'government', 'reportedly', 'furious', 'China', 'believes', 'could', '40', 'times', 'coronavirus', 'cases', 'claims'], tags=['SENT_2']),
 TaggedDocument(words=['Toyota', 'Gearing', 'Up', 'To', 'Build', 'Ventilators', 'And', 'Face', 'Shields', 'As', 'Mercedes', 'Offers', 'Use', 'Of', '3D', 'Printers'], tags=['SENT_3']),
 TaggedDocument(words=['Trudeau', 'vows', "'no", 'corners', 'cut', 'accepting', 'masks', 'supplies', 'China'], tags=['SENT_4']),
 TaggedDocument(words=['Endangered', 'sea', 'turtles', 'hatch', 'Brazil', "'s", 'deserted', 'beaches'], tags=['SENT_5']),
 TaggedDocument(words=['Edward', 'Snowden', 'says', 'COVID-19', 'could', 'give', 'governments', 'in

### Pickle the tag_lookup dictionary so I can load it later in my data app

In [32]:
with open('tag_lookup.pickle', 'wb') as f:
    pickle.dump(tag_lookup, f, protocol=pickle.HIGHEST_PROTOCOL)

### Build and train a Doc2Vec model

In [16]:
model = Doc2Vec(alpha=0.025, min_alpha=0.025)
model.build_vocab(sentences)
model.train(sentences, total_examples=len(sentences), epochs=100)

### The next step is to lookup a headline, find similar headlines, and compare the sentiment of that headline to other similar headlines.

In [61]:
# function to lookup a headline and find similar headlines
def find_similar(headline, topn):
    new_clean_headline = clean_headline(headline)
    similar = model.docvecs.most_similar(positive=[model.infer_vector(new_clean_headline)],topn=topn)

    return similar

### Test the model on a headline not included in the training set

In [60]:
new_headline = 'this is a test headline, this is not real'
similar = find_similar(new_headline, 5)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

this is a test headline, this is not real
Sentiment(polarity=-0.1, subjectivity=0.30000000000000004)

0: Two Brazilian governors test positive for coronavirus -- Sentiment(polarity=0.22727272727272727, subjectivity=0.5454545454545454)
1: UK scientists create coronavirus antibody test with '99.8% accuracy and results in 35 minutes' -- Sentiment(polarity=0.0, subjectivity=0.0)
2: Two Brazilian governors test positive for coronavirus -- Sentiment(polarity=0.22727272727272727, subjectivity=0.5454545454545454)
3: Rio de Janeiro governor tests positive for coronavirus -- Sentiment(polarity=0.22727272727272727, subjectivity=0.5454545454545454)
4: South Korea sending COVID-19 test kits to U.S. -- Sentiment(polarity=0.0, subjectivity=0.0)


### Lookup an existing headline to see other similar headlines
Since the headline I'm looking up was part of training, the most similar headline is always the same one I looked up. So to get the top 5 most similar headlines, I just get the top 6 and ignore the first.

In [57]:
new_headline = df['headline'][0]
similar = find_similar(new_headline, 6)
similar.pop(0)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

Top U.S. general resists Trump administration?s efforts to provoke war with Iran ? Mondoweiss
Sentiment(polarity=0.275, subjectivity=0.5)

0: Iran: $1.6 Billion Released from Luxembourg to Go to Coronavirus -- Sentiment(polarity=0.0, subjectivity=0.0)
1: WHO hopes U.S. funding will continue -- Sentiment(polarity=0.0, subjectivity=0.0)
2: Luxembourg blocks US bid for $1.6 billion 9/11 compensation from Iran -- Sentiment(polarity=0.0, subjectivity=0.0)
3: Iran coronavirus fatalities drop to double figures for first time in month -- Sentiment(polarity=0.125, subjectivity=0.16666666666666666)
4: Armed men seize, release tanker off Iran by Strait of Hormuz -- Sentiment(polarity=0.0, subjectivity=0.0)


### Presenting the results
I figured the best way to present these results would be a data app. This way a user can select whatever headline they want and find similar ones. To do this, I will save the Doc2Vec model to a file and load it from the app.

In [62]:
model.save('headline_model')

Test loading the model to make sure it works

In [66]:
loaded_model = Doc2Vec.load('headline_model')

new_headline = df['headline'][1]
new_clean_headline = clean_headline(new_headline)
similar = loaded_model.docvecs.most_similar(positive=[model.infer_vector(new_clean_headline)],topn=6)
similar.pop(0)

print(f'{new_headline}')
blob = TextBlob(new_headline)
print(blob.sentiment)
print()

# show similar headlines
for i, s in enumerate(similar):
    print(f'{i}: {tag_lookup[s[0]]} -- {TextBlob(tag_lookup[s[0]]).sentiment}')

Atalanta vs Valencia linked to accelerating coronavirus spread
Sentiment(polarity=0.0, subjectivity=0.0)

0: Boris Johnson continued to shake hands after his own scientific advisers warned it could spread the coronavirus -- Sentiment(polarity=0.6, subjectivity=1.0)
1: Air-conditioning spread the coronavirus to 9 people sitting near an infected person in a restaurant, researchers say. It has huge implications for the service industry. -- Sentiment(polarity=0.25000000000000006, subjectivity=0.65)
2: Coronavirus: Labour official spread conspiracy theory that Boris Johnson did not have Covid-19 -- Sentiment(polarity=0.0, subjectivity=0.0)
3: China’s initial coronavirus outbreak in Wuhan spread twice as fast as we thought, new study suggests -- Sentiment(polarity=0.11212121212121212, subjectivity=0.3515151515151515)
4: NHS coronavirus app: memo discussed giving ministers power to 'de-anonymise' users -- Sentiment(polarity=0.0, subjectivity=0.0)
