# Stories & Links for Behavioural Survey

I was curious to understand if people would think that journalist inserted links, TF-IDF selected links, GloVe embeddings and cosine similarity selected links, or GloVe embeddings and WMD selected links would be more simialar to a given article. To accomplish this I decided to administer a short story where individuals had to read an article, and then select which story title the thought was most similar to a story of interest. The following notebook highlights how I selected the link options to be included. 

In [1]:
#Loading in required packages
import pandas as pd
import numpy as np
import sys
import pyemd
import inspect
from pathlib import Path

#specifying the root path
sys.path.insert(1, '../scripts/')

#self written functions and scripts that will be needed
import text_normalization_funs as TN
import similiarity_funcs as SIM

In [2]:
#Specifying root path and path to data and scripts
root_path = Path('..')
data_path = root_path/"data"
scripts_path = data_path/"scripts"

In [3]:
#Loading in data 
news_corpus = pd.read_csv(data_path/"cleaned"/'Final_Cleaned_Corpus.txt')

In [4]:
#Loading in three pre-selected test stories

with open(data_path/"cleaned"/'TestStory_1.txt', 'r') as file:
    TestStory1 = file.read().replace('\n', '')

with open(data_path/"cleaned"/'TestStory_2.txt', 'r') as file:
    TestStory2 = file.read().replace('\n', '')

with open(data_path/"cleaned"/'TestStory_3.txt', 'r') as file:
    TestStory3 = file.read().replace('\n', '')


In [5]:
#Cleaning the test stories to get them ready for comparison
cleancomp1 = TN.normalize_Story(story = TestStory1)
cleancomp2 = TN.normalize_Story(story = TestStory2)
cleancomp3 = TN.normalize_Story(story = TestStory3)



### TF-IDF Links

In [8]:
#Computing similarity scores for given dataframe
scores1 = SIM.SimCorpCreate(corpora = news_corpus['cleaned_text'], directory= str(scripts_path), 
                           querydoc = cleancomp1)
scores2 = SIM.SimCorpCreate(corpora = news_corpus['cleaned_text'], directory= str(scripts_path), 
                           querydoc = cleancomp2)
scores3 = SIM.SimCorpCreate(corpora = news_corpus['cleaned_text'], directory= str(scripts_path), 
                           querydoc = cleancomp3)

#Getting top similar stories
Top_Whole_Stories1 = SIM.StoryMatch(SimScores = scores1, originalDF = news_corpus)
Top_Whole_Stories2 = SIM.StoryMatch(SimScores = scores2, originalDF = news_corpus)
Top_Whole_Stories3 = SIM.StoryMatch(SimScores = scores3, originalDF = news_corpus)

In [48]:
Top_Whole_Stories1

Unnamed: 0,title,mainurl,sim_scores
153,Oil and gas industry must do more to address c...,https://www.cbc.ca/news/business/oil-and-gas-i...,0.123619
1464,'No excitement at all' as oilpatch interest wa...,https://www.cbc.ca/news/canada/calgary/crown-d...,0.121086
1346,The new two solitudes: 'Alberta and the rest o...,https://www.cbc.ca/news/canada/calgary/angus-r...,0.09426
1447,P.E.I. immigration forecast to reach record le...,https://www.cbc.ca/news/canada/prince-edward-i...,0.090523
304,Another record year for SUV sales in Quebec as...,https://www.cbc.ca/news/canada/montreal/quebec...,0.088483
1553,Irving Oil refinery drops 2020 carbon cut pledge,https://www.cbc.ca/news/canada/new-brunswick/i...,0.087952
480,Bank of Canada holds interest rates steady,https://www.cbc.ca/news/business/bank-of-canad...,0.087437
481,Bank of Canada holds interest rates steady,https://www.cbc.cahttps://ici.radio-canada.ca/...,0.087437
288,China's economic growth sinks to lowest level ...,https://www.cbc.ca/news/world/china-economy-we...,0.085211
1264,Alberta leaders cheer court ruling against B.C...,https://www.cbc.ca/news/canada/calgary/trans-m...,0.083661


In [None]:
Top_Whole_Stories2

In [None]:
Top_Whole_Stories3

### GloVe Embeddings & Cosine Similarity

In [60]:
#Calculating embeddings
embeds = GloveEmbeds (corpus = news_corpus)

In [None]:
#Computing the similar stories using embeddings
glove1 = GloVeCosine(querydoc = cleancomp1, corpus_embeddings = embeds)
glove2 = GloVeCosine(querydoc = cleancomp2, corpus_embeddings = embeds)
glove3 = GloVeCosine(querydoc = cleancomp3, corpus_embeddings = embeds)

In [63]:
#Getting most similar stories
glovesim1 = SIM.StoryMatch_Val (SimScores = glove1, originalDF = news_corpus, NumLinks = 10)
glovesim2 = SIM.StoryMatch_Val (SimScores = glove2, originalDF = news_corpus, NumLinks = 10)
glovesim3 = SIM.StoryMatch_Val (SimScores = glove3, originalDF = news_corpus, NumLinks = 10)

### GloVe Embeddings & WMD

In [None]:
w_embeds = WMDEmbeds(corpus = news_corpus)

In [35]:
#Computing the similar stories using embeddings
wmd1 = GloVeWMD(querydoc = cleancomp1, corpus_embeddings = w_embeds)
wmd2 = GloVeWMD(querydoc = cleancomp2, corpus_embeddings = w_embeds)
wmd3 = GloVeWMD(querydoc = cleancomp3, corpus_embeddings = w_embeds)

In [65]:
#Getting the most similar stories 
wmdsim1 = EmbeddingMatch(scores = wmd1, originalDF = news_corpus, NumLinks=10)
wmdsim2 = EmbeddingMatch(scores = wmd2, originalDF = news_corpus, NumLinks=10)
wmdsim3 = EmbeddingMatch(scores = wmd3, originalDF = news_corpus, NumLinks=10)