# Task 2, 3 and 4

1. Done in previous notebook
2. Extract vector representation of headlines and bodies in the all the datasets, and compute the cosine similarity between these two vectors. You can use representations based on bag-of- words or other methods like Word2Vec for vector based representations. You are encouraged to explore alternative representations as well.
3. Establish language model based representations of the headlines and the article bodies in all the datasets and calculate the KL-divergence for each pair of headlines and article bodies. Feel free to explore different smoothing techniques for language model based representations.
4. Propose and implement alternative features/distances that might be helpful for the stance detection task. Describe feature meaning and extraction process.

Remarks:
- In this notebook, task 2, 3 and task 4 have been implemented
The output contains two csv files: 
1. Word2vec vector representations (accociasted with cosine similarities), named (Headline_w2c_vec, Body_w2c_vec separately) 
2. Tfidf vector representations (accociasted with cosine similarities), named (Headline_tfidf_vec, Body_tfidf_vec separately) 

 The word2vec model is trained by using all the sentences in the trainning set. 

 The TF-IDF model is trained by using all the unique sentences in the trainning set. 

Variable Name List
1. Train_stance and Train_body :original table (49972 stance(head), 1683 body)
   Pred_headline and Pred_body are the preprocessed versions of Train_stance and Train_body (49972 (head), 1683 body)
2. Train_df: merged Train_stance and Train_bidy (49972 haed + 49972 body)
4. All sentence: all sentences of headline and body (not unique 99444)

In [94]:
import numpy as np
import pandas as pd
from numpy import dot
from numpy.linalg import norm

import re

#import libraries for data processing
import nltk
from nltk import FreqDist, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

#import libraries for vectorisation
import gensim
from gensim.models import Word2Vec

#train validation set split
from sklearn.model_selection import train_test_split

#cosice distance
from sklearn.metrics.pairwise import paired_cosine_distances

#Counter for count word frequency
from collections import Counter

#import tqdm for feature construction
from tqdm import tqdm

#Plot
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline
sns.set_style("white")  

#math for calculation
import math


#import the training set and test set
Train_body = pd.read_csv('/Users/weisihan/Downloads/fnc-1-master/train_bodies.csv')
Train_stance = pd.read_csv('/Users/weisihan/Downloads/fnc-1-master/train_stances.csv')
Test_body = pd.read_csv('/Users/weisihan/Downloads/fnc-1-master/competition_test_bodies.csv')
Test_stance = pd.read_csv('/Users/weisihan/Downloads/fnc-1-master/competition_test_stances.csv')

### Task 2. Extract vector representation of headlines and bodies, compute the cosine similarity between these two vectors. 
You can use representations based on bag-of-words or other methods like Word2Vec for vector based representations. You are encouraged to explore alternative representations as well.

##### Before doing that, we should preprocess the training set and validation set

In [2]:
# Loading English stopwords
#nltk.download('stopwords')

Summary:
the preprocessing will be don in the following step:
1. select only the english word, filter out numbers, punctuations, etc
2. split each piece of data (either headline or body) and get each single word
3. transform each word into lower case
4. transform the plurals into singluar
4. delete english stop words. eg. is, a, the ...
5. join each word again to get piece of strings

In [3]:
# Initialising fuction for text processing (we only get alphabetic characters, lowercase and remove stopwords)
def preprocessing(data, col):
    df=data.copy()
    stopWords = set(stopwords.words('english'))
    table = []
    table = [re.sub("[^a-zA-Z]", " ",str(dt)) for dt in df[col]]#save only english words,delete numbers, punctuations,..., return list of strings
    table = [word_tokenize(dt) for dt in table]#split the strings and get each single word e.g. [['it', 'is', 'a', 'good', 'sunny','day'],['how','are','you']...]
    table = [[word.lower() for word in dt] for dt in table]#return the lower case
    table = [[WordNetLemmatizer().lemmatize(word) for word in dt] for dt in table]#return the singular case
    table = [[word for word in dt if word not in stopWords] for dt in table]#delete stop words...to remove all common pronouns ("a", "the", ...) to reduce the number of noisy features
    table = [' '.join(word) for word in table]#use space to join the single words together again []
    df[col] = table# make it to a column of a datafram
    return df

In [95]:
# do preprocessing of the training set and test set 
Pred_Train_stance = preprocessing(Train_stance,'Headline')
Pred_Train_body = preprocessing(Train_body,'articleBody')
Pred_Test_stance = preprocessing(Test_stance,'Headline')
Pred_Test_body = preprocessing(Test_body,'articleBody')
#Pred_Train_stance has a shape of (49972, 3)
#Pred_Train_body has a shape of (1683, 2)
#Pred_Test_stance has a shape of (25413, 3)
#Pred_Test_body has a shape of (904, 2)

In [5]:
# this is the Train Stance dataset after data preprocessing
Pred_Train_stance.head()

Unnamed: 0,Headline,Body ID,Stance
0,police find mass graf least body near mexico t...,712,unrelated
1,hundred palestinian flee flood gaza israel ope...,158,agree
2,christian bale pass role steve job actor repor...,137,unrelated
3,hbo apple talk month apple tv streaming servic...,1034,unrelated
4,spider burrowed tourist stomach chest,1923,disagree


In [6]:
# this is the Train body set after data preprocessing
Pred_Train_body.head()

Unnamed: 0,Body ID,articleBody
0,0,small meteorite crashed wooded area nicaragua ...
1,4,last week hinted wa come ebola fear spread acr...
2,5,newser wonder long quarter pounder cheese last...
3,6,posting photo gun toting child online isi supp...
4,7,least suspected boko haram insurgent killed cl...


In [7]:
#Change the name of trainingset column 'Body ID' to 'Body_ID'
Pred_Train_stance.rename(columns={'Body ID':'Body_ID'}, inplace=True) 
Pred_Train_body.rename(columns={'Body ID':'Body_ID'}, inplace=True) 
#Merge table Train_stance and Train_body
Pred_Train_df = pd.merge(Pred_Train_stance, Pred_Train_body, how='inner', on="Body_ID",  copy=True)  
#Change the sequence of column for better view hehe
Pred_Train_df=Pred_Train_df[['Headline', 'Body_ID', 'articleBody','Stance']]
Pred_Train_df.shape

(49972, 4)

In [101]:
#Change the name of testset column 'Body ID' to 'Body_ID'
Pred_Test_stance.rename(columns={'Body ID':'Body_ID'}, inplace=True) 
Pred_Test_body.rename(columns={'Body ID':'Body_ID'}, inplace=True) 
#Merge table Test_stance and Train_body
Pred_Test_df = pd.merge(Pred_Test_stance, Pred_Test_body, how='inner', on="Body_ID",  copy=True)  
#Change the sequence of column for better view hehe
Pred_Test_df=Pred_Test_df[['Headline', 'Body_ID', 'articleBody','Stance']]
Pred_Test_df.shape

(25413, 4)

In [8]:
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated


### Task 2 - 1 - Word2Vec 

In [279]:
Headline_list = (Pred_Train_df['Headline'].str.split()).values.tolist()
Body_list = (Pred_Train_df['articleBody'].str.split()).values.tolist()

All_sentence = np.concatenate([Headline_list, Body_list])
#the length of All_sentence (headline+body) is 99944, not unique but all

In [10]:
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated


In [11]:
#build model (this is to use all sentences in the training set : 99444)
model = gensim.models.word2vec.Word2Vec(sentences=All_sentence, min_count=1)
words = list(model.wv.vocab.keys()) #unique words in training set (including body and headline)
model.save('word2vec_all_sentence_99444.model')

In [13]:
#Compute the Headline vector list based on word2vec 
Headline_w2c_vec = np.zeros((len(Headline_list), 100))

for h in range(len(Headline_list)):
    for w in Headline_list[h]:
        if w in words:                                                        #the vector representation of the whole sentence is to add the vectors of each word in this sentenct 
            Headline_w2c_vec[h] = np.add(Headline_w2c_vec[h], model.wv.word_vec(w)) #model.wv.word_vec(w)is to search for the vector of word w
    Headline_w2c_vec[h] = Headline_w2c_vec[h] / np.sqrt(Headline_w2c_vec[h].dot(Headline_w2c_vec[h]))  #take the normalised vector as vector representation

In [12]:
#Compute the Body vector list based on word2vec 
Body_w2c_vec = np.zeros((len(Body_list), 100))

for b in range(len(Body_list)):
    for w in Body_list[b]:
        if w in words:
            Body_w2c_vec[b] = np.add(Body_w2c_vec[b], model.wv.word_vec(w))
    Body_w2c_vec[b] = Body_w2c_vec[b] / np.sqrt(Body_w2c_vec[b].dot(Body_w2c_vec[b]))#within 1
#15:55-max:13

In [15]:
#Save the word2vec represented training set headline vectors and body vectors into local files
np.savetxt('Headline_w2c_vec', Headline_w2c_vec, delimiter=',')
np.savetxt('Body_w2c_vec', Body_w2c_vec, delimiter=',') 

In [9]:
#load word2vec directly
"""
Headline_w2c_vec = np.loadtxt('Headline_w2c_vec', dtype='float32', delimiter=',')
Body_w2c_vec = np.loadtxt('Body_w2c_vec', dtype='float32', delimiter=',')
"""

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In [106]:
#Use the trained model to compute the vectors for the test set

Test_Headline_list = (Pred_Test_df['Headline'].str.split()).values.tolist()
Test_Body_list = (Pred_Test_df['articleBody'].str.split()).values.tolist()

Test_Headline_w2c_vec = np.zeros((len(Test_Headline_list), 100))

for h in range(len(Test_Headline_list)):
    for w in Test_Headline_list[h]:
        if w in words:                                                        #the vector representation of the whole sentence is to add the vectors of each word in this sentenct 
            Test_Headline_w2c_vec[h] = np.add(Test_Headline_w2c_vec[h], model.wv.word_vec(w)) #model.wv.word_vec(w)is to search for the vector of word w
    Test_Headline_w2c_vec[h] = Test_Headline_w2c_vec[h] / np.sqrt(Test_Headline_w2c_vec[h].dot(Test_Headline_w2c_vec[h]))  #take the normalised vector as vector representation


  if sys.path[0] == '':


In [109]:
#Compute the Body vector list based on word2vec 
Test_Body_w2c_vec = np.zeros((len(Test_Body_list), 100))

for b in range(len(Test_Body_list)):
    for w in Test_Body_list[b]:
        if w in words:
            Test_Body_w2c_vec[b] = np.add(Test_Body_w2c_vec[b], model.wv.word_vec(w))
    Test_Body_w2c_vec[b] = Test_Body_w2c_vec[b] / np.sqrt(Test_Body_w2c_vec[b].dot(Test_Body_w2c_vec[b]))#within 1
#20:28-max:13

In [110]:
#Save the word2vec represented testset headline vectors and body vectors into local files
np.savetxt('Test_Headline_w2c_vec', Test_Headline_w2c_vec, delimiter=',')
np.savetxt('Test_Body_w2c_vec', Test_Body_w2c_vec, delimiter=',') 

### Task 2 - 2 - TF - IDF


In [10]:
#len_doc=len(All_sentence_count)
def tf(word, count):
    return count[word] / sum(count.values())

def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

def idf(word, count_list):
    return math.log(len(count_list) / (1 + n_containing(word, count_list)))

def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

In [11]:
# Initialising fuction for selecting unique headlines and article bodies words, return the unique string list and the number of unique words
def get_uni_strs(data):
    uni_str = []
    for eachstr in data:
        if eachstr not in uni_str:
            uni_str.append(eachstr)  
    return uni_str
# uni_str should be sth like: ['this is news head1','this is news head2',...,'this is body1','this is body2',...]

In [12]:
def tfidf_dict_to_vector(tfidf, inx_voc):
    vector = np.zeros(len(inx_voc))
    for w,t in tfidf.items():
        if w in inx_tvoc:
            inx_w = inx_voc[w]
            vector[inx_w] = t
    return vector

In [13]:
#get the list of Headlines and bodies
Headline_list_tfidf = (Pred_Train_stance['Headline'].str.split()).values.tolist()
Body_list_tfidf = (Pred_Train_body['articleBody'].str.split()).values.tolist()

All_sentence_tfidf = np.concatenate([Headline_list_tfidf, Body_list_tfidf])
#the length of All_sentence_tfidf (headline+body) is 51655, the initial size of train_stance ans trian_body

In [14]:
len(All_sentence_tfidf)

51655

In [15]:
#get unique headlines and bodies from the training set
All_sentence_unique = get_uni_strs(All_sentence_tfidf)
#get counters of unique headlines and unique bodies from the training set, Counter() is used to count the number of apperence of each word in a piece of string
#All_sentence_unique_count is the countlist...
All_sentence_unique_count = [Counter(strs) for strs in All_sentence_unique]

In [16]:
#get counters of all headlines in the training set (49972)
Headline_list_count = [Counter(strs) for strs in Headline_list_tfidf]
#get counters of all bodies in the training set (1683)
Body_list_count = [Counter(strs) for strs in Body_list_tfidf]

In [17]:
#Compute the tfidf for each headline and body
Headline_tfidf=[{word: tfidf(word, count, All_sentence_unique_count) for word in count}  for count in Headline_list_count]
#23:30-23:33
#16:28-16:31

In [19]:
Body_tfidf=[{word: tfidf(word, count, All_sentence_unique_count) for word in count}  for count in Body_list_count]
#00：44-00:46
#16:32-16:34

In [20]:
All_uni_words = []
waste=[All_uni_words.extend(item) for item in All_sentence_unique]
All_uni_words=set(All_uni_words)
len(All_uni_words)

19984

In [21]:
#add index
inx_tvoc = dict(zip(All_uni_words,range(len(All_uni_words))))

In [22]:
Body_tfidf_vec = [tfidf_dict_to_vector(tfidfs,inx_tvoc) for tfidfs in Body_tfidf]
Headline_tfidf_vec = [tfidf_dict_to_vector(tfidfs,inx_tvoc) for tfidfs in Headline_tfidf]

In [23]:
tmpb= Pred_Train_body.copy()
tmph= Pred_Train_stance.copy()
tmpb['tfidf_body'] = Body_tfidf_vec
tmph['tfidf_head'] = Headline_tfidf_vec
 
#Merge table Train_stance and Train_body
tmpcomb = pd.merge(tmph, tmpb, how='inner', on="Body_ID",  copy=True)  
tmpcomb.shape

#get the 49972 body tfidf vector
Body_tfidf_vec=list(tmpcomb['tfidf_body'])
Headline_tfidf_vec = list(tmpcomb['tfidf_head'])

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

In [111]:
#Use the trained tfidf model to compute the vectors for the test set

#get the list of Headlines and bodies
Test_Headline_list_tfidf = (Pred_Test_stance['Headline'].str.split()).values.tolist()
Test_Body_list_tfidf = (Pred_Test_body['articleBody'].str.split()).values.tolist()

Test_All_sentence_tfidf = np.concatenate([Test_Headline_list_tfidf, Test_Body_list_tfidf])
#the length of Test_All_sentence_tfidf (headline+body) is 26317, the initial size of test_stance and test_body

In [112]:
len(Test_All_sentence_tfidf)

26317

In [115]:
#get counters of all headlines in the test set (25413)
Test_Headline_list_count = [Counter(strs) for strs in Test_Headline_list_tfidf]
#get counters of all bodies in the test set (904)
Test_Body_list_count = [Counter(strs) for strs in Test_Body_list_tfidf]

In [116]:
#Compute the tfidf for each headline and body in the test set
Test_Headline_tfidf=[{word: tfidf(word, count, All_sentence_unique_count) for word in count}  for count in Test_Headline_list_count]
#20:40-20:42

In [117]:
Test_Body_tfidf=[{word: tfidf(word, count, All_sentence_unique_count) for word in count}  for count in Test_Body_list_count]
#20:42-20:43

In [123]:
Test_Body_tfidf_vec = [tfidf_dict_to_vector(tfidfs,inx_tvoc) for tfidfs in Test_Body_tfidf]
Test_Headline_tfidf_vec = [tfidf_dict_to_vector(tfidfs,inx_tvoc) for tfidfs in Test_Headline_tfidf]

In [128]:
tmpb= Pred_Test_body.copy()
tmph= Pred_Test_stance.copy()
tmpb['tfidf_body'] = Test_Body_tfidf_vec
tmph['tfidf_head'] = Test_Headline_tfidf_vec
 
#Merge table Train_stance and Train_body
tmpcomb = pd.merge(tmph, tmpb, how='inner', on="Body_ID",  copy=True)  
tmpcomb.shape

#get the 49972 body tfidf vector
Test_Body_tfidf_vec=list(tmpcomb['tfidf_body'])
Test_Headline_tfidf_vec = list(tmpcomb['tfidf_head'])

### Task 3. Establish language model based representations of the headlines and the article bodies in all the datasets and calculate the KL-divergence for each pair of headlines and article bodies. 
Feel free to explore different smoothing techniques for language model based representations

In [282]:
# Implement Language Model and Calculate KL Divergence
def language_model(headline, body):
    row_vocab_dict={}
    for x in headline:
        row_vocab_dict[x]=0
    for x in body:
        row_vocab_dict[x]=0
    #dictionary of all unique words in a pair of headline and body
    
    headline_dict=row_vocab_dict.copy()
    #dictionary for counting words in headline
    articleBody_dict=row_vocab_dict.copy()
    #dictionary for counting words in body
    
    for x in headline:
        headline_dict[x]+=1
    #count frequency for every word in headline
    for x in body:
        articleBody_dict[x]+=1
    #count frequency for every word in body
    
    #Lapalace smoothing
    for key in headline_dict:
        headline_dict[key]=(headline_dict[key]+0.1)/(len(headline)+0.1*len(headline_dict))
        #calculate probability for each unique word in heading
        #use lapalace smoothing
    for key in articleBody_dict:
        articleBody_dict[key]=(articleBody_dict[key]+0.1)/(len(body)+0.1*len(articleBody_dict))
        #calculate probability for the each unique word in body
        #use lapalace smoothing
        
    #calculate kl_divergence
    headline_list=[]
    articleBody_list=[]
    for key in row_vocab_dict:
        headline_list.append(headline_dict[key])
        articleBody_list.append(articleBody_dict[key])
        
    headline_vector=np.array(headline_list)
    articleBody_vector=np.array(articleBody_list)
    
    kl_divergence=np.sum(headline_vector*np.log(headline_vector/articleBody_vector))
    
    return kl_divergence

In [283]:
LM_kld=[language_model(h,b) for h,b in zip(Headline_list,Body_list)]

In [287]:
Pred_Train_df['LM_kld']=LM_kld
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [285]:
"""Test_LM_kld=[language_model(h,b) for h,b in zip(Test_Headline_list,Test_Body_list)]"""

In [288]:
"""Pred_Test_df['LM_kld']=Test_LM_kld
Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""

### Task 4. Propose and implement alternative features/distances that might be helpful for the stance detection task. 
Describe feature meaning and extraction process.

#### Feature 1. Cosine Similarity

In [26]:
cos_sim_w2v = 1 - paired_cosine_distances(Headline_w2c_vec, Body_w2c_vec)
cos_sim_w2v

array([-0.24361682,  0.85103494, -0.17421353, ...,  0.7424047 ,
        0.80527228,  0.90500337], dtype=float32)

In [24]:
cos_sim_tfidf=1 - paired_cosine_distances(Headline_tfidf_vec,Body_tfidf_vec)
cos_sim_tfidf
#17:00-17:05

array([ -2.22044605e-16,   3.30128623e-01,  -2.22044605e-16, ...,
         5.90225677e-01,   3.25946275e-01,   4.26769007e-01])

In [27]:
Pred_Train_df['cos_sim_w2v']=cos_sim_w2v
Pred_Train_df['cos_sim_tfidf']=cos_sim_tfidf
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_w2v,cos_sim_tfidf
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.243617,-2.220446e-16
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss,0.851035,0.3301286
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.174214,-2.220446e-16
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated,-0.041024,0.02126375
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated,0.318279,0.0


In [159]:
# generate features for the test set
t_cos_sim_w2v = 1 - paired_cosine_distances(np.nan_to_num(Test_Headline_w2c_vec), Test_Body_w2c_vec)
t_cos_sim_w2v
t_cos_sim_tfidf = 1 - paired_cosine_distances(Test_Headline_tfidf_vec, Test_Body_tfidf_vec)
t_cos_sim_tfidf
#21:12

array([ 0.25401375, -0.19543733,  0.49107634, ...,  0.1641917 ,
       -0.01334122,  0.38916749])

In [160]:
Pred_Test_df['cos_sim_tfidf']=t_cos_sim_tfidf
Pred_Test_df['cos_sim_w2v']=t_cos_sim_w2v
Pred_Test_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_tfidf,cos_sim_w2v
0,ferguson riot pregnant woman loses eye cop fir...,2008,respected senior french police officer investi...,unrelated,2.220446e-16,0.254014
1,apple store install safe secure gold apple watch,2008,respected senior french police officer investi...,unrelated,0.0,-0.195437
2,pregnant woman loses eye police shoot bean bag,2008,respected senior french police officer investi...,unrelated,0.06354246,0.491076
3,found ferguson protester claim wa shot eye rub...,2008,respected senior french police officer investi...,unrelated,0.02141642,0.341024
4,police chief charge paris attack commits suicide,2008,respected senior french police officer investi...,discuss,0.1742968,0.628163


#### Feature 2. Word Overlap Features

In [34]:
def word_overlap_features(headlines, bodies):
    X = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        clean_headline = word_tokenize(headline)
        clean_body = word_tokenize(body)
        features = [
            len(set(clean_headline).intersection(clean_body)) / float(len(set(clean_headline).union(clean_body)))]
        X.append(features)
    return X

In [35]:
overlap=word_overlap_features(Pred_Train_df['Headline'],Pred_Train_df['articleBody'])

49972it [00:59, 837.46it/s]


In [36]:
pred_overlap=[]
waste=[pred_overlap.extend(ovr) for ovr in overlap]

In [38]:
Pred_Train_df['overlap']=pred_overlap
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_w2v,cos_sim_tfidf,overlap
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.243617,-2.220446e-16,0.0
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss,0.851035,0.3301286,0.065217
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.174214,-2.220446e-16,0.0
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated,-0.041024,0.02126375,0.020408
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated,0.318279,0.0,0.0


In [39]:
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [171]:
"""t_overlap=word_overlap_features(Pred_Test_df['Headline'],Pred_Test_df['articleBody'])"""

25413it [00:29, 867.72it/s]


In [172]:
"""t_pred_overlap=[]
waste=[t_pred_overlap.extend(ovr) for ovr in t_overlap]"""

In [173]:
"""Pred_Test_df['overlap']=t_pred_overlap
Pred_Test_df.head()"""

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_tfidf,cos_sim_w2v,overlap
0,ferguson riot pregnant woman loses eye cop fir...,2008,respected senior french police officer investi...,unrelated,2.220446e-16,0.254014,0.0
1,apple store install safe secure gold apple watch,2008,respected senior french police officer investi...,unrelated,0.0,-0.195437,0.0
2,pregnant woman loses eye police shoot bean bag,2008,respected senior french police officer investi...,unrelated,0.06354246,0.491076,0.007407
3,found ferguson protester claim wa shot eye rub...,2008,respected senior french police officer investi...,unrelated,0.02141642,0.341024,0.021739
4,police chief charge paris attack commits suicide,2008,respected senior french police officer investi...,discuss,0.1742968,0.628163,0.022727


In [174]:
"""Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""

#### Feature 2. Word Polarity Feature

In [46]:
def polarity_features(headlines, bodies):
    _refuting_words = [
        'fake',
        'fraud',
        'hoax',
        'false',
        'deny', 'denies',
        'not',
        'despite',
        'nope',
        'doubt', 'doubts',
        'bogus',
        'debunk',
        'pranks',
        'retract'
    ]

    def calculate_polarity(text):
        tokens = word_tokenize(text)
        return sum([t in _refuting_words for t in tokens]) % 2
    X = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        features = []
        features.append(calculate_polarity(headline))
        features.append(calculate_polarity(body))
        X.append(features)
    return np.array(X)

In [47]:
polar=polarity_features(Pred_Train_df['Headline'],Pred_Train_df['articleBody'])

49972it [00:58, 848.95it/s]


In [54]:
Pred_Train_df['polar_h']=polar[:,0]
Pred_Train_df['polar_b']=polar[:,1]

In [55]:
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_w2v,cos_sim_tfidf,overlap,polar_h,polar_b
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.243617,-2.220446e-16,0.0,0,0
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss,0.851035,0.3301286,0.065217,0,0
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.174214,-2.220446e-16,0.0,0,0
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated,-0.041024,0.02126375,0.020408,0,0
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated,0.318279,0.0,0.0,0,0


In [56]:
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [183]:
"""t_polar=polarity_features(Pred_Test_df['Headline'],Pred_Test_df['articleBody'])"""

25413it [00:30, 846.93it/s]


In [184]:
"""Pred_Test_df['polar_h']=t_polar[:,0]
Pred_Test_df['polar_b']=t_polar[:,1]"""

In [185]:
"""Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""

#### Feature 3. Refuting Feature

In [63]:
def refuting_features(headlines, bodies):
    _refuting_words = [
        'fake',
        'fraud',
        'hoax',
        'false',
        'deny', 'denies',
        # 'refute',
        'not',
        'despite',
        'nope',
        'doubt', 'doubts',
        'bogus',
        'debunk',
        'pranks',
        'retract'
    ]
    X = []
    for i, (headline, body) in tqdm(enumerate(zip(headlines, bodies))):
        clean_headline = word_tokenize(headline)
        features = [1 if word in clean_headline else 0 for word in _refuting_words]
        X.append(features)
    return X

In [64]:
#Currently dont know how to use this feature
refute=refuting_features(Pred_Train_df['Headline'],Pred_Train_df['articleBody'])

49972it [00:06, 7345.60it/s]


#### Feature 4. Pearson Correlation Coeffcient

In [66]:
# store the pearson correlation coefficient in this array
def PearsonCorr(Head,Body):
    corr=[]
    for x, y in zip(Head, Body):
        X=np.vstack([x,y])
        d1=np.corrcoef(X)[0][1]
        corr.append(d1)
    return corr

"""
# use equation to calculate Pearson Correlation Coefficients 

def PearsonCorr(Head,Body):
    corr=[]
    for x, y in zip(Head,Body):
        x_=x-np.mean(x)
        y_=y-np.mean(y)
        d2=np.dot(x_,y_)/(np.linalg.norm(x_)*np.linalg.norm(y_))
        corr.append(d2)
    return corr

"""

'\n# use equation to calculate Pearson Correlation Coefficients \n\ndef PearsonCorr(Head,Body):\n    corr=[]\n    for x, y in zip(Head,Body):\n        x_=x-np.mean(x)\n        y_=y-np.mean(y)\n        d2=np.dot(x_,y_)/(np.linalg.norm(x_)*np.linalg.norm(y_))\n        corr.append(d2)\n    return corr\n\n'

In [67]:
P_Cor_Coe_w2c = PearsonCorr(Headline_w2c_vec, Body_w2c_vec)
P_Cor_Coe_tfidf = PearsonCorr(Headline_tfidf_vec, Body_tfidf_vec)

In [68]:
Pred_Train_df['P_Cor_Coe_w2c']=P_Cor_Coe_w2c
Pred_Train_df['P_Cor_Coe_tfidf']=P_Cor_Coe_tfidf
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_w2v,cos_sim_tfidf,overlap,polar_h,polar_b,P_Cor_Coe_w2c,P_Cor_Coe_tfidf
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.243617,-2.220446e-16,0.0,0,0,-0.234044,-0.001445
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss,0.851035,0.3301286,0.065217,0,0,0.849136,0.329755
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.174214,-2.220446e-16,0.0,0,0,-0.162194,-0.001232
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated,-0.041024,0.02126375,0.020408,0,0,-0.036997,0.02016
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated,0.318279,0.0,0.0,0,0,0.316359,-0.001007


In [69]:
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [186]:
"""Test_P_Cor_Coe_w2c = PearsonCorr(Test_Headline_w2c_vec, Test_Body_w2c_vec)
Test_P_Cor_Coe_tfidf = PearsonCorr(Test_Headline_tfidf_vec, Test_Body_tfidf_vec)"""

  c /= stddev[:, None]
  c /= stddev[None, :]


In [196]:
"""Pred_Test_df['P_Cor_Coe_w2c']=np.nan_to_num(Test_P_Cor_Coe_w2c)
Pred_Test_df['P_Cor_Coe_tfidf']=np.nan_to_num(Test_P_Cor_Coe_tfidf)
Pred_Test_df.head()"""

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_tfidf,cos_sim_w2v,overlap,polar_h,polar_b,P_Cor_Coe_w2c,P_Cor_Coe_tfidf
0,ferguson riot pregnant woman loses eye cop fir...,2008,respected senior french police officer investi...,unrelated,2.220446e-16,0.254014,0.0,0,0,0.254075,-0.001565
1,apple store install safe secure gold apple watch,2008,respected senior french police officer investi...,unrelated,0.0,-0.195437,0.0,0,0,-0.195883,-0.001134
2,pregnant woman loses eye police shoot bean bag,2008,respected senior french police officer investi...,unrelated,0.06354246,0.491076,0.007407,0,0,0.49141,0.062481
3,found ferguson protester claim wa shot eye rub...,2008,respected senior french police officer investi...,unrelated,0.02141642,0.341024,0.021739,1,0,0.34187,0.020011
4,police chief charge paris attack commits suicide,2008,respected senior french police officer investi...,discuss,0.1742968,0.628163,0.022727,0,0,0.628161,0.173617


In [197]:
"""Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""

#### Feature 5. Euclidean Distance 

In [70]:
def Euclidean_distance(Head, Body):
    Euc_dis=[]
    for x, y in zip(Head, Body):
        d1=np.sqrt(np.sum(np.square(x-y)))
        Euc_dis.append(d1)
    return Euc_dis

In [71]:
Euc_distance_w2c = Euclidean_distance(Headline_w2c_vec, Body_w2c_vec)
Euc_distance_tfidf = Euclidean_distance(Headline_tfidf_vec, Body_tfidf_vec)

In [72]:
Pred_Train_df['Euc_distance_w2c']=Euc_distance_w2c
Pred_Train_df['Euc_distance_tfidf']=Euc_distance_tfidf
Pred_Train_df.head()

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_w2v,cos_sim_tfidf,overlap,polar_h,polar_b,P_Cor_Coe_w2c,P_Cor_Coe_tfidf,Euc_distance_w2c,Euc_distance_tfidf
0,police find mass graf least body near mexico t...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.243617,-2.220446e-16,0.0,0,0,-0.234044,-0.001445,1.577097,1.117929
1,seth rogen play apple steve wozniak,712,danny boyle directing untitled film seth rogen...,discuss,0.851035,0.3301286,0.065217,0,0,0.849136,0.329755,0.54583,1.447857
2,mexico police find mass grave near site studen...,712,danny boyle directing untitled film seth rogen...,unrelated,-0.174214,-2.220446e-16,0.0,0,0,-0.162194,-0.001232,1.532458,1.333335
3,mexico say missing student found first mass graf,712,danny boyle directing untitled film seth rogen...,unrelated,-0.041024,0.02126375,0.020408,0,0,-0.036997,0.02016,1.442931,1.244375
4,new io bug delete icloud document,712,danny boyle directing untitled film seth rogen...,unrelated,0.318279,0.0,0.0,0,0,0.316359,-0.001007,1.167665,1.941065


In [73]:
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [198]:
"""Test_Euc_distance_w2c = Euclidean_distance(Test_Headline_w2c_vec, Test_Body_w2c_vec)
Test_Euc_distance_tfidf = Euclidean_distance(Test_Headline_tfidf_vec, Test_Body_tfidf_vec)"""

In [199]:
"""Pred_Test_df['Euc_distance_w2c']=Test_Euc_distance_w2c
Pred_Test_df['Euc_distance_tfidf']=Test_Euc_distance_tfidf
Pred_Test_df.head()"""

Unnamed: 0,Headline,Body_ID,articleBody,Stance,cos_sim_tfidf,cos_sim_w2v,overlap,polar_h,polar_b,P_Cor_Coe_w2c,P_Cor_Coe_tfidf,Euc_distance_w2c,Euc_distance_tfidf
0,ferguson riot pregnant woman loses eye cop fir...,2008,respected senior french police officer investi...,unrelated,2.220446e-16,0.254014,0.0,0,0,0.254075,-0.001565,1.221463,1.370576
1,apple store install safe secure gold apple watch,2008,respected senior french police officer investi...,unrelated,0.0,-0.195437,0.0,0,0,-0.195883,-0.001134,1.546245,1.482062
2,pregnant woman loses eye police shoot bean bag,2008,respected senior french police officer investi...,unrelated,0.06354246,0.491076,0.007407,0,0,0.49141,0.062481,1.008884,1.686966
3,found ferguson protester claim wa shot eye rub...,2008,respected senior french police officer investi...,unrelated,0.02141642,0.341024,0.021739,1,0,0.34187,0.020011,1.148021,1.011718
4,police chief charge paris attack commits suicide,2008,respected senior french police officer investi...,discuss,0.1742968,0.628163,0.022727,0,0,0.628161,0.173617,0.862365,1.309439


In [200]:
"""Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""

#### Feature 6. KL Divergence

As the KL Divergence only works for positive probabilities, we only use tfidf in this part to calculate KL divergence.

In [None]:
def KL(P,Q):
    P = np.array(P)
    Q = np.array(Q)
    sum = []
    for i in range(len(P)):
        if P[i] == 0 and Q[i] == 0:
            continue
        p=P[i]+np.spacing(1) 
        q=Q[i]+np.spacing(1)
        sum.append(p * (math.log(p / q)))
    all_value= [x for x in sum]# delete nan of inf
    return np.sum(all_value)

In [84]:
#Asymmetric
kl_dis_HB= [KL(h,b) for h, b in zip(Headline_tfidf_vec,Body_tfidf_vec)]
#19:04-19:08

In [87]:
#Asymmetric
kl_dis_BH=[KL(b,h) for h, b in zip(Headline_tfidf_vec,Body_tfidf_vec)]
#19:10-19:15

In [88]:
#Symmetric
kl_dis = [(kl_dis_HB[i]+kl_dis_BH[i])/2 for i in range(len(kl_dis_HB))]  

In [289]:
Pred_Train_df['kl_dis']=kl_dis_HB
Pred_Train_df.to_csv('Pred_Train_features.csv', encoding='utf-8', index=False)

In [201]:
"""t_kl_dis_HB= [KL(h,b) for h, b in zip(Test_Headline_tfidf_vec,Test_Body_tfidf_vec)]"""

In [202]:
"""t_kl_dis_BH=[KL(b,h) for h, b in zip(Test_Headline_tfidf_vec,Test_Body_tfidf_vec)]"""

In [203]:
"""t_kl_dis = [(t_kl_dis_HB[i]+t_kl_dis_BH[i])/2 for i in range(len(t_kl_dis_HB))]  """

In [290]:
"""Pred_Test_df['kl_dis']=t_kl_dis_HB
Pred_Test_df.to_csv('Pred_Test_features.csv', encoding='utf-8', index=False)"""