# Feature Engineering
First, we list the features generated by the successful TALOS team.

1. **Basic Count Features**: turns unigrams, bi-grams, and tri-grams into various counts and ratios which mark the *relationship between a headline and its body text*. Features generated include:
    - Number of unique grams in the headline & body
    - Number of times a gram appears in the headline & body
    - Ratio of gram appearances over number of unique grams ( for headline and body)
    - How many grams in the headline also appear in the body text (overlapping grams)
    - Number of overlapping grams normalized by the number of grams in the headline

2. **TF-IDF Features**: constructs sparse vector representations of the headline and body by calculating the Term Frequency score (TF) of each gram, and normalizing it by its Inverse-Document Frequency score (IDF).
    - Calculates the cosine similarity between the headline and body tfidf vector

3. **SVD Features**: this model applies Singular-Value Decomposiiton (SVD) to the sparse matrices resulting from the TF-IDF analysis, and obtains a more compact vector representation of the headline and body. This is a well known procedure used for Topic Modelling. As such, its output is used to understand which topics in the corpus represent the headline or body. Features include:
    - Latent topics from the corpus which represent headline/body text
    - The cosine similarities between the SVD features of headline and body text

4. **Word2Vec Features**: Using pre-trained word vectors from public sources, *vector representations of the headline and body* are built. These word vectors were trained on a Google News corpus. *Vector representation features may overcome the use of synonyms instead of exact words*. 

5. **Sentiment Features**: Used the Sentiment Analyzer in the NLTK package to assign a sentiment polarity to the headline and body separately. Features include:
    - Returned negative or positive scores on sentiment for the headline and body
    
Other features we can add:
6. **Bag of Words Features**:

### Import Libraries

In [1]:
#standard imports
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

#nlp imports
import re

### Import Data

In [2]:
#Locate Vedant's pickle ;)
#os.listdir('../playground/vedant/pre_processing/')
df = pd.read_pickle('../playground/vedant/pre_processing/claims.pkl')
df.head()

Unnamed: 0_level_0,claim,claimant,date,label,related_articles
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,line george orwell novel 1984 predict power sm...,,2017-07-17,0,"[122094, 122580, 130685, 134765]"
1,maine legislature candidate leslie gibson insu...,,2018-03-17,2,"[106868, 127320, 128060]"
4,17 year old girl name alyssa carson train nasa...,,2018-07-18,1,"[132130, 132132, 149722]"
5,1988 author roald dahl pen open letter urge pa...,,2019-02-04,2,"[123254, 123418, 127464]"
6,come fight terrorism another thing know work b...,Hillary Clinton,2016-03-22,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


## Basic Count Features
### ngram Count
First we will attempt to create n-gram counts for {1,2,3}-grams.

In [3]:
test_claim = df['claim'].values[11]
print(test_claim)

socialist teacher south charlotte middle school put message fuck kavanaugh school sign


In [4]:
#generate a list of ngrams
def generate_ngrams(string, n):
    #break text in tokens, not counting empty tokens
    tokens = [token for token in string.split(" ") if token != ""]
    
    #use the zip function to generate the desired n_gram list
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return [" ".join(ngram) for ngram in ngrams]

In [5]:
generate_ngrams(test_claim,2)

['socialist teacher',
 'teacher south',
 'south charlotte',
 'charlotte middle',
 'middle school',
 'school put',
 'put message',
 'message fuck',
 'fuck kavanaugh',
 'kavanaugh school',
 'school sign']

We could also use the nltk ngrams function

In [6]:
from nltk.util import ngrams

In [7]:
tokens = [token for token in test_claim.split(" ") if token != ""]
list(ngrams(tokens,2))

[('socialist', 'teacher'),
 ('teacher', 'south'),
 ('south', 'charlotte'),
 ('charlotte', 'middle'),
 ('middle', 'school'),
 ('school', 'put'),
 ('put', 'message'),
 ('message', 'fuck'),
 ('fuck', 'kavanaugh'),
 ('kavanaugh', 'school'),
 ('school', 'sign')]

In [8]:
def add_feature_ngram_count(df,col,n):
    '''
    Adds the word_count feature for a desired column in a pandas dataframe.
    '''
    if not isinstance(col,str):
        raise ValueError('col must be of type str')
    
    df['count_%sgram_%s' %(n,col)] = df[col].apply(lambda x: len(generate_ngrams(x,n)))

Add ngram counts to the desired column

In [9]:
add_feature_ngram_count(df,'claim',1)

In [10]:
df.head()

Unnamed: 0_level_0,claim,claimant,date,label,related_articles,count_1gram_claim
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,line george orwell novel 1984 predict power sm...,,2017-07-17,0,"[122094, 122580, 130685, 134765]",8
1,maine legislature candidate leslie gibson insu...,,2018-03-17,2,"[106868, 127320, 128060]",14
4,17 year old girl name alyssa carson train nasa...,,2018-07-18,1,"[132130, 132132, 149722]",11
5,1988 author roald dahl pen open letter urge pa...,,2019-02-04,2,"[123254, 123418, 127464]",12
6,come fight terrorism another thing know work b...,Hillary Clinton,2016-03-22,2,"[41099, 89899, 72543, 82644, 95344, 88361]",12


### Unique ngram Count & Ratio of Unique ngram
Now, piggyback off this function to add a count of **unique ngrams** and the **ratio of unique ngrams to total ngrams**.

In [11]:
def add_feature_unique_ngram_count(df,col,n):
    '''
    
    '''
    if not isinstance(col,str):
        raise ValueError('col must be of type str')
    
    df['count_unique_%sgram_%s' %(n,col)] = df.apply(lambda x: len(set(generate_ngrams(x[col],n))),axis=1)

In [12]:
def try_divide(a,b):
    try:
        return a/b
    except ZeroDivisionError:
        return 0

In [13]:
def add_feature_unique_ngram_ratio(df,col,n):
    '''
    Adds a column of ratios which indicate the proportion of the feature's text which is unique.
    '''
    if not isinstance(col,str):
        raise ValueError('col must be of type str')
    
    df['ratio_unique_%sgram_%s' %(n,col)] = \
    list(map(try_divide, df['count_unique_%sgram_%s' %(n,col)], df['count_%sgram_%s' %(n,col)]))

In [14]:
add_feature_unique_ngram_count(df,'claim',1)
add_feature_unique_ngram_ratio(df,'claim',1)

In [15]:
print("Example of a headline with a ratio less than 1.0")
print(df.loc[df.ratio_unique_1gram_claim != 1.0].sample(1).claim.values)

Example of a headline with a ratio less than 1.0
['state federal government help run health care marketplace average american 50 different plan choose different level coverage']


### Overlapping ngram
Now we will attempt to make a count of **overlapping ngrams**. As we only have one sentence of per sample (the 
claim), we will use a custom sentence to make sure this function works.

In [16]:
test_claim

'socialist teacher south charlotte middle school put message fuck kavanaugh school sign'

In [17]:
artificial_headline = 'teacher in south charlotte school'
#this should result in an overlap count of 5 (because school appears twice)

In [18]:
#Slowly work our way towards the end function for sanity's sake
#here df.iloc[11]["claim"] is the standin for an x in a lambda function
[s for s in generate_ngrams(df.iloc[11]["claim"],1)]

['socialist',
 'teacher',
 'south',
 'charlotte',
 'middle',
 'school',
 'put',
 'message',
 'fuck',
 'kavanaugh',
 'school',
 'sign']

In [19]:
#add 1.0 if an ngram in is in the set of ngrams of the artificial headline
[1.0 for s in generate_ngrams(df.iloc[11]["claim"],1) if s in set(generate_ngrams(artificial_headline,1))] 

[1.0, 1.0, 1.0, 1.0, 1.0]

In [20]:
print("Overlapping ngrams between artificial_headline and test_claim")
sum([1.0 for s in generate_ngrams(df.iloc[11]["claim"],1) if s in set(generate_ngrams(artificial_headline,1))] )

Overlapping ngrams between artificial_headline and test_claim


5.0

In [21]:
def add_feature_overlap_ngrams(df,col1,col2,n):
    '''
    Adds the word_count feature for a desired column in a pandas dataframe.
    '''
    if not isinstance(col1,str):
        raise ValueError('col must be of type str')
    
    df['overlap_%sgram_%s_%s' %(n,col1,col2)] = \
    list(df.apply(lambda x: sum([1.0 for s in generate_ngrams(x[col1],n) if s in set(generate_ngrams(x[col2],n))]),axis=1))

To use this effectively, we will need the dataframe to have some comparison claim. Let's make a row which has the first five words from each claim. This way, the overlap_ngram feature is expected to return a uniform number across every claim (not accounting for double words).

In [22]:
#Join the first four words together using the below operations
" ".join(df.iloc[11].claim.split(" ")[:4])

'socialist teacher south charlotte'

In [23]:
df['pseudo_headline'] = df.apply(lambda x: " ".join(x["claim"].split(" ")[:4]),axis=1) #need to set axis=1 to be able to treat x as a series

In [24]:
add_feature_overlap_ngrams(df,"claim","pseudo_headline",1)
add_feature_overlap_ngrams(df,"claim","pseudo_headline",2)
add_feature_overlap_ngrams(df,"claim","pseudo_headline",3)

Mission accomplished!

In [25]:
df.head(2)

Unnamed: 0_level_0,claim,claimant,date,label,related_articles,count_1gram_claim,count_unique_1gram_claim,ratio_unique_1gram_claim,pseudo_headline,overlap_1gram_claim_pseudo_headline,overlap_2gram_claim_pseudo_headline,overlap_3gram_claim_pseudo_headline
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,line george orwell novel 1984 predict power sm...,,2017-07-17,0,"[122094, 122580, 130685, 134765]",8,8,1.0,line george orwell novel,4.0,3.0,2.0
1,maine legislature candidate leslie gibson insu...,,2018-03-17,2,"[106868, 127320, 128060]",14,14,1.0,maine legislature candidate leslie,4.0,3.0,2.0


Number of sentences in the claim

In [26]:
from nltk.tokenize import sent_tokenize
def add_feature_sentence_count(df,col):
    '''
    Adds the word_count feature for a desired column in a pandas dataframe.
    '''
    if not isinstance(col,str):
        raise ValueError('col must be of type str')
    
    df['num_sents_%s' %(col)] = df[col].apply(lambda x: len(sent_tokenize(x)))

In [27]:
add_feature_sentence_count(df,'claim')

Create a list of columns which contain the features we just generated

In [28]:
feat_names = [ n for n in df.columns \
                if "count" in n \
                or "ratio" in n \
                or "num_sent" in n]

In [29]:
feat_names

['count_1gram_claim',
 'count_unique_1gram_claim',
 'ratio_unique_1gram_claim',
 'num_sents_claim']

### Other Basic Count Features to Explore
- Count of target words (i.e. if we wish to make a count of known negative, positive, or topical words). Inspired from TALOS's (discarded) 'refute_words' list, which counted words which would indicate a refutation mid-sentence.

## TF-IDF Features
The primary feature generated through TF-IDF processes is the metric of **cosine similarity** between headline TF-IDF features and body TF-IDF features. Thus the features we create will be `headlineTFIDF`,`bodyTFIDF`, and `similarityTFIDF`

Until our dataframe has the articles associated with each claim, we will use the following test corpus.

In [30]:
corpus = [
    'if you\'re from Syria and you\'re a Christian, you cannot come into this country as a refugee says Trump',
    'video shows federal troops in armored vehicles patrolling the streets of Chicago on President Trump\'s orders.',
    'Donald Trump wrote in \The Art of the Deal\ that he punched out his second grade music teacher.',
    'Actor Denzel Washington said electing President Trump saved the US from becoming an \"Orwellian police state.\"',
    "President Trump fired longtime White House butler Cecil Gaines for disobedience.",
    'Congress has approved the creation of a taxpayer-funded network called \"Trump TV.\"',
    'The Islamic State "just built a hotel in Syria", according to President Donald Trump',
    "A Gucci ensemble worn by Trump counselor Kellyanne Conway to the inauguration closely resembled a 1970s 'Simplicity' pattern.,"
    'Says Melania Trump hired exorcist to \"cleanse White House of Obama demons.\"',
    '"\"Russia, Iran, Syria & many others are not happy\" about US troops leaving Syria according to the US President.'
    "In January 2019, President Donald Trump ordered FEMA to stop or cancel funding for its disaster assistance efforts in California.",
    "Trump looking to open up E Coast & new areas for offshore oil drilling when Congress has passed no new safety standards since BP",
    '"You were here long before any of us were here, although we have a representative in Congress who, they say, was here a long time ago. They call her Pocahontas.", said Trump',
    "A photograph shows an elephant carrying a lion cub.",
    "Elephant carrying thirsty lion cub",
    "Nike makes their sneakers with elephant skins.",
    "A photograph shows a jumping baby elephant.",
    "Is this a video of an elephant trampling a man to death in India?",
    "The lion used for the original MGM logo killed its trainer and his assistants.",
    "A friend and I are arguing about the origin of the photo of Donald Trump, Ivanka, and Barron with Barron sitting on the stuffed lion. She swears it's a photo-shopped, fake photo. Do you have any idea who took it or published it first?",
    "A photograph shows a real baby platypus.",
    "Photograph shows a drop bear cub being fed human blood."
]

In [31]:
from math import log
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

In [32]:
def basic_preprocess_corpus(corpus,stop_words):
    '''
    Returns a simply cleaned corpus with basic lemmatization, lower casing, and stopword removal.
    
    Parameters
    ----------
    corpus:
        An uncleaned corpus of shape [num_documents,]. The 2nd dimension contains the text sample for that document.
    stop_words:
        An nltk.corpus.stopwords object which contains the desired stopwords to be removed in preprocessing.

    Returns
    -------
    corpus:
        A corpus of shape [num_docments,] which has been processed through lower casing, stopword removal, lematization.
    '''
    cleaned_corpus = list()
    for i in range(len(corpus)):
        # remove symbols and numbers
        document = re.sub('[^a-zA-Z]', ' ', corpus[i])
        # change to lower case
        document = document.lower()
        # convert string to a list of strings
        document = document.split()    
        # remove stopwords and perform lemmatisation
        lem = WordNetLemmatizer()
        document = [lem.lemmatize(word) for word in document if (not word in  
                stop_words) and (len(word) > 1)]
        cleaned_corpus.append(document)
    return cleaned_corpus

In [33]:
# define stopwords
stop_words = stopwords.words('english')
corpus = basic_preprocess_corpus(corpus,stop_words)

Now we will create a dictionary which has the keys as the term, and the values is a list of documents which it is found in

In [34]:
def create_term_dict(corpus, threshold=False):
    '''
    Returns the term dictionary of a corpus.
    
    Parameters
    ----------
    corpus:
        A text corpus of shape [num_documents,1]. The 2nd dimension contains the text sample for that document.
    threshold:
        The minimum number of documents a term must appear in for it to be added to the term dictionary.

    Returns
    -------
    term_dict:
        A dictionary for which the key stores all unique terms, and the value for each key is a list of 
        indices for the documents in the corpus which contain the term. Duplicate indices indicate 
        multiple appearances within the same document.
    '''
    term_dict = {}
    for idx, document in enumerate(corpus):
        for term in document:
            if term in set(term_dict.keys()): #if the term is already in the keys, append
                term_dict[term].append(idx)
            else: #otherwise, add new key
                term_dict[term] = [idx]
    
    #automatically remove terms which don't appear in 'threshold' documents
    if threshold:
        for term in list(term_dict): #list allows us to change dictionary in iteration
            if len(term_dict[term]) <= threshold:
                term_dict.pop(term) #delete that term
    return term_dict

In [35]:
term_dict = create_term_dict(corpus)
print('Term dictionary with single appearance terms removed\n')
print(term_dict)

Term dictionary with single appearance terms removed

{'syria': [0, 6, 8, 8], 'christian': [0], 'cannot': [0], 'come': [0], 'country': [0], 'refugee': [0], 'say': [0, 7, 10], 'trump': [0, 1, 2, 3, 4, 5, 6, 7, 7, 8, 9, 10, 17], 'video': [1, 15], 'show': [1, 11, 14, 18, 19], 'federal': [1], 'troop': [1, 8], 'armored': [1], 'vehicle': [1], 'patrolling': [1], 'street': [1], 'chicago': [1], 'president': [1, 3, 4, 6, 8, 8], 'order': [1], 'donald': [2, 6, 8, 17], 'wrote': [2], 'art': [2], 'deal': [2], 'punched': [2], 'second': [2], 'grade': [2], 'music': [2], 'teacher': [2], 'actor': [3], 'denzel': [3], 'washington': [3], 'said': [3, 10], 'electing': [3], 'saved': [3], 'u': [3, 8, 8, 10], 'becoming': [3], 'orwellian': [3], 'police': [3], 'state': [3, 6], 'fired': [4], 'longtime': [4], 'white': [4, 7], 'house': [4, 7], 'butler': [4], 'cecil': [4], 'gaines': [4], 'disobedience': [4], 'congress': [5, 9, 10], 'approved': [5], 'creation': [5], 'taxpayer': [5], 'funded': [5], 'network': [5], 'calle

Notice that words which appear twice in the same document appear in `term_dict` as indices with repeated numbers.

In [36]:
def create_term_doc_matrix(term_dict,corpus):
    '''
    Returns the term-document matrix of a corpus of shape [num_documents,].
    
    Parameters
    ----------
    term_dict:
        The term dictionary of corpus.
    corpus:
         The corpus of text from which the term_dictionary has been made.

    Returns
    -------
    term_doc:
        A term-document matrix of shape [num_documents, num_terms] with 
        (rows, columns) corresponding to (documents, terms). Values at the [i,j]th
        index indicate the number of times term j appears in document i.   
    '''
    A = np.zeros([len(corpus),len(term_dict)]) #rows x col = doc x terms
    for idx, term in enumerate(term_dict):
        for d in term_dict[term]:
            A[d,idx] += 1 
    return np.asarray(A)

In [37]:
term_doc = create_term_doc_matrix(term_dict,corpus)
term_doc.shape

(20, 157)

In [38]:
term_doc

array([[1., 1., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 1., 1.]])

In [39]:
def tfidf_matrix(term_doc):
    '''
    Returns the term-frequency-inverse-document-frequency matrix of a term-document matrix.
    
    Parameters
    ----------
    term_doc:
        The term-document matrix of a corpus of words. 

    Returns
    -------
    tfidf_matrix:
        A matrix of tf-idf values for each term-document relationship with
        (rows, columns) corresponding to (documents, terms). 
    '''
    col_sums = np.sum(term_doc, axis=0)
    A = np.zeros(term_doc.shape)
    B = np.copy(term_doc)
    for i in range(term_doc.shape[0]):
        for j in range(term_doc.shape[1]):
            term_count = col_sums[j] #number of terms in the document j
            tf = B[i,j] / term_count #divide all rows by the frequency per term
            
            row_i = list(B[i]) 
            row_i = [d for d in row_i if d>0] #filter out docs that don't have the term
            nt = len(row_i) #nt is the number of documents which have the term
            
            idf = log(float(term_doc.shape[1])) / nt
            A[i,j] = tf*idf
    return A

In [40]:
tfidf_mat = tfidf_matrix(term_doc)

In [41]:
tfidf_mat

array([[0.15800768, 0.63203073, 0.63203073, ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.63203073, 0.63203073,
        0.63203073]])

### Use SVD to Create a Low-Rank Approximation

In [42]:
from sklearn.decomposition import TruncatedSVD

-**n_components**: the desired dimensionality of the output data
-**n_iter**: the number of iteratoins for randomized SVD solver to take. 

In [43]:
svd = TruncatedSVD(n_components=50, n_iter=15)
svd.fit(tfidf_mat)

TruncatedSVD(algorithm='randomized', n_components=50, n_iter=15,
       random_state=None, tol=0.0)

In [44]:
svd_output= svd.transform(tfidf_mat)
svd_output.shape

(20, 20)

Now, assuming that we have a second corpus of text to perform this pipeline on, we can return the following feature.

In [45]:
from sklearn.metrics.pairwise import cosine_similarity

In [46]:
def cos_sim(A,B):
    '''
    Return the cosine similarity of two matrices.
    '''
    return cosine_similarity(A,B)[0][0]

For the case in which we have many svd outputs

In [47]:
svd_outputs = [svd_output for i in range(5)]

In [48]:
svd_similarities = list(map(cos_sim,svd_outputs,svd_outputs))
print(svd_similarities)

[0.9999999999999998, 0.9999999999999998, 0.9999999999999998, 0.9999999999999998, 0.9999999999999998]


### Using Sklearn Library TfidfVectorizer
Now that we have explored the creation of the SVD similarity feature on some toy data, let's implement it on our claims and pseudo_headlines. Here, we will use the big guns and use sklearns `TfidfVectorizer` object.

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [50]:
vectorizer = TfidfVectorizer(ngram_range=(1,3),max_df=0.8,min_df=2)
vectorizer.fit(df['claim']);

In [51]:
#this is the mapping of terms to feature vector indices
vocabulary = vectorizer.vocabulary_ 

In [52]:
claim_vectorizer = TfidfVectorizer(ngram_range=(1,3),max_df=0.8,
                                   min_df=2, vocabulary=vocabulary)
claims_tfidf = claim_vectorizer.fit_transform(df['claim'])

In [53]:
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 3), max_df=0.8,
                                      min_df=2, vocabulary=vocabulary)
headline_tfidf= headline_vectorizer.fit_transform(df['pseudo_headline'])

In [54]:
sim_tfidf = list(map(cos_sim, claims_tfidf,headline_tfidf))

In [55]:
print(np.squeeze(sim_tfidf).shape)

(15555,)


We can see that we have one similarity score per entry!

### Functionise TFIDF-SVD Cosine Similarity Feature

In [56]:
def add_feature_tfidf_svd_similarity(df,col1,col2,vec):
    '''
    Adds the cosine similarity feature for the tfidf matrices of each column
    
    Parameters
    ----------
    df:
        The dataframe which the row is added to.
    col1:
        The first column of text for the similarity measurements. 
    col2:
        The second column of text for the similarity measurements.
    n:
        The ngram number to include in the similarity comparison.
        
    Returns
    -------
    N/A 
    '''
    vec.fit(np.add(df[col1].values,df[col2].values)) #train on all text contained in the two columns
    vocabulary = vectorizer.vocabulary_ 
    
    new_vec= TfidfVectorizer(ngram_range=vec.ngram_range,max_df=vec.max_df,
                            min_df=vec.min_df,vocabulary=vec.vocabulary_)
    tfidf1 = new_vec.fit_transform(df[col1])
    new_vec= TfidfVectorizer(ngram_range=vec.ngram_range,max_df=vec.max_df,
                            min_df=vec.min_df,vocabulary=vec.vocabulary_)
    tfidf2 = new_vec.fit_transform(df[col2])
    
    df['sim_tfidf_%s_to_%sgram_%s_%s' %(vec.ngram_range[0],vec.ngram_range[1],col1,col2)] = list(map(cos_sim, claims_tfidf,headline_tfidf))

In [57]:
vec = TfidfVectorizer(ngram_range=(1,3),max_df=0.8,min_df=2)
add_feature_tfidf_svd_similarity(df,'claim','pseudo_headline',vec)

In [58]:
df.head(2)

Unnamed: 0_level_0,claim,claimant,date,label,related_articles,count_1gram_claim,count_unique_1gram_claim,ratio_unique_1gram_claim,pseudo_headline,overlap_1gram_claim_pseudo_headline,overlap_2gram_claim_pseudo_headline,overlap_3gram_claim_pseudo_headline,num_sents_claim,sim_tfidf_1_to_3gram_claim_pseudo_headline
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,line george orwell novel 1984 predict power sm...,,2017-07-17,0,"[122094, 122580, 130685, 134765]",8,8,1.0,line george orwell novel,4.0,3.0,2.0,1,0.708021
1,maine legislature candidate leslie gibson insu...,,2018-03-17,2,"[106868, 127320, 128060]",14,14,1.0,maine legislature candidate leslie,4.0,3.0,2.0,1,0.349062


Check to see if our function worked.

In [60]:
sum(np.subtract(sim_tfidf,df.sim_tfidf_1_to_3gram_claim_pseudo_headline.values))

0.0

The difference between our function and the step-by-step process is the exact same AKA SUCCESS!!!

## Word2Vec Features
Now we will generate the **Word2Vec features** as indicated by the Talos team. They used pre-trained word vectors

In [61]:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [62]:
df['pseudo_headline_unigram_vec'] = \
df['pseudo_headline'].map(lambda x: generate_ngrams(x,1))
df['claim_unigram_vec'] = \
df['claim'].map(lambda x: generate_ngrams(x,1))

The `functools.reduce` tool is used to apply a function to a sequence of elements. 

In [63]:
from functools import reduce
lis = [1,2,3,4,5]

# using reduce to compute sum of list 
print ("The sum of the list elements is : ",end="") 
print (reduce(lambda a,b : a+b,lis)) 

#using reduce to compute maximum element from list 
print ("The maximum element of the list is : ",end="") 
print (reduce(lambda a,b : a if a > b else b,lis)) 

The sum of the list elements is : 15
The maximum element of the list is : 5


In [64]:
headline_unigram_arr = df.pseudo_headline_unigram_vec.values
claim_unigram_arr = df.claim_unigram_vec.values
headline_vecs = map(lambda x: reduce(np.add, [model[gram] for gram in x if gram in model], [0.]*300), headline_unigram_arr)
claim_vecs = map(lambda x: reduce(np.add, [model[gram] for gram in x if gram in model], [0.]*300), claim_unigram_arr)

Let's take a moment to break down what is happening in our `reduce` line as there is A LOT going on.
1. We are mapping (i.e. applying) a lambda function to each element of the `claim_unigram_arr`. Thus the lambda function acts on each claim's list of unigrams.
2. The function which is applied is an `np.add` function, which adds vectors element-wise. What is it adding? It is summing the list of vectors generated by the list comprehension [model[y] for y in x if y in model]
3. What is this list comprehension doing? It is applying the model to each 'unigram' in 'x' (which is the list of total unigrams for that claim). Thus this list comprehension returns a list of vector embeddings for each unigram in that claim's unigram list.
4. What does the presence of `[0.]*300` mean? It means we are `np.add`-ing the results of summing the list of vector embedded unigrams to a 300-dimensional array of zeros.

SO! What can we conclude? The output of this gloriously pythonic sentence is a list which has a vector embedding for each claim - created by adding up the individual vectors for each unigram found in the claim.

Now, what do we do with these?

In [65]:
headline_vecs = list(headline_vecs)
claim_vecs = list(claim_vecs)
print('Shape of headline vectors:',np.asarray(headline_vecs).shape)
print('Shape of claim vectors:',np.asarray(claim_vecs).shape)

Shape of headline vectors: (15555, 300)
Shape of claim vectors: (15555, 300)


Now we are able to compute the cosine similarity between the headline and body Word2Vec features. We will get the cosine similarity for each of our 15K rows.

After much (much!) struggle and iteration, it has been determined that we must pass our vectors in with the shape (num_samples, 1, dimension_of_vector). WHY? Well thanks for asking. The `cosine_similarity` function requires that single samples be passed in as a 2-D vector of shape (1, dimensions). When we use a neat pythonic tool such as the `map(func, input_list_a, input_list_b)`, we are returned a list of the function applied to the pairwise inputs from each list. In other words, we receive a list of the function applied to inputs of the same indices.

THUS, we must make sure that each 'input' in input list is of the appropriate shape, which is in the shape (1, dimensions).

In [66]:
vecs_h = [np.reshape(x,newshape=[1,-1]) for x in headline_vecs]
vecs_c =[np.reshape(x,newshape=[1,-1]) for x in claim_vecs]

In [67]:
w2v_sims = np.squeeze(list(map(cos_sim, vecs_h,vecs_c)))

In [68]:
print(w2v_sims)

[0.71542829 0.74194606 0.65782847 ... 0.8944385  0.49021315 0.79096662]


Future work:
- Similarity metrics for different word2vec models!

### Functionise Word2Vec Features

In [69]:
def add_feature_w2v_similarity(df,col1,col2,model,n):
    '''
    Adds the Word2Vec cosine similarity feature to the dataframe between two specific columns.
    
    Parameters
    ----------
    df:
        The dataframe which the row is added to.
    col1:
        The first column of text for the similarity measurements. 
    col2:
        The second column of text for the similarity measurements.
    model:
        A gensim model loaded with Word2Vec embeddings.
    n:
        The ngram number to include in the similarity comparison.
        
    Returns
    -------
    N/A 
    '''
    gram_list1 = df[col1].map(lambda x: generate_ngrams(x,n)).values
    gram_list2 = df[col2].map(lambda x: generate_ngrams(x,n)).values
    
    vecs1 = map(lambda x: reduce(np.add, [model[gram] for gram in x if gram in model], [0.]*300), gram_list1)
    vecs2 = map(lambda x: reduce(np.add, [model[gram] for gram in x if gram in model], [0.]*300), gram_list2)
    #get values from map by calling the generator in a list
    vecs1 = list(vecs1)
    vecs2 = list(vecs2)
    #shape into necessary form
    vecs1 = [np.reshape(x,newshape=[1,-1]) for x in vecs1]
    vecs2 =[np.reshape(x,newshape=[1,-1]) for x in vecs2]
    df['sim_w2v_%s_%s' %(col1,col2)] = w2v_sims = np.squeeze(list(map(cos_sim, vecs_h,vecs_c)))

In [70]:
add_feature_w2v_similarity(df,'claim','pseudo_headline',model,1)

In [72]:
df.sim_w2v_claim_pseudo_headline.head()

id
0    0.715428
1    0.741946
4    0.657828
5    0.420630
6    0.741683
Name: sim_w2v_claim_pseudo_headline, dtype: float64

## Sentiment Features

In [73]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [74]:
def compute_sentiment(sentences,average=False):
    '''
    Computes the sentiment of the sentences with nltk SentimentIntensityAnalyzer().
    
    Parameters
    ----------
    sentences:
        Either a list of single sentences or a list of sentence lists, where each inner list is
        composed of all the sentences tokenized per document.
    average:
        A binary True/False value of whether or not to average the sentiment.
        Sentiment should be averaged if your sentences parameter is a list of lists.
        
    Returns
    -------
    A dataframe of the resulting sentiments.
    '''
    result = []
    for sentence in sentences:
        vs = sid.polarity_scores(sentence)
        result.append(vs)
    if average:
        return pd.DataFrame(result).mean()
    else:
        return pd.DataFrame(result)

In [75]:
sents = compute_sentiment(df.claim.values)
sents.head()

Unnamed: 0,compound,neg,neu,pos
0,0.3182,0.0,0.753,0.247
1,-0.4939,0.297,0.573,0.13
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,-0.9022,0.552,0.448,0.0


In [76]:
def add_feature_sentiment_sid(df,col,avg=False):
    '''
    Adds features from the sentiment of the sentences as processed by nltk SentimentIntensityAnalyzer().
    
    Parameters
    ----------
    df:
        The dataframe which the row is added to.
    col:
        The column of text to be analyzed. 
    avg:
        A binary True/False value of whether or not to average the sentiment.
        Sentiment should be averaged if your sentences parameter is a list of lists.
        
    Returns
    -------
    N/A
    '''
    sentences = df[col].values
    result = list()
    for sentence in sentences:
        vs = sid.polarity_scores(sentence)
        result.append(vs)
    if avg:
        df2 = pd.DataFrame(result).mean()
        columns = ['sent_'+ x + '_' + col for x in df2.columns]
        df2.columns = columns
        for key in df2:
            df[key] = df2[key]
    else:
        df2 = pd.DataFrame(result)
        columns = ['sent_'+ x + '_' + col for x in df2.columns]
        df2.columns = columns
        for key in df2:
            df[key] = df2[key]

In [77]:
add_feature_sentiment_sid(df,'claim')

In [79]:
df[[col for col in df.columns if 'sent_' in col]].head()

Unnamed: 0_level_0,sent_compound_claim,sent_neg_claim,sent_neu_claim,sent_pos_claim
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.3182,0.0,0.753,0.247
1,-0.4939,0.297,0.573,0.13
4,-0.9022,0.552,0.448,0.0
5,-0.9171,0.443,0.557,0.0
6,-0.4767,0.341,0.659,0.0


Everything matches, we're good!