## Assignment Alternus Vera 
### Factor: Partisan Biasnes


** As part of Alternus Vera assignment, Here we are trying to analyze liar liar dataset for partisan biasnes which includes two parties i.e. democratic and republic.**

* Data set used: Liar Liar.


### Reading data set

In [None]:
import pandas as pd
import numpy as np

In [86]:
dataset_train = pd.read_csv('liar_dataset/train.tsv', delimiter = '\t', quoting = 3, header=None)
dataset_train.columns = ["id", "label", "statement", "subject", "speaker", "speaker_title", "State", "party_affiliation", "barely_true", "false", "half_true", "mostly_true", "pants_on_fire","context"]
dataset_train.head(5)

Unnamed: 0,id,label,statement,subject,speaker,speaker_title,State,party_affiliation,barely_true,false,half_true,mostly_true,pants_on_fire,context
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0,1,0,0,0,a mailer
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0,0,1,1,0,a floor speech.
2,324.json,mostly-true,"Hillary Clinton agrees with John McCain ""by vo...",foreign-policy,barack-obama,President,Illinois,democrat,70,71,160,163,9,Denver
3,1123.json,false,Health care reform legislation is likely to ma...,health-care,blog-posting,,,none,7,19,3,5,44,a news release
4,9028.json,half-true,The economic turnaround started at the end of ...,"economy,jobs",charlie-crist,,Florida,democrat,15,9,20,19,2,an interview on CNN


In [87]:
dataset_train.shape

(10269, 14)

In [90]:
from gensim import corpora

documents = dataset_train['statement']

### Preprocessing data
** Following steps were taken for preprocessing the data **
* Tokenizing
* Removing stop words
* Stemming
* Lemmetizing

### Tokenizing

In [130]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
        for document in documents]

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1]
         for text in texts]

from pprint import pprint  # pretty-printer
print(texts)




### Stopwords 

In [199]:

from nltk.corpus import stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')

dataset_train['statement_processed'] = documents.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

print(documents[10244])
print(dataset_train['statement_processed'][10244])

The Obama administration spent $205,075 in stimulus funds to relocate a shrub that sells for $16.
The Obama administration spent $205,075 stimulus funds relocate shrub sells $16.


### Stemming

In [200]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter.stem('running')

stemmed_words = dataset_train['statement_processed'].apply(lambda x: ' '.join([porter.stem(word) for word in x.split() if word not in (stop)]))


### Lemmetizing

In [332]:
from textblob import Word
dataset_train['statement_processed'] = dataset_train['statement_processed'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))


### Replace number from string

In [333]:
dataset_train['statement_processed'] = stemmed_words.replace('\d+', '', regex = True)


### Creating dictionary for political words: 
** The dictionary words are taken from website: ** https://www.theatlantic.com/politics/archive/2016/07/why-democrats-and-republicans-literally-speak-different-languages/492539/

In [204]:
democratic_phrases=['private accounts',
'trade agreement',
'American people',
'tax breaks',
'trade deficit',
'oil companies',
'credit card',
'nuclear option',
'war in Iraq',
'middle class',
'President budget',
'Republican party',
'change the rules',
'minimum wage',
'budget deficit',
'Republican senators',
'wildlife refuge',
'card companies',
'worker\'s rights',
'poor people',
'Republican leader',
'cut funding',
'American workers',
'living in poverty',
'Senate Republicans',
'fuel efficiency',
'national wildlife',
'veterans health care',
'congressional black caucus',
'billion in tax cuts',
'security trust fund',
'social security trust',
'privatize social security',
'American free trade',
'central American free',
'corporation for public broadcasting',
'additional tax cuts',
'pay for tax cuts',
'tax cuts for people',
'oil and gax companies',
'prescription drug bill',
'caliber sniper rifles',
'increase the minimum wage',
'system of checks and balances',
'middle class families',
'cut health care',
'civil rights movement',
'cuts to child support',
'drilling in the Arctic National',
'victims of gun violence',
'solvency of social security',
'Voting Rights Act',
'war in Iraq and Afghanistan',
'civil rights protections',
'credit card debt',
'Affordable Care Act']

In [205]:
republican_phrases=[
'stem cell',
'natural gas',
'death tax',
'illegal aliens',
'class action',
'war on terror',
'embryonic cell',
'tax relief',
'illegal immigration',
'personal account',
'pass the bill',
'private property',
'border security',
'human life',
'human embryos',
'increase taxes',
'retirement accounts',
'government spending',
'national forest',
'minority leader',
'urge support',
'cell lines',
'cord blood',
'action lawsuits',
'economic growth',
'food program',
'hate crimes legislation',
'adult stem cells',
'oil for food',
'personal retirement accounts',
'energy and natural resources',
'hate crimes law',
'change hearts and minds',
'global war on terrorism',
'death tax repeal',
'housing and urban affairs',
'million jobs created',
'national flood insurance',
'private property rights',
'temporary worker program',
'class action reform',
'growth and job creation',
'reform social security',
'Obamacare'
]


### Cleaning the dictionary data

In [206]:
from nltk.corpus import stopwords

from nltk.corpus import stopwords
stop = stopwords.words('english')

filtered_democratic_list = []
for sent in democratic_phrases:
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sent)
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    filtered_sentence = ''
    for w in word_tokens:
        if w not in stop_words: 
#             print(w+"====>"+porter.stem(w))
            filtered_sentence=filtered_sentence+" "+porter.stem(w)
#             filtered_sentence.append(w)
    filtered_democratic_list.append(filtered_sentence)
    
filtered_republican_list = []
for sent in republican_phrases:
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sent)
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    filtered_sentence = ''
    for w in word_tokens:
        if w not in stop_words: 
#             print(w+"====>"+porter.stem(w))
            filtered_sentence=filtered_sentence+" "+porter.stem(w)
#             filtered_sentence.append(w)
    filtered_republican_list.append(filtered_sentence) 
    


### Applying TF-IDF 

In [207]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline 
y = dataset_train.label 
 
# Drop the `label` column 
dataset_train.drop("label", axis=1) 
 
# Make training and test sets 
X_train, X_test, y_train, y_test = train_test_split(dataset_train['statement_processed'], y, test_size=0.33, random_state=53)


print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



(6880,)
(3389,)
(6880,)
(3389,)


In [312]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
vectorizer = TfidfVectorizer(vocabulary = republican_phrases,norm='l2',ngram_range = (1,3),use_idf=True, smooth_idf=True,
                sublinear_tf=False)  

tfidf = vectorizer.fit_transform(X_train)


In [314]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,3), min_df = 0, stop_words = 'english')

tfidf_matrix =  tf.fit_transform(X_train)
feature_names = tf.get_feature_names()

feature_names

['aaa',
 'aaa credit',
 'aaa credit rating',
 'aaron',
 'aaron bean',
 'aaron bean vote',
 'aarp',
 'aarp largest',
 'aarp largest resel',
 'aba',
 'aba criteria',
 'aba criteria judici',
 'abandon',
 'abandon republican',
 'abandon republican stood',
 'abba',
 'abba member',
 'abba member class',
 'abbas',
 'abbas leader',
 'abbas leader fatah',
 'abbott',
 'abbott benefit',
 'abbott benefit payday',
 'abbott charg',
 'abbott charg overse',
 'abbott compani',
 'abbott compani vacuum',
 'abbott convert',
 'abbott convert million',
 'abbott defend',
 'abbott defend billion',
 'abbott gift',
 'abbott gift free',
 'abbott lost',
 'abbott lost court',
 'abbott said',
 'abbott said wast',
 'abbott surrog',
 'abbott surrog refer',
 'abbott texa',
 'abbott texa year',
 'abbott went',
 'abbott went court',
 'abc',
 'abc allow',
 'abc allow ad',
 'abedin',
 'abedin tie',
 'abedin tie muslim',
 'abel',
 'abel billionaire',
 'abel elimin',
 'abel elimin sheriff',
 'abel year',
 'abel year noth',


In [315]:
tfidf_matrix

<6880x101422 sparse matrix of type '<class 'numpy.float64'>'
	with 176105 stored elements in Compressed Sparse Row format>

In [336]:
tfidf = vectorizer.fit_transform(X_train).todense().T

print(tfidf)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [317]:
dense = tfidf_matrix.todense()
len(dense[0].tolist()[0])
episode = dense[0].tolist()[0]
phrase_scores = [pair for pair in zip(range(0, len(episode)), episode) if pair[1] > 0]
len(phrase_scores)

15

In [318]:
sorted(phrase_scores, key=lambda t: t[1] * -1)[:5]

[(46407, 0.298083948114242),
 (46408, 0.298083948114242),
 (57470, 0.298083948114242),
 (70031, 0.298083948114242),
 (99017, 0.298083948114242)]

### Finding top frequent words

In [320]:
print(feature_names[46407])
print(feature_names[46408])
print(feature_names[57470])
print(feature_names[70031])
print(feature_names[99017])

latest job
latest job number
number wisconsin rank
rank midwest
wisconsin rank midwest


### Finding bag of Words

In [350]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
def lemmatize_stemming(text):
    stemmer = PorterStemmer()
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [353]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from sklearn import preprocessing
statement_news=dataset_train[['statement_processed']]

processed_statement = statement_news['statement_processed'].map(preprocess)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_statement]
# bow_corpus

### Applying tfidf on bag of words

In [354]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(2, 0.45196320593139855),
 (3, 0.5112111677756673),
 (1466, 0.2998361817377411),
 (1817, 0.551223590392398),
 (6595, 0.375024292468497)]


### Applying LDA using bag of words

In [356]:
lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

In [357]:
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

Topic: 0 
Words: 0.038*"health" + 0.034*"care" + 0.019*"dollar" + 0.019*"plan" + 0.016*"american" + 0.015*"say" + 0.014*"mccain" + 0.014*"john" + 0.012*"spend" + 0.011*"year"
Topic: 1 
Words: 0.015*"campaign" + 0.014*"million" + 0.014*"fund" + 0.014*"obama" + 0.013*"spend" + 0.011*"island" + 0.011*"time" + 0.011*"year" + 0.010*"care" + 0.010*"today"
Topic: 2 
Words: 0.031*"vote" + 0.015*"republican" + 0.013*"percent" + 0.013*"support" + 0.011*"plan" + 0.011*"time" + 0.010*"cost" + 0.010*"clinton" + 0.009*"democrat" + 0.009*"rate"
Topic: 3 
Words: 0.022*"romney" + 0.022*"scott" + 0.020*"state" + 0.019*"governor" + 0.018*"billion" + 0.017*"mitt" + 0.016*"walker" + 0.013*"year" + 0.012*"plan" + 0.011*"health"
Topic: 4 
Words: 0.061*"percent" + 0.034*"state" + 0.028*"million" + 0.017*"rate" + 0.016*"year" + 0.014*"lose" + 0.014*"florida" + 0.013*"world" + 0.009*"number" + 0.008*"highest"
Topic: 5 
Words: 0.029*"obama" + 0.019*"barack" + 0.018*"nation" + 0.013*"democrat" + 0.013*"come" + 0.

### Running LDA using Tfidf

In [358]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=10, id2word=dictionary, passes=2, workers=4)
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))

Topic: 0 Word: 0.023*"obama" + 0.013*"barack" + 0.008*"rate" + 0.008*"state" + 0.007*"vote" + 0.007*"bush" + 0.007*"democrat" + 0.006*"year" + 0.006*"lowest" + 0.006*"muslim"
Topic: 1 Word: 0.010*"percent" + 0.009*"tax" + 0.008*"month" + 0.007*"obama" + 0.007*"record" + 0.006*"state" + 0.006*"american" + 0.006*"vote" + 0.006*"dont" + 0.005*"person"
Topic: 2 Word: 0.012*"health" + 0.011*"cost" + 0.010*"state" + 0.009*"percent" + 0.009*"million" + 0.009*"job" + 0.009*"vote" + 0.009*"care" + 0.008*"year" + 0.008*"children"
Topic: 3 Word: 0.016*"year" + 0.016*"percent" + 0.015*"plan" + 0.012*"health" + 0.010*"state" + 0.010*"care" + 0.008*"fund" + 0.008*"florida" + 0.008*"american" + 0.007*"vote"
Topic: 4 Word: 0.014*"romney" + 0.011*"mitt" + 0.008*"say" + 0.008*"state" + 0.008*"time" + 0.007*"rate" + 0.007*"clinton" + 0.007*"year" + 0.007*"campaign" + 0.006*"percent"
Topic: 5 Word: 0.011*"support" + 0.010*"year" + 0.010*"state" + 0.009*"american" + 0.008*"percent" + 0.007*"right" + 0.007*

### Performance evaluation for LDA Bag of Words model

In [362]:
processed_statement[11]

['sinc', 'nearli', 'million', 'american', 'slip', 'middl', 'class', 'poverti']

In [364]:

for index, score in sorted(lda_model[bow_corpus[11]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.5259239077568054	 
Topic: 0.049*"percent" + 0.028*"obama" + 0.019*"time" + 0.017*"american" + 0.017*"spend" + 0.016*"school" + 0.014*"vote" + 0.013*"year" + 0.013*"clinton" + 0.013*"million"

Score: 0.2740309536457062	 
Topic: 0.031*"vote" + 0.015*"republican" + 0.013*"percent" + 0.013*"support" + 0.011*"plan" + 0.011*"time" + 0.010*"cost" + 0.010*"clinton" + 0.009*"democrat" + 0.009*"rate"

Score: 0.02501034550368786	 
Topic: 0.061*"percent" + 0.034*"state" + 0.028*"million" + 0.017*"rate" + 0.016*"year" + 0.014*"lose" + 0.014*"florida" + 0.013*"world" + 0.009*"number" + 0.008*"highest"

Score: 0.02500923164188862	 
Topic: 0.022*"romney" + 0.022*"scott" + 0.020*"state" + 0.019*"governor" + 0.018*"billion" + 0.017*"mitt" + 0.016*"walker" + 0.013*"year" + 0.012*"plan" + 0.011*"health"

Score: 0.025006147101521492	 
Topic: 0.038*"health" + 0.034*"care" + 0.019*"dollar" + 0.019*"plan" + 0.016*"american" + 0.015*"say" + 0.014*"mccain" + 0.014*"john" + 0.012*"spend" + 0.011*"year"

### Performance evaluation for LDA Tfidf model

In [366]:
for index, score in sorted(lda_model_tfidf[bow_corpus[11]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.7749489545822144	 
Topic: 0.016*"year" + 0.016*"percent" + 0.015*"plan" + 0.012*"health" + 0.010*"state" + 0.010*"care" + 0.008*"fund" + 0.008*"florida" + 0.008*"american" + 0.007*"vote"

Score: 0.02500765770673752	 
Topic: 0.012*"health" + 0.011*"cost" + 0.010*"state" + 0.009*"percent" + 0.009*"million" + 0.009*"job" + 0.009*"vote" + 0.009*"care" + 0.008*"year" + 0.008*"children"

Score: 0.02500726841390133	 
Topic: 0.015*"state" + 0.014*"budget" + 0.011*"percent" + 0.010*"unit" + 0.009*"year" + 0.008*"spend" + 0.008*"trillion" + 0.007*"billion" + 0.007*"world" + 0.006*"today"

Score: 0.025006849318742752	 
Topic: 0.014*"romney" + 0.011*"mitt" + 0.008*"say" + 0.008*"state" + 0.008*"time" + 0.007*"rate" + 0.007*"clinton" + 0.007*"year" + 0.007*"campaign" + 0.006*"percent"

Score: 0.02500617690384388	 
Topic: 0.016*"spend" + 0.012*"year" + 0.010*"state" + 0.010*"million" + 0.008*"percent" + 0.008*"dollar" + 0.008*"billion" + 0.008*"clinton" + 0.007*"half" + 0.006*"vote"

Score

### Randomly picking a test data and evaluating the performance of model

In [368]:
unseen_document = 'building a wall on the U.S.-Mexico border will take literally years'
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.8499708771705627	 Topic: 0.038*"health" + 0.034*"care" + 0.019*"dollar" + 0.019*"plan" + 0.016*"american"
Score: 0.016673222184181213	 Topic: 0.055*"state" + 0.051*"year" + 0.019*"unit" + 0.018*"obama" + 0.016*"vote"
Score: 0.016671452671289444	 Topic: 0.022*"romney" + 0.022*"scott" + 0.020*"state" + 0.019*"governor" + 0.018*"billion"
Score: 0.016670534387230873	 Topic: 0.022*"state" + 0.018*"year" + 0.016*"vote" + 0.015*"say" + 0.014*"trump"
Score: 0.016670146957039833	 Topic: 0.029*"obama" + 0.019*"barack" + 0.018*"nation" + 0.013*"democrat" + 0.013*"come"
Score: 0.016669603064656258	 Topic: 0.049*"percent" + 0.028*"obama" + 0.019*"time" + 0.017*"american" + 0.017*"spend"
Score: 0.016669457778334618	 Topic: 0.061*"percent" + 0.034*"state" + 0.028*"million" + 0.017*"rate" + 0.016*"year"
Score: 0.016668478026986122	 Topic: 0.029*"state" + 0.022*"billion" + 0.016*"budget" + 0.013*"year" + 0.011*"public"
Score: 0.01666816510260105	 Topic: 0.015*"campaign" + 0.014*"million" + 0.0

In [379]:


X_train, X_test, y_train, y_test = train_test_split(dataset_train['statement_processed'], y, test_size=0.33, random_state=53)



from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import KFold

nb_pipeline = Pipeline([('NBCV',vectorizer),('nb_clf',MultinomialNB())])

nb_pipeline.fit(X_train,y_train)
predicted_nb = nb_pipeline.predict(X_test)
print(np.mean(predicted_nb == y_test))


0.1950427854824432


### Next Steps to follow:
* Enrich the data
* Apply doc2Vec
* Apply LDA on top of doc2vec
* Cosine similarity

### Enriching Data to add publication and published date

In [234]:
import requests
def read_url_extract_date(id,i):
    URL = "http://www.politifact.com//api/v/2/statement/"+id+"?format=json"
    page = requests.get(URL).json()
    if(len(page['author'])!=0):
        publication_name = page['author'][0]['publication']['publication_name']
        dataset_train['Publication'][i]=publication_name
    ruling_date = page['ruling_date']
    dataset_train['Date'][i]=ruling_date

In [321]:
dataset_train['Publication'] = 'None'
dataset_train['Date'] = 'None'
for i in range(0, 10269):
    id = dataset_train['id'][i]
    id_split = id.split('.')
    read_url_extract_date(id_split[0],i)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [323]:
dataset_train.head(2)

Unnamed: 0,id,label,statement,subject,speaker,speaker_title,State,party_affiliation,barely_true,false,half_true,mostly_true,pants_on_fire,context,statement_processed,Publication,Date
0,2635.json,false,Says the Annies List political group supports ...,abortion,dwayne-bohac,State representative,Texas,republican,0,1,0,0,0,a mailer,say anni list polit group support third-trimes...,Austin American-Statesman,2010-10-20T06:00:00
1,10540.json,half-true,When did the decline of coal start? It started...,"energy,history,job-accomplishments",scott-surovell,State delegate,Virginia,democrat,0,0,1,1,0,a floor speech.,when declin coal start? It start natur ga took...,Richmond Times-Dispatch,2015-02-23T00:00:00


### Writing the enriched data to csv

In [311]:
dataset_train.to_csv('train_with_date', sep='\t', encoding='utf-8')

### Todos:
* Using enriched data to optimize the factors score
* Use doc2Vec
* Perform LDA, bag of words, tfidf and cosine similarity
* analyze the performance
* Integration with other team mebers work
* Combining all the factors to perform multi classifier to give vectorized output. 