# Final Presentation


## Members: Haorui Ji, Mengze Zhang, Weihao Tan


# Abstract

* Combine title and text corpus together to retrieve more information.
* Concatenate vectors made from w2v and d2v methods (tf-idf: optional) as final feature vectors(600d).
* Using an Ensemble classifier(composed of three tuned estimators) to train and make predictions on the feature vectors we get.
* Reach an over 0.94 f-1 score in test.

# Data Ingestion

First, we need to import the data. Since we plan to test if using the combination of title vectors and text vectors would yield a better result, we import title corpus and text corpus seperately.In addition, since we consider combining title and text corpus would give us more useful information, an additional "titleandtext" corpus is generated by concatenating title and text in every review into one list 

In [1]:
# MengZe Zhang 2018-9-23
# import all the data and split all strings to words in it

import numpy as np
import pandas as pd
import copy
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation
from itertools import chain
from gensim import corpora
from gensim.models import Word2Vec
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import train_test_split as tts
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report as cr
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn import cross_validation
from sklearn.model_selection import RandomizedSearchCV
from numpy import linspace
from scipy.sparse import csr_matrix

df=pd.read_csv('../data/fake_or_real_news.csv')
title_raw=np.asarray(df.title)
text_raw=np.asarray(df.text)
y_raw=np.asarray(df.label)
title_raw=[titles.lower().split() for titles in title_raw]
text_raw=[texts.lower().split() for texts in text_raw]
# Haorui Ji 2018-9-24
# Combine title and text in every review into one list 
titleandtext_raw = copy.deepcopy(title_raw)
for i in range(len(titleandtext_raw)):
    titleandtext_raw[i].extend(text_raw[i])
df.head()

  from numpy.core.umath_tests import inner1d


Unnamed: 0.1,Unnamed: 0,title,text,label,title_vectors
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,[ 1.1533764e-02 4.2144405e-03 1.9692603e-02 ...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,[ 0.11267698 0.02518966 -0.00212591 0.021095...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,[ 0.04253004 0.04300297 0.01848392 0.048672...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,[ 0.10801624 0.11583211 0.02874823 0.061732...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,[ 1.69016439e-02 7.13498285e-03 -7.81233795e-...


In [2]:
print("first title : \n",title_raw[0])
print("first text : \n ",text_raw[0])
print("first titleandtext : \n ",titleandtext_raw[0])

first title : 
 ['you', 'can', 'smell', 'hillary’s', 'fear']
first text : 
first titleandtext : 


# Preprocessing

After importing all the data we have, we preprocessed the dataset in order to filter unuseful noise and make our data more informative. In this process, we first removed all stopwords and punctuations, then lemmatize and stem the rest to generate our "cleaned" version of corpus.

## Remove noise data from raw dataset

In [3]:
# Haorui Ji 2018-9-24
# Check how may texts are empty, they are the "noise" in the dataset and we shall remove them

count = 0
for text in text_raw:
    if len(text) == 0:
        count = count + 1

print(count)

36


In [4]:
deleteindex = []
idx = -1
for text in text_raw:
    idx += 1
    num_words = len(text)
    if num_words == 0:
        deleteindex.append(idx)
        
y=np.delete(y_raw,deleteindex)
title = np.delete(title_raw,deleteindex)
text = np.delete(text_raw,deleteindex)
titleandtext = np.delete(titleandtext_raw,deleteindex)

print(len(title))
print(len(text))
print(len(titleandtext))
print(len(y))

6299
6299
6299
6299


## Preprocess

In [5]:
# Haorui Ji 2018-9-24
# Defining a preprocessing function that can lemmatize, stem and remove stopwords/punctuations from the corpus
# preprocessing both title corpus, text corpus and titleandtext corpus

import nltk
from itertools import chain

english_stemmer = nltk.stem.SnowballStemmer('english')
english_lemmatizer = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')
punct = punctuation

def lemmatize_tokens(tokens, lemmatizer):
    lemmatized = []
    for doc in tokens:
        lemmatized.append([lemmatizer.lemmatize(token) for token in doc])
    return lemmatized

def stem_tokens(tokens, stemmer):
    stemmed = []
    for doc in tokens:
        stemmed.append([stemmer.stem(token) for token in doc])
    return stemmed

def clean_text(tokenized_list,lemmatize=True,stem=True):
    tokens = []
    for doc in tokenized_list:
        tokens.append([token for token in doc if token not in chain(punct, stopwords)])
    tokens_cleaned = tokens
    if lemmatize:
        tokens_cleaned = lemmatize_tokens(tokens_cleaned, english_lemmatizer)
    if stem:
        tokens_cleaned = stem_tokens(tokens_cleaned, english_stemmer)
      
    return tokens_cleaned

clean_titleandtext=clean_text(titleandtext,True,True)
clean_title=clean_text(title,True,True)
clean_text=clean_text(text,True,True)

In [6]:
print("first title cleaned : \n",clean_title[0])
print("first text cleaned : \n ",clean_text[0])
print("first titleandtext cleaned : \n ",clean_titleandtext[0])

first title cleaned : 
 ['smell', 'hillari', 'fear']
first text cleaned : 
first titleandtext cleaned : 


## Get the real number version of labels

In [7]:
# MengZe Zhang 2018-9-23
# Get real number version label 

realnumber_y=[1]*len(titleandtext)
count=0
for word in y:
    if(word=='FAKE'):
        realnumber_y[count]=0
    else:
        realnumber_y[count]=1
    count+=1
realnumber_y = np.array(realnumber_y).astype(int)

# Relation Between titles and texts

## Concatenation
Our first notion of this issue is that we would like to explore the relationship between titles and texts? How do we take fully advatange of information lie in the dataset especially for FakeNews dataset? First of all, we regard titles as part of the corpus and train them together. We do this by concatenating vectors generated from the title corpus and text corpus seperately into one long vector. And then we will see if the long vector can provide sufficient information that we need. The vectorization method we used was Word2Vec

In [8]:
# Haorui Ji 2018-9-27
# Train a w2v model based on the whole corpus

model = Word2Vec(clean_titleandtext,
                       size = 300,
                       window = 5,
                       min_count = 0,
                       sg = 0,
                       alpha = 0.025,
                       iter=10,
                       batch_words = 10000)

In [9]:
# Haorui Ji 2018-9-24
# Develope doc vectors from word vectors by taking average of word vectors(Naive Doc2Vec)
# Remove those with no text---deleteindex

titleandtext_list = []
idx = -1
for review in clean_titleandtext:
    idx += 1
    init_vec = np.zeros([300,])
    num_words = len(review)
    for word in review:
        init_vec += model.wv[word]
    init_vec /= num_words
    titleandtext_list.append(init_vec)

In [10]:
titleandtext_array=np.asarray(titleandtext_list)
print(titleandtext_array.shape)

(6299, 300)


In [11]:
# Haorui Ji 2018-9-24
# Some data in the vectors are NaN, so we initialize them with mean value

# titleandtext
print(np.isnan(titleandtext_array).any())
titleandtext_array[np.isnan(titleandtext_array)] = np.mean(titleandtext_array[~np.isnan(titleandtext_array)])
print(np.isnan(titleandtext_array).any())

False
False


Now we have the vector, the next step is to find an accurate classifier, in order to do so, we tried four different kinds of classification methods which are Logistic Regression, Random Forest, XGBoost and SVC( we also intended to try multinomial naive Bayesian model, however it seems that MNB model doesn't take negative inputs and some elements in our vectors are inevitably negative. )

In [12]:
clf1=LogisticRegression()
clf2=RandomForestClassifier()
clf3=XGBClassifier()
clf4=SVC(probability=True)

for clf,label in zip([clf1,clf2,clf3,clf4],['LR','RF','XGB','SVM']):
    scores=cross_validation.cross_val_score(clf,titleandtext_array,realnumber_y,scoring='f1')
    print("text_f1_score: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

text_f1_score: 0.90 (+/- 0.00) [LR]
text_f1_score: 0.87 (+/- 0.01) [RF]


  if diff:
  if diff:
  if diff:


text_f1_score: 0.90 (+/- 0.00) [XGB]
text_f1_score: 0.88 (+/- 0.00) [SVM]


##  Weighted Average

People always regard titles as part of the corpus and train them together. However, as we all know, title is the most informative message part in an article, and lots of fake news are clickbait. We wonder if we can train with the title alone and combine the results with those trained from the text corpus, in this way we may get a better result. Therefore, we seperately build two models, one trained from the text corpus and the other trained on title corpus, then we take the weighted average of these two different models' results. The weights of different models will be decided based on their classification performance.

In [13]:
# MengZe Zhang 2018-9-23
# train two w2v models, one trained on text corpus, the other trained on title corpus


model1 = Word2Vec(clean_title,
                       size = 300,
                       window = 5,
                       min_count = 0,
                       sg = 0,
                       alpha = 0.025,
                       iter=10,
                       batch_words = 10000)
model2 = Word2Vec(clean_text,
                        size = 300,
                        window = 5,
                        min_count = 1,
                        sg = 0,
                        alpha = 0.025,
                        iter=10,
                        batch_words = 10000)


In [14]:
# Haorui Ji 2018-9-24
# Develope doc vectors from word vectors by taking average of word vectors(Naive Doc2Vec)
# Remove those with no text---deleteindex

title_list = []
text_list = []
idx = -1
for title,text in zip(clean_title,clean_text):
    idx += 1
    init_vec_title = np.zeros([300,])
    init_vec_text = np.zeros([300,])
    num_words_title = len(title)
    num_words_text = len(text)
        
    for tiword in title:
        init_vec_title += model1.wv[tiword]    
    init_vec_title /= num_words_title
    title_list.append(init_vec_title)
    for teword in text:
        init_vec_text += model2.wv[teword]
    init_vec_text /= num_words_text
    text_list.append(init_vec_text)



In [15]:
# Haorui Ji 2018-9-27
# generate title and text vectors respectivly

# In this time, "titleandtext" vectors are generated by concatenating title and text vectors 
# because they both nake up our training and test set

titleandtext_list=[]
for title,text in zip(title_list,text_list):
    titleandtext_list.append([*list(title),*list(text)])
titleandtext_array=np.asarray(titleandtext_list)
title_array=np.asarray(title_list)
text_array=np.asarray(text_list)
print(titleandtext_array.shape)
print(title_array.shape)
print(text_array.shape)

(6299, 600)
(6299, 300)
(6299, 300)


In [16]:
# Haorui Ji 2018-9-24
# Some data in the vectors are NaN, so we initialize them with mean value

# titleandtext
print(np.isnan(titleandtext_array).any())
titleandtext_array[np.isnan(titleandtext_array)] = np.mean(titleandtext_array[~np.isnan(titleandtext_array)])
print(np.isnan(titleandtext_array).any())

#title
print(np.isnan(title_array).any())
title_array[np.isnan(title_array)] = np.mean(title_array[~np.isnan(title_array)])
print(np.isnan(title_array).any())

#text
print(np.isnan(text_array).any())
text_array[np.isnan(text_array)] = np.mean(text_array[~np.isnan(text_array)])
print(np.isnan(text_array).any())

True
False
True
False
False
False


We first evaluatre models based on text corpus

In [17]:
# MengZe Zhang 2018-9-23
# test the text vectors on four different classifiers(using crossvalidation)
# evaluate their mean value and std of f-1 values

w_text_array = text_array

for clf,label in zip([clf1,clf2,clf3,clf4],['LR','RF','XGB','SVM']):
    scores=cross_validation.cross_val_score(clf,w_text_array,realnumber_y,scoring='f1')
    print("text_f1_score: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

text_f1_score: 0.90 (+/- 0.00) [LR]
text_f1_score: 0.87 (+/- 0.01) [RF]


  if diff:
  if diff:
  if diff:


text_f1_score: 0.90 (+/- 0.00) [XGB]
text_f1_score: 0.88 (+/- 0.00) [SVM]


Then we evaluate the model trained only on title corpus.

In [18]:
# MengZe Zhang 2018-9-23
# test the title vectors on four different classifiers(using crossvalidation)
# evaluate their mean value and std of f-1 values

w_title_array = title_array

for clf,label in zip([clf1,clf2,clf3,clf4],['LR','RF','XGB','SVM']):
    scores=cross_validation.cross_val_score(clf,w_title_array,realnumber_y,scoring='f1')
    print("text_f1_score: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

text_f1_score: 0.60 (+/- 0.01) [LR]
text_f1_score: 0.62 (+/- 0.01) [RF]


  if diff:
  if diff:
  if diff:


text_f1_score: 0.67 (+/- 0.00) [XGB]
text_f1_score: 0.60 (+/- 0.01) [SVM]


To combine results from different models, we built an ensemble classifier to calculate the weighted average of different models'results, the class of EnsembleClassifier we built shows as follows:

In [19]:
# MengZe Zhang 2018-9-23
# Define the EnsembleClassifier class

class EnsembleClassifier():
    
    def __init__(self,textclfs,titleclfs,weights):
        self.textclfs=textclfs
        self.titleclfs=titleclfs
        self.weights=weights

    def fit(self,doc_vec,title_vec, y):
        for text_clf in self.textclfs:
            text_clf.fit(doc_vec,y)
        for title_clf in self.titleclfs:
            title_clf.fit(title_vec,y)
        #print("fitting process is over")
    def predict(self, title_vec,doc_vec):
        
        print("predict process begins")
        avg=self.predict_proba(doc_vec,title_vec)
        results=[]
        for item in avg:
          if(item[0]>item[1]):
              results.append('FAKE')
          else:
              results.append('REAL')
        print("predict process is over")
        return np.asarray(results)

    def predict_proba(self, doc_vec, title_vec):
        self.probas_ = [np.array(text_clf.predict_proba(doc_vec),dtype='float') for text_clf in self.textclfs]
        for title_clf in self.titleclfs:
            self.probas_.append(title_clf.predict_proba(title_vec))
        avg = np.average(self.probas_, axis=0, weights=self.weights)
#        for result,weight in zip(self.probas_,self.weights):
#            sumresult+=result*weight
#            sumweight+=weight
#        avg=sumresult/sumweight
        print(avg)
        return avg  

The initiation of this EnsembleClassifier takes three parameters, the first is the list of classifiers we want to train with text corpus, the second is the list of models we want to train with title corpus, the third is the list of weights for all classifiers we provided in the first two parameters. From the result of our previous tests, we can see that the best classifier for text corpus and title corpus is Logistic Regression and XGBoost respectively. So we used these two classifiers to build our Ensemble Classifier, and since the f-1  score of title corpus based model is fairly low, we set its weight to 0.5.

In [20]:
# MengZe Zhang 2018-9-23
# divide title corpus, text corpus and labels into training set and testing set

teandti_train,teandti_test,y_train,y_test=tts(titleandtext_list,y,test_size=0.33,random_state=22)
ultraclf=EnsembleClassifier(textclfs=[clf1],titleclfs=[clf3],weights=[1,0.5])
text_train=[]
title_train=[]
text_test=[]
title_test=[]
for train in teandti_train:
    title_train.append(train[0:300])
    text_train.append(train[300:])
for test in teandti_test:
    title_test.append(test[0:300])
    text_test.append(test[300:])

In [21]:
# MengZe Zhang 2018-9-23
# using the ensemble classifier to trian on title train corpus and text train corpus
# make predicitons on title test corpus and text test corpus
# evaluate its performance

text_train=np.asarray(text_train)
text_test=np.asarray(text_test)
title_train=np.asarray(title_train)
title_test=np.asarray(title_test)
ultraclf.fit(text_train,title_train,y_train)
prediction=ultraclf.predict(title_test,text_test)
target_names=['FAKE','REAL']
print(cr(y_test,prediction,target_names=target_names,digits=4))

predict process begins
[[0.24226075 0.75773924]
 [0.93897326 0.06102675]
 [0.17325811 0.82674189]
 ...
 [0.03689138 0.96310862]
 [0.86441097 0.13558904]
 [0.08708648 0.91291352]]
predict process is over
             precision    recall  f1-score   support

       FAKE     0.8810    0.9138    0.8971      1021
       REAL     0.9137    0.8809    0.8970      1058

avg / total     0.8977    0.8971    0.8971      2079



#  Using different embedding method


We think the reason why weighted average method by ensembling the training results from title and text respectively shows no better performance is that the information provided by titles is either too few to influence the final resullt or too conflicting with those provided by the text. Therefore, we think maybe using the combinations of different vectors made by different vectorization methods would provide us with more information, and the way of combination we tried is still concatenation.

## TF-IDF by ourselves

What we like to do is strength the learning representation of the word vectors. In order to do so, one way is to multiply these word vectors we obtained earlier by an the tfidf coefficient. That is, combined word2vec with tfidf to add a weight to the word, to judge its importance. However, diectly applying tf-idf model in gensim and sklearn cannot meet our needs of extracting tf-idf coefficient of a particular word, so we have to write one on our own.

In [22]:
# Haorui Ji 2018-9-24
# Compute tf-idf coefficient for every word

from collections import Counter

# Count words' frequency in title and text
countlist = []
for i in range(len(clean_titleandtext)):
    count = Counter(clean_titleandtext[i])
    countlist.append(count)

print(countlist[0])



In [23]:
def tf(word, count):
    return count[word] / sum(count.values())

def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

def idf(word, count_list):
    return math.log(len(count_list) / (1 + n_containing(word, count_list)))

def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

In [24]:
import math

tf_idf_manual = []

for i, count in enumerate(countlist):
    scores = {word: tfidf(word, count, countlist) for word in count}
    if i%100 == 0:
        print(i)
    tf_idf_manual.append(scores)
    

0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200


In [25]:
print("Top words in document 1")
sorted_words = sorted(tf_idf_manual[0].items(), key=lambda x: x[1], reverse=True)
for word, score in sorted_words[:]:
    print("\tWord: {}, TF-IDF: {}".format(word, round(score, 5)))


Top words in document 1
	Word: fbi, TF-IDF: 0.0727
	Word: comey, TF-IDF: 0.04622
	Word: fbi., TF-IDF: 0.04513
	Word: hillari, TF-IDF: 0.02468
	Word: clinton, TF-IDF: 0.02458
	Word: scandal, TF-IDF: 0.02204
	Word: kgb., TF-IDF: 0.02164
	Word: it, TF-IDF: 0.02163
	Word: panicked., TF-IDF: 0.02083
	Word: email, TF-IDF: 0.01979
	Word: clintonworld, TF-IDF: 0.01968
	Word: awkward, TF-IDF: 0.01922
	Word: hatch, TF-IDF: 0.01886
	Word: setup, TF-IDF: 0.01673
	Word: fear, TF-IDF: 0.01607
	Word: preemptiv, TF-IDF: 0.016
	Word: unpreced, TF-IDF: 0.01447
	Word: smell, TF-IDF: 0.01431
	Word: fear., TF-IDF: 0.01392
	Word: bigger, TF-IDF: 0.01373
	Word: investig, TF-IDF: 0.01365
	Word: fire., TF-IDF: 0.01321
	Word: doj, TF-IDF: 0.01239
	Word: assault, TF-IDF: 0.01216
	Word: gone, TF-IDF: 0.01196
	Word: bizarr, TF-IDF: 0.01193
	Word: countless, TF-IDF: 0.01181
	Word: war, TF-IDF: 0.01172
	Word: fbi,, TF-IDF: 0.01166
	Word: whatev, TF-IDF: 0.01162
	Word: computer?, TF-IDF: 0.01139
	Word: spinmeist, TF-

Then, we apply this to our former word2vec model trained on titleandtext corpus to see if there are improvements on its performance.

In [26]:
# Haorui Ji 2018-9-24
# Develope doc vectors from word vectors by taking average of word vectors(Naive Doc2Vec)
# Remove those with no text---deleteindex

titleandtext_list = []
idx = -1
for review in clean_titleandtext:
    idx += 1
    init_vec = np.zeros([300,])
    num_words = len(review)
        
    for word in review:
        init_vec = init_vec + model.wv[word] * tf_idf_manual[idx][word]
    init_vec /= num_words
    titleandtext_list.append(init_vec)
        
titleandtext_array=np.asarray(titleandtext_list)

In [27]:
clf1=LogisticRegression()
clf2=RandomForestClassifier()
clf3=XGBClassifier()
clf4=SVC(probability=True)

for clf,label in zip([clf1,clf2,clf3,clf4],['LR','RF','XGB','SVM']):
    scores=cross_validation.cross_val_score(clf,titleandtext_array,realnumber_y,scoring='f1')
    print("text_f1_score: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

text_f1_score: 0.82 (+/- 0.01) [LR]
text_f1_score: 0.85 (+/- 0.00) [RF]


  if diff:
  if diff:
  if diff:


text_f1_score: 0.88 (+/- 0.01) [XGB]
text_f1_score: 0.67 (+/- 0.00) [SVM]


From the result we can see that this consideration of multiplying tf-idf coefficient and wor2vec doesn't really work that well.

## Word2Vec

In [28]:
# Haorui Ji 2018-9-24
# Develope doc vectors from word vectors by taking average of word vectors(Naive Doc2Vec)
# Remove those with no text
# Optional: whether to multiply tf-idf coefficient

titleandtext_list = []
idx = -1
for review in clean_titleandtext:
    idx += 1
    init_vec = np.zeros([300,])
    num_words = len(review)
    for word in review:
        init_vec += model.wv[word]
    init_vec /= num_words
    titleandtext_list.append(init_vec)

In [29]:
# Haorui Ji 2018-9-24

w_titleandtext_array=np.asarray(titleandtext_list)
print(w_titleandtext_array.shape)

print(np.isnan(w_titleandtext_array).any())
w_titleandtext_array[np.isnan(w_titleandtext_array)] = np.mean(w_titleandtext_array[~np.isnan(w_titleandtext_array)])
print(np.isnan(w_titleandtext_array).any())

(6299, 300)
False
False


## Doc2Vec

In [30]:
# Haorui Ji 2018-9-24
# take doc vectors by using d2v method trained on titleandtext corpus

tagged_docs = [TaggedDocument(doc, tags=[idx]) for idx, doc in enumerate(clean_titleandtext)]
pretrained_emb = 'GoogleNews-vectors-negative300.bin'
doc2vec = Doc2Vec(tagged_docs,size=300, window=5, min_count=5, dm = 0.5, iter=10, pretrained_emb=pretrained_emb)
doc2vec.train(tagged_docs, epochs=50, total_examples=doc2vec.corpus_count)

new_list2=[]
for i in range(doc2vec.docvecs.count):
    new_list2.append(doc2vec.docvecs[i])
    
d_titleandtext_array=np.asarray(new_list2)
print(d_titleandtext_array.shape)



(6299, 300)


## TF-IDF

In [32]:
# Weihao Tan 2018-9-27
# Generate tf-idf vector using gensim 

from gensim import corpora, models

dictionary = corpora.Dictionary(clean_titleandtext)

small_freq_ids = [tokenid for tokenid, docfreq in dictionary.dfs.items() if docfreq < 10]
dictionary.filter_tokens(small_freq_ids)
dictionary.compactify()
print(dictionary)

corpus = [dictionary.doc2bow(text) for text in clean_titleandtext]
tfidf_model = models.TfidfModel(corpus=corpus,
                                dictionary=dictionary)
corpus_tfidf = [tfidf_model[doc] for doc in corpus]
lsi_model = models.LsiModel(corpus = corpus_tfidf, 
                            id2word = dictionary, 
                            num_topics=300)
corpus_lsi = [lsi_model[doc] for doc in corpus]
#print(corpus_lsi[0])

data = []
rows = []
cols = []
line_count = 0
for line in corpus_lsi:  # lsi_corpus_total 是之前由gensim生成的lsi向量
    for elem in line:
        rows.append(line_count)
        cols.append(elem[0])
        data.append(elem[1])
    line_count += 1
lsi_sparse_matrix = csr_matrix((data,(rows,cols))) # 稀疏向量
lsi_matrix = lsi_sparse_matrix.toarray()  # 密集向量

tf_titleandtext_array=np.asarray(lsi_matrix)
print(tf_titleandtext_array.shape)

Dictionary(17949 unique tokens: ['60%', 'abedin.', 'abus', 'accus', 'act']...)
(6299, 300)


In [33]:
# MengZe Zhang 2018-9-23
# process the lael array and doc array made from d2v, so that all three arrays can have the same shape.

feature_list=[]
for word,doc,tf in zip(w_titleandtext_array,d_titleandtext_array, tf_titleandtext_array):
    feature_list.append([*list(word),*list(doc), *list(tf)])
    
feature_array=np.asarray(feature_list)
print(feature_array.shape)

print(np.isnan(feature_array).any())
feature_array[np.isnan(feature_array)] = np.mean(feature_array[~np.isnan(feature_array)])
print(np.isnan(feature_array).any())

(6299, 900)
False
False


In [34]:
# MengZe Zhang 2018-9-23
# test the concatenate vectors on four classifiers.

clf1=LogisticRegression()
clf2=RandomForestClassifier()
clf3=XGBClassifier()
clf4=SVC(probability=True)

for clf,label in zip([clf1,clf2,clf3,clf4],['LR','RF','XGB','SVM']):
    scores=cross_validation.cross_val_score(clf,feature_array,realnumber_y,scoring='f1')
    print("text_f1_score: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))

text_f1_score: 0.93 (+/- 0.00) [LR]
text_f1_score: 0.87 (+/- 0.01) [RF]


  if diff:
  if diff:
  if diff:


text_f1_score: 0.91 (+/- 0.00) [XGB]
text_f1_score: 0.94 (+/- 0.00) [SVM]


# Multi-Layer Perceptron

In [62]:
print(feature_array.shape)
print(realnumber_y.shape)

import torch
from torch.autograd import Variable
from torch import nn

feature_tensor = torch.from_numpy(feature_array).type(torch.FloatTensor)
realnumber_y_tensor = torch.from_numpy(realnumber_y).type(torch.LongTensor)

x, y = Variable(feature_tensor), Variable(realnumber_y_tensor)

print(x.shape)
print(y.shape)

(6299, 900)
(6299,)
torch.Size([6299, 900])
torch.Size([6299])


In [63]:
class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden_1, n_hidden_2, n_output):
        super(Net, self).__init__()
        self.hidden_1 = nn.Sequential(
                                        torch.nn.Linear(n_feature, n_hidden_1),
                                        torch.nn.Dropout(0.5),  # drop 50% of the neuron
                                        torch.nn.ReLU()
        )
        
        self.hidden_2 = nn.Sequential(
                                        torch.nn.Linear(n_hidden_1, n_hidden_2),
                                        torch.nn.Dropout(0.5),  # drop 50% of the neuron
                                        torch.nn.ReLU()
        )
        
        self.out = torch.nn.Linear(n_hidden_2, n_output)

    def forward(self, x):
        x = self.hidden_1(x)
        x = self.hidden_2(x)
        x = self.out(x)
        return x

net = Net(n_feature=900, n_hidden_1=100, n_hidden_2=10, n_output=2)
print(net)

Net(
  (hidden_1): Sequential(
    (0): Linear(in_features=900, out_features=100, bias=True)
    (1): Dropout(p=0.5)
    (2): ReLU()
  )
  (hidden_2): Sequential(
    (0): Linear(in_features=100, out_features=10, bias=True)
    (1): Dropout(p=0.5)
    (2): ReLU()
  )
  (out): Linear(in_features=10, out_features=2, bias=True)
)


In [64]:
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_func = torch.nn.CrossEntropyLoss()

In [65]:
for epoch in range(10):
    
    out = net(x)
    loss = loss_func(out, y)  # loss are defined as value before Softmax

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 2 == 0:
        # 过了一道 softmax 的激励函数后的最大概率才是预测值
        # torch.max既返回某个维度上的最大值，同时返回该最大值的索引值
        prediction = torch.max(F.softmax(out), 1)[1]  # 在第1维度取最大值并返回索引值
        pred_y = prediction.data.numpy().squeeze()
        target_y = y.data.numpy()
        accuracy = sum(pred_y == target_y) / len(target_y)  # 预测中有多少和真实值一样
        print('Accuracy=%.4f',accuracy)

  del sys.path[0]


Accuracy=%.4f 0.50404826162883
Accuracy=%.4f 0.8306080330211144
Accuracy=%.4f 0.892998888712494
Accuracy=%.4f 0.9028417209080807
Accuracy=%.4f 0.9234799174472138


As we can see the results are so much better than only using text vectors, all the four classifiers we tested got a considerable amount of improvement which means this kind of combination is actually working.

## Using the Ensemble Voting Classifier
Now we have our desired vectors, we need to consider which classifier we need to use. Then  we realized in stead of just using one classifier, maybe using a voting classifier which can actually take several different classifiers' results into account and then give a final classification would be our best choice. And luckily we found that with some small tunes on the parameters, the Ensemble Classifier class we just created can do the trick.

We decided to take those classifiers whose performance are in the top three list, by the previous result, these classifiers are clf1(Logistic Regression), clf3(XGBoost) and clf4(SVC). Since we only use vectors build from text, we passed all these three models as the first parameter to the EnsembleClassifier, and set all their weights to be 1. But Before that, we plan to use random grid search on each classifier to tune their hyperparameters.

In [31]:
# MengZe Zhang 2018-9-23
# setting three sets of parameters for random grid search
# random_grid1 for clf1,random_grid3 for clf3,random_grid4 for clf4

c=np.linspace(1,50,200)
random_grid1={'C':c}


random_grid3 = {
        'min_child_weight': [1, 5, 10],
        'gamma': np.linspace(0.1,5,100),
        'max_depth': list(range(1,9,1)),
        'n_estimators': list(range(200,2000,10))
        }


random_grid4= {'C':linspace(1,50,200),'gamma':linspace(0.001,1,200), 'kernel':['linear','rbf']}
n_iter_search=100
random_search1=RandomizedSearchCV(clf1,cv=5,param_distributions=random_grid1,n_iter=n_iter_search,return_train_score=True,n_jobs=6)
random_search3=RandomizedSearchCV(clf3,cv=5,param_distributions=random_grid3,n_iter=n_iter_search,return_train_score=True,n_jobs=6)
random_search4=RandomizedSearchCV(clf4,cv=5,param_distributions=random_grid4,n_iter=n_iter_search,return_train_score=True,n_jobs=6)

In [32]:
# MengZe Zhang 2018-9-23
# divide the whole corpus into three parts: validation set(25%), test set(25%) and training set(25%)
test_size=0.25
validation_size=0.25
test_validation_size=test_size+validation_size
test_size1=test_size/test_validation_size
x_train,x_testandvalid,y_train,y_testandvalid=tts(wordanddoc_array,final_y,test_size=test_validation_size,random_state=22)
x_valid,x_test,y_valid,y_test=tts(x_testandvalid,y_testandvalid,test_size=test_size1,random_state=22)

In [33]:
random_search1.fit(x_valid,y_valid)

RandomizedSearchCV(cv=5, error_score='raise',
          estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=100, n_jobs=6,
          param_distributions={'C': array([ 1.     ,  1.24623, ..., 49.75377, 50.     ])},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [34]:
random_search3.fit(x_valid,y_valid)

KeyboardInterrupt: 

In [None]:
random_search4.fit(x_valid,y_valid)

In [None]:
# MengZe Zhang 2018-9-23
# assign all three best estimators of three random gridsearch to clf1, clf3 and clf4
# use the ensemble classifier to train and test on the training set and test set.
# evaluate its performance.

clf1=random_search1.best_estimator_
clf3=random_search3.best_estimator_
#clf4=random_search4.best_estimator_
ultraclf=EnsembleClassifier(textclfs=[clf1,clf3,clf4],titleclfs=[],weights=[1,1,1])
ultraclf.fit(x_train,x_train,y_train)
prediction=ultraclf.predict(x_test,x_test)
target_names=['FAKE','REAL']
print(cr(y_test,prediction,target_names=target_names,digits=4))

The final result given by the voting classifier is even better than every single one of its component classifier, it reaches an f-1 score of over 0.94 as shown above. Now as always, we'll present the confusion matrix of the result from our best classifier

In [None]:
# MengZe Zhang 2018-9-23
# make a function that can plot cofusion matrix
#This is a compact version of the orginal function(The original function is here:http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html)

import matplotlib.pyplot as plt
from sklearn import metrics
import itertools
def plot_confusion_matrix(cm,classes,title='Confusion Matrix',cmap=plt.cm.Blues):
    plt.imshow(cm,interpolation='nearest',cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]),range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# MengZe Zhang 2018-9-23
# plot the confusion matrix

%matplotlib inline
cm=metrics.confusion_matrix(y_test,prediction,target_names)
plot_confusion_matrix(cm,classes=target_names)