##  1. INTRODUCTION

This project aims to predict the truth of a set of news. Indeed, given the content of a news and how it has been spread on Twitter, we would like to predict if the news is real or fake. For this purpose, we have at our disposal a data set containing news and users who share these news. A part of these news are labeled contrary to others. Our objective is to use network and text mining tools to predict unlabeled news labels based on labeled news.

We start by loading all useful python libraries for this task.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics, model_selection
from sklearn.datasets import fetch_20newsgroups
from sklearn.cluster import MiniBatchKMeans
from sklearn.decomposition import NMF, LatentDirichletAllocation
#from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae


import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.collocations import *
from nltk import tokenize

import networkx as nx
import pandas as pd
from gensim.models import Word2Vec
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib
import itertools as it
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import os
import random as rdm
import lightgbm as lgb
import xgboost as xgb

%matplotlib inline

pathTrainingSet = '/media/kabore/Data/M2STAT/web_mining/data_competition/news/training/'
pathTestSet = '/media/kabore/Data/M2STAT/web_mining/data_competition/news/test/'
userUserPath = '/media/kabore/Data/M2STAT/web_mining/data_competition/UserUser.txt'
newsUserPath = '/media/kabore/Data/M2STAT/web_mining/data_competition/newsUser.txt'
labelTrainPath = '/media/kabore/Data/M2STAT/web_mining/data_competition/labels_training.txt'

The data sets to use that are described as follows:

--training: this directory contains the content of the news in the training set. Each file is a news stored in the txt format and contains the title, the summary and the content of the news. The name of the txt file is @news_id.txt where @news_id is the identifier of the news.

--test: this directory contains the content of the news in the test set. The news are represented in the same format than for the training directory.

--newsUser.txt: the news-user relationship. For example, '240 1 1' means news 240 is posted/spreaded by user 1 for 1 time.

--UserUser.txt: the user-user relationship. For example, '1589 1' means user 1589 is following user 1;

--labels_training.txt: indicate whether the news in the training set is fake (1) or real (0). For example '23 0' means the news 23 is real.

A good question is to know how to make these predictions. Our main idea is to first build a graph to store all useful informations(news, users, their features and characteristics, relation between users and news, users and users, etc), then make a text mining analysis on this data sets, and finally we predict the label of each news using some useful informations(extract from the features) for prediction and the xgboost algorithm.

We start by the graph mining part.

## 2. GRAPH MINING 

In this section, we create a graph and store all users which share labeled and unlabeled news. Each user node has as attribute :
- followers : indicating the total number of users following current user,
- follows : indicating the total number of users followed by current user,
- totalShare : indicating the total number of news shared by current user,
- totalFake : indicating the total number of fake news among all news shared by current user, 
- totalDiffShare : indicating the total number of different news shared by current user,
- labeled : indicating the total number of labeled news among current user totalDiffShare news
- totalDiffFake : indicating the total number of fake news among user labeled news

In [5]:
g = nx.DiGraph()

usersId = []
with open(userUserPath, 'r') as userUser :
    for users in userUser :
        user1 = 'u' + users[:users.index('\t')]
        user2 = 'u' + users[users.index('\t')+1:len(users)-1]
        
        if user1 not in g.nodes :
            g.add_node(user1, nodeType = 'user', followers = 0, follows = 0, totalShare = 0, totalFake = None,
                       totalDiffShare = 0, labeled = None, totalDiffFake = None)
            usersId.append(user1)
        g.node[user1]['follows'] += 1
            
        
        if user2 not in g.nodes() :
            g.add_node(user2, nodeType = 'user', followers = 0, follows = 0, totalShare = 0, totalFake = None,
                       totalDiffShare = 0, labeled = None, totalDiffFake = None)
            usersId.append(user2)
        g.node[user2]['followers'] += 1
            
        
        if not g.has_edge(user1, user2) :
            g.add_edge(user1, user2, edgeType = 'user edge')

We load in the same graph news with these attributes :
- view : indicating the total number the news has been seen,
- share : indicating the total number news has been shared,
- soloUser : juste to know the presence of isolate user
- fake : to know text label. None for unkown label and 1 for fake news, 0 for good news,
- topics : to store news topics by importance

Note that we specify node type which is not very necessary here because nodes names already makes difference among them.
We also store users names and news names respectively in two lists to ease manipulating them.

In [6]:
newsUser = pd.read_csv(newsUserPath, sep = '\t', header=None)
labelTrain = pd.read_csv(labelTrainPath, sep = ',', header=0, index_col=0)
newsList = []
soloUser = []
userNews = dict()
labeledNews = ['n'+str(index) for index in list(labelTrain.index)]

for i in range(newsUser.shape[0]) :
    news = 'n' + str(newsUser[0][i])
    user = 'u' + str(newsUser[1][i])
    sharingOccurency = newsUser[2][i]
    if user not in list(userNews.keys()) :
        userNews[user] = list()
    
    if not g.has_edge(news, user) :
        userNews[user].append(news)
        if news not in newsList :
            g.add_node(news, nodeType = 'news', view = 0, share = 0, soloUser = 0, fake = None, topics = list())
            if news in labeledNews :
                g.node[news]['fake'] = labelTrain.at[int(news[1:]), 'class']
            newsList.append(news)
        g.node[news]['share'] += sharingOccurency
        g.node[news]['view'] += (sharingOccurency*g.node[user]['followers'] + 1)
    
        if user not in usersId :
            g.add_node(user, nodeType = 'user', followers = 0, follows = 0, totalShare = 0, totalFake = None,
                       totalDiffShare = 0, labeled = None, totalDiffFake = None)
            g.node[news]['soloUser'] += 1
            soloUser.append(user)
            usersId.append(user)
        g.node[user]['totalShare'] += sharingOccurency
        g.node[user]['totalDiffShare'] += 1
        if g.node[news]['fake'] != None :
            if g.node[user]['totalFake'] == None :
                g.node[user]['totalFake'] = 0
            
            if g.node[user]['totalDiffFake'] == None :
                g.node[user]['totalDiffFake'] = 0
            
            if g.node[news]['fake'] == 1 :
                g.node[user]['totalFake'] += sharingOccurency
                g.node[user]['totalDiffFake'] += 1
            
        g.add_edge(news, user, edgeType = 'news edge')
        if news in labeledNews :
            if g.node[user]['labeled'] == None :
                g.node[user]['labeled'] = 0
            g.node[user]['labeled'] += 1
        
unLabeledNews = list(set(newsList) - set(labeledNews))

Now that we have our graph with its nodes and features, let us now move to the text mining analysis part.

##  3. TEXT MINING 

The aim of this section is to make text mining analysis on our data sets. We start by loading the whole news in a dictionary called 'data'.

In [7]:
def buildText(textList, wordToDel) :
    '''
    function to read text
    '''
    result = ' '
    for sentence in textList :
        if sentence != wordToDel :
            result = result + sentence[:len(sentence)-2]
    return result



#reading training set
dataTraining = {}
for file in os.listdir(pathTrainingSet):
    iD = 'n' + file[:len(file)-4]
    with open(pathTrainingSet+file, 'r') as f :
        dataTraining[iD] = buildText(f.readlines(), '\n')


#reading test set
dataTest = {}
for file in os.listdir(pathTestSet):
    iD = 'n' + file[:len(file)-4]
    with open(pathTestSet+file, 'r') as f :
        dataTest[iD] = buildText(f.readlines(), '\n')
    

data = dataTraining.copy()
data.update(dataTest)

### 3.1 Cleaning the data set

In this section, we clean the data. For this purpose:
- we first remove the stopwords;
- the, we stem all remaining words, that is to keep only each remaining word root.

In [8]:
stopWords = stopwords.words('english')
stemming = PorterStemmer()
tokenizer = tokenize.RegexpTokenizer(r'\w+')

def dataFilter(data, stopWords) :
    '''
    Function to filter data and only keep important words
    '''
    returnValue = {}
    for key in data.keys() :
        returnValue[key] = [stemming.stem(word) for word in tokenizer.tokenize(data[key].lower()) if word not in stopWords and word.isalpha()]
    return returnValue


tokenazation = dataFilter(data, stopWords)

#setting up the cleaned data
cleanData = {}
for key in tokenazation.keys() :
    cleanData[key] = ' '.join(tokenazation[key])

cleanDataTraining = {key : cleanData[key] for key in dataTraining.keys()}
cleanDataTest = {key : cleanData[key] for key in dataTest.keys()}


posTag = [nltk.pos_tag(tokenazation[key]) for key in tokenazation.keys()]

sPosTag = pd.Series()
for p in posTag :
    sPosTag = sPosTag.append(pd.Series(p))
    
#sPosTag.value_counts()

### 3.2 Focus on the Topic of News

In this section, we try to find topics for our news. We do this in order to isolate topics only dealing with fake news or good news, which could be used in the prediction part.

In [9]:
#rdm.seed(10)
nFeatures = 1000
countVect = CountVectorizer(max_df = 0.5, min_df = 2, max_features = nFeatures)
dataMatrice = countVect.fit_transform(cleanData.values())

nTopics = 15
lda = LatentDirichletAllocation(n_components = nTopics, learning_method = 'batch', random_state=10).fit(dataMatrice)
ldaMatW = lda.transform(dataMatrice)
ldaMatH = lda.components_

print(ldaMatW.shape)
print(ldaMatH.shape)

(240, 15)
(15, 1000)


Function to display topics

In [19]:
def display_topics_full(H, W, feature_names, documents, no_top_words, no_top_documents):
    '''
    This function is to display each topic, his associated words, and news about it by order of importance. 
    '''
    for topic_idx, topic in enumerate(H):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for docIndex in top_doc_indices:
            print(docIndex)
            #print(documents['n'+str(docIndex+1)])
        print("------------------------")
        
noTopWords = 10
noTopDocs = 10

#display_topics_full(ldaMatH, ldaMatW, countVect.get_feature_names(), cleanData, noTopWords, noTopDocs)

Setting up texts topics

In [10]:
noTopDocs = len(cleanData.keys())
def settingTextTopics(H, W, feature_names, no_top_documents):
    '''
    This function computes the different topics and stores texts associated to each
    topic by importance.
    '''
    topicMatrix = pd.DataFrame(np.zeros((H.shape[0], no_top_documents)))
    for topic_idx, topic in enumerate(H):
        top_doc_indices = ['n'+str(value+1) for value in list(np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents])]
        topicMatrix.loc[topic_idx] = top_doc_indices
    return topicMatrix
            

def setTextsTopics(graph, topics, newsId) :
    '''
    This function sets texts topics and orders them by importance.
    '''
    textsTopics = pd.DataFrame(np.zeros((len(newsId), topics.shape[0])), index = newsId)
    for news in newsId :
        for colId in range(topics.shape[1]) :
            graph.node[news]['topics'].extend(['t'+str(i+1) for i, iD in enumerate(topics[colId]) if news == iD])
        textsTopics.loc[news] = graph.node[news]['topics']
    return textsTopics
  
    
topicsMatrix = settingTextTopics(ldaMatH, ldaMatW, countVect.get_feature_names(), len(cleanData.keys()))


for iD in newsList :
    g.node[iD]['topics'] = list()

#textsTopics = pd.DataFrame(np.zeros((len(cleanData.keys()), nTopics)), index = newsList)
textsTopics = setTextsTopics(g, topicsMatrix, newsList)

The following script is use to print each news and the list of its main topics.

In [11]:
for iD in newsList :
    if len(g.node[iD]['topics']) == 0 :
        print(iD)
        
        
for iD in newsList :
    print(iD)
    print(g.node[iD]['topics'])

n240
['t14', 't2', 't3', 't8', 't12', 't13', 't15', 't1', 't4', 't5', 't10', 't11', 't9', 't6', 't7']
n124
['t13', 't5', 't11', 't10', 't8', 't2', 't4', 't15', 't3', 't12', 't14', 't1', 't9', 't6', 't7']
n162
['t6', 't14', 't9', 't8', 't2', 't1', 't3', 't4', 't5', 't12', 't15', 't11', 't10', 't13', 't7']
n233
['t3', 't11', 't15', 't6', 't10', 't14', 't2', 't8', 't7', 't9', 't1', 't4', 't5', 't12', 't13']
n50
['t5', 't2', 't8', 't4', 't12', 't15', 't3', 't10', 't13', 't14', 't1', 't6', 't9', 't11', 't7']
n155
['t12', 't3', 't7', 't6', 't8', 't2', 't14', 't15', 't4', 't1', 't10', 't5', 't9', 't13', 't11']
n227
['t15', 't14', 't8', 't9', 't2', 't1', 't3', 't5', 't12', 't4', 't11', 't10', 't13', 't6', 't7']
n31
['t14', 't9', 't8', 't1', 't2', 't3', 't4', 't5', 't12', 't15', 't11', 't10', 't13', 't6', 't7']
n41
['t14', 't3', 't8', 't10', 't13', 't9', 't2', 't1', 't5', 't12', 't15', 't4', 't11', 't6', 't7']
n59
['t10', 't14', 't9', 't8', 't2', 't1', 't3', 't5', 't4', 't12', 't15', 't11', 't1

In the part below, we build a dataframe containing all topics as index and as columns the number of news dealing with each topic, the number of labeled news, the number of fake news and the ratio of fake news which is the ratio of fake column and labeled column.
As we can see in this dataframe, there are not topics associated to only fake news or only good news. And this dataframe depends a lot on the randomness of the process.

In [12]:
textsMainTopics = pd.Series(textsTopics[0], index=textsTopics.index)
topicsSummary = pd.DataFrame(textsMainTopics.value_counts(), index=textsMainTopics.value_counts().index)
topicsSummary.columns = ['news']
topicsSummary['labeled'] = -1
topicsSummary['fake'] = -1
topicsSummary['ratio'] = -1
indexes = ['n'+ str(iD) for iD in labelTrain.index]
ratio = []

for index in topicsSummary.index :
    tempIndexes = [iD for iD in indexes if textsMainTopics[iD] == index]
    count = 0
    labeled = 0
    for i in tempIndexes :
        if g.node[i]['fake'] != None :
            labeled += 1
            if g.node[i]['fake'] == 1 :
                count = count + 1
    topicsSummary.at[index, 'fake'] = count
    topicsSummary.at[index, 'labeled'] = labeled
    #topicsSummary.at[index, 'ratio'] = round(count/topicsSummary.at[index, 'news'], 2)
    ratio.append(round(count/topicsSummary.at[index, 'labeled'], 2))
    indexes = list(set(indexes)-set(tempIndexes))
    
topicsSummary['ratio'] = ratio
topicsSummary.sort_values('ratio', ascending=False)

Unnamed: 0,news,labeled,fake,ratio
t12,14,10,8,0.8
t8,13,10,7,0.7
t4,18,15,10,0.67
t10,17,14,9,0.64
t1,15,12,7,0.58
t2,16,13,7,0.54
t13,18,16,8,0.5
t11,14,14,7,0.5
t14,12,8,4,0.5
t3,18,15,7,0.47


### 3.3 Useful Informations to extract from Users

We focus now on informations we can extract from users. We start by building a dataframe of users, containing as columns :
- number of users followed by current user,
- current user number of followers,
- its total number of shares,
- its total number of different shares,
- its total number of labeled news,
- its total number of fake news,
- its proportion of fake news

In [13]:
usersSummaryFeatures = ['follows', 'followers', 'totalShare', 'totalDiffShare', 'labeled', 'fake', 'proportion']
usersSummary = pd.DataFrame(np.full((len(usersId), 7), None), columns = usersSummaryFeatures, index=usersId)
nbFakeTreshold = 2
bornSup = 0.6
bornInf = 0.4

for user in usersId :
    usersSummary.at[user, 'follows'] = g.node[user]['follows']
    usersSummary.at[user, 'followers'] = g.node[user]['followers']
    usersSummary.at[user, 'totalShare'] = g.node[user]['totalShare']
    usersSummary.at[user, 'totalDiffShare'] = g.node[user]['totalDiffShare']
    
    if g.node[user]['labeled'] == None :
        usersSummary.at[user, 'labeled'] = usersSummary.at[user, 'fake'] = usersSummary.at[user, 'proportion'] = -1
    else :
        usersSummary.at[user, 'labeled'] = g.node[user]['labeled']
        usersSummary.at[user, 'fake'] = g.node[user]['totalDiffFake']
        usersSummary.at[user, 'proportion'] = usersSummary.at[user, 'fake']/usersSummary.at[user, 'labeled']
    
usersSummary = usersSummary.sort_values('proportion', ascending=False)

potentialFakeUsers = list(usersSummary.loc[(usersSummary['proportion'] == 1) & (usersSummary['totalDiffShare'] >= nbFakeTreshold)].index)
potentialGoodUsers = list(usersSummary.loc[(usersSummary['proportion'] == 0) & (usersSummary['totalDiffShare'] >= nbFakeTreshold)].index)

mediumUsers = list(usersSummary.loc[(usersSummary['proportion'] <= bornSup) & (usersSummary['proportion'] >= bornInf) & (usersSummary['totalDiffShare'] >= nbFakeTreshold)].index)

fakeUsers = list(usersSummary.loc[(usersSummary['proportion'] < 1) & (usersSummary['proportion'] > bornSup) & (usersSummary['totalDiffShare'] >= nbFakeTreshold)].index)
goodUsers = list(usersSummary.loc[(usersSummary['proportion'] > 0) & (usersSummary['proportion'] < bornInf) & (usersSummary['totalDiffShare'] >= nbFakeTreshold)].index)

From the dataframe we have just created, we extract some informations. These informations are :
- potential fake news sharers : these are users with proportion of fake news equals to 1,
- potential good news sharers : these are users with proportion of fake news equals to 0,
- fake news sharers : these are users with proportion of fake news between 0.6 and 1,
- good news sharers : these are users with proportion of fake news between 0 and 0.4
- medium news sharers : these are users with proportion of fake news between 0.4 and 0.6

For every set of users, we extract news the users shared. In the part below, we extract news potential fake news sharers and potential good news sharers shared in two lists. We update these list by deleting news already labeled and keep the remaining ones. There are some news shared together by some users in both two sets. So, we define two set of news, one containing news only shared by potential fake news sharers and not by the potential good news sharers, and the other containing news only shared by potential good news sharers and not by the potential fake news sharers.

In [14]:
fakeNewsSet = list()
for news in labeledNews :
    if g.node[news]['fake'] == 1 :
        fakeNewsSet.append(news)

goodNewsSet = list(set(labeledNews) - set(fakeNewsSet))

potentialFakeNews = list()
for user in potentialFakeUsers :
    potentialFakeNews.extend(userNews[user])

potentialFakeNews = list(set(potentialFakeNews))
potFakeNews = list(set(potentialFakeNews) - set(labeledNews))


potentialGoodNews = list()
for user in potentialGoodUsers :
    potentialGoodNews.extend(userNews[user])

potentialGoodNews = list(set(potentialGoodNews))
potGoodNews = list(set(potentialGoodNews) - set(labeledNews))

fakeSet = list(set(potFakeNews) - set(potGoodNews))
goodSet = list(set(potGoodNews) - set(potFakeNews))

We do the same process as before but now one fake news sharers set and good news sharers set. We update the lists we create before by adding in the set of fake news the news only shared by fake news sharers and adding in the set of good news the news only shared by good news sharers.

In [15]:
fakeNews = list()
for user in fakeUsers :
    fakeNews.extend(userNews[user])

fakeNews = list(set(fakeNews))
fakeNews = list(set(fakeNews) - set(labeledNews))

goodNews = list()
for user in goodUsers :
    goodNews.extend(userNews[user])

goodNews = list(set(goodNews))
goodNews = list(set(goodNews) - set(labeledNews))


onlyFake = list(set(fakeNews) - set(goodNews))
fakeSet = list(set(fakeSet + onlyFake))
onlyGood = list(set(goodNews) - set(fakeNews))
goodSet = list(set(goodSet + onlyGood))

We do not exactly do the same process below as before. Here after finding news shared by medium news sharers, we update this list by deleting labeled news, the fake set news and the good set news.

In [16]:
mediumNews = list()
for user in mediumUsers :
    mediumNews.extend(userNews[user])

mediumNews = list(set(mediumNews))
mediumNews = list(set(mediumNews) - set(labeledNews))
mediumNews = list(set(mediumNews) - set(fakeSet))
mediumNews = list(set(mediumNews) - set(goodSet))

In the whole data set, they are some news which are only once shared by some users which did not share anything before. So below, we try to get all these news.

In [17]:
userSharingOnlyOnce = list(usersSummary.loc[(usersSummary.totalDiffShare == 1) & (usersSummary.labeled == -1)].index)
shareOnlyOnce = []
[shareOnlyOnce.extend(userNews[user]) for user in userSharingOnlyOnce]
shareOnlyOnce = list(set(shareOnlyOnce))

Up to now, we have build our graph and make some text mining analysis on it. Let us now build the model prediction.

## 4. Model Prediction

### 4.1 Defining the dataframe for the Prediction

In this part, we define the dataframe we will use for the prediction. This dataframe has as index the news names and as columns :
- potFakeNews : we have two values for this columns. 1 for the potential fake news we defined before and for the real fake news, 0 for the remaining news,
- potGoodNews : the same thing but 1 for the potential good news we defined before and for the real good news, 0 for the remaining news,
- fakeSet : 1 for the news in the fakeset we defined and for the real fake news, 0 for the rest,
- goodSet : 1 for the news in the goodset we defined and for the real good news, 0 for the rest,
- mediumSet : we have here : 
    - the label for each labeled news
    - for the news in the medium Set, the label will be 1 if users which share this news use to follow users sharing fake news and 0 otherwise
    - or we give label of one of the column we defined before.

In [18]:
features = ['potFakeNews', 'potGoodNews', 'fakeSet', 'goodSet', 'mediumSet', 'fake']
df = pd.DataFrame(np.full((len(newsList), len(features)), None), columns=features, index=newsList)


for news in newsList :
    temp = usersSummary.loc[list(g.neighbors(news))]
    temp = temp.loc[(temp.totalDiffShare > 1) & (temp.labeled >=1)]
    fakeUsersSize = len(list(temp.loc[temp.proportion >= 0.5].index))
    allUsersSize = len(list(temp.index))
    boolValue = False
    if fakeUsersSize > (allUsersSize/2) :
        boolValue = True

    df.at[news, 'fake'] = g.node[news]['fake']
    
    if news in potentialFakeNews :
        if boolValue :
            df.at[news, 'potFakeNews'] = 1
        else :
            df.at[news, 'potFakeNews'] = 0
    else :
        df.at[news, 'potFakeNews'] = 0
        
    if news in potentialGoodNews :
        if not boolValue :
            df.at[news, 'potGoodNews'] = 1
        else :
            df.at[news, 'potGoodNews'] = 0
    else :
        df.at[news, 'potGoodNews'] = 0
        
    if news in list(set(potentialFakeNews + fakeSet + fakeNewsSet)) :
        if boolValue :
            df.at[news, 'fakeSet'] = 1
        else :
            df.at[news, 'fakeSet'] = 0
    else :
        df.at[news, 'fakeSet'] = 0
        
    if news in list(set(potentialGoodNews + goodSet + goodNewsSet)) :
        if not boolValue :
            df.at[news, 'goodSet'] = 1
        else :
            df.at[news, 'goodSet'] = 0
    else :
        df.at[news, 'goodSet'] = 0
     
    fakeLabel = False
    if (df.at[news, 'potFakeNews'] == 1) or (df.at[news, 'fakeSet'] == 1) :
        fakeLabel = True
        
    goodLabel = False
    if (df.at[news, 'potGoodNews'] == 1) or (df.at[news, 'goodSet'] == 1) :
        goodLabel = True
        
    if news in labeledNews :
        df.at[news, 'mediumSet'] = g.node[news]['fake']
    if news in mediumNews :
        if boolValue :
            df.at[news, 'mediumSet'] = 1
        else :
            df.at[news, 'mediumSet'] = 0
    elif news in shareOnlyOnce and (fakeLabel == goodLabel):
        tempFrame = usersSummary.loc[list(g.neighbors(news))]
        hisUsers = list(tempFrame.loc[(tempFrame.totalDiffShare == 1) & (tempFrame.labeled == -1)].index)
        usersTemp = list()
        [usersTemp.extend(g.neighbors(user)) for user in hisUsers]
        usersTemp = list(set(usersTemp))
        tempFrame = usersSummary.loc[usersTemp]
        tempFrame = tempFrame.loc[tempFrame.totalDiffShare >=5]
        tempFrame = tempFrame.loc[(tempFrame.proportion != -1) & ((tempFrame.totalDiffShare - tempFrame.labeled > 0))]
        tempFakeUsersSize = len(list(tempFrame.loc[tempFrame.proportion >= 0.5].index))
        tempAllUsersSize = len(list(tempFrame.index))
        if tempFakeUsersSize > (tempAllUsersSize/2) :
            df.at[news, 'mediumSet'] = 1
        else :
            df.at[news, 'mediumSet'] = 0
    else :
        if fakeLabel :
            df.at[news, 'mediumSet'] = 1
        elif goodLabel :
            df.at[news, 'mediumSet'] = 0

del features

df.potFakeNews = df.potFakeNews.astype(int)
df.potGoodNews = df.potGoodNews.astype(int)
df.fakeSet = df.fakeSet.astype(int)
df.mediumSet = df.mediumSet.astype(int)
df.goodSet = df.goodSet.astype(int)
#df.fake = df.fake.astype(int)

### 4.2 Prediction

In this section, We start by training the data only considering labeled news, then we build a model on the whole data set to predict the unlabeled news. We use the xgboost algorithm which is one of the most efficient machine learning algorithm.

For this purpose, we will train de data nbSimution times and compute the mean error.
For each iteration, we create our training set and test set, make the prediction on the test set and compute the mean squared error between the test set real labels and the prediced labels.

In [19]:
trainingSize = round(len(labeledNews)*0.7)
testSize = round(len(labeledNews)*0.3)
if (trainingSize + testSize) != len(labeledNews) :
    trainingSize = len(labeledNews) - testSize

indexesGroups = labelTrain.groupby(by=['class'], sort=False)
for key in indexesGroups.indices.keys() :
    indexesGroups.groups[key]=['n'+str(index) for index in list(indexesGroups.groups[key])]
    
modelFeatures = list(df.columns[:(len(df.columns)-1)])
nbSimulations = 500
fakeSize = round(round(len(indexesGroups.groups[1])/len(labeledNews), 1)*trainingSize)
xgbModel = xgb.XGBClassifier()
xgbErrors = list()
#lgbModel = lgb.LGBMClassifier(objective = 'binary', random_state=5)
#lgbErrors = list()

for i in range(nbSimulations) :
    trainingSample = list(pd.Series(indexesGroups.groups[1]).sample(fakeSize))
    trainingSample.extend(list(pd.Series(indexesGroups.groups[0]).sample(trainingSize-fakeSize)))
    testSample = list(set(labeledNews) - set(trainingSample))
    df.loc[trainingSample]['fake'] = df.loc[trainingSample]['fake'].astype(int)
    y = df.loc[trainingSample]['fake']
    X = df.loc[trainingSample].drop('fake', axis = 1)
    
    xgbModel.fit(X, y.values)
    xgbPred = xgbModel.predict(df.loc[testSample].drop('fake', axis = 1))
    #lgbModel.fit(X, y.values)
    #lgbPred = lgbModel.predict(df.loc[testSample].drop('fake', axis = 1))
    
    xgbErrors.append(mse(xgbPred, df.loc[testSample]['fake']))
    #lgbErrors.append(mse(lgbPred, df.loc[testSample]['fake']))

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:

In [21]:
print('the error is : {n}'.format(n = np.mean(xgbErrors)))

the error is : 0.0009310344827586208


we have a good prediction. Now we build a model based on the whole data set to predict the unlabeled news.

In [22]:
xgbModel.fit(df.loc[labeledNews].drop('fake', axis = 1), df.loc[labeledNews].fake)
testPred = xgbModel.predict(df.loc[unLabeledNews].drop('fake', axis = 1))

  if diff:


### 4.3 The Result

The result is given by the following dataframe where the first columns contains the index of unlabeled news, and the second column 'class' take the value 1 if the news is fake and 0 otherwise.

In [23]:
newsId = [int(news[1:len(news)]) for news in unLabeledNews]
result = pd.DataFrame(-1, columns=['doc', 'result'], index=unLabeledNews)
result.doc = newsId
result.result = testPred
result.columns = ['doc', 'class']
result = result.sort_values('doc')
result

## 5. SUBMISSION

We submit a csv file described as the above table. We create this file as follows.

In [None]:
result.to_csv('oursubmission.csv')