# Automatic Fact-checking project using pandas and ScikitLearn

### Travail réalisé par : Bouali Mohammed-Amin, Oussama Nassim Sehout, Chahinez Benallal, Abdellah Choukri 



####  Sommaire du travail :
####   . Chargement du jeu de données
####   . Prétraitement du texte 
###   . Conversion des textes aux valeurs numériques
###   . Division du jeu de données en données d'entraînement et données de test
###   . Entraînement et prédiction en utilisant des modèles de classification

## Etape 1 : Chargement du jeu de données  
 On a généré trois fichiers csv, un avec valeurs true, un autre avec valeurs false, et un dernier avec valeurs mixture à partir du site ClaimsKg, après on les concatène pour avoir notre jeu de données.

In [2]:
import pandas as pd
import random as random
import glob, os

#concatener les fichiers en python  et lecture du fichier :
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "./DataSet/*.csv"))))

row, col = df.shape
print("Size of the dataframe :" + str(df.size) + " ("+str(row)+"*"+str(col)+")")

df.head()

df

Size of the dataframe :342328 (24452*14)


Unnamed: 0,id,text,date,truthRating,ratingName,author,headline,named_entities_claim,named_entities_article,keywords,source,sourceURL,link,language
0,http://data.gesis.org/claimskg/claim_review/36...,'There will be no public funding for abortion ...,2010-03-21,3,True,Bart Stupak,Stupak revises abortion stance on health care ...,"Abortion rights,Barack Obama,Bart Stupak,Ben N...",abortion,"Abortion,Health Care",politifact,http://www.politifact.com,http://www.politifact.com/truth-o-meter/statem...,English
1,http://data.gesis.org/claimskg/claim_review/e6...,Central Health 'is the only hospital district ...,2011-03-15,3,True,Wayne Christian,State Rep. Wayne Christian says Central Health...,"Austin American-Statesman,Harris County Hospit...",,Abortion,politifact,http://www.politifact.com,http://www.politifact.com/texas/statements/201...,English
2,http://data.gesis.org/claimskg/claim_review/e0...,Says most of Perry's chiefs of staff have been...,2010-08-14,3,True,Bill White,Bill White says most of Gov. Rick Perry's chie...,"AT&T,Bill Clements,Bill White,Bracewell & Giul...",,Ethics,politifact,http://www.politifact.com,http://www.politifact.com/texas/statements/201...,English
3,http://data.gesis.org/claimskg/claim_review/48...,Says 'as Co-Chair of the Joint Ways & Means Co...,2012-09-28,3,True,Mary Nolan,Did Mary Nolan secure funding for Milwaukie br...,"Carolyn Tomei,Dave Hunt,Fetsch,Jeff Merkley,Ka...",Portland-Milwaukie Light Rail project,"State Budget,State Finances,Transportation",politifact,http://www.politifact.com,http://www.politifact.com/oregon/statements/20...,English
4,http://data.gesis.org/claimskg/claim_review/80...,Says Gary Farmer’s claim that he 'received an ...,2016-07-08,3,True,Jim Waldman,Florida Senate candidate never actually receiv...,"Gary Farmer,Gwyndolen Clarke-Reed,Jim Waldman,...",Gary Farmer,Guns,politifact,http://www.politifact.com,http://www.politifact.com/florida/statements/2...,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,http://data.gesis.org/claimskg/claim_review/41...,'The city that I was mayor of was 50 percent L...,2018-06-07,2,MIXTURE,Lou Barletta,Hazleton wasn’t half-Latino when Lou Barletta ...,"Barletta,Bob Casey,Fox Business Network,Hazlet...",Latino,Immigration,politifact,http://www.politifact.com,http://www.politifact.com/pennsylvania/stateme...,English
9996,http://data.gesis.org/claimskg/claim_review/9e...,A set of images show a congenital anomaly that...,Unknown,2,MIXTURE,Unknown,Is This a Real Hand With Eight Fingers?,"American Society for Surgery of the Hand,Bilas...",,"hands, Medical, medical anomalies, mirror hand",snopes,http://www.snopes.com,https://www.snopes.com/fact-check/real-hand-ei...,English
9997,http://data.gesis.org/claimskg/claim_review/1a...,'You see 34 people (are) murdered every single...,2012-12-23,2,MIXTURE,Cory Booker,Cory Booker says 34 Americans are killed by gu...,"ABC,Centers for Disease Control and Prevention...",,"Crime,Guns",politifact,http://www.politifact.com,http://www.politifact.com/new-jersey/statement...,English
9998,http://data.gesis.org/claimskg/claim_review/28...,'White men have committed more mass shootings ...,2017-10-02,2,MIXTURE,Newsweek,Are white males responsible for more mass shoo...,"2015 San Bernardino shooting,Aurora, Colo,Foll...",,Guns,politifact,http://www.politifact.com,http://www.politifact.com/punditfact/statement...,English


### Etape 2 : Prétraitement du texte
 On a fait les prétraitements suivants :
     . Transformation du texte en miniscule
     . Suppression des espaces
     . Enlever les ponctuations 
     . Elimination des stopwords
     . Remplacement des mots de négation par le mot 'not'
     . Suppression des caractères non ASCII
     . Lemmatization 
     . Correction orthographique ##TODO

In [3]:
#commencement des prétraitement :

#import nécessaire :
import re
import nltk
nltk.download('stopwords') 
nltk.download('punkt')
from nltk.corpus import stopwords
import unicodedata
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from collections import Counter


#La j'ai mis que la colonne text en lower mais je crois qu'on devra tous les mettre par la suite
df['text'] = df['text'].str.lower()


#White spaces removal
df['text'] = df['text'].str.strip()


#Elimination des stopWord :
#input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words("english"))
#La liste des stopwords en Anglais
negation ={'haven''t','cannot',"doesn't","shouldn't","needn't","shant't","weren't","hasn't", "wasn't","didn't", "aren't",'not', "mightn't", "mustn't", 'no',  "wouldn't", "mightn't", "won't",  "needn't", "wasn't", "wouldn't",  "isn't", "doesn't", "weren't", "isn't", "hasn't", "hadn't", "don't", "hadn't","couldn't"}

#La liste des stopwords de négation en Anglais  
from nltk.tokenize import TweetTokenizer
a= df['text'].str.replace("’","'") #Extraire toutes les entrées de la colonne text
pat = r'\b(?:{})\b'.format('|'.join(negation))

a=a.str.replace(pat,'not')
pat='\w*\d\w*'
text_without_stopwords=[] #Une liste dont on va affecter les textes après l'élimination des stop words
tk=TweetTokenizer()
for text in a.iteritems(): #On parcourt toutes les lignes
 tokens = tk.tokenize(str(text[1]))
 result = [i for i in tokens if not i in stop_words-negation]
 splitor=" "
 concatinated = splitor.join(result) #concatiner les tokens
 text_without_stopwords.append(concatinated)

df['text']=text_without_stopwords


#The following code removes this set of symbols [!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]:
text_without_punctuation=[] # pour y mettre notre résultat
a= df['text'] #Extraire toutes les entrées de la colonne text deja traiter pour faire la suite
tk=TweetTokenizer()
for text in a.iteritems(): #On parcourt toutes les lignes
    result = re.sub('[!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~]', '', str(text[1]) )
    splitor="" #séparateur de mots
    concatinated = splitor.join(result) #concatiner les résultats
    text_without_punctuation.append(concatinated)

df['text']=text_without_punctuation


#The following code removes non ASCII characters :
text_without_ascii=[] # pour y mettre notre résultat
a= df['text'] #Extraire toutes les entrées de la colonne text deja traiter pour faire la suite
tk=TweetTokenizer()
for text in a.iteritems(): #On parcourt toutes les lignes
    text = unicodedata.normalize('NFKD', str(text[1]) ).encode("ascii", "ignore").decode("utf-8", 'ignore')
    splitor="" #séparateur de mots
    concatinated = splitor.join(text)
    text_without_ascii.append(concatinated)

df['text']=text_without_ascii


#Lemmatization :
lemmatizer = WordNetLemmatizer()
text_lemmatizer=[] # pour y mettre notre résultat
a= df['text'] #Extraire toutes les entrées de la colonne text deja traiter pour faire la suite
tk=TweetTokenizer()
for ligne in a.iteritems(): #On parcourt toutes les lignes
    wordList = re.sub("[^\w]", " ",  ligne[1]).split() # on parcourt tout les mot de la ligne
    newList=[]
    for word in wordList:
        if not(word.isdigit()): # si c'est  mot et non un nombre
            newList.append(lemmatizer.lemmatize(word, pos = 'v')) #lemmatisation
    splitor=" " #séparateur de mots
    concatinated = splitor.join(newList)
    text_lemmatizer.append(concatinated)

df['text']=text_lemmatizer
df





#suppression des common word:
word_counter  = Counter()
for sentence in df["text"].values:
    for word in sentence.split():
        word_counter[word] += 1
most = word_counter.most_common(10)
print("most common word"+str(most))
print("Suppression of the common word de nos artiles : ")
most_word = set([w for (w, wc) in most])
def delmost_word(sentence):
    return " ".join([word for word in str(sentence).split() if word not in most_word])
df["text"] = df["text"].apply(delmost_word)
df["text"].head()

df

#Fin des prétraitement.

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/choukri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/choukri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/choukri/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


most common word[('say', 4888), ('s', 3903), ('not', 2872), ('state', 2269), ('president', 1825), ('obama', 1699), ('percent', 1672), ('show', 1558), ('tax', 1509), ('trump', 1433)]
Suppression of the common word de nos artiles : 


Unnamed: 0,id,text,date,truthRating,ratingName,author,headline,named_entities_claim,named_entities_article,keywords,source,sourceURL,link,language
0,http://data.gesis.org/claimskg/claim_review/36...,public fund abortion legislation,2010-03-21,3,True,Bart Stupak,Stupak revises abortion stance on health care ...,"Abortion rights,Barack Obama,Bart Stupak,Ben N...",abortion,"Abortion,Health Care",politifact,http://www.politifact.com,http://www.politifact.com/truth-o-meter/statem...,English
1,http://data.gesis.org/claimskg/claim_review/e6...,central health hospital district texas spend t...,2011-03-15,3,True,Wayne Christian,State Rep. Wayne Christian says Central Health...,"Austin American-Statesman,Harris County Hospit...",,Abortion,politifact,http://www.politifact.com,http://www.politifact.com/texas/statements/201...,English
2,http://data.gesis.org/claimskg/claim_review/e0...,perry chiefs staff lobbyists,2010-08-14,3,True,Bill White,Bill White says most of Gov. Rick Perry's chie...,"AT&T,Bill Clements,Bill White,Bracewell & Giul...",,Ethics,politifact,http://www.politifact.com,http://www.politifact.com/texas/statements/201...,English
3,http://data.gesis.org/claimskg/claim_review/48...,cochair joint ways mean committee secure key p...,2012-09-28,3,True,Mary Nolan,Did Mary Nolan secure funding for Milwaukie br...,"Carolyn Tomei,Dave Hunt,Fetsch,Jeff Merkley,Ka...",Portland-Milwaukie Light Rail project,"State Budget,State Finances,Transportation",politifact,http://www.politifact.com,http://www.politifact.com/oregon/statements/20...,English
4,http://data.gesis.org/claimskg/claim_review/80...,gary farmer claim receive nra absolute lie,2016-07-08,3,True,Jim Waldman,Florida Senate candidate never actually receiv...,"Gary Farmer,Gwyndolen Clarke-Reed,Jim Waldman,...",Gary Farmer,Guns,politifact,http://www.politifact.com,http://www.politifact.com/florida/statements/2...,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,http://data.gesis.org/claimskg/claim_review/41...,city mayor latino vote,2018-06-07,2,MIXTURE,Lou Barletta,Hazleton wasn’t half-Latino when Lou Barletta ...,"Barletta,Bob Casey,Fox Business Network,Hazlet...",Latino,Immigration,politifact,http://www.politifact.com,http://www.politifact.com/pennsylvania/stateme...,English
9996,http://data.gesis.org/claimskg/claim_review/9e...,set image congenital anomaly result hand eight...,Unknown,2,MIXTURE,Unknown,Is This a Real Hand With Eight Fingers?,"American Society for Surgery of the Hand,Bilas...",,"hands, Medical, medical anomalies, mirror hand",snopes,http://www.snopes.com,https://www.snopes.com/fact-check/real-hand-ei...,English
9997,http://data.gesis.org/claimskg/claim_review/1a...,see people murder every single day gunfire ame...,2012-12-23,2,MIXTURE,Cory Booker,Cory Booker says 34 Americans are killed by gu...,"ABC,Centers for Disease Control and Prevention...",,"Crime,Guns",politifact,http://www.politifact.com,http://www.politifact.com/new-jersey/statement...,English
9998,http://data.gesis.org/claimskg/claim_review/28...,white men commit mass shoot group,2017-10-02,2,MIXTURE,Newsweek,Are white males responsible for more mass shoo...,"2015 San Bernardino shooting,Aurora, Colo,Foll...",,Guns,politifact,http://www.politifact.com,http://www.politifact.com/punditfact/statement...,English


### Etape 3 : Conversion des textes en valeurs numériques
 On transforme nos données de textes en valeurs numériques, en utilisant des LabelEncoder sur les colonnes qu'on a  sauf la colonne texte et la colonne keyword, dont on a essayé Tf-IDF pour les transformer. 

In [4]:
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer




#np.random.seed(500) #utilisé pour avoir le même résultat à chaque exécution 
#On crée des variable LabelEncoder qui vont servir à transférer nos données en valeurs numériques
l1=LabelEncoder()
l2=LabelEncoder()
l3=LabelEncoder()
l4=LabelEncoder()
l5=LabelEncoder()
l6=LabelEncoder()
l7=LabelEncoder()
l8=LabelEncoder()
l9=LabelEncoder()
l10=LabelEncoder()
l11=LabelEncoder()
l12=LabelEncoder()
l13=LabelEncoder()
df=df.applymap(str) #on transforme tous nous données en String car y'avais des entrées qui ont une combinaison du string et float
#On applique la mesure TF-IDF sur la colonne text et la colonne keywords
Tfidf_vect = TfidfVectorizer(max_features=1000,sublinear_tf=True, min_df=200, norm='l2', encoding='latin-1', ngram_range=(1, 2), stop_words='english')
tfidfx=Tfidf_vect.fit_transform(df['text'])
df1 = pd.DataFrame(tfidfx.toarray(), columns=Tfidf_vect.get_feature_names())
Tfidf_vect = TfidfVectorizer(max_features=30)
print(df['keywords'])
tfidfx=Tfidf_vect.fit_transform(df['keywords'])
df2 = pd.DataFrame(tfidfx.toarray(), columns=Tfidf_vect.get_feature_names())
print(df1)


#on transfère toutes les valeurs des colonnes qu'on va utiliser en valeurs numériques 
df['id']=l1.fit_transform(df['id'])
df['date']=l3.fit_transform(df['date'])
df['author']=l5.fit_transform(df['author'])
df['headline']=l6.fit_transform(df['headline'])
df['named_entities_claim']=l7.fit_transform(df['named_entities_claim'])
df['named_entities_article']=l8.fit_transform(df['named_entities_article'])
df['source']=l10.fit_transform(df['source'])
df['sourceURL']=l11.fit_transform(df['sourceURL'])
df['link']=l12.fit_transform(df['link'])
df['language']=l13.fit_transform(df['language'])

training1= pd.concat([df, df1], axis=1,join_axes=[df.index])
training=pd.concat([training1, df2], axis=1,join_axes=[df.index])








0                                    Abortion,Health Care
1                                                Abortion
2                                                  Ethics
3              State Budget,State Finances,Transportation
4                                                    Guns
                              ...                        
9995                                          Immigration
9996       hands, Medical, medical anomalies, mirror hand
9997                                           Crime,Guns
9998                                                 Guns
9999    Corporations,Economy,Health Care,Public Health...
Name: keywords, Length: 24452, dtype: object
       act  actually  administration  allow  america  american  americans  \
0      0.0       0.0             0.0    0.0  0.00000       0.0        0.0   
1      0.0       0.0             0.0    0.0  0.00000       0.0        0.0   
2      0.0       0.0             0.0    0.0  0.00000       0.0        0.0   
3      0.



### Etape 4 : Dévision du jeu de données en données d'entraînement et données de test
On mélange le dataframe qu'on a et après on sélectionne 80% des données pour l'entraînement et 20% pour le test, ensuite on sélectionne les colonnes features dont on s'intéresse lors du classification, et on essaye plusieurs combinaisons de colonnes pour arriver à une meilleure accuracy (#TODO).

In [5]:
from sklearn.utils import shuffle
import numpy as np
##### 
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

X_train, X_test = train_test_split(df, test_size=0.7, random_state=int(time.time()))


gnb = GaussianNB()
used_features =[
    "id",
    "date",
    "author",
    "named_entities_claim",
    "named_entities_article",
    "source",
    "sourceURL",
    "link", "language"
]

gnb.fit(
    X_train[used_features].values,
    X_train["ratingName"]
)

y_pred = gnb.predict(X_test[used_features])
print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
      .format(
          X_test.shape[0],
          (X_test["ratingName"] != y_pred).sum(),
          100*(1-(X_test["ratingName"] != y_pred).sum()/X_test.shape[0])
))
##

Number of mislabeled points out of a total 17117 points : 7320, performance 57.24%


### Etape 5 :  Entraînement et prédiction en utilisant des modèles de 


classification
On passe les features et targets des données d'entraînement à nos classifiers (#TODO tester plusieurs classifieurs) et après on lui laisse prédire les valeurs des données de test (soit vrai, faux ou mixture).

['2' '1' '2' ... '2' '1' '1']
0.5126789366053169
