# Projet méthodes pour la science des données

# I. Chargement des données

In [33]:
import pandas as pd

df=pd.read_csv('Dataset/claimskg_result.csv')

# II. Informations sur le dataframe

In [34]:
shape=df.shape
print("Nombre d'enregistrements : ")
print(shape[0])
print("Nombre de colonnes : ")
print(shape[1])
print("Informations sur les colonnes")
df.info()

Nombre d'enregistrements : 
10000
Nombre de colonnes : 
14
Informations sur les colonnes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      10000 non-null  object
 1   text                    10000 non-null  object
 2   date                    10000 non-null  object
 3   truthRating             10000 non-null  int64 
 4   ratingName              10000 non-null  object
 5   author                  10000 non-null  object
 6   headline                9882 non-null   object
 7   named_entities_claim    9864 non-null   object
 8   named_entities_article  6497 non-null   object
 9   keywords                8691 non-null   object
 10  source                  10000 non-null  object
 11  sourceURL               10000 non-null  object
 12  link                    10000 non-null  object
 13  language          

On observe qu'il n'existe qu'un seul champs dont le type des valeurs est numérique : truthRating qui indique si l'affirmation est vraie, fausse ou un mélange de vraies et de fausses informations. Les autres types correspondent tous à des chaîne de caractères.

# III. Feature engeenering

## III.a. Supression des colonnes inutiles

Parmi les colonnes du Dataframe, certaines colonnes ne sont pas nécessaire pour les tâches de classification. On peut par exemple vérifier qu'il y'a une bijection entre l'ensemble des valeurs de truthRating et l'ensemble des valeurs de ratingName.

In [35]:
g=df.groupby(['truthRating', 'ratingName'])
print(g['id'].count())

truthRating  ratingName
-1           OTHER         1761
 1           FALSE         3665
 2           MIXTURE       3247
 3           TRUE          1327
Name: id, dtype: int64


On peut donc supprimer au choix la colonne truthRating et ratingName car les valeurs seront de toute façon transformées lors de la création des features à l'étape suivante.

In [36]:
del df['ratingName']

On peut aussi supprimer la colonne language car tous les enregistrement dans le dataframe on la valeur 'English' pour cette colonne, elle n'est donc pas utile à l'apprentissage.

In [37]:
del df['language']

In [38]:
df.to_csv('Dataset/claimskg_columns_removed.csv')

## III.b. Valeurs manquantes

Ci-dessous sont présentés les nombres de chaînes de caractères vides par colonnes dans le dataframe.

In [39]:
def count_empty_values(df):
    columns=df.columns[df.isnull().any()].tolist()
    for column in columns:
        print(column + " : " + str(df[column].isnull().sum()))

count_empty_values(df)

headline : 118
named_entities_claim : 136
named_entities_article : 3503
keywords : 1309


On remarque que toutes ces colonnes de type 'chaîne de caractère', on ne peut donc pas effectuer de calcul de moyennes pour remplir les informations manquantes. Il n'est pas non plus envisageables de supprimer les enregistrements correspondant car les autres colonnes portent des informations utiles à l'apprentissages.

On s'intéresse maintenant aux enregistrements qui ne portent pas la classe cible des tâches d'apprentissage :

In [40]:
print("Enregistrements non FALSE, TRUE ou MIXTURE : " + str(df[df['truthRating'] == -1].count()['id']))

Enregistrements non FALSE, TRUE ou MIXTURE : 1761


Comme ces enregistrements ne peuvent servir pour l'apprentissage, on peut se permettre de les supprimer.

In [41]:
df = df[df['truthRating'] != -1]

On peut constater l'impact sur les valeurs manquantes des quatres colonnes contenant des valeurs nulles :

In [42]:
count_empty_values(df)

named_entities_claim : 11
named_entities_article : 2966
keywords : 464


On compte maintenant le nombre de date dont la valeur est unknown dans le dataframe

In [47]:
print(df[df['date'] == 'Unknown'].count()['id'])

3422


Les date étant sous forme de chaîne de caractère, nous choisissons de les convertir en timestamp afin que les algorithmes de classification puisse trouver, s'il en existe, certaines corrélations entre l'ancienneté d'une assertion et sa veracité. Nous remplacerons les valeurs 'Unknown' par la moyenne des timestamps obtenus.

In [68]:
import time
import datetime
from math import ceil

def convert_to_timestamp(datestring):
    return time.mktime(datetime.datetime.strptime(datestring, "%Y-%m-%d").timetuple())


# df_feature_date = df['date'].apply(convert_to_timestamp)['date']

timestamp_sum=0
number_of_dates = 0
for date in df['date']:
    if date != 'Unknown':
        timestamp_sum += convert_to_timestamp(date)
        number_of_dates = number_of_dates + 1

timestamp_mean = ceil(timestamp_sum / number_of_dates)

def process_dates(date):
    if date == 'Unknown':
        return timestamp_mean
    else:
        return convert_to_timestamp(date)

df_timestamp = pd.DataFrame(df['date'].apply(process_dates))

Unnamed: 0,date
2,1.329260e+09
3,1.252966e+09
4,1.395176e+09
5,1.395176e+09
6,1.245017e+09
...,...
9995,1.219615e+09
9996,1.291072e+09
9997,1.472162e+09
9998,1.395176e+09


## III.c. Pré-traitement du texte d'une affirmation

On commence par importer la bibliothèque NLTK (Natural Language ToolKit)

In [None]:
import nltk

from nltk import sent_tokenize
from nltk.tokenize import word_tokenize

Puis on défini la fonction de pré-traitement du texte d'une affirmation.
Celle-ci contient les étapes suivantes :
- Suppression des formes contractées de la langue anglaise
- Remplacement des chiffres par des mots
- Suppression des ponctuations
- Normalisation de la casse (tout est mis en minuscule)
- Suppression des mots non utiles à la classification (Stop words)
- Reduction des mots à leurs racines (Lemmatization)

In [None]:
import inflect
import contractions
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

p = inflect.engine()
stop_words = set(stopwords.words('english'))
    
def replace_contractions(text):
    return contractions.fix(text)

def tokenize(text):
    return word_tokenize(text)

def replace_numbers(word):
    if (word.isdigit()):
        return p.number_to_words(word)
    else:
        return word

def remove_punctuations(words):
    return [word for word in words if word.isalpha()]

def normalize_case(word):
    return word.lower()

def filter_stop_words(words):
    return [word for word in words if not word in stop_words]

def lemmatize_word(word):
    lemmatizer = WordNetLemmatizer()
    return lemmatizer.lemmatize(word, pos='v')

def process_text(text):
    text = replace_contractions(text)

    words = tokenize(text)
    words = map(replace_numbers, words)
    words = remove_punctuations(words)
    words = map(normalize_case, words)
    words = filter_stop_words(words)
    words = map(lemmatize_word, words)
    
    return ' '.join(words)

In [None]:
# df_rows_removed : DataFrame dont les enregistrements avec une valeur de truthRating égale à -1 sont supprimés
df_rows_removed = df.copy()
df = pd.DataFrame(df['text'].apply(process_text))
print(df_claim_text_processed['text'].values)


## III.d. Extraction des features à partir du texte de l'assertion

Nous allons utilisé la classe TfidVectorizer pour produire à partir de chaque mots du texte pré-traité de chaque assertion une feature dont la valeur correspond à la fréquence d'apparition du mot dans le texte. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

features_text = vectorizer.fit_transform(df['text'])

df_features_text = pd.DataFrame(
    data=vectorizer.transform(df['text']).toarray(),
    columns=vectorizer.get_feature_names()
)

## III.e. Ajout de la colonne timestamp

On définit le DataFrame 'features' comme la jointure sur l'id des features extraits du texte avec le DataFrame contenant les timestamp

In [70]:
features = df_features_text.join(df_timestamp)


Unnamed: 0,aaa,aaron,ab,aba,abaco,abandon,abbey,abbott,abbreviation,abby,...,zip,zipper,zippo,zombie,zombies,zombism,zone,zoo,zuckerberg,zuma
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8236,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8237,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## III.e. Création des features 

In [None]:
## IV.a. Ingénierie du texte

In [None]:
# from sklearn.model_selection import train_test_split 
# from sklearn.linear_model import SGDClassifier
# from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
# from sklearn.metrics import accuracy_score, confusion_matrix
# from time import time
#from sklearn.metrics import classification_report


# validation_size=0.3 #30% du jeu de données pour le test

# testsize= 1-validation_size
# seed=30



# pipeline = Pipeline([('vect', TfidfVectorizer()),
#                ('clf', SGDClassifier(loss='hinge', 
#                                      penalty='l2',
#                                      alpha=1e-3, 
#                                      random_state=42, 
#                                      max_iter=5, tol=None)),
#               ])




# X = df_claim_text_processed['text'].values
# y = df['truthRating']

# X_train,X_test,y_train,y_test=train_test_split(X, 
#                                               y, 
#                                               train_size=validation_size, 
#                                               random_state=seed,
#                                               test_size=testsize)


# t0 = time()
# pipeline.fit(X_train, y_train)
# print("Fit réalisé en %0.3fs" % (time() - t0))

# t0 = time()
# result = pipeline.predict(X_test)
# print("Prédiction réalisée en %0.3fs" % (time() - t0))

# print('\n accuracy:',accuracy_score(result, y_test),'\n')

# conf = confusion_matrix(y_test, result)
# print ('\n matrice de confusion \n',conf)

# print ('\n',classification_report(y_test, result))
