# Groupe 6 : évaluation de caractères fake news de messages sur les réseaux sociaux

Le but de ce projet est, à partir d'un ensemble de tweets, d'établir une liste de tweets dont il faut vérifier l'information, triée par ordre de priorité.

## Installations nécessaires :

In [1]:
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import nltk
nltk.download("punkt") #gestion de la ponctuation pour la tokenization.
nltk.download('vader_lexicon') #lexique de nltk pour la positivité des mots.

[nltk_data] Downloading package punkt to /home/arthur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/arthur/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Lecture de l'ensemble d'entraînement et de test

In [4]:
from src.read.ReadData import ReadData

reader = ReadData(train_path = "data/train.csv",test_path = "data/test.csv")
df_train = reader.read_train()
df_test = reader.read_test()
display(df_train)
display(df_test)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @TheTawniest The out of control w...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,Police investigating after an e-bike collided ...,1


Unnamed: 0,text
0,Just happened a terrible car crash
1,"Heard about #earthquake is different cities, s..."
2,"there is a forest fire at spot pond, geese are..."
3,Apocalypse lighting. #Spokane #wildfires
4,Typhoon Soudelor kills 28 in China and Taiwan
...,...
3258,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,Storm in RI worse than last hurricane. My city...
3260,Green Line derailment in Chicago http://t.co/U...
3261,MEG issues Hazardous Weather Outlook (HWO) htt...


## Extraction des caractéristiques pour l'ensemble d'entraînement :
On va construire le dataframe d'entraînement, avec les attributs des tweets.

### Features tweetLevel :
On va extraire les attributs suivants pour chaque tweet :
- longueur du tweet (en caractères ou en tokens) ;
- score de sentiment du tweet ;
- POS tags du tweet ;
- entités du tweet.

In [5]:
import pandas as pd
import spacy
from src.features.tweetLevel import tweetLevel

extractor_features_tweet_level = tweetLevel()
nlp = spacy.load("en_core_web_sm") #analyseur pour les entités

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
dico_tweet_level = {"len_in_tokens":[],"sentiment_score":[]} #clé : nom de l'attribut, valeur : la valeur de l'Attribut 
n_tweets = len(df_train.index)

for row_name in df_train.index:
    tweet = df_train["text"][row_name]
    
    # Longueur du tweet :
    len_tweet = extractor_features_tweet_level.get_length_in_tokens(tweet)
    dico_tweet_level["len_in_tokens"].append(len_tweet)

    # Sentiment du tweet :
    sentiment_score = extractor_features_tweet_level.get_positive_sentiment_score(tweet)
    dico_tweet_level["sentiment_score"].append(sentiment_score)

    # POS tags : pour chaque POS tags, on compte le nombre de fois qu'il est dans le tweet
    pos_tags = extractor_features_tweet_level.get_pos_tags(tweet) #liste de tuples (token,POS tag)
    for tag in pos_tags:
        if tag[1] not in dico_tweet_level:
            dico_tweet_level[tag[1]] = [0 for k in range(n_tweets)]
            dico_tweet_level[tag[1]][row_name] = 1
        else:
            dico_tweet_level[tag[1]][row_name] += 1

    # Entités du tweet :
    entities = extractor_features_tweet_level.get_entity_types(tweet,nlp)
    for entity in entities:
        if entity[1] not in dico_tweet_level:
            dico_tweet_level[entity[1]] = [0 for k in range(n_tweets)]
            dico_tweet_level[entity[1]][row_name] = 1
        else:
            dico_tweet_level[entity[1]][row_name] += 1

df_tweet_level = pd.DataFrame(dico_tweet_level, columns=list(dico_tweet_level.keys()))
df_tweet_level

Unnamed: 0,len_in_tokens,sentiment_score,PRP$,NNS,VBP,DT,NNP,IN,#,NN,...,RBS,WORK_OF_ART,FW,PERCENT,UH,SYM,PDT,LANGUAGE,``,WP$
0,14,0.149,1,1,1,3,4,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,8,0.000,0,0,0,0,5,1,0,1,...,0,0,0,0,0,0,0,0,0,0
2,24,0.000,0,3,2,2,0,3,0,4,...,0,0,0,0,0,0,0,0,0,0
3,9,0.000,0,3,1,0,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
4,18,0.000,0,1,1,2,2,4,2,3,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,13,0.000,0,2,0,1,0,1,0,4,...,0,0,0,0,0,0,0,0,0,0
7609,24,0.000,0,1,0,3,3,4,0,5,...,0,0,0,0,0,0,0,0,0,0
7610,15,0.000,0,0,0,0,6,1,0,2,...,0,0,0,0,0,0,0,0,0,0
7611,21,0.000,0,1,0,2,3,3,0,5,...,0,0,0,0,0,0,0,0,0,0


### Word2Vec
On va inclure le feature word2vec. Voici ce qu'on va faire :
- On va déterminer les embeddings de tous les mots de l'ensemble d'entraînement. 
- Puis, pour chaque tweet, nous allons, pour chaque composante, calculer la moyenne sur les mots du tweet.

In [7]:
## Création du tableau de correspondance : mot, embedding du mot

from src.features.word2vec import word2vec

dico_correspondance = {} #clé : mot, valeur : embedding

# Apprentissage des embeddings :
tweets = df_train["text"].to_numpy()
word2vec_creator = word2vec(tweets)

# Mise en forme du df :
for row_name in df_train.index:
    tweet = df_train["text"][row_name]
    embedded_tweet = word2vec_creator.predict(tweet)
    for word in embedded_tweet:
        if word[0] not in dico_correspondance:
            dico_correspondance[word[0]] = word[1]

df_correspondance = pd.DataFrame(dico_correspondance,columns=list(dico_correspondance.keys()))
df_correspondance

Unnamed: 0,Our,Deeds,are,the,Reason,of,this,#,earthquake,May,...,Ssw,//t.co/5ueCmcv2Pk,Forney,developing,symptoms,//t.co/rqKK15uhEY,flip,//t.co/nF4IculOje,//t.co/STfMbbZFB5,//t.co/YmY4rSkQ3d
0,-0.088936,-0.013425,-0.539233,-0.506070,-0.009121,-0.660990,-0.584319,-1.895276,-0.128269,-0.054426,...,-0.002169,0.005666,-0.004421,-0.013284,0.005110,-0.007176,0.007163,-0.006583,0.000615,0.004299
1,0.080497,0.006168,0.701040,0.663755,0.015567,0.554867,0.751100,1.098641,0.106357,0.067234,...,0.003212,0.005514,0.010331,0.003313,-0.006954,-0.004480,-0.009891,0.004983,-0.009695,0.007272
2,-0.018796,-0.007523,-0.209598,-0.205669,0.003896,-0.080090,-0.214491,-0.257318,-0.022614,-0.012981,...,-0.003137,-0.003163,-0.007912,0.007406,-0.003189,-0.009242,0.006602,-0.002412,-0.005027,-0.002610
3,-0.083582,0.000253,-0.447759,-0.588088,-0.005413,-0.685263,-0.448505,-0.508912,-0.083952,-0.076006,...,0.001462,0.004868,-0.002141,-0.010971,0.003484,-0.003934,0.001862,-0.002630,-0.003808,0.007484
4,-0.029874,-0.001819,-0.062624,-0.184092,-0.011683,-0.325603,-0.083247,-0.277073,-0.041183,-0.027222,...,0.001802,-0.004759,0.009448,-0.006149,-0.009207,0.004046,0.003205,-0.007557,-0.001310,0.003895
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.136720,0.016389,0.801135,0.966699,0.016369,1.247418,0.735257,1.544007,0.177994,0.117343,...,0.003775,-0.010691,-0.001793,0.008014,0.006416,0.011428,-0.006595,0.003059,-0.001794,0.005913
96,-0.096362,-0.008263,-0.741638,-0.829982,-0.001822,-0.607123,-0.661063,0.461285,-0.045140,-0.066197,...,0.000741,0.006057,-0.001328,-0.009622,0.007589,0.000830,-0.010234,0.007526,-0.005848,0.007249
97,-0.106658,-0.007465,-0.707973,-0.779454,-0.006876,-0.885658,-0.714209,-1.772949,-0.134873,-0.082764,...,-0.011596,0.012781,0.002615,-0.007019,0.004415,-0.003601,-0.000330,-0.000854,-0.003502,-0.009513
98,0.108524,0.008342,0.622788,0.824014,0.008276,0.956606,0.588094,0.740957,0.106431,0.096092,...,0.001444,0.004847,-0.010965,0.002639,-0.001078,0.002506,-0.005893,-0.003122,-0.008760,0.010175


In [8]:
from src.features.tokenization import tokenization
import numpy as np

## Liaison à notre ensemble d'entraînement : pour chaque tweet, on calcule la moyenne de l'embedding sur chaque composante

index_composantes_embeddings = list(df_correspondance.index)
dico_word2vec = {key:[] for key in index_composantes_embeddings}

tokenizer = tokenization()
for row_name in df_train.index:
    tweet = df_train["text"][row_name]
    tokenized_tweet = tokenizer.tokenize_tweet(tweet)
    
    # Récupération des embeddings de chaque mot du tweet considéré
    embeddings_tweet = df_correspondance[tokenized_tweet]
    
    # Calcul de l'embedding moyen sur chaque mot du tweet
    mean_tweet = np.mean(embeddings_tweet,axis=1)
    for i in range(len(mean_tweet)):
        dico_word2vec[i].append(mean_tweet[i])

df_word2vec = pd.DataFrame(dico_word2vec,columns=index_composantes_embeddings)
df_word2vec

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.356068,0.338908,-0.085931,-0.246757,-0.084820,-0.571304,0.228335,1.579388,-0.453341,-0.629924,...,0.673960,0.091271,0.076871,-0.101129,0.701610,0.469800,-0.238238,-0.422081,0.336559,0.045911
1,-0.205360,0.195587,-0.042980,-0.166417,-0.055329,-0.363630,0.153361,1.014028,-0.295475,-0.402520,...,0.451061,0.056253,0.054480,-0.056928,0.436828,0.309247,-0.173036,-0.250658,0.233053,0.044576
2,-0.311789,0.311245,-0.066701,-0.236514,-0.080394,-0.545277,0.222279,1.523329,-0.439398,-0.603005,...,0.667594,0.082338,0.057971,-0.071715,0.664900,0.456278,-0.271862,-0.376704,0.340553,0.052998
3,-0.386286,0.263185,-0.056592,-0.213195,-0.101669,-0.529203,0.202893,1.369470,-0.376829,-0.628044,...,0.568672,0.089015,0.233023,-0.127594,0.655403,0.468341,-0.090828,-0.418630,0.301391,0.102114
4,-0.433668,0.370416,-0.085201,-0.277377,-0.107744,-0.651715,0.252284,1.750950,-0.491779,-0.738362,...,0.744851,0.102746,0.147436,-0.140434,0.798981,0.546387,-0.217916,-0.491687,0.378283,0.063079
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,-0.363065,0.246640,-0.030519,-0.204263,-0.116032,-0.562066,0.221790,1.416838,-0.409839,-0.676202,...,0.614446,0.091445,0.253844,-0.110234,0.681128,0.487778,-0.115825,-0.392220,0.325851,0.163302
7609,-0.459293,0.411479,-0.086358,-0.331330,-0.137541,-0.768927,0.315346,2.098361,-0.622452,-0.880490,...,0.923146,0.127808,0.154040,-0.120158,0.950163,0.643282,-0.310277,-0.539347,0.473784,0.113972
7610,-0.540858,0.384962,-0.045198,-0.233981,-0.106030,-0.724649,0.292173,1.770980,-0.436086,-0.858365,...,0.679852,0.100674,0.334552,-0.186645,0.886941,0.591260,-0.071804,-0.546964,0.354610,0.156317
7611,-0.249972,0.258637,-0.061371,-0.208817,-0.071002,-0.469554,0.194539,1.326259,-0.396243,-0.518687,...,0.594328,0.072230,0.038656,-0.052288,0.571320,0.392147,-0.250986,-0.314459,0.303115,0.047459


### Construction de l'ensemble d'entraînement final avec la target :
On n'a plus qu'à concaténer le tout !

In [9]:
# Word2vec
df_classification = df_word2vec 

# TweetLevel
for tweetlevel_feature in df_tweet_level.columns:
    df_classification[tweetlevel_feature] = df_tweet_level[tweetlevel_feature]

# Target :
df_classification["target"] = df_train["target"]

df_classification

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,WORK_OF_ART,FW,PERCENT,UH,SYM,PDT,LANGUAGE,``,WP$,target
0,-0.356068,0.338908,-0.085931,-0.246757,-0.084820,-0.571304,0.228335,1.579388,-0.453341,-0.629924,...,0,0,0,0,0,0,0,0,0,1
1,-0.205360,0.195587,-0.042980,-0.166417,-0.055329,-0.363630,0.153361,1.014028,-0.295475,-0.402520,...,0,0,0,0,0,0,0,0,0,1
2,-0.311789,0.311245,-0.066701,-0.236514,-0.080394,-0.545277,0.222279,1.523329,-0.439398,-0.603005,...,0,0,0,0,0,0,0,0,0,1
3,-0.386286,0.263185,-0.056592,-0.213195,-0.101669,-0.529203,0.202893,1.369470,-0.376829,-0.628044,...,0,0,0,0,0,0,0,0,0,1
4,-0.433668,0.370416,-0.085201,-0.277377,-0.107744,-0.651715,0.252284,1.750950,-0.491779,-0.738362,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,-0.363065,0.246640,-0.030519,-0.204263,-0.116032,-0.562066,0.221790,1.416838,-0.409839,-0.676202,...,0,0,0,0,0,0,0,0,0,1
7609,-0.459293,0.411479,-0.086358,-0.331330,-0.137541,-0.768927,0.315346,2.098361,-0.622452,-0.880490,...,0,0,0,0,0,0,0,0,0,1
7610,-0.540858,0.384962,-0.045198,-0.233981,-0.106030,-0.724649,0.292173,1.770980,-0.436086,-0.858365,...,0,0,0,0,0,0,0,0,0,1
7611,-0.249972,0.258637,-0.061371,-0.208817,-0.071002,-0.469554,0.194539,1.326259,-0.396243,-0.518687,...,0,0,0,0,0,0,0,0,0,1


On remarque qu'on a 165 attributs au total, ce qui est beaucoup. Comme vu dans les articles, nous allons chercher à sélectionner les attributs les plus discriminants.

## Sélection de caractéristiques