# Groupe 6 : évaluation de caractères fake news de messages sur les réseaux sociaux

Le but de ce projet est, à partir d'un ensemble de tweets, d'établir une liste de tweets dont il faut vérifier l'information, triée par ordre de priorité.

## Installations nécessaires :

In [1]:
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
!python3 -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import nltk
nltk.download("punkt") #gestion de la ponctuation pour la tokenization.
nltk.download('vader_lexicon') #lexique de nltk pour la positivité des mots.

[nltk_data] Downloading package punkt to /home/arthur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/arthur/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

## Lecture de l'ensemble d'entraînement et de test

In [4]:
from src.read.ReadData import ReadData

reader = ReadData(train_path = "data/train.csv",test_path = "data/test.csv")
df_train = reader.read_train()
df_test = reader.read_test()
display(df_train)
display(df_test)

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @TheTawniest The out of control w...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,Police investigating after an e-bike collided ...,1


Unnamed: 0,text
0,Just happened a terrible car crash
1,"Heard about #earthquake is different cities, s..."
2,"there is a forest fire at spot pond, geese are..."
3,Apocalypse lighting. #Spokane #wildfires
4,Typhoon Soudelor kills 28 in China and Taiwan
...,...
3258,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,Storm in RI worse than last hurricane. My city...
3260,Green Line derailment in Chicago http://t.co/U...
3261,MEG issues Hazardous Weather Outlook (HWO) htt...


## Extraction des caractéristiques :

### Longueur des tweets :

In [5]:
# Test sur 1 tweet :
from src.features.tweetLevel import tweetLevel
from src.features.tokenization import tokenization

extractor_features = tweetLevel()
tokenizer = tokenization()

tweet = df_train["text"][0]
tokenized_tweet = tokenizer.tokenize_tweet(tweet)
len_in_char = extractor_features.get_length_in_characters(tweet)
len_in_tokens = extractor_features.get_length_in_tokens(tweet)

print(f"Tweet : '{tweet}'")
print(f"Longueur en caractères : {len_in_char}")
print("")
print(f"Représentation en tokens : {tokenized_tweet}")
print(f"Longueur en tokens : {len_in_tokens}")

  from .autonotebook import tqdm as notebook_tqdm


Tweet : 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
Longueur en caractères : 69

Représentation en tokens : ['Our', 'Deeds', 'are', 'the', 'Reason', 'of', 'this', '#', 'earthquake', 'May', 'ALLAH', 'Forgive', 'us', 'all']
Longueur en tokens : 14


### Sentiment du tweet

In [6]:
sentiment_analysis = extractor_features.get_positive_sentiment_score(tweet)

print(f"Tweet : '{tweet}'")
print(f"Score de positivité du tweet : {sentiment_analysis}")

Tweet : 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
Score de positivité du tweet : 0.149


### POS tags du tweet

In [7]:
pos_tags = extractor_features.get_pos_tags(tweet)
print(f"Tweet : '{tweet}'")
print(f"Représentation en tokens : {tokenized_tweet}")
print(f"pos_tags : {pos_tags}")

Tweet : 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
Représentation en tokens : ['Our', 'Deeds', 'are', 'the', 'Reason', 'of', 'this', '#', 'earthquake', 'May', 'ALLAH', 'Forgive', 'us', 'all']
pos_tags : [('Our', 'PRP$'), ('Deeds', 'NNS'), ('are', 'VBP'), ('the', 'DT'), ('Reason', 'NNP'), ('of', 'IN'), ('this', 'DT'), ('#', '#'), ('earthquake', 'NN'), ('May', 'NNP'), ('ALLAH', 'NNP'), ('Forgive', 'NNP'), ('us', 'PRP'), ('all', 'DT')]


### Entités du tweet

In [8]:
entity_types = extractor_features.get_entity_types(tweet)
print(f"Tweet : '{tweet}'")
print(f"Entités : {entity_types}")

Tweet : 'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
Entités : [('May ALLAH Forgive', 'ORG')]
