Skip to content

Gabriel-Lino-Garcia/FakeRecogna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FakeRecogna

FakeRecogna is a dataset comprised of real and fake news. The real news is not directly linked to fake news and vice-versa, which could lead to a biased classification. The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The web crawlers were developed based on each analyzed webpage, where the extracted information is first separated into categories and then grouped by dates. The plurality of news on several pages and the different writing styles provide the dataset with great diversity for natural language processing analysis and machine learning algorithms.

The Dataset

The news collection was performed by crawlers developed for mining pages of well-known and of great national importance agency news. The fake news mining was mainly focused on pages mentioned by the Duke Reporters Lab, which provides a list of pages that verify the veracity of news worldwide.There were 160 active fact-checking agencies in the world in 2019 and Brazil figures as a growing ecosystem with currently 9 initiatives and there were considered 6 out of the 9 pages during search with a great variation in the number of fake news extracted from each one, ending in 5,951 samples. Table 1 presents the current initiatives as well as the number of fake news collected from each source.

Fact-Check Agency Web address # News
Boatos.org https://boatos.org 2,605
Fato ou Fake https://oglobo.globo.com/fato-ou-fake 1,055
E-farsas https://www.e-farsas.com 812
UOL Confere https://noticias.uol.com.br/confere 582
AFP Checamos https://checamos.afp.com/afp-brasil 509
Projeto Comprova https://checamos.afp.com/afp-brasil 388
Total ------------------------------------- 5,951

Concerning the real news, the crawlers searched portals such as G1, UOL and Extra, which are publicly recognized as reliable news outlets, besides the Ministry of Health of Brazil home page, resulting in a collection of over 100,000 samples. From this set, there were filtered out 5,951 samples to keep the balance between classes and, thus, resulting in a dataset comprised of 11,902 samples.

More informations

The FakeRecogna dataset is available at GitHub as a single XLSX file that contains 8 columns for the metadata, and each row stands for a sample (real or fake news), as described in Table 2.

Columns Description
Title Title of article
Sub-title (if available) Brief description of news
News Information about the article
Category News grouped according to your information
Author Publication author
Date Publication date
URL Article web address
Class 0 for fake news and 1 for real news

The collected texts are distributed into six categories in relation to their main subjects: Brazil, Entertainment, Health, Politics, Science, and World. These categories are defined based on the journal sections where the news were extracted. The distribution of news by category and its percentages are described in Table 3.

Category # News %
Brazil 904 7.6
Entertainment 1,409 12.00
Health 4,456 37.4
Politics 3.951 33.1
Science 602 5.1
World 580 4.9
Total 11,902 100.00

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published