# Using Data Mining to understand fake news and misinformation in election cycles

## Authors

Salomé Dias     - 118163
Daniel Pedrinho - 107378

## Project Objective

The objective of this project is to understand the spread of fake news and misinformation in election cycles, using data mining techniques to analyze the data.

## Similar Works

1. https://ieeexplore.ieee.org/document/10068440
A. Matheven and B. V. D. Kumar, "Fake News Detection Using Deep Learning and Natural Language Processing," 2022 9th International Conference on Soft Computing & Machine Intelligence (ISCMI), Toronto, ON, Canada, 2022, pp. 11-14, doi: 10.1109/ISCMI56532.2022.10068440. keywords: {Industries;Deep learning;Training;Social networking (online);Natural language processing;Fake news;Machine intelligence;fake news;deep Learning;natural language processing}

2. https://ieeexplore.ieee.org/document/9641517
X. Jose, S. D. M. Kumar and P. Chandran, "Characterization, Classification and Detection of Fake News in Online Social Media Networks," 2021 IEEE Mysore Sub Section International Conference (MysuruCon), Hassan, India, 2021, pp. 759-765, doi: 10.1109/MysuruCon52639.2021.9641517. keywords: {Costs;Social networking (online);IEEE Sections;Conferences;Pressing;Bidirectional control;Fake news;Online fake news;Social media analytics;Natural Language Processing;Machine learning;Fake news detection}

3. https://ieeexplore.ieee.org/document/9702824
F. W. Wibowo, A. Dahlan and Wihayati, "Detection of Fake News and Hoaxes on Information from Web Scraping using Classifier Methods," 2021 4th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 2021, pp. 178-183, doi: 10.1109/ISRITI54043.2021.9702824. keywords: {Support vector machines;Stochastic processes;Artificial neural networks;Boosting;Natural language processing;Data models;Decision trees;classification;fake news;hoax;Indonesian language;nlp;web scraping}

## Data Aquisition

The query currently is defined in 2 parts:

1. Part 1 relates to the "base" data, that will be used as a baseline for comparison.
2. Part 2 relates to the "target" data, that is, the data we are interested in.

Both queries are built in the same way, with the only difference being the time period of the search
To filtrate the data, several news sources are selected, based of this list: https://today.yougov.com/ratings/entertainment/popularity/news-websites/all

From this list were excluded same organization with different domains and sports websites.

The query is done once for each site, and the results are stored in an external file, keeping the original arquivo.pt data structure.

The data is acquired from the arquivo.pt TextSearch API seems lacking in quantity, and the website search further affirms this, as the results amount stated is far greater than what is presented both by the API requests, and the website search.

Further more, some of the news sources selected return no results, which we believe to be odd, since both the period of the search and its terms are very broad and well documented.

As such, we are considering changing the API from the Arquivo.pt API (Full-text & URL search) to the CDX-server API (URL search) or Memento API (URL search).

In [1]:
import requests
import os

default = 'https://arquivo.pt/textsearch?q='
election_time = '&from=20200203000000&to=20201103000000'
non_election_time = '&from=20180203000000&to=20181103000000'
pretty_print = "&prettyPrint=true"

news_sites = ['&siteSearch=www.cbs.com', 
              '&siteSearch=www.nbc.com', 
              '&siteSearch=www.washingtonpost.com', 
              '&siteSearch=www.bbc.com',
              '&siteSearch=www.forbes.com',
              '&siteSearch=www.nytimes.com', 
              '&siteSearch=www.foxnews.com', 
              '&siteSearch=www.cnn.com']

def request_api(query, is_election, site):

    if query:

        if is_election == 1:
            response = requests.get(default + query + election_time + site +pretty_print)
            return response
        elif is_election == 0:
            response = requests.get(default + query + non_election_time + site + pretty_print)
            return response
    else:
        print("No query provided")
        return
    
input_query = input("Enter a query: ")
is_election = int(input("Is it election time? (1 for yes, 0 for no): "))

# if output.txt exists, delete it
if os.path.exists('output_election.txt'):
    os.remove('output_election.txt')

if os.path.exists('output_non_election.txt'):
    os.remove('output_non_election.txt')

for site in news_sites:
    response = request_api(input_query, is_election, site)
    # dump the response to a file without overwriting
    if is_election == 1:
        with open('output_election.txt', 'a') as f:
            f.write('\n############################ ' + site + ' ############################\n')
            f.write(response.text)
            f.write('##############################\n')
    else:
        with open('output_non_election.txt', 'a') as f:
            f.write('\n############################ ' + site + ' ############################\n')
            f.write(response.text)
            f.write('##############################\n')
