<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-packages" data-toc-modified-id="Load-packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load packages</a></span></li><li><span><a href="#Functions" data-toc-modified-id="Functions-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Webscraping" data-toc-modified-id="Webscraping-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Webscraping</a></span><ul class="toc-item"><li><span><a href="#Extract-info-of-each-hyperlink-to-build-dataset" data-toc-modified-id="Extract-info-of-each-hyperlink-to-build-dataset-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Extract info of each hyperlink to build dataset</a></span></li></ul></li><li><span><a href="#Some-EDA" data-toc-modified-id="Some-EDA-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Some EDA</a></span><ul class="toc-item"><li><span><a href="#Unique-fact_checkers" data-toc-modified-id="Unique-fact_checkers-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Unique fact_checkers</a></span></li><li><span><a href="#Trying-to-eliminate-NaN" data-toc-modified-id="Trying-to-eliminate-NaN-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Trying to eliminate NaN</a></span><ul class="toc-item"><li><span><a href="#Saving-cleaned-data" data-toc-modified-id="Saving-cleaned-data-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Saving cleaned data</a></span></li></ul></li><li><span><a href="#Period-covered-by-our-dataset" data-toc-modified-id="Period-covered-by-our-dataset-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Period covered by our dataset</a></span></li><li><span><a href="#Labels" data-toc-modified-id="Labels-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Labels</a></span></li></ul></li></ul></div>

In this notebook we scrap https://www.poynter.org/ifcn-covid-19-misinformation/ in order to obtain data about fake news concerning COVID-19 from different fact checkers.

# Load packages

In [98]:
# importing packages

import requests
# from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd
import time
import pickle
from tqdm import tqdm
import pandas as pd
import numpy as np

TodaysDate = time.strftime("%Y-%m-%d")

# Set max rows displayed in output to 4000
pd.set_option("display.max_rows", 4000)

# Functions

In [2]:
def retrieve_news_hyperlinks(main_url):
    """ 
    Extract all hyperlinks in 'main_url' and return a list with these hyperlinks 
    """
    
    # Packages the request, send the request and catch the response: r
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
    
    # Packages the request, send the request and catch the response: r
    
    r = requests.get(main_url, headers=headers)
    
    # Create a BeautifulSoup object from the HTML: soup
    
    soup = BeautifulSoup(r.text,"lxml")
    
    # Find all 'a' tags (which define hyperlinks): a_tags

    a_tags = soup.find_all('a')

    # Create a list with hyperlinks found

    list_links = [link.get('href') for link in a_tags]
    
    # keep only news links (i.e. containing "?ifcn_misinformation")
    
    list_news = [link for link in list_links if "?ifcn_misinformation" in link]
    
    # remove duplicates
    
    list_news = list(set(list_news))
    
    # Remove none values if there is some
    
    list_news = list(filter(None, list_news)) 
    
    return list_news



In [3]:
def save_list(list_to_save, pickle_path):
    """ 
    Save list using pickle
    
    Input:
        list_to_save: list that will be saved
        pickle_name: name that the pickle file will receive without .pkl extension
    """
    
    with open(pickle_path+'_'+TodaysDate+'.pkl', 'wb') as f:
        pickle.dump(list_to_save, f)
        
    print(pickle_path+'_'+TodaysDate+'.pkl')
        
    return pickle_path+'_'+TodaysDate+'.pkl'
    
def load_list(list_path):
    """ 
    Load list saved as pickle file
    """
    
    with open(list_path, 'rb') as f:
        newlist = pickle.load(f)
        
    return newlist

In [4]:
def build_csv(list_news_urls):
    """ 
    Build csv with extract information of hyperlinks in list of hyperlinks of Poynter.
    
    """
    
    #list of information extract from website
    
    fact_checker = []
    date = []
    location = []
    label = []
    title = []
    explanation = []
    claim_originated_by = []
    url_checker = []
    
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
        
    for idx in tqdm(range(len(list_news_urls))):
        
        r = requests.get(list_news_urls[idx], headers=headers)
    
        # Create a BeautifulSoup object from the HTML: soup
        soup = BeautifulSoup(r.text,"lxml")

        fact_checker.append(soup.find('p', class_='entry-content__text entry-content__text--org').get_text().split(':')[-1].strip())

        date_location = soup.find('p', class_="entry-content__text entry-content__text--topinfo").get_text()
    
        date.append(date_location.split('|')[-2])

        location.append(date_location.split('|')[-1].strip())

        label_title = soup.find("h1",class_="entry-title").get_text()

        label.append(label_title.split(":")[0].replace("\n","").strip().lower())

        title.append(label_title.split(":")[1].replace("\t\t","").strip())

        explanation.append(soup.find('p',class_="entry-content__text entry-content__text--explanation").get_text().split(":")[1].strip())

        claim_originated_by.append(soup.find('p', class_="entry-content__text entry-content__text--smaller").get_text().split(":")[1].strip())

        url_checker.append(soup.find('a', class_="button entry-content__button entry-content__button--smaller").get("href"))

    dict_info = {'fact_checker':fact_checker,
                 'date':date, 
                 'location': location,
                 'label':label,
                 'title':title,
                 'explanation':explanation,
                 'claim_originated_by':claim_originated_by,
                 'url_checker':url_checker}
    
    df = pd.DataFrame(dict_info)

    return df

# Webscraping

In [5]:
main_url = "https://www.poynter.org/ifcn-covid-19-misinformation/?orderby=views&order=DESC"

In [6]:
# list of hyperlinks in the 1st page

list_news = retrieve_news_hyperlinks(main_url)

list_news

['https://www.poynter.org/?ifcn_misinformation=facebook-posts-shared-in-at-least-three-countries-as-scientists-work-to-develop-a-covid-19-vaccine-claim-to-offer-a-legal-way-to-refuse-vaccination',
 'https://www.poynter.org/?ifcn_misinformation=a-video-shows-an-empty-hospital-it-states-that-although-globo-said-there-were-already-7-patients-at-a-temporary-covid-19-hospital-built-in-fortaleza-brazil-the-video-proves-it-is-still-not-working',
 'https://www.poynter.org/?ifcn_misinformation=clinical-trials-with-chlorine-dioxide-cds-allegedly-initiated-by-the-american-institute-of-health',
 'https://www.poynter.org/?ifcn_misinformation=the-canadian-federal-governement-gives-756-to-every-citizen',
 'https://www.poynter.org/?ifcn_misinformation=colombia-is-the-country-with-less-cases-and-deaths-for-coronavirus-per-inhabitant-in-america',
 'https://www.poynter.org/?ifcn_misinformation=video-of-an-fbi-raid-in-the-captions-it-says-the-agents-were-seizing-chinese-n95-masks-that-were-contaminated-wi

In [7]:
# retrieve all hyperlinks from all 308 pages with corona news 

all_news_links = retrieve_news_hyperlinks(main_url)

for page in tqdm(range(309)):
    if page == 0:
        all_news_links = retrieve_news_hyperlinks(main_url)
    else:     
        all_news_links.extend(retrieve_news_hyperlinks("https://www.poynter.org/ifcn-covid-19-misinformation/page/"+str(page)+"/?orderby=views&order=DESC#038;order=DESC"))


100%|████████████████████████████████████████████████████████████████████████████████| 309/309 [04:04<00:00,  1.27it/s]


In [8]:
len(all_news_links)

4623

In [9]:
# save list of hyperlinks in a pickle file

save_list(all_news_links, "../data/all_news_links")

../data/all_news_links_2020-04-24.pkl


'../data/all_news_links_2020-04-24.pkl'

In [10]:
# load this list to check if everything is fine

list_test = load_list("../data/all_news_links_2020-04-24.pkl")

In [12]:
# comparing to check if it was saved properly

all_news_links == list_test

True

## Extract info of each hyperlink to build dataset

Retrieving all data at once cause a connection problem so it was necessary to do it in pieces.

In [13]:
df1 = build_csv(list_test[:1000])

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [18:13<00:00,  1.09s/it]


In [14]:
df1.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,AFP,2020/04/22,"United States, Canada, Australia",false,Facebook posts shared in at least three countr...,The claims are false; immunization is not comp...,FB,https://factcheck.afp.com/false-advice-refusin...
1,Agência Lupa,2020/04/22,Brazil,false,A video shows an empty hospital. It states tha...,This video was originally published on April 1...,Facebook posts,https://piaui.folha.uol.com.br/lupa/2020/04/22...
2,Maldita.es,2020/04/22,"Spain, Colombia",misleading,Clinical trials with chlorine dioxide (CDS) al...,The study will not be carried out by the Natio...,WhatsApp,https://maldita.es/malditaciencia/2020/04/22/q...
3,Décrypteurs - Radio-Canada,2020/04/23,Canada,false,The Canadian federal governement gives 756 $ t...,The link to sign up for this program leads to ...,Facebook,https://ici.radio-canada.ca/nouvelle/1694856/i...
4,La Silla Vacía,2020/04/22,Colombia,false,Colombia is the country with less cases and de...,Colombia is the third country with less cases ...,Facebook,https://lasillavacia.com/detector-no-colombia-...


In [15]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         1000 non-null   object
 1   date                 1000 non-null   object
 2   location             1000 non-null   object
 3   label                1000 non-null   object
 4   title                1000 non-null   object
 5   explanation          1000 non-null   object
 6   claim_originated_by  1000 non-null   object
 7   url_checker          1000 non-null   object
dtypes: object(8)
memory usage: 62.6+ KB


In [16]:
df1.to_csv("../data/data_poynter_part1_"+TodaysDate+".csv", index=False)

In [17]:
df2 = build_csv(list_test[1000:2000])

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [30:28<00:00,  1.83s/it]


In [18]:
df2.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,PesaCheck,2020/04/04,"Uganda, Kenya",False,Uganda has postponed general elections due to ...,There has not been such a directive from Presi...,Website,https://pesacheck.org/hoax-ugandas-2021-genera...
1,AFP,2020/04/04,India,False,A video has been viewed thousands of times on ...,The claim is false; the video was published on...,FB,https://factcheck.afp.com/video-predates-india...
2,Factly,2020/04/04,India,False,Video of Rahul Gandhi and Priyanka Vadra Gandh...,The video is three months old. In December 201...,Social Media,https://factly.in/an-old-video-falsely-shared-...
3,AFP,2020/04/04,West Africa,False,An hospital was set on fire in Nairobi after p...,"To this day, there is still no vaccine against...",Social media,https://factuel.afp.com/non-aucun-hopital-na-e...
4,Maldita.es,2020/04/04,Spain,False,The Spanish political party Vox assures that t...,The government is not hiding the images of cof...,Twitter,https://maldita.es/malditobulo/2020/04/04/no-e...


In [19]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         1000 non-null   object
 1   date                 1000 non-null   object
 2   location             1000 non-null   object
 3   label                1000 non-null   object
 4   title                1000 non-null   object
 5   explanation          1000 non-null   object
 6   claim_originated_by  1000 non-null   object
 7   url_checker          1000 non-null   object
dtypes: object(8)
memory usage: 62.6+ KB


In [20]:
# save df in csv

df2.to_csv("../data/data_poynter_part2_"+TodaysDate+".csv", index=False)

In [21]:
df3 = build_csv(list_test[2000:3000])

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [36:23<00:00,  2.18s/it]


In [22]:
df3.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,franceinfo,2020/03/24,France,False,This photo shows nurses insulting French Presi...,The image has been modified. The original phot...,Twitter,https://www.francetvinfo.fr/sante/maladie/coro...
1,Rappler,2020/03/24,Philippines,False,Globe Telecom is giving every subscriber free ...,"In a message sent to Rappler, Globe said the f...",Website,https://www.rappler.com/newsbreak/fact-check/2...
2,La Nación,2020/03/24,Costa Rica,False,"In Costa Rica, dead bodies are not returned to...","There are procedures in place, and the bodies ...","Facebook, WhatsApp",https://www.nacion.com/no-coma-cuento/nocomacu...
3,AFP,2020/03/24,"Sri Lanka, Philippines",False,Purported advisories urging residents to stay ...,The claim is false. Both the Sri Lankan and Ph...,"Facebook, WhatsApp",https://factcheck.afp.com/false-claim-circulat...
4,PesaCheck,2020/03/24,Kenya,False,23 new COVID-19 cases confirmed in Mombasa.,The cases were not confirmed by Kenya's Minist...,Times Live Kenya,https://pesacheck.org/false-kenya-has-not-conf...


In [23]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         1000 non-null   object
 1   date                 1000 non-null   object
 2   location             1000 non-null   object
 3   label                1000 non-null   object
 4   title                1000 non-null   object
 5   explanation          1000 non-null   object
 6   claim_originated_by  1000 non-null   object
 7   url_checker          1000 non-null   object
dtypes: object(8)
memory usage: 62.6+ KB


In [24]:
df3.to_csv("../data/data_poynter_part3_"+TodaysDate+".csv", index=False)

In [25]:
df4 = build_csv(list_test[3000:4000])

100%|██████████████████████████████████████████████████████████████████████████████| 1000/1000 [36:23<00:00,  2.18s/it]


In [26]:
df4.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,Faktograf,2020/03/15,Croatia,False,COVID-19 arrived in the U.S. by train.,The photo has been manipulated.,Facebook,https://faktograf.hr/2020/03/16/vagon-vlak-cov...
1,Taiwan FactCheck Center,2020/03/15,Taiwan,False,"COVID-19 is large in size, where the cell diam...",The cell diameter is around 120 nanometers. Th...,"Line, Facebook",https://tfc-taiwan.org.tw/articles/3382
2,PesaCheck,2020/03/15,"Kenya, East Africa",False,Graphics with information about COVID-19 conta...,A series of infographics with UNICEF branding ...,Kenya Broadcasting Corporation,https://pesacheck.org/fake-these-infographics-...
3,Agencia Ocote,2020/03/15,Guatemala,False,Helicopter of the armed forces disinfects the ...,This claim is false.,Social networks,https://agenciaocote.com/bulo-ningun-helicopte...
4,PesaCheck,2020/03/15,"Kenya, East Africa",False,A situation update allegedly from Kenya’s Citi...,A post about COVID-19 in Kenya that contains a...,Tweet,https://pesacheck.org/hoax-this-post-with-a-si...


In [27]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         1000 non-null   object
 1   date                 1000 non-null   object
 2   location             1000 non-null   object
 3   label                1000 non-null   object
 4   title                1000 non-null   object
 5   explanation          1000 non-null   object
 6   claim_originated_by  1000 non-null   object
 7   url_checker          1000 non-null   object
dtypes: object(8)
memory usage: 62.6+ KB


In [28]:
df4.to_csv("../data/data_poynter_part4_"+TodaysDate+".csv", index=False)

In [29]:
df5 = build_csv(list_test[4000:])

100%|████████████████████████████████████████████████████████████████████████████████| 623/623 [22:09<00:00,  2.13s/it]


In [30]:
df5.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,Maldita.es,2020/02/13,Spain,false,Claims of cannibalism in China,The image shows an autopsy on a baby.,Instagram,https://maldita.es/malditobulo/bulo-personas-a...
1,AFP,2020/02/13,Philippines,false,Boiled ginger can cure COVID-19,Doctors deny this claim.,Katja Helena Ristilä,http://u.afp.com/GingerCovid19
2,Efecto Cocuyo,2020/02/13,Venezuela,false,There is a Cuban drug used in China against th...,Story cites the president and the Cuban embass...,"Miguel Díaz-Canel, Cuba´s president",https://efectococuyo.com/cocuyo-chequea/medica...
3,The Quint,2020/02/14,India,misleading,Coronavirus was created in a lab in Wuhan.,No scientific evidence supports that the coron...,Twitter,https://fit.thequint.com/fit-webqoof/fit-webqo...
4,Vishvas News,2020/02/13,India,false,Weed kills the coronavirus.,Weed does not kill the coronavirus.,Twitter,https://www.vishvasnews.com/english/health/fac...


In [31]:
df5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 623 entries, 0 to 622
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         623 non-null    object
 1   date                 623 non-null    object
 2   location             623 non-null    object
 3   label                623 non-null    object
 4   title                623 non-null    object
 5   explanation          623 non-null    object
 6   claim_originated_by  623 non-null    object
 7   url_checker          623 non-null    object
dtypes: object(8)
memory usage: 39.1+ KB


In [32]:
df5.to_csv("../data/data_poynter_part5_"+TodaysDate+".csv", index=False)

In [33]:
# concatenate all data
df_concat = pd.concat([df1,df2,df3,df4,df5])

# drop duplicates if there is
df_concat.drop_duplicates(subset=['date','location','title'], keep='first', inplace = True)

# reset index
df_concat.reset_index(drop=True,inplace = True)

# save in csv
df_concat.to_csv("../data/data_poynter_COMPLETE_"+TodaysDate+".csv", index=False)

#save in excel
df_concat.to_excel("../data/data_poynter_COMPLETE_"+TodaysDate+".xlsx")



Apparently, there where some duplicates based on `date`, `location`, and `title`. This reduced the expected data of 4623 to 3934.

In [34]:
df_conc_temp = pd.read_csv("../data/data_poynter_COMPLETE_"+TodaysDate+".csv")

In [35]:
df_conc_temp.head()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
0,AFP,2020/04/22,"United States, Canada, Australia",false,Facebook posts shared in at least three countr...,The claims are false; immunization is not comp...,FB,https://factcheck.afp.com/false-advice-refusin...
1,Agência Lupa,2020/04/22,Brazil,false,A video shows an empty hospital. It states tha...,This video was originally published on April 1...,Facebook posts,https://piaui.folha.uol.com.br/lupa/2020/04/22...
2,Maldita.es,2020/04/22,"Spain, Colombia",misleading,Clinical trials with chlorine dioxide (CDS) al...,The study will not be carried out by the Natio...,WhatsApp,https://maldita.es/malditaciencia/2020/04/22/q...
3,Décrypteurs - Radio-Canada,2020/04/23,Canada,false,The Canadian federal governement gives 756 $ t...,The link to sign up for this program leads to ...,Facebook,https://ici.radio-canada.ca/nouvelle/1694856/i...
4,La Silla Vacía,2020/04/22,Colombia,false,Colombia is the country with less cases and de...,Colombia is the third country with less cases ...,Facebook,https://lasillavacia.com/detector-no-colombia-...


In [36]:
df_conc_temp.tail()

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
3929,Science Feedback,2020/01/22,United Kingdom,no evidence,The new coronavirus outbreak is linked to eati...,Bats are known natural reservoirs for coronavi...,"Emma Parker, writer at the Daily Star (and man...",https://healthfeedback.org/claimreview/no-conc...
3930,Pagella Politica,2020/01/22,Italy,false,Warning,"Until Jan. 22, this information hadn't been co...",retenews24.it,https://pagellapolitica.it/bufale/show/902/not...
3931,Animal Político,2020/01/18,Mexico,false,Stores and supermarkets in Veracruz (Mexico) w...,"As of Mar. 18, stores had not said they woud c...",WhatsApp,https://www.animalpolitico.com/elsabueso/falso...
3932,Rappler,2020/01/14,Philippines,false,"A chain message circulated on Tuesday, Jan. 14...",The Department of Health (DOH) and Healthway M...,Chain message,https://www.rappler.com/newsbreak/fact-check/2...
3933,Animal Político,2020/01/19,Mexico,partially false,The peak of the new coronavirus will happen in...,Authorities say that what lasts two weeks or l...,WhatsApp,https://www.animalpolitico.com/elsabueso/caden...


In [37]:
df_conc_temp.shape

(3934, 8)

# Some EDA

In [38]:
df_conc_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3934 entries, 0 to 3933
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         3933 non-null   object
 1   date                 3934 non-null   object
 2   location             3934 non-null   object
 3   label                3934 non-null   object
 4   title                3934 non-null   object
 5   explanation          3932 non-null   object
 6   claim_originated_by  3866 non-null   object
 7   url_checker          3933 non-null   object
dtypes: object(8)
memory usage: 246.0+ KB


## Unique fact_checkers

In [47]:
# there were a float in the middle of the fact_checker, probably the `nan`

df_conc_temp.fact_checker = df_conc_temp.fact_checker.apply(lambda x: str(x))

In [51]:
list_fact_checkers = df_conc_temp.fact_checker.unique().tolist()
list_fact_checkers.sort()
list_fact_checkers

['15min.lt',
 'AAP FactCheck',
 'AFP',
 'AfricaCheck',
 'Agencia Ocote',
 'Agência Lupa',
 'Animal Político',
 'Annie Lab',
 'Aos Fatos',
 'BOOM FactCheck',
 'Bolivia Verifica',
 'BuzzFeed Japan',
 'Check',
 'Check Your Fact',
 'CheckNews',
 'Chequeado',
 'Colombiacheck',
 'Convoca.pe',
 'Correctiv',
 'Delfi Melo Detektorius (Lie Detector)',
 'Demagog',
 'Dubawa',
 'Décrypteurs - Radio-Canada',
 'Décrypteurs - Radio-Canada and CBC',
 'Détecteur de rumeurs',
 'EFE Verifica',
 'Ecuador Chequea',
 'Efecto Cocuyo',
 'Effecinque - SkyTg24',
 'El Surtidor',
 'Ellinika Hoaxes',
 'Estadão Verifica',
 'FactCheck Georgia',
 'FactCheck.org',
 'FactCrescendo',
 'Factcheck.Vlaanderen',
 'Factcheck.kz',
 'Factly',
 'Factnameh',
 'Faktabaari/FactBar',
 'Faktograf',
 'Fatabyyano',
 'France 24 Observers',
 'Full Fact',
 'GhanaFact',
 'INFACT',
 'India Today',
 'Istinomer',
 'JTBC news',
 'La Nación',
 'La Silla Vacía',
 'La Voz de Guanacaste',
 'LeadStories',
 'Les Décodeurs',
 'Maldita.es',
 'MediaWis

In [53]:
len(list_fact_checkers)

90

## Trying to eliminate NaN

There is one news without checker. This one should be eliminated from the dataset?

In [76]:
df_conc_temp[df_conc_temp.fact_checker=='nan']

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
1236,,2020/03/26,"Spain, Russia",False,Putin accuses US and EU leaders of artificiall...,This is the video of a speech by the Russian l...,Video posted on Facebook and Youtube,https://www.efe.com/efe/espana/efeverifica/put...


In [64]:
df_conc_temp.title[df_conc_temp.fact_checker=='nan'].values[0]

'Putin accuses US and EU leaders of artificially creating COVID-19 to reduce world population.'

In [77]:
df_conc_temp.url_checker[df_conc_temp.fact_checker=='nan'].values[0]

'https://www.efe.com/efe/espana/efeverifica/putin-no-acusa-a-la-ue-ni-eeuu-de-crear-el-coronavirus/50001435-4205887'

In [88]:
df_conc_temp.fact_checker[df_conc_temp.url_checker.str.contains('efe.com', na=True)]

1236              nan
1354     EFE Verifica
1387     EFE Verifica
1390     EFE Verifica
1490     EFE Verifica
1636    Les Décodeurs
1807     EFE Verifica
1811     EFE Verifica
2068     EFE Verifica
2070     EFE Verifica
2072     EFE Verifica
2174     EFE Verifica
2182     EFE Verifica
2289     EFE Verifica
2292     EFE Verifica
2300     EFE Verifica
2591     EFE Verifica
2638     EFE Verifica
2642     EFE Verifica
3054     EFE Verifica
Name: fact_checker, dtype: object

Therefore, the fact checker here is `EFE Verifica`. We update our dataset.

In [89]:
df_conc_temp.fact_checker.iloc[1236] = 'EFE Verifica'

In [90]:
df_conc_temp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3934 entries, 0 to 3933
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   fact_checker         3934 non-null   object
 1   date                 3934 non-null   object
 2   location             3934 non-null   object
 3   label                3934 non-null   object
 4   title                3934 non-null   object
 5   explanation          3932 non-null   object
 6   claim_originated_by  3866 non-null   object
 7   url_checker          3933 non-null   object
dtypes: object(8)
memory usage: 246.0+ KB


In [75]:
df_conc_temp[df_conc_temp.url_checker.isna()]

Unnamed: 0,fact_checker,date,location,label,title,explanation,claim_originated_by,url_checker
1636,Les Décodeurs,2020/03/22,France,partly false,"Yves Levy, husband of the former french health...",The virus wasn't created in Wuhan's P4 laborat...,Facebook,


For `explanation` and `url_checker` we cannot fill new values.

### Saving cleaned data

In [91]:
df_conc_temp.to_csv("../data/data_poynter_COMPLETE_2020-04-24_CLEANED.csv", index=False)

In [102]:
df_conc_temp.to_excel("../data/data_poynter_COMPLETE_2020-04-24_CLEANED.xlsx")

## Period covered by our dataset

The data collected in April 24 covers period from January 14, 2020 until April 23, 2020.

In [52]:
(df_conc_temp.date.min(),df_conc_temp.date.max())

('2020/01/14 ', '2020/04/23 ')

## Labels

In [101]:
df_conc_temp.label.unique()

array(['false', 'misleading', 'partly false', 'no evidence',
       'pants on fire!', 'four pinocchios', 'mainly false', 'explanatory',
       'three pinocchios', 'mostly false', 'incorrect', 'fake',
       'half true', 'mostly true', 'suspicions', 'partly true',
       'true but', 'two pinocchios', 'partially false', 'inaccurate',
       'partially correct', 'misleading/false', 'unproven',
       "(org. doesn't apply rating)", 'in dispute',
       'false and misleading', 'not legit (false)', 'mixed', 'half truth',
       'partially true', 'correct', 'unlikely', 'pants on fire',
       'conspiracy theory', 'misinformation / conspiracy theory',
       'fake news', 'unverified', 'not true'], dtype=object)

In [112]:
df_checker_label = df_conc_temp.groupby(["fact_checker", "label"])["location"].count().reset_index()
df_checker_label.drop(columns = ['location'],inplace=True)
df_checker_label

Unnamed: 0,fact_checker,label
0,15min.lt,conspiracy theory
1,15min.lt,false
2,15min.lt,misleading
3,15min.lt,partially false
4,AAP FactCheck,false
5,AAP FactCheck,partially false
6,AFP,false
7,AFP,misleading
8,AFP,mostly false
9,AFP,partially false
