# ClaimReview Subset

You will need to download:

* The ClaimReview feed from: https://www.datacommons.org/factcheck/download
* LangDetect model (lid.176.bin) from: https://fasttext.cc/docs/en/language-identification.html

In [None]:
import json
import pandas as pd
from tqdm.auto import tqdm
tqdm.pandas()

In [None]:
datafile = "factcheck_datafeed.json"

In [None]:
# Load the JSON data from file
with open(datafile, 'r') as f:
    data = json.load(f)

In [None]:
# Normalize the JSON data into a flattened Pandas DataFrame
df = pd.json_normalize(data, record_path=['dataFeedElement', 'item'])
df

In [None]:
df.columns

The following colums are the most interesting for the claimreview schema, but the expalanation may not be complete, so please check the schema.org documentation for more information.
| Field | Explanation |
|-------|-------------|
| claimReviewed | Specifies the claim being reviewed |
| url | Specifies the URL of the page containing the review |
| author.@type | Specifies the type of the author, e.g. Person or Organization |
| author.image | Specifies the URL of the author's image |
| author.name | Specifies the name of the author |
| author.url | Specifies the URL of the author's website or profile |
| itemReviewed.@type | Specifies the type of the item being reviewed, e.g. CreativeWork or Product |
| itemReviewed.author.@type | Specifies the type of the author of the item being reviewed, e.g. Person or Organization |
| itemReviewed.author.image | Specifies the URL of the image of the author of the item being reviewed |
| itemReviewed.author.jobTitle | Specifies the job title of the author of the item being reviewed |
| itemReviewed.author.name | Specifies the name of the author of the item being reviewed |
| itemReviewed.datePublished | Specifies the date the item being reviewed was published |
| reviewRating.@type | Specifies the type of the rating, e.g. Rating or AggregateRating |
| reviewRating.alternateName | Specifies an alternate name for the rating |
| reviewRating.bestRating | Specifies the best possible rating value |
| reviewRating.image | Specifies the URL of the image representing the rating |
| reviewRating.ratingValue | Specifies the actual rating value |
| reviewRating.worstRating | Specifies the worst possible rating value |
| itemReviewed.name | Specifies the name of the item being reviewed |
| reviewRating.ratingExplanation | Specifies an explanation of the rating |
| datePublished | Specifies the date the review was published |
| itemReviewed.appearance | Specifies the appearance of the item being reviewed, e.g. review or snippet |
| itemReviewed.firstAppearance.@type | Specifies the type of the first appearance of the item being reviewed |
| itemReviewed.firstAppearance.url | Specifies the URL of the first appearance of the item being reviewed |
| itemReviewed.author.sameAs | Specifies the URL of a page that contains further information about the author of the item being reviewed |

In [None]:
subcols = ['claimReviewed', 'url', 'author.@type', 'author.image', 'author.name', 'author.url', 'itemReviewed.@type',
           'itemReviewed.author.@type','itemReviewed.author.image','itemReviewed.author.jobTitle','itemReviewed.author.name',
           'itemReviewed.datePublished','reviewRating.@type','reviewRating.alternateName','reviewRating.bestRating','reviewRating.image',
           'reviewRating.ratingValue','reviewRating.worstRating','itemReviewed.name','reviewRating.ratingExplanation','datePublished',
           'itemReviewed.appearance','itemReviewed.firstAppearance.@type','itemReviewed.firstAppearance.url','itemReviewed.author.sameAs']

len(subcols)

In [None]:
df = df[subcols]
df

In [None]:
# remove cols where the claimReviewed is null
df = df[~df.claimReviewed.isnull()].copy()
df.shape

In [None]:
import fasttext
fasttext_model = fasttext.load_model('lid.176.bin')

In [None]:
def fasttext_detect(text, model):
    try:
        # remove newline characters
        if text:
            text = text.replace('\n', ' ')
            # predict language using fasttext model
            return model.predict(text)[0][0].split('__')[-1]
        else:
            return 'unknown'
    except Exception as e:
        print(text, e)

In [None]:
df['claimReviewed_lang'] = df['claimReviewed'].progress_apply(lambda x: fasttext_detect(x, fasttext_model))

In [None]:
df.claimReviewed_lang.value_counts()

In [None]:
subcols.insert(1, 'claimReviewed_lang')
subcols

In [None]:
# leave only english claims
df_en = df[df.claimReviewed_lang == 'en']

In [None]:
df_en.shape

In [None]:
df_en.to_csv('../data/claimreview_en_subcols.csv', index=False)