# Hidden Engagement & Problems

This is an attempt to document some of the problems that were encountered while gathering Facebook engagement data for scholarly articles based on their DOIs. To be specific, we want to collect URL-specific engagement (shares, likes, comments) data and compare these with the article-based data from Altmetric. In this notebook I am trying to document and organize these oddities and surprises.

## Intro

**What Altmetric does:**

Scholarly article -> crawl public Facebook pages -> count mentions of IDs, URLs

**What we are trying:**

Scholarly Article ID -> resolve to URL (e.g. via CrossRef) -> retrieve engagement data from FB OpenGraph API

## Main challenges/questions

Looking at the big picture we can see two main challenges for our approach:

### How do we get to the relevant URLs?

Starting at a single DOI for an article, how do we make sure to find the relevant URLs?

Several subproblems and examples:

+ DOIs point to different URLs over time

    DOIs are meant to be persistent, but articles can migrate to new domains. Thus, simply resolving a DOI and retrieving engagement data will miss previous shares.
    
+ Different URLs might exist concurrently for articles.

    Biomedical research is often identified by DOI and PMID. The DOI might resolve to a different URL than those based on PMID.

So far, there are a few things that we could try (keep in mind that we need to query each URL seperately; the Facebook API is rate-limited)

```
DOI ---CrossRef--> current URL
DOI -----PKP-----> original URL
DOI -----PMC-----> alternative URLs for biomed research
```

### What happens at Facebook with those URLs?

Having identified which URL we want to use to query the Facebook API, several things might happen now:

- The Facebook Crawler will resolve each link to an Open Graph object based on a canonical URL which
  - can be provided by the metatag `og:url`
  - can be inferred from the page content
  
The canonical URL makes sure that different variations of a page (http, https, trailing slashes, different views) still resolve to the same Open Graph object (and thus share counts). Unfortunately, sometimes this doesn't happen:

- Even if the page contains the recommended OG metatags, the FB crawler might fail to successfully resolve the URL. In an extreme case the canonical URL is modified because of resolve errors and causes different OG objects for varations of the same page. (See [StackOverflow question](https://stackoverflow.com/questions/48159408/facebook-crawler-infers-different-ogurl-than-the-one-specified-in-the-metatag))
- If the page does not provide any metatags, FB tries to infer the canonical URL which often simply fails. Various versions of the same article will be resolved to different OG objects with varying share numbers.

**Link resolving**

As previously mentioned, bad page design (previous example was handling of some cookie error) can cause problems. But some URL require a browser to resolve successfully (see Joe's blogpost on CrossRef about [DOIs vs URLs](https://www.jerriepelser.com/blog/introduction-to-the-open-graph-protocol)). According to [this question](https://stackoverflow.com/questions/25420887/does-facebook-crawler-currently-interpret-javascript-before-parsing-the-dom) FB does not execute javascript to resolve URLs.

**Max redirects or redirect loops**

FB does also stop resolving URLs after 5 redirects.

## Useful links

+ Joe Wass at CrossRef has looked into the complicated relationship between DOIs and URLs. [link](https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/)
+ A very quick introduction to the OpenGraph protocol. [link](https://www.jerriepelser.com/blog/introduction-to-the-open-graph-protocol)

## Examples

Now I want to go through some examples of the previously discussed problems.

In [1]:
import datetime, time
import json
import urllib.parse
import pandas as pd
import configparser
from pprint import pprint

from ATB.ATB.Altmetric import Altmetric, AltmetricHTTPException
from ATB.ATB.Facebook import Facebook
from ATB.ATB.Utils import resolve_doi

import urllib

# Load config
Config = configparser.ConfigParser()
Config.read('config.cnf')
FACEBOOK_APP_ID = Config.get('facebook', 'app_id')
FACEBOOK_APP_SECRET = Config.get('facebook', 'app_secret')
ALTMETRIC_KEY = Config.get('altmetric', 'key')

In [2]:
fb_graph = Facebook(FACEBOOK_APP_ID, FACEBOOK_APP_SECRET)
altmetric = Altmetric(api_key = ALTMETRIC_KEY)

Generated access token: 287299458433880|6Y_ml710QWnU7HBYLWjaneoWVKU


### URL variations mapped to different Open Graph objects

One article and various URL variations -> different OG objects.

Details for this problem: https://stackoverflow.com/questions/48159408/facebook-crawler-infers-different-ogurl-than-the-one-specified-in-the-metatag

Nevertheless, it is interesting to compare the share count directly from FB with the numbers from Altmetric.com

```
Sum of Facebook URLs: 1018
Altmetric shares: 48
```

(Even though I am not so sure, about the 509 shares of two different Open Graph objects... Facebook is a mess...)

In [21]:
url_base = "www.nature.com/news/the-future-of-dna-sequencing-1.22787"
doi = "10.1038/550179a"

urls = ['http://' + url_base,
        'http://' + url_base + '/',
        'https://' + url_base,
        'https://' + url_base + '/',
        'http://dx.doi.org/' + doi,
        'https://dx.doi.org/' + doi,
        'http://doi.org/' + doi,
        'https://doi.org/' + doi]

og_ids = []
shares = []

for url in urls:
    try:
        r = fb_graph.get_object(url, fields="og_object, engagement")
        og_ids.append(r['og_object']['id'])
        shares.append(r['engagement']['share_count'])
    except:
        og_ids.append(None)
        shares.append(None)
    
pd.DataFrame({'URL': urls,
              'OG IDs': og_ids,
              'Shares': shares})[['URL', 'OG IDs', 'Shares']]

Unnamed: 0,URL,OG IDs,Shares
0,http://www.nature.com/news/the-future-of-dna-s...,1431803343584077.0,509.0
1,http://www.nature.com/news/the-future-of-dna-s...,1313165148787816.0,0.0
2,https://www.nature.com/news/the-future-of-dna-...,1513472432101761.0,3.0
3,https://www.nature.com/news/the-future-of-dna-...,1500355130063165.0,0.0
4,http://dx.doi.org/10.1038/550179a,1472429859490322.0,509.0
5,https://dx.doi.org/10.1038/550179a,,
6,http://doi.org/10.1038/550179a,,
7,https://doi.org/10.1038/550179a,,


As a comparison the Altmetric FB share count for the DOI

In [9]:
altmetric.doi(doi, fetch=True)['counts']['facebook']['posts_count']

49

### FB Crawler - Max. Redirects

The FB crawler is not properly crawling the previous URLs because of too many redirects (check with this [tool](https://developers.facebook.com/tools/debug/og/object/)). The displayed share numbers are from a previous crawl, when the URLs could still be resolved. This is also interesting as the displayed share numbers could thus always potentially be older.

### Different share counts for DOI & URL despite identical OG ID

In [14]:
url = "http://bmcbiol.biomedcentral.com/articles/10.1186/1741-7007-10-51"
doi = "http://dx.doi.org/10.1186/1741-7007-10-51"

fb_url = fb_graph.get_object(url, fields="engagement, og_object")
fb_doi = fb_graph.get_object(doi, fields="engagement, og_object")

print("DOI and ULR have same og_object_id: {}".format(fb_url['og_object']['id'] == fb_doi['og_object']['id']))

print("FB shares for URL: {}".format(fb_url['engagement']['share_count']))
print("FB shares for DOI: {}".format(fb_doi['engagement']['share_count']))

DOI and ULR have same og_object_id: True
FB shares for URL: 1
FB shares for DOI: 0


### 0 shares

FB Graph API engagement often displays 0 shares even though the link has been definitely shared. E.g., Link to the [FB posting](https://www.facebook.com/permalink.php?story_fbid=790084137845947&id=583799085141121) and the [FB debugger results](https://developers.facebook.com/tools/debug/sharing/?q=https%3A%2F%2Fejournal.undip.ac.id%2Findex.php%2Fjitaa%2Farticle%2Fview%2F11730) for the shared link (Do not mix with number of shares of the FB posting - see previous point)

### Facebook API share numbers vs Altmetric share numbers

**Example 1**

http://www.tandfonline.com/doi/full/10.3402/fnr.v60.31694

```
Altmetric: 276
FB API: 2087
```

In [57]:
print("Altmetric shares:", altmetric.doi("10.3402/fnr.v60.31694", fetch=True)['counts']['facebook']['posts_count'])
print("Current DOI URL:", fb_graph.get_object("http://www.tandfonline.com/doi/full/10.3402/fnr.v60.31694", fields="engagement, og_object")['engagement']['share_count'])
print("PMCID:", fb_graph.get_object("https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5040825/", fields="engagement, og_object")['engagement']['share_count'])
print("PMID:", fb_graph.get_object("https://www.ncbi.nlm.nih.gov/pubmed/27680091", fields="engagement, og_object")['engagement']['share_count'])
print("PKP URL:", fb_graph.get_object("http://www.foodandnutritionresearch.net/index.php/fnr/article/view/31694", fields="engagement, og_object")['engagement']['share_count'])

Altmetric shares: 276
Current DOI URL: 39
PMCID: 134
PMID: 67
PKP URL: 1847


**Example 2**

http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0080-62342013000100001&lng=pt&tlng=pt

```
Altmetric: 53
FB API: 1156
```

In [52]:
print(altmetric.doi("10.1590/S0080-62342013000100001", fetch=True)['counts']['facebook']['posts_count'])

53


In [43]:
print(fb_graph.get_object("http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0080-62342013000100001&lng=pt&tlng=pt", fields="engagement, og_object")['engagement']['share_count'])
print(fb_graph.get_object("http://www.revistas.usp.br/reeusp/article/view/52846", fields="engagement, og_object")['engagement']['share_count'])

1156
0
