## COVID-19
### Exploring Corona Virus Publications with WeLearn.

Hey there! This is a notebook I used for doing curated page for WeLearn.

If you haven't, check out the [Map on WeLearn][exp-page].

This is an open source analysis, and available on GitHub.

All the data used for this analysis is available via [WeLearn API][api-url].


[api-url]: https://welearn.cri-paris.org/.meta/docs
[exp-page]: https://welearn.cri-paris.org/experiments/covid19

In [1]:
%config InlineBackend.figure_formats = ['png']

import requests
import pandas as pd
import numpy as np
import scipy as sci
import scipy.spatial
import scipy.linalg
import networkx as nx
import networkx.readwrite.json_graph

from urllib.parse import urlparse
from pprint import pprint

import plotly.graph_objects as plotly_go

In [2]:
COVID19_UUID = '8705149e0d3a449e9a70747df29ccea4'
API_BASE_URL = f'https://welearn.cri-paris.org/api/resources/user/{COVID19_UUID}?limit=500'

covid_resources = requests.get(API_BASE_URL).json()
resources = covid_resources['results']

print(f'n(Resources) = {len(resources)}\n')

pprint(resources[0])

n(Resources) = 203

{'concepts': [{'cuid': 'Q28946449',
               'title_en': 'Transcriptomics technologies',
               'title_fr': None,
               'wikidata_id': 'Q28946449'},
              {'cuid': 'Q7318015',
               'title_en': 'Revenue assurance',
               'title_fr': None,
               'wikidata_id': 'Q7318015'},
              {'cuid': 'Q858810',
               'title_en': 'Big data',
               'title_fr': 'Big data',
               'wikidata_id': 'Q858810'},
              {'cuid': 'Q5282133',
               'title_en': 'Disease surveillance in China',
               'title_fr': None,
               'wikidata_id': 'Q5282133'},
              {'cuid': 'Q891528',
               'title_en': 'Social media measurement',
               'title_fr': None,
               'wikidata_id': 'Q891528'}],
 'created': '2020-03-04T16:28:22.901685+00:00',
 'lang': 'en',
 'readability_score': 42.0,
 'resource_id': '91c1b55c623d46e08b05e62d7c77314e',
 'title': 'Novel

### Using NetworkX for Graph Statistics

The API gets us the resources in this format:

```json
{
  "results": [
    {
      "resource_id": "string",
      "title": "string",
      "url": "string",
      "lang": "string",
      "readability_score": 0,
      "concepts": [
        {
          "cuid": "string",
          "wikidata_id": "string",
          "title_en": "string",
          "title_fr": "string"
        }
      ]
    }
  ],
  "pagination": {
    "count": 0,
    "skip": 0,
    "limit": 0,
    "next": 0
  }
}
```

We can quite easily transform this to a `node-link` format object to get a `NetworkX` Graph.

The `node-link` format looks like this:
```json
{ nodes: [{id, ...props}], links: [{source, target}] }
```

In [12]:
def resource_graph(data):
  def links_gen():
    for resource in data:
      for concept in resource['concepts']:
        yield {
          'source': resource['resource_id'],
          'target': concept['wikidata_id'],
        }

  def nodes_gen():
    '''Here we're sequentially emitting the resource and concept nodes.
       It should be noted that this is a flat list so we add the "group"
       properties as well.
       Group 1: Resource, Group 2: Concept.

       For concept nodes, we'd like to avoid duplicating data. We'll use
       a list to keep track of the concepts already emitted.
    '''
    _cemit = []
    for resource in data:
      yield {
        'id': resource['resource_id'],
        'group': 1,
        'url': resource['url'],
        'title': resource['title'],
      }
      for concept in resource['concepts']:
        if concept['wikidata_id'] in _cemit:
          continue

        _cemit.append(concept['wikidata_id'])
        yield {
          'id': concept['wikidata_id'],
          'group': 2,
          'title_en': concept['title_en'],
          'title_fr': concept['title_fr'],
        }

  node_links = {
    'nodes': list(nodes_gen()),
    'links': list(links_gen()),
  }
  graph = networkx.json_graph.node_link_graph(node_links)
  
  return (graph, node_links)


G, node_link_graph = resource_graph(resources)

Nice! We now have the node-link format object. Lets get some stats out of this:

In [4]:
_nedge, _nnode = G.number_of_edges(), G.number_of_nodes()

print(f'n(edges) = {_nedge}')
print(f'n(nodes) = {_nnode}')


n(edges) = 679
n(nodes) = 479


**What are websites that are referred most often?**

In [9]:
# Use urllib to parse the urls and get the domain names in a set.
# with this, make a dataframe with <domain, freq> ordered pairs.

domains = [urlparse(r['url']).netloc for r in resources]
freq_list = [(x, domains.count(x)) for x in set(domains)]
domain_freq = (pd
               .DataFrame(freq_list, columns=['domain', 'freq'])
               .sort_values(by='freq', ascending=False))

In [11]:
fig = plotly_go.Figure(data=plotly_go.Bar(x=domain_freq.domain, y=domain_freq.freq))
fig.show()