# CS848: The art and science of empirical computer science

## Assignment: Visualization Project

**Description:** At a high-level, for this visualization project, I would like you to perform exploratory data analysis on a bibliometric dataset of your choice. From this exploration, I would like you to come up with one or more interesting observations or questions to ask. And then, I would like you to build a visualization that either "makes the points" or answers the questions that you posed.

### 1. Exploratory data analysis on a bibliometric dataset of your choice:


* The data set I chose is [DBLP](https://qlever.cs.uni-freiburg.de/dblp/jzdksf). 
* [DBLP](https://qlever.cs.uni-freiburg.de/dblp/jzdksf) allows you to run SPARQL queries to get bibliometrics about:
   * Papers
   * Their authors
   * The affiliation of the authors
   * Conferences in which those papers are published
   * etc.

### 2. Observations or questions to ask:
[CSRankings](https://csrankings.org/) provide **per institution** rankings for different CS fields.
We are interested in a similar ranking but **per country**.

More specifically, for **a given field** (for example: cloud), **What are the rankings by country?**


### How to answer this question?

[DBLP](https://qlever.cs.uni-freiburg.de/dblp/jzdksf) provides the backend information needed to execute SPARQL queries from any where.

The following function can excute a query that takes the parameters:
* keywords: a list of keywords that are used to determine the desired field. The titles of the returned papers will contain at least one of those keywords.
* conferences: a list of the conferences that we want to consider in our search.
* years: (from_year, to_year).

This query will return the following information:
* paper 
* title 
* author 
* conference 
* affiliation 
* year

In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import matplotlib.pyplot as plt
from iso3166 import countries
import ipywidgets as widgets
from IPython.display import clear_output

def query(keywords, conferences, years):
    sparql = SPARQLWrapper("https://qlever.cs.uni-freiburg.de/api/dblp")
    sparql.setReturnFormat(JSON)

    sparql.setQuery(f"""
    PREFIX dblp: <https://dblp.org/rdf/schema#>
    SELECT ?paper ?title ?author ?conference ?affiliation ?year WHERE {{
      ?paper dblp:title ?title .
      ?paper dblp:publishedIn ?conference .
      ?paper dblp:yearOfPublication ?year .
      ?paper dblp:authoredBy ?author .
      ?author dblp:affiliation ?affiliation . 
      FILTER REGEX(?title, "{'|'.join(keywords)}") .
      FILTER REGEX (?conference , "{'|'.join(conferences)}") .
      FILTER (?year >= "{years[0]}") .
      FILTER (?year <= "{years[1]}") .
    }}
    """
    )
    query_res = sparql.queryAndConvert()

    cols = query_res['head']['vars']
    rows = []
    for res in query_res['results']['bindings']:
        row = []
        for col in cols:
            if col in res:
                row.append(res[col]['value'])
            else:
                row.append("")
        rows.append(row)


But, the returned information does not contain the country!

We need to find a way to detect it from the affiliation. This function checks if the last part in the affiliation represents a country according to the [iso3166](https://pypi.org/project/iso3166/) standard. 


In [None]:
def country(aff):
    res = "Cannot tell"
    if "," in aff:
        # If the country is mentioned in the affiliation,
        # it will usually after the last ",".
        candidate = aff.split(",")[-1]
        # Clean the extracted name
        if candidate.startswith("The"):
            candidate = candidate.replace("The", "")
        candidate = candidate.strip()
        
        if candidate in countries:
            res = countries.get(candidate).name

    return res