<a href="https://colab.research.google.com/github/BL-Labs/Jupyter-notebooks-projects-using-BL-Sources/blob/master/LOD_SPARQL/not_run/01_BNB_SPARQL_Compare_Publication_Year_for_two_Subjects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BNB -- SPARQL query - Compare resources' Publication Year for two Subjects

The [BNB Linked Data Platform](https://bnb.data.bl.uk/) provides access to the [British National Bibliography (BNB)](http://www.bl.uk/bibliographic/natbib.html) published as linked open data and made available through SPARQL services. 

This notebook explains how to query it to retrieve records of resources indexed under two given Subjects / Topics (LCSH), and compare their number by publication year.

## Define the Subjects to Compare

Set the Subject to search from [LCSH list](http://id.loc.gov/authorities/subjects.html):

In [None]:
# Examples -- uncomment the desired subject, or added your own set of Subject and Label, and do "Run" or "Runtime" > "Run all" or "Run all cells":

Label = ''

# Subject = 'Nanotubes'
# Label = 'Nanotubes'

Subject1 = 'Climaticchanges'
Label1 = 'Climatic Changes'

Subject2 = 'Globalwarming'
Label2 = 'Global Warming'

#### Required modules / libraries

In [None]:
import requests
import pandas as pd
import json
import csv
import matplotlib.pyplot as plt
from pandas import json_normalize

## Let's query the repository by asking the publications indexed by the defined Subjects
We will use the [SPARQL endpoint](https://bnb.data.bl.uk/flint-sparql) to create the query and configure the request to retrieve json as a result.

In [None]:
url = 'https://bnb.data.bl.uk/sparql'
query = """
PREFIX bibo: <http://purl.org/ontology/bibo/> 
PREFIX dct: <http://purl.org/dc/terms/> 
PREFIX schema: <http://schema.org/> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 

SELECT ?resource ?isbn ?title ?date ?author ?authorUri WHERE {{
      ?resource dct:subject <http://bnb.data.bl.uk/id/concept/lcsh/{0}>;  
            dct:title ?title ; 
            schema:author ?authorUri ; 
            schema:datePublished ?date . 
      ?authorUri rdfs:label ?author .
}}  
"""

# Query for Subject 1
query1 = query.format(Subject1)

# use json as a result
headers = {'Accept': 'application/sparql-results+json'}
r = requests.get(url, params = {'format': 'application/sparql-results+json', 'query': query1}, headers=headers)
print('Records retrieved for subject ' + Label1)


In [None]:
# print(r.text)

## Load the data into DataFrames

In [None]:
bnbdata = json.loads(r.text)
dfSub1 = json_normalize(bnbdata['results']['bindings']) 
dfSub1.head(2)

In [None]:
# Query for Subject 2
query2 = query.format(Subject2)

r = requests.get(url, params = {'format': 'application/sparql-results+json', 'query': query2}, headers=headers)
print('Elements retrieved for subject ' + Label2)

bnbdata = json.loads(r.text)
dfSub2 = json_normalize(bnbdata['results']['bindings']) 
dfSub2.head(2)

## How many items?

In [None]:
# How many items for each Subject?
print('Number of records retrieved for subject ' + Label1 + ': ' + str(len(dfSub1)))
print('Number of records retrieved for subject ' + Label2 + ': ' + str(len(dfSub2)))


### Let's count the number of Resources by author

In [None]:
#Subject 1
resources_by_author_Sub1 = dfSub1['author.value'].value_counts()
print('Rank of Authors about ' + Label1)
resources_by_author_Sub1

In [None]:
#Subject 2
resources_by_author_Sub2 = dfSub2['author.value'].value_counts()
print('Rank of Authors about ' + Label2)
resources_by_author_Sub2

### Create a chart to visualize the results
First let's see a top of publication Dates:

### Let's group the books by year

In [None]:
# First we create a new column in pandas with the year
dfSub1['year'] = pd.DatetimeIndex(dfSub1['date.value']).year
dfSub2['year'] = pd.DatetimeIndex(dfSub2['date.value']).year

# Check first five from Subject 1 books
dfSub1['year'].head(5)

### Creating the chart of resources per year

In [None]:
# .sort_index() is very important = sorts by year, instead of the the regular count number

resources_by_year_Sub1 = dfSub1['year'].value_counts().sort_index()
resources_by_year_Sub2 = dfSub2['year'].value_counts().sort_index()

# let's check for Subject 1
resources_by_year_Sub1

In [None]:
plt.figure(figsize=(15,7))
resources_by_year_Sub1.plot(marker='o', markerfacecolor='blue', markersize=1, color='skyblue', linewidth=2, label=Label1)
resources_by_year_Sub2.plot(marker='x', color='orange', markersize=2, linewidth=2, label=Label2)
plt.legend(title = 'Number of Resources by Publication Year', title_fontsize = '14')
plt.xlabel("Year")
plt.ylabel("Number of Published Resources")