<a href="https://colab.research.google.com/github/JinboCi/Knowledge_Graph/blob/master/KG_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KG Project

Want to look up information about a Nobel Prize Laureate with something built by your own?

In this project, we are trying to build a searching system with information storing in knowledge graphs. The mechanism includes three parts:

1. Collecting the data from Wikipedia (<https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country>)
2. Mapping sturctured data into RDF (please refer to <https://jena.apache.org/tutorials/rdf_api.html> for details)
3. A sparql querior (please visit <https://www.w3.org/TR/rdf-sparql-query/> for more information)

First of all, let's install the required packages SPARQLWrapper, rdb2rdf and scrapy!

In [0]:
!pip install SPARQLWrapper
!pip install rdb2rdf
!pip install scrapy

Import the modules we will need:

In [0]:
import scrapy
import re

from scrapy.crawler import CrawlerProcess
import csv
import logging

## The List of Nobel Prize Laureates

We would divide our task of building Sparql databases into two parts:

 - First, we are going to collect our data, we will use the build-in module *scrapy* to crawl the information on the website and then store it to a CSV file. 
 
 Most of the code in the first part are encrpyted or modified from Kyran Dale, *Data Visualization with Python and JavaScript_ Scrape, Clean, Explore & Transform Your Data-O’Reilly Media (2016).* Chapter 6.
 
 - After that, when building graphs, we would read the data from that CSV file and convert it into Sparql databases. 
 
 
 We write the Building-Graphs function separately, in case that sometimes we are directly provided with CSV files. So we only need to build RDF graphs from the existing CSV files rather than crawl the websites.


### A simple verison

We write this simple version for debugging purpose, but it is a good chance to have a glance at the outline of our data.

For this version, we will only consider three kinds of information of the Nobel Prize Winners:
- country
- name
- link_text

In [0]:
# nwinners_list_spider.py

# A. Define the data to be scraped
class NWinnerItem(scrapy.Item):
  country = scrapy.Field()
  name = scrapy.Field()
  link_text = scrapy.Field()

  
# B Create a named spider
class NWinnerSpiderSimp(scrapy.Spider):
  """ Scrapes the country and link-text of the Nobel-winners. """
  name = 'nwinners_list'
  allowed_domains = ['en.wikipedia.org']
  start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

  
  # C A parse method to deal with the HTTP response
  def parse(self, response):
    h3s = response.xpath('//h3')
    items = []
    for h3 in h3s:
      country = h3.xpath('span[@class="mw-headline"]/text()')\
      .extract()
      if country:
        winners = h3.xpath('following-sibling::ol[1]')
        for w in winners.xpath('li'):
          text = w.xpath('descendant-or-self::text()')\
          .extract()
          items.append(NWinnerItem(
            country=country[0], name=text[0],
            link_text = ' '.join(text)
            ))
    return items

In [0]:
process = CrawlerProcess()
process.crawl(NWinnerSpiderSimp)
process.start()

Looks good? However, remember the restart the kernel after running. I don't know why I could run only one process once the kernel starts.

Then import all the modules we will need in the following sections:

In [0]:
import scrapy
import re

from scrapy.crawler import CrawlerProcess
import csv
import logging

import os
import sys
import time
import random
import datetime

from SPARQLWrapper import SPARQLWrapper, JSON, XML

import csv
import rdflib
from rdflib import URIRef, BNode, Literal
from rdflib import Namespace
from rdflib.namespace import RDF, FOAF
from rdflib import Graph, Literal

### A comprehensive version

Besides *country*, *name*, *link_text* that we have previously considered, in this part, we will fetch more information about Nobel Prize Winners, including:
 - year
 - category
 - nationality
 - gender
 - born_in
 - date_of_birth
 - date_of_death
 - place_of_birth
 - place_of_death
 

In [0]:
BASE_URL = 'http://en.wikipedia.org'
class NWinnerItem(scrapy.Item):
  name = scrapy.Field()
  link = scrapy.Field()
  year = scrapy.Field()
  category = scrapy.Field()
  nationality = scrapy.Field()
  gender = scrapy.Field()
  born_in = scrapy.Field()
  date_of_birth = scrapy.Field()
  date_of_death = scrapy.Field()
  place_of_birth = scrapy.Field()
  place_of_death = scrapy.Field()
  text = scrapy.Field()
  
# B Create a named spider
class NWinnerSpiderComph(scrapy.Spider):
  """ Scrapes the country and link-text of the Nobel-winners. """
  name = 'nwinners_list'
  allowed_domains = ['en.wikipedia.org']
  start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

  
  # C A parse method to deal with the HTTP response
  def parse(self, response):
    h3s = response.xpath('//h3')
    items = []
    for h3 in h3s:
      country = h3.xpath('span[@class="mw-headline"]/text()')\
                .extract()
      if country:
        winners = h3.xpath('following-sibling::ol[1]')
        for w in winners.xpath('li'):
          wdata = self.process_winner_li(w, country[0])
          request = scrapy.Request(
            wdata['link'],
            callback=self.parse_bio,
            dont_filter=True)
          request.meta['item'] = NWinnerItem(**wdata)
          yield request
    return items
  def process_winner_li(self, w, country=None):
    """
    Process a winner's <li> tag, adding country of birth or
    nationality, as applicable.
    """
    wdata = {}
    wdata['link'] = BASE_URL + w.xpath('a/@href').extract()[0]
    text = ' '.join(w.xpath('descendant-or-self::text()').extract())
    # get comma-delineated name and strip trailing white-space
    wdata['name'] = text.split(',')[0].strip()
    # see if there are four adjecent integers in the string text
    year = re.findall('\d{4}', text)
    if year:
      wdata['year'] = int(year[0])
    else:
      wdata['year'] = 0
      print('Oops, no year in ', text)
    category = re.findall('Physics|Chemistry|Physiology or Medicine|Literature|Peace|Economics',text)
    if category:
      wdata['category'] = category[0]
    else:
      wdata['category'] = ''
      print('Oops, no category in ', text)
    if country:
      # the interesting label that represent his nationality or motherland
      if text.find('*') != -1:
        wdata['nationality'] = ''
        wdata['born_in'] = country
      else:
        wdata['nationality'] = country
        wdata['born_in'] = ''
    # store a copy of the link's text-string for any manual corrections
    wdata['text'] = text
    return wdata
  
  def parse_bio(self, response):
    item = response.meta['item']
    href = response.xpath("//li[@id='t-wikibase']/a/@href").extract()
    if href:
      request = scrapy.Request(href[0],\
                  callback=self.parse_wikidata,\
                              dont_filter=True)
      request.meta['item'] = item
      yield request
  def parse_wikidata(self, response):
    item = response.meta['item']
    property_codes = [
      {'name':'date_of_birth', 'code':'P569'},
      {'name':'date_of_death', 'code':'P570'},
      {'name':'place_of_birth', 'code':'P19', 'link':True},
      {'name':'place_of_death', 'code':'P20', 'link':True},
      {'name':'gender', 'code':'P21', 'link':True}
    ]    
    p_template = '//*[@id="%(code)s"]/div[2]/div[1]/div/div[2]/div[2]/div[1]'
    for prop in property_codes:
      extra_html = ''
      if prop.get('link'): # property string in <a> tag
        extra_html = '/a'
      sel = response.xpath(p_template%prop + extra_html + '/text()')
      if sel:
        item[prop['name']] = sel[0].extract()
    yield item


We will not run this comprehensive version at present since it would print out all the loggings on the screen. But we will run a really similar version in the next part where we will ignore the clumsy loggings and store what we crawl into a CSV file

### Storing to CSV
We may want to write the results to csv files or directly store it as RDF graphs. How could we achieve this? 

Hmm...

We may first store the data to CSV, and then upload the CSV file to our sql server!

In [0]:
import pdb
BASE_URL = 'http://en.wikipedia.org'
class NWinnerItem(scrapy.Item):
  name = scrapy.Field()
  link = scrapy.Field()
  year = scrapy.Field()
  category = scrapy.Field()
  nationality = scrapy.Field()
  gender = scrapy.Field()
  born_in = scrapy.Field()
  date_of_birth = scrapy.Field()
  date_of_death = scrapy.Field()
  place_of_birth = scrapy.Field()
  place_of_death = scrapy.Field()
  text = scrapy.Field()
  
# B Create a named spider
class NWinnerSpiderToCsv(scrapy.Spider):
  """ Scrapes the country and link-text of the Nobel-winners. """
  name = 'nwinners_list'
  allowed_domains = ['en.wikipedia.org']
  start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]
  items = []
  output = name+".csv" 
  custom_settings = {
      'LOG_LEVEL': 'INFO',
      'FEED_FORMAT':'csv',
      'FEED_URI': 'nwinners_list.csv'
    }
  '''
  def __init__(self):
    if os.path.isfile(self.output):
      os.remove(self.output)
    open(self.output, "w").close()
  '''

  
  # C A parse method to deal with the HTTP response
  def parse(self, response):
    h3s = response.xpath('//h3')
    for h3 in h3s:
      country = h3.xpath('span[@class="mw-headline"]/text()')\
                .extract()
      if country:
        winners = h3.xpath('following-sibling::ol[1]')
        for w in winners.xpath('li'):
          wdata = self.process_winner_li(w, country[0])
          #pdb.set_trace()
          request = scrapy.Request(
            wdata['link'],
            callback=self.parse_bio,
            dont_filter=True)
          #pdb.set_trace()
          request.meta['item'] = NWinnerItem(**wdata)
          yield request
          #pdb.set_trace()
    #pdb.set_trace()
    return self.items
  
      
  def process_winner_li(self, w, country=None):
    """
    Process a winner's <li> tag, adding country of birth or
    nationality, as applicable.
    """
    wdata = {}
    wdata['link'] = BASE_URL + w.xpath('a/@href').extract()[0]
    text = ' '.join(w.xpath('descendant-or-self::text()').extract())
    # get comma-delineated name and strip trailing white-space
    wdata['name'] = text.split(',')[0].strip()
    # see if there are four adjecent integers in the string text
    year = re.findall('\d{4}', text)
    if year:
      wdata['year'] = int(year[0])
    else:
      wdata['year'] = 0
      print('Oops, no year in ', text)
    category = re.findall('Physics|Chemistry|Physiology or Medicine|Literature|Peace|Economics',text)
    if category:
      wdata['category'] = category[0]
    else:
      wdata['category'] = ''
      print('Oops, no category in ', text)
    if country:
      # the interesting label that represent his nationality or motherland
      if text.find('*') != -1:
        wdata['nationality'] = ''
        wdata['born_in'] = country
      else:
        wdata['nationality'] = country
        wdata['born_in'] = ''
    # store a copy of the link's text-string for any manual corrections
    wdata['text'] = text
    #pdb.set_trace()
    return wdata
  
  def parse_bio(self, response):
    #pdb.set_trace()
    item = response.meta['item']
    href = response.xpath("//li[@id='t-wikibase']/a/@href").extract()
    if href:
      request = scrapy.Request(href[0],\
                  callback=self.parse_wikidata,\
                              dont_filter=True)
      request.meta['item'] = item
      return request
  def parse_wikidata(self, response):
    #pdb.set_trace()
    item = response.meta['item']
    property_codes = [
      {'name':'date_of_birth', 'code':'P569'},
      {'name':'date_of_death', 'code':'P570'},
      {'name':'place_of_birth', 'code':'P19', 'link':True},
      {'name':'place_of_death', 'code':'P20', 'link':True},
      {'name':'gender', 'code':'P21', 'link':True}
    ]
    # this template should be obtained by carefully examining the webpage's elements
    p_template = '//*[@id="%(code)s"]/div[2]/div/div/div[2]/div[1]/div/div[2]/div[2]/div[1]'
    for prop in property_codes:
      extra_html = ''
      if prop.get('link'): # property string in <a> tag
        extra_html = '/a'
      
      sel = response.xpath(p_template%prop + extra_html + '/text()')
      #pdb.set_trace()
      if sel:
        item[prop['name']] = sel[0].extract()
      else:
        item[prop['name']] = ""
    self.items.append(NWinnerItem(name=item["name"], link=item["link"], 
                                   year=item["year"], category=item["category"],
                                  nationality=item["nationality"], gender=item["gender"],
                                  born_in=item["born_in"], date_of_birth=item["date_of_birth"],
                                  date_of_death=item["date_of_death"],
                                  place_of_birth=item["place_of_birth"],
                                  place_of_death=item["place_of_death"],
                                  text=item["text"]))
    return item


Now, let's try to see how it works:

In [0]:
process = CrawlerProcess()
process.crawl(NWinnerSpiderToCsv)
process.start()

## Building Graphs

The code in this part is to generate RDF graphs from datasets. For the simplicity, we would only consider the simplest case - Mapping structured data into graphs.

### Structured datasets

In this part, we would build our RDF graphs from the structured datasets (i.e. CSV files). We would achieve it by applying the libraries *rdflib* and *csv*. 

This function would receive 3 three parameters:

- filepath: the path of our CSV file
- output : the expected path of our output file
- output_format: the output format, including 'xml', 'n3', 'turtle', 'nt', 'pretty-xml', 'trix', 'trig' and 'nquads'

In [0]:
def R2RDF(filepath, output, output_format):
  first_Row = True
  graph = Graph()
  if os.path.isfile(filepath):
    with open(filepath, encoding="utf-8") as csvfile:
      readCSV = csv.reader(csvfile,delimiter=',')
      wiki_prefix = Namespace('https://en.wikipedia.org/wiki/')
      for row in readCSV:
        #print(row)
        if first_Row == True:
          first_Row = False
          continue
        born_in = Literal(row[0])
        category = Literal(row[1])
        date_of_birth = Literal(row[2])
        date_of_death = Literal(row[3])
        gender = Literal(row[4])
        link = Literal(row[5])
        name = Literal(row[6])
        nationality = Literal(row[7])
        place_of_birth = Literal(row[8])
        place_of_death = Literal(row[9])
        text = Literal(row[10])
        year = Literal(row[11])
        current_node = URIRef(link)
        graph.add((current_node, RDF.type, FOAF.Person))
        graph.add((current_node, FOAF.name, name))
        graph.add((current_node, wiki_prefix.homeland, born_in))
        #if born_in != "":
        #graph.add((born_in, RDF.type, wiki_prefix.country))
        graph.add((current_node, wiki_prefix.Course_education, category))
        graph.add((current_node, wiki_prefix.date_of_birth,date_of_birth))
        #graph.add((date_of_birth, RDF.type, FOAF.Date))
        #if date_of_death != "":
          #graph.add((date_of_death, RDF.type, FOAF.Date))
        graph.add((current_node, wiki_prefix.date_of_death, date_of_death))
        #if gender != "":
        graph.add((current_node, FOAF.gender, gender))
        graph.add((current_node, FOAF.accountServiceHomepage, link))
        graph.add((current_node, wiki_prefix.nationality, nationality))
        #graph.add((nationality, RDF.type, wiki_prefix.country))
        #if place_of_birth != "":
        graph.add((current_node, wiki_prefix.place_of_birth, place_of_birth))
        #graph.add((place_of_birth, RDF.type, wiki_prefix.city))
        #if place_of_death != "":
        graph.add((current_node, wiki_prefix.place_of_death, place_of_death))
        #graph.add((place_of_death, RDF.type, wiki_prefix.city))
        graph.add((current_node, FOAF.depiction, text))
        graph.add((current_node, wiki_prefix.year, year))
    if os.path.isfile(output):
        os.remove(output)
    graph.serialize(destination=output, format=output_format)
    
    
    
    
    
    
    

Now let's look how it works:

In [0]:
R2RDF('nwinners_list.csv', 'nwinners_list.ttl', 'turtle')

In [0]:
R2RDF('nwinners_list.csv', 'nwinners_list.xml', 'pretty-xml')

## Sparql Querior

We are writing a class that is able to achieve basic implementations of the SPQRQL queries for an RDF database. 

### Querying with SPARQLWrapper (probably failed)
The following functions are to be considered:

- Function *init*
- Function *Querying_database* 

The Fit_project and its member functions would take the inputs: 

- Url: the path of the RDF graph

- Enum: the indication that where the *Input_stirng* is the path of an input file or the sparql request

- Return_format: JSON or XML

In [0]:
import pdb
class Query_with_SPARQLWrapper:
    File_or_Query = {
        "File",
        "Query",
    }
    Return_format = {
        "JSON": JSON,
        "XML": XML,
    }
    File_path = ""
    Query_string = ""
    Url = ""
    Sparql = SPARQLWrapper("")
    def __init__(self, Url):
        self.Sparql = SPARQLWrapper(Url)
    def Querying_database(self, Enum, Input_string, Return_format):
        if Enum == "File":
            self.File_path = Input_string
            if not os.path.isfile(self.File_path):
                raise TypeError(self.File_path + " does not exist")
            self.Query_string = open(self.File_path).read().close()
        else:
            self.Query_string = Input_string
        self.Sparql.setQuery(self.Query_string)
        #pdb.set_trace()
        if Return_format == "JSON":
            self.Sparql.setReturnFormat(JSON)
        else:
            self.Sparql.setReturnFormat(XML)
        results = self.Sparql.query().convert()
        return results

Now let's see how it works:

In [0]:
Fp = Fit_project("https://raw.githubusercontent.com/JinboCi/Knowledge_Graph/master/nwinners_list.xml")
print(Fp.Querying_database("Query", 
                           """SELECT ?subject ?predicate 
WHERE {
  ?subject ?predicate "Chicago"
}
LIMIT 25""", 
                           "JSON").decode('utf-8'))

### Rdflib method

OMG, we have failed to use SPARQLWrapper for querying. Sadly but hopefully, there is a another way!

In [0]:
def Rdflib_method(filepath, inputstring):
    graph = rdflib.Graph()
    graph.parse(filepath, format='xml')
    qres = graph.query(inputstring)
    for row in qres:
        print(row)
    return qres

How it works?

In [6]:
Rdflib_method("https://raw.githubusercontent.com/JinboCi/Knowledge_Graph/master/nwinners_list.xml", 
             """
             PREFIX  foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?object 
WHERE {
  ?subject ?predicate "Tibet".
  ?subject foaf:name ?object
}
""")

(rdflib.term.Literal('14th Dalai Lama'),)


<rdflib.plugins.sparql.processor.SPARQLResult at 0x7f5d573ac278>

Yes! We build this knowledge graph system successfully!