# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita (shbita@usc.edu) as part of the class <u>CSCI 563/INF 558: Building Knowledge Graphs</u> during Spring 2020 at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

First, you need define how a single entry would like for each type of record (for each dataset)

In [None]:
import rltk

# You can use this tokenizer in case you need to manipulate some data
tokenizer = rltk.CrfTokenizer()

# RLTK IMDB Record
class IMDBRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['name']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

# RLTK AFI Record
class AFIRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['url']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['title']

You can load your json-lines files into RLTK using this method:

In [None]:
imdb_file = 'imdb.jl'
afi_file  = 'afi.jl'

# load Datasets
ds_imdb = rltk.Dataset(reader=rltk.JsonLinesReader(imdb_file), record_class=IMDBRecord, adapter=rltk.MemoryKeyValueAdapter())
ds_afi  = rltk.Dataset(reader=rltk.JsonLinesReader(afi_file),  record_class=AFIRecord,  adapter=rltk.MemoryKeyValueAdapter())

And we can inspect a few entries:

In [None]:
# print some entries
print(ds_imdb.generate_dataframe().head(5))
print(ds_afi.generate_dataframe().head(5))

## Field (Attribute) Similarity

Here are 2 example functions for field (attribute) similarity:

In [None]:
def name_string_similarity_1(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string[:4]
    s2 = r_afi.name_string[:4]
    
    return rltk.jaro_winkler_similarity(s1, s2)
    
def name_string_similarity_2(r_imdb, r_afi):
    ''' Example dummy similiary function '''
    s1 = r_imdb.name_string[:2]
    s2 = r_afi.name_string[:2]
    
    if s1 == s2:
        return 1
    
    return 0

## Entity Linking

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [None]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.65 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r_imdb, r_afi):
    score_1 = name_string_similarity_1(r_imdb, r_afi)
    score_2 = name_string_similarity_2(r_imdb, r_afi)
    
    total = 0.7 * score_1 + 0.3 * score_2
    
    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

You can run your predictions like this:

In [None]:
# test on a single entry from imdb
r_imdb = ds_imdb.get_record("https://www.imdb.com/title/tt0068646/")
r_imdb.raw_object

In [None]:
# test this record with AFI records
for r_afi in ds_afi:
    # get result and confidence
    result, confidence = rule_based_method(r_imdb, r_afi)
    #print(result, confidence)
    if result == 1:
        print(f'found a match (with confidence of {confidence}) based on my methods. It is: {r_afi.raw_object}')
        break

# Task 2: Using RDFLib for Knowledge Representation

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [None]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF

Let's define some namespaces:

In [None]:
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://inf558.org/myfakenamespace#')

We can create a graph:

In [None]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)

Define a URI, then add a simple triple to the graph:

In [None]:
node_uri = URIRef(MYNS['inf558_production_company'])
my_kg.add((node_uri, RDF.type, MYNS['productionCompany']))

Add an additional triple (which describes the same subject, `node_uri`):

In [None]:
my_kg.add((node_uri, FOAF['name'], Literal('INF 558 Production Company')))

And now let's dump our graph triples into some `ttl` file:

In [None]:
my_kg.serialize('sample_graph.ttl', format="turtle")