# Task 1: Using RLTK to perform Entity Resolution (ER)

<sub>Content of this notebook was prepared by Basel Shbita, and modified by Avijit Thawani (thawani@usc.edu) as part of the class <u>DSCI 558: Building Knowledge Graphs</u> at University of Southern California (USC).</sub>

The Record Linkage ToolKit ([RLTK](https://github.com/usc-isi-i2/rltk)) is a general-purpose open-source record linkage platform that allows users to build powerful Python programs that link records referring to the same underlying entity.

This notebook introduces some applied examples using RLTK. You can also find additional examples and use-cases in [RLTK's documentation](https://rltk.readthedocs.io/en/master/).

## Dataset analysis & RLTK components construction

In [1]:
%pip install rltk

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


### Task 1-1. Construct RLTK Datasets

First, you need define how a single entry would like for each type of record (for each dataset)

In [2]:
import rltk
import csv

# You can use this tokenizer in case you need to manipulate some data
tokenizer = rltk.tokenizer.crf_tokenizer.crf_tokenizer.CrfTokenizer()

In [3]:
'''
Feel free to add more columns here for use in record linkage.
COLS ADDED:
- three authors (one in each column, can use the count maybe to block)
- ISBN (fun fact, there are meanings to each group of 4 numbers in the ISBN13, use some grouping of that to block)
- publisher (we can try string matching)
'''

class GoodRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def ISBN13(self):
        # Defined by student
        return self.raw_object['ISBN13']
    
    @rltk.cached_property
    def first_author(self):
        # Defined by student
        if self.raw_object['FirstAuthor'] == '':
            return " "
        else:
            return self.raw_object['FirstAuthor']

    @rltk.cached_property
    def second_author(self):
        # Defined by student
        if self.raw_object['SecondAuthor'] == '':
            return " "
        else:
            return self.raw_object['SecondAuthor']

    @rltk.cached_property
    def third_author(self):
        # Defined by student
        if self.raw_object['ThirdAuthor'] == '':
            return " "
        else:
            return self.raw_object['ThirdAuthor']
    
    @rltk.cached_property
    def publisher(self):
        # Defined by student
        return self.raw_object['Publisher']

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']

    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))
    
    @rltk.cached_property
    def num_authors(self):
        # Defined by student
        firstboi = self.raw_object['FirstAuthor']
        secondboi = self.raw_object['SecondAuthor']
        thirdboi = self.raw_object['ThirdAuthor']
        count = 0
        if firstboi != ' ':
            count+=1
        if secondboi != ' ':
            count+=1
        if thirdboi != ' ':
            count+=1
        return str(count)


class NobleRecord(rltk.Record):
    def __init__(self, raw_object):
        super().__init__(raw_object)
        self.name = ''

    @rltk.cached_property
    def id(self):
        return self.raw_object['ID']

    @rltk.cached_property
    def ISBN13(self):
        # Defined by student
        return self.raw_object['ISBN13']

    @rltk.cached_property
    def first_author(self):
        # Defined by student
        if self.raw_object['Author1'] == '':
            return " "
        else:
            return self.raw_object['Author1']

    @rltk.cached_property
    def second_author(self):
        # Defined by student
        if self.raw_object['Author2'] == '':
            return " "
        else:
            return self.raw_object['Author2']

    @rltk.cached_property
    def third_author(self):
        # Defined by student
        if self.raw_object['Author3'] == '':
            return " "
        else:
            return self.raw_object['Author3']
    
    @rltk.cached_property
    def publisher(self):
        # Defined by student
        return self.raw_object['Publisher']
    
    @rltk.cached_property
    def num_authors(self):
        # Defined by student
        firstboi = self.raw_object['Author1']
        secondboi = self.raw_object['Author2']
        thirdboi = self.raw_object['Author3']
        count = 0
        if firstboi != ' ':
            count+=1
        if secondboi != ' ':
            count+=1
        if thirdboi != ' ':
            count+=1
        return str(count)

    @rltk.cached_property
    def name_string(self):
        return self.raw_object['Title']
    
    @rltk.cached_property
    def name_tokens(self):
        return set(tokenizer.tokenize(self.name_string))

In [4]:
dir_ = ''
good_file = dir_ + 'goodreads.csv'
noble_file = dir_ + 'barnes_and_nobles.csv'

ds1 = rltk.Dataset(rltk.CSVReader(good_file),record_class=GoodRecord)
ds2 = rltk.Dataset(rltk.CSVReader(noble_file),record_class=NobleRecord)

You can load your csv files into RLTK using this method:

And we can inspect a few entries:

In [5]:
# print some entries
print(ds1.generate_dataframe().head(5))

  id         ISBN13     first_author   second_author     third_author  \
0  0  9780340728567    Alex Ferguson                                    
1  1  9780844627106  Boris Pasternak                                    
2  2  9780712679480  Betty Boothroyd                                    
3  3  9780725100148           Caddie                                    
4  4  9780340014684   Rudolf Nureyev  Richard Avedon  Alexander Bland   

               publisher                                 name_string  \
0     Hodder & Stoughton          Managing My Life: My Autobiography   
1  Peter Smith Publisher     I Remember: Sketch for an Autobiography   
2        Random House UK              Betty Boothroyd: Autobiography   
3              Sun Books  Caddie, A Sydney Barmaid: An Autobiography   
4        E P Dutton & Co     Nureyev: An Autobiography With Pictures   

                                         name_tokens num_authors  
0             {Autobiography, :, My, Managing, Life}         

In [6]:
# Printing some more entries
print(ds2.generate_dataframe().head(5))

  id         ISBN13           first_author second_author  third_author  \
0  0  9780984504176   Laura Ingalls Wilder                               
1  1  9780062376336             Chris Kyle  Scott McEwen  Jim DeFelice   
2  2  9780345350688              Malcolm X                               
3  3  9781556520747          Assata Shakur                               
4  4  9780876120798  Paramahansa Yogananda                               

                               publisher num_authors  \
0  South Dakota State Historical Society           1   
1               HarperCollins Publishers           3   
2          Random House Publishing Group           1   
3     Chicago Review Press, Incorporated           1   
4            Self-Realization Fellowship           1   

                                         name_string  \
0          Pioneer Girl: The Annotated Autobiography   
1  American Sniper (Movie Tie-in Edition): The Au...   
2                     The Autobiography of Malcolm

### Task 1-2. Blocking

First, we'll load dev set to evaluate both blocking (Task 1-2) and entity linking (Task 1-3).

In [7]:
dev_set_file = dir_ + 'dev.csv'
dev = []
with open(dev_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            dev.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(dev)} lines.')

gt = rltk.GroundTruth()
for row in dev:    
    r1 = ds1.get_record(row[0])
    r2  = ds2.get_record(row[1])
    if row[-1] == '1':
        gt.add_positive(r1.raw_object['ID'], r2.raw_object['ID'])
    else:
        gt.add_negative(r1.raw_object['ID'], r2.raw_object['ID'])

trial = rltk.Trial(gt)

Column names are: goodreads.ID, barnes_and_nobles.ID, label
Processed 297 lines.


Then, you can build your own blocking techniques and evaluate it.

Hint:

- What is the total number of pairs without blocking? 
- what is the number of paris with blocking?
- After blocking, how many "correct" (matched) pairs presented in dev set?


In [8]:
# name_string first few characters blocking
bg1 = rltk.HashBlockGenerator()
block1 = bg1.generate(
    bg1.block(ds1, function_=lambda r: r.name_string[:3]),
    bg1.block(ds2, function_=lambda r: r.name_string[:3])
)
# pairs = rltk.get_record_pairs(ds1, ds2, block=block1)

In [9]:
# number of authors blocking
# bg2 = rltk.HashBlockGenerator()
# block2 = bg2.generate(
#     bg2.block(ds1, property_='num_authors', base_on=block1),
#     bg2.block(ds2, property_='num_authors', base_on=block1)
# )
# pairs = rltk.get_record_pairs(ds1, ds2, block=block2)

In [10]:
# for r1, r2 in pairs:
#     print(r1.id, r1.name_string, '\t', r2.id, r2.name_string)

# for idx, b in enumerate(block1.key_set_adapter):
#     print(b)

In [11]:
# ISBN based blocking
bg3 = rltk.HashBlockGenerator()
block3 = bg3.generate(
    bg3.block(ds1, function_=lambda r: r.ISBN13[:3], base_on=block1),
    bg3.block(ds2, function_=lambda r: r.ISBN13[:3], base_on=block1)
)

In [12]:
# Further ISBN Blocking
# bg4 = rltk.HashBlockGenerator()
# block4 = bg4.generate(
#     bg4.block(ds1, function_=lambda r: r.ISBN13[6:9], base_on=block3),
#     bg4.block(ds2, function_=lambda r: r.ISBN13[6:9], base_on=block3)
# )

### Metrics for 1-2
- First, how do we get the total number of potential pairs without blocking?
- Should be straightforward; if there are N entries in goodreads,
    and M entries in barnes and nobles, then total potential pairs
    should just by NxM

In [13]:
total_possible_pairs = len(ds1.generate_dataframe()) * len(ds2.generate_dataframe())
print("Total possible pairs: ", total_possible_pairs)

Total possible pairs:  14681867


- Alrighty then. Next, what is the number of pairs _with_ blocking?
- From our previous steps we have the blocks, whose entries are
  grouped together. So what we can try to do is see total possible pairs using
  the in-built functionality and by counting pairs based on the last of our
  sequential blocks, ie block 3

In [14]:
pairs = rltk.get_record_pairs(ds1, ds2, block=block3)
total_blocked_pairs = 0
for r1, r2 in pairs:
    total_blocked_pairs = total_blocked_pairs + 1
    # print(r1.id, r1.name_string, '\t', r2.id, r2.name_string)

# print('inside blocks:')
# for b, d, r in block3:
#     print(b, d, r)

print("Total blocked pairs: ", total_blocked_pairs)

Total blocked pairs:  411599


In [15]:
pairs = rltk.get_record_pairs(ds1, ds2, block=block3)
with open(dir_ + 'Devyan_Biswas_blocked.csv', mode='w') as file:
    writer = csv.writer(file)
    writer.writerow(['goodreads.ID','barnes_and_nobles.ID'])
    for gr, ban in pairs:
        writer.writerow([gr.id, ban.id])

- So, here's our efficiency/reduction-ratio
- It's nothing but the total compared pairs over the total possible pairs of the whole dataset

In [16]:
red_ratio = total_blocked_pairs / total_possible_pairs
print("Reduction Ratio: ", red_ratio)

Reduction Ratio:  0.028034513594217957


Woah, nice reduction ratio! But how does our blocking hold up when comparing to pair completeness or quality?
<br>
Let's test that on the dev.csv examples that we'd established earlier!

In [17]:
true_matches = 0
neg_matches = 0
for entry in dev:
    if entry[2] == '1':
        true_matches = true_matches + 1
    elif entry[2] == '0':
        neg_matches = neg_matches + 1


print("Positive True matches: ", true_matches)
print("Negative True matches: ", neg_matches)

Positive True matches:  67
Negative True matches:  230


In [18]:
pairs = rltk.get_record_pairs(ds1, ds2, block=block3)
positive_compared_matches = 0
negative_compared_matches = 0

for r1,r2 in pairs:
    for entry in dev:
        if entry[2] == '1' and r1.id == entry[0] and r2.id == entry[1]:
            positive_compared_matches = positive_compared_matches + 1
        elif entry[2] == '0' and r1.id == entry[0] and r2.id == entry[1]:
            negative_compared_matches = negative_compared_matches + 1

total_compared_pairs = positive_compared_matches + negative_compared_matches # needs to be the total number in DEV

print("Positive True matches compared: ", positive_compared_matches)
print("Negative True matches compared: ", negative_compared_matches)
print("Total pairs compared (and found from ground truth): ", total_compared_pairs)


Positive True matches compared:  52
Negative True matches compared:  39
Total pairs compared (and found from ground truth):  91


In [19]:
# Recall/ pair completeness measure: true pairs compared over total true pairs
recall = positive_compared_matches / true_matches
print("Recall: ", recall)

Recall:  0.7761194029850746


In [20]:
# Pair quality measure: number of true pairs compared over total number of pairs compared
prec = positive_compared_matches/total_compared_pairs #???
print("Precision: ", prec)

Precision:  0.5714285714285714


As you can see, our recall is ~0.78, and our precision is ~0.57
<br>
Note that we're measuring these with respect to our ground truths in dev.csv

### Task 1-3. Entity Linking

Here are 2 example functions for field (attribute) similarity:

In [21]:
def basic_jaro_wrinkler(r1, r2):
    '''Really basic jaro wrinkler to the titles'''
    s1 = r1.name_string[:5]
    s2 = r2.name_string[:5]
    
    return rltk.jaro_winkler_similarity(s1, s2)

def basic_jaccard(r1, r2):
    '''Really basic jaccard to the title tokens'''
    s1 = r1.name_tokens
    s2 = r2.name_tokens
    return rltk.jaccard_index_similarity(s1,s2)

# def isbn13_sim(r1, r2):
#     '''
#     Basically, check their isbns
#     '''
#     print(r1.ISBN13, '\t', r2.ISBN13)
#     if r1.ISBN13 == r2.ISBN13:
#         print("HERE")
#         return 1
#     return 0


def author_similarity(r1,r2, print_flag=False):
    '''
    I had initially done the average of all the max matches
    I hadnt considered the siuation in which theres a subset of the matching strings.
    So could turn them into tokens then do subjset, ie jaccard?
    OR I could just try to do the max similarity score overall, then it's all good maybe...
    '''

    r1_author_list = []
    r1_author_list.append(r1.first_author)
    r1_author_list.append(r1.second_author)
    r1_author_list.append(r1.third_author)
    
    r2_author_list = []
    r2_author_list.append(r2.first_author)
    r2_author_list.append(r2.second_author)
    r2_author_list.append(r2.third_author)

    # max_sim_score = float('-inf')
    # for r1a in r1_author_list:
    #     for r2a in r2_author_list:
    #         curr_sim = rltk.jaro_winkler_similarity(r1a, r2a)
    #         if curr_sim > max_sim_score:
    #             max_sim_score = curr_sim
    
    # return max_sim_score

    cum_sum = 0
    for r1a in r1_author_list:
        max_sum = float('-inf')
        for r2a in r2_author_list:
            if  (r1a == ' ' or r2a == ' '):
                continue
            # print("R1 author: ", r1a)  
            # print("R2 author: ", r2a)  
            # print("---")
            curr_sum = rltk.jaro_winkler_similarity(r1a, r2a)
            # print(curr_sum)
            if curr_sum > max_sum:
                # print("HERE")
                max_sum = curr_sum
        if(max_sum != float('-inf')):
            cum_sum = cum_sum + max_sum
        # print("===")
    
    # print("CUM SUM: ", cum_sum)
    # print(r1.num_authors, r2.num_authors)
    # print("MAX AUTH: ", max(int(r1.num_authors), int(r2.num_authors)))
    return cum_sum/max(int(r1.num_authors), int(r2.num_authors))
    # return cum_sum/3
    

def name_string_similarity_test(r1, r2):
    ''' Example dummy similiary function '''
    s1 = r1.name_string
    s2 = r2.name_string
    
    if s1 == s2:
        return 1
    
    return 0

In [22]:
rltk.jaro_winkler_similarity(' ', ' Agatha Christie ')

0.6862745098039215

Here's how you can combine multiple similarity functions into a single weightened scoring function:

In [23]:
# threshold value to determine if we are confident the record match
MY_TRESH = 0.83 # this number is just an example, you need to change it

# entity linkage scoring function
def rule_based_method(r1, r2, print_flag=False):
    score_1 = basic_jaro_wrinkler(r1, r2)
    score_2 = basic_jaccard(r1, r2)
    score_3 = author_similarity(r1, r2, print_flag)

    # My current plan: conditional weights IF the scores are low/high enough on name similarity
    # 

    total = (0.20 * score_1) + (0.30 * score_2) + (0.50 * score_3)
    # print("SCORES: ", score_1, '\t', score_2, '\t', score_3, "\t Total: ", total)

    # return two values: boolean if they match or not, float to determine confidence
    return total > MY_TRESH, total

Lets run some candidates using the ground-truth

In [24]:
trial = rltk.Trial(gt)
candidate_pairs = rltk.get_record_pairs(ds1, ds2, ground_truth=gt, block=block3)
for r1, r2 in candidate_pairs:
    result, confidence = rule_based_method(r1, r2, True)
    # if r1.id == '2':
    #     print(r1.name_string, "\t", r2.name_string)
    #     print(r1.first_author, "\t", r1.second_author, "\t", r1.third_author)
    #     print(r2.first_author, "\t", r2.second_author, "\t", r2.third_author)
    # print(confidence)
    # print("=======")
    trial.add_result(r1, r2, result, confidence)

Now lets evaluate our trial results

In [25]:
trial.evaluate()
print('Trial statistics based on Ground-Truth from development set data:')
print(f'tp: {trial.true_positives:.06f} [{len(trial.true_positives_list)}]')
print(f'fp: {trial.false_positives:.06f} [{len(trial.false_positives_list)}]')
print(f'tn: {trial.true_negatives:.06f} [{len(trial.true_negatives_list)}]')
print(f'fn: {trial.false_negatives:.06f} [{len(trial.false_negatives_list)}]')

Trial statistics based on Ground-Truth from development set data:
tp: 0.788462 [41]
fp: 0.307692 [12]
tn: 0.692308 [27]
fn: 0.211538 [11]


In [26]:
trial.f_measure

0.780952380952381

### Save Test predictions
You will be evaluated on dev and test predictions, over a hidden ground truth.

In [27]:
test_set_file = dir_ + 'test.csv'
test = []
with open(test_set_file, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            test.append(row)
    print(f'Column names are: {", ".join(columns)}')
    print(f'Processed {len(test)} lines.')

Column names are: goodreads.ID, barnes_and_nobles.ID
Processed 90 lines.


In [28]:
predictions = []
# print(len(ds1.generate_dataframe()))
# print(len(ds2.generate_dataframe()))
# print("----------")
for id1, id2 in test:
    # print(id1, '\t', id2)
    # print("=======")
    # print(r1.name_string, '\t', r2.name_string)
    r1 = ds1.get_record(id1)
    r2  = ds2.get_record(id2)
    result, confidence = rule_based_method(r1, r2, print_flag=True)
        # print("R1: ", r1.name_string, " \t ", r1.first_author, " , ", r1.second_author, " , ", r1.third_author)
        # print("R2: ", r2.name_string, " \t ", r2.first_author, " , ", r2.second_author, " , ", r2.third_author)
        # print(result, " , ", confidence)
        # print("=======")

    predictions.append((r1.id, r2.id, result, confidence))

In [29]:
len(predictions), len(ds1.generate_dataframe()), len(ds2.generate_dataframe())

(90, 3967, 3701)

In [30]:
with open(dir_ + 'Devyan_Biswas_predictions.csv', mode='w') as file:
    writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(['goodreads.ID','barnes_and_nobles.ID', 'prediction', 'confidence'])
    for row in predictions:
        writer.writerow(row)

In [31]:
with open(dir_ + 'Devyan_Biswas_valid_predictions.csv', mode='w') as file:
    writer = csv.writer(file)
    writer.writerow(['goodreads.ID','barnes_and_nobles.ID'])
    for row in predictions:
        if row[2] == True:
            writer.writerow([row[0], row[1]])

# Task 2: Using RDFLib for Knowledge Representation

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information as graphs. RDFLib aims to be a pythonic RDF API, a Graph is a python collection of RDF Subject, Predicate,  Object Triples.

This notebook introduces simple examples. You can also find additional information in the [official documenation](https://rdflib.readthedocs.io/en/stable/).

In [32]:
! pip install rdflib

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [33]:
from rdflib import Graph, URIRef, Literal, XSD, Namespace, RDF, BNode

Let's define some namespaces:

In [34]:
# FOAF = Namespace('http://xmlns.com/foaf/0.1/')
# MYNS = Namespace('http://dsci558.org/myfakenamespace#')

We can create a graph:

In [35]:
# my_kg = Graph()
# my_kg.bind('myns', MYNS)
# my_kg.bind('foaf', FOAF)

Define a URI, then add a simple triple to the graph:

In [36]:
# node_uri = URIRef(MYNS['dsci_558'])
# my_kg.add((node_uri, RDF.type, MYNS['course']))

Add an additional triple (which describes the same subject, `node_uri`):

In [37]:
# my_kg.add((node_uri, FOAF['name'], Literal('Building Knowledge Graphs')))

And now let's dump our graph triples into some `ttl` file:

In [38]:
# my_kg.serialize(dir_ + 'sample_graph.ttl', format="turtle")

In [39]:
!head sample_graph.ttl

head: sample_graph.ttl: No such file or directory


## Ok, for 2.1, gotta read in the valid pairs' info first

In [40]:
# Reading in the valid pairs
preds = dir_ + 'Devyan_Biswas_valid_predictions.csv'
valid_pairs = []
with open(preds, encoding='utf-8', errors="replace") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if len(row) <= 1:
            continue
        if line_count == 0:
            columns = row
            line_count += 1
        else:
            valid_pairs.append(row)
# print(len(valid_pairs), valid_pairs)


Now, what we gotta do is get the rows from the goodreads and barnes and nobles csv's and get the appropriate columns, then
<br>
filter based on id's and pair em up in a data struct
<br>
tbh i got kinda lazy, so I just started using pandas dataframes to do this lol

In [41]:
import pandas as pd

In [42]:
gr_pd = pd.read_csv('./goodreads.csv', sep=',', usecols = ['ID', 'Title','Description','ISBN', 'ISBN13', 'PageCount', 'FirstAuthor', 'SecondAuthor', 'ThirdAuthor', 'Rating' , 'NumberofRatings', 'NumberofReviews', 'Publisher', 'PublishDate', 'Format', 'Language'], header=0) 
ban_pd = pd.read_csv('./barnes_and_nobles.csv', sep=',', usecols = ['ID', 'Title','Author1', 'Author2', 'Author3', 'Publisher' , 'ISBN13', 'PublicationDate', 'Productdimensions', 'Salesrank', 'Ratingscount', 'Ratingvalue', 'Paperbackprice', 'Hardcoverprice', 'Nookbookprice', 'Audiobookprice'], header=0) 

In [43]:
valid_gr_ids = [item[0] for item in valid_pairs]
valid_ban_ids = [item[1] for item in valid_pairs]

In [44]:
valid_gr_pd = gr_pd.iloc[valid_gr_ids]
valid_ban_pd = ban_pd.iloc[valid_ban_ids]

In [45]:
valid_gr_tuples = list(valid_gr_pd.itertuples(index=False))
valid_ban_tuples = list(valid_ban_pd.itertuples(index=False))

We're only gonna be adding one entry/node per pair, but add the attributes from BOTH datasets as per HW2 documentation

This part's gonna be very hardcode-y, but eh 

I am definind the URI as the concat of the two id's. Why? Because im pressed for time and was running into an issue with making
<br>
the title part of the URI
<br>
You'll note that overall I give preference to stats from the goodread dataset. personal choice tbh, nothing more
<br>
If I had more time, I think a good approach would be to even create two separate nodes from GR and BAN, do respective dataproc,
<br>
then just add a link to the top levels of each entry. But this was, somehow, simpler/more straightforward lell

In [46]:
SCHEMA = Namespace("https://schema.org/")
FOAF = Namespace('http://xmlns.com/foaf/0.1/')
MYNS = Namespace('http://dsci558.org/myfakenamespace#')

In [47]:
my_kg = Graph()
my_kg.bind('myns', MYNS)
my_kg.bind('foaf', FOAF)
my_kg.bind('schema', SCHEMA)

In [48]:
import math
counter = 0

for gr, ban in zip(valid_gr_tuples, valid_ban_tuples):
    counter = counter + 1
    # print(gr)
    # print("=====")
    # print(ban)

    # Node URI definition
    node_id = str(gr[0])+ "." + str(ban[0])
    node_URI = URIRef(SCHEMA[node_id])
    my_kg.add((node_URI, RDF.type, SCHEMA.Book))

    # Node title definition
    if len(gr[1]) > len(ban[1]):
        book_name = gr[1]
    else:
        book_name = ban[1]

    my_kg.add((node_URI, SCHEMA.name, Literal(book_name)))

    # Adding the abstract
    abstract_text = gr[2]
    if len(abstract_text) > 1:
        my_kg.add((node_URI, SCHEMA.abstract, Literal(abstract_text)))
    else:
        my_kg.add((node_URI, SCHEMA.abstract, BNode()))
    
    # Adding the ISBN and ISBN13

    # This causes nans to be added instead of blank nodes, but dont have time to resolve edge cases
    if gr[3] == "nan":
        my_kg.add((node_URI, SCHEMA.isbn, BNode()))
    else:
        my_kg.add((node_URI, SCHEMA.isbn, Literal(gr[3], datatype=SCHEMA.ISBN)))

    if gr[4] == ' ':
        if ban[6] ==' ':
            my_kg.add((node_URI, SCHEMA.isbn13, BNode()))
        else:
            my_kg.add((node_URI, SCHEMA.isbn13, Literal(ban[6], datatype=SCHEMA.ISBN)))
    else:
        my_kg.add((node_URI, SCHEMA.isbn13, Literal(gr[4], datatype=SCHEMA.ISBN)))


    # Adding the bag of authors:
    # NOTE: This is NOT A CORRECT IMPLEMENTATION I am just low on time
    # If more time, would do a similarity check for strings and then add the three auths only but this should be aight
    auths = list()
    if gr[6] != ' ':
        auths.append(gr[6])
    if gr[7] != ' ':
        auths.append(gr[7])
    if gr[8] != ' ':
        auths.append(gr[8])
    if ban[2] != ' ' and ban[2] not in auths:
        auths.append(ban[2])
    if ban[3] != ' ' and ban[3] not in auths:
        auths.append(ban[3])
    if ban[4] != ' ' and ban[4] not in auths:
        auths.append(ban[4])

    bag = BNode()
    my_kg.add((node_URI, SCHEMA.author, bag))
    my_kg.add((bag, RDF.type, RDF.Bag))
    for auth in auths:
        my_kg.add((bag, SCHEMA.author, Literal(auth, datatype=SCHEMA.author)))
    
    # Adding ratings (not including rating count tho)
    if gr[9]=="nan":
        if ban[11] == ' ':
            my_kg.add((node_URI, SCHEMA.rating, BNode()))
        rating = str(ban[11])
    else:
        rating = str(gr[9])
    
    my_kg.add((node_URI, SCHEMA.rating, Literal(rating, datatype=SCHEMA.contentRating)))
    
    # Adding Publisher Info
    publisher = ""
    if gr[12] != ' ':
        publisher = publisher + gr[12] + ", "
    if ban[5] != ' ':
        publisher = publisher + ban[5] 

    my_kg.add((node_URI, SCHEMA.publisher, Literal(publisher, datatype=SCHEMA.publisher)))
    
    if gr[13] == ' ':
        if ban[7] == ' ':
            my_kg.add((node_URI, SCHEMA.datePublished, BNode()))
        else:
            my_kg.add((node_URI, SCHEMA.datePublished, Literal(ban[7], datatype=SCHEMA.datePublished)))
    else:
        my_kg.add((node_URI, SCHEMA.datePublished, Literal(gr[13], datatype=SCHEMA.datePublished)))

    # Lastly, I just want to have the prices from the BAN record
    12, 13, 14, 15
    papercost = ban[12]
    hardcost = ban[13]
    nookcost = ban[14]
    audiocost = ban[15]
    
    price_bag = BNode()
    my_kg.add((node_URI, SCHEMA.price, price_bag))
    my_kg.add((price_bag, RDF.type, RDF.Bag))
    if papercost == ' ':
        my_kg.add((price_bag, SCHEMA.paper_price,BNode()))
    else:
        my_kg.add((price_bag, SCHEMA.paper_price, Literal(papercost, datatype=SCHEMA.price)))

    if hardcost == ' ':
        my_kg.add((price_bag, SCHEMA.hard_price,BNode()))
    else:
        my_kg.add((price_bag, SCHEMA.hard_price, Literal(hardcost, datatype=SCHEMA.price)))

    if nookcost == ' ':
        my_kg.add((price_bag, SCHEMA.nook_price,BNode()))
    else:
        my_kg.add((price_bag, SCHEMA.nook_price, Literal(nookcost, datatype=SCHEMA.price)))

    if audiocost == ' ':
        my_kg.add((price_bag, SCHEMA.audio_price,BNode()))
    else:
        my_kg.add((price_bag, SCHEMA.audio_price, Literal(audiocost, datatype=SCHEMA.price)))

    # break

print(counter)

57


In [49]:
my_kg.serialize(dir_ + 'Devyan_Biswas_model.ttl', format="turtle")

<Graph identifier=Na78dadcf44b7439ebdff37f1be1da30c (<class 'rdflib.graph.Graph'>)>