Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [1]:
NAME = "Shubhangi Singhal"

---

# In this problem we will use publications dataset and write some datalog rules to check data integrity.

## Notes about datalog rules using Clingo in Jupyter Notebook environment:
* Refer [Clingo with Jupyter Intro](Clingo_with_Jupyter_Intro.ipynb) before attempting this notebook.
* It's important to run following cell first for rest of notebook to work.
* It's always a good idea to run cells in order. In case you have run cells in jumbled order and would want to start fresh, restart kernel from menu above.
* All clingo cells start with `%%clingo`.
* You can run your clingo cell against some basic facts and rules from a file. `set_db_file $filepath` sets the file against which your clingo cells will run.
* Each clingo cell is independent of others. Rules defined in one cell won't be available in others.
* It's nice to be able to execute clingo from within your notebook but don't forget to practice from command line. `%%clingo` is just a thin wrapper over command line and it's best to know how to use the underlying tool.
* Upon assignment submission, we will run your code against different set of facts. Please don't hardcode answers and save yourself the embarassment.

## Notes about the publication datalog questions:
* In this question, consider a “dirty” dataset such as the file “publications” posted on the class page. In order to improve the data quality of the original dataset, a reasonable approach is to first apply OpenRefine and then import the “OR-cleaned” dataset into a database. The IC-checking capabilities of a database provide a powerful way to detect inconsistencies.
* For this problem, assume the cleaned dataset has been loaded into a table of a relational database as shown below. We are going to write datalog rules to check ICs of data from the table.
![Publication](Publication_Table.png "Publication")

### Good luck!!

In [2]:
%reload_ext lib.clingo.clingo_magic
import os
from lib.clingo.clingo_evaluate_util import clingo_evaluate

In [3]:
# All clingo cells will run against this file containing some base facts.
publications_base_facts_and_rules_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
%set_db_file $publications_base_facts_and_rules_file

## We will now write various rules to find "bad" (inconsistent) data

### [10 points] The key attribute ID should uniquely determine all other attributes.
* In DENIAL form we report all IC violations, i.e., where there are at least two rows having the same ID same, but some differing attributes somewhere.
     - You can assume that the table is available as a Datalog predicate of the form publication (I,A,Y,T,J,V,N,F,L,P). Recall that in Datalog, arbitrary (capitalized) names can be chosen as variables, since it is the argument position that determines which attribute/column is meant.
     - (FD-1) The publication identifier Pid is a key, i.e., if a row agrees with another row on the key attribute Pid, then it also agrees on all other attributes (i.e., the “two” rows are in fact one and the same). As usual, your rule should return the IC-violations.
* Here we report both the name of the attribute and the duplicate values.


In [4]:
%%clingo {"predicate" : "icv_pid_key", "predicate_arity" : 4, "result_var": "Icv_pid_key"}

% Following code snippet and it's result will be assigned to local variable Icv_pid_key

% Change following expressions.
% In DENIAL form we report all IC violations, i.e., where there are at least two rows
% having the same ID same, but some differing attributes somewhere.
% Here we report both the name of the attribute and the duplicate values.
icv_pid_key(I,author,A1,A2) :-    publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), A1<A2.
icv_pid_key(I,year,Y1,Y2) :-      publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), Y1<Y2.
icv_pid_key(I,title,T1,T2) :-     publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), T1<T2.
icv_pid_key(I,journal,J1,J2) :-   publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), J1<J2.
icv_pid_key(I,vol,V1,V2) :-       publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), V1<V2.
icv_pid_key(I,no,N1,N2) :-        publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), N1<N2.
icv_pid_key(I,fp,F1,F2) :-        publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), F1<F2.
icv_pid_key(I,lp,L1,L2) :-        publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), L1<L2.
icv_pid_key(I,publisher,P1,P2) :- publication(I, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I, A2, Y2, T2, J2, V2, N2, F2, L2, P2), P1<P2.



Saving output to local variable Icv_pid_key['result']
Saving code snippet to local variable Icv_pid_key['code']



#### [3 points] Test 1 for icv_pid_key.
Following test will compare output of your icv_pid_key rule against expected output.
You must have run all clingo cells above for test to pass.

In [5]:
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_pid_key(4407,author,doe,kummel) icv_pid_key(4407,year,1969,2015) icv_pid_key(4407,title,ammonoids,foobar) icv_pid_key(4407,vol,10,137) icv_pid_key(4407,no,1,3) icv_pid_key(4407,fp,10,476) icv_pid_key(4407,lp,1,null) icv_pid_key(4407,publisher,null,publisher2)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_pid_key['code'], 'icv_pid_key', 4, expected_output)


#### [7 points] Test 2 for icv_pid_key.
Following is what is called a hidden test case. This will always pass in student's version but will actually be evaluated after submission.
* We will first add some facts that are hidden from student.
* We will run descendant rule using these new facts and see if rule still behaving correctly.

In [6]:
# This cell will test the descendant with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] Every journal has a single publisher, i.e., Journal --> Publisher
- (FD-2) Every Journal has a single Publisher. Like (FD-1), this is a functional dependency. It is sometimes written as Journal —> Publisher.
- In denial mode, we report the journals which have multiple publishers, two publishers at a time.


In [7]:
%%clingo {"predicate" : "icv_journal_publisher", "predicate_arity" : 3, "result_var": "Icv_journal_publisher"}

% Following code snippet and it's result will be assigned to local variable Icv_journal_publisher

% Food for thought: How are null values for publishers handled by your rules?
% Do you notice different repair options, depending on whether or not a null value is reported?
icv_journal_publisher(J1,P1,P2) :- publication(I1, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I2, A2, Y2, T2, J2, V2, N2, F2, L2, P2), J1=J2, P1<P2. 

Saving output to local variable Icv_journal_publisher['result']
Saving code snippet to local variable Icv_journal_publisher['code']


### [3 points] Test 1 for icv_journal_publisher.
Following test will compare output of your icv_journal_publisher rule against expected output.
You must have run all clingo cells above for test to pass.

In [8]:
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_journal_publisher(bullmcz,null,publisher1) icv_journal_publisher(bullmcz,publisher1,publisher2) icv_journal_publisher(bullmcz,null,publisher2)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_journal_publisher['code'], 'icv_journal_publisher', 3, expected_output)

#### [7 points] Test 2 for icv_journal_publisher.
Following is what is called a hidden test case. This will always pass in student's version but will actually be evaluated after submission.
* We will first add some facts that are hidden from student.
* We will run sibling rule using these new facts and see if rule still behaving correctly.

In [9]:
# This cell will test the icv_journal_publisher with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] The last page Lp cannot be smaller than the first page Fp.
- (NC-1) The last page Lp cannot be smaller than the first page Fp. Note: This numerical constraint can be evaluated independently on each row.
- In DENIAL form, we report the ones for which last page is smaller than first.


In [10]:
%%clingo {"predicate" : "icv_firstpage_lastpage", "predicate_arity" : 3, "result_var": "Icv_firstpage_lastpage"}

% Following code snippet and it's result will be assigned to local variable Icv_firstpage_lastpage

% Change following expression.
icv_firstpage_lastpage(ID,F,L) :- publication(ID, A, Y, T, J, V, N, F, L, P), F>L.


Saving output to local variable Icv_firstpage_lastpage['result']
Saving code snippet to local variable Icv_firstpage_lastpage['code']



#### [3 points] Test 1 for icv_firstpage_lastpage.
Following test will compare output of your icv_firstpage_lastpage rule against expected output.
You must have run all clingo cells above for test to pass.

In [11]:
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_firstpage_lastpage(6755,91,9) icv_firstpage_lastpage(4407,10,1)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_firstpage_lastpage['code'], 'icv_firstpage_lastpage', 3, expected_output)



#### [7 points] Test 2 for icv_firstpage_lastpage.
Following is what is called a hidden test case. This will always pass in student's version but will actually be evaluated after submission.
* We will first add some facts that are hidden from student.
* We will run icv_person_has_parent rule using these new facts and see if rule still behaving correctly.

In [12]:
# This cell will test the icv_person_has_parent with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] Inclusion Dependency: Every cited publication in CITES also occurs in PUBLICATION.
- Now consider that an additional table cites(P1, P2) is given which records pairs of publication P1, P2, where P1 is citing P2. We are going to define the following IC in denial form.
![Cites](Cites_Table.png "Cites")
- In DENIAL form, we report those publications which are in CITES but not in PUBLICATION.

In [17]:
%%clingo {"predicate" : "icv_cited_publication", "predicate_arity" : 1, "result_var": "Icv_cited_publication"}

% Following code snippet and it's result will be assigned to local variable Icv_cited_publication

% Change following expression.
%(Inclusion Dependency): Every cited publication in CITES also occurs in PUBLICATION.
icv_cited_publication(P2) :- cites(P1, P2), not publication(I, A, Y, T, J, V, N, F, L, P2). 

#### [3 points] Test 1 for icv_cited_publication.
Following test will compare output of your icv_person_has_father_mother rule against expected output.
You must have run all clingo cells above for test to pass.

In [14]:
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_cited_publication(2020) icv_cited_publication(3799)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_cited_publication['code'], 'icv_cited_publication', 1, expected_output)

NameError: name 'Icv_cited_publication' is not defined


#### [7 points] Test 2 for icv_cited_publication.
Following is what is called a hidden test case. This will always pass in student's version but will actually be evaluated after submission.
* We will first add some facts that are hidden from student.
* We will run icv_person_has_father_mother rule using these new facts and see if rule still behaving correctly.

In [None]:
# This cell will test the icv_cited_publication with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] If P1 cites P2 then P2's year of publication cannot be greater than P1.

In [15]:
%%clingo {"predicate" : "icv_p1_greater_p2", "predicate_arity" : 4, "result_var": "Icv_p1_greater_p2"}

% Following code snippet and it's result will be assigned to local variable Icv_p1_greater_p2

% Change following expression.
icv_p1_greater_p2(P1,P2,Y1,Y2) :- publication(I1, A1, Y1, T1, J1, V1, N1, F1, L1, P1), publication(I2, A2, Y2, T2, J2, V2, N2, F2, L2, P2), cites(P1, P2), Y2>Y1. 

Saving output to local variable Icv_p1_greater_p2['result']
Saving code snippet to local variable Icv_p1_greater_p2['code']


#### [3 points] Test 1 for icv_p1_greater_p2.
Following test will compare output of your icv_p1_greater_p2 rule against expected output.
You must have run all clingo cells above for test to pass.

In [16]:
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_p1_greater_p2(2044,2580,1934,1962)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_p1_greater_p2['code'], 'icv_p1_greater_p2', 4, expected_output)

AssertionError: 

#### [7 points] Test 2 for icv_p1_greater_p2.
Following is what is called a hidden test case. This will always pass in student's version but will actually be evaluated after submission.
* We will first add some facts that are hidden from student.
* We will run icv_person_has_father_mother rule using these new facts and see if rule still behaving correctly.

In [None]:
# This cell will test the icv_p1_greater_p2 with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.
