# Checking complex ICs on a simple publications dataset

* In this problem, consider a “dirty” dataset such as the file “publications” posted in class. In order to improve the data quality of the original dataset, a reasonable approach is to first apply OpenRefine and then import the “OR-cleaned” dataset into a database. The IC-checking capabilities of database queries are a powerful way to detect inconsistencies.
* For this problem, assume that the "pre-cleaned" dataset (i.e., after using OpenRefine) has been loaded into a table of a relational database as shown below. We are going to write rules (Datalog/clingo queries) to check ICs of data from the table.
![Publication](Publication_Table.png "Publication")

### Good luck!!

In [1]:
%reload_ext lib.clingo.clingo_magic
import os
from lib.clingo.clingo_evaluate_util import clingo_evaluate

In [2]:
# All clingo cells will run against this file containing some base facts.
publications_base_facts_and_rules_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
%set_db_file $publications_base_facts_and_rules_file

## You will now write various rules to find "bad" (i.e., inconsistent) data

### [10 points] The key attribute ID should uniquely determine all other attributes.
* In DENIAL form we report all IC violations, i.e., where there are at least two rows having the same ID, but some differing attributes somewhere.
     - You can assume that the table is available as a Datalog predicate of the form `publication(I,A,Y,T,J,V,N,F,L,P)`. Recall that in Datalog, arbitrary (capitalized) names can be chosen as variables, since it is the argument position that determines which attribute/column is meant.
     - **(FD-1)** The publication identifier `Pid` is a key, i.e., if a row agrees with another row on the key attribute Pid, then it also agrees on all other attributes (i.e., the “two” rows are in fact one and the same). As usual, your rule should return the IC-violations.
* Here we report both the name of the attribute and the duplicate values.


In [6]:
# icv_pid_key(ID,author,A1,A2) :-    replace_me_fd1(ID,A1,A2).
    
# icv_pid_key(ID,year,Y1,Y2) :-      replace_me_fd1(ID,Y1,Y2).
    
# icv_pid_key(ID,title,T1,T2) :-     replace_me_fd1(ID,T1,T2).
    
# icv_pid_key(ID,journal,J1,J2) :-   replace_me_fd1(ID,J1,J2).
    
# icv_pid_key(ID,vol,V1,V2) :-       replace_me_fd1(ID,V1,V2).
    
# icv_pid_key(ID,no,N1,N2) :-        replace_me_fd1(ID,N1,N2).
    
# icv_pid_key(ID,fp,F1,F2) :-        replace_me_fd1(ID,F1,F2).
    
# icv_pid_key(ID,lp,L1,L2) :-        replace_me_fd1(ID,L1,L2).
    
# icv_pid_key(ID,publisher,P1,P2) :- replace_me_fd1(ID,P1,P2).

In [9]:
%%clingo {"predicate" : "icv_pid_key", "predicate_arity" : 4, "result_var": "Icv_pid_key"}
% Don't change the clingo magic command above. The header of this cell will determine how the datalog rules are saved for evaluation.

% Following code snippet and it's result will be assigned to local variable Icv_pid_key

% Change following expressions.
% In DENIAL form we report all IC violations, i.e., where there are at least two rows
% having the same ID, but some differing attributes somewhere.
% Here we report both the name of the attribute and the duplicate values.

icv_pid_key(ID, author, A1, A2) :- 
    publication(ID, A1, _, _, _, _, _, _, _, _),
    publication(ID, A2, _, _, _, _, _, _, _, _),
    A1 != A2,
    A1 < A2.

icv_pid_key(ID, year, Y1, Y2) :- 
    publication(ID, _, Y1, _, _, _, _, _, _, _),
    publication(ID, _, Y2, _, _, _, _, _, _, _),
    Y1 != Y2,
    Y1 < Y2.

icv_pid_key(ID, title, T1, T2) :- 
    publication(ID, _, _, T1, _, _, _, _, _, _),
    publication(ID, _, _, T2, _, _, _, _, _, _),
    T1 != T2,
    T1 < T2.

icv_pid_key(ID, journal, J1, J2) :- 
    publication(ID, _, _, _, J1, _, _, _, _, _),
    publication(ID, _, _, _, J2, _, _, _, _, _),
    J1 != J2,
    J1 < J2.

icv_pid_key(ID, vol, V1, V2) :- 
    publication(ID, _, _, _, _, V1, _, _, _, _),
    publication(ID, _, _, _, _, V2, _, _, _, _),
    V1 != V2,
    V1 < V2.

icv_pid_key(ID, no, N1, N2) :- 
    publication(ID, _, _, _, _, _, N1, _, _, _),
    publication(ID, _, _, _, _, _, N2, _, _, _),
    N1 != N2,
    N1 < N2.

icv_pid_key(ID, fp, F1, F2) :- 
    publication(ID, _, _, _, _, _, _, F1, _, _),
    publication(ID, _, _, _, _, _, _, F2, _, _),
    F1 != F2,
    F1 < F2.

icv_pid_key(ID, lp, L1, L2) :- 
    publication(ID, _, _, _, _, _, _, _, L1, _),
    publication(ID, _, _, _, _, _, _, _, L2, _),
    L1 != L2,
    L1 < L2.

icv_pid_key(ID, publisher, P1, P2) :- 
    publication(ID, _, _, _, _, _, _, _, _, P1),
    publication(ID, _, _, _, _, _, _, _, _, P2),
    P1 != P2,
    P1 < P2.

Saving output to local variable Icv_pid_key['result']
Saving code snippet to local variable Icv_pid_key['code']



### [3 points] Test 1 for icv_pid_key/4.
The following test will compare the output of your `icv_pid_key/4` rule against the expected output.

In [10]:
# Test 1 for icv_pid_key/4
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_pid_key(4407,author,doe,kummel) icv_pid_key(4407,year,1969,2015) icv_pid_key(4407,title,ammonoids,foobar) icv_pid_key(4407,vol,10,137) icv_pid_key(4407,no,1,3) icv_pid_key(4407,fp,10,476) icv_pid_key(4407,lp,1,null) icv_pid_key(4407,publisher,null,publisher2)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_pid_key['code'], 'icv_pid_key', 4, expected_output)


### [7 points] Test 2 for icv_pid_key/4
Hidden test case.

In [11]:
# Hidden Test 2 for icv_pid_key/4
# This cell will test the descendant with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] Every journal has a single publisher: icv_journal_publisher/3
- **(FD-2)** Every Journal has a single Publisher. 
- Like (FD-1), this is a functional dependency. It is sometimes written as Journal —> Publisher.
- As usual, we use denial mode and report the journals which have more than one publisher.


In [None]:
# icv_journal_publisher(J,P1,P2) :- replace_me_fd2(J,P1,P2).

In [12]:
%%clingo {"predicate" : "icv_journal_publisher", "predicate_arity" : 3, "result_var": "Icv_journal_publisher"}
% Don't change the clingo magic command above. The header of this cell will determine how the datalog rules are saved for evaluation.

% Following code snippet and it's result will be assigned to local variable Icv_journal_publisher

icv_journal_publisher(J, P1, P2) :-
    publication(_, _, _, _, J, _, _, _, _, P1),
    publication(_, _, _, _, J, _, _, _, _, P2),
    P1 != P2,
    P1 < P2.  

Saving output to local variable Icv_journal_publisher['result']
Saving code snippet to local variable Icv_journal_publisher['code']


### [3 points] Test 1 for icv_journal_publisher/3
You must have run all clingo cells above for test to pass.

In [13]:
# Test 1 for icv_journal_publisher/3
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_journal_publisher(bullmcz,null,publisher1) icv_journal_publisher(bullmcz,publisher1,publisher2) icv_journal_publisher(bullmcz,null,publisher2)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_journal_publisher['code'], 'icv_journal_publisher', 3, expected_output)

### [7 points] Test 2 for icv_journal_publisher.
Hidden test case.

In [14]:
# Hidden Test 2 for icv_journal_publisher/3
# This cell will test the icv_journal_publisher with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] The last page (Lp) cannot be smaller than the first page (Fp)
- **(NC-1)** The last page _Lp_ cannot be smaller than the first page _Fp_. 
- This numerical constraint can be evaluated independently on each row.
- In denial form, we report publication IDs for which the last page is smaller than the first.


In [16]:
# icv_firstpage_lastpage(ID,F,L) :- replace_me_nc1(ID,F,L).

In [15]:
%%clingo {"predicate" : "icv_firstpage_lastpage", "predicate_arity" : 3, "result_var": "Icv_firstpage_lastpage"}
% Don't change the clingo magic command above. The header of this cell will determine how the datalog rules are saved for evaluation.

% Following code snippet and it's result will be assigned to local variable Icv_firstpage_lastpage

% Change following expression.
icv_firstpage_lastpage(ID, F, L) :-
    publication(ID, _, _, _, _, _, _, F, L, _),
    L < F.

Saving output to local variable Icv_firstpage_lastpage['result']
Saving code snippet to local variable Icv_firstpage_lastpage['code']



### [3 points] Test 1 for icv_firstpage_lastpage/3
You must have run all clingo cells above for test to pass.

In [17]:
# Test 1 for icv_firstpage_lastpage/3
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_firstpage_lastpage(6755,91,9) icv_firstpage_lastpage(4407,10,1)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_firstpage_lastpage['code'], 'icv_firstpage_lastpage', 3, expected_output)



### [7 points] Test 2 for icv_firstpage_lastpage/3
Hidden test case.

In [18]:
# Hidden Test 2 for icv_firstpage_lastpage/3
# This cell will test the icv_firstpage_lastpage with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] Inclusion Dependency: Every cited publication in CITES also occurs in PUBLICATION.
- Now consider that an additional table `cites(Pid1, Pid2)` is given which records pairs of publication Pid1, Pid2, where Pid1 **is citing** Pid2. 
![Cites](Cites_Table.png "Cites")
- (**ID**) Every cited publication Pid2 occurs in the publication table!
- In denial form, we report in `icv_cited_publication/1` any and all cited publications from `CITES` that are not also in `PUBLICATION`.

In [19]:
%%clingo {"predicate" : "icv_cited_publication", "predicate_arity" : 1, "result_var": "Icv_cited_publication"}
% Don't change the clingo magic command above. The header of this cell will determine how the datalog rules are saved for evaluation.

% Following code snippet and it's result will be assigned to local variable Icv_cited_publication

% Change following expression.
%(Inclusion Dependency): Every cited publication in CITES also occurs in PUBLICATION.
% icv_cited_publication(P2) :- replace_me_id(P2).


icv_cited_publication(P2) :-
    cites(_, P2),           % P2 is cited in the cites table
    not publication(P2, _, _, _, _, _, _, _, _, _).

Saving output to local variable Icv_cited_publication['result']
Saving code snippet to local variable Icv_cited_publication['code']


### [3 points] Test 1 for icv_cited_publication/1
You must have run all clingo cells above for test to pass.

In [20]:
# Test 1 for icv_cited_publication/1
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_cited_publication(2020) icv_cited_publication(3799)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_cited_publication['code'], 'icv_cited_publication', 1, expected_output)


### [7 points] Test 2 for icv_cited_publication/1
Hidden test case.

In [21]:
# Hidden Test 2 for icv_cited_publication/1
# This cell will test the icv_cited_publication with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.


### [10 points] If P1 cites P2 then P2's year of publication cannot be greater than P1.
* As usual we use denial form and report publications P1 that cite P2 where the cited publication's year Y2 is larger than the citing publication's year Y1 (i.e., P1 is "looking into the future" - which is supicious and needs to be reported ;-)

In [22]:
%%clingo {"predicate" : "icv_p1_greater_p2", "predicate_arity" : 4, "result_var": "Icv_p1_greater_p2"}
% Don't change the clingo magic command above. The header of this cell will determine how the datalog rules are saved for evaluation.

% Following code snippet and it's result will be assigned to local variable Icv_p1_greater_p2

% Change following expression.
% icv_p1_greater_p2(P1,P2,Y1,Y2) :- replace_me_nc2(P1,P2,Y1,Y2).

icv_p1_greater_p2(P1, P2, Y1, Y2) :-
    cites(P1, P2),          % P1 cites P2
    publication(P1, _, Y1, _, _, _, _, _, _, _),  % Year Y1 of publication P1
    publication(P2, _, Y2, _, _, _, _, _, _, _),  % Year Y2 of publication P2
    Y2 > Y1.

Saving output to local variable Icv_p1_greater_p2['result']
Saving code snippet to local variable Icv_p1_greater_p2['code']


### [3 points] Test 1 for icv_p1_greater_p2/4
You must have run all clingo cells above for test to pass.

In [23]:
# Test 1 for icv_p1_greater_p2/4
# Following should be output of your previous cell.
# Order of predicates in the output doesn't matter.
# Run to see expected output with syntax highlighting.
expected_output = '''
icv_p1_greater_p2(2044,2580,1934,1962)
'''

db_file = os.path.expanduser('~/data_readonly/datalog/publications_base.lp')
clingo_evaluate(db_file, Icv_p1_greater_p2['code'], 'icv_p1_greater_p2', 4, expected_output)

### [7 points] Test 2 for icv_p1_greater_p2/4
Hidden test case.

In [24]:
# Hidden Test 2 for icv_p1_greater_p2/4
# This cell will test the icv_p1_greater_p2 with these new facts.
# Contents of this cell will not be present in student's version of assignment.
# This will only be evaluated after submission.
