In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [13]:
#Reading the Query 1

query1 = pd.read_csv('Query1.csv')
query2 = pd.read_csv('Query2.csv')
query3 = pd.read_csv('Query3.csv')

## Part 1: Quality Measure Computation of Rules

Given a set of Horn rules that are assumed to be discovered by an automatic tool using quality measures such as support, head-coverage, and confidence, here's a reminder of how these measures can be computed for a given rule in the form \(\vec{B} \rightarrow r(x, y)\):

- **Support** (\(supp(\vec{B} \rightarrow r(x, y))\)) is defined as: 
  \[
  supp(\vec{B} \rightarrow r(x, y)) := \#\{(x, y) : \exists z_1, ..., z_m : \vec{B} \land r(x, y)\}
  \]

- **Head-Coverage** (\(hc(\vec{B} \rightarrow r(x, y))\)) is defined as: 
  \[
  hc(\vec{B} \rightarrow r(x, y)) := \frac{supp(\vec{B} \rightarrow r(x, y))}{\#\{(x, y) : r(x, y)\}}
  \]

- **Confidence** (\(conf(\vec{B} \rightarrow r(x, y))\)) is defined as: 
  \[
  conf(\vec{B} \rightarrow r(x, y)) := \frac{supp(\vec{B} \rightarrow r(x, y))}{\#\{(x, y) : \exists z_1, ..., z_m : \vec{B}\}}
  \]

### Considered Rules

Let's consider the following rules \(r_1\), \(r_2\), and \(r_3\) where the atoms predicate(x, y) are written as \((?x\ \text{predicate}\ ?y)\):

- **r1**: \( (?a\ \text{nationality}\ ?b) \Rightarrow (?a\ \text{deathPlace}\ ?b) \)
- **r2**: \( (?a\ \text{birthPlace}\ ?b) \land (?a\ \text{country}\ ?b) \Rightarrow (?a\ \text{deathPlace}\ ?b) \)
- **r3**: \( (?a\ \text{child}\ ?h) \land (?h\ \text{parent}\ ?b) \Rightarrow (?a\ \text{spouse}\ ?b) \)


In [8]:
query1.head()

Unnamed: 0,writerNameLabel,birthPlaceLabel,nationalityLabel,deathPlaceLabel
0,Robert Zend,Budapest,Hungarian-Canadian,Canada
1,Robert Zend,Kingdom of Hungary (1920-1946),Hungarian-Canadian,Canada
2,Bruce Alistair McKelvie,Canada,Canadians,Canada
3,Leona Florentino,Captaincy General of the Philippines,Ilocano people,Captaincy General of the Philippines
4,Aziz Ahmad (writer),Hyderabad,Pakistani,Canada


In [14]:
query2.head()

Unnamed: 0,writerName,birthPlaceLabel,nationality,country,deathPlaceLabel
0,Caleb Whitefoord,Edinburgh,Scottish,Scotland,London
1,Caleb Whitefoord,Edinburgh,Scottish,United Kingdom,London
2,Carol Fenner,"North Hornell, New York",American,United States,"Battle Creek, Michigan"
3,Amelia Perrier,Cork (city),http://dbpedia.org/resource/United_Kingdom_of_...,Ireland,Sussex
4,Carola Prosperi,Turin,http://dbpedia.org/resource/Italy,Italy,Turin


In [16]:
query3.head(10)

Unnamed: 0,writerName,childLabel,parentLabel,spouseLabel
0,Caitlin Thomas,Aeronwy Thomas,Caitlin Thomas,Dylan Thomas
1,Dylan Thomas,Aeronwy Thomas,Caitlin Thomas,Caitlin Thomas
2,Qi Xin,Xi Yuanping,Qi Xin,Xi Zhongxun
3,Julia Rush Cutler Ward,Samuel Ward (lobbyist),Samuel Ward (banker),Samuel Ward (banker)
4,Rabindranath Tagore,Rathindranath Tagore,Mrinalini Devi,Mrinalini Devi
5,Henrik Ibsen,Sigurd Ibsen,Suzannah Thoresen,Suzannah Thoresen
6,Barry Shipman,Nina Shipman,Barry Shipman,Gwynne Shipman
7,Belkis Cuza Malé,Ernesto Padilla,Belkis Cuza Malé,Heberto Padilla
8,Nalini Prava Deka,Jim Ankan Deka,Bhabananda Deka,Bhabananda Deka
9,Gail Omvedt,Prachi Patankar,Bharat Patankar,Bharat Patankar


## Calculating Metrics

>Let consider the following rules r1, r2 and r3 where the atoms predicate(x, y) are written as (?x
predicate ?y):

    • r1: (?a nationality ?b) => (?a deathPlace ?b)


In [29]:
#Convert the data query1 into a list of dictionaries
query1_data = query1.to_dict(orient='records')
query2_data = query2.to_dict(orient='records')
query3_data = query3.to_dict(orient='records')

In [34]:
def calculate_metrics_for_rule_1(data):
    """
    Calculate support, head coverage, and confidence for rule 1:
    A person's nationality is the same as their place of death.
    
    Parameters:
    - data: List of dictionaries, each representing an instance with nationality and deathPlace attributes.
    
    Returns:
    - A dictionary with support, head coverage, and confidence metrics.
    """
    # Initialize counters
    support = 0  # Instances where nationality matches deathPlace
    total_nationality = 0  # Instances with a specified nationality
    total_deathPlace = 0  # Instances with a specified deathPlace
    
    for instance in data:
        has_nationality = 'nationalityLabel' in instance and instance['nationalityLabel']
        has_deathPlace = 'deathPlaceLabel' in instance and instance['deathPlaceLabel']
        
        if has_nationality:
            total_nationality += 1
        if has_deathPlace:
            total_deathPlace += 1
        if has_nationality and has_deathPlace and instance['nationalityLabel'] == instance['deathPlaceLabel']:
            support += 1
    
    # Calculate metrics
    head_coverage = support / total_deathPlace if total_deathPlace > 0 else 0
    confidence = support / total_nationality if total_nationality > 0 else 0
    
    return {
        'support': support,
        'head_coverage': head_coverage,
        'confidence': confidence
    }

In [37]:
data = query1_data

metrics = calculate_metrics_for_rule_1(data)
print("The metrics for rule 1 are:")
print(metrics)


The metrics for rule 1 are:
{'support': 227, 'head_coverage': 0.13243873978996498, 'confidence': 0.13243873978996498}


>r2: (?a birthPlace ?b) and (?a country ?b) => (?a deathPlace ?b)


In [40]:
def calculate_metrics_for_rule_2(data):
    """
    Calculate support, head coverage, and confidence for rule 2:
    A person's birthPlace and country are the same, and this matches their place of death.
    
    Parameters:
    - data: List of dictionaries, each representing an instance with birthPlace, country, and deathPlace attributes.
    
    Returns:
    - A dictionary with support, head coverage, and confidence metrics.
    """
    # Initialize counters
    support = 0  # Instances where birthPlace and country match deathPlace
    total_conditions_met = 0  # Instances where birthPlace matches country
    total_deathPlace = 0  # Instances with a specified deathPlace
    
    for instance in data:
        has_birthPlace = 'birthPlaceLabel' in instance and instance['birthPlaceLabel']
        has_country = 'country' in instance and instance['country']
        has_deathPlace = 'deathPlaceLabel' in instance and instance['deathPlaceLabel']
        
        if has_deathPlace:
            total_deathPlace += 1
        if has_birthPlace and has_country and instance['birthPlaceLabel'] == instance['country']:
            total_conditions_met += 1
            if has_deathPlace and instance['birthPlaceLabel'] == instance['deathPlaceLabel']:
                support += 1
    
    # Calculate metrics
    head_coverage = support / total_deathPlace if total_deathPlace > 0 else 0
    confidence = support / total_conditions_met if total_conditions_met > 0 else 0
    
    return {
        'support': support,
        'head_coverage': head_coverage,
        'confidence': confidence
    }




In [41]:
data = query2_data

metrics = calculate_metrics_for_rule_2(data)
print("The metrics for rule 2 are:")
print(metrics)

The metrics for rule 2 are:
{'support': 23, 'head_coverage': 0.03576982892690513, 'confidence': 0.40350877192982454}


>Rule 3 ::: r3: (?a child ?h) and (?h parent ?b) => (?a spouse ?b)

In [46]:
# Support Calculation
support = query3.apply(lambda row: row['parentLabel'] == row['spouseLabel'], axis=1).sum()

# Head Coverage Calculation
# For head coverage, we need the total instances where a spouse is identified
total_spouse_instances = query3['spouseLabel'].notna().sum()

# Confidence Calculation
# For confidence, we consider instances where a child and an identified other parent exist
total_body_instances = query3.apply(lambda row: pd.notna(row['childLabel']) and pd.notna(row['parentLabel']), axis=1).sum()

head_coverage = support / total_spouse_instances if total_spouse_instances > 0 else 0
confidence = support / total_body_instances if total_body_instances > 0 else 0


In [47]:
# Print the metrics
print(f"Metrics for rule 3:")
print(f"Support: {support}")
print(f"Head Coverage: {head_coverage}")
print(f"Confidence: {confidence}")

Metrics for rule 3:
Support: 104
Head Coverage: 0.3837638376383764
Confidence: 0.3837638376383764


## Part 2 AMIE Results - Analysis

Analyse the results and observe the results of some rules that are obtained and compare the
obtained measures (support, HC and PCA-confidence) especially for the three used rules in
part 1.


Consider these rules from the Film dataset

>Rule: 1 db:name  ?a  ?g  db:runtime  ?b   => ?a  db:runtime  ?b	0.050075643	0.995488722	662	665	-1

>Rule 2: ?g  db:name  ?a  ?g  db:writer  ?b   => ?a  db:director  ?b	0.024831527	0.395104895	339	858	-1

>Rule: 3 ?h  db:director  ?b  ?a  db:name  ?h   => ?a  db:director  ?b	0.050395546	0.995658466	688	691	-1

Analysis: 

The rules 1 and 3 exhibit high confidence (~0.995). Meaning that, the observed pattern with an entity is true.

- Rule 1 interprets that if an entity ?a has a db:name relation with any value ?g and there exists a db:runtime relation with the value ?b, then ?a has a db:runtime of ?b directly. However, there are limited number of instances where 'runtime' can be applied.

- Rule 2 interprets that if an entity ?g is associated with ?a through db:name and ?g is also associated with ?b through db:writer, then ?a is likely to have a db:director relation with ?b. The confidence is extremely low, suggesting that entities like this are very rare in the dataset.

- Rule 3 indicates that if an entity ?h is a db:director of ?b, and ?a has a db:name relation with ?h, then ?a is also a db:director of ?b. The high confidence makes it more reliable.