# 📚 Exercise: Taxonomy Induction

<br>
In this exercise, we will perform the various steps commonly employed for unsupervised taxonomy induction from text corpora. 

## Overview:
Taxonomy induction from text typically consists of three main steps:

<ol>
  <li><b>Relations Extraction:</b> In this step, we use lexico-syntactic patterns to extract <b>IsA</b> relations from text. An example of IsA relation is (<i>apple, fruit</i>), which implies that <i>apple</i> is a type of fruit.</li>
  <br>
  <li><b>Initial Graph Construction:</b> In this step, we aggregate the extracted IsA relations to construct a potentially-noisy initial hypernym graph.</li>
  <br>
  <li><b>Graph Pruning:</b> In the final step, we perform some pruning or optimization steps to induce the a final clean taxonomy.</li>
</ol>  

## Goal:
- Run an extraction of IsA relations using lexico-syntactic patterns and inspect the results.
- Given the true IsA relations, construct an initial potentially-noisy hypernym graph.

## What are you learning in this exercise:
You will learn about the three steps of taxonomy induction in detail in the rest of this exercise.




## Question 1 - Relations Extraction


In this part of the exercise, we will run a small-scale extraction of IsA relations and inspect the results. Relations extraction uses lexico-syntactic patterns to identify IsA relations from unstructured text. Examples of lexico-syntactic patterns include:

 Lexico-syntactic pattern | Sample matching text
  ------|------------------
  <b>X</b> is a <b>Y</b>     | <i><b>apple</b> is a <b>fruit</b></i>, <i><b>switzerland</b> is a <b>country</b></i>
  <b>X</b> such as <b>Y</b>     | <i><b>fruits</b> such as <b>mango</b></i>, <i><b>scientists</b> such as <b>Einstein</b></i>
  <b>X</b> is an example of <b>Y</b>     | <i><b>iphone</b> is an example of <b>smartphone</b></i>
  
  <br>
  In this exercise, we will use such lexico-syntactic patterns to identify IsA relations from text.
  
  
  ###  Question 1.a

  Load the given file <b>wiki_food_en.txt</b> into memory using the following code:
  
  
  

In [None]:
def load_text_file(filename):
    file_text = []
    with open(filename) as fp:
        for line in fp:
            file_text.append(line.strip().lower())
    return " ".join(file_text)

file_text = load_text_file("wiki_food_en.txt")

<br>
The following code uses the regular expression library to detect lexico-syntactic patterns in the file_text. The example below uses the regular expression "X is a Y". Fill in the blanks (...):


In [None]:
import re

def find_matches(file_text, regexp_string):
    #Compile a regular expression 
    regexp = re.compile(...)
    
    #Find all matches with the given regular expression
    matches = ...(regexp, file_text)
    
    return matches

isa_matches = find_matches(file_text, "[a-z]+ is a ...")

<br>
 ###  Question 1.b

Run the above code for relations extraction with the following lexico-syntactic patterns:

<ol>
  <li>X such as Y</li>
  <li>such X as Y</li>
<li>X and other Y</li>  
</ol>

Manually inspect the results and compute the accuracies of first 20 matches for each lexico-syntactic pattern. What do you observe? Is there any important difference between patterns no. 1,2 and 3? 

## Question 2 - Graph Construction

As you noticed in the previous step, the output of lexico-syntactic patterns-based relations extraction contains significant noise. The task of noise removal is fairly involved and beyond the scope of this exercise. For further reading, we recommend this paper, which demonstrates a state-of-the-art effort for IsA relations extraction (<a href="http://webdatacommons.org/isadb/lrec2016.pdf">A Large Database of Hypernymy Relations Extracted from the Web</a>).

In this part of the exercise, we assume that IsA relations extracted using a state-of-the-art approach are already available. Given these relations, the aim of this step is to construct an initial potentially-noisy hypernym graph.

###  Question 2.a


Load the IsA relations of the food domain from the given file "food_isa_relations.txt" using the following code:

In [None]:
rels = []
with open("food_isa_relations.txt") as fp:
    for line in fp.readlines():
        toks = line.strip().split('\t')
        rels.append((toks[0],toks[1],float(toks[2])))

In python, graphs are better handled as a default 2-level dictionary. For example, the edge (<i>apple</i>,<i>fruit</i>, freq) is represented as a two-level map:

map['apple']['fruit'] = freq

The following code converts the IsA relations loaded from the file into a 2-level dictionary. Fill in the blanks:

In [None]:
from collections import defaultdict

noisy_relations = defaultdict(dict)
for hypo, hyper, freq in rels:
    noisy_relations[...][...] = ...

<br>

###  Question 2.b

The next step of taxonomy induction involves removing and filtering out noisy IsA relations. In a real scenario, this usually involves a wide variety of steps. However, in this exercise, we will implement only one step. In this step, we will sort all the hypernyms for each hyponym, and only retain top-5 hypernyms for each hyponym.

First, print the hypernyms of 'apple':

In [None]:
noisy_relations['apple']

<br>
Fill in the blanks in the following code: 

In [None]:
for hypo in noisy_relations.key():
    sorted_hypernyms = sorted(..., key = lambda x: ...)
    noisy_relations[hypo] = {k:v for k,v in ...}
    
# Printed filtered noisy relations.
noisy_relations['apple']   

###  Question 2.c

In the next step, we would first convert the set of filtered IsA relations into a graph. First install the library networkx and matplotlib:

   $ pip install networkx<br>
   
   $ pip install matplotlib
   


Further, use the following code:

In [None]:
import networkx and nx
import matplotlib.pyplot as plt

G=nx.DiGraph()

for hypo in noisy_relations.keys():
    for hyper in noisy_relations[hypo].keys():
        G.add_edge(hypo, hyper)
                

<br>
Print all the paths between the following terms (Hint: use the networkx function all_simple_paths):

1. 'apple' and 'food'
2. 'fusilli' and 'food'
3.  'okra' and 'food' 

Do you notice any relationship between the length of the path and its accuracy?


###  Question 2.d

In this step, we will now build a taxonomy. We will undertake the following steps:

<ol>
  <li> Let the vocabulary be {'apple', 'mango', 'peach', 'orange', 'banana'}.</li> 
<li> Let the root of the taxonomy be 'food'.</li> 
<li> Find all simple paths between terms in the vocabulary and the root. </li>
<li> Retain all simple paths of length $l$.
<li> Construct a graph by aggregatiing all the edges in the retained paths. </li>

The below code implements the above steps. Fill in the blanks:


In [None]:
def select_paths(vocab, root, l):
    retained_paths = []
    for term in vocab:
        for path in ...:
            if ...:
                retained_paths.append(path)
    return retained_paths


def aggregate_paths(paths):
    agg_graph = defaultdict(dict)
    
    for path in paths:
        for i,term in enumerate(path[0:len(path) -1]):
            agg_graph[...][...] = 1
            
    return agg_graph


V = ['apple', 'mango', 'peach', 'orange', 'banana']
root = 'food'

graph = aggregate_paths(select_paths(V, root, 3))
 

# Plot the graph
Gt = nx.DiGraph()
for k in graph.keys():
    for k1 in graph[k].keys():
        Gt.add_edge(k,k1)
            
plt.show()

###  Question 2.e

Plot the aggregated graph using the previous steps but with different path lengths (For example, 2 or 4). What do you notice?

###  Question 2.f

Repeat steps from 2.b to 2.e but without filtering the noisy relations in step 2.b. What do you observe?