# nlg: a Python package for analogy
This notebook contains examples on usage of __nlg__ package.  
__nlg__ package is a Python3 package which contains modules and functions related to analogy.  
The main usage is to extract analogies from a given text.

___
## Installation
Please follow the instruction on __README__ file
- You need to first install the Python package: __fast_distance__.
- Based on your environment, you may need to install __Cython__.

After sucessfully installing the package, then we import the package.

In [1]:
import nlg

___
## Analogy

### Analogy class
Analogy is a class to represent analogy.  
> *A* is to *B* as *C* is to *D*

In [2]:
from nlg.Analogy.Analogy import Analogy
term_A = "makan"
term_B = "dimakan"
term_C = "minum"
term_D = "diminum"
analogy = Analogy.fromTerms(term_A, term_B, term_C, term_D)
print(analogy)

makan : dimakan :: minum : diminum


### Solving analogies
We can use *solvenlg* from the C module to solve analogical equation:  
> given 3 terms *A*, *B* and *C*; coin the fourth term *D*

In [3]:
from _nlg import solvenlg
term_A = "makan"
term_B = "dimakan"
term_C = "minum"
term_D = solvenlg(term_A, term_B, term_C)
print(term_D)

diminum


___
## Representing strings as vectors
To get the vector representation of a given string, you may use Python script available within the package: __*Lines2Vectors.py*__  
Let us produce vector representation for a set of words contained in a file.

In [4]:
!cat toy_data/id.test.words

air
anto
beli
bola
cilok
dan
di
dia
dibeli
dimakan
diminum
enak
es
itu
juga
main
mainan
makan
makanan
melihat
memakan
memang
meminum
minum
minuman
nasi
olahraga
pasar
selesai
senang
setelah
suka

In [5]:
! python Lines2Vectors.py <toy_data/id.test.words -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Alphabet size: 19
# 
air	(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
anto	(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0)
beli	(0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
bola	(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0)
cilok	(0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0)
dan	(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
di	(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
dia	(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
dibeli	(0, 1, 0, 1, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
dimakan	(2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0)
diminum	(0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1)
enak	(1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
es	(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
itu	(0, 0

Now, we will do it by using the module provided by package.

In [6]:
from nlg.Vector.Vectors import Vectors

In [7]:
filename = "toy_data/id.test.words"
set_of_words = [line.strip() for line in open(filename)]

print(f"# Number of lines: {len(set_of_words)}")
for i, elem in enumerate(set_of_words):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 32
	 1. air
	 2. anto
	 3. beli
	 4. bola
	 5. cilok
	 6. dan
	 7. di
	 8. dia
	 9. dibeli
	10. dimakan
	11. diminum
	12. enak
	13. es
	14. itu
	15. juga
	16. main
	17. mainan
	18. makan
	19. makanan
	20. melihat
	21. memakan
	22. memang
	23. meminum
	24. minum
	25. minuman
	26. nasi
	27. olahraga
	28. pasar
	29. selesai
	30. senang
	31. setelah
	32. suka


### Feature vector: characters
In this example, we used the number of occurrences of characters in the alphabet as feature for the vectors.

In [8]:
vectors = Vectors.fromFile(lines=set_of_words)
print(vectors)

# 
air	(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0)
anto	(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0)
beli	(0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
bola	(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0)
cilok	(0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0)
dan	(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
di	(0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
dia	(1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
dibeli	(0, 1, 0, 1, 1, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
dimakan	(2, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0)
diminum	(0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1)
enak	(1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
es	(0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0)
itu	(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1)
juga	(1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)
main	(1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 

# Alphabet size: 19


### Feature vector: morphosyntactic description
We can also use morphosyntactic description of the word as features. For example:

> the Indonesian word *makanan* has a part of speech tag of __noun__.

To demonstrate how to do it, let us use the SIGMORPHON data for English (unfortunately, they do not have Indonesian).  
It is formatted as follows.

> __LEMMA__ *tabulation* __INFLECTED_FORM__ *tabulation* __MORPHOSYNTACTIC_DESCRIPTION__

In [9]:
sigmorphon_filename = 'toy_data/english-sigmorphon'
sigmorphon_lines = [line.strip() for line in open(sigmorphon_filename)]

print(f"# Number of lines: {len(sigmorphon_lines)}")
for i, elem in enumerate(sigmorphon_lines):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 100
	 1. dreep	dreep	V;NFIN
	 2. charcoal	charcoal	V;NFIN
	 3. stodge	stodges	V;3;SG;PRS
	 4. biotransform	biotransform	V;NFIN
	 5. disallow	disallowing	V;V.PTCP;PRS
	 6. precut	precut	V;V.PTCP;PST
	 7. outmanœuvre	outmanœuvred	V;PST
	 8. unsnib	unsnibbing	V;V.PTCP;PRS
	 9. Afghanize	Afghanized	V;PST
	10. redescribe	redescribes	V;3;SG;PRS
	11. overspeculate	overspeculates	V;3;SG;PRS
	12. reënter	reënters	V;3;SG;PRS
	13. waller	wallering	V;V.PTCP;PRS
	14. carboxylate	carboxylating	V;V.PTCP;PRS
	15. imprison	imprisoned	V;PST
	16. helicopt	helicopted	V;PST
	17. tut	tutted	V;V.PTCP;PST
	18. misdoom	misdooms	V;3;SG;PRS
	19. mush	mush	V;NFIN
	20. billhook	billhook	V;NFIN
	21. ingrave	ingraved	V;PST
	22. estheticize	estheticize	V;NFIN
	23. off-split	off-split	V;PST
	24. excecate	excecating	V;V.PTCP;PRS
	25. hegemonise	hegemonised	V;V.PTCP;PST
	26. overregularize	overregularized	V;PST
	27. innoculate	innoculates	V;3;SG;PRS
	28. mopy	mopying	V;V.PTCP;PRS
	29. unhyphenate	unhy

We can now represent them as vectors using the __*fromSigmorphonFile*__ function.  
Notice that we can control what kind of feature do we want to be embedded into the vectors.

In [10]:
char_feature = True
morph_feature = True
lemma_feature = False
gen_lemma = False

sigmorphon_vectors = Vectors.fromSigmorphonFile(lines=sigmorphon_lines,
							char_feature=char_feature,
							morph_feature=morph_feature,
							lemma_feature=lemma_feature,
							gen_lemma=gen_lemma)
print(sigmorphon_vectors)

# 
addending	(0, 0, 1, 0, 0, 3, 1, 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 3, 1, 1, 1, 2)
underfang	(0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 1)
begrasping	(0, 0, 1, 1, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 2, 1, 0, 0, 0, 0, 3, 1, 1, 1, 2)
quahog	(0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 1)
reënters	(0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 3, 0, 0, 1, 0, 0, 1, 1, 2, 0, 1)
conculcated	(0, 0, 1, 0, 3, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1)
carabined	(0, 0, 2, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1)
autostarted	(0, 0,

# Alphabet size: 30
# Morph feature size: 13


### Feature vector: affixes
Instead of characters, we may also use affixes as features in the vector. For example:  

> the Indonesian word *makanan* is a derivation from
> - stem *makan*, which is a verb, with
> - suffix *-an*, which transform a verb into a noun

To demonstrate this, we will use the MALINDO-Morph dataset for Indonesian affixes.  
The data is already preprocessed to follow the SIGMORPHON format so we can use the previous function to create the vector representation.

In [11]:
malindo_filename = 'toy_data/malindo-500.txt'
malindo_lines = [line.strip() for line in open(malindo_filename)]

print(f"# Number of lines: {len(malindo_lines)}")
for i, elem in enumerate(malindo_lines):
    print(f"\t{i+1:2}. {elem}")

# Number of lines: 500
	 1. abad	abadnya	-nya
	 2. ada	mengada-adakan	meN-;-kan;R-penuh
	 3. adeni	adeni
	 4. adik	adik-adik	R-penuh
	 5. afgan	afgan
	 6. akhlak	akhlaknya	-nya
	 7. aki	akinya	-nya
	 8. alang	teralang	ter-
	 9. alat	memperalat	meN-;per-
	10. alih	dialihkah	di-;-kah
	11. amin	mengaminkannya	meN-;-kan;-nya
	12. ampuh	mengampuhkan	meN-;-kan
	13. ampun	terampuninya	ter-;-i;-nya
	14. ancam	ancam-mengancam	meN-;R-penuh
	15. andil	berandil	ber-
	16. anggap	dianggapnya	di-;-nya
	17. anggar	teranggar-anggar	ter-;R-penuh
	18. angguk	dianggukkannya	di-;-kan;-nya
	19. angkat	diangkatkan	di-;-kan
	20. antar	pengantar	peN-
	21. antarklub	antarklub
	22. anti-rezim	anti-rezim
	23. antigen	antigen
	24. apit	mengapit	meN-
	25. apresiasi	terapresiasinya	ter-;-nya
	26. arak	diaraknya	di-;-nya
	27. archi	archi
	28. asah	diasah	di-
	29. atrium	atrium
	30. australia-indonesia	australia-indonesia
	31. babak	membabak	meN-
	32. baca	pembacaan	peN--an
	33. badari	badari
	34. baik	pembaik	peN-
	3

In [12]:
char_feature = True
morph_feature = True
lemma_feature = False
gen_lemma = False

malindo_vectors = Vectors.fromSigmorphonFile(lines=malindo_lines,
							char_feature=char_feature,
							morph_feature=morph_feature,
							lemma_feature=lemma_feature,
							gen_lemma=gen_lemma)
print(malindo_vectors)

# 
mewangikan	(0, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0)
imanlah	(0, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
ketelitian	(0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
lembagakan	(0, 3, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0)
ketatlah	(0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
beriring	(0, 0, 1, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
perintahkanlah	(0, 3, 0, 0, 0, 1, 0, 0, 2, 1, 0, 1, 1, 0, 2, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1

# Alphabet size: 25
# Morph feature size: 20


### Feature vector: combination of features
Remember that we can always combine all of the features mentioned above to have rich representation of the words.  
By using __different kind of feature vector__, we will also extract __different kind of analogical clusters and grids__.

___
## Extracting analogical clusters
There are several Python scripts that can be used to extract the analogical clusters:
- __*Words2Clusters.py*__ (from a set of words)
- __*Vectors2Clusters.py*__ (from vectors)

### *Words2CLusters.py*
This Python script receives a file contains a word on each line and gives a list of analogical clusters.

In [13]:
!python Words2Clusters.py <toy_data/id.test.words -V

# Reading words and computing feature vectors (features=characters)...
# Clustering strings according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Adding indistinguishable strings...
# Checking distance constraint...
# 
minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan
# Processing time: 0.64s


### *Vectors2CLusters.py*
This Python script receives a file contains a word and its vector representation on each line and gives a list of analogical clusters.  
In this example, we use __*Lines2Vectors.py*__ to produce the vector representation.

Notice that we can make use of the notion of *pipeline* to have the programs communicate between each other through *stdin* and *stdout*.

In [14]:
!python Lines2Vectors.py <toy_data/id.test.words -V | python Vectors2Clusters.py -V

# Reading file...
# Number of lines read: 32	
# Building vector with feature...
#	- char : True
#	- token: False
#	- morph: False
#	- lemma: False
# Alphabet size: 19
# Lines2Vectors.py - Processing time: 0:00:00
# Reading words and their vector representations...
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Add the indistinguishables...
# Checking distance constraints...
# 
minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan
# Vectors2Clusters.py - Processing time: 0:00:00


Let us now extract the analogical clusters from vectors that is created before using the module available in the package.

In [15]:
from nlg.Cluster.Cluster import ListOfClusters
from nlg.Cluster.Words2Clusters.StrCluster import ListOfStrClusters

In [16]:
min_clu_size=2
max_clu_size=None

distinguishable_vectors = vectors.get_distinguishables()
list_of_clusters = ListOfClusters.fromVectors(distinguishable_vectors,
			minimal_size=min_clu_size,
			maximal_size=max_clu_size)
list_of_clusters.set_indistinguishables(vectors.indistinguishables)

We then verify the distance constraints.

In [17]:
list_of_strclusters = ListOfStrClusters.fromListOfClusters(clusters=list_of_clusters,
			minimal_size=min_clu_size,
			maximal_size=max_clu_size)
print(list_of_strclusters)

# 
minum : makan :: meminum : memakan :: diminum : dimakan :: minuman : makanan
minum : minuman :: main : mainan :: makan : makanan
minum : diminum :: beli : dibeli :: makan : dimakan
beli : makan :: dibeli : dimakan
minum : main :: minuman : mainan
main : makan :: mainan : makanan
meminum : minuman :: memakan : makanan
diminum : minuman :: dimakan : makanan
minum : beli :: diminum : dibeli
meminum : diminum :: memakan : dimakan
minum : meminum :: makan : memakan


___
## Extracting analogical grids
There are several Python scripts that can be used to extract the analogical grids:
- __*Words2Grids.py*__ (from a set of words)
- __*Vectors2Grids.py*__ (from vectors)
- __*Clusters2Grids.py*__ (from analogical clusters)

### *Words2Grids.py*
This Python script receives a file contains a word on each line and gives a list of analogical grids.

In [18]:
!python Words2Grids.py <toy_data/id.test.words -V

# Reading words and computing feature vectors (features=characters)...
# Clustering the words according to their feature vectors...
#	- min cluster size: 2
#	- max cluster size: None
# Add the indistinguishables...
# Checking distance constraints...
# Building grids...
#	- saturation ≥ 0.000
#	- cluster size ≥ {options.minimal_grids_cluster_size}
minum : meminum : diminum : minuman :: makan : memakan : dimakan : makanan :: main : None : None : mainan :: beli : None : dibeli : None

# Processing time: 0.07s


### *Vectors2Grids.py*
This Python script receives a file contains a word and its vector representation on each line and gives a list of analogical grids.  
In this example, again, we use __*Lines2Vectors.py*__ to produce the vector representation.

In [None]:
!python Lines2Vectors.py <toy_data/id.test.words -V | python Vectors2Grids.py -V

### *Clusters2Grids.py*
This Python script receives a file contains an analogical cluster on each line and gives a list of analogical grids.  
In this example, we use the previous Python script __*Vectors2Clusters.py*__ to produce the clusters.

In [None]:
!python Lines2Vectors.py <toy_data/id.test.words -V | python Vectors2Clusters.py -V | python Clusters2Grids.py -V

Let us now extract the analogical grids from list of analogical clusters using the module.

In [None]:
from nlg.Grid.Grid import ListOfGrids

In [None]:
min_saturation = 0.0 
list_of_grids = ListOfGrids.fromClusters(list_of_strclusters, min_saturation)
print(list_of_grids.pretty_print())

### Properties of analogical grids
There are two properties of analogical grids: *size* and *saturation*.
- *Size* is simply the total number of cells inside the grid.
- *Saturation* is the ratio of non-empty cells against the total number of cells.

In [None]:
for i, grid in enumerate(list_of_grids):
    print(f'Grid no {i+1}:')
    print(f'  - size = {grid.attributes[2]}')
    print(f'  - saturation = {grid.attributes[4]}')

# Notes
In this notebook, we only showed the basic use of the scripts and modules.  
There are many parameters available in both the Python scripts and modules.  
Please look inside the scripts to perform more interesting experiments.