# Finding Similar Items

## 1.	Introduction

### Frequent pairs of items

- Extract a list of the authors or editors per publication from the ACL Anthology dataset (https://aclanthology.org/) and create baskets and perform a search on the following:

1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

-	Define all and only used package imports below

In [1]:
import requests # to download the dataset
import gzip
import shutil # to extract the gz file
import re # for text cleaning

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random 

from nltk.corpus import stopwords # calculation of stopwords
import nltk
nltk.download('stopwords')
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# ! pip install pylatexenc
from pylatexenc.latex2text import LatexNodes2Text # fix umlaut vocals in names

import itertools

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alext\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2.	ELT

### Extract, Load and Transform of data.

- In your code data should be retrieved from an online source, NOT from your local drive, otherwise, nobody can run your code without additional effort.

In [2]:
# Download data 
url = 'https://aclanthology.org/anthology+abstracts.bib.gz'
filename = url.split("/")[-1]
with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

# Extract the gz file
with gzip.open('anthology+abstracts.bib.gz', 'rb') as f_in:
    with open('anthology+abstracts.bib', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [3]:

# Find all the rows in the file that contain an abstract and laod the text to a list
authors = []
with open("anthology+abstracts.bib", "r",encoding="UTF-8") as f:
    line = f.readline()
    while(line != ''):
          
        if line.__contains__('author =') or line.__contains__('editor ='):
            while not line.endswith('",\n'):
                line = line+f.readline()
            # something to clean
            line = LatexNodes2Text().latex_to_text(line) # fix umlaut vocals    
            authors.append(line)
        line = f.readline()
    f.close()


print("{} baskets of authors were found in the file.".format(len(authors)))
#the issue here is that only the first row after author/editor is stored. 

73199 baskets of authors were found in the file.


In [4]:
authors[1]

'    author = "Singh, Sumer  and\n      Li, Sheng",\n'

In [5]:
# Some cleaning

minletters = 5 
authors_clean = []

for a in authors: 
    if len(a) > minletters and len(re.findall('[a-zA-Z]',a)) >0.6*len(a):
        authors_clean.append(a) 
print("After cleaning, {} rows of authors were remaining.".format(len(authors_clean)))

After cleaning, 18197 rows of authors were remaining.


In [6]:
# Find all the rows in the file that contain an author/editor and load the text to a list
end = set(['",\n', '}\n'])
authors = []
with open("anthology+abstracts.bib", "r",encoding="UTF-8") as f:
    line = f.readline()
    while(line != ''):

        if line.__contains__('author =') or line.__contains__('editor ='):
            while not (line.endswith('",\n')|line.endswith('},\n')):
                line = line+f.readline()
            # something to clean
#           line = LatexNodes2Text().latex_to_text(line) # fix umlaut vocals; this part takes some time to run    
            line = re.sub('    editor = "|    author = "|",|\n|    author = {|    editor = {','',line)
            line = re.sub(',','',line)
            line = re.sub('  and      ',', ',line)
            authors.append(line)
            
        line = f.readline()
        
    f.close()

print("{} baskets of authors were found in the file.".format(len(authors)))

73199 baskets of authors were found in the file.


In [7]:
authors

['Mostafazadeh Davani Aida, Kiela Douwe, Lambert Mathias, Vidgen Bertie, Prabhakaran Vinodkumar, Waseem Zeerak',
 'Singh Sumer, Li Sheng',
 'Hahn Vanessa, Ruiter Dana, Kleinbauer Thomas, Klakow Dietrich',
 "Caselli Tommaso, Basile Valerio, Mitrovi{\\'c} Jelena, Granitzer Michael",
 'Kirk Hannah, Jun Yennie, Rauba Paulius, Wachtel Gal, Li Ruining, Bai Xingjian, Broestl Noah, Doff-Sotta Martin, Shtedritski Aleksandar, Asano Yuki M',
 'Kivlichan Ian, Lin Zi, Liu Jeremiah, Vasserman Lucy',
 'Caselli Tommaso, Schelhaas Arjan, Weultjes Marieke, Leistra Folkert, van der Veen Hylke, Timmerman Gerben, Nissim Malvina',
 'Niraula Nobal B., Dulal Saurab, Koirala Diwa',
 "Fortuna Paula, Cortez Vanessa, Sozinho Ramalho Miguel, P{\\'e}rez-Mayos Laura",
 'Manerba Marta Marchiori, Tonelli Sara',
 'Mostafazadeh Davani Aida, Omrani Ali, Kennedy Brendan, Atari Mohammad, Ren Xiang, Dehghani Morteza',
 'Zad Samira, Jimenez Joshuan, Finlayson Mark',
 'Chuang Yung-Sung, Gao Mingye, Luo Hongyin, Glass James, L

In [8]:
def readdata(k, fname=authors, report=False):
    C_k = []
    b = 0

    for line in fname:
        if report:
            print(line)
         
        # gather all items in one basket
        C_k.append(line.split(", "))

        # end of basket, report all itemsets
        for author in C_k:
            for itemset in itertools.combinations(author, k):
                yield frozenset(itemset)
            C_k = []
                
        if report:
            print("")

    # last basket
    if len(C_k) > 0:
        for itemset in itertools.combinations(C_k, k):
            yield frozenset(itemset)
    

### Report the essential description of data.
-	Don’t print out dozens of raw lines.

In [10]:
data = pd.DataFrame(authors_clean, columns=['authors'])
# Number of words
data['word_count'] = data['authors'].apply(lambda x: len(str(x).split(" ")))
data[['authors','word_count']]

#Number of characters
data['char_count'] = data['authors'].str.len() ## this also includes spaces
data[['authors','char_count']]

# Average word length
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

data['avg_word'] = data['authors'].apply(lambda x: avg_word(x))

# Number of stop words 
stop = stopwords.words('english')
data['stopwords'] = data['authors'].apply(lambda x: len([x for x in x.split() if x in stop]))

# Number of Uppercase words
data['upper'] = data['authors'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# Descriptive statistics of the DataFrame
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
word_count,18197.0,28.216354,20.397909,8.0,17.0,26.0,35.0,560.0
char_count,18197.0,106.383525,74.498631,32.0,62.0,91.0,132.0,1767.0
avg_word,18197.0,6.334394,0.526573,4.454545,6.0,6.25,6.571429,10.0
stopwords,18197.0,2.460186,2.675705,0.0,1.0,2.0,3.0,60.0
upper,18197.0,0.134033,0.419235,0.0,0.0,0.0,0.0,7.0


In [11]:
nitems = 5
for C_k in readdata(k=2, report=True):
    print(C_k)
    nitems -= 1
    
    if nitems == 0:
        break

Mostafazadeh Davani Aida, Kiela Douwe, Lambert Mathias, Vidgen Bertie, Prabhakaran Vinodkumar, Waseem Zeerak
frozenset({'Kiela Douwe', 'Mostafazadeh Davani Aida'})
frozenset({'Mostafazadeh Davani Aida', 'Lambert Mathias'})
frozenset({'Vidgen Bertie', 'Mostafazadeh Davani Aida'})
frozenset({'Prabhakaran Vinodkumar', 'Mostafazadeh Davani Aida'})
frozenset({'Mostafazadeh Davani Aida', 'Waseem Zeerak'})


In [12]:
# pair of elements
import time


def get_C(k):

    start = time.time()
    C = {}
    for key in readdata(k):  # False report
        if key not in C:
            C[key] = 1
        else:
            C[key] += 1
    print("Took {}s for k={}".format((time.time() - start), k))
    return C


C1 = get_C(1)
C2 = get_C(2)

Took 0.44598817825317383s for k=1
Took 0.9703795909881592s for k=2


In [13]:
print(len(C1),len(C2))

64850 271422


In [14]:
for (ck, n), _ in zip(C2.items(), range(5)):
    print(ck,n)

frozenset({'Kiela Douwe', 'Mostafazadeh Davani Aida'}) 2
frozenset({'Mostafazadeh Davani Aida', 'Lambert Mathias'}) 1
frozenset({'Vidgen Bertie', 'Mostafazadeh Davani Aida'}) 2
frozenset({'Prabhakaran Vinodkumar', 'Mostafazadeh Davani Aida'}) 3
frozenset({'Mostafazadeh Davani Aida', 'Waseem Zeerak'}) 2


In [19]:

for s in range(10,110,10):
    L2 = {}
    for key, n in C2.items():
        if n >= s:
            L2[key] = n
    print('{} items with >{} occurances'.format(len(L2), s))

1784 items with >10 occurances
253 items with >20 occurances
67 items with >30 occurances
25 items with >40 occurances
13 items with >50 occurances
9 items with >60 occurances
6 items with >70 occurances
1 items with >80 occurances
1 items with >90 occurances
0 items with >100 occurances


## 3.	Modeling

### Prepare analytics here and construct all the data objects you will use in your report.
•	Write functions and classes to simplify tasks. Do not repeat yourself.

•	Avoid output.

•	Refactor your code until it’s clean

In [16]:
def naive_method(s):
    L2 = {}
    for key, n in C2.items():
        if n >= s:
            L2[key] = n
    L2 = [elem for elem in list(L2) if len(elem) > 1] 

    for i in range(len(L2)):

        A, B = list(L2[i])
        support_AB = C2[frozenset([A, B])]
        support_A = C1[frozenset([A])]
        conf_A_leads_to_B = support_AB / support_A

        support_B = C1[frozenset([B])]
        prob_B = support_B / nbaskets

        interest_A_leads_to_B = conf_A_leads_to_B - prob_B

        if interest_A_leads_to_B > 0.7:
            print("{} --> {} with interest {:3f}".format(A, B,
                                                         interest_A_leads_to_B))

In [17]:
def A_priori_method(s): 
    # filter stage
    L1 = {}
    for key, count in C1.items():
        if count >= s:
            L1[key] = count

    # find frequent 2-tuples
    C2_items = set([a.union(b) for a in L1.keys() for b in L1.keys()]) # List comprehensions in python
    C2 = {}
    for key in readdata(k=2):
        # filter out non-frequent tuples
        if key not in C2_items:
            continue

        # record frequent tuples
        if key not in C2:
            C2[key] = 1
        else:
            C2[key] += 1

    # filter stage
    L2 = {}
    for key, count in C2.items():
        if count >= s:
            L2[key] = count
    print('A-priori: {} items with >{} occurances'.format(len(L2), s))
    

In [18]:
# SLOW! (too many possible 3-tuples) So let's be smart and use some time constrain.
from time import time
start = time()
PERIOD_OF_TIME = 10 # 

# find frequent 2-tuples
C3 = {}
for key in readdata(k=2):
    
    # filter out non-frequent tuples
    # A-Priori filtering, option 2: generate all possible subsets and check that they all are frequent
    non_freq_1 = set([frozenset(x) for x in itertools.combinations(list(key), 1)]) - L1_items
    if len(non_freq_1) > 0:
        continue

    non_freq_2 = set([frozenset(x) for x in itertools.combinations(list(key), 2)]) - L2_items
    if len(non_freq_2) > 0:
        continue

    # record frequent tuples
    if key not in C2:
        C2[key] = 1
    else:
        C2[key] += 1
    
    ################################
    # break out of a slow function #
    if time() > start + PERIOD_OF_TIME : 
        print('Time is running out')
        break
        
print("{} items".format(len(C3)))

NameError: name 'L1_items' is not defined

## 4.	Results

•	Print out relevant tables nicely, display well-annotated charts and explain if needed in plain English.
•	Use minimum code here, just output-functions’ calls.

In [None]:
nbaskets =len(authors)
%time
naive_method(10)

In [None]:

naive_method(50)
%time

In [None]:

naive_method(100)
%time

## 5.	Conclusions

•	Summarize your findings here in 5...10 lines of text.

In [None]:
# ! git add Frequent_pairs.ipynb
# ! git commit -m "some progress"
# ! git push 