# Finding Similar Items

## 1.	Introduction

### Frequent pairs of items

- Extract a list of the authors or editors per publication from the ACL Anthology dataset (https://aclanthology.org/) and create baskets and perform a search on the following:

1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

-	Define all and only used package imports below

In [8]:
import requests # to download the dataset
import gzip
import shutil # to extract the gz file
import re # for text cleaning

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random 

from nltk.corpus import stopwords # calculation of stopwords
import nltk
nltk.download('stopwords')
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# ! pip install pylatexenc
from pylatexenc.latex2text import LatexNodes2Text # fix umlaut vocals in names

import itertools

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alext\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2.	ELT

### Extract, Load and Transform of data.

- In your code data should be retrieved from an online source, NOT from your local drive, otherwise, nobody can run your code without additional effort.

In [9]:
# Download data 
url = 'https://aclanthology.org/anthology+abstracts.bib.gz'
filename = url.split("/")[-1]
with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

# Extract the gz file
with gzip.open('anthology+abstracts.bib.gz', 'rb') as f_in:
    with open('anthology+abstracts.bib', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [89]:

# Find all the rows in the file that contain an abstract and laod the text to a list
authors = []
with open("anthology+abstracts.bib", "r",encoding="UTF-8") as f:
    line = f.readline()
    while(line != ''):
          
        if line.__contains__('author =') or line.__contains__('editor ='):
            while not line.endswith('",\n'):
                line = line+f.readline()
            # something to clean
            line = LatexNodes2Text().latex_to_text(line) # fix umlaut vocals    
            authors.append(line)
        line = f.readline()
    f.close()


print("{} baskets of authors were found in the file.".format(len(authors)))
#the issue here is that only the first row after author/editor is stored. 

73129 baskets of authors were found in the file.


In [94]:
authors[1]

'    author = "Singh, Sumer  and\n      Li, Sheng",\n'

In [95]:
# Some cleaning

minletters = 5 
authors_clean = []

for a in authors: 
    if len(a) > minletters and len(re.findall('[a-zA-Z]',a)) >0.6*len(a):
        authors_clean.append(a) 
print("After cleaning, {} rows of authors were remaining.".format(len(authors_clean)))

After cleaning, 18181 rows of authors were remaining.


### Report the essential description of data.
-	Don’t print out dozens of raw lines.

In [92]:
data = pd.DataFrame(authors_clean, columns=['authors'])
# Number of words
data['word_count'] = data['authors'].apply(lambda x: len(str(x).split(" ")))
data[['authors','word_count']]

#Number of characters
data['char_count'] = data['authors'].str.len() ## this also includes spaces
data[['authors','char_count']]

# Average word length
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

data['avg_word'] = data['authors'].apply(lambda x: avg_word(x))

# Number of stop words 
stop = stopwords.words('english')
data['stopwords'] = data['authors'].apply(lambda x: len([x for x in x.split() if x in stop]))

# Number of Uppercase words
data['upper'] = data['authors'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# Descriptive statistics of the DataFrame
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
word_count,18181.0,28.208019,20.392735,8.0,17.0,26.0,35.0,560.0
char_count,18181.0,106.355811,74.47912,32.0,62.0,91.0,132.0,1767.0
avg_word,18181.0,6.334465,0.526591,4.454545,6.0,6.25,6.571429,10.0
stopwords,18181.0,2.459161,2.674914,0.0,1.0,2.0,3.0,60.0
upper,18181.0,0.134206,0.41958,0.0,0.0,0.0,0.0,7.0


## 3.	Modeling

### Prepare analytics here and construct all the data objects you will use in your report.
•	Write functions and classes to simplify tasks. Do not repeat yourself.

•	Avoid output.

•	Refactor your code until it’s clean

In [55]:
def readdata(k, fname="data/data.txt", report=False):
    C_k = []
    b = 0

    for line in fname:
        line = line.replace('\n', '')  # remove newline symbol
        if report:
            print(line)
         
        if line != "":
            # gather all items in one basket
            C_k.append(line)
        else:
            # end of basket, report all itemsets
            for itemset in itertools.combinations(C_k, k):
                yield frozenset(itemset)
            C_k = []
                
            if report:
                print("")

            # report progress
            # print every 1000th element to reduce clutter
            if report:
                if b % 1000 == 0:
                    print('processing bin ', b)
                b += 1

    # last basket
    if len(C_k) > 0:
        for itemset in itertools.combinations(C_k, k):
            yield frozenset(itemset)
    

In [60]:
N = 5  # frequency threshold


# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, fname=data["authors"], report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

33590 items
3328 items with >5 occurances


In [61]:
import time


def get_C(k):

    start = time.time()
    C = {}
    for key in readdata(k):  # False report
        if key not in C:
            C[key] = 1
        else:
            C[key] += 1
    print("Took {}s for k={}".format((time.time() - start), k))
    return C

get_C(2)

Took 0.0s for k=2


{frozenset({'a', 'd'}): 8,
 frozenset({'d', 't'}): 8,
 frozenset({'/', 'd'}): 2,
 frozenset({'d'}): 1,
 frozenset({'.', 'd'}): 2,
 frozenset({'d', 'x'}): 2,
 frozenset({'a', 't'}): 16,
 frozenset({'a'}): 6,
 frozenset({'/', 'a'}): 4,
 frozenset({'.', 'a'}): 4,
 frozenset({'a', 'x'}): 4,
 frozenset({'/', 't'}): 4,
 frozenset({'t'}): 6,
 frozenset({'.', 't'}): 4,
 frozenset({'t', 'x'}): 4,
 frozenset({'.', '/'}): 1,
 frozenset({'/', 'x'}): 1,
 frozenset({'.', 'x'}): 1}

In [38]:
import itertools

for c in itertools.combinations(data["abstracts"][1:8], 2):
    print(c)

('Hate speech and profanity detection suffer from data sparsity, especially for languages other than English, due to the subjective nature of the tasks and the resulting annotation incompatibility of existing corpora. In this study, we identify profane subspaces in word and sentence representations and explore their generalization capability on a variety of similar and distant target tasks in a zero-shot setting. This is done monolingually (German) and cross-lingually to closely-related (English), distantly-related (French) and non-related (Arabic) tasks. We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations in the zero-shot setting, with improvements between F1 +10.9 and F1 +42.9 over the baselines across all tested monolingual and cross-lingual scenarios.', 'We introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was

In [41]:
itertools.combinations(data["abstracts"][1:8],2)

<itertools.combinations at 0x1cab0dc1450>

## 4.	Results

•	Print out relevant tables nicely, display well-annotated charts and explain if needed in plain English.
•	Use minimum code here, just output-functions’ calls.

## 5.	Conclusions

•	Summarize your findings here in 5...10 lines of text.

The file will have its original line endings in your working directory


In [47]:
# ! git add Frequent_pairs.ipynb
# ! git commit -m "some progress"
# ! git push 

The file will have its original line endings in your working directory


[main f5ce580] some progress
 1 file changed, 1118 insertions(+), 1068 deletions(-)


To https://github.com/AlexTouvras/FindingSimilarItems
   27d8ee1..f5ce580  main -> main
