# Finding Similar Items

## 1.	Introduction

### Frequent pair of items

- Extract a list of the authors or editors per publication from the ACL Anthology dataset (https://aclanthology.org/) and create baskets and perform a search on the following:

1. Find the frequent pair of items (2-tuples) using the naïve, A-priori and PCY algorithms. For each of these compare the time of execution and results for supports s=10, 50, 100. Comment your results. 

2. For the PCY algorithm, create up to 5 compact hash tables. What is  the difference in results and time of execution for 1,2,3,4 and 5 tables? Comment your results.

3. Find the final list of k-frequent items (k-tuples) for k=3 and 4. Experiment a bit and describe the best value for the support in each case. *Warning*: You can use any of the three algorithms, but be careful, because the algorithm can take too long if you don't chose it properly (well, basically don't use the naïve approach ;)).

4. Using one of the results of the previous items, for one k (k=2 or 3) find the possible clusters using the 1-NN criteria. Comment your results.

> 1-NN means that if you have a tuple {A,B,C} and {C,E,F} then because they share one element {C}, then they belong to the same cluster  {A,B,C,E,F}.

-	Define all and only used package imports below

In [21]:
import requests # to download the dataset
import gzip
import shutil # to extract the gz file
import re # for text cleaning

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import random 

from nltk.corpus import stopwords # calculation of stopwords
import nltk
nltk.download('stopwords')
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import itertools

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alext\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2.	ELT

### Extract, Load and Transform of data.

- In your code data should be retrieved from an online source, NOT from your local drive, otherwise, nobody can run your code without additional effort.

In [13]:
# Download data 
url = 'https://aclanthology.org/anthology+abstracts.bib.gz'
filename = url.split("/")[-1]
with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

# Extract the gz file
with gzip.open('anthology+abstracts.bib.gz', 'rb') as f_in:
    with open('anthology+abstracts.bib', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

In [123]:
# Find all the rows in the file that contain an abstract and laod the text to a list
authors = []
with open("anthology+abstracts.bib", "r",encoding="UTF-8") as f:
    s = f.readlines()
    for x in s:
        #Create a function for the if statement below?
        if x.__contains__('author ='):
            start = x.find('    author = "') + len('    author = "')
            end = x.find('",')
            substring = x[start:end]
            authors.append(substring)
        if x.__contains__('editor ='):
            start = x.find('    editor = "') + len('    editor = "')
            end = x.find('",')
            substring = x[start:end]
            authors.append(substring)
            authors.append("")
    f.close()
    f.close()

print("{} rows of authors were found in the file.".format(len(authors)))
#the issue here is that only the first row after author/editor is stored. 

74739 rows of authors were found in the file.


In [119]:
# the above is wrong, perhaps some solution like the one here: 
# https://stackoverflow.com/questions/21421060/read-a-file-and-extract-lines-between-two-lines-of-specific-text-in-c

In [124]:
authors

['Mostafazadeh Davani, Aida  and',
 '',
 'Singh, Sumer  and',
 'Hahn, Vanessa  and',
 'Caselli, Tommaso  and',
 'Kirk, Hannah  and',
 'Kivlichan, Ian  and',
 'Caselli, Tommaso  and',
 'Niraula, Nobal B.  and',
 'Fortuna, Paula  and',
 'Manerba, Marta Marchiori  and',
 'Mostafazadeh Davani, Aida  and',
 'Zad, Samira  and',
 'Chuang, Yung-Sung  and',
 'Aksenov, Dmitrii  and',
 'Sodhi, Ravsimar  and',
 'Xenos, Alexandros  and',
 'Salawu, Semiu  and',
 'Risch, Julian  and',
 'Trujillo, Milo  and',
 'Shvets, Alexander  and',
 'Bertaglia, Thales  and',
 'Mathias, Lambert  and',
 'Aggarwal, Piush  and',
 'Zia, Haris Bin  and',
 'Kougia, Vasiliki  and',
 'Xu, Wei  and',
 '',
 'Dadu, Tanvi  and',
 'Olsen, Benjamin  and',
 '{H{\\"a}m{\\"a}l{\\"a}inen, Mika  and',
 'Lei, Yanfei  and',
 'Tran Phu, Minh  and',
 'Le, Duong  and',
 'Cho, Won Ik  and',
 'Feucht, Malte  and',
 'Higashiyama, Shohei  and',
 'Cheong, Sik Feng  and',
 'Chen, Shuguang  and',
 'Plepi, Joan  and',
 'Gao, Mengyi  and',
 'Lent,

In [101]:
# Some cleaning

minletters = 5 
authors_clean = []

for a in authors: 
    if len(a) > minletters and len(re.findall('[a-zA-Z]',a)) >0.6*len(a):
        authors_clean.append(a) 
print("After cleaning, {} rows of authors were remaining.".format(len(authors_clean)))

After cleaning, 70092 rows of authors were remaining.


### Report the essential description of data.
-	Don’t print out dozens of raw lines.

In [54]:
data = pd.DataFrame(authors_clean, columns=['authors'])
# Number of words
data['word_count'] = data['authors'].apply(lambda x: len(str(x).split(" ")))
data[['authors','word_count']]

#Number of characters
data['char_count'] = data['authors'].str.len() ## this also includes spaces
data[['authors','char_count']]

# Average word length
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

data['avg_word'] = data['authors'].apply(lambda x: avg_word(x))

# Number of stop words 
stop = stopwords.words('english')
data['stopwords'] = data['authors'].apply(lambda x: len([x for x in x.split() if x in stop]))

# Number of Uppercase words
data['upper'] = data['authors'].apply(lambda x: len([x for x in x.split() if x.isupper()]))

# Descriptive statistics of the DataFrame
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
word_count,72081.0,5.304838,15.411135,1.0,4.0,4.0,4.0,414.0
char_count,72081.0,29.245668,103.73031,6.0,16.0,19.0,22.0,2673.0
avg_word,72081.0,5.632325,1.527317,2.25,4.666667,5.333333,6.333333,43.0
stopwords,72081.0,1.361732,5.583222,0.0,1.0,1.0,1.0,158.0
upper,72081.0,0.081769,0.426801,0.0,0.0,0.0,0.0,34.0


## 3.	Modeling

### Prepare analytics here and construct all the data objects you will use in your report.
•	Write functions and classes to simplify tasks. Do not repeat yourself.

•	Avoid output.

•	Refactor your code until it’s clean

In [55]:
def readdata(k, fname="data/data.txt", report=False):
    C_k = []
    b = 0

    for line in fname:
        line = line.replace('\n', '')  # remove newline symbol
        if report:
            print(line)
         
        if line != "":
            # gather all items in one basket
            C_k.append(line)
        else:
            # end of basket, report all itemsets
            for itemset in itertools.combinations(C_k, k):
                yield frozenset(itemset)
            C_k = []
                
            if report:
                print("")

            # report progress
            # print every 1000th element to reduce clutter
            if report:
                if b % 1000 == 0:
                    print('processing bin ', b)
                b += 1

    # last basket
    if len(C_k) > 0:
        for itemset in itertools.combinations(C_k, k):
            yield frozenset(itemset)
    

In [60]:
N = 5  # frequency threshold


# find frequent 1-tuples (individual items)
C1 = {}
for key in readdata(k=1, fname=data["authors"], report=False):
    if key not in C1:
        C1[key] = 1
    else:
        C1[key] += 1    
        
print("{} items".format(len(C1)))

# filter stage
L1 = {}
for key, count in C1.items():
    if count >= N:
        L1[key] = count
print('{} items with >{} occurances'.format(len(L1), N))

33590 items
3328 items with >5 occurances


In [61]:
import time


def get_C(k):

    start = time.time()
    C = {}
    for key in readdata(k):  # False report
        if key not in C:
            C[key] = 1
        else:
            C[key] += 1
    print("Took {}s for k={}".format((time.time() - start), k))
    return C

get_C(2)

Took 0.0s for k=2


{frozenset({'a', 'd'}): 8,
 frozenset({'d', 't'}): 8,
 frozenset({'/', 'd'}): 2,
 frozenset({'d'}): 1,
 frozenset({'.', 'd'}): 2,
 frozenset({'d', 'x'}): 2,
 frozenset({'a', 't'}): 16,
 frozenset({'a'}): 6,
 frozenset({'/', 'a'}): 4,
 frozenset({'.', 'a'}): 4,
 frozenset({'a', 'x'}): 4,
 frozenset({'/', 't'}): 4,
 frozenset({'t'}): 6,
 frozenset({'.', 't'}): 4,
 frozenset({'t', 'x'}): 4,
 frozenset({'.', '/'}): 1,
 frozenset({'/', 'x'}): 1,
 frozenset({'.', 'x'}): 1}

In [38]:
import itertools

for c in itertools.combinations(data["abstracts"][1:8], 2):
    print(c)

('Hate speech and profanity detection suffer from data sparsity, especially for languages other than English, due to the subjective nature of the tasks and the resulting annotation incompatibility of existing corpora. In this study, we identify profane subspaces in word and sentence representations and explore their generalization capability on a variety of similar and distant target tasks in a zero-shot setting. This is done monolingually (German) and cross-lingually to closely-related (English), distantly-related (French) and non-related (Arabic) tasks. We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations in the zero-shot setting, with improvements between F1 +10.9 and F1 +42.9 over the baselines across all tested monolingual and cross-lingual scenarios.', 'We introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was

In [41]:
itertools.combinations(data["abstracts"][1:8],2)

<itertools.combinations at 0x1cab0dc1450>

## 4.	Results

•	Print out relevant tables nicely, display well-annotated charts and explain if needed in plain English.
•	Use minimum code here, just output-functions’ calls.

## 5.	Conclusions

•	Summarize your findings here in 5...10 lines of text.

In [8]:
# ! git add Project 2
# ! git commit -am "added Project 2" 
# ! git push 

fatal: pathspec 'Project' did not match any files


[main 8db0450] added Project 2
 3 files changed, 407 insertions(+), 36 deletions(-)


The file will have its original line endings in your working directory
The file will have its original line endings in your working directory
The file will have its original line endings in your working directory
To https://github.com/AlexTouvras/FindingSimilarItems
   9c4ee28..8db0450  main -> main
