# Project 1: Mining information from Text Data - Hamed Ahmadinia

This project will explore and analyze the information stored in a particular dataset. In this case the ACL Anthology dataset (https://aclanthology.org/). We will explore different techniques for obtainingn valuable information.

## Task 1: Finding Similar Items

Randomly select 1000 abstracts from the whole dataset. Find the similar items using pairwise Jaccard similarities, MinHash and LSH (vectorized versions) .

   1. Compare the performance in time and the results for *k*-shingles = 3, 5 and 10, for the three methods and similarity thresholds *s*=0.9 and 0.95. Use 50 hashing functions. Comment your results. 
      
   2. Compare the results obtained for MinHash and LSH for different similarity thresholds *s* = 0.5, 0.9 and 0.95  and 50, 100 and 200 hashing functions. Comment your results.
   
   3. For MinHashing using 100 hashing functions and *s* = 0.5 and 0.9,  find the Jaccard distances (1-Jaccard similarity) for all possible pairs. Use the obtained values within a k-NN algorithm, and for k=1,3 and, 5 identify the clusters with similar abstracts for each *s*. Describe the obtained clusters, are they different?. Select randomly at least 5 abstracts per cluster, upon visual inspection, what are the main topics?
 

In [1]:
def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))

In [2]:
!pip install bibtexparser # We use this library to read bib file 



In [3]:
import bibtexparser

# We unziped the giz file and then we used bib file on our local machine
with open('anthology+abstracts.bib') as bibtex_file:
    bib_database = bibtexparser.bparser.BibTexParser(common_strings=True).parse_file(bibtex_file) # Read bib file

In [4]:
import pandas as pd
import numpy as np
from random import shuffle
import time

data=pd.DataFrame(bib_database.entries) # Create pandas DataFrame from bib file
data.head()

Unnamed: 0,url,publisher,address,year,month,editor,title,ENTRYTYPE,ID,abstract,pages,doi,booktitle,author,volume,journal,language,number,isbn,note
0,https://aclanthology.org/2021.woah-1.0,Association for Computational Linguistics,Online,2021,August,"Mostafazadeh Davani, Aida and\nKiela, Douwe ...",Proceedings of the 5th Workshop on Online Abus...,proceedings,woah-2021-online,,,,,,,,,,,
1,https://aclanthology.org/2021.woah-1.1,Association for Computational Linguistics,Online,2021,August,,Exploiting Auxiliary Data for Offensive Langua...,inproceedings,singh-li-2021-exploiting,Offensive language detection (OLD) has receive...,1--5,10.18653/v1/2021.woah-1.1,Proceedings of the 5th Workshop on Online Abus...,"Singh, Sumer and\nLi, Sheng",,,,,,
2,https://aclanthology.org/2021.woah-1.2,Association for Computational Linguistics,Online,2021,August,,Modeling Profanity and Hate Speech in Social M...,inproceedings,hahn-etal-2021-modeling,Hate speech and profanity detection suffer fro...,6--16,10.18653/v1/2021.woah-1.2,Proceedings of the 5th Workshop on Online Abus...,"Hahn, Vanessa and\nRuiter, Dana and\nKleinba...",,,,,,
3,https://aclanthology.org/2021.woah-1.3,Association for Computational Linguistics,Online,2021,August,,{H}ate{BERT}: Retraining {BERT} for Abusive La...,inproceedings,caselli-etal-2021-hatebert,"We introduce HateBERT, a re-trained BERT model...",17--25,10.18653/v1/2021.woah-1.3,Proceedings of the 5th Workshop on Online Abus...,"Caselli, Tommaso and\nBasile, Valerio and\nM...",,,,,,
4,https://aclanthology.org/2021.woah-1.4,Association for Computational Linguistics,Online,2021,August,,Memes in the Wild: Assessing the Generalizabil...,inproceedings,kirk-etal-2021-memes,Hateful memes pose a unique challenge for curr...,26--35,10.18653/v1/2021.woah-1.4,Proceedings of the 5th Workshop on Online Abus...,"Kirk, Hannah and\nJun, Yennie and\nRauba, Pa...",,,,,,


In [5]:
# selecting 1000 random abstract from acl anthology

abstract = [" ".join(x.lower().split()) for x in data['abstract'].dropna().to_numpy()] # delete exceed space, tab and line
abstract = ["".join([i for i in x if i.isalpha() or i==' ']) for x in abstract]# delete non alphabet and {'\/$#%}
abstract = [x for x in abstract if len(x)>100]
shuffle(abstract) # make random abstracts
abstract = abstract[:1000]

In [6]:
abstract[999]

'in largescale educational assessments the use of automated scoring has recently become quite common while the majority of student responses can be processed and scored without difficulty there are a small number of responses that have atypical characteristics that make it difficult for an automated scoring system to assign a correct score we describe a pipeline that detects and processes these kinds of responses at runtime we present the most frequent kinds of what are called nonscorable responses along with effective filtering models based on various nlp and speech processing technologies we give an overview of two operational automated scoring systems one for essay scoring and one for speech scoring and describe the filtering models they use finally we present an evaluation and analysis of filtering models used for spoken responses in an assessment of language proficiency'

In [7]:
def shingle(text, k):
    shingle_set = []
    for i in range(len(text) - k+1):
        shingle_set.append(text[i:i+k])
    return set(shingle_set)

In [8]:
print(shingle(abstract[0], 3))

{' in', 'nsi', ' we', 'hav', 'oft', 'wev', 'ura', ' di', 'ver', 'ose', 'r t', ' al', 'ive', 'hou', 'iza', 'ns ', 'bot', ' ca', 'e o', 'ngr', 'ubc', 'lic', 'tre', 'erb', 'rk ', ' re', ' fa', 'for', 'chr', ' us', 'r d', 'e i', 'cac', 'pti', 'int', ' ou', 'dit', 'e m', 'and', 'rem', 'ndo', 'omp', ' io', ' is', 'cce', 'che', ' en', 's t', 'the', 'ttp', ' fo', 'at ', 'que', 'x i', 'd n', 'k t', 'seq', 'wor', 'tte', 'y f', 'men', 'ark', 'hro', 'ely', 'gen', 'roc', 'res', 's e', 'uti', 'd u', ' ge', 'a s', 'ere', 'miz', 'r b', 'sts', ' ad', ' ht', ' im', 'ete', 'enc', 'ens', 'e t', 'gor', 'nil', 'to ', 'm f', ' to', 'sfo', 'ved', ' of', 'h a', 'lts', 'd o', 'cie', 'gua', 'lle', ' an', 'uto', ' un', 'e c', 'n n', 'cti', 'esu', 'cel', ' ar', 'pro', 'd i', 'ien', 'pel', 'oug', 't o', 'pee', 'sou', 'ono', ' eg', 'ous', 'm o', 'emo', 'ron', 'sof', 'mew', ' ho', ' ma', 'mer', 'zat', ' at', 'be ', 'ms ', 'nsf', 'hes', 'ilm', ' de', 'nch', 'ngu', 'eve', 'ten', 'o a', 'y l', 'sgi', 'hub', 'ame', 'wit'

In [9]:
def create_shingles(abstract, k):
    return [shingle(x, k) for x in abstract]

In [10]:
shingles = create_shingles(abstract, 3)

In [11]:
import itertools

def unionize(shingles): # we union all of shingles and make a set from that, it help to create a pool of shingles
    return list(set(itertools.chain(*shingles)))

In [12]:
pool = unionize(shingles)

In [13]:
def create_one_hot(shingles, pool):
    ab_1hot = []
    for ab in shingles:
        ab_1hot.append([1 if x in ab else 0 for x in pool])
    return ab_1hot

In [14]:
# for each abstract we create an on-hot ecoder with length of pool
# this is very sparse vector for each abstract

ab_1hot = create_one_hot(shingles, pool)

In [15]:
print(ab_1hot[41])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 

In [16]:
def create_hash_func(size: int):
    # function for creating the hash vector/function
    hash_ex = list(range(1, len(pool)+1))
    shuffle(hash_ex)
    return hash_ex

def build_minhash_func(pool_size: int, nbits: int):
    # function for building multiple minhash vectors
    hashes = []
    for _ in range(nbits):
        hashes.append(create_hash_func(pool_size))
    return hashes

# we create 50 minhash vectors
minhash_func = build_minhash_func(len(pool), 50)

def create_hash(vector: list):
    # use this function for creating our signatures
    signature = []
    for func in minhash_func:
        for i in range(1, len(pool)+1):
            idx = func.index(i)
            signature_val = vector[idx]
            if signature_val == 1:
                signature.append(idx)
                break
    return signature

In [17]:
def create_signatures(ab_1hot):
    ab_sign = []
    for ab in ab_1hot:
        ab_sign.append(create_hash(ab))
    return ab_sign

In [18]:
%%time

# create signature from each abstract with intuitive and very slow method
# as pool size increase, time increase exponentialy

ab_sign = create_signatures(ab_1hot) 

CPU times: user 45.6 s, sys: 220 ms, total: 45.9 s
Wall time: 49.7 s


In [19]:
# this is show that our signature work good for us and it is near real value of jaccard similarity
jaccard_similarity(shingles[10], shingles[100]), jaccard_similarity(ab_sign[10], ab_sign[100])

(0.23876404494382023, 0.22666666666666666)

In [20]:
def split_vector(signature, b):
    r = int(len(signature) / b)
    # code splitting signature in b parts
    subvecs = []
    for i in range(0, len(signature), r):
        subvecs.append(signature[i : i+r])
    return subvecs

def split_all(ab_sign, b):
    return [split_vector(sign, b) for sign in ab_sign]

In [21]:
ab_split = split_all(ab_sign, 10) # we split signiture vector to 10 band and later use in LSH
ab_split[0]

[[6414, 2567, 3479, 4650, 3814],
 [306, 6853, 1990, 1359, 3620],
 [2324, 2579, 6097, 306, 669],
 [3241, 5980, 1531, 6046, 6414],
 [6537, 1419, 110, 5596, 4079],
 [343, 2539, 2716, 764, 2857],
 [340, 5039, 2522, 4667, 4539],
 [2373, 4268, 2189, 356, 6233],
 [1138, 1372, 5737, 5429, 6610],
 [2826, 6414, 5245, 878, 4699]]

In [33]:
def create_jac_score(shingles):
    jacc_score = []
    for i in range(len(shingles)):
        for j in range(i, len(shingles)):
            jacc_score.append(list([i, j, jaccard_similarity(shingles[i], shingles[j])]))
    return pd.DataFrame(jacc_score, columns=["abstract_one", "abstract_two", "similarity"])

In [34]:
def create_minihash_score(ab_sign):
    minihash_score = []
    for i in range(len(ab_sign)):
        for j in range(i, len(ab_sign)):
            minihash_score.append(list([i, j, jaccard_similarity(ab_sign[i], ab_sign[j])]))
    return pd.DataFrame(minihash_score, columns=["abstract_one", "abstract_two", "similarity"])

In [35]:
def create_lsh_score(ab_splits):
    lsh_score = []
    for i in range(len(ab_splits)):
        for j in range(i, len(ab_splits)):
            for i_rows, j_rows in zip(ab_splits[i], ab_splits[j]):
                if i_rows == j_rows:
                    lsh_score.append(list([i,j,jaccard_similarity(ab_sign[i], ab_sign[j])]))
                    # we only need one band to match
                    break
    return pd.DataFrame(lsh_score, columns=["abstract_one", "abstract_two", "similarity"])

In [36]:
%%time

jac_score = create_jac_score(shingles)
jac_score.loc[(jac_score.similarity>.5)&(jac_score.similarity<1),:]

CPU times: user 37.4 s, sys: 131 ms, total: 37.6 s
Wall time: 38 s


Unnamed: 0,abstract_one,abstract_two,similarity


In [37]:
%%time

minihash_score = create_minihash_score(ab_sign)
minihash_score.loc[(minihash_score.similarity>.5)&(minihash_score.similarity<1),:]

CPU times: user 5.24 s, sys: 16.1 ms, total: 5.26 s
Wall time: 5.39 s


Unnamed: 0,abstract_one,abstract_two,similarity


In [38]:
%%time

lsh_score = create_lsh_score(ab_split)
lsh_score.loc[(lsh_score.similarity>.5)&(lsh_score.similarity<1),:]

CPU times: user 402 ms, sys: 7.84 ms, total: 410 ms
Wall time: 445 ms


Unnamed: 0,abstract_one,abstract_two,similarity


We create basics to use different methods 
now time to measure performance and results.


We implement minhash with simple method that get struggle in large number of article and shingles.


first we need to implement minhash algorithm faster and more efficient ...

In [39]:
from random import randint
from binascii import crc32

def minh(s, N, prime=4294967311):
    max_val = (2**32)-1
    perms = [(randint(0,max_val), randint(0,max_val)) for i in range(N)]
    vec = []

    for n in range(N):
        a, b = perms[n][0], perms[n][1]
        
        vec_ = [float('inf') for i in range(len(s))]
        for abstract_idx in range(len(s)):
            for i in s[abstract_idx]:
                ha = crc32(i.encode('utf-8')) 
                output = (a * ha + b) % prime
                if vec_[abstract_idx] > output:
                    vec_[abstract_idx] = output
        vec.append(vec_)
    return vec

###### 2- Compare the results obtained for MinHash and LSH for different similarity thresholds s = 0.5, 0.9 and 0.95 and 50, 100 and 200 hashing functions. Comment your results.

we implement minhash and lsh and trying 50, 100 and 200 hash functions. also use 3,5 and 10 shingles and from result we can say that:

- jaccard is precious but slow
    as shingle size is increasing jaccard become slower
    order of complexity of jaccard is O(n^2)

- minihash is much fatter than jaccard and precious is not as good as jaccard although very reasonable.

- lsh solve the problem of precious of minhash and also faster than minhash.

- as shingles increase runtime become increase much faster and we need to use lsh instead of jaccard and for value more than 10 million almost impossible to use jaccard

- as hash functions increase result of lsh becomes more precious.

In [40]:
def JACCARD(k_shingle=3, jac_threshold=.5):
    t0 = time.time()
    print(f'start jaccard with {k_shingle}_shingle ===================')
    shingles = create_shingles(abstract, k_shingle)
        
    for i in range(len(shingles)):
        for j in range(i,len(shingles)):
            jac = jaccard_similarity(shingles[i], shingles[j])
            if (jac >jac_threshold) & (jac<1):
                print(i, j, jac)
    print("finish jaccard in: ", time.time() - t0)
    print()

In [41]:
def MINHASH(N=50, k_shingle=3, jac_threshold=.5):
    t0 = time.time()

    print(f'start minHash with {N} hash function and {k_shingle}_shingle ===================')
    shingles = create_shingles(abstract, k_shingle)
    minhash_func = pd.DataFrame(minh(shingles, N))
    print("create sinatures: ", time.time() - t0)
    
    l=len(shingles)
    a=[]
    for i in range(l):
        a.append(minhash_func.iloc[:,i])
    for i in range(l):
        for j in range(i+1,l):
            jac = jaccard_similarity(a[i], a[j])
            if (jac >.5) & (jac<1):
                print(i, j, jac)
    print("finish minhash in: ", time.time() - t0)
    print()

In [42]:
def LSH(N=50, k_shingle=3, jac_threshold=.5, splits=4):
    t0 = time.time()
    print(f'start LSH with {N} hash function and {k_shingle}_shingle ===================')
    shingles = create_shingles(abstract, k_shingle)
    m = pd.DataFrame(minh(shingles, N))
    print("create sinatures: ", time.time() - t0)
    t0 = time.time()
    b=[]
    br=False
    l=len(shingles)
    for i in range(l):
        b.append(split_vector(m.iloc[:,i],splits))

    print("split complete: ", time.time() - t0)
    
    for i in range(l):
        for j in range(i+1,l):
            for idx in range(splits):
                if br==True:
                    br = False
                    break
                for i_rows, j_rows in zip(b[i][idx], b[j][idx]):
                    if i_rows == j_rows:
                        br = True
                        jac = jaccard_similarity(shingles[i], shingles[j])
                        if (jac >jac_threshold) & (jac<1):
                            print(i, j, jac)
                        break
    print("finish lsh in: ", time.time() - t0)
    print()

In [51]:
JACCARD(k_shingle=3, jac_threshold=.5)

finish jaccard in:  40.837918519973755



In [44]:
MINHASH(N=50, k_shingle=3, jac_threshold=.5)

create sinatures:  9.123526096343994
finish minhash in:  17.518795251846313



In [46]:
LSH(N=50, k_shingle=3, jac_threshold=.4, splits=10)

create sinatures:  8.668784856796265
split complete:  0.6250503063201904
106 962 0.42191780821917807
123 503 0.41233373639661425
finish lsh in:  47.83500623703003



###### 1- Compare the performance in time and the results for k-shingles = 3, 5 and 10, for the three methods and similarity thresholds s=0.9 and 0.95. Use 50 hashing functions. Comment your results.

In [47]:
JACCARD(k_shingle=3, jac_threshold=.5)
JACCARD(k_shingle=5, jac_threshold=.5)
JACCARD(k_shingle=10, jac_threshold=.5)

finish jaccard in:  70.77935671806335

finish jaccard in:  72.55189490318298

finish jaccard in:  64.1764976978302



In [48]:
MINHASH(N=50, k_shingle=3, jac_threshold=.5)
MINHASH(N=50, k_shingle=5, jac_threshold=.5)
MINHASH(N=50, k_shingle=10, jac_threshold=.5)

create sinatures:  10.616516828536987
finish minhash in:  20.56183123588562

create sinatures:  15.003560066223145
finish minhash in:  24.707454919815063

create sinatures:  19.519854068756104
finish minhash in:  29.48997950553894



In [49]:
LSH(N=50, k_shingle=3, jac_threshold=.5, splits=4)
LSH(N=50, k_shingle=5, jac_threshold=.5, splits=4)
LSH(N=50, k_shingle=10, jac_threshold=.5, splits=4)

create sinatures:  9.783883094787598
split complete:  0.412304162979126
finish lsh in:  49.506980657577515

create sinatures:  16.02803111076355
split complete:  0.37647557258605957
finish lsh in:  59.94094800949097

create sinatures:  18.245524406433105
split complete:  0.44815897941589355
finish lsh in:  28.59602642059326



In [52]:
MINHASH(N=100, k_shingle=3, jac_threshold=.5)
MINHASH(N=100, k_shingle=5, jac_threshold=.5)
MINHASH(N=100, k_shingle=10, jac_threshold=.5)
MINHASH(N=200, k_shingle=3, jac_threshold=.5)
MINHASH(N=200, k_shingle=5, jac_threshold=.5)
MINHASH(N=200, k_shingle=10, jac_threshold=.5)

create sinatures:  20.665815830230713
finish minhash in:  40.56891751289368

create sinatures:  30.59933853149414
finish minhash in:  48.16977787017822

create sinatures:  33.732871294021606
finish minhash in:  52.467140674591064

create sinatures:  44.410902976989746
finish minhash in:  80.19050359725952

create sinatures:  63.09355640411377
finish minhash in:  99.13037848472595

create sinatures:  74.03358745574951
finish minhash in:  109.07237935066223



In [55]:
LSH(N=100, k_shingle=3, jac_threshold=.5, splits=4)
LSH(N=100, k_shingle=5, jac_threshold=.5, splits=4)
LSH(N=100, k_shingle=10, jac_threshold=.5, splits=4)
LSH(N=200, k_shingle=3, jac_threshold=.5, splits=4)
LSH(N=200, k_shingle=5, jac_threshold=.5, splits=4)
LSH(N=200, k_shingle=10, jac_threshold=.5, splits=4)

create sinatures:  20.648046255111694
split complete:  0.47182416915893555
finish lsh in:  47.13585376739502

create sinatures:  31.871016263961792
split complete:  0.4316837787628174
finish lsh in:  58.93512797355652

create sinatures:  38.67946195602417
split complete:  0.5153203010559082
finish lsh in:  45.597782611846924

create sinatures:  42.12301683425903
split complete:  0.22830581665039062
finish lsh in:  48.016915798187256

create sinatures:  62.0220890045166
split complete:  0.4754030704498291
finish lsh in:  65.30190920829773

create sinatures:  74.98971700668335
split complete:  0.6402044296264648
finish lsh in:  50.79961156845093



###### 3- For MinHashing using 100 hashing functions and s = 0.5 and 0.9, find the Jaccard distances (1-Jaccard similarity) for all possible pairs. Use the obtained values within a k-NN algorithm, and for k=1,3 and, 5 identify the clusters with similar abstracts for each s

In [56]:
t0 = time.time()

shingles = create_shingles(abstract, 3)
minhash_func = pd.DataFrame(minh(shingles, 100))
print("create sinatures: ", time.time() - t0)

l=len(shingles)
a=[]
jac=[]
for i in range(l):
    a.append(minhash_func.iloc[:,i])
for i in range(l):
    for j in range(i+1,l):
        jac.append([i, j, jaccard_similarity(a[i], a[j])])

print("finish minhash in: ", time.time() - t0)
print()

create sinatures:  18.717809438705444
finish minhash in:  38.61874604225159



In [57]:
result = pd.DataFrame(jac, columns=['abstract_one', 'abstract_two', 'jaccard-distance'])
result['jaccard-distance'] = 1 - result['jaccard-distance']
result.head()

Unnamed: 0,abstract_one,abstract_two,jaccard-distance
0,0,1,0.843931
1,0,2,0.857143
2,0,3,0.863636
3,0,4,0.876404
4,0,5,0.93617


In [58]:
from sklearn.cluster import KMeans

knn = KMeans(n_clusters=3).fit(result[['jaccard-distance']])
knn.labels_

array([2, 0, 0, ..., 2, 2, 0], dtype=int32)

In [59]:
knn = KMeans(n_clusters=5).fit(result[['jaccard-distance']])
knn.labels_

array([1, 1, 0, ..., 1, 3, 0], dtype=int32)

In [60]:
result['knn'] = knn.labels_
result.sort_values(['jaccard-distance'])

Unnamed: 0,abstract_one,abstract_two,jaccard-distance,knn
248411,290,897,0.657718,3
128804,138,534,0.666667,3
125269,134,449,0.675497,3
45841,46,969,0.675497,3
45824,46,952,0.675497,3
...,...,...,...,...
422788,607,924,1.000000,4
347899,448,924,1.000000,4
323123,405,744,1.000000,4
214789,244,924,1.000000,4


In [61]:
print(abstract[436])

lobjectif est letude des causes des disperiodicites des voix du type  qui sont pseudoperiodiques et monophoniques un modele qui explique quantitativement les perturbations des durees de cycles glottiques fait appel aux fluctuations de la tension du muscle vocal or ces fluctuations nexpliquent pas lenrouement qui peut faire suite a une charge vocale ou une laryngite legere par exemple cest pourquoi nous discutons plusieurs modeles qui montrent quune redistribution des amplitudes vibratoires entre le corps et la couverture du pli module les perturbations qui trouvent leur origine au niveau du muscle vocal des simulations a laide dun modele corpscouverture suggerent ainsi que les perturbations des durees des cycles glottiques augmentent avec une redistribution des amplitudes vibratoires de la couverture vers le muscle suite a une redistribution des masses vibrantes du muscle vers la couverture


In [62]:
print(abstract[624])

expressions with an aspectual variant of a light verb eg take on debt vs have debt are frequent in texts but often difficult to classify between verbal idioms light verb constructions or compositional phrases we investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone assigning the expressions to one of them


In [63]:
print(abstract[814])

we introduce a novel method for multilingual transfer that utilizes deep contextual embeddings pretrained in an unsupervised fashion while contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts aligning them poses a challenge due to their dynamic nature to this end we construct contextindependent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the contextdependent spaces this mapping readily supports processing of a target language improving transfer by contextaware embeddings our experimental results demonstrate the effectiveness of this approach for zeroshot and fewshot learning of dependency parsing specifically our method consistently outperforms the previous stateoftheart on  tested languages yielding an improvement of  las points on average


In [64]:
print(abstract[410])

nowadays spoken dialogue agents such as communication robots and smart speakers listen to narratives of humans in order for such an agent to be recognized as a listener of narratives and convey the attitude of attentive listening it is necessary to generate responsive utterances moreover responsive utterances can express empathy to narratives and showing an appropriate degree of empathy to narratives is significant for enhancing speakers motivation the degree of empathy shown by responsive utterances is thought to depend on their type however the relation between responsive utterances and degrees of the empathy has not been explored yet this paper describes the classification of responsive utterances based on the degree of empathy in order to explain that relation in this research responsive utterances are classified into five levels based on the effect of utterances and literature on attentive listening quantitative evaluations using  responsive utterances showed the appropriateness o

In [65]:
print(abstract[769])

hyponymy is the cornerstone of taxonomies and concept hierarchies however the extraction of hypernymhyponym pairs from a corpus can be timeconsuming and reconstructing the hierarchical network of a domain is often an extremely complex process this paper presents the development and evaluation of the french ecolexicon semantic sketch grammar essgfr a french hyponymic sketch grammar for sketch engine based on knowledge patterns it offers a userfriendly way of extracting hyponymic pairs in the form of word sketches in any userowned corpus the essgfr contains three times more hyponymic patterns than its english counterpart and has been tested in a multidisciplinary corpus it is thus expected to be domainindependent moreover the following methodological innovations have been included in its development  use of english hyponymic patterns in a parallel corpus to find new french patterns  automatic inclusion of the results of the sketch engine thesaurus to find new variants of the patterns as 