# Hierarchical Cluster Labeling of Software Requirements using Contextual Word Embeddings

The popularization of social media has motivated research into machine learning methods for software requirements extraction from user comments and reviews. These methods analyze software review datasets to classify some textual excerpts as software requirements, which allows the management and monitoring of the evolution of software quality directly from the crowd users' perspective. However, the existing methods have two major limitations. First, several duplicate requirements are extracted from reviews because users write the same requirement in different ways by using synonyms and non-technical language, often with misspellings and ambiguity. Second, requirements extraction methods do not deal with different granularity levels, thereby ignoring hierarchical relationships between software requirements. This paper presents a hierarchical cluster labeling approach for software requirements based on contextual word embeddings to address these challenges. We explore neural language models to obtain a more semantic and robust representation of the software requirements, in which the texts are represented by embedding vectors considering the context of occurrence in reviews. Our approach organizes the software requirements into clusters and sub-clusters according to requirement similarities in the embedding space. Finally, we select representative software requirements to label each cluster and sub-cluster, thereby dealing with both duplicate and different granularity levels of the software requirements. An experimental evaluation using review datasets from 8 mobile apps shows that our approach obtains promising results and presents new ideas and research directions for data-driven requirements engineering.

**`This notebook contains the main source code of the hierarchical clustering labeling method submitted to the SBES 2021- 35th Brazilian Symposium on Software Engineering - Insightful Ideas and Emerging Results Track`**

**`The code was anonymized due to the double-blind review process.`**

# Loading the dataset

The dataset contains reviews of 8 apps with software requirements extracted and validated by human experts.

Note that software requirements extracted from reviews should not be interpreted as a requirements document from a conventional software engineering process. Our work is in the context of data-driven requirements engineering.

In [1]:
!git clone https://github.com/jsdabrowski/CAiSE-20

Cloning into 'CAiSE-20'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 25 (delta 7), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (25/25), done.


In [2]:
import pandas as pd

df_data = pd.read_excel('CAiSE-20/Ground_truth.xlsx')
df_data

Unnamed: 0,App id,Review id,Sentence id,Sentence content,Feature (Positive),Feature (Neutral),Feature (Negative),Feature (All Annotated)
0,com.zentertain.photoeditor,gp:AOqpTOEW40L9WXqCjzq04bqaZImgMdzlczxIF3_ibs8...,1,May be i can check,,,,
1,com.zentertain.photoeditor,gp:AOqpTOF57AQPvmnCiWYurwLY-F2-mej25ON8RAFk-Ls...,1,It make me happy,,,,
2,com.zentertain.photoeditor,gp:AOqpTOHYdmt72q4tSD8TZ8A5fZQqGivlBkIbWuHuJMZ...,1,I have a normal phone and it made 1 of my pics...,,pics,,pics
3,com.zentertain.photoeditor,gp:AOqpTOFYnXMShrDJPS0jpM04pFQxYOJN1LDuX3lSNm0...,1,Love it so much,,,,
4,com.zentertain.photoeditor,gp:AOqpTOF_JO496wnThQ2kcYlPct_g1GhOmQyyVvHp4VV...,1,Cant get to install,,install,,install
...,...,...,...,...,...,...,...,...
2057,com.spotify.music,gp:AOqpTOE-vqukBoo4GbnnTJnBesSlsJR9w2yGydeMlIK...,1,Every time I go on I can't browse or do anythi...,,,,
2058,com.spotify.music,gp:AOqpTOE-vqukBoo4GbnnTJnBesSlsJR9w2yGydeMlIK...,2,I am fully connected to wifi and offline mode ...,,offline mode,,offline mode
2059,com.spotify.music,gp:AOqpTOE-vqukBoo4GbnnTJnBesSlsJR9w2yGydeMlIK...,3,"Loved this app when it worked, can't use at al...",,,,
2060,com.spotify.music,gp:AOqpTOE-vqukBoo4GbnnTJnBesSlsJR9w2yGydeMlIK...,4,Will cancel if not fixed very soon...,,,,


# Loading BERT-based models for text preprocessing

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/3b/fd/8a81047bbd9fa134a3f27e12937d2a487bd49d353a038916a5d7ed4e5543/sentence-transformers-2.0.0.tar.gz (85kB)
[K     |████████████████████████████████| 92kB 6.0MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 28.5MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/ac/aa/1437691b0c7c83086ebb79ce2da16e00bef024f24fec2a5161c35476f499/sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 55.4MB/s 
[?25hCollecting huggingface-hub
  Downloading https://files.pythonhosted.org/packages/35/03/071adc023c0a7e540cf4652fa9cad13ab32e6ae469bf0cc0262045244812/huggingface_hub

# Generating word embedding

In [4]:
from sentence_transformers import SentenceTransformer
language_model = SentenceTransformer('distiluse-base-multilingual-cased-v1')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=690.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2830.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=556.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=122.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=341.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=538971577.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=53.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1961847.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=452.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=190.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=114.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1575975.0, style=ProgressStyle(descript…




## Check available apps

In [5]:
pd.DataFrame(df_data['App id'].unique())

Unnamed: 0,0
0,com.zentertain.photoeditor
1,B004LOMB2Q
2,B004SIIBGU
3,com.whatsapp
4,B005ZXWMUS
5,com.twitter.android
6,B0094BB4TW
7,com.spotify.music


## Selecting app for hierarchical clustering label of software requirements

In [6]:
APP_NAME = 'B004LOMB2Q'
RUN=3

## Generating embeddings

In [7]:
sentence_embedding_cache = {}
requirement_embedding_cache = {}
L_sentence = []
L_requirement = []
for index,row in df_data.iterrows():
  L_sentence.append(row['Sentence content'])
  try:
    for r in row['Feature (All Annotated)'].split(';'):
      #print(r.strip())
      L_requirement.append(r.strip())
  except:
    1

L_sentence_embedding = list(language_model.encode(L_sentence))
L_requirement_embedding = list(language_model.encode(L_requirement))



In [8]:
for i in range(0,len(L_sentence)):
  sentence_embedding_cache[L_sentence[i]] = L_sentence_embedding[i]

for i in range(0,len(L_requirement)):
  requirement_embedding_cache[L_requirement[i]] = L_requirement_embedding[i]

In [9]:
df_app = df_data[df_data['App id']==APP_NAME]
L = []
for index,row in df_app.iterrows():
  L.append(sentence_embedding_cache[row['Sentence content']])
df_app['sentence_embedding'] = L
df_app

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,App id,Review id,Sentence id,Sentence content,Feature (Positive),Feature (Neutral),Feature (Negative),Feature (All Annotated),sentence_embedding
154,B004LOMB2Q,A3NOCGVOLK1IPO,1,I've used Evernote for quite awhile to take no...,,take notes;share;edit,,take notes;share;edit,"[0.036477115, 0.02155222, -0.069200546, -0.023..."
155,B004LOMB2Q,A3NOCGVOLK1IPO,2,"Not as nice as Microsoft OneNote, but at least...",,,,,"[-0.04071297, -0.034613997, 0.046789404, -0.04..."
156,B004LOMB2Q,A3NOCGVOLK1IPO,3,Nice to see that it runs on the Kindle Fire HD.,,,,,"[0.027171697, 0.04388646, 0.028232776, 0.01278..."
157,B004LOMB2Q,ALYCIMJ6ZVBA5,1,Love that I can sync to my other devices.,sync to my other devices,,,sync to my other devices,"[0.048731066, -0.03787678, -0.052123584, -0.04..."
158,B004LOMB2Q,ALYCIMJ6ZVBA5,2,It has been a big help in keeping me organized...,,keeping me organized with my lists,,keeping me organized with my lists,"[0.004379682, 0.07013778, 0.017054016, 0.05085..."
...,...,...,...,...,...,...,...,...,...
516,B004LOMB2Q,A3AJ6LK7A5ZZ9R,4,I like the way you can organize into notebooks.,organize into notebooks,,,organize into notebooks,"[0.030230064, -0.025762867, 0.047045365, 0.014..."
517,B004LOMB2Q,A3AJ6LK7A5ZZ9R,5,everything is very useful.,,,,,"[0.099809624, 0.023971448, -0.0021783714, -0.0..."
518,B004LOMB2Q,A3AJ6LK7A5ZZ9R,6,I like it a lot.,,,,,"[0.08920752, -0.07764692, -0.013945812, -0.052..."
519,B004LOMB2Q,AJCBQCE0DPTGX,1,great app for taking notes across all platforms.,taking notes,,,taking notes,"[0.016192125, -0.04975882, -0.035583932, -0.04..."


# Software Requirements Clustering

In [10]:
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_samples
import numpy as np
import math

def get_kmeans(df_):

  clustering = {}
  df_temp = df_.copy()
  X = np.array(df_temp['requirement_embedding'].to_list())
  num_clusters = int(math.sqrt(len(df_temp)))
  if num_clusters < 2:
    print('Error! Dataset size!')
    return
  if num_clusters >= 30:
    num_clusters = 30

  cluster_parent = df_temp['cluster'].unique()[0]
  print('Running cluster from ',cluster_parent,' total=',len(X),' num_clusters=',num_clusters)
  kmeans = MiniBatchKMeans(n_clusters=num_clusters, random_state=RUN, init='random',verbose=0,batch_size=1000,max_no_improvement=1000).fit(X)
  df_temp['cluster_parent'] = cluster_parent
  df_temp['cluster'] = np.array(kmeans.labels_)+(10*cluster_parent)
  print('Cluster parent',df_temp['cluster_parent'].unique())
  print('Cluster ',df_temp['cluster'].unique())
  clustering[cluster_parent] = [df_temp,kmeans]

  return df_temp, clustering

In [11]:
requirements = set()
df_temp = df_data[df_data['App id']==APP_NAME]
for index,row in df_temp[['Feature (All Annotated)']].dropna().iterrows():
    for r in row['Feature (All Annotated)'].split(';'):
      requirements.add(r.strip())

df_requirements = pd.DataFrame(requirements)
df_requirements.columns =  ['requirement']
L = []
for r in requirements:
    L.append(requirement_embedding_cache[r])
df_requirements['requirement_embedding'] = L
df_requirements

Unnamed: 0,requirement,requirement_embedding
0,updating,"[-0.042649217, 0.035711575, -0.041541334, 0.05..."
1,to do lists,"[0.013045504, 0.023163002, -0.030219816, 0.048..."
2,create short notes,"[-0.010133171, -0.031141514, -0.07525056, 0.06..."
3,grocery,"[0.02432415, 0.0022237618, -0.032460213, -0.03..."
4,UI,"[-0.022581171, 0.043188088, -0.020339454, 0.00..."
...,...,...
254,formatting options,"[-0.015440987, 0.07356892, -0.0076078833, 0.05..."
255,email myself links,"[-0.025898518, 0.027636474, -0.008167994, 0.02..."
256,note pad,"[-0.0029527291, 0.015027506, -0.082566604, 0.0..."
257,to-do lists,"[0.019597491, 0.0115555385, -0.029689817, 0.04..."


In [12]:
stack = []
df_temp = df_requirements.copy()
df_temp['cluster_parent'] = 0
df_temp['cluster'] = 1
stack.append(df_temp)

hc = []
clustering = []

while(len(stack)!=0):
  df_temp = stack.pop()
  if len(df_temp) <= 20: continue
  df_temp,clust = get_kmeans(df_temp)
  clustering.append(clust)
  hc.append(df_temp)
  for cluster_ in df_temp.cluster.unique():
    stack.append(df_temp[df_temp.cluster==cluster_])

Running cluster from  1  total= 259  num_clusters= 16
Cluster parent [1]
Cluster  [21 11 25 18 23 22 14 12 17 15 19 16 13 10 24 20]
Running cluster from  15  total= 23  num_clusters= 4
Cluster parent [15]
Cluster  [153 152 151 150]
Running cluster from  12  total= 29  num_clusters= 5
Cluster parent [12]
Cluster  [122 121 120 123 124]
Running cluster from  18  total= 23  num_clusters= 4
Cluster parent [18]
Cluster  [180 183 182 181]
Running cluster from  25  total= 41  num_clusters= 6
Cluster parent [25]
Cluster  [253 251 250 254 252 255]
Running cluster from  21  total= 23  num_clusters= 4
Cluster parent [21]
Cluster  [211 212 213 210]


In [13]:
df_hc = pd.concat(hc)
df_hc

Unnamed: 0,requirement,requirement_embedding,cluster_parent,cluster
0,updating,"[-0.042649217, 0.035711575, -0.041541334, 0.05...",1,21
1,to do lists,"[0.013045504, 0.023163002, -0.030219816, 0.048...",1,11
2,create short notes,"[-0.010133171, -0.031141514, -0.07525056, 0.06...",1,25
3,grocery,"[0.02432415, 0.0022237618, -0.032460213, -0.03...",1,18
4,UI,"[-0.022581171, 0.043188088, -0.020339454, 0.00...",1,23
...,...,...,...,...
184,check box,"[0.0143796075, 0.08743661, -0.028821062, 0.026...",21,212
186,sign,"[-0.032189954, -0.02814175, -0.06721526, -0.03...",21,212
193,create information,"[-0.039803077, -0.017100481, -0.04480628, 0.09...",21,212
224,add-in,"[-0.023695203, -0.01207363, -0.03556951, 0.001...",21,212


# Centroid Labeling (Baseline)

In [14]:
def centroid_labeling(df_hc):

  cluster_labels = {}

  df_hc1 = df_hc.copy() 
  for cluster in df_hc1.cluster.unique():
    cluster_centroid = np.mean(np.array(df_hc1[df_hc1.cluster==cluster].requirement_embedding.to_list()),axis=0)
    min_dist = 999999999
    cluster_label = None

    for index,row in df_hc1[df_hc1.cluster==cluster].iterrows():
      requirement_emb = row['requirement_embedding']
      dist = np.linalg.norm(requirement_emb-cluster_centroid)
      if dist < min_dist:
        cluster_label = row['requirement']
        min_dist = dist
    cluster_labels[cluster]=cluster_label

  L = []
  for index,row in df_hc1.iterrows():
    L.append(cluster_labels[row['cluster']])

  df_hc1['cluster_label'] = L
  return df_hc1

df_hc1 = centroid_labeling(df_hc)
df_hc1

Unnamed: 0,requirement,requirement_embedding,cluster_parent,cluster,cluster_label
0,updating,"[-0.042649217, 0.035711575, -0.041541334, 0.05...",1,21,upgrading
1,to do lists,"[0.013045504, 0.023163002, -0.030219816, 0.048...",1,11,making lists
2,create short notes,"[-0.010133171, -0.031141514, -0.07525056, 0.06...",1,25,take my notes
3,grocery,"[0.02432415, 0.0022237618, -0.032460213, -0.03...",1,18,capture
4,UI,"[-0.022581171, 0.043188088, -0.020339454, 0.00...",1,23,UI
...,...,...,...,...,...
184,check box,"[0.0143796075, 0.08743661, -0.028821062, 0.026...",21,212,add-ons
186,sign,"[-0.032189954, -0.02814175, -0.06721526, -0.03...",21,212,add-ons
193,create information,"[-0.039803077, -0.017100481, -0.04480628, 0.09...",21,212,add-ons
224,add-in,"[-0.023695203, -0.01207363, -0.03556951, 0.001...",21,212,add-ons


# Baseline Evaluation
Note that the experimental evaluation needs to be repeated multiple times and report mean F-Score values.

In [15]:
from sklearn.metrics.cluster import v_measure_score
from sklearn.metrics import f1_score
from sklearn.metrics.cluster import normalized_mutual_info_score



def isNaN(num):
  return num != num


def eval_labeling(df_hc_,df_app):

  
  results = []

  for cluster_parent in df_hc_.cluster_parent.unique():

    L = []

    for index,row in df_app.iterrows():
      for r in df_hc_[df_hc_.cluster_parent==cluster_parent].requirement.unique():
        if isNaN(row['Feature (All Annotated)'])==False:
          if r in row['Feature (All Annotated)']:
            L.append(row.to_dict())

    df_app2 = pd.DataFrame(L)

    cluster_reference = []
    cluster_labeling = []
    
    for index,row in df_app2.iterrows():

      ####### REFERENCE CLUSTER #######
      cluster_sel = -1
      min_dist = 999999999
      sentence_embedding = row['sentence_embedding']

      df_temp = df_hc_[df_hc_.cluster_parent==cluster_parent]
      for cluster in df_temp.cluster.unique():
        cluster_centroid = np.mean(np.array(df_temp[df_temp.cluster==cluster].requirement_embedding.to_list()),axis=0)
        dist = np.linalg.norm(sentence_embedding-cluster_centroid)
        if dist < min_dist:
          cluster_sel = cluster
          min_dist = dist
      
      cluster_reference.append(cluster_sel)


      ####### LABELING CLUSTER #######
      cluster_sel = -1
      min_dist = 999999999
      sentence_embedding = row['sentence_embedding']

      df_temp = df_hc_[df_hc_.cluster_parent==cluster_parent]
      for cluster in df_temp.cluster.unique():
        cluster_centroid = requirement_embedding_cache[df_temp[df_temp.cluster==cluster].head(1).cluster_label.values[0]]
        dist = np.linalg.norm(sentence_embedding-cluster_centroid)
        if dist < min_dist:
          cluster_sel = cluster
          min_dist = dist
      
      cluster_labeling.append(cluster_sel)

    # print(cluster_parent)
    # print(cluster_reference)
    # print(cluster_labeling)
    # print(v_measure_score(cluster_reference,cluster_labeling))
    # print(f1_score(cluster_reference,cluster_labeling,average='micro'))
    # print('--------------')
    results.append((cluster_parent,len(cluster_reference),f1_score(cluster_reference,cluster_labeling,average='micro')))

  return results

In [16]:
results_baseline = eval_labeling(df_hc1,df_app)

In [None]:
df_baseline_results = pd.DataFrame(results_baseline)
df_baseline_results.columns = ['cluster_id','cluster_size','f_score']
df_baseline_results

# Proposed Contextual Cluster Labeling

In [18]:
df_app = df_data[df_data['App id']==APP_NAME]
L = []
for index,row in df_app.iterrows():
  L.append(sentence_embedding_cache[row['Sentence content']])
df_app['sentence_embedding'] = L
df_app

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Unnamed: 0,App id,Review id,Sentence id,Sentence content,Feature (Positive),Feature (Neutral),Feature (Negative),Feature (All Annotated),sentence_embedding
154,B004LOMB2Q,A3NOCGVOLK1IPO,1,I've used Evernote for quite awhile to take no...,,take notes;share;edit,,take notes;share;edit,"[0.036477115, 0.02155222, -0.069200546, -0.023..."
155,B004LOMB2Q,A3NOCGVOLK1IPO,2,"Not as nice as Microsoft OneNote, but at least...",,,,,"[-0.04071297, -0.034613997, 0.046789404, -0.04..."
156,B004LOMB2Q,A3NOCGVOLK1IPO,3,Nice to see that it runs on the Kindle Fire HD.,,,,,"[0.027171697, 0.04388646, 0.028232776, 0.01278..."
157,B004LOMB2Q,ALYCIMJ6ZVBA5,1,Love that I can sync to my other devices.,sync to my other devices,,,sync to my other devices,"[0.048731066, -0.03787678, -0.052123584, -0.04..."
158,B004LOMB2Q,ALYCIMJ6ZVBA5,2,It has been a big help in keeping me organized...,,keeping me organized with my lists,,keeping me organized with my lists,"[0.004379682, 0.07013778, 0.017054016, 0.05085..."
...,...,...,...,...,...,...,...,...,...
516,B004LOMB2Q,A3AJ6LK7A5ZZ9R,4,I like the way you can organize into notebooks.,organize into notebooks,,,organize into notebooks,"[0.030230064, -0.025762867, 0.047045365, 0.014..."
517,B004LOMB2Q,A3AJ6LK7A5ZZ9R,5,everything is very useful.,,,,,"[0.099809624, 0.023971448, -0.0021783714, -0.0..."
518,B004LOMB2Q,A3AJ6LK7A5ZZ9R,6,I like it a lot.,,,,,"[0.08920752, -0.07764692, -0.013945812, -0.052..."
519,B004LOMB2Q,AJCBQCE0DPTGX,1,great app for taking notes across all platforms.,taking notes,,,taking notes,"[0.016192125, -0.04975882, -0.035583932, -0.04..."


In [19]:
def get_contextual_embedding(requirement,df_app,alpha=0.9,show_progress_bar=False):

  df_feature = df_app[['Sentence content','Feature (All Annotated)','sentence_embedding']].dropna()
  df_req_temp = df_feature[df_feature['Feature (All Annotated)'].str.contains(requirement)]
  context_center = np.mean(np.array(df_req_temp['sentence_embedding'].to_list()),axis=0)
  requirement_emb = requirement_embedding_cache[requirement]
  contextual_embedding = (1.0-alpha)*requirement_emb + (alpha)*context_center

  return contextual_embedding

#get_contextual_embedding('premium',df_app,alpha=0.5)

In [20]:
def contextual_cluster_labeling(df_hc,df_app,alpha=0.1):

  cluster_labels = {}
  cache_used_labels = set()

  df_hc2 = df_hc.copy() 
  for cluster in df_hc2.cluster.unique():
    cluster_centroid = np.mean(np.array(df_hc2[df_hc2.cluster==cluster].requirement_embedding.to_list()),axis=0)
    min_dist = 999999999
    cluster_label = None

    for index,row in df_hc2[df_hc2.cluster==cluster].iterrows():
      requirement_emb = get_contextual_embedding(row['requirement'],df_app,alpha=alpha)
      #requirement_emb = row['requirement_embedding']
      dist = np.linalg.norm(requirement_emb-cluster_centroid)
      if dist < min_dist and row['requirement'] not in cache_used_labels:
        cluster_label = row['requirement']
        min_dist = dist
    cluster_labels[cluster]=cluster_label
    cache_used_labels.add(cluster_label)

  L = []
  for index,row in df_hc2.iterrows():
    L.append(cluster_labels[row['cluster']])

  df_hc2['cluster_label'] = L
  return df_hc2



In [None]:
from tqdm.notebook import tqdm
results_proposal = {}
for alpha in tqdm(np.arange (0, 1.02, 0.02)):
  
  df_hc2 = contextual_cluster_labeling(df_hc,df_app,alpha=alpha)
  results_proposal[alpha] = eval_labeling(df_hc2,df_app)

In [None]:
L = []
for alpha in results_proposal:
  for item in results_proposal[alpha]:
    R = list(item)
    R.append(alpha)
    L.append(R)

df_proposal_results = pd.DataFrame(L)
df_proposal_results['method'] = 'proposal'
df_proposal_results.columns = ['cluster_id','cluster_size','f_score','alpha','method']
df_proposal_results