SemanticTextSegmentation NaN With All Stop Words #6

haowjy · 2022-07-13T15:50:08Z

When running semantic text segmentation, I found that if the input utterance line is all stop words, (i.e. "Bye. Uh huh. Yeah."), SemanticTextSegmentation._get_similarity fails with ValueError: Input contains NaN.

I found that adding a check for nan in both embeddings could solve this problem.

def _get_similarity(self, text1, text2):
    sentence_1 = [i.text.strip()
                  for i in nlp(text1).sents if len(i.text.split(' ')) > 1]
    sentence_2 = [i.text.strip()
                  for i in nlp(text2).sents if len(i.text.split(' ')) > 2]
    embeding_1 = model.encode(sentence_1)
    embeding_2 = model.encode(sentence_2)
    embeding_1 = np.mean(embeding_1, axis=0).reshape(1, -1)
    embeding_2 = np.mean(embeding_2, axis=0).reshape(1, -1)

    if np.any(np.isnan(embeding_1)) or np.any(np.isnan(embeding_2)):
            return 1

    sim = cosine_similarity(embeding_1, embeding_2)
    return sim

I would like to have someone else look at it because I don't want to make any assumptions that the stop words should be part of the same segments.

The text was updated successfully, but these errors were encountered:

AnjanaRita · 2022-07-21T07:09:06Z

@haowjy, This solution is looking good to me. Would you like to contribute by creating a PR for this?

haowjy mentioned this issue Jul 21, 2022

check for NaN embeddings #8

Merged

haowjy closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SemanticTextSegmentation NaN With All Stop Words #6

SemanticTextSegmentation NaN With All Stop Words #6

haowjy commented Jul 13, 2022 •

edited

AnjanaRita commented Jul 21, 2022

SemanticTextSegmentation NaN With All Stop Words #6

SemanticTextSegmentation NaN With All Stop Words #6

Comments

haowjy commented Jul 13, 2022 • edited

AnjanaRita commented Jul 21, 2022

haowjy commented Jul 13, 2022 •

edited