## Reading JSON transcripts / Bert / Semantic similarity
- [Hugging facer repo](https://huggingface.co/models?library=flair&sort=downloads)
- [Flair embedinggs](https://github.com/flairNLP/flair/blob/master/resources/docs/embeddings/TRANSFORMER_EMBEDDINGS.md)

Code of the class used here --> [link](https://github.com/FranckPrts/MBCS-RP2-Codes/blob/master/utils/speech_sw.py)

## Reading the transcripts

In [1]:
# Import the class
from utils.speech_sw import read_dialogue

# Define path to transcripts
path = "../Transcripts/"
path += "SAMPLE.json"

# Innit class
speech = read_dialogue(path)

In [2]:
speech.select_sw(sw_min=28.6, sw_max=42)

>> Starting dialogue extraction ...
>> Time window [28.600000 - 42.000000] 
[29.100000 - 29.400000] 	Speaker: 2 		Word: you
[29.400000 - 29.500000] 	Speaker: 2 		Word: don't
[29.500000 - 29.700000] 	Speaker: 2 		Word: want
[29.700000 - 29.800000] 	Speaker: 2 		Word: to
[29.800000 - 30.000000] 	Speaker: 2 		Word: push
[30.000000 - 30.100000] 	Speaker: 2 		Word: it
[30.100000 - 30.200000] 	Speaker: 2 		Word: in
[30.200000 - 32.600000] 	Speaker: 2 		Word: so
[32.600000 - 33.000000] 	Speaker: 2 		Word: she
[33.000000 - 33.100000] 	Speaker: 2 		Word: called
[33.100000 - 33.300000] 	Speaker: 2 		Word: me
[33.300000 - 33.400000] 	Speaker: 2 		Word: and
[33.400000 - 33.500000] 	Speaker: 2 		Word: I
[33.500000 - 33.700000] 	Speaker: 2 		Word: didn't
[33.700000 - 33.800000] 	Speaker: 2 		Word: know
[33.800000 - 33.800000] 	Speaker: 2 		Word: she
[33.800000 - 34.000000] 	Speaker: 2 		Word: had
[34.000000 - 34.500000] 	Speaker: 2 		Word: calls
[34.500000 - 34.600000] 	Speaker: 2 		Word: cuz
[34.60

See the contribution of each participant to the dialogue:

In [3]:
print("Text from speaker 1 : ", speech.sub1_speech_list)
print("Text from speaker 2 : ", speech.sub2_speech_list)

Text from speaker 1 :  ['she', 'like']
Text from speaker 2 :  ['you', "don't", 'want', 'to', 'push', 'it', 'in', 'so', 'she', 'called', 'me', 'and', 'I', "didn't", 'know', 'she', 'had', 'calls', 'cuz', 'he', 'was', 'talking', 'my', 'ear', 'off', 'with', 'my', 'home', 'DaVinci', 'come', 'to', 'my']


We now embed these two dialogue segment using `TransformerWordEmbeddings`:

In [4]:
len(speech.sub2_speech_list)

32

In [5]:
speech.embed_dialogue_wtov()

>> Starting dialogue embeding w/ w2v ...

>> Embeding S1 ...
embeding:  she
embeding:  like
>> Done.

>> Embeding S2 ...
embeding:  you
embeding:  don't
embeding:  want
embeding:  to
> Word not embeded : to
> Error: "Key 'to' not present"
embeding:  push
embeding:  it
embeding:  in
embeding:  so
embeding:  she
embeding:  called
embeding:  me
embeding:  and
> Word not embeded : and
> Error: "Key 'and' not present"
embeding:  I
embeding:  didn't
embeding:  know
embeding:  she
embeding:  had
embeding:  calls
embeding:  cuz
embeding:  he
embeding:  was
embeding:  talking
embeding:  my
embeding:  ear
embeding:  off
embeding:  with
embeding:  my
embeding:  home
embeding:  DaVinci
embeding:  come
embeding:  to
> Word not embeded : to
> Error: "Key 'to' not present"
embeding:  my
>> Done.
>> Dialogue embeded.


In [6]:
speech.not_embeded

{'to': 'S2', 'and': 'S2'}

In [7]:
len(speech.sub2_speech_embed)

29

In [8]:
speech.compute_semantic_similarity()
speech.sem_sim

0.7117323279380798

## Archives

In [None]:
# speech.dialogue is a dict 
print(speech.dialogue.keys())
print(len(speech.dialogue["results"]))

There is 14 results available in our sample. 

13 sentences, each containing a list of all the 

- confidence
- startTime
- endTime
- word


1 list of all the words, with their   <-- This is our element of interest

- confidence
- startTime
- endTime
- **speakerTag**
- word

Because we're interrested in sampling words that fall within a time window, we'll use the last element of the JSON which provide the following (confidence, startTime, endTime, speakerTag, word).

Let's see how many word we have and let's also print the 350th word information.

In [None]:
print("There is %i in that transcript" % len(speech.dialogue["results"][13]["alternatives"][0]["words"]))
print("Here are the infos for the 349th word")
print(speech.dialogue["results"][13]["alternatives"][0]["words"][349])

In [None]:
# Exploring the 13 individual sentence object
print("Each available keys for each sentence")
print(speech.dialogue["results"][1].keys())

print("\nThere is %i alternative in our transcription" % len(speech.dialogue["results"][0]["alternatives"]))
print("This alternative has %i keys " % len(speech.dialogue["results"][0]["alternatives"][0].keys()))
print('Which are:')
print(speech.dialogue["results"][0]["alternatives"][0].keys())

print("\nWith 'words' being a dict with all the sentence's words and their attributes")
speech.dialogue["results"][0]["alternatives"][0]["words"]