## In this notebook we want to generate the input of training data for LLMs from transcripts. 
- The idea is to take a random sentence from a given transcript. Then we fill it with more sentences, till we reach a given limit of tokens. 
- the output will be created by ChatGPT, similar how in of Alpaca (https://github.com/tatsu-lab/stanford_alpaca)



Lets start and see, how the text is formatted:

In [53]:
# Original text with timestamps
text = "[00:01:19 -> 00:01:23] Golden nugget number one time is your most valuable and scarcest resource\n" \
       "[00:01:23 -> 00:01:24] Hey\n" \
       "[00:01:24 -> 00:01:29] 1440 is the number that can change your life and that number is the number of minutes\n" \
       "[00:01:29 -> 00:01:33] We all have in a single day and while you know, most of the people that I interviewed\n" \
       "[00:01:34 -> 00:01:41] They're not all doing the same 15 secrets either. The common thread was that they always spoke about\n" \
       "[00:01:42 -> 00:01:46] minutes in the value of time and when you truly realize\n" \
       "[00:01:46 -> 00:01:52] What like just how valuable a single minute is I mean money we can lose it and make it back again our health\n" \
       "[00:01:52 -> 00:01:54] We can get sick and get healthy again\n" \
       "[00:01:54 -> 00:01:57] Time once it's gone. It is gone time is life"

### Preprocessing
- The transcripts are having timestamps for each line. However, since we want to use complete sentences we need to do some preprocessing first, to get timestamps for sentences. 
- To do so we need to do the following steps:
    - get all lines in a dict including information about the timestamps
    - correct the punctuation ot the text without the timestamps 
    - seperate text into sentences
    - find timestamps for a given sentence


Lets first define a pattern to use regex and try it out

In [4]:
import re
#define three groups we are interested in 
#start time stamp
#end time stamp
#text of line
pattern = r'\[(\d+:\d+:\d+) -> (\d+:\d+:\d+)\] (.+)'
entries = re.findall(pattern, text)

In [5]:
print(entries)

[('00:01:19', '00:01:23', 'Golden nugget number one time is your most valuable and scarcest resource'), ('00:01:23', '00:01:24', 'Hey'), ('00:01:24', '00:01:29', '1440 is the number that can change your life and that number is the number of minutes'), ('00:01:29', '00:01:33', 'We all have in a single day and while you know, most of the people that I interviewed'), ('00:01:34', '00:01:41', "They're not all doing the same 15 secrets either. The common thread was that they always spoke about"), ('00:01:42', '00:01:46', 'minutes in the value of time and when you truly realize'), ('00:01:46', '00:01:52', 'What like just how valuable a single minute is I mean money we can lose it and make it back again our health'), ('00:01:52', '00:01:54', 'We can get sick and get healthy again'), ('00:01:54', '00:01:57', "Time once it's gone. It is gone time is life")]


We want to correct the punctuation of the sentences. To still be able to match them to the lines, the easiest way to achieve this is to remove all whitespaces and punctuation. 

In [8]:
test_line = entries[0][2]
print(test_line)

Golden nugget number one time is your most valuable and scarcest resource


In [13]:
#helper function

def remove_punctuation_and_whitespaces(string):
    string = re.sub(r"[.,?\-:\s]", "", string)
    return string

print(remove_punctuation_and_whitespaces(test_line))

Goldennuggetnumberonetimeisyourmostvaluableandscarcestresource


We also want the timestamps in seconds to be able to calculate stuff more easy. 

In [14]:
#define a timestamp to test it
test_timestamp = entries[0][1]
print(test_timestamp)

00:01:23


In [15]:
from datetime import datetime

#helper function
def get_seconds(time_string):
    time_format = "%H:%M:%S"
    time = datetime.strptime(time_string, time_format)
    total_seconds = time.hour * 3600 + time.minute * 60 + time.second
    return total_seconds

#print test result
print(get_seconds(test_timestamp))

83


### Now we are ready to define our function, to turn a transcript into a line dict!

In [16]:
def extract_entries(text):
    pattern = r'\[(\d+:\d+:\d+) -> (\d+:\d+:\d+)\] (.+)'
    entries = re.findall(pattern, text)
    result = {}
    #we count letters, so we can later find the index of where a sentence we are looking for starts and then find the correct line
    cum_letters = 0
    for i, entry in enumerate(entries):
        timestamp_start, timestamp_end, text = entry
        #we remove all punctuation and whitespaces so we can match the lines with sentences that we get from the corrected versions of the text. 
        letters = remove_punctuation_and_whitespaces(text)
        cum_letters += len(letters)
        result[i] = {
            "timestamp_start": get_seconds(timestamp_start),
            "timestamp_end": get_seconds(timestamp_end),
            "text": text, 
            "letters": letters,
            "number_of_letters": len(letters),
            "cum_number_of_letters": cum_letters

        }
    return result

Lets see the result

In [18]:
line_dict = extract_entries(text)
line_dict

{0: {'timestamp_start': 79,
  'timestamp_end': 83,
  'text': 'Golden nugget number one time is your most valuable and scarcest resource',
  'letters': 'Goldennuggetnumberonetimeisyourmostvaluableandscarcestresource',
  'number_of_letters': 62,
  'cum_number_of_letters': 62},
 1: {'timestamp_start': 83,
  'timestamp_end': 84,
  'text': 'Hey',
  'letters': 'Hey',
  'number_of_letters': 3,
  'cum_number_of_letters': 65},
 2: {'timestamp_start': 84,
  'timestamp_end': 89,
  'text': '1440 is the number that can change your life and that number is the number of minutes',
  'letters': '1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutes',
  'number_of_letters': 69,
  'cum_number_of_letters': 134},
 3: {'timestamp_start': 89,
  'timestamp_end': 93,
  'text': 'We all have in a single day and while you know, most of the people that I interviewed',
  'letters': 'WeallhaveinasingledayandwhileyouknowmostofthepeoplethatIinterviewed',
  'number_of_letters': 67,
  'cum_number_of_lett

Looks good! Now we look into fixxing the punctuation to be able to seperate sentences. 

In [27]:
from sentencepiece import SentencePieceProcessor
#deepmultilingualpunctuation is a transformer based model. So run this on the GPU if possible
from deepmultilingualpunctuation import PunctuationModel
punc_model = PunctuationModel()



In [28]:
def remove_timestamps(sentence):
    return re.sub(r'\[\d{2}:\d{2}:\d{2} -> \d{2}:\d{2}:\d{2}\]', '', sentence)

In [29]:
#TODO: need to check if it is significatly faster to just iterate over the dict to get the texts instead of using regex again
raw_text = remove_timestamps(text)
cor_text = punc_model.restore_punctuation(raw_text)
cor_text


"Golden nugget number one: time is your most valuable and scarcest resource. Hey, 1440 is the number that can change your life, and that number is the number of minutes We all have in a single day. and while you know most of the people that I interviewed, They're not all doing the same 15 secrets either. The common thread was that they always spoke about minutes in the value of time and when you truly realize What, like just how valuable a single minute is. I mean money. we can lose it and make it back again. our health: We can get sick and get healthy again. Time- once it's gone, It is gone. time is life."

Looks much better! Now lets checkout the sentences. 

In [32]:
from langdetect import detect_langs
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

def seperate_sentences(text):
    ''' Using sumy to seperate a given text into seperate sentences. 
    Sumy needs to know the language of the text to work correctly. We do this with langdetect.
    We only want to have English or German text, therefore we skip everything else. 
    '''
    lang = detect_langs(text)

    if lang[0].lang =="en":
        parser = PlaintextParser.from_string(text, Tokenizer('english'))

    elif lang[0].lang == "de":
        parser = PlaintextParser.from_string(text, Tokenizer('german'))
    else:
        print("not German or english, this text needs to be skipped")
        return []

    sentences = parser.document.sentences
    # we now the first letter of each sentence needs to be upper case. So lets ensure that.
    result_sentences = [str(sentence).capitalize() for sentence in sentences]
    return result_sentences

In [33]:
sentences = seperate_sentences(cor_text)
sentences

['Golden nugget number one: time is your most valuable and scarcest resource.',
 'Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.',
 "And while you know most of the people that i interviewed, they're not all doing the same 15 secrets either.",
 'The common thread was that they always spoke about minutes in the value of time and when you truly realize what, like just how valuable a single minute is.',
 'I mean money.',
 'We can lose it and make it back again.',
 'Our health: we can get sick and get healthy again.',
 "Time- once it's gone, it is gone.",
 'Time is life.']

That looks really good! Lets write a function that combines the process of getting sentences from a raw text with timestamps

In [50]:
def get_clean_sentences(text_with_timestamps,punc_model):
    raw_text = remove_timestamps(text_with_timestamps)
    cor_text = punc_model.restore_punctuation(raw_text)
    sentences = seperate_sentences(cor_text)
    return sentences

In [54]:
get_clean_sentences(text,punc_model)

['Golden nugget number one: time is your most valuable and scarcest resource.',
 'Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.',
 "And while you know most of the people that i interviewed, they're not all doing the same 15 secrets either.",
 'The common thread was that they always spoke about minutes in the value of time and when you truly realize what, like just how valuable a single minute is.',
 'I mean money.',
 'We can lose it and make it back again.',
 'Our health: we can get sick and get healthy again.',
 "Time- once it's gone, it is gone.",
 'Time is life.']

Now we can try to find the timestamps for a test sentence.
- to do so, we convert a sentence into a string of only letters 
- do the same with the full text 
- find the start and end index of the sentence

In [36]:
#test sentence = 1. It starts at line 1 and ends in the middle of line 3 
test_sentence = str(sentences[1])
test_letters = remove_punctuation_and_whitespaces(test_sentence)
print(test_sentence)
print(test_letters)

Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.
Hey1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutesweallhaveinasingleday


In [40]:
all_letters = ""
for i in line_dict:
    all_letters += line_dict[i]["letters"]
print(all_letters)    

GoldennuggetnumberonetimeisyourmostvaluableandscarcestresourceHey1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutesWeallhaveinasingledayandwhileyouknowmostofthepeoplethatIinterviewedThey'renotalldoingthesame15secretseitherThecommonthreadwasthattheyalwaysspokeaboutminutesinthevalueoftimeandwhenyoutrulyrealizeWhatlikejusthowvaluableasingleminuteisImeanmoneywecanloseitandmakeitbackagainourhealthWecangetsickandgethealthyagainTimeonceit'sgoneItisgonetimeislife


In [42]:
index_start_letter = all_letters.lower().index(test_letters.lower())
index_end_letter = index_start_letter +  len(test_letters)

print("start index: ", index_start_letter)
print("end index: ", index_end_letter)

start index:  62
end index:  155


### Now we need to find the start index and the end index in the line dict. 
- Per text piece, we collect the following information:
    - start_time_first_line: the start time of the line, that contains the beginning of the text piece
    - end_time_last_line: the end time of the line, that contains the end of the text piece
    - interpolated_start_time: the estimated start time of the sentence calculated by interpolation. 
    - interpolated_end_time: the estimated end time of the sentence calculated by interpolation.
- The estimated start and end times are probably this is not neccesarry, as the lines only contain a few seconds of text.

In [143]:
#The idea of the following code is the following:
# - iterate over the line, till you find the line with the start index
# - keep iterating until you find the line with the end index
# - return the timestamps of this lines
# - we also call interpolate_timestamp for the start and end line, to maybe find a even more exact time 
# - interpolate_timestamp works by calculating the average time per character


def get_timestamps(dictionary, start_index, end_index):
    start_time_first_line = None
    end_time_last_line = None

    for index, entry in dictionary.items():
        if entry['cum_number_of_letters'] >= start_index +1 and start_time_first_line is None:
            start_time_first_line, interpolated_start_time = interpolate_timestamp(entry, start_index, "start")

        if entry['cum_number_of_letters'] >= end_index and end_time_last_line is None:
            end_time_last_line, interpolated_end_time = interpolate_timestamp(entry, end_index, "end")
            

        if start_time_first_line and end_time_last_line:
            break

    return start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time



def interpolate_timestamp(entry, target_letters, start_or_end):
    letters_per_second = entry['number_of_letters'] / (entry['timestamp_end'] - entry['timestamp_start'])
    #we need to find the position inside the line. To do so, we calculate how many letters were used before the line. 
    letters_before_line = entry['cum_number_of_letters']-entry['number_of_letters']
    elapsed_seconds = (target_letters - letters_before_line) / letters_per_second
    if start_or_end == "start":
        start_time_of_line = entry['timestamp_start']
        interpolated_start_time = entry['timestamp_start'] + elapsed_seconds
        return start_time_of_line, interpolated_start_time
    elif start_or_end =="end":
        end_time_of_line = entry['timestamp_end']
        interpolated_end_time = entry['timestamp_start'] + elapsed_seconds
        return end_time_of_line, interpolated_end_time




In [146]:
start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time = get_timestamps(line_dict,index_start_letter, index_end_letter)
print("start_time_first_line: ", start_time_first_line)
print("end_time_last_line; ", end_time_last_line)
print("interpolated_start_time: ", interpolated_start_time)
print("interpolated_end_time: ", interpolated_end_time)

start_time_first_line:  83
end_time_last_line;  93
interpolated_start_time:  83.0
interpolated_end_time:  90.25373134328358


### The results seem to be good! Lets make it into a simple function

In [153]:
def find_timestamps_of_text(original_text, query_text, line_dict = None):
    #we might want to create the line dict outside the function, if we search for mutiple texts in the same original text
    if line_dict is None:
        line_dict = extract_entries(original_text)
        
    #get all letters
    all_letters = ""
    for i in line_dict:
        all_letters += line_dict[i]["letters"]
    all_letters = all_letters
    query_letters = remove_punctuation_and_whitespaces(query_text)
    index_start_letter = all_letters.lower().index(query_letters.lower())
    index_end_letter = index_start_letter +  len(query_letters)


    return get_timestamps(line_dict,index_start_letter, index_end_letter)
    


In [154]:
text

"[00:01:19 -> 00:01:23] Golden nugget number one time is your most valuable and scarcest resource\n[00:01:23 -> 00:01:24] Hey\n[00:01:24 -> 00:01:29] 1440 is the number that can change your life and that number is the number of minutes\n[00:01:29 -> 00:01:33] We all have in a single day and while you know, most of the people that I interviewed\n[00:01:34 -> 00:01:41] They're not all doing the same 15 secrets either. The common thread was that they always spoke about\n[00:01:42 -> 00:01:46] minutes in the value of time and when you truly realize\n[00:01:46 -> 00:01:52] What like just how valuable a single minute is I mean money we can lose it and make it back again our health\n[00:01:52 -> 00:01:54] We can get sick and get healthy again\n[00:01:54 -> 00:01:57] Time once it's gone. It is gone time is life"

In [155]:
sentences

['Golden nugget number one: time is your most valuable and scarcest resource.',
 'Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.',
 "And while you know most of the people that i interviewed, they're not all doing the same 15 secrets either.",
 'The common thread was that they always spoke about minutes in the value of time and when you truly realize what, like just how valuable a single minute is.',
 'I mean money.',
 'We can lose it and make it back again.',
 'Our health: we can get sick and get healthy again.',
 "Time- once it's gone, it is gone.",
 'Time is life.']

In [156]:
find_timestamps_of_text(text,sentences[2])

(89, 101, 90.25373134328358, 97.41463414634147)

### Now we need a function that picks a random piece of texts within a given number of tokens limit. Lets first check how to count the tokens. 
- first we load the tokenizer of the model and try it out to get a feeling for it.
- then we need a function, that get a text and returns a subtext that is close to the maximum size of tokens. 
- then we create a function, that picks random samples from a given text

In [157]:
#If you want to use a model that uses another tokenizer, you need to change the tokenizer
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'openlm-research/open_llama_7b'

llama_tokenizer = LlamaTokenizer.from_pretrained(model_path)

In [158]:
short_text = "Hello, how are you? I hope you're doing well."

# Tokenize the text
tokens = llama_tokenizer.tokenize(short_text)

# Get the first n tokens
n = 5  # Number of tokens you want to extract
first_n_tokens = tokens[:n]

print(first_n_tokens)

['▁Hello', ',', '▁how', '▁are', '▁you']


Looks good! Now we can to our main task. Lets get a bigger text, to work with. 

In [159]:
def read_file_to_string(file_path):
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        return ""
    except IOError:
        print(f"Error reading file '{file_path}'.")
        return ""


In [160]:
bigger_text = read_file_to_string("data/txts/-5e2tPyKNdw.txt")
bigger_text

"[00:00:00 -> 00:00:20]  In this session, we are going to understand the emergence of sociology.\n[00:00:20 -> 00:00:25]  So the topic for today is actually emergence of sociology.\n[00:00:25 -> 00:00:31]  Now you can always expect a question from emergence of sociology in all alternate years.\n[00:00:31 -> 00:00:36]  So if this year there is a question, next year there will be a question like that.\n[00:00:36 -> 00:00:39]  So this is actually a favorite area of the examiner.\n[00:00:39 -> 00:00:42]  So we need to prepare this topic completely.\n[00:00:42 -> 00:00:48]  And if you know this topic, emergence of sociology, you can easily understand the other topics\n[00:00:48 -> 00:00:54]  like scope of sociology, sociology and other sciences, the comparison, and some of the\n[00:00:54 -> 00:00:58]  other topics also which are related to this particular topic.\n[00:00:58 -> 00:01:00]  So we need to understand emergence of sociology.\n[00:01:00 -> 00:01:02]  Now what are the questions you 

In [161]:
import time
start_time = time.time()
bigger_line_dict = extract_entries(bigger_text)
sentences_bigger_text = get_clean_sentences(bigger_text,punc_model)
print(sentences_bigger_text)
print(time.time()-start_time)

['In this session we are going to understand the emergence of sociology.', 'So the topic for today is actually emergence of sociology.', 'Now you can always expect a question from emergence of sociology in all alternate years.', 'So if this year there is a question, next year there will be a question like that.', 'So this is actually a favorite area of the examiner.', 'So we need to prepare this topic completely and if you know this topic- emergence of sociology- you can easily understand the other topics like scope of sociology, sociology and other sciences, the comparison and some of the other topics also which are related to this particular topic.', 'So we need to understand emergence of sociology.', 'Now what are the questions you can get from this?', "Let's start from the questions.", 'The first question itself: you will be getting a short note: emergence of sociology.', 'So you can get a short note like 10 marker- emergence of sociology directly.', 'Or the second way in which the

This works well, but it takes some time. We should to this once for every text and save the result. But this is a task not for this notebook.
- The bottleneck is the punctuation correction.
- If we do this for more data than just benchmark stuff, we might need to see if we can make it faster
- maybe we do not need a punctuation correction for every text. 
- maybe the punctuation is better if we change to the original whisper or quantize less. 

In [162]:
def get_text_within_token_limit(sentences, tokenizer, token_limit):
    result = []
    cur_token_len = 0

    for sentence in sentences:
        new_token_len= len(tokenizer.tokenize(sentence))
        if cur_token_len + new_token_len <= token_limit:
            result.append(sentence)
            cur_token_len += new_token_len
        else:
            break


    return result, cur_token_len

In [163]:
result, cur_token_len = get_text_within_token_limit(sentences_bigger_text,llama_tokenizer, 400)
result, cur_token_len

(['In this session we are going to understand the emergence of sociology.',
  'So the topic for today is actually emergence of sociology.',
  'Now you can always expect a question from emergence of sociology in all alternate years.',
  'So if this year there is a question, next year there will be a question like that.',
  'So this is actually a favorite area of the examiner.',
  'So we need to prepare this topic completely and if you know this topic- emergence of sociology- you can easily understand the other topics like scope of sociology, sociology and other sciences, the comparison and some of the other topics also which are related to this particular topic.',
  'So we need to understand emergence of sociology.',
  'Now what are the questions you can get from this?',
  "Let's start from the questions.",
  'The first question itself: you will be getting a short note: emergence of sociology.',
  'So you can get a short note like 10 marker- emergence of sociology directly.',
  'Or the 

Looks good. Lets get the correct timestamps. 

In [164]:
#turn list of sentences in normal str
result_str = " ".join(result)
result_str

"In this session we are going to understand the emergence of sociology. So the topic for today is actually emergence of sociology. Now you can always expect a question from emergence of sociology in all alternate years. So if this year there is a question, next year there will be a question like that. So this is actually a favorite area of the examiner. So we need to prepare this topic completely and if you know this topic- emergence of sociology- you can easily understand the other topics like scope of sociology, sociology and other sciences, the comparison and some of the other topics also which are related to this particular topic. So we need to understand emergence of sociology. Now what are the questions you can get from this? Let's start from the questions. The first question itself: you will be getting a short note: emergence of sociology. So you can get a short note like 10 marker- emergence of sociology directly. Or the second way in which they will ask you this question is im

In [165]:
find_timestamps_of_text(bigger_text,result_str)

(0, 142, 0.0, 142.0)

In [105]:
bigger_text

"[00:00:00 -> 00:00:20]  In this session, we are going to understand the emergence of sociology.\n[00:00:20 -> 00:00:25]  So the topic for today is actually emergence of sociology.\n[00:00:25 -> 00:00:31]  Now you can always expect a question from emergence of sociology in all alternate years.\n[00:00:31 -> 00:00:36]  So if this year there is a question, next year there will be a question like that.\n[00:00:36 -> 00:00:39]  So this is actually a favorite area of the examiner.\n[00:00:39 -> 00:00:42]  So we need to prepare this topic completely.\n[00:00:42 -> 00:00:48]  And if you know this topic, emergence of sociology, you can easily understand the other topics\n[00:00:48 -> 00:00:54]  like scope of sociology, sociology and other sciences, the comparison, and some of the\n[00:00:54 -> 00:00:58]  other topics also which are related to this particular topic.\n[00:00:58 -> 00:01:00]  So we need to understand emergence of sociology.\n[00:01:00 -> 00:01:02]  Now what are the questions you 

In [122]:
extract_entries(bigger_text)

{0: {'timestamp_start': 0,
  'timestamp_end': 20,
  'text': ' In this session, we are going to understand the emergence of sociology.',
  'letters': 'Inthissessionwearegoingtounderstandtheemergenceofsociology',
  'number_of_letters': 58,
  'cum_number_of_letters': 58},
 1: {'timestamp_start': 20,
  'timestamp_end': 25,
  'text': ' So the topic for today is actually emergence of sociology.',
  'letters': 'Sothetopicfortodayisactuallyemergenceofsociology',
  'number_of_letters': 48,
  'cum_number_of_letters': 106},
 2: {'timestamp_start': 25,
  'timestamp_end': 31,
  'text': ' Now you can always expect a question from emergence of sociology in all alternate years.',
  'letters': 'Nowyoucanalwaysexpectaquestionfromemergenceofsociologyinallalternateyears',
  'number_of_letters': 73,
  'cum_number_of_letters': 179},
 3: {'timestamp_start': 31,
  'timestamp_end': 36,
  'text': ' So if this year there is a question, next year there will be a question like that.',
  'letters': 'Soifthisyearthe