## In this notebook we want to generate the input of training data for LLMs from transcripts. 
- The idea is to take a random sentence from a given transcript. Then we fill it with more sentences, till we reach a given limit of tokens. 
- the output will be created by ChatGPT, similar how in of Alpaca (https://github.com/tatsu-lab/stanford_alpaca)



Lets start and see, how the text is formatted:

In [197]:
# Original text with timestamps
text = "[00:01:19 -> 00:01:23] Golden nugget number one time is your most valuable and scarcest resource\n" \
       "[00:01:23 -> 00:01:24] Hey\n" \
       "[00:01:24 -> 00:01:29] 1440 is the number that can change your life and that number is the number of minutes\n" \
       "[00:01:29 -> 00:01:33] We all have in a single day and while you know, most of the people that I interviewed\n" \
       "[00:01:34 -> 00:01:41] They're not all doing the same 15 secrets either. The common thread was that they always spoke about\n" \
       "[00:01:42 -> 00:01:46] minutes in the value of time and when you truly realize\n" \
       "[00:01:46 -> 00:01:52] What like just how valuable a single minute is I mean money we can lose it and make it back again our health\n" \
       "[00:01:52 -> 00:01:54] We can get sick and get healthy again\n" \
       "[00:01:54 -> 00:01:57] Time once it's gone. It is gone time is life"

### Preprocessing
- The transcripts are having timestamps for each line. However, since we want to use complete sentences we need to do some preprocessing first, to get timestamps for sentences. 
- To do so we need to do the following steps:
    - get all lines in a dict including information about the timestamps
    - correct the punctuation ot the text without the timestamps 
    - seperate text into sentences
    - find timestamps for a given sentence


Lets first define a pattern to use regex and try it out

In [198]:
import re
#define three groups we are interested in 
#start time stamp
#end time stamp
#text of line
pattern = r'\[(\d+:\d+:\d+) -> (\d+:\d+:\d+)\] (.+)'
entries = re.findall(pattern, text)

In [199]:
print(entries)

[('00:01:19', '00:01:23', 'Golden nugget number one time is your most valuable and scarcest resource'), ('00:01:23', '00:01:24', 'Hey'), ('00:01:24', '00:01:29', '1440 is the number that can change your life and that number is the number of minutes'), ('00:01:29', '00:01:33', 'We all have in a single day and while you know, most of the people that I interviewed'), ('00:01:34', '00:01:41', "They're not all doing the same 15 secrets either. The common thread was that they always spoke about"), ('00:01:42', '00:01:46', 'minutes in the value of time and when you truly realize'), ('00:01:46', '00:01:52', 'What like just how valuable a single minute is I mean money we can lose it and make it back again our health'), ('00:01:52', '00:01:54', 'We can get sick and get healthy again'), ('00:01:54', '00:01:57', "Time once it's gone. It is gone time is life")]


We want to correct the punctuation of the sentences. To still be able to match them to the lines, the easiest way to achieve this is to remove all whitespaces and punctuation. 

In [200]:
test_line = entries[0][2]
print(test_line)

Golden nugget number one time is your most valuable and scarcest resource


In [201]:
#helper function

def remove_punctuation_and_whitespaces(string):
    string = re.sub(r"[.,?!\-:\s]", "", string)
    return string

print(remove_punctuation_and_whitespaces(test_line))

Goldennuggetnumberonetimeisyourmostvaluableandscarcestresource


We also want the timestamps in seconds to be able to calculate stuff more easy. 

In [202]:
#define a timestamp to test it
test_timestamp = entries[0][1]
print(test_timestamp)

00:01:23


In [203]:
from datetime import datetime

#helper function
def get_seconds(time_string):
    time_format = "%H:%M:%S"
    time = datetime.strptime(time_string, time_format)
    total_seconds = time.hour * 3600 + time.minute * 60 + time.second
    return total_seconds

#print test result
print(get_seconds(test_timestamp))

83


### Now we are ready to define our function, to turn a transcript into a line dict.

In [204]:
def extract_entries(text):
    pattern = r'\[(\d+:\d+:\d+) -> (\d+:\d+:\d+)\] (.+)'
    entries = re.findall(pattern, text)
    result = {}
    #we count letters, so we can later find the index of where a sentence we are looking for starts and then find the correct line
    cum_letters = 0
    for i, entry in enumerate(entries):
        timestamp_start, timestamp_end, text = entry
        #we remove all punctuation and whitespaces so we can match the lines with sentences that we get from the corrected versions of the text. 
        letters = remove_punctuation_and_whitespaces(text)
        cum_letters += len(letters)
        result[i] = {
            "timestamp_start": get_seconds(timestamp_start),
            "timestamp_end": get_seconds(timestamp_end),
            "text": text, 
            "letters": letters,
            "number_of_letters": len(letters),
            "cum_number_of_letters": cum_letters

        }
    return result

Lets see the result

In [205]:
line_dict = extract_entries(text)
line_dict

{0: {'timestamp_start': 79,
  'timestamp_end': 83,
  'text': 'Golden nugget number one time is your most valuable and scarcest resource',
  'letters': 'Goldennuggetnumberonetimeisyourmostvaluableandscarcestresource',
  'number_of_letters': 62,
  'cum_number_of_letters': 62},
 1: {'timestamp_start': 83,
  'timestamp_end': 84,
  'text': 'Hey',
  'letters': 'Hey',
  'number_of_letters': 3,
  'cum_number_of_letters': 65},
 2: {'timestamp_start': 84,
  'timestamp_end': 89,
  'text': '1440 is the number that can change your life and that number is the number of minutes',
  'letters': '1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutes',
  'number_of_letters': 69,
  'cum_number_of_letters': 134},
 3: {'timestamp_start': 89,
  'timestamp_end': 93,
  'text': 'We all have in a single day and while you know, most of the people that I interviewed',
  'letters': 'WeallhaveinasingledayandwhileyouknowmostofthepeoplethatIinterviewed',
  'number_of_letters': 67,
  'cum_number_of_lett

Looks good! Now we look into fixxing the punctuation to be able to seperate sentences. 

In [206]:
from sentencepiece import SentencePieceProcessor
#deepmultilingualpunctuation is a transformer based model. So run this on the GPU if possible
from deepmultilingualpunctuation import PunctuationModel
punc_model = PunctuationModel()



In [207]:
def remove_timestamps(sentence):
    return re.sub(r'\[\d{2}:\d{2}:\d{2} -> \d{2}:\d{2}:\d{2}\]', '', sentence)

In [209]:
raw_text

" Golden nugget number one time is your most valuable and scarcest resource\n Hey\n 1440 is the number that can change your life and that number is the number of minutes\n We all have in a single day and while you know, most of the people that I interviewed\n They're not all doing the same 15 secrets either. The common thread was that they always spoke about\n minutes in the value of time and when you truly realize\n What like just how valuable a single minute is I mean money we can lose it and make it back again our health\n We can get sick and get healthy again\n Time once it's gone. It is gone time is life"

In [208]:
#TODO: need to check if it is significatly faster to just iterate over the dict to get the texts instead of using regex again
raw_text = remove_timestamps(text)
cor_text = punc_model.restore_punctuation(raw_text)
cor_text


"Golden nugget number one: time is your most valuable and scarcest resource. Hey, 1440 is the number that can change your life, and that number is the number of minutes We all have in a single day. and while you know most of the people that I interviewed, They're not all doing the same 15 secrets either. The common thread was that they always spoke about minutes in the value of time and when you truly realize What, like just how valuable a single minute is. I mean money. we can lose it and make it back again. our health: We can get sick and get healthy again. Time- once it's gone, It is gone. time is life."

Looks much better! Now lets checkout the sentences. 

In [210]:
from langdetect import detect_langs
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser

def seperate_sentences(text):
    ''' Using sumy to seperate a given text into seperate sentences. 
    Sumy needs to know the language of the text to work correctly. We do this with langdetect.
    We only want to have English or German text, therefore we skip everything else. 
    '''
    lang = detect_langs(text)

    if lang[0].lang =="en":
        parser = PlaintextParser.from_string(text, Tokenizer('english'))

    elif lang[0].lang == "de":
        parser = PlaintextParser.from_string(text, Tokenizer('german'))
    else:
        print("not German or english, this text needs to be skipped")
        return []

    sentences = parser.document.sentences
    # we now the first letter of each sentence needs to be upper case. So lets ensure that.
    result_sentences = [str(sentence).capitalize() for sentence in sentences]
    return result_sentences

In [211]:
sentences = seperate_sentences(cor_text)
sentences

['Golden nugget number one: time is your most valuable and scarcest resource.',
 'Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.',
 "And while you know most of the people that i interviewed, they're not all doing the same 15 secrets either.",
 'The common thread was that they always spoke about minutes in the value of time and when you truly realize what, like just how valuable a single minute is.',
 'I mean money.',
 'We can lose it and make it back again.',
 'Our health: we can get sick and get healthy again.',
 "Time- once it's gone, it is gone.",
 'Time is life.']

That looks really good! Lets write a function that combines the process of getting sentences from a raw text with timestamps

In [212]:
def get_clean_sentences(text_with_timestamps,punc_model):
    raw_text = remove_timestamps(text_with_timestamps)
    cor_text = punc_model.restore_punctuation(raw_text)
    sentences = seperate_sentences(cor_text)
    return sentences

In [213]:
get_clean_sentences(text,punc_model)

['Golden nugget number one: time is your most valuable and scarcest resource.',
 'Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.',
 "And while you know most of the people that i interviewed, they're not all doing the same 15 secrets either.",
 'The common thread was that they always spoke about minutes in the value of time and when you truly realize what, like just how valuable a single minute is.',
 'I mean money.',
 'We can lose it and make it back again.',
 'Our health: we can get sick and get healthy again.',
 "Time- once it's gone, it is gone.",
 'Time is life.']

Now we can try to find the timestamps for a test sentence.
- to do so, we convert a sentence into a string of only letters 
- do the same with the full text 
- find the start and end index of the sentence

In [214]:
#test sentence = 1. It starts at line 1 and ends in the middle of line 3 
test_sentence = str(sentences[1])
test_letters = remove_punctuation_and_whitespaces(test_sentence)
print(test_sentence)
print(test_letters)

Hey, 1440 is the number that can change your life, and that number is the number of minutes we all have in a single day.
Hey1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutesweallhaveinasingleday


In [215]:
all_letters = ""
for i in line_dict:
    all_letters += line_dict[i]["letters"]
print(all_letters)    

GoldennuggetnumberonetimeisyourmostvaluableandscarcestresourceHey1440isthenumberthatcanchangeyourlifeandthatnumberisthenumberofminutesWeallhaveinasingledayandwhileyouknowmostofthepeoplethatIinterviewedThey'renotalldoingthesame15secretseitherThecommonthreadwasthattheyalwaysspokeaboutminutesinthevalueoftimeandwhenyoutrulyrealizeWhatlikejusthowvaluableasingleminuteisImeanmoneywecanloseitandmakeitbackagainourhealthWecangetsickandgethealthyagainTimeonceit'sgoneItisgonetimeislife


In [None]:
index_start_letter = all_letters.lower().index(test_letters.lower())
index_end_letter = index_start_letter +  len(test_letters)

print("start index: ", index_start_letter)
print("end index: ", index_end_letter)

### Now we need to find the start index and the end index in the line dict. 
- Per text piece, we collect the following information:
    - start_time_first_line: the start time of the line, that contains the beginning of the text piece
    - end_time_last_line: the end time of the line, that contains the end of the text piece
    - interpolated_start_time: the estimated start time of the sentence calculated by interpolation. 
    - interpolated_end_time: the estimated end time of the sentence calculated by interpolation.
- The estimated start and end times are probably this is not neccesarry, as the lines only contain a few seconds of text.

In [None]:
#The idea of the following code is the following:
# - iterate over the line, till you find the line with the start index
# - keep iterating until you find the line with the end index
# - return the timestamps of this lines
# - we also call interpolate_timestamp for the start and end line, to maybe find a even more exact time 
# - interpolate_timestamp works by calculating the average time per character


def get_timestamps(dictionary, start_index, end_index):
    start_time_first_line = None
    end_time_last_line = None

    for index, entry in dictionary.items():
        if entry['cum_number_of_letters'] >= start_index +1 and start_time_first_line is None:
            start_time_first_line, interpolated_start_time = interpolate_timestamp(entry, start_index, "start")

        if entry['cum_number_of_letters'] >= end_index and end_time_last_line is None:
            end_time_last_line, interpolated_end_time = interpolate_timestamp(entry, end_index, "end")
            

        if start_time_first_line and end_time_last_line:
            break

    return start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time



def interpolate_timestamp(entry, target_letters, start_or_end):
    try:
        letters_per_second = entry['number_of_letters'] / (entry['timestamp_end'] - entry['timestamp_start'])
    except ZeroDivisionError:
        letters_per_second = np.nan
    #we need to find the position inside the line. To do so, we calculate how many letters were used before the line. 
    letters_before_line = entry['cum_number_of_letters']-entry['number_of_letters']
    elapsed_seconds = (target_letters - letters_before_line) / letters_per_second
    if start_or_end == "start":
        start_time_of_line = entry['timestamp_start']
        interpolated_start_time = entry['timestamp_start'] + elapsed_seconds
        return start_time_of_line, interpolated_start_time
    elif start_or_end =="end":
        end_time_of_line = entry['timestamp_end']
        interpolated_end_time = entry['timestamp_start'] + elapsed_seconds
        return end_time_of_line, interpolated_end_time




In [None]:
start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time = get_timestamps(line_dict,index_start_letter, index_end_letter)
print("start_time_first_line: ", start_time_first_line)
print("end_time_last_line; ", end_time_last_line)
print("interpolated_start_time: ", interpolated_start_time)
print("interpolated_end_time: ", interpolated_end_time)

### The results seem to be good! Lets make it into a simple function

In [None]:
def find_timestamps_of_text(original_text, query_text, line_dict = None):
    #we might want to create the line dict outside the function, if we search for mutiple texts in the same original text
    if line_dict is None:
        line_dict = extract_entries(original_text)
        
    #get all letters
    all_letters = ""
    for i in line_dict:
        all_letters += line_dict[i]["letters"]
    all_letters = all_letters
    query_letters = remove_punctuation_and_whitespaces(query_text)
    try:
        index_start_letter = all_letters.lower().index(query_letters.lower())
    except Exception:
        print("all letters : ", all_letters)
        print("query_letters : ", query_letters)
    index_end_letter = index_start_letter +  len(query_letters)


    return get_timestamps(line_dict,index_start_letter, index_end_letter)
    


In [None]:
text

In [None]:
sentences

In [None]:
find_timestamps_of_text(text,sentences[2])

### Now we need a function that picks a random piece of texts within a given number of tokens limit. Lets first check how to count the tokens. 
- first we load the tokenizer of the model and try it out to get a feeling for it.
- then we need a function, that get a text and returns a subtext that is close to the maximum size of tokens. 
- then we create a function, that picks random samples from a given text

In [None]:
#If you want to use a model that uses another tokenizer, you need to change the tokenizer
from transformers import LlamaTokenizer, LlamaForCausalLM

model_path = 'openlm-research/open_llama_7b'

llama_tokenizer = LlamaTokenizer.from_pretrained(model_path)

In [None]:
short_text = "Hello, how are you? I hope you're doing well."

# Tokenize the text
tokens = llama_tokenizer.tokenize(short_text)

# Get the first n tokens
n = 5  # Number of tokens you want to extract
first_n_tokens = tokens[:n]

print(first_n_tokens)

Looks good! Now we can to our main task. Lets get a bigger text, to work with. 

In [None]:
def read_file_to_string(file_path):
    try:
        with open(file_path, 'r') as file:
            content = file.read()
            return content
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
        return ""
    except IOError:
        print(f"Error reading file '{file_path}'.")
        return ""


In [None]:
bigger_text = read_file_to_string("data/txts/-5e2tPyKNdw.txt")
bigger_text

In [None]:
import time
start_time = time.time()
bigger_line_dict = extract_entries(bigger_text)
sentences_bigger_text = get_clean_sentences(bigger_text,punc_model)
print(sentences_bigger_text)
print(time.time()-start_time)

This works well, but it takes some time. We should to this once for every text and save the result. But this is a task not for this notebook.
- The bottleneck is the punctuation correction.
- If we do this for more data than just benchmark stuff, we might need to see if we can make it faster
- maybe we do not need a punctuation correction for every text. 
- maybe the punctuation is better if we change to the original whisper or quantize less. 

In [None]:
def get_text_within_token_limit(sentences, tokenizer, token_limit):
    result = []
    cur_token_len = 0

    for sentence in sentences:
        new_token_len= len(tokenizer.tokenize(sentence))
        if cur_token_len + new_token_len <= token_limit:
            result.append(sentence)
            cur_token_len += new_token_len
        else:
            break


    return result, cur_token_len

In [None]:
result, cur_token_len = get_text_within_token_limit(sentences_bigger_text,llama_tokenizer, 400)
result, cur_token_len

Looks good. Lets get the correct timestamps. 

In [None]:
#turn list of sentences in normal str
result_str = " ".join(result)
result_str

In [None]:
find_timestamps_of_text(bigger_text,result_str)

In [None]:
bigger_text

In [None]:
extract_entries(bigger_text)

Lets combione all that in a function, that gets a text and then returns random subpieces of that text. 

In [None]:
bigger_line_dict = extract_entries(bigger_text)


In [None]:
import numpy as np
import os

def create_pieces_of_text(filepath, punc_model, chance = 0.15, max_token_limit = 400):
    start_time = time.time()
    text = read_file_to_string(filepath)
    sentences = get_clean_sentences(text,punc_model)
    print("time to clean sentences: ", time.time()-start_time)
    line_dict = extract_entries(text)
    results = []
    filename = os.path.basename(filepath)
    #iterate over all sentences. With the given chance each sentence can be the start of an input_text. 
    for sentence_index, sentence in enumerate(text):
        if np.random.random() < chance:
            input_sentences, number_of_tokens = get_text_within_token_limit(sentences[sentence_index:],llama_tokenizer, max_token_limit)
            input_text = " ".join(input_sentences)
            start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time = find_timestamps_of_text(text,input_text, line_dict=line_dict)
            results.append({"text":filename,
                            "input_text": input_text, 
                            "number_of_tokens": number_of_tokens,
                            "start_time_first_line": start_time_first_line, 
                            "end_time_last_line":end_time_last_line,
                            "interpolated_start_time": interpolated_start_time,
                            "interpolated_end_time": interpolated_end_time})
    print("time to do the rest: ", time.time()-start_time)
    return results


In [None]:
results = create_pieces_of_text("data/txts/-5e2tPyKNdw.txt",punc_model)
results

Most of the time is used to clean the sentences, which is because it uses the punctuation model. Because we want to create a lot of different training data in the future, it makes sense to clean the sentneces first, save them and then work with the already cleaned sentences. We will do this now. 

In [None]:
import os

def get_file_paths(directory_path):
    file_paths = []
    for root, dirs, files in os.walk(directory_path):
        for file in files:
            file_paths.append(os.path.join(root, file))
    return file_paths


In [None]:
list_of_files = get_file_paths("data/txts/")

In [None]:
import pickle

#this code takes some time, so it is commeted out, so it is not run by accident

# for index,f in enumerate(list_of_files):
#     if not "raw" in f:
#         #print(index)
#         try:
#             text_data = read_file_to_string(f)
#             sentences_of_text_data = get_clean_sentences(text_data,punc_model)
#             filename = os.path.basename(f)
#             new_filename = filename.replace(".txt", "_sentences.txt")
#             with open('data/sentences/' + new_filename, 'wb') as file:
#                 pickle.dump(sentences_of_text_data, file)
#         except Exception as e:
#             print(e)





Now we write a new version of 'create_pieces_of_text' that load the sentence from the pickle file. We still need the original transcripted file, to get the timing right. In the future, we might want to use dictonaries for the sentence files, so all informations are already there. 

In [None]:
def create_pieces_of_text(filepath,sentence_dir, punc_model, chance = 0.15, max_token_limit = 400):
    start_time = time.time()
    text = read_file_to_string(filepath)

    #sentences = get_clean_sentences(text,punc_model)

    line_dict = extract_entries(text)
    results = []
    filename = os.path.basename(filepath).replace(".txt", "_sentences.txt")
    id = filename.replace("_sentences.txt", "")
    id = id.replace(".mp3", "") #artifact from small bug in code
    sentence_file_path = os.path.join(sentence_dir, filename)
    if os.path.exists(sentence_file_path):
        with open(sentence_file_path, 'rb') as file:
            sentences = pickle.load(file)
    else:
        return []
    #iterate over all sentences. With the given chance each sentence can be the start of an input_text. 
    for sentence_index, sentence in enumerate(sentences):
        if np.random.random() < chance:
            input_sentences, number_of_tokens = get_text_within_token_limit(sentences[sentence_index:],llama_tokenizer, max_token_limit)
            input_text = " ".join(input_sentences)
            start_time_first_line, end_time_last_line, interpolated_start_time, interpolated_end_time = find_timestamps_of_text(text,input_text, line_dict=line_dict)
            results.append({"id":id,
                            "input_text": input_text, 
                            "number_of_tokens": number_of_tokens,
                            "start_time_first_line": start_time_first_line, 
                            "end_time_last_line":end_time_last_line,
                            "interpolated_start_time": interpolated_start_time,
                            "interpolated_end_time": interpolated_end_time})
    return results

In [227]:
start_time = time.time()
piece_of_text_single_file = create_pieces_of_text("data/txts/-5e2tPyKNdw.txt","data/sentences/",punc_model, max_token_limit=1500)
print(time.time()-start_time)
piece_of_text_single_file

0.5032508373260498


[{'id': '-5e2tPyKNdw',
  'input_text': "Now you can always expect a question from emergence of sociology in all alternate years. So if this year there is a question, next year there will be a question like that. So this is actually a favorite area of the examiner. So we need to prepare this topic completely and if you know this topic- emergence of sociology- you can easily understand the other topics like scope of sociology, sociology and other sciences, the comparison and some of the other topics also which are related to this particular topic. So we need to understand emergence of sociology. Now what are the questions you can get from this? Let's start from the questions. The first question itself: you will be getting a short note: emergence of sociology. So you can get a short note like 10 marker- emergence of sociology directly. Or the second way in which they will ask you this question is impact of modernity on the emergence of sociology. So impact of modernity and social changes 

Seems to work! Lets do this for all the files. 

In [229]:
pieces_of_text = []
for index,f in enumerate(list_of_files):
    if not "raw" in f:
        try:
            pieces_of_text += create_pieces_of_text(f,"data/sentences/",punc_model,max_token_limit=1500)
        except Exception as e:
            print(e)


In [231]:
len(pieces_of_text)

9879

In [None]:
len(results)

In [None]:
results

In [None]:
import random
random.shuffle(results)
tmp = random.choice(results)

In [None]:
tmp["input_text"]

In [None]:
system_role = """ You are a helpful assistant who extracts informations from texts a json format. 

Your answer should always look like this. 

{
  "Required Skills": [
    "SKILL1",
      "SKILL2", ....
  ],
  "Taught Skills": [
        "SKILL1",
      "SKILL2", ....
  ]
} """


len(llama_tokenizer.tokenize(system_role))

In [None]:
test_answer = '''{
  "Required ESCO Skills": [
    "Reading Comprehension",
    "Analytical Thinking",
    "Historical Knowledge",
    "Sociological Knowledge"
  ],
  "Taught ESCO Skills": [
    "Understanding Historical Periods",
    "Identifying Dimensions of Modernity",
    "Analyzing Social Changes"
  ]
}'''

len(llama_tokenizer.tokenize(test_answer))

In [None]:
import os

import openai
openai.api_type = "azure"
openai.api_base = "https://andreastestressource.openai.azure.com/"
openai.api_version = "2023-05-15"
openai.api_key =  "1d153d0c7f8b470d856c80e928f94cd4"
deployment_id="AndreasTestBereitstellung"


In [None]:
def get_gpt_answer(system_role, text):
    try:
        response = openai.ChatCompletion.create(
            deployment_id=deployment_id, # engine = "deployment_name".
            messages=[
                {"role": "system", "content": system_role},
                {"role": "user", "content": text}
            ]
        )
    except Exception as e:
        print(f"Error occurred: {e}")
        response = f"Error occurred: {e}"

    return response

In [None]:
answers = []
for r in results[:30]:
    answers.append(get_gpt_answer(system_role,r["input_text"]))

In [185]:
answers

[<OpenAIObject chat.completion id=chatcmpl-7eQExqMehloaRKtIKNtgfySipiklo at 0x7f154c51c4a0> JSON: {
   "choices": [
     {
       "finish_reason": "stop",
       "index": 0,
       "message": {
         "content": "{\n  \"Required Skills\": [],\n  \"Taught Skills\": [\n        \"Teaching\",\n        \"Mentoring\",\n        \"Career Development\",\n        \"Communication\",\n        \"Patience\"\n  ]\n}",
         "role": "assistant"
       }
     }
   ],
   "created": 1689868343,
   "id": "chatcmpl-7eQExqMehloaRKtIKNtgfySipiklo",
   "model": "gpt-35-turbo",
   "object": "chat.completion",
   "usage": {
     "completion_tokens": 42,
     "prompt_tokens": 455,
     "total_tokens": 497
   }
 },
 <OpenAIObject chat.completion id=chatcmpl-7eQEzidlwBli9vjPBW3zUnCHhK3AQ at 0x7f154c51c450> JSON: {
   "choices": [
     {
       "finish_reason": "stop",
       "index": 0,
       "message": {
         "content": "{\n  \"Required Skills\": [],\n  \"Taught Skills\": [\n    \"Self-awareness\",\n   

In [195]:
def print_pretty(text, max_line_length=80):
    if len(text) <= max_line_length:
        print(text)
    else:
        pretty_text = ""
        words = text.split()
        current_line = ""

        for word in words:
            if len(current_line) + len(word) + 1 <= max_line_length:
                current_line += word + " "
            else:
                pretty_text += current_line.strip() + "\n"
                current_line = word + " "

        if current_line:
            pretty_text += current_line.strip()

        print(pretty_text)

In [196]:
for r,a in zip(results,answers):
    if type(a) is str:
        continue
    print_pretty(r["input_text"])
    print(a["choices"][0]["message"]["content"])
    print("____________________________")
    

I am nowhere near that right now and the other piece is just the intermediary
piece of just being able to teach and build people's careers and help them kind
of get going, which is what i just i love doing. It's why i like to go to
university so much and talk to students and do presentations. I just love
teaching. It's my absolute favorite thing one, because i just nerd out and get
real detailed about things and then i find out people just don't know things
which i just assume you know and then it just gets me all excited to explain
and teach it. It's also great to to see these in the long run. So i've had some
students i've worked with- again not professionally, but those on the youtube
channel- and they've had just a lot of questions and it's a detailed and it's
very to the point and their intellectual questions and they're smart. And then
i see these individuals and they graduate and they leave and they go into the
industry and then they're starting to get these really good jobs and

In [218]:

for r in results:
    tmp_tokens = llama_tokenizer.tokenize(r["input_text"])
    print(len(tmp_tokens))

394
391
382
395
388
285
396
348
398
343
392
399
210
395
376
196
384
367
399
395
384
400
64
377
286
389
316
373
387
395
375
386
394
255
389
396
391
386
383
370
400
356
304
380
378
399
384
392
376
397
381
385
397
376
384
393
391
383
400
390
379
348
376
398
391
371
394
363
294
386
398
389
384
392
179
392
384
398
398
363
39
400
368
397
390
393
360
380
211
384
385
398
395
388
394
398
377
394
396
394
148
397
371
400
185
392
371
356
366
367
400
393
378
398
366
391
394
386
394
395
346
393
364
340
398
400
100
256
383
20
399
399
393
360
365
387
377
397
397
390
350
399
390
383
393
363
399
375
390
368
372
397
382
381
379
398
400
393
398
388
399
399
394
373
385
387
364
387
392
393
400
382
359
356
363
133
394
285
373
400
389
395
310
369
389
338
398
376
367
399
376
366
376
381
390
398
399
386
398
395
380
396
396
395
373
391
388
388
390
393
94
386
87
391
380
342
373
398
373
286
385
392
389
384
400
384
386
362
396
365
379
392
387
370
397
345
393
346
396
383
383
389
379
381
378
397
398
393
398
397
387
3

KeyboardInterrupt: 

In [216]:
tmp_tokens = llama_tokenizer.tokenize('''
I am nowhere near that right now and the other piece is just the intermediary
piece of just being able to teach and build people's careers and help them kind
of get going, which is what i just i love doing. It's why i like to go to
university so much and talk to students and do presentations. I just love
teaching. It's my absolute favorite thing one, because i just nerd out and get
real detailed about things and then i find out people just don't know things
which i just assume you know and then it just gets me all excited to explain
and teach it. It's also great to to see these in the long run. So i've had some
students i've worked with- again not professionally, but those on the youtube
channel- and they've had just a lot of questions and it's a detailed and it's
very to the point and their intellectual questions and they're smart. And then
i see these individuals and they graduate and they leave and they go into the
industry and then they're starting to get these really good jobs and they're
learning more and more and more. And then they write me a letter, an email or
the message on linkedin- like dimitri, this one- to thank you for helping me
with. You know my career and stuff. I'm here, i'm doing this and explain
everything that's going on and they thank me for my time for helping train them
and teach them. That is the most rewarding piece of everything. Out of the
podcast, the youtube channel, it's really seeing like the impact of it, right,
they're doing all the work, just like when i have employees working for me.
They're doing all the work, but it's it's the teaching and the guidance piece.
That's just really like. It just makes me excited.


''')

In [217]:
len(tmp_tokens)

426

In [219]:
results

[{'id': 'PcH8DopiJEc',
  'input_text': "I am nowhere near that right now and the other piece is just the intermediary piece of just being able to teach and build people's careers and help them kind of get going, which is what i just i love doing. It's why i like to go to university so much and talk to students and do presentations. I just love teaching. It's my absolute favorite thing one, because i just nerd out and get real detailed about things and then i find out people just don't know things which i just assume you know and then it just gets me all excited to explain and teach it. It's also great to to see these in the long run. So i've had some students i've worked with- again not professionally, but those on the youtube channel- and they've had just a lot of questions and it's a detailed and it's very to the point and their intellectual questions and they're smart. And then i see these individuals and they graduate and they leave and they go into the industry and then they're st

In [222]:
def get_costs(token_array, cost_per_1000 = 0.002, answer_length_estimate = 0.01 ):
    return sum(token_array) /1000 * cost_per_1000

In [223]:
len(results)

9679

In [233]:
9879 * 2 * 0.002

39.516