Problem: 
- Transcripts (+ prompt and output text) usually exceed maximum acceptable LLM context length. 

Strategy: 

1. Split transcript into chunks of a length less than (but probably close to) the maximum context length of the model minus the context size required for prompt + expected output. 
    - How to choose appropriate split locations (to avoid losing context as much as possible)?
      - With sufficiently big context size splitting at almost the exact suggested word should be fine. 
2. Get LLM output for each chunk.
3. Combine the outputs of all chunks into a single output (using a function, i.e. no further LLM call). 


Required Parameters:
- maximum acceptable context length (can be less than the actual model context length, depending on output quality and memory constraints)
- prompt context size 
- max expected (allowed?) output size
- model (-> which tokenizer to use?)



### 1. Demonstration

In [1]:
# get data index

import pandas as pd

data = pd.read_csv("../scraping/6_filtered_videos_final/filtered_index_sorted_avg_channel_views.csv", sep=";")
data.head()

Unnamed: 0,uploader_id,video_id,view_count,yt_video_type,channel_avg_view_count
0,@JennyHoyosLOL,i2bUeO1ID30,63002469.0,short,7569842.0
1,@JennyHoyosLOL,VvEBCXHx-74,48360277.0,short,7569842.0
2,@JennyHoyosLOL,CEdnanNgS3k,39446140.0,short,7569842.0
3,@JennyHoyosLOL,jOc1XfFNJTo,30411496.0,short,7569842.0
4,@JennyHoyosLOL,Gs0QiMVkUAw,29434395.0,short,7569842.0


In [2]:
# sample some videos (we want long examples!)
sample = data[data.yt_video_type == "video"].sample(2)
sample

Unnamed: 0,uploader_id,video_id,view_count,yt_video_type,channel_avg_view_count
43292,@MarketMobster,BK1CsZf4oOo,17929.0,video,20870.965517
3579,@AndreiJikh,8mmSW8G_oYo,573854.0,video,499474.370192


In [3]:
# load transcripts for the sampled videos

transcript_path = "../scraping/5_transcripts_and_metadata/transcripts_csvs"
transcripts = {}
for i, row in sample.iterrows():
    video_id = row["video_id"]
    uploader_id = row["uploader_id"]
    print(f"Loading transcript for video {video_id}")
    transcripts[video_id] = pd.read_csv(f"{transcript_path}/{uploader_id}_{video_id}.csv", sep=";")


    

Loading transcript for video BK1CsZf4oOo
Loading transcript for video 8mmSW8G_oYo


In [4]:
# convert transcript dataframes to single strings
transcript_texts = {}
for video_id, transcript in transcripts.items():
    transcript_texts[video_id] = " ".join(transcript.text)

In [5]:
# load mistral tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [6]:
# tokenize transcript texts and get lengths
for video_id, transcript_text in transcript_texts.items():
    tokenized = tokenizer(transcript_text)
    print(f"{'-'*20}{video_id}{'-'*20}")
    print(f"{len(tokenized['input_ids'])} tokens")
    # also get number of characters and words
    print(f"{len(transcript_text)} characters")
    print(f"{len(transcript_text.split())} words")
    

--------------------BK1CsZf4oOo--------------------
2889 tokens
12210 characters
2378 words
--------------------8mmSW8G_oYo--------------------
3501 tokens
15305 characters
2813 words


In [7]:
# helper for display
def print_readable(text_to_print):
    words = text_to_print.split()
    # print 20 words per line
    for i in range(0, len(words), 20):
        print(" ".join(words[i:i+20]))

In [8]:
# chunk and display samples
from data_prep_utils import get_text_chunks

chunks = {}
for video_id, transcript_text in transcript_texts.items():
    print(f"{'-'*20}{video_id}{'-'*20}")
    chunks[video_id] = get_text_chunks(transcript_text, tokenizer)
    for chunk in chunks[video_id]:
        print_readable(chunk)
        print("\n\n")


--------------------BK1CsZf4oOo--------------------
hey everyone we're going to talk about silica it's been a long time so hi everyone welcome to the channel
if you are new please do subscribe the bell button leave a like and comment it is saturday and i'm
doing a video so make sure you have smashed that like button so i want to say thank you for
the support over the lives throughout the last few weeks and obviously growth in the channel if you are not
subscribed make sure you do because there's quite a lot of people that are not which is obviously nice that
you come back but i'd like it to remain mine forever which always helps so silica it's had a big
moon we'll go over in terms of price in terms of usd on the beautiful world of coin market cap
you can see big massive move a lot of people are getting a little bit anxious with it over the
last few weeks i've not been talking about it because it's been consolidating it's been pretty boring but that's good
good in a bull market a consol

### 2. Perform chunking for entire dataset

We proceed as follows:
- load transcript csvs and build string from transcript data
- split transcript strings into chunks
- store chunks in a dataframe with 3 columns: ``video_id``, ``chunk_number``, ``chunk_text``
  - chunks will be numbered 1, 2, 3, ... for each video
- save dataframe to single csv file
  - Since the csv data we start from is only a little over 1 GB (with timestamp data), our single file should be < 1GB, which is workable (and more convenient than multiple files). 
  - filename will include parameters used for chunking and data info




In [9]:
import pandas as pd

########################################################################################################
# data source folders <- adjust
index_path = "../scraping/6_filtered_videos_final"
transcripts_path = "../scraping/5_transcripts_and_metadata/transcripts_csvs"
# chunking parameters <- adjust
max_chunk_tokens = 2048
overlap = 50
# tokenizer <- adjust
tokenizer_hf_model = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer_filename_str = "Mistral"
# load index
index = pd.read_csv(f"{index_path}/filtered_index_sorted_avg_channel_views.csv", sep=";")
print(f"Loaded index with {len(index)} video ids.")
# result file saving info <- adjust
result_filename = f"transcript_chunks_nvids{len(index)}_chunksize{max_chunk_tokens}_overlap{overlap}_tok{tokenizer_filename_str}.csv"
result_path = f"../data/transcript_chunks"
########################################################################################################


Loaded index with 45968 video ids.


In [11]:
from data_prep_utils import clean_text, get_text_chunks
from transformers import AutoTokenizer

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_hf_model)

# initialize result lists
video_ids = []
chunk_numbers = []
chunk_texts = []

# iterate over videos
for i, (video_id, uploader_id) in enumerate(zip(index.video_id, index.uploader_id)):

    # load transcript csv
    transcript_csv = pd.read_csv(f"{transcripts_path}/{uploader_id}_{video_id}.csv", sep=";")
    # convert transcript to single string 
    # (apparently there are NaN lines we need to filter out first)
    transcript_lines = [line for line in transcript_csv.text if isinstance(line, str)]
    transcript_text = " ".join(transcript_lines)
    # clean
    transcript_text = clean_text(transcript_text)
    
    # get chunks and store them
    if len(transcript_text) == 0:
        print(f"Empty transcript for video {video_id}. Skipping.")
    else:
        chunks = get_text_chunks(transcript_text, tokenizer)
        # store data in result lists
        for chunk_number, chunk in enumerate(chunks, start=1):
            video_ids.append(video_id)
            chunk_numbers.append(chunk_number)
            chunk_texts.append(chunk)

    # print progress
    if (i+1) % 1000 == 0:
        print(f"Processed transcript {i} of {len(index)}")
        print(f"texts memory: {chunk_texts.__sizeof__()}")

print(f"{'-'*40}\nFinished processing {len(index)} transcripts.")

# save in dataframe/csv
chunks_df = pd.DataFrame(columns=["video_id", "chunk_number", "chunk_text"])
chunks_df["video_id"] = video_ids
chunks_df["chunk_number"] = chunk_numbers
chunks_df["chunk_text"] = chunk_texts

chunks_df.to_csv(f"{result_path}/{result_filename}", sep=";", index=False)

Processed transcript 999 of 45968
texts memory: 8840
Processed transcript 1999 of 45968
texts memory: 18216
Processed transcript 2999 of 45968
texts memory: 37192
Processed transcript 3999 of 45968
texts memory: 47144
Empty transcript for video 84wsLwCvNxo. Skipping.
Processed transcript 4999 of 45968
texts memory: 59720
Processed transcript 5999 of 45968
texts memory: 75656
Processed transcript 6999 of 45968
texts memory: 95848
Processed transcript 7999 of 45968
texts memory: 107880
Processed transcript 8999 of 45968
texts memory: 121416
Processed transcript 9999 of 45968
texts memory: 136616
Processed transcript 10999 of 45968
texts memory: 153736
Processed transcript 11999 of 45968
texts memory: 153736
Processed transcript 12999 of 45968
texts memory: 173000
Processed transcript 13999 of 45968
texts memory: 194664
Processed transcript 14999 of 45968
texts memory: 194664
Processed transcript 15999 of 45968
texts memory: 194664
Processed transcript 16999 of 45968
texts memory: 219048


### 3. Check results

In [12]:
import pandas as pd 

# delete variables to free memory
for var in ["video_ids", "chunk_numbers", "chunk_texts", "chunks_df"]:
    if var in locals():
        del locals()[var]

# load saved file
filepath = "../data/transcript_chunks/transcript_chunks_nvids45968_chunksize2048_overlap50_tokMistral.csv"
#filepath = f"{result_path}/{result_filename}"
chunks_df = pd.read_csv(filepath, sep=";")
print(f"Loaded {len(chunks_df)} transcript chunks for {len(chunks_df.video_id.unique())} videos.")
print(f"df memory: {chunks_df.__sizeof__() / 1024**3:.2f} GB")

Loaded 80176 transcript chunks for 45967 videos.
df memory: 0.46 GB


In [13]:
chunks_df.head()

Unnamed: 0,video_id,chunk_number,chunk_text
0,i2bUeO1ID30,1,my grandma thinks Christmas is expensive so I'...
1,VvEBCXHx-74,1,you can find golden dirt this is a 25 bag of d...
2,CEdnanNgS3k,1,one dollar chicken sandwich now Chick-fil-A ha...
3,jOc1XfFNJTo,1,Logan Paul made from Prime apparently over 100...
4,Gs0QiMVkUAw,1,two dollar pumpkin spice lattes apparently you...


In [14]:
chunks_per_video = chunks_df.groupby("video_id").size()
chunks_per_video.describe()

count    45967.000000
mean         1.744208
std          0.900603
min          1.000000
25%          1.000000
50%          2.000000
75%          2.000000
max          7.000000
dtype: float64

In [15]:
# check the few empty transcripts

# video_ids in index but not in chunks_df
index = pd.read_csv(f"../scraping/6_filtered_videos_final/filtered_index_sorted_avg_channel_views.csv", sep=";")
missing_ids = set(index.video_id) - set(chunks_df.video_id)
print(f"Check transcripts for ids: {missing_ids}")



Check transcripts for ids: {'84wsLwCvNxo'}


In [16]:
index = pd.read_csv(f"../scraping/6_filtered_videos_final/filtered_index_sorted_avg_channel_views.csv", sep=";")
# check for duplicate video_ids
duplicates = index.video_id[index.video_id.duplicated()]
print(f"Found {len(duplicates)} duplicate video_ids.")
duplicates

Found 0 duplicate video_ids.


Series([], Name: video_id, dtype: object)