**YouTube Cantonese Subtitle Parsing Metrics**

Welcome to YTCantoParse! I built this mostly for fun, but if it helps you at all in your learning I will be so glad!

I've split the Colab into two sections. The first section is if you would simply like to analyze a single video. This is not integrated with the spreadsheet.

The second section is for recommending you new videos based off of either the known/unknown words or frequency analysis.

Both sections will have optional parameters depending on if you use Migaku or not. If there are any issues with the Colab or you have questions, please contact chrisjwest99@gmail.com

In [None]:
#@title Option A) - Analyze a Single Video
#@markdown Use this section if you would like to analyze a Youtube video that you already have a link for.

#@markdown By default, we assume you simply want basic frequency analysis with a standard parser and without Migaku learning status integration. If you want to use your own parser or frequency list, it's a bit more difficult and you should see the FAQ at the bottom.

#@markdown To use Migaku learning status, it is quite simple; first, check the UseMigaku box below and run the cell. Then, download your Yue wordlist from Migaku settings and rename it as "learning.json". Then, upload your file into the Colab filesystem under the "Files" tab on the left and replace the current "learning.json" file.
UseMigaku = True #@param {type:"boolean"}

### Downloading files for parsing ###

!gdown --id 1WF2FkidAVMLKjdtOf0lWb21d2mYa7-06
!gdown --id 1PWglfvAwhN9CDBjJiqPt_DkZr132bFAE

if (UseMigaku):
  !gdown --id 151b1fyxMtodhsy2EIgQPUjEo3QyVXoCC

#ParserPath = '' #@param {type:"string"}



Downloading...
From: https://drive.google.com/uc?id=1WF2FkidAVMLKjdtOf0lWb21d2mYa7-06
To: /content/parser.csv
100% 415k/415k [00:00<00:00, 121MB/s]
Downloading...
From: https://drive.google.com/uc?id=1PWglfvAwhN9CDBjJiqPt_DkZr132bFAE
To: /content/freq.json
100% 361k/361k [00:00<00:00, 95.6MB/s]
Downloading...
From: https://drive.google.com/uc?id=151b1fyxMtodhsy2EIgQPUjEo3QyVXoCC
To: /content/learning.json
100% 157k/157k [00:00<00:00, 96.2MB/s]


In [None]:
#@title Run this code to complete set-up

import csv

### Import the Parsing Data ###

parser_path = "/content/parser.csv"
with open(parser_path, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)
# We convert to a set for fast-lookup time
data_list = []
for i in data:
  data_list.append(i[0])
setParser = set(data_list)


### Define our Longest-Substring-Match Parser ###
# This uses a simple double-loop structure. Takes a long time
# for longer videos but is usually fast-enough and
# is very simple.
def parseInput(inputString):
  parsedOutput = []
  i = 0
  while (i<len(inputString)):
    longestSub = i+1
    for ii in range(i+1,i+21):
      if (inputString[i:ii] in setParser):
        longestSub = ii
    parsedOutput.append(inputString[i:longestSub])
    i = longestSub
  return parsedOutput


### Import the Frequency Data ###

import json

freq_path = "/content/freq.json"

# Opening JSON file
with open(freq_path) as json_file:
    data = json.load(json_file)
    numList = list(range(len(data)))
    freq_dict = dict(zip(data, numList))


### OPTIONAL: Import Migaku Learning Status ###
# We use a dictionary here for fast-lookup

words_path = "/content/learning.json"

with open(words_path) as json_file:
    data = json.load(json_file)
    wordList = []
    knownList = []
    for i in range(len(data)):
      word = data[i][0]
      wordOnly = word.split("◴")[0]
      wordList.append(wordOnly)
      knownList.append(data[i][1])
    word_dict = dict(zip(wordList, knownList))

### Importing Youtube extension for grabbing subtitles ###
!pip install youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi

print("Success")

Success


In [None]:
#@title Paste in the youtube video URL and run this cell to analyze

VideoPath = 'https://youtube.com/watch?v=uiN8jfEtEN4' #@param {type:"string"}
video_id = VideoPath.split("v=")[1]
video_id = video_id.split("&")[0]

### OLD WAY TO FIND SUBS - PLEASE IGNORE ###

### NOTE: This is not ideal but I use a brute force way to
#         get past the fact that videos could have any number
#         of language codes. This can fail if it uses a special code.
#customLanguageCode = "zh-HK"

#try:
#  srt = YouTubeTranscriptApi.get_transcript(video_id, languages=['zh-HK'])
#except:
#  try:
#    srt = YouTubeTranscriptApi.get_transcript(video_id, languages=['yue-HK'])
#  except:
#      try:
#        srt = YouTubeTranscriptApi.get_transcript(video_id, languages=['yue'])
#      except:
#        try:
#          srt = YouTubeTranscriptApi.get_transcript(video_id, languages=[customLanguageCode])
#        except:
#          raise Exception("Error in fetching subtitles: Language code for this video not found. Please find language codec for this video (ex. zh-HK) and set under 'customLanguageCode' in code for this cell.")

### NEW WAY TO FIND SUBS ###

codecs = YouTubeTranscriptApi.list_transcripts(video_id)

# Search for correct cc
alltext = ""
for codec in codecs:
  srt = YouTubeTranscriptApi.get_transcript(video_id, languages=[codec.language_code])
  # Combine all of the subtitles into one string
  alltext = ""
  for iii in range(len(srt)):
    alltext += srt[iii]['text']
  if (('係' in alltext) or ('佢' in alltext)):
    finalCodec = codec


# Parse text into tokens
parsed = parseInput(alltext)

# Counters to track metrics
knownTokensNum = 0
learningTokensNum = 0
unknownTokensNum = 0
totalFreqSum = 0
unknownAndSeenFreqSum = 0

# Filter only those tokens which are in our parser
parseFiltered = []
for i in range(len(parsed)):
  if (parsed[i] in setParser):
    parseFiltered.append(parsed[i])

maxFreq = len(freq_dict)
for i in range(len(parseFiltered)):

  # Total Frequency metrics
  word = parseFiltered[i]
  freq = freq_dict.get(word, maxFreq)
  totalFreqSum += freq

  # Word learning status metrics
  knownStatus = word_dict.get(word, -1)
  if (knownStatus == 1):
    learningTokensNum+=1
  elif (knownStatus == 2):
    knownTokensNum+=1
  else:
    unknownTokensNum+=1

  # Learning status frequency metrics
  if ((knownStatus == 1) or (knownStatus == -1)):
    freq = freq_dict.get(word, maxFreq)
    unknownAndSeenFreqSum += freq

totalTokens = learningTokensNum+knownTokensNum+unknownTokensNum
ulNum = learningTokensNum+unknownTokensNum

print("Average Word Frequency: " + str(round(totalFreqSum / (len(parseFiltered)), 3)))

if UseMigaku:
  print("Percentage known words: " + str(round(knownTokensNum/totalTokens, 3)))
  print("Percentage learning words: " + str(round(learningTokensNum/totalTokens,3)))
  print("Percentage unknown words: " + str(round(unknownTokensNum/totalTokens, 3)))
  print("Average Word Frequency for Learning or Unknown Words: " + str(round(unknownAndSeenFreqSum/ulNum, 3)))


Average Word Frequency: 2310.526
Percentage known words: 0.559
Percentage learning words: 0.289
Percentage unknown words: 0.152
Average Word Frequency for Learning or Unknown Words: 4821.193


In [None]:
#@title Option B) - Recommend a video from Youtube
#@markdown Use this section if you would like to be recommended a video from Youtube based on your learning preferences.

#@markdown Specifically, I use rtveitch's Cantonese spreadsheet, which can be found here:

#@markdown https://docs.google.com/spreadsheets/d/1CmN8GPalrb45YFIPrWgh7GRYyoUhnizEOImY6kAW82w

#@markdown Same as in Option A), you can specify if you would like to use Migaku or not when calculating metrics. To use Migaku learning status, it is quite simple; first, check the UseMigaku box below and set your other options and then run the cell. Then, download your Yue wordlist from Migaku settings and rename it as "learning.json". Then, upload your file into the Colab filesystem under the "Files" tab on the left and replace the current "learning.json" file.

UseMigaku = True #@param {type:"boolean"}

#@markdown Below are the preferences for how videos are selected for you. Since the spreadsheet has 5000ish Canto Subbed videos, we use a sampling algorithm to randomly select videos from the sheet to sort for metrics. This also hopefully means you aren't always recommended the same videos every time. Also, Youtube has a limit on the number of times I can access it's API every second so this helps stay below that limit.

SampleSize = 10 #@param {type:"slider", min:0, max:100, step:1}

!gdown --id 1WF2FkidAVMLKjdtOf0lWb21d2mYa7-06
!gdown --id 1PWglfvAwhN9CDBjJiqPt_DkZr132bFAE
!gdown --id 1pOgV1mLyHkkfAIX39iqddG8mGBssN9jY
#!gdown --id 1CmN8GPalrb45YFIPrWgh7GRYyoUhnizEOImY6kAW82w/edit#gid=396865486

if (UseMigaku):
  !gdown --id 151b1fyxMtodhsy2EIgQPUjEo3QyVXoCC

#ParserPath = '' #@param {type:"string"}



Downloading...
From: https://drive.google.com/uc?id=1WF2FkidAVMLKjdtOf0lWb21d2mYa7-06
To: /content/parser.csv
100% 415k/415k [00:00<00:00, 76.9MB/s]
Downloading...
From: https://drive.google.com/uc?id=1PWglfvAwhN9CDBjJiqPt_DkZr132bFAE
To: /content/freq.json
100% 361k/361k [00:00<00:00, 24.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1pOgV1mLyHkkfAIX39iqddG8mGBssN9jY
To: /content/CantoneseVideosSheet.csv
100% 3.78M/3.78M [00:00<00:00, 176MB/s]
Downloading...
From: https://drive.google.com/uc?id=151b1fyxMtodhsy2EIgQPUjEo3QyVXoCC
To: /content/learning.json
100% 157k/157k [00:00<00:00, 79.5MB/s]


In [None]:
import csv

videos_path = "/content/CantoneseVideosSheet.csv"
with open(videos_path, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)

newData = []
for i in range(1,len(data)):
  if(data[i][8] == 'Y'):
    newData.append(data[i])

#@title Run this code to complete set-up

import csv

### Import the Parsing Data ###

parser_path = "/content/parser.csv"
with open(parser_path, newline='') as f:
    reader = csv.reader(f)
    data = list(reader)
# We convert to a set for fast-lookup time
data_list = []
for i in data:
  data_list.append(i[0])
setParser = set(data_list)


### Define our Longest-Substring-Match Parser ###
# This uses a simple double-loop structure. Takes a long time
# for longer videos but is usually fast-enough and
# is very simple.
def parseInput(inputString):
  parsedOutput = []
  i = 0
  while (i<len(inputString)):
    longestSub = i+1
    for ii in range(i+1,i+21):
      if (inputString[i:ii] in setParser):
        longestSub = ii
    parsedOutput.append(inputString[i:longestSub])
    i = longestSub
  return parsedOutput


### Import the Frequency Data ###

import json

freq_path = "/content/freq.json"

# Opening JSON file
with open(freq_path) as json_file:
    data = json.load(json_file)
    numList = list(range(len(data)))
    freq_dict = dict(zip(data, numList))


### OPTIONAL: Import Migaku Learning Status ###
# We use a dictionary here for fast-lookup

words_path = "/content/learning.json"

with open(words_path) as json_file:
    data = json.load(json_file)
    wordList = []
    knownList = []
    for i in range(len(data)):
      word = data[i][0]
      wordOnly = word.split("◴")[0]
      wordList.append(wordOnly)
      knownList.append(data[i][1])
    word_dict = dict(zip(wordList, knownList))

!pip install youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi

print("Success")

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.4.4-py3-none-any.whl (22 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.4.4
Success


In [None]:
#@title Run this code to generate results
import tqdm
from tqdm import trange

import random
indicies = random.sample(range(len(newData)), SampleSize)

allGlobalFreqs = []
allUnknownLearningFreqs = []
allKnownPercentages = []
allLearingPercentages = []
allUnknownPercentages = []

for i in trange(len(indicies)):

  newIndex = indicies[i]
  video_id = newData[newIndex][2]
  codecs = YouTubeTranscriptApi.list_transcripts(video_id)

  alltext = ""
  for codec in codecs:
    srt = YouTubeTranscriptApi.get_transcript(video_id, languages=[codec.language_code])
    alltext = ""
    for iii in range(len(srt)):
      alltext += srt[iii]['text']
    if (('係' in alltext) or ('佢' in alltext)):
      finalCodec = codec

  # Parse text into tokens
  parsed = parseInput(alltext)

  # Counters to track metrics
  knownTokensNum = 0
  learningTokensNum = 0
  unknownTokensNum = 0
  totalFreqSum = 0
  unknownAndSeenFreqSum = 0

  # Filter only those tokens which are in our parser
  parseFiltered = []
  for ii in range(len(parsed)):
    if (parsed[ii] in setParser):
      parseFiltered.append(parsed[ii])

  maxFreq = len(freq_dict)
  for ii in range(len(parseFiltered)):

    # Total Frequency metrics
    word = parseFiltered[ii]
    freq = freq_dict.get(word, maxFreq)
    totalFreqSum += freq

    # Word learning status metrics
    knownStatus = word_dict.get(word, -1)
    if (knownStatus == 1):
      learningTokensNum+=1
    elif (knownStatus == 2):
      knownTokensNum+=1
    else:
      unknownTokensNum+=1

    # Learning status frequency metrics
    if ((knownStatus == 1) or (knownStatus == -1)):
      freq = freq_dict.get(word, maxFreq)
      unknownAndSeenFreqSum += freq

  totalTokens = learningTokensNum+knownTokensNum+unknownTokensNum
  ulNum = learningTokensNum+unknownTokensNum

  avgWordFreq = round(totalFreqSum / (len(parseFiltered)), 3)
  avgUnknownLearningWordFreq = round(unknownAndSeenFreqSum/ulNum, 3)
  percentKnown = round(knownTokensNum/totalTokens, 3)
  percentLearning = round(learningTokensNum/totalTokens, 3)
  percentUnknown = round(unknownTokensNum/totalTokens, 3)

  allGlobalFreqs.append(avgWordFreq)
  allUnknownLearningFreqs.append(avgUnknownLearningWordFreq)
  allKnownPercentages.append(percentKnown)
  allLearingPercentages.append(percentLearning)
  allUnknownPercentages.append(percentUnknown)

100%|██████████| 10/10 [00:13<00:00,  1.37s/it]


In [None]:
#@title Run this code to display and sort results
SortBy = 'Known Word Percentage [Migaku Only]' #@param ["Global Word Frequency", "Learning/Unknown Word Frequency [Migaku Only]", "Known Word Percentage [Migaku Only]"]

# Table package for nice prints
!pip install tabulate
import tabulate
from tabulate import tabulate

# Clumsy way to sort based on the metric
if (SortBy == "Global Word Frequency"):
  allGlobalFreqs2, allKnownLearningFreqs2, allKnownPercentages2, allLearningPercentages2, allUnknownPercentages2, indicies2 = zip(*sorted(zip(allGlobalFreqs, allUnknownLearningFreqs, allKnownPercentages, allLearingPercentages, allUnknownPercentages, indicies)))
elif (SortBy == "Learning/Unknown Word Frequency [Migaku Only]"):
  assert UseMigaku==True, "ERROR: You set UseMigaku as False. Either change to true in Option B or else change the sorting criteria"
  allKnownLearningFreqs2, allGlobalFreqs2, allKnownPercentages2, allLearningPercentages2, allUnknownPercentages2, indicies2 = zip(*sorted(zip(allUnknownLearningFreqs, allGlobalFreqs, allKnownPercentages, allLearingPercentages, allUnknownPercentages, indicies)))
else:
  assert UseMigaku==True, "ERROR: You set UseMigaku as False. Either change to true in Option B or else change the sorting criteria"
  allKnownPercentages2, allGlobalFreqs2, allKnownLearningFreqs2, allLearningPercentages2, allUnknownPercentages2, indicies2 = zip(*sorted(zip(allKnownPercentages, allGlobalFreqs, allUnknownLearningFreqs, allLearingPercentages, allUnknownPercentages, indicies),reverse=True))

# Clumsy way to write data to tables
print("Index  URL AvgFreq")
newData2 = []
for i in range(len(indicies2)):
  newIndex = indicies2[i]
  if UseMigaku:
    newData2.append([i, "https://youtube.com/watch?v=" + newData[newIndex][2], allGlobalFreqs2[i], allKnownLearningFreqs2[i], allKnownPercentages2[i], allLearningPercentages2[i], allUnknownPercentages2[i]])
  else:
    newData2.append([i, "https://youtube.com/watch?v=" + newData[newIndex][2], allGlobalFreqs2[i]])

# Clumsy way to assign headers lol
if UseMigaku:
  header = ["Index", "URL", "Global Frequency", "Average UL Frequency", "KPercentage", "LPercentage", "UPercentage"]
else:
  header = ["Index", "URL", "Global Frequency"]

print(tabulate(newData2, headers=header))

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Index  URL AvgFreq
  Index  URL                                        Global Frequency    Average UL Frequency    KPercentage    LPercentage    UPercentage
-------  ---------------------------------------  ------------------  ----------------------  -------------  -------------  -------------
      0  https://youtube.com/watch?v=IjgP9DjaXFU             1376.25                 6289.08          0.847          0.109          0.045
      1  https://youtube.com/watch?v=hGHQHx_l2zA             1789.23                 7253.29          0.788          0.118          0.094
      2  https://youtube.com/watch?v=UxIXbvlHsFI             2885.96                 8413.73          0.725          0.151          0.124
      3  https://youtube.com/watch?v=KjTOg0DoRBA             2316.36                 5748.55          0.646          0.21           0.143
      4  https://youtube.com/watch?v=OpjvRMyoAAs      

# FAQ

**How do I use my own Parser or Frequency list?**

This is not too hard I hope. First, run the first cell of Option A or Option B depending on what you are parsing. This will import the default parsing and  frequency files (and optional Migaku Learning List file). Take a note of the format of the frequency and parser files. Frequency should be fine as-is if you use the Migaku format, but you will have to do some simple reformatting to make your parser file fit with the Colab since we expect csv format with just the word itself. I suggest using a simple replace macro that only copies each line up to the space character. Then, just rename to a CSV file and upload and overwrite these files with your own. I will probably try and get this to work with the native u8 file in the future but this was a lot easier.

**Help! I still don't know how to import my own Migaku Data!**

No problem! Here are step-by-step careful instructions.

1.   Go to your Migaku Browser Extension settings and set to your language.
2.   Go to the learning status Tab
3.   Go to "Export Backup"
4.   Notice the JSON file download. Rename to learning.json.
5.   Run the first block of either Option A or Option B to import the other files. Click the Folder icon on the right side of the screen. It will say "Files". Notice that csv and json files. There should be one called "learning.json"/
6.   Now just right click and click upload to upload and overwrite with your own learning data. Then run the rest of the code blocks and you are fine!

NOTE: Everytime you run the first cell of either Option A or B, it will overwrite all the files once again and you have to upload your own files again. Sorry!

**I got a strange error message (429) involving YouTube**

This happens when you make too many API requests to YouTube. Try turning down your Sample Size and then waiting for a while before trying again.

**I have further questions or don't understand something**

Feel free to contact me on Discord (TofuChris) or else via email (chrisjwest99@gmail.com).

Please enjoy!

