YoutTube Caption API Python example converted to Python 3
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


YouTube Caption API Python example converted to Python 3

Successful attempted at extracting closed caption information from Siraj Raval's ML videos. The ultimate goal of this extraction is to make a Sirajbot so that we can all have our own personal Siraj.

The original code is found in the download call for this API:

To run this code you need to get OAuth2 credentials from google:

Usage examples:

These examples extract the closed caption information from this video on creating a chatbot:

python --action=list --videoid=t5qgjJIBy9g Caption track '(CTcNv09WGJMRU2JMppJBW2SFXPfsrJtllhu4z_DJ_fQ=)' in 'en' language. Created and managed caption tracks.

python --action=download --videoid=t5qgjJIBy9g --captionid=CTcNv09WGJMRU2JMppJBW2SFXPfsrJtllhu4z_DJ_fQ= First line of caption track: b"1\n00:00:00,000 --> 00:00:04,529\nhello world its Suraj and let's build a\n\n2\n00:00:02,429 --> 00:00:07,140\nchat bot that can answer questions about\n\n3\n00:00:04,529 --> 00:00:09,540\nany text you give it it an article or\n\n4\n00:00:07,140 --> 00:00:11,790\neven a book using care off just imagine\n\n5\n00:00:09,540 --> 00:00:13,920\nthe boost in productivity all of us will\n\n6\n00:00:11,790 --> 00:00:16,350\nhave once we have access to expert\n\n7\n00:00:13,920 --> 00:00:18,029\nsystems for any given topic instead of\n\n8\n00:00:16,350 --> 00:00:20,250\nsifting through all the jargon in a\n\n9\n00:00:18,029 --> 00:00:22,230\nscientific paper you just give it the\n\n10\n00:00:20,250 --> 00:00:25,230\npaper then ask it the relevant questions\n\n11\n00:00:22,230 --> 00:00:27,779\nentire textbooks libraries videos images\n\n12\n00:00:25,230 --> 00:00:30,240\nwhatever you just feed it some data and\n\n13\n00:00:27,779 --> 00:00:32,279\nit would become an expert at it all 7\n\n14\n00:00:30,240 --> 00:00:34,530\nbillion people on earth would have the\n\n15\n00:00:32,279 --> 00:00:36,360\ncapability of learning anything much\n\n16\n00:00:34,530 --> 00:00:39,270\nfaster the web democratize information\n\n17\n00:00:36,360 --> 00:00:41,910\nand this next evolution will democratize\n\n18\n00:00:39,270 --> 00:00:43,530\nsomething just as important guidance the\n\n19\n00:00:41,910 --> 00:00:46,170\nideal chat box and talk intelligently\n\n20\n00:00:43,530 --> 00:00:48,750\nabout any domain that's the Holy Grail\n\n21\n00:00:46,170 --> 00:00:51,090\nbut domain-specific chat bots are\n\n22\n00:00:48,750 --> 00:00:53,789\ndefinitely possible the technical term\n\n23\n00:00:51,090 --> 00:00:55,800\nfor this is a question answering system\n\n24\n00:00:53,789 --> 00:00:58,739\nsurprisingly we've been able to do this\n\n25\n00:00:55,800 --> 00:01:00,390\nsince way back in the 70s lunar was one\n\n26\n00:00:58,739 --> 00:01:02,489\nof the first it was as you might have\n\n27\n00:01:00,390 --> 00:01:04,860\nguessed rule-based so it allowed\n\n28\n00:01:02,489 --> 00:01:07,439\ngeologists to ask questions about moon\n\n29\n00:01:04,860 --> 00:01:09,270\nrocks from the Apollo missions a later\n\n30\n00:01:07,439 --> 00:01:11,729\nimprovement to rule-based Q&A systems\n\n31\n00:01:09,270 --> 00:01:13,680\nallowing programmers to encode patterns\n\n32\n00:01:11,729 --> 00:01:16,680\ninto their BOTS called artificial\n\n33\n00:01:13,680 --> 00:01:18,840\nintelligence markup language or a IML\n\n34\n00:01:16,680 --> 00:01:21,570\nthat meant less code for the same\n\n35\n00:01:18,840 --> 00:01:24,180\nresults but yeah don't use a IML it's so\n\n36\n00:01:21,570 --> 00:01:25,619\nold it makes numa numa look new now with\n\n37\n00:01:24,180 --> 00:01:27,869\ndeep learning we can do this without\n\n38\n00:01:25,619 --> 00:01:30,360\nhard coded responses and have much\n\n39\n00:01:27,869 --> 00:01:33,030\nbetter results the generic case is that\n\n40\n00:01:30,360 --> 00:01:35,250\nyou give it some tax as input and then\n\n41\n00:01:33,030 --> 00:01:37,259\nask it a question it'll give you the\n\n42\n00:01:35,250 --> 00:01:39,420\nright answer after logically reasoning\n\n43\n00:01:37,259 --> 00:01:42,090\nabout it the input could also be that\n\n44\n00:01:39,420 --> 00:01:44,100\neverybody is happy and then the question\n\n45\n00:01:42,090 --> 00:01:46,049\ncould be what's the sentiment the answer\n\n46\n00:01:44,100 --> 00:01:48,630\nwould be positive other possible\n\n47\n00:01:46,049 --> 00:01:50,670\nquestions are what's the entity what are\n\n48\n00:01:48,630 --> 00:01:53,549\nthe part of speech tags what's the\n\n49\n00:01:50,670 --> 00:01:55,979\ntranslation to French we need a common\n\n50\n00:01:53,549 --> 00:01:57,750\nmodel for all of these questions this is\n\n51\n00:01:55,979 --> 00:02:00,030\nwhat the AI community is trying to\n\n52\n00:01:57,750 --> 00:02:01,920\nfigure out how to do facebook research\n\n53\n00:02:00,030 --> 00:02:04,110\nmade some great progress with this just\n\n54\n00:02:01,920 --> 00:02:06,509\ntwo years ago when they released a paper\n\n55\n00:02:04,110 --> 00:02:09,780\nintroducing this really cool idea called\n\n56\n00:02:06,509 --> 00:02:12,599\na memory network lstm networks proved to\n\n57\n00:02:09,780 --> 00:02:13,800\nbe a useful tool in tasks like text\n\n58\n00:02:12,599 --> 00:02:16,050\nsummarization but\n\n59\n00:02:13,800 --> 00:02:19,110\ntheir memory encoded by hidden states\n\n60\n00:02:16,050 --> 00:02:22,200\nand weight is too small for very very\n\n61\n00:02:19,110 --> 00:02:25,410\nlong sequences of data be that a book or\n\n62\n00:02:22,200 --> 00:02:27,300\na movie a way around this for language\n\n63\n00:02:25,410 --> 00:02:29,880\ntranslation for example was to store\n\n64\n00:02:27,300 --> 00:02:31,860\nmultiple lstm states and use an\n\n65\n00:02:29,880 --> 00:02:34,110\nattention mechanism to choose between\n\n66\n00:02:31,860 --> 00:02:37,470\nthem but they develop another strategy\n\n67\n00:02:34,110 --> 00:02:40,080\nthat outperformed lft ms or QA systems\n\n68\n00:02:37,470 --> 00:02:42,540\nthe idea was to allow a neural network\n\n69\n00:02:40,080 --> 00:02:45,210\nto use an external data structure as\n\n70\n00:02:42,540 --> 00:02:47,370\nmemory storage it learns where to\n\n71\n00:02:45,210 --> 00:02:49,830\nretrieve the required memory from the\n\n72\n00:02:47,370 --> 00:02:51,750\nmemory bank in a supervised way when it\n\n73\n00:02:49,830 --> 00:02:54,060\ncame to entering questions from COI data\n\n74\n00:02:51,750 --> 00:02:55,950\nthat was generated that info was pretty\n\n75\n00:02:54,060 --> 00:02:59,100\neasy to come by but in real world data\n\n76\n00:02:55,950 --> 00:03:01,320\nit is not that easy most recently there\n\n77\n00:02:59,100 --> 00:03:03,660\nwas a four-month-long cattle contest\n\n78\n00:03:01,320 --> 00:03:06,570\nthat a startup called meta mind placed\n\n79\n00:03:03,660 --> 00:03:08,730\nin the top 5% for to do this they built\n\n80\n00:03:06,570 --> 00:03:11,520\na new state-of-the-art model called a\n\n81\n00:03:08,730 --> 00:03:14,130\ndynamic memory network that built on\n\n82\n00:03:11,520 --> 00:03:15,720\nFacebook's initial idea that's the one\n\n83\n00:03:14,130 --> 00:03:18,030\nwe'll focus on so let's build it\n\n84\n00:03:15,720 --> 00:03:20,010\nprogrammatically using care of this data\n\n85\n00:03:18,030 --> 00:03:22,410\nset is pretty well organized it was\n\n86\n00:03:20,010 --> 00:03:24,420\ncreated by Facebook AI research for the\n\n87\n00:03:22,410 --> 00:03:26,430\nspecific goal of improving textual\n\n88\n00:03:24,420 --> 00:03:29,489\nreasoning it's grouped into 20 different\n\n89\n00:03:26,430 --> 00:03:31,860\ntasks each task tests a different aspect\n\n90\n00:03:29,489 --> 00:03:33,720\nof reasoning so overall it provides a\n\n91\n00:03:31,860 --> 00:03:35,700\ngood overview of all the different\n\n92\n00:03:33,720 --> 00:03:37,500\ncapabilities of your learning model\n\n93\n00:03:35,700 --> 00:03:39,420\nthere are a thousand questions for\n\n94\n00:03:37,500 --> 00:03:41,700\ntraining at a thousand for testing per\n\n95\n00:03:39,420 --> 00:03:43,830\ntask each question is paired with a\n\n96\n00:03:41,700 --> 00:03:46,050\nstatement or series of statements as\n\n97\n00:03:43,830 --> 00:03:48,390\nwell as an answer the goal is to have\n\n98\n00:03:46,050 --> 00:03:50,940\none model that can succeed in all tasks\n\n99\n00:03:48,390 --> 00:03:52,980\neasily will use pre-trained glove\n\n100\n00:03:50,940 --> 00:03:55,200\nvectors to help create a sequence of war\n\n101\n00:03:52,980 --> 00:03:57,390\nvectors from our input sentences and\n\n102\n00:03:55,200 --> 00:03:59,970\nthese vectors will act as inputs to the\n\n103\n00:03:57,390 --> 00:04:02,070\nmodel the dmn architecture defines two\n\n104\n00:03:59,970 --> 00:04:04,680\ntypes of memory semantic and episodic\n\n105\n00:04:02,070 --> 00:04:07,320\nthese input vectors are considered the\n\n106\n00:04:04,680 --> 00:04:08,730\nsemantic memory whereas episodic memory\n\n107\n00:04:07,320 --> 00:04:11,130\nmight contain other knowledge as well\n\n108\n00:04:08,730 --> 00:04:12,810\nand we'll talk about that in a second we\n\n109\n00:04:11,130 --> 00:04:14,880\ncan fetch our babble data set from the\n\n110\n00:04:12,810 --> 00:04:16,769\nweb and split them into training and\n\n111\n00:04:14,880 --> 00:04:18,630\ntesting data the love will help convert\n\n112\n00:04:16,769 --> 00:04:20,760\nour words two vectors so they're ready\n\n113\n00:04:18,630 --> 00:04:23,280\nto be fed into our model the first\n\n114\n00:04:20,760 --> 00:04:25,229\nmodule the input module is a GRU or\n\n115\n00:04:23,280 --> 00:04:26,590\ngated recurrent unit that runs on a\n\n116\n00:04:25,229 --> 00:04:28,990\nsequence of words\n\n117\n00:04:26,590 --> 00:04:31,750\nvectors a GRU cell is kind of like an\n\n118\n00:04:28,990 --> 00:04:33,850\nlstm cell but it's more computationally\n\n119\n00:04:31,750 --> 00:04:36,520\nefficient since it only has two gates\n\n120\n00:04:33,850 --> 00:04:38,169\nand it doesn't use a memory unit the two\n\n121\n00:04:36,520 --> 00:04:40,600\ngates control when its content is\n\n122\n00:04:38,169 --> 00:04:45,190\nupdated and when it's erased off a\n\n123\n00:04:40,600 --> 00:04:49,479\nrecess up the resistance of a recession\n\n124\n00:04:45,190 --> 00:04:52,660\nand the hidden state of the input module\n\n125\n00:04:49,479 --> 00:04:54,910\nrepresents the input process so far in a\n\n126\n00:04:52,660 --> 00:04:57,100\nvector it outputs hidden States after\n\n127\n00:04:54,910 --> 00:04:59,169\nevery sentence and these outputs are\n\n128\n00:04:57,100 --> 00:05:00,910\ncalled facts and the paper because they\n\n129\n00:04:59,169 --> 00:05:02,620\nrepresent the essence of what is fed\n\n130\n00:05:00,910 --> 00:05:04,570\ngiven a word vector and the previous\n\n131\n00:05:02,620 --> 00:05:06,820\ntime step detector will compute the\n\n132\n00:05:04,570 --> 00:05:08,889\ncurrent time step vector the uplinking\n\n133\n00:05:06,820 --> 00:05:11,020\nis a single layer neural network we sum\n\n134\n00:05:08,889 --> 00:05:13,600\nup the matrix multiplications and add a\n\n135\n00:05:11,020 --> 00:05:15,430\nbias term and then the signal it\n\n136\n00:05:13,600 --> 00:05:18,370\nsquashes it to a list of values between\n\n137\n00:05:15,430 --> 00:05:20,560\n0 and 1 the output vector we do this\n\n138\n00:05:18,370 --> 00:05:22,900\ntwice with different sets of weights\n\n139\n00:05:20,560 --> 00:05:24,789\nthen we use a reset gate that will learn\n\n140\n00:05:22,900 --> 00:05:26,889\nto ignore the past time steps when\n\n141\n00:05:24,789 --> 00:05:29,020\nnecessary for example if the next\n\n142\n00:05:26,889 --> 00:05:31,450\nsentence has nothing to do with those\n\n143\n00:05:29,020 --> 00:05:32,830\nthat came before it the update gate is\n\n144\n00:05:31,450 --> 00:05:35,830\nsimilar in that it can learn to ignore\n\n145\n00:05:32,830 --> 00:05:37,510\nthe current time step entirely maybe the\n\n146\n00:05:35,830 --> 00:05:40,539\ncurrent sentence has nothing to do with\n\n147\n00:05:37,510 --> 00:05:42,849\nthe answer whereas previous one bit then\n\n148\n00:05:40,539 --> 00:05:45,820\nthere's the question module it processes\n\n149\n00:05:42,849 --> 00:05:48,310\nthe question word by word and outputs a\n\n150\n00:05:45,820 --> 00:05:50,979\nvector by using the same gru as the\n\n151\n00:05:48,310 --> 00:05:52,479\ninput module and the same weight we can\n\n152\n00:05:50,979 --> 00:05:54,849\nencode both of them by creating\n\n153\n00:05:52,479 --> 00:05:56,889\nembedding layers for both then we'll\n\n154\n00:05:54,849 --> 00:05:59,200\ncreate an episodic memory representation\n\n155\n00:05:56,889 --> 00:06:01,120\nfor both the motivation for this in the\n\n156\n00:05:59,200 --> 00:06:03,340\npaper came from the hippocampus function\n\n157\n00:06:01,120 --> 00:06:05,349\nin our brain it's able to retrieve\n\n158\n00:06:03,340 --> 00:06:08,260\ntemporal states that are triggered by\n\n159\n00:06:05,349 --> 00:06:10,660\nsome response like a site or a sound\n\n160\n00:06:08,260 --> 00:06:12,880\nboth the fact and question vectors that\n\n161\n00:06:10,660 --> 00:06:15,190\nare extracted from the input enter the\n\n162\n00:06:12,880 --> 00:06:17,590\nepisodic memory module it's composed of\n\n163\n00:06:15,190 --> 00:06:19,450\ntwo nested gr use the energy ru\n\n164\n00:06:17,590 --> 00:06:21,880\ngenerates what are called episodes it\n\n165\n00:06:19,450 --> 00:06:24,130\ndoesn't by passing over the facts from\n\n166\n00:06:21,880 --> 00:06:26,050\nthe input module but when updating its\n\n167\n00:06:24,130 --> 00:06:28,330\ninterstate it takes into account the\n\n168\n00:06:26,050 --> 00:06:30,130\noutput of an attention function on the\n\n169\n00:06:28,330 --> 00:06:32,289\ncurrent fact the attention function\n\n170\n00:06:30,130 --> 00:06:35,320\ngives a score between zero and one to\n\n171\n00:06:32,289 --> 00:06:38,050\neach fact and so the GRU ignores facts\n\n172\n00:06:35,320 --> 00:06:39,490\nwith low scores after each full pass on\n\n173\n00:06:38,050 --> 00:06:41,710\nall the facts the in\n\n174\n00:06:39,490 --> 00:06:43,750\ngru outputs an episode which is then fed\n\n175\n00:06:41,710 --> 00:06:46,120\nto the outer GRU the reason we need\n\n176\n00:06:43,750 --> 00:06:48,160\nmultiple episodes is so our model can\n\n177\n00:06:46,120 --> 00:06:50,349\nlearn what part of a sentence it should\n\n178\n00:06:48,160 --> 00:06:52,539\npay attention to after realizing after\n\n179\n00:06:50,349 --> 00:06:54,819\none pass that something else is\n\n180\n00:06:52,539 --> 00:06:56,979\nimportant with multiple passes we can\n\n181\n00:06:54,819 --> 00:06:59,710\ngather increasingly relevant information\n\n182\n00:06:56,979 --> 00:07:02,020\nwe can initialize our model and set its\n\n183\n00:06:59,710 --> 00:07:04,090\nloss function has categorical cross\n\n184\n00:07:02,020 --> 00:07:07,240\nentropy with the stochastic gradient\n\n185\n00:07:04,090 --> 00:07:09,069\ndescent implementation or MS prop then\n\n186\n00:07:07,240 --> 00:07:10,900\ntrain it on the given data using the fed\n\n187\n00:07:09,069 --> 00:07:12,669\nfunction we can test this code in the\n\n188\n00:07:10,900 --> 00:07:14,889\nbrowser without waiting for it to train\n\n189\n00:07:12,669 --> 00:07:16,930\nbecause luckily for us this researcher\n\n190\n00:07:14,889 --> 00:07:19,150\nuploaded a web app with a fully trained\n\n191\n00:07:16,930 --> 00:07:21,220\nmodel of this code we can generate a\n\n192\n00:07:19,150 --> 00:07:23,349\nstory which is a collection of sentences\n\n193\n00:07:21,220 --> 00:07:25,210\neach describing an event in sequential\n\n194\n00:07:23,349 --> 00:07:27,909\norder then we'll ask it a question\n\n195\n00:07:25,210 --> 00:07:29,590\npretty high accuracy response let's\n\n196\n00:07:27,909 --> 00:07:32,409\ngenerate another story and ask it\n\n197\n00:07:29,590 --> 00:07:34,060\nanother question hero status let's go\n\n198\n00:07:32,409 --> 00:07:36,490\nover the three key facts we've learned\n\n199\n00:07:34,060 --> 00:07:39,159\ngr use control the flow of data like\n\n200\n00:07:36,490 --> 00:07:41,530\nlstm cells but are more computationally\n\n201\n00:07:39,159 --> 00:07:43,990\nefficient using just two gates update\n\n202\n00:07:41,530 --> 00:07:46,180\nand reset dynamic memory networks offer\n\n203\n00:07:43,990 --> 00:07:48,490\nstate-of-the-art performance in question\n\n204\n00:07:46,180 --> 00:07:50,770\nentering systems and they do this by\n\n205\n00:07:48,490 --> 00:07:53,440\nusing both semantic and episodic memory\n\n206\n00:07:50,770 --> 00:07:57,039\ninspired by the hippocampus drumroll\n\n207\n00:07:53,440 --> 00:07:58,270\nplease no never mind nemanja tomek is\n\n208\n00:07:57,039 --> 00:08:00,639\nthe coding challenge winner from last\n\n209\n00:07:58,270 --> 00:08:02,530\nweek he implemented his own neural\n\n210\n00:08:00,639 --> 00:08:04,180\nmachine translator by training it on\n\n211\n00:08:02,530 --> 00:08:06,460\nmovie subtitles in both English and\n\n212\n00:08:04,180 --> 00:08:08,830\nGerman you can see all the results in\n\n213\n00:08:06,460 --> 00:08:10,930\nhis eye Python notebook amazing work\n\n214\n00:08:08,830 --> 00:08:12,550\nwizard of the week and the runner-up is\n\n215\n00:08:10,930 --> 00:08:14,680\nvishal bought two despite the massive\n\n216\n00:08:12,550 --> 00:08:16,719\namount of training time n empty requires\n\n217\n00:08:14,680 --> 00:08:19,599\nmichelle was able to achieve some great\n\n218\n00:08:16,719 --> 00:08:21,699\nresults I vow to both of you this week's\n\n219\n00:08:19,599 --> 00:08:23,740\nchallenge is to make your own Q&A chat\n\n220\n00:08:21,699 --> 00:08:25,360\nbot all the details are in the readme\n\n221\n00:08:23,740 --> 00:08:27,219\ngithub links go in the comments and\n\n222\n00:08:25,360 --> 00:08:28,599\nannounce winner a week from today please\n\n223\n00:08:27,219 --> 00:08:30,610\nsubscribe for more programming videos\n\n224\n00:08:28,599 --> 00:08:32,740\ncheck out this related video and for now\n\n225\n00:08:30,610 --> 00:08:35,849\nI've got to ask the right questions so\n\n226\n00:08:32,740 --> 00:08:35,849\nthanks for watching\n\n" Created and managed caption tracks.