Skip to content

Latest commit

 

History

History
74 lines (46 loc) · 4.83 KB

finding_video_links.md

File metadata and controls

74 lines (46 loc) · 4.83 KB

Finding video links

Now you know the basics, enough to scrape most stuff from most sites, but not streaming sites. Because of the high costs of video hosting the video providers really don't want anyone scraping the video and bypassing the ads. This is why they often obfuscate, encrypt and hide their links which makes scraping really hard. Some sites even put V3 Google Captcha on their links to prevent scraping while the majority IP/time/referer lock the video links to prevent sharing. You will almost never find a plain <video> element with a mp4 link.

This is why you should always scrape the video first when trying to scrape a video hosting site. Sometimes getting the video link can be too hard.

I will therefore explain how to do more advanced scraping, how to get these video links.

What you want to do is:

  1. Find the iFrame/Video host.*
  2. Open the iFrame in a separate tab to ease clutter.*
  3. Find the video link.
  4. Work backwards from the video link to find the source.
  • Step 1 and 2 is not applicable to all sites.

Let's explain further: Step 1: Most sites use an iFrame system to show their videos. This is essentially loading a separate page within the page. This is most evident in Gogoanime, link gets updated often, google the name and find their page if link isn't found. The easiest way of spotting these iframes is looking at the network tab trying to find requests not from the original site. I recommend using the HTML filter.

finding

Once you have found the iFrame, in this case a fembed-hd link open it in another tab and work from there. (Step 2) If you only have the iFrame it is much easier to find the necessary stuff to generate the link since a lot of useless stuff from the original site is filtered out.

Step 3: Find the video link. This is often quite easy, either filter all media requests or simply look for a request ending in .m3u8 or .mp4 What this allows you to do is limit exclude many requests (only look at the requests before the video link) and start looking for the link origin (Step 4).

video_link

I usually search for stuff in the video link and see if any text/headers from the preceding requests contain it. In this case fvs.io redirected to the mp4 link, now do the same steps for the fvs.io link to follow the request backwards to the origin. Like images are showing.

fvs

fvs_redirector

complete

NOTE: Some sites use encrypted JS to generate the video links. You need to use the browser debugger to step by step find how the links are generated in that case

What to do when the site uses a captcha?

You pretty much only have 3 options when that happens:

  1. Try to use a fake / no captcha token. Some sites actually doesn't check that the captcha token is valid.
  2. Use Webview or some kind of browser in the background to load the site in your stead.
  3. Pray it's a captcha without payload, then it's possible to get the captcha key without a browser:

Before showing a code example, I'll explain some of the logic so it's easier to visualize what's happening. Our end goal is to make a request to https://www.google.com/recaptcha/api2/anchor with some parameters that we can hardcode, since they're not bound to change, but we also need to pass 3 parameters that are dynamic. These include: k (stands for key), co and v (stands for vtoken).

Here is a proof of concept code example of how you can get a captcha token programmatically (this can vary for various websites):

key=$(curl -s "$main_page" | sed -nE "s@.*recaptcha_site_key = '(.*)'.*@\1@p") # the main_page variable in this example is the home page for our website, for example https://zoro.to
co=$(printf "%s:443" "$main_page" | base64 | tr "=" ".") # here, we would be base64 encoding the following url: https://zoro.to:443 => aHR0cHM6Ly96b3JvLnRvOjQ0Mzo0NDM.
vtoken=$(curl -s "https://www.google.com/recaptcha/api.js?render=$key" | sed -nE "s_.*po\.src=.*releases/(.*)/recaptcha.*_\1_p")
recaptcha_token=$(curl -s "https://www.google.com/recaptcha/api2/anchor?ar=1&hl=en\
		&size=invisible&cb=cs3&k=${key}&co=${co}&v=${vtoken}" |
  sed -nE 's_.*id="recaptcha-token" value="([^"]*)".*_\1_p')
curl -s "$main_page/some_url_requiring_token?token=${recaptcha_token}" # now we can use the recaptcha token to pass the verification on the site