<a href="https://colab.research.google.com/github/JasmineElm/Notebooks/blob/master/Grab_all_Free_and_Netflix_titles_on_1001_Movie_list.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# identify movies on the 1001 movies list that are on
+ iPlayer
+ ITV Hub
+ BFi Player (free)

The script will trawl through a large number of movies( bfi results in particular),  so will take some time to run.  Around 5 minutes for the script to complete ins not unreasonable.

# process

Each service requires slightly different logic to get from the main (base) link to individual links.  the below table should give an overview of the logic required.

In this context, full metadata simply means the service can return the movie title and year of release.  If a service cannot, we need to use OMDB or the like to ~~fill in~~ guess the blanks

| service            | iPlayer | itvPlayer | bfiPlayer | Netflix |
|--------------------|:-------:|:---------:|:---------:|:-------:|
| main               | X       | X         | X         | X       |
| paginated results  | X       |           | X         | X       |
| individual results | X       | X         | X         |         |
| Full Metadata      | X       |           | X         | X       |
| requires OMDB      |         | X         |           | X       |

In [0]:
## grab our 1001 Movies List, download, and load into a dataframe 
url ='https://gist.githubusercontent.com/JasmineElm/ce8219c58bd416c0aec588a97e168221/raw/57717b4ae21f1721b2e1c22d2e8a74795e0e54d4/netflixTitles.csv'
r = requests.get(url)
filename = url.split('/')[-1]
 
with open(filename,'wb') as output_file:
    output_file.write(r.content)
    
ml=pd.read_csv(filename,index_col=[0])

# Expected Output

The script should generate a markdown table in the following format:

| Title              | Genre                             | Director       |   Year |   Runtime | Link                                                                     |
|:-------------------|:----------------------------------|:---------------|-------:|----------:|:-------------------------------------------------------------------------|
| A Trip to the Moon | Short, Adventure, Fantasy, Sci-Fi | Georges Méliès |   1902 |        13 | https://player.bfi.org.uk/free/film/watch-a-trip-to-the-moon-1902-online |

An additional column flagging the service may be added to make the table easier to read.

As a service could, quite legitimately, have no 1001 movie matches, the script should print a message where a service has been queried unsuccessfully.  e.g. ```bfi: no results```


In [0]:
!pip install cfscrape



In [0]:
import pandas as pd
from bs4 import BeautifulSoup
from time import sleep
from random import uniform  # for our sleep function
from tabulate import tabulate  # to export dataframe as markdown table
import requests, cfscrape, json
import urllib.request
import json


def scrapeSoup(url):
  scraper = cfscrape.create_scraper()
  webPage = scraper.get(url)
  soup = BeautifulSoup(webPage.text, "html5lib")
  return soup

    
def isYear(s):
    """ Returns a year if it looks 'legit'
        otherwise '0'
    """
    legit = False
    if s.isdigit():
        if 1880 <= int(s) <= 2030:
            legit = True
    return (0,s)[legit]

def progressBar(inList, currentStep, numeric=False):
    """ Prints a progress bar of sorts to make it easier to judge
        whether your loops are doing something. use in conjunction with 
        `enumerate` to get the index of the current step
    """
    ps = ['░', '▒', '▓', '█']
    if numeric:
        if currentStep == 0:
            print('iterations left: ', end='')
        print(len(inList)-currentStep, end=' ')
    else:
        if currentStep == 0:
            print('Progress: ', end='')
        stage = int(currentStep/len(inList)*len(ps))
        print(ps[stage], end='')

def isRuntime(s):
    """ checks if the first three passed characters
    "*" of a string look like a a runtime
    """
    legit = False
    if s[:3].isdigit():
        if 0 <= int(s[:3]) <= 999:
            legit = True
    return legit

def parseOMDB(query, full=False):
    """ queries OMDB returns data in a format similar to
        that used by the 1001 Movies List (full=True)
    """
    metadata = []
    ######## FIX THIS TO USE CFSCRAPE ##########
    with urllib.request.urlopen(query) as url:
        data = json.loads(url.read().decode())
        if full:
            metadata.append(data['Title'])
            metadata.append(data['Metascore'])
            metadata.append(data['Genre'])
            metadata.append(data['Director'])
            metadata.append(data['Language'])
            metadata.append(data['Country'])
            metadata.append(data['imdbRating'])
            metadata.append(data['Year'][:4])  # More appropriate to series...
            metadata.append(data['Plot'])
            metadata.append(data['Awards'])
            metadata.append(data['imdbID'])
            metadata.append(data['Runtime'][:-4])  # remove ' mins' at source
        else:
            metadata.append(data['Title'])
            metadata.append(data['Year'])
    return metadata

def cleanseTitle(s):
    """ removes bad case and spaces from search
        terms sent to OMDBapi
    """

    s = s.casefold()
    return s.replace(' ', '+')

def buildOMDBQuery(title, year=None, movie=True, apikey=None):
    """ simple function to build a query string for OMDB
        VERY beta: no logic around handling failed queries
    """
    omdbURL = 'https://www.omdbapi.com/?apikey='
    if apikey==None:
      print('API key required! get one from https://www.omdbapi.com/')
    title = '&t='+cleanseTitle(title)
    # 'and' can make the term too specific ('starsky and hutch' will not return the movie, 'starsky hutch' will)
    query = omdbURL+apikey+title
    if year != None:
        query = query+'&y='+year
    if movie == True:
        query = query+'&type=movie'
    return query

def iPlayerLinkList(urlList):
    """ 
    """
    soup=scrapeSoup(urlList[1])
    pageList = []
#     pl = [urList]
    pageList.append(urlList) 
    pl = []
    for link in soup.find('div', {'class': 'list__pagination'}).find_all('a'):
        if urlList[1]+link.get('href') != pageList[-1][1]:
            pl.append(urlList[0])
            pl.append(urlList[1]+link.get('href'))
            pageList.append(pl)
            pl = []
    return pageList

def bfiLinkList(urlList):
    """
    """
    pageList = []
    soup=scrapeSoup(urlList[1])
    base = 'https://player.bfi.org.uk'
    pl = []
    for collection in soup.findAll('div', {'class': 'collection-card'}):
        pl.append(urlList[0])
        pl.append(base+collection.find('a').get('href'))
        pageList.append(pl)
        pl = []
    return pageList


def netflixLinkList(urlList):
  """
  """
  pageList=[]
  page=[]
  baseLink="https://uk.newonnetflix.info/catalogue/a2z/all/"
  soup=scrapeSoup(urlList[1])
  for link in soup.find("span",{"class":"datemenu"}):
    if "".join(link).lower()!=" " and "".join(link).lower()!="\n":
      page=[]
      page.append(urlList[0])
      page.append(baseLink+"".join(link).lower())
      pageList.append(page)
  return pageList

def linkListBuilder(baseList):
    """
    """
    pageLinkList = []
    for item in baseList:
        if item[0] == 'itv':
            pageLinkList.append(item)
        elif item[0] == 'iplayer':
            for link in iPlayerLinkList(item):
                pageLinkList.append(link)
        elif item[0] == 'bfi':
            for link in bfiLinkList(item):
                pageLinkList.append(link)
        elif item[0] == 'netflix':
            for link in netflixLinkList(item):
                pageLinkList.append(link)
    return pageLinkList

def itvMoviePage(linkList):
    """ grab individual movie links from a
        page of itv listings
    """
    moviePage = []
    soup=scrapeSoup(linkList[1])
    # like iPlayer, the metadata exists only on the individual pages.
    movieList = []
    for category in soup.findAll('div', {'class', 'categories'}):
        for link in category.findAll('a', {'class': 'complex-link'}):
            movieList = []
            movieList.append(linkList[0])
            movieList.append(link.get('href'))
            moviePage.append(movieList)
    return moviePage

def iPlayerMoviePage(linkList):
    """ grab individual movie links from a
        page of iPlayer listings
    """
    movieDetails = []
    baseLink = 'https://www.bbc.co.uk'
    soup=scrapeSoup(linkList[1])
    for tile in soup.findAll('div', {'class': 'content-item'}):
        movieList = []
        movieList.append(linkList[0])
        movieList.append(baseLink+tile.find('a').get('href'))
        movieDetails.append(movieList)
    return movieDetails

def getMoviePages(linkList):
    """ iPlayer and itvHub keep metadata on
        individual movie pages.  this function
        captures those pages.  
        BFI and Netflix entries pass through
    """
    moviePage = []
    for itx, item in enumerate(linkList):
        progressBar(linkList, itx)
        if item[0] == 'itv':
            for link in itvMoviePage(item):
                moviePage.append(link)
        elif item[0] == 'iplayer':
            for link in iPlayerMoviePage(item):
                moviePage.append(link)
        elif item[0] == 'bfi':
            moviePage.append(item)
        elif item[0] == 'netflix':
            moviePage.append(item)
    return moviePage

def itvMovieData(moviePage, apikey=None):
    """
    """
    outerList = []
    movieLink=moviePage[1]
    outerList.append(movieLink)
    soup=scrapeSoup(movieLink)
    title = cleanseTitle(soup.find('h1').text)
    query = buildOMDBQuery(title, apikey=apikey)
    queryResults = parseOMDB(query)
    for result in queryResults:
        outerList.append(result)
    outerList.append(
        soup.find('li', {'class': 'episode-info__meta-item--availability'}).text)
    outerList.append(moviePage[0])
    return outerList

def iPlayerMovieData(moviePage):
    movieLink=moviePage[1]
    soup=scrapeSoup(movieLink)
    innerList = [moviePage[1]]
    movie = soup.find('div', {'class': 'play-cta__title-container'}).text
    innerList.append(
        soup.find('div', {'class': 'play-cta__title-container'}).text)
    meta = soup.find('div', {'class': 'gel-layout'}).findAll('li')
    for itx, metadata in enumerate(meta):
        if itx != 0:
            if itx == 1:
                innerList.append(metadata.find(
                    'span', {'class': 'episode-metadata__text'}).text[-4:])
            else:
                innerList.append(metadata.find(
                    'span', {'class': 'episode-metadata__text'}).text)
    innerList.append(moviePage[0])
    return innerList

def bfiMovieData(moviePage):
    base = 'https://player.bfi.org.uk'
    soup=scrapeSoup(moviePage[1])
    for card in soup.findAll('div', {'class': 'card--free'}):
        movieInfo = []
        movieInfo.append(base+card.find('a').get('href'))  # direct link
        movieInfo.append(card.find('h3').find('span').text)  # title
        if len(card.find('p', {'class': 'card__info'}).select("span:nth-of-type(2)")[0].text) == 4:
            movieInfo.append(card.find('p', {'class': 'card__info'}).select(
                "span:nth-of-type(2)")[0].text)
        else:
            # hacky way to ensure only years are parsed - needs fixing
            movieInfo.append('0')
        movieInfo.append('')
        movieInfo.append(moviePage[0])
        return movieInfo

def netflixMovieData(moviePage):
  movieData=[]
  empty=''
  soup=scrapeSoup(moviePage[1])
  for movieDetail in soup.find_all("a",{"class":"infopop"}):
    movieList=[]
    if movieDetail.find("img") == None:
      title = movieDetail.text[:-6].strip()
      year = isYear(movieDetail.text[-5:].strip(")").strip())
      movieList.append(empty)
      movieList.append(title)
      movieList.append(year)
      movieList.append(empty)
      movieList.append(moviePage[0])
    if len(movieList)==5:
      movieData.append(movieList)
  return movieData

def randSleep(minSleep=0.2, maxSleep=2.5):
    """ let's not spam servers.  
    """
    sleep(uniform(minSleep, maxSleep))

def getmovieData(moviePageList, apikey):
    """ for each page in list, grab the metadata
        if the page is itv, we also need to make a 
        call to omdb to guess the year.
    """
    movieDataList = []
    for itx, moviePage in enumerate(moviePageList):
        progressBar(moviePageList, itx)
        if moviePage[0] == 'itv':
            randSleep()
            movieDataList.append(itvMovieData(moviePage, apikey=apikey))
        elif moviePage[0] == 'iplayer':
            randSleep()
            movieDataList.append(iPlayerMovieData(moviePage))
        elif moviePage[0] == 'bfi':
            randSleep()
            movieDataList.append(bfiMovieData(moviePage))
        elif moviePage[0] == 'netflix':
            randSleep()
            for movie in netflixMovieData(moviePage):
              movieDataList.append(movie)
    return movieDataList

In [0]:
urlList=[['iplayer','https://www.bbc.co.uk/iplayer/categories/films/a-z']
        ,['itv',    'https://www.itv.com/hub/categories/films']
        ,['bfi',    'https://player.bfi.org.uk/free/collections']
        ,['netflix','https://uk.newonnetflix.info/catalogue/a2z/all']]

In [0]:
omdbAPI='' #########CHANGE ME#########
pll=linkListBuilder(urlList)
mp=getMoviePages(pll)
print('Getting movie data . . .')
metaDataList=getmovieData(mp, omdbAPI)



Progress: ░░░░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓██████████████████████████Getting movie data . . .
Progress: ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓██████████████████████████████████████

In [0]:
metaDataList

[['https://www.bbc.co.uk/iplayer/episode/p04b183c/adam-curtis-hypernormalisation',
  'Adam Curtis',
  '2016',
  'Available for over a year',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/p07ctvvn/a-high-school-rape-goes-viral-roll-red-roll',
  'A High School Rape Goes Viral: Roll Red Roll',
  '2019',
  'Available for 4 months',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/b065jk5m/a-most-violent-year',
  'A Most Violent Year',
  '2014',
  'Expires tonight 1:35am',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/b0078cwc/a-simple-plan',
  'A Simple Plan',
  '1998',
  'Available for 3 months',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/b01dtlxl/the-awakening',
  'The Awakening',
  '2011',
  'Available for 17 days',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/b06vq3yn/the-big-short',
  'The Big Short',
  '2015',
  'Available until Tue 12:55am',
  'iplayer'],
 ['https://www.bbc.co.uk/iplayer/episode/b010y7my/the-boy-in-the-striped-pyjamas',


In [0]:
streaming=pd.DataFrame(metaDataList, columns =['Link','Title','Year','Availability','Service']).dropna().astype({'Year': 'int64'})
streaming.sort_values('Year')

Unnamed: 0,Link,Title,Year,Availability,Service
75,https://player.bfi.org.uk/free/film/watch-no-m...,No Mischief Here Can Satan Find for Idle Hands...,0,,bfi
105,https://player.bfi.org.uk/free/film/watch-the-...,The Bradford Godfather,0,,bfi
102,https://player.bfi.org.uk/free/film/watch-scre...,Screenwriter Tess Morris on how to write a romcom,0,,bfi
86,https://player.bfi.org.uk/free/film/watch-wedd...,Wedding of Princess Elizabeth and Philip Mount...,0,,bfi
52,https://player.bfi.org.uk/free/film/watch-colo...,Colour on the Thames,0,,bfi
68,https://player.bfi.org.uk/free/film/watch-the-...,The Dentist,0,,bfi
87,https://player.bfi.org.uk/free/film/watch-mini...,Mining Review 25th Year No. 9,0,,bfi
85,https://player.bfi.org.uk/free/film/watch-a-wo...,A Woman Undressing,1896,,bfi
53,https://player.bfi.org.uk/free/film/watch-prin...,Prince Ranjitsinhji Practising Batting at the ...,1897,,bfi
74,https://player.bfi.org.uk/free/film/watch-pano...,"Panorama of Calcutta, India, From the River Ga...",1899,,bfi


In [0]:
moviels= pd.merge(ml, streaming,  how='inner', on=['Title','Year'], )

In [0]:
#generate a nice markdown table that can be pasted into Reddit
print(tabulate(moviels.drop(['Metascore','Language','Country','IMDBRating', 'Plot','Awards','imdbID'],axis=1).sort_values('Title').sort_values('Service')
, tablefmt="pipe", headers="keys", showindex=False))

| Title                                             | Genre                                        | Director                                    |   Year |   Runtime | Link                                                                     | Availability                | Service   |
|:--------------------------------------------------|:---------------------------------------------|:--------------------------------------------|-------:|----------:|:-------------------------------------------------------------------------|:----------------------------|:----------|
| A Trip to the Moon                                | Short, Adventure, Fantasy, Sci-Fi            | Georges Méliès                              |   1902 |        13 | https://player.bfi.org.uk/free/film/watch-a-trip-to-the-moon-1902-online |                             | bfi       |
| Night of the Living Dead                          | Horror                                       | George A. Romero                            