<a href="https://colab.research.google.com/github/ApoorvRusia/webScraping_IMDB_storeAllThePostersAndMovies/blob/master/WebScraping_IMDB_for_movie_names_and_posters_since_2011.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Webscraping IMDB

### For learning purposes, lets just collect movie posters and movie names since 2011 from IMDB website

#### Motivation here is to learn about web scraping and what all we can do with it

First, import the necessary libraries. 

In google-colab these come preinstalled, but they are not part of the python standard library as of the writing of this article.

In [0]:
from bs4 import BeautifulSoup
import requests

In [0]:
# function to hit the server and get the necessary page.
def request_webpage(url, year_and_month = '2011-01'):
  res = requests.get(url + year_and_month + '/')
  try:
    res.raise_for_status()
  except Exception as exc:
    print('There was a problem with the request: %s' % (exc))
  return res

Next, create an object to store the webpage locally

In [0]:
#Actuall URL should be 'https://www.imdb.com/movies-coming-soon/2011-01/',
#We will send the year and month value dynamically into the url to fetch different values
url = 'https://www.imdb.com/movies-coming-soon/'
coming_soon_page = request_webpage(url)

Let's take a look at what data we have got in html

In [73]:
coming_soon_page.text

'\n\n\n\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>New Movies Coming Soon - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n    if (typeof uex =

The 'prettify' function will help to make this a bit more human readable

In [0]:
coming_soon_soup = BeautifulSoup(coming_soon_page.text)
#print(coming_soon_soup.prettify())

Next, let's find the images.

One can locate an HTML element by right clicking on the webpage and selecting 'inspect'.

The code below is looking for an element like this: 

< div class="list detail" >... (content)...< /div >

In [75]:
#find the main div tag which contains all the information related to movies.
details = coming_soon_soup.find('div', attrs = {'class': 'list detail'})
#inside the div tag, lets find all the img tags
image_details = details.find_all('img')

#this list comprehension will get all the 'src' data (urls) of the posters (as class name has the word poster in it),
#while filter out icons for ratings
image_list = [x['src'] for x in image_details if 'poster' in x['class']]
image_list

['https://m.media-amazon.com/images/M/MV5BMzc3MjYxNzg2N15BMl5BanBnXkFtZTcwNzQyMTkwNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTUxMjQ0NjE3OV5BMl5BanBnXkFtZTcwODIxNDEwNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTcwOTMwMDYyMl5BMl5BanBnXkFtZTcwMzAxMjMyNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMzg3MDUwMTI1OV5BMl5BanBnXkFtZTcwNzY0NDIxNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTM4MTUwNDg0OF5BMl5BanBnXkFtZTcwMjUyODYxNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTc3MjkyMzk4N15BMl5BanBnXkFtZTcwODQxMDg5Mw@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTkxOTI3Njc4MF5BMl5BanBnXkFtZTcwMzI0NTIzNA@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/images/M/MV5BMTQxMTgyNDc5M15BMl5BanBnXkFtZTcwMzk4OTM5Mw@@._V1_UY209_CR0,0,140,209_AL_.jpg',
 'https://m.media-amazon.com/ima

We can get the full size image URLs by removing everything between '_V1_' and '.jpg'

In [0]:
image_url = image_list[0]
#find the position of '_V1_' in the img url to remove anything after it to get the full size of the image.
slice_index = image_url.find('_V1_')
#print(slice_index)
full_size_image_url = image_url[:slice_index] + '_V1_.jpg'

In [77]:
# Now let's see how the image url looks like. Click the url to see the full image.
full_size_image_url

'https://m.media-amazon.com/images/M/MV5BMzc3MjYxNzg2N15BMl5BanBnXkFtZTcwNzQyMTkwNA@@._V1_.jpg'

In [0]:
#now lets get the image from the server
img_res = request_webpage(full_size_image_url)

In [0]:
# Let us save the images 
imageFile = open('MoviePoster0'+'.jpg', 'wb')
for chunk in img_res.iter_content(100000):
  imageFile.write(chunk)
imageFile.close()

You can find files (in colab) by clicking the '>' icon on the top-left side of the screen. You may need to refresh.

## Find the names of the movies for this month 

And save them in a list

In [86]:
from datetime import datetime
current_date = datetime.today().strftime("%Y-%m")
print('The current year and months is ',current_date)
movie_page = BeautifulSoup(request_webpage(url, current_date).text)

div_page_details = movie_page.find('div', attrs = {'class': 'list detail'})
#inside the div tag, lets find all the img tags
image_details = div_page_details.find_all('img')

name_list = [x['alt'] for x in image_details if 'poster' in x['class']]
name_list

The current year and months is  2019-02


['Miss Bala (2019) Poster',
 'Arctic (2018) Poster',
 'Ek Ladki Ko Dekha Toh Aisa Laga (2019) Poster',
 'Ahlat Agaci (2018) Poster',
 'The Lego Movie 2: The Second Part (2019) Poster',
 'What Men Want (2019) Poster',
 'Cold Pursuit (2019) Poster',
 'The Prodigy (2019) Poster',
 'Todos lo saben (2018) Poster',
 'Under the Eiffel Tower (2018) Poster',
 'Chokehold (2019) Poster',
 'xiao zhu pei qi guo da nian (2019) Poster',
 'The Final Wish (2018) Poster',
 'Happy Death Day 2U (2019) Poster',
 'Alita: Battle Angel (2019) Poster',
 'Fighting with My Family (2019) Poster',
 "Isn't It Romantic (2019) Poster",
 'Pájaros de verano (2018) Poster',
 'Ruben Brandt, Collector (2018) Poster',
 'Donnybrook (2018) Poster',
 'Plaire, aimer et courir vite (2018) Poster',
 'How to Train Your Dragon: The Hidden World (2019) Poster',
 'Total Dhamaal (2019) Poster',
 'Run the Race (2019) Poster',
 'The Turning (2020) Poster']

## Collect all of the movie posters (for this month).

And, put them in a folder.

In [0]:
import os
#create the folder with the name year-month
try:
  os.makedirs(current_date)
except:
  print('failed gracefully, you probably already made the folder')

In [106]:
for i in range(len(image_list)):
  image_url = image_list[i]
  slice_index = image_url.find('_V1_')
  full_size_image_url = image_url[:slice_index] + '_V1_.jpg'
  img_res = request_webpage(full_size_image_url)
  try:
    imageFile = open(os.path.join(current_date, name_list[i] + '.jpg'), 'wb')
    for chunk in img_res.iter_content(100000):
      imageFile.write(chunk)
    imageFile.close()
  except Exception as exc:
    print('There was a problem with writing the file for %s: %s' % (name_list[i], exc))
    
print('All Finished')

All Finished


## Collect a year's worth of movie posters, placing them in folders by month.


In [0]:
def collect_media_info(date):
  
  soup = BeautifulSoup(request_webpage(url, date).text)
  details = soup.find('div', attrs = {'class': 'list detail'})
  image_details = details.find_all('img')
  
  image_list = [x['src'] for x in image_details if 'poster' in x['class']]
  name_list = [x['alt'] for x in image_details if 'poster' in x['class']]
  return (image_list, name_list)

In [0]:
def download_month_of_posters(images, names, date):
  try:
    os.makedirs(date)
  except:
    print('failed gracefully, you probably already made the folder')
    
  for i in range(len(images)):
    image_url = images[i]
    slice_index = image_url.find('_V1_')
    full_size_image_url = image_url[:slice_index] + '_V1_.jpg'
    img_res = request_webpage(full_size_image_url)
    name = names[i]
    if ('/' in name): # because file names can't have a slash
      name = name.replace('/', '-') 
    try:
      imageFile = open(os.path.join(date, name + '.jpg'), 'wb')
      for chunk in img_res.iter_content(100000):
        imageFile.write(chunk)
      imageFile.close()
    except Exception as exc:
      print('There was a problem with writing the file for %s: %s' % (names[i], exc))
    
  print('All Finished with %s' % (date))

In [132]:
# month numbers for the URLs
month_nums = [str(x+1).zfill(2) for x in range(12)]
month_nums

['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12']

In [133]:
year = '2019'
for ii in month_nums:
  date = year + '-' + ii
  images, names = collect_media_info(date)
  download_month_of_posters(images, names, date)

All Finished with 2019-01
All Finished with 2019-02
All Finished with 2019-03
All Finished with 2019-04
All Finished with 2019-05
All Finished with 2019-06
All Finished with 2019-07
All Finished with 2019-08
All Finished with 2019-09
All Finished with 2019-10
All Finished with 2019-11
All Finished with 2019-12


##Collect all the movie posters since the start of this page (2011-01)

And, put it in different folders

In [151]:
current_year = datetime.now().year
print(current_year)
year = 2011
while(int(year) <= int(current_year)):
  for ii in month_nums:
    folder_name = str(year) + '-' + ii
    #print(folder_name)
    images, names = collect_media_info(folder_name)
    download_month_of_posters(images, names, folder_name)
    
  year = year+1

2019
2011-01
All Finished with 2011-01
2011-02
All Finished with 2011-02
2011-03
All Finished with 2011-03
2011-04
All Finished with 2011-04
2011-05
All Finished with 2011-05
2011-06
All Finished with 2011-06
2011-07
All Finished with 2011-07
2011-08
All Finished with 2011-08
2011-09
All Finished with 2011-09
2011-10
All Finished with 2011-10
2011-11
All Finished with 2011-11
2011-12
All Finished with 2011-12
2012-01
All Finished with 2012-01
2012-02
All Finished with 2012-02
2012-03
All Finished with 2012-03
2012-04
All Finished with 2012-04
2012-05
All Finished with 2012-05
2012-06
All Finished with 2012-06
2012-07
All Finished with 2012-07
2012-08
All Finished with 2012-08
2012-09
All Finished with 2012-09
2012-10
All Finished with 2012-10
2012-11
All Finished with 2012-11
2012-12
All Finished with 2012-12
2013-01
All Finished with 2013-01
2013-02
All Finished with 2013-02
2013-03
All Finished with 2013-03
2013-04
All Finished with 2013-04
2013-05
All Finished with 2013-05
2013-06
A

## Zip all the files you have collected in order to download them on your local machine (without a million clicks)

See this article for how to go about zipping: https://www.geeksforgeeks.org/working-zip-files-python/

In [0]:
from zipfile import ZipFile 

In [0]:
# code for deleting a directory
# in case you make a mistake and want to clean up: 
!rm -rf '2010-01' # '2018-01' is a folder of files to delete

In [0]:
def get_file_paths(directory): 
  file_paths = []
  files = os.listdir(directory)
  for filename in files: 
    filepath = os.path.join(directory, filename)
    file_paths.append(filepath)
  return file_paths 

In [158]:
year = 2011
cwd_file_paths = []
while(year <= datetime.now().year):
  
  for i in range(12):
    date = str(year) + '-' + str(i+1).zfill(2)
    cwd_file_paths += get_file_paths(date)
  year+=1
print(cwd_file_paths)

['2011-01/From Prada to Nada (2011) Poster.jpg', '2011-01/The Mechanic (2011) Poster.jpg', '2011-01/No Strings Attached (2011) Poster.jpg', "2011-01/Barney's Version (2010) Poster.jpg", '2011-01/The Way Back (2010) Poster.jpg', '2011-01/Kaboom (2010) Poster.jpg', '2011-01/Country Strong (2010) Poster.jpg', '2011-01/The Dilemma (2011) Poster.jpg', '2011-01/Biutiful (2010) Poster.jpg', '2011-01/Season of the Witch (2011) Poster.jpg', '2011-01/The Rite (2011) Poster.jpg', '2011-01/The Green Hornet (2011) Poster.jpg', '2011-01/En ganske snill mann (2010) Poster.jpg', '2011-01/The Company Men (2010) Poster.jpg', '2011-01/Ong-bak 3 (2010) Poster.jpg', '2011-02/Heartbeats (2010) Poster.jpg', '2011-02/También la lluvia (2010) Poster.jpg', '2011-02/Unknown (2011) Poster.jpg', '2011-02/Drive Angry (2011) Poster.jpg', '2011-02/Justin Bieber: Never Say Never (2011) Poster.jpg', '2011-02/Just Go with It (2011) Poster.jpg', '2011-02/The Eagle (2011) Poster.jpg', '2011-02/Des hommes et des dieux (201

In [160]:
from tqdm import tqdm
'''print('The following files will be zipped:') 
for file_name in cwd_file_paths: 
  print(file_name) 
  
'''

with ZipFile('Movie-posters.zip','w') as zip: 
  for file in tqdm(cwd_file_paths):
    zip.write(file)
  
print('All files zipped successfully!')

100%|██████████| 3393/3393 [01:08<00:00, 49.29it/s]

All files zipped successfully!





## To delete the folders and files after zipping it

In [0]:
#To delete all the folders, we can use this command.
#!rm -rf '2019-01'
import re, shutil
path = '.'
files = os.listdir(path)

for name in files:
  if re.match('^\d+', name, flags=0):
    shutil.rmtree(name)