Code taken from this
[Link](https://python.plainenglish.io/how-to-scrape-imdb-data-9d7535b98576).
<br>
I added some extra columns and did some modifications here and there.


First, we need to import Beautiful Soup along with some other packages. This application needs to download IMDB data from a large number of IMDB URLs. We will be using Python’s concurrent API to make the process parallel and seamless.

In [39]:
import requests
from bs4 import BeautifulSoup
from dateutil.parser import parse
import concurrent.futures
import pandas as pd
import re

Attributes that we are interested in:
<br>
We will mostly focus on the below-mentioned attributes:
<ol>
<li>Movie title</li>
<li>Year the movie was released</li>
<li>The genre of the movie</li>
<li>Synopsis of the movie</li>
<li>Image URL for the poster</li>
<li>Image ID (This is the same as the unique identifier)</li>
<li>Director's Name</li>
<li>Cast List</li>
<ol>

In [40]:
movie_title_arr = []
movie_year_arr = []
movie_genre_arr = []
movie_synopsis_arr =[]
image_url_arr  = []
image_id_arr = []
director_arr = []
cast_arr = []

given below are some functions which will fetch string data from an html element.

In [41]:
def getMovieTitle(header):
    try:
        return header[0].find("a").getText()
    except:
        return 'NA'

def getReleaseYear(header):
    try:
        return re.sub('\D', '', header[0].find("span",  {"class": "lister-item-year text-muted unbold"}).getText())
    except:
        return 'NA'

def getGenre(muted_text):
    try:
        return muted_text.find("span",  {"class":  "genre"}).getText().strip()
    except:
        return 'NA'

def getsynopsys(movie):
    try:
        ret = movie.find_all("p", {"class":  "text-muted"})[1].getText().replace('\n',' ')
        if ret == " Add a Plot ":
            return "NA"
        else:
            return ret
    except:
        return 'NA'

def getImage(image):
    try:
        return image.get('loadlate')
    except:
        return 'NA'

def getImageId(image):
    try:
        return image.get('data-tconst')
    except:
        return 'NA'

def getDirector(movie):
    try:
        return movie.find('p',class_='').find_all('a')[0].text
    except:
        return 'NA'

def getActors(movie):
    try:
        ret = ""
        f = 0
        for a in movie.find('p',class_='').find_all('a')[1:]:
            if f==0:
                ret = ret + a.text
                f=1
            else:
                ret=ret+", "+a.text
        if ret=="":
            return "NA"
        else:
            return ret
    except:
        return 'NA'

The main function that will utilize the URL provided to scrape data
This will be our main function that will be responsible for iterating through the various attributes of the IMDB data. We will be providing this function with URLs for various IMDB pages and this will help us extract information from the pages.


In [42]:
def main(imdb_url):
    response = requests.get(imdb_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Movie Name
    movies_list  = soup.find_all("div", {"class": "lister-item mode-advanced"})
    
    for movie in movies_list:
        header = movie.find_all("h3", {"class":  "lister-item-header"})
        muted_text = movie.find_all("p", {"class":  "text-muted"})[0]
        imageDiv =  movie.find("div", {"class": "lister-item-image float-left"})
        image = imageDiv.find("img", "loadlate")
        
        #  Movie Title
        movie_title =  getMovieTitle(header)
        movie_title_arr.append(movie_title)
        
        #  Movie release year
        year = getReleaseYear(header)
        movie_year_arr.append(year)
        
        #  Genre  of movie
        genre = getGenre(muted_text)
        movie_genre_arr.append(genre)
        
        # Movie Synopsys
        synopsis = getsynopsys(movie)
        movie_synopsis_arr.append(synopsis)
        
        #  Image attributes
        img_url = getImage(image)
        image_url_arr.append(img_url)
        
        image_id = image.get('data-tconst')
        image_id_arr.append(image_id)

        # director
        director=getDirector(movie)
        director_arr.append(director)

        # actors
        actors = getActors(movie)
        cast_arr.append(str(actors))

Note below mentioned for loop helps in generating URLs for the list of movies according to the filter that we have specified. For example, in this section, we are only looking at movies in Bangla and having a rating within the range of 1.0 to 10.0 and sorted according to their release dates. The number of results per page is 250 and the view is compact.

In [43]:
# An array to store all the URL that are being queried
imageArr = []

# Maximum number of pages one wants to iterate over
MAX_PAGE =10

# Loop to generate all the URLS.
for i in range(0,MAX_PAGE):
    totalRecords = 0 if i==0 else (250*i)+1
    print(totalRecords)
    #imdb_url = f'https://www.imdb.com/search/title/?release_date=2020-01-02,2021-02-01&user_rating=4.0,10.0&languages=en&count=250&start={totalRecords}&ref_=adv_nxt'
    imdb_url = f'https://www.imdb.com/search/title/?languages=bn&sort=year,desc&user_rating=1.0,10.0&title_type=feature&count=250&start={totalRecords}&ref_=adv_nxt'
    imageArr.append(imdb_url)

0
251
501
751
1001
1251
1501
1751
2001
2251


The below-mentioned download function takes up the URLs and calls the main function with those. It does this in parallel with MAX_THREADS as the maximum number of requests.

In [44]:
# Maximum number of threads that will be spawned
MAX_THREADS = 50
def download_stories(story_urls):
    threads = min(MAX_THREADS, len(story_urls))
    with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
        executor.map(main, story_urls)

Finally, we call the download function and then get our required data. This may take a while.

In [45]:
# Call the download function with the array of URLS called imageArr
download_stories(imageArr)

print('--------- Download Complete --------')


--------- Download Complete --------


Putting scraped data from python lists to pandas dataframe.

In [46]:
print(len(director_arr))
print(len(movie_title_arr))
# Attach all the data to the pandas dataframe. You can optionally write it to a CSV file as well
movieDf = pd.DataFrame({
    "Image_ID": image_id_arr,
    "Title": movie_title_arr,
    "Director": director_arr,
    "Cast": cast_arr,
    "Year": movie_year_arr,
    "Genre": movie_genre_arr,
    "Synopsis": movie_synopsis_arr,
    "image_url": image_url_arr,
})


2309
2309


Checking whether data has been properly stored or not.

In [47]:
movieDf[0:10]

Unnamed: 0,Image_ID,Title,Director,Cast,Year,Genre,Synopsis,image_url
0,tt0155217,Sudhar Prem,Premankur Atorthy,"Asitbaran, Manoranjan Bhattacharya, Lila Dasgu...",1950,,,https://m.media-amazon.com/images/S/sash/NapCx...
1,tt0043026,Tathapi,Manoj Bhattacharya,"Bhanu Bannerjee, Gangapada Basu, Bijon Bhattac...",1950,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
2,tt0042719,Mashaal,Nitin Bose,"Ashok Kumar, Sumitra Devi, Ruma Guha Thakurta,...",1950,Drama,,https://m.media-amazon.com/images/M/MV5BYmJjZj...
3,tt0267730,Mantramugdhu,Bimal Roy,"Jiben Bose, Reba Bose, Tulsi Chakraborty, Jaha...",1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
4,tt0231225,Bamuner Meye,Ajoy Kar,"Sunil Das Gupta, Anubha Gupta, Tulsi Lahiri, S...",1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
5,tt0214563,Cartoon,Dhirendranath Ganguly,,1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
6,tt0156695,Kavi,Debaki Bose,"Robin Majumdar, Nitish Mukherjee, Anubha Gupta...",1949,Drama,"Kavi is a 1949 Indian Bengali film, directed ...",https://m.media-amazon.com/images/M/MV5BOWE2MT...
7,tt0152283,Sankalpa,Agradoot,"N.B. Agrami, Bibhuti Laha, Sikharani Bag, Moli...",1949,Drama,,https://m.media-amazon.com/images/M/MV5BOGJkZj...
8,tt0243353,Kalo Chhaya,Premendra Mitra,"Gurudas Bannerjee, Dhiraj Bhattacharya, Sipra ...",1948,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
9,tt0157061,Sir Sankarnath,Debaki Bose,"Ajit Bandyopadhyay, Jiben Bose, Tulsi Chakrabo...",1948,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...


performing some data cleaning.

In [48]:
for index, row in movieDf.iterrows():
    if row['Genre'] == 'NA':
        movieDf.drop(index, inplace=True)
movieDf[0:10]

Unnamed: 0,Image_ID,Title,Director,Cast,Year,Genre,Synopsis,image_url
1,tt0043026,Tathapi,Manoj Bhattacharya,"Bhanu Bannerjee, Gangapada Basu, Bijon Bhattac...",1950,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
2,tt0042719,Mashaal,Nitin Bose,"Ashok Kumar, Sumitra Devi, Ruma Guha Thakurta,...",1950,Drama,,https://m.media-amazon.com/images/M/MV5BYmJjZj...
3,tt0267730,Mantramugdhu,Bimal Roy,"Jiben Bose, Reba Bose, Tulsi Chakraborty, Jaha...",1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
4,tt0231225,Bamuner Meye,Ajoy Kar,"Sunil Das Gupta, Anubha Gupta, Tulsi Lahiri, S...",1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
5,tt0214563,Cartoon,Dhirendranath Ganguly,,1949,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
6,tt0156695,Kavi,Debaki Bose,"Robin Majumdar, Nitish Mukherjee, Anubha Gupta...",1949,Drama,"Kavi is a 1949 Indian Bengali film, directed ...",https://m.media-amazon.com/images/M/MV5BOWE2MT...
7,tt0152283,Sankalpa,Agradoot,"N.B. Agrami, Bibhuti Laha, Sikharani Bag, Moli...",1949,Drama,,https://m.media-amazon.com/images/M/MV5BOGJkZj...
8,tt0243353,Kalo Chhaya,Premendra Mitra,"Gurudas Bannerjee, Dhiraj Bhattacharya, Sipra ...",1948,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
9,tt0157061,Sir Sankarnath,Debaki Bose,"Ajit Bandyopadhyay, Jiben Bose, Tulsi Chakrabo...",1948,Drama,,https://m.media-amazon.com/images/S/sash/NapCx...
10,tt0156490,Drishtidan,Nitin Bose,"Asitbaran, Sunanda Banerjee, Biman Bannerjee, ...",1948,Drama,,https://m.media-amazon.com/images/M/MV5BZGM1Yj...


Converting pandas dataframe to csv file.

In [49]:
movieDf.to_csv('imdb bangla movie dataset.csv', index=False)