## IMDB Scraper using Selenium

- import necessary libraries
- go to google.com
- search for top 100 movies
- get the list of details of 100 movies

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
import pandas as pd

driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get('https://google.com')
input_box = driver.find_element_by_xpath('//*[@id="tsf"]/div[2]/div[1]/div[1]/div/div[2]/input')
input_box.clear()
input_box.send_keys('Top 100 Movies imdb')
input_box.send_keys(Keys.ENTER)
imdb_link = driver.find_element_by_xpath('//*[@id="rso"]/div[1]/div[1]/div/div[1]/div/div[2]/div/div[1]/a/h3')
imdb_link.click()

movies_list = driver.find_elements_by_xpath('/html/body/div[3]/div/div[2]/div[3]/div[1]/div/div[4]/div[3]/div')

[WDM] - Current google-chrome version is 87.0.4280
[WDM] - Get LATEST driver version for 87.0.4280


 


[WDM] - Driver [C:\Users\prajw\.wdm\drivers\chromedriver\win32\87.0.4280.88\chromedriver.exe] found in cache


- take a number input from 1-100 (as there are 100 movies)
- display the movie name
- save the movie details as a screenshot with the name as the movie name

In [2]:
n = int(input("Which movie in the list do you want?:\n"))
movie = movies_list[n-1]
movie_name = movie.text.split('\n')[0].replace(':', '')
print(f"Movie is: {movie_name}")
print("Saving screenshot with movie details!")
movie.screenshot(f"{movie_name}.png")

Which movie in the list do you want?:
32
Movie is: 32. Interstellar (2014)
Saving screenshot with movie details!


True

- create an empty dataframe
- iterate over the movies list and add details of each movie to the dataframe
- if some detail of the movie is missing, then that movie is not included
- even though this part can be done using BeautifulSoup and perhaps might even even be a little faster,I did it using selenium as I was learning selenium
- convert dataframe to a csv file

In [3]:
df = pd.DataFrame(columns=['Movie name', 'Rating', 'Duration', 'Genre', 'IMDB Rating',
                           'Metascore', 'Description', 'Director', 'Stars', 'Votes', 'Gross'])

for movie in movies_list:
    movie_details = movie.text.split('\n')
    index = len(df)

    try:
        name = movie_details[0]
        rating, duration, genre = list(map(str.strip, movie_details[1].split('|')))
        imdb_rating = movie_details[2]
        metascore = movie_details[4].split()[0]
        description = movie_details[5]
        director = movie_details[6].split('|')[0].split(':')[1].strip()
        stars = list(map(str.strip, movie_details[6].split('|')[1].split(':')[1:]))
        votes = movie_details[7].split('|')[0].split(':')[1].strip()
        gross = movie_details[7].split('|')[1].split(':')[1].strip()

        df.loc[index] = [name, rating, duration, genre, imdb_rating,
                         metascore, description, director, *stars, votes, gross]
    except:
        pass

df.to_csv('Top 100 Movies data.csv', index=False)

In [4]:
df.head()

Unnamed: 0,Movie name,Rating,Duration,Genre,IMDB Rating,Metascore,Description,Director,Stars,Votes,Gross
0,1. The Shawshank Redemption (1994),R,142 min,Drama,9.3,80,Two imprisoned men bond over a number of years...,Frank Darabont,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",2319371,$28.34M
1,2. The Godfather (1972),R,175 min,"Crime, Drama",9.2,100,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,"Marlon Brando, Al Pacino, James Caan, Diane Ke...",1601681,$134.97M
2,3. The Godfather: Part II (1974),R,202 min,"Crime, Drama",9.0,90,The early life and career of Vito Corleone in ...,Francis Ford Coppola,"Al Pacino, Robert De Niro, Robert Duvall, Dian...",1118688,$57.30M
3,4. The Dark Knight (2008),PG-13,152 min,"Action, Crime, Drama",9.0,84,When the menace known as the Joker wreaks havo...,Christopher Nolan,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",2281507,$534.86M
4,5. 12 Angry Men (1957),Approved,96 min,"Crime, Drama",9.0,96,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,"Henry Fonda, Lee J. Cobb, Martin Balsam, John ...",682526,$4.36M


### Movie data can be used to perform data analysis

In [5]:
from collections import Counter
Counter(df.Director).most_common()[:6]

[('Christopher Nolan', 6),
 ('Stanley Kubrick', 5),
 ('Quentin Tarantino', 4),
 ('Alfred Hitchcock', 4),
 ('Francis Ford Coppola', 3),
 ('Steven Spielberg', 3)]

In [6]:
for item in Counter(df.Director).most_common()[:6]:
    director = item[0]
    print(f"Movies directed by {director}")
    print(df[df.Director == director]['Movie name'], end='\n\n')

Movies directed by Christopher Nolan
3            4. The Dark Knight (2008)
13                14. Inception (2010)
29             32. Interstellar (2014)
44             47. The Prestige (2006)
51                  55. Memento (2000)
62    70. The Dark Knight Rises (2012)
Name: Movie name, dtype: object

Movies directed by Stanley Kubrick
57                               62. The Shining (1980)
59    66. Dr. Strangelove or: How I Learned to Stop ...
78                     90. 2001: A Space Odyssey (1968)
82                         94. Full Metal Jacket (1987)
85                        97. A Clockwork Orange (1971)
Name: Movie name, dtype: object

Movies directed by Quentin Tarantino
7              8. Pulp Fiction (1994)
55        60. Django Unchained (2012)
73          83. Reservoir Dogs (1992)
76    87. Inglourious Basterds (2009)
Name: Movie name, dtype: object

Movies directed by Alfred Hitchcock
38                41. Psycho (1960)
47           51. Rear Window (1954)
79               9