**Importing Libraries**

The first step in the process is to import libraries. 

Goal: We're going to extract the top 250 movies from imdb using Python beautifulsoup, lxml and a few other libraries.

First stage is importing the libraries we need to use for the data extraction.At each stage we will be explaining these libraries and their usage[link text]

In [1]:
# !pip install unidecode

In [2]:
import requests
from bs4 import BeautifulSoup
from lxml import etree as et
import time
import random
import json
from unidecode import unidecode

In [3]:
# start_url = "https://www.imdb.com/chart/top"
start_url = "https://www.imdb.com/search/title/?groups=top_100"
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
movie_urls = []

The next step is to get the link to the movies in the start url and save them to a list we declared above.First we use the requests library to get the url, then we used beautifulsoup to get a beautifulsoup object. The next step is to create a way to query the html for the link. For that we used the etree module from lxml.Upon inspection using chrome developer tools - we found the data is available at an xpath given in the expression below. The expression generates a list of urls

In [4]:
response = requests.get(start_url, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
dom = et.HTML(str(soup))
movie_urls_list = dom.xpath('//td[@class="titleColumn"]/a/@href')

The data in the movies_urls_list is not in the way we need, for example it does not have the imdb domain name with it, also it is too long.

We concatonated imdb url into the url string we obtained. However upon inspection we can see that even if we remove all items after the question mark - it is still a valid link going to the same page.We add this to the movie_urls list.

In [5]:
for i in movie_urls_list:
    long_url = "https://www.imdb.com" + i
    short_url = long_url.split("?")[0]
    movie_urls.append(short_url)

Once we get the movies urls into the list - the next step is to go to each movie page and extract the data. However before doing that we need to fix the attributes and the structure we are going to be using..

For education purpose - we will be using the json format to store the data. Before parsing the data - we need to prepare writing the data into the json.

See the code below to understand how.

In [6]:
def time_delay():
    time.sleep(random.randint(2, 5))

In [7]:
with open("data_v1.json", "w") as f:
    json.dump([], f)


def write_to_json(new_data, filename='data_v1.json'):
    with open(filename, 'r+') as file:
        file_data = json.load(file)
        file_data.append(new_data)
        file.seek(0)
        json.dump(file_data, file, indent=4)

The next stage is to extract the individual elements from each page and write it into a json file.

note a couple of things.

1- we used a library unidecode,the function unidecode() takes Unicode data and tries to represent it in ASCII characters. 

The best way to understand this is to not use it and inspect the data - you'll see some strange letters inbetween text. Using unidecode eliminates that problem.

2 - we used write json function to write data into the json file

In [8]:
for movie_url in movie_urls:
    response = requests.get(movie_url, headers=header)
    soup = BeautifulSoup(response.content, 'html.parser')
    dom = et.HTML(str(soup))

    rank = movie_urls.index(movie_url) + 1
    movie_name = dom.xpath('//h1[@data-testid="hero-title-block__title"]/text()')[0]
    movie_year = dom.xpath('//a[@class="ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"]/text()')[0]
    genre = dom.xpath('//span[@class="ipc-chip__text"]/text()')
    director_name = dom.xpath('//a[@class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link"]/text()')[0]
    rating = dom.xpath('//span[@class="sc-7ab21ed2-1 jGRxWM"]/text()')[0]
    actors_list = dom.xpath('//a[@data-testid="title-cast-item__actor"]/text()')
    actors_list = [unidecode(i) for i in actors_list]

    write_to_json({'rank': rank,
                   'movie_name': movie_name,
                   'movie_url': movie_url,
                   'movie_year': movie_year,
                   'genre': genre,
                   'director_name': unidecode(director_name),
                   'rating': rating,
                   'actors': actors_list})

    time_delay()
    print("{} written to json file".format(rank))