# IMDb Data Processing
- Retrieve data from [IMDb top 250](https://www.imdb.com/chart/top)
- Create JSON Object file
- Create pandas dataframe for data analysis 

## Data retrieval
Import required packages. <br />
Install the additional dependencies before running the script (If you don't have them installed prior)
 - pip install [requests](https://pypi.org/project/requests/)
 - pip install [beautifulsoup4](https://pypi.org/project/beautifulsoup4/)
 - pip install [pandas](https://pypi.org/project/pandas/)
 - pip install [tqdm](https://pypi.org/project/tqdm/) 


In [1]:
import requests                 # Simpler HTTP requests 
from bs4 import BeautifulSoup   # Python package for pulling data out of HTML and XML files
import pandas as pd             # Python package for data manipulation and analysis
import re                       # regular expressions
import json                     # Python package used to work with JSON data
from tqdm import tqdm           # python for displaying progressbar 
from datetime import datetime   # python package to retireve DateTime

Get the session and pull Soup data from the HTML

In [2]:
url = 'https://www.imdb.com/chart/top'              # IMDb Top 250 list link
url_text = requests.get(url).text                    # Get the session text for the link
url_soup = BeautifulSoup(url_text, 'html.parser')   # Get data from the HTML


Get the different fields from the top 250 list

In [3]:
template = 'https://www.imdb.com%s'

# Get the title links for all the pages
title_links = [template % a.attrs.get('href') for a in url_soup.select('td.titleColumn a')]

imdb_movie_list = []
# Getting the various fields and creating a list of objects with details
#   - ranking | movie_name | url | year | rating | vote_count | summary | production | director | writer_1 | writer_2
#   - genre_1 | genre_2 | genre_3 | genre_4 | release date | censor_rating | movie_length | country | language
#   - budget | gross_worldwide | gross_usa | opening_week_usa

for i in tqdm(range(0, len(title_links)), desc="Movies processed", ncols=100):
    page_url = title_links[i]
    page_text = requests.get(page_url).text
    page_soup = BeautifulSoup(page_text, 'html.parser')

    # ------------------------------------------------------------------------------------------
    # Getting movie name, year, rating and number of votes
    movie_name = (page_soup.find("div",{"class":"title_wrapper"}).get_text(strip=True).split('|')[0]).split('(')[0]
    year = ((page_soup.find("div",{"class":"title_wrapper"}).get_text(strip=True).split('|')[0]).split('(')[1]).split(')')[0]
    rating = page_soup.find("span",{"itemprop":"ratingValue"}).text
    vote_count = page_soup.find("span",{"itemprop":"ratingCount"}).text

    # ------------------------------------------------------------------------------------------
    # Getting censor rating, movie length, genre list, rlease date and 
    # country from the subtext
    subtext= page_soup.find("div",{"class":"subtext"}).get_text(strip=True).split('|')
    
    if len(subtext) < 4:
        # Setting values when the movie is unrated
        censor_rating = "No rating"
        movie_length = subtext[0]
        genre_list = subtext[1].split(',')

        while len(genre_list) < 4: genre_list.append(' ')
        genre_1, genre_2, genre_3, genre_4 = genre_list
        
        release_date_and_country = subtext[2].split('(')
        release_date = movie_length_and_country[0]
    else: 
        censor_rating = subtext[0]
        movie_length = subtext[1]
        genre_list = subtext[2].split(',')

        while len(genre_list) < 4: genre_list.append(' ')

        movie_length_and_country = subtext[3].split('(')
        release_date = movie_length_and_country[0]

    # ------------------------------------------------------------------------------------------
    # Getting the movie summary
    summary = page_soup.find("div",{"class":"summary_text"}).get_text(strip=True).strip()
    
    # ------------------------------------------------------------------------------------------
    # Getting the credits for the director and writers
    credit_summary = []
    for summary_item in page_soup.find_all("div",{"class":"credit_summary_item"}):
        credit_summary.append(re.split(',|:|\|',summary_item.get_text(strip=True)))
    
    credit_summary.pop()
    writers = credit_summary.pop()[1:3]
    director = credit_summary.pop()[1:]
    
    while len(writers) < 2: writers.append(" ")
    writer_1, writer_2 = writers
    writer_1 = writer_1.split('(')[0]
    writer_2 = writer_2.split('(')[0]

    # ------------------------------------------------------------------------------------------
    # Getting the box office details for language, budget, Opening Weekend USA, 
    # Gross income worldwide and USA, and production company
    box_office_details = []
    box_office_dictionary = {'Country':'','Language':'','Budget':'', 'Opening Weekend USA':'','Gross USA':'','Cumulative Worldwide Gross':'','Production Co':''}
    for details in page_soup.find_all("div",{"class":"txt-block"}):
        detail = details.get_text(strip=True).split(':')
        if detail[0] in box_office_dictionary:
            box_office_details.append(detail)
    
    for detail in box_office_details: 
        if detail[0] in box_office_dictionary: 
            box_office_dictionary.update({detail[0] : detail[1]}) 

    country = box_office_dictionary['Country'].split("|")
    while len(country) < 4: country.append(' ')

    language = box_office_dictionary['Language'].split("|")
    while len(language) < 5: language.append(' ')

    budget = box_office_dictionary['Budget'].split('(')[0]

    opening_week_usa = ','.join((box_office_dictionary['Opening Weekend USA'].split(' ')[0]).split(',')[:-1])

    gross_usa = box_office_dictionary['Gross USA']
    gross_worldwide = box_office_dictionary['Cumulative Worldwide Gross'].split(' ')[0]
    production = box_office_dictionary['Production Co'].split('See more')[0]

    movie_dict = { 'ranking': i+1, 'movie_name': movie_name, 'url': page_url, 'year': year,
        'rating': rating, 'vote_count': vote_count, 'summary': summary, 'production': production,
        'director': director, 'writers': [writer_1, writer_2], 'genres': genre_list, 'release_date': release_date,
        'censor_rating': censor_rating, 'movie_length': movie_length, 'country': country,
        'language': language, 'budget': budget, 'gross_worldwide': gross_worldwide,
        'gross_usa': gross_usa,'opening_week_usa': opening_week_usa }

    imdb_movie_list.append(movie_dict)

Movies processed: 100%|███████████████████████████████████████████| 250/250 [08:23<00:00,  2.02s/it]


## JSON Object file creation
Sample JSON data

In [4]:
timestamp =  datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%f')
imdb_list = {
    "timestamp" : timestamp,
    "imdb_movies" : imdb_movie_list
}
for i in range(0, 5):
    print(imdb_movie_list[i],'\n')

{'ranking': 1, 'movie_name': 'The Shawshank Redemption', 'url': 'https://www.imdb.com/title/tt0111161/', 'year': '1994', 'rating': '9.3', 'vote_count': '2,325,797', 'summary': 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.', 'production': 'Castle Rock Entertainment', 'director': ['Frank Darabont'], 'writers': ['Stephen King', 'Frank Darabont'], 'genres': ['Drama', ' ', ' ', ' '], 'release_date': '14 October 1994 ', 'censor_rating': 'A', 'movie_length': '2h 22min', 'country': ['USA', ' ', ' ', ' '], 'language': ['English', ' ', ' ', ' ', ' '], 'budget': '$25,000,000', 'gross_worldwide': '$28,817,291', 'gross_usa': '$28,699,976', 'opening_week_usa': '$727,327'} 

{'ranking': 2, 'movie_name': 'The Godfather', 'url': 'https://www.imdb.com/title/tt0068646/', 'year': '1972', 'rating': '9.2', 'vote_count': '1,606,921', 'summary': 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empi

Store the array in a JSON file

In [5]:
with open('./src/data/imdb_movies_data.json', 'w') as file:
    json.dump(imdb_list, file)
print("Successfully saved to JSON file")


Successfully saved to JSON file


## Dataframe file creation
Initializing the dataframe

In [6]:
dataframe_columns = [ 'ranking', 'movie_name', 'url', 'year', 'rating', 'vote_count', 'summary', 'production',
        'director', 'writer_1', 'writer_2', 'genre_1', 'genre_2', 'genre_3', 'genre_4','release_date', 'censor_rating', 
        'movie_length', 'country_1', 'country_2', 'country_3', 'country_4', 'language_1', 'language_2', 'language_3', 
        'language_4', 'language_5', 'budget', 'gross_worldwide', 'gross_usa','opening_week_usa']
dataframe = pd.DataFrame(columns=dataframe_columns)
dataframe

Unnamed: 0,ranking,movie_name,url,year,rating,vote_count,summary,production,director,writer_1,...,country_4,language_1,language_2,language_3,language_4,language_5,budget,gross_worldwide,gross_usa,opening_week_usa


Parse data into the dataframe

In [7]:
for i in range(0, len(imdb_movie_list)):
    dataframe.at[i, 'ranking'] = imdb_movie_list[i]['ranking']
    dataframe.at[i, 'movie_name'] = imdb_movie_list[i]['movie_name']
    dataframe.at[i, 'url'] = imdb_movie_list[i]['url']
    dataframe.at[i, 'year'] = imdb_movie_list[i]['year']
    dataframe.at[i, 'rating'] = imdb_movie_list[i]['rating']
    dataframe.at[i, 'vote_count'] = imdb_movie_list[i]['vote_count']
    dataframe.at[i, 'summary'] = imdb_movie_list[i]['summary']
    dataframe.at[i, 'production'] = imdb_movie_list[i]['production']
    dataframe.at[i, 'director'] = imdb_movie_list[i]['director'][0]
    dataframe.at[i, 'writer_1'] = imdb_movie_list[i]['writers'][0]
    dataframe.at[i, 'writer_2'] = imdb_movie_list[i]['writers'][1]
    dataframe.at[i, 'genre_1'] = imdb_movie_list[i]['genres'][0]
    dataframe.at[i, 'genre_2'] = imdb_movie_list[i]['genres'][1]
    dataframe.at[i, 'genre_3'] = imdb_movie_list[i]['genres'][2]
    dataframe.at[i, 'genre_4'] = imdb_movie_list[i]['genres'][3]
    dataframe.at[i, 'release_date'] = imdb_movie_list[i]['release_date']
    dataframe.at[i, 'censor_rating'] = imdb_movie_list[i]['censor_rating']
    dataframe.at[i, 'movie_length'] = imdb_movie_list[i]['movie_length']
    dataframe.at[i, 'country_1'] = imdb_movie_list[i]['country'][0]
    dataframe.at[i, 'country_2'] = imdb_movie_list[i]['country'][1]
    dataframe.at[i, 'country_3'] = imdb_movie_list[i]['country'][2]
    dataframe.at[i, 'country_4'] = imdb_movie_list[i]['country'][3]
    dataframe.at[i, 'language_1'] = imdb_movie_list[i]['language'][0]
    dataframe.at[i, 'language_2'] = imdb_movie_list[i]['language'][1]
    dataframe.at[i, 'language_3'] = imdb_movie_list[i]['language'][2]
    dataframe.at[i, 'language_4'] = imdb_movie_list[i]['language'][3]
    dataframe.at[i, 'language_5'] = imdb_movie_list[i]['language'][4]
    dataframe.at[i, 'budget'] = imdb_movie_list[i]['budget']
    dataframe.at[i, 'gross_worldwide'] = imdb_movie_list[i]['gross_worldwide']
    dataframe.at[i, 'gross_usa'] = imdb_movie_list[i]['gross_usa']
    dataframe.at[i, 'opening_week_usa'] = imdb_movie_list[i]['opening_week_usa']

dataframe = dataframe.set_index(['ranking'], drop=False)
dataframe.head(10)

Unnamed: 0_level_0,ranking,movie_name,url,year,rating,vote_count,summary,production,director,writer_1,...,country_4,language_1,language_2,language_3,language_4,language_5,budget,gross_worldwide,gross_usa,opening_week_usa
ranking,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,The Shawshank Redemption,https://www.imdb.com/title/tt0111161/,1994,9.3,2325797,Two imprisoned men bond over a number of years...,Castle Rock Entertainment,Frank Darabont,Stephen King,...,,English,,,,,"$25,000,000","$28,817,291","$28,699,976","$727,327"
2,2,The Godfather,https://www.imdb.com/title/tt0068646/,1972,9.2,1606921,The aging patriarch of an organized crime dyna...,"Paramount Pictures,Alfran Productions",Francis Ford Coppola,Mario Puzo,...,,English,Italian,Latin,,,"$6,000,000","$246,120,986","$134,966,411","$302,393"
3,3,The Godfather: Part II,https://www.imdb.com/title/tt0071562/,1974,9.0,1121888,The early life and career of Vito Corleone in ...,"Paramount Pictures,The Coppola Company,America...",Francis Ford Coppola,Francis Ford Coppola,...,,English,Italian,Spanish,Latin,Sicilian,"$13,000,000","$48,035,783","$47,834,595","$171,417"
4,4,The Dark Knight,https://www.imdb.com/title/tt0468569/,2008,9.0,2287611,When the menace known as the Joker wreaks havo...,"Warner Bros.,Legendary Entertainment,Syncopy",Christopher Nolan,Jonathan Nolan,...,,English,Mandarin,,,,"$185,000,000","$1,005,973,645","$534,858,444","$158,411,483"
5,5,12 Angry Men,https://www.imdb.com/title/tt0050083/,1957,9.0,684546,A jury holdout attempts to prevent a miscarria...,Orion-Nova Productions,Sidney Lumet,Reginald Rose,...,,English,,,,,"$350,000",$576,,
6,6,Schindler's List,https://www.imdb.com/title/tt0108052/,1993,8.9,1205876,"In German-occupied Poland during World War II,...","Universal Pictures,Amblin Entertainment",Steven Spielberg,Thomas Keneally,...,,English,Hebrew,German,Polish,Latin,"$22,000,000","$322,161,245","$96,898,818","$656,636"
7,7,The Lord of the Rings: The Return of the King,https://www.imdb.com/title/tt0167260/,2003,8.9,1632410,Gandalf and Aragorn lead the World of Men agai...,"New Line Cinema,WingNut Films,The Saul Zaentz ...",Peter Jackson,J.R.R. Tolkien,...,,English,Quenya,Old English,Sindarin,,"$94,000,000","$1,142,456,987","$377,845,905","$72,629,713"
8,8,Pulp Fiction,https://www.imdb.com/title/tt0110912/,1994,8.9,1813993,"The lives of two mob hitmen, a boxer, a gangst...","Miramax,A Band Apart,Jersey Films",Quentin Tarantino,Quentin Tarantino,...,,English,Spanish,French,,,"$8,000,000","$213,928,762","$107,928,762","$9,311,882"
9,9,"Il buono, il brutto, il cattivo",https://www.imdb.com/title/tt0060196/,1966,8.8,684295,A bounty hunting scam joins two men in an unea...,"Produzioni Europee Associate (PEA),Arturo Gonz...",Sergio Leone,Luciano Vincenzoni,...,,Italian,,,,,"$1,200,000","$25,252,927","$25,100,000",
10,10,The Lord of the Rings: The Fellowship of the Ring,https://www.imdb.com/title/tt0120737/,2001,8.8,1649091,A meek Hobbit from the Shire and eight compani...,"New Line Cinema,WingNut Films,The Saul Zaentz ...",Peter Jackson,J.R.R. Tolkien,...,,English,Sindarin,,,,"$93,000,000","$888,159,092","$315,544,750","$47,211,490"


Save dataframe into a comma-separated values file (CSV)

In [8]:
dataframe.to_csv('./src/data/imdb_movies_data.csv')
print("Successfully saved to CSV")

Successfully saved to CSV
