<h1 align='center' style="margin-bottom: 0px"> A Machine Learning Approch to predict the Movie Genres based on the Movie Overview </h1>


**We will be scraping data from TMDB(themoviedb)**

<h3>TMDB:https://www.themoviedb.org/</h3>

TMDB, or The Movie DataBase, is an open source version of IMDB, with a free to use API that can be used to collect information. You do need an API key, but it can be obtained for free by just making a request after making a free account.

## Setting up our Multi-Label Classification Problem Statement
There are several ways of building a recommendation engine. When it comes to movie genres, you can slice and dice the data based on multiple variables. But here’s a simple approach – build a model that can automatically predict genre tags! I can already imagine the possibilities of adding such an option to a recommender. A win-win for everyone.

Our task is to build a model that can predict the genre of a movie using just the plot details (available in text form).

Take a look at the below snapshot from IMDb and pick out the different things on display:

![movie_genre_snapshot](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/04/img_3.png)

## Follow this steps to Obtain API key.
* Signing up for TMDB and getting set up for getting movie metadata.
* Step 1. Head over to [tmdb.org](https://www.themoviedb.org/?language=en) and create a new account there by signing up.
* Step 2. Click on your account icon on the top right, then from drop down menu select "Settings".
* Step 3. On the settings page, you will see the option "API" on the left pane. Click on that.
* Step 4. Apply for a new developer key. Fill out the form as required. The fields "Application Name" and "Application URL" are not important. Fill anything there.
* Step 5. It should generate a new API key for you and you should also receive a mail.
Now that you have the API key for TMDB, you can query using TMDB. Remember, it allows on

In [1]:
# Installing the Required Libraries for the Notebook

import os

os.system("python -m pip install --upgrade pip -q")
os.system("pip install tmdbsimple pandas joblib -q")

os.system("mkdir -p /kfs_public/movies_data")



0

In [2]:
# Importing the Necessary Libraries

import os
import urllib
import requests
import json
import time
import os
import numpy as np
import pandas as pd

import random
import pickle
from joblib import dump, load

import tmdbsimple as tmdb

# Here is a broad outline of technical steps to be done for data collection
* Sign up for TMDB (themoviedatabase.org), and set up API to scrape movie posters for above movies.
* Set up and work with TMDb to get movie information from their database
* Compare the entries of IMDb and TMDb for a movie
* Get a listing and information of a few movies
* Think and ponder over the potential challenges that may come our way, and think about interesting questions we can answer given the API's we have in our hands.
* Get data from the TMDb
* Let's go over each one of these one by one.

In [3]:
# Setting up the TMDB_API_KEY to Environment Variables.

os.environ["TMDB_API_KEY"] = "923a3621a665ddd778698f7e578a0490"
api_key = os.environ["TMDB_API_KEY"]

# Syncing tmdb.API_KEY with Our Own api_key & Initiating a search 

tmdb.API_KEY = api_key
search = tmdb.Search()
# Extracting the Genre IDs and Genre Names that are available in TMDB Database.

Genre_dict = tmdb.Genres().movie_list()["genres"]

# Mapping Genre IDs with Genre Names and making them a Dictionary.

Genres_id_name_dict = dict(zip([ movie["id"] for movie in Genre_dict], [ movie["name"] for movie in Genre_dict]))

**In order to Grab the movie reviews we need to make sure that, the overviews should belong to all the types of Genres that are available.**

**So let's grab all the different Genres from the TMDB.** 

In [4]:
# Extracting the Genre IDs and Genre Names that are available in TMDB Database.

Genre_dict = tmdb.Genres().movie_list()["genres"]

# Mapping Genre IDs with Genre Names and making them a Dictionary.

Genres_id_name_dict = dict(zip([genre["id"] for genre in Genre_dict], [genre["name"] for genre in Genre_dict]))

Genres_id_name_dict[10769]="Foreign" #Adding it to the dictionary

**Now we got all the genres that are available, let's use these genres to grab the movies Overviews.**

**We need to make sure that, every movie has an Overview. If a movie dosen't contain Overview, we'll dosen't take that movie into consideration**

In [5]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2013, 2014]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2013 each genre. This will take a while, please wait...
		Lone Survivor
		Empire State
		Hunter x Hunter: The Last Mission
		Patema Inverted
		127 Hours
		Mad Max 2
		Khumba
		My Little Pony: Equestria Girls
		Fate/stay night: Unlimited Blade Works
		Police Academy
		Gulliver's Travels
		Mr. Bean's Holiday
		Beverly Hills Cop III
		Force of Execution
		Closed Circuit
		The Wolverine: Path of a Ronin
		Kids for Cash
		Where the Trail Ends
		Outbreak
		Evil Woman
		Patema Inverted
		Scooby-Doo! Mask of the Blue Falcon
		Free Birds
		An Extremely Goofy Movie
		Persona 3 the Movie: #1 Spring of Birth
		Horns
		Pokémon the Movie: Kyurem vs. the Sword of Justice
		Goltzius & the Pelican Company
		Good Ol' Freda
		All the President's Men Revisited
		Rise of the Zombies
		Wolf Creek 2
		Rape Zombie: Lust of the Dead 3
		Garth Brooks: Live from Las Vegas
		Rihanna 777 Documentary... 7Countries7Days7Shows
		Springsteen & I
		Tom at the Farm
		Grabbers
		Haun

In [6]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2015, 2016]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2015 each genre. This will take a while, please wait...
		Into the Storm
		Star Wars: Episode II - Attack of the Clones
		Blackhat
		From Russia with Love
		The Little Prince
		Pan
		The Nut Job
		Jack and the Cuckoo-Clock Heart
		Harmony
		The Peanuts Movie
		Me and Earl and the Dying Girl
		Identity Thief
		Hot Pursuit
		Mr. Right
		Police Story 3: Super Cop
		Unity
		Backstreet Boys: Show 'Em What You're Made Of
		Human
		Inherent Vice
		Tamayura: Graduation Photo Part 1 - Kizashi
		Apocalypse Now
		Sapphire Blue
		Pokémon Ranger and the Temple of the Sea
		Dino Time
		Forbidden Empire
		Fairy Tail: Phoenix Priestess
		The Invisible Boy
		Field of Lost Shoes
		The Scandalous Lady W
		Jack Strong
		Grace
		Exists
		The Gallows
		Miley Cyrus: Bangerz Tour
		Nils Frahm - Live at Montreux Jazz Festival 2015
		Danny Says
		Strangerland
		Elephant Song
		Sherlock Holmes and the Secret Weapon
		The Memory Book
		Romeo & Juliet
		Kill Your Darlings
		10

In [7]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2017, 2018]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2017 each genre. This will take a while, please wait...
		Men in Black
		Death Race
		Spider-Man 2
		A Monster in Paris
		Olaf's Frozen Adventure
		Animal Crackers
		Sahara
		Wonder Woman
		Justice League Dark
		Skiptrace
		One Piece "3D2Y": Overcome Ace's Death! Luffy's Vow to his Friends
		Tangled: Before Ever After
		1922
		Fabricated City
		Ghost in the Shell Arise - Border 1: Ghost Pain
		A Plastic Ocean
		Night Will Fall
		Saving Capitalism
		Last Days in the Desert
		Dunkirk
		The Girl on the Train
		The Princess Bride
		LEGO Jurassic World: The Indominus Escape
		Ratchet & Clank
		Gremlins
		One Piece "3D2Y": Overcome Ace's Death! Luffy's Vow to his Friends
		Tangled: Before Ever After
		Egon Schiele: Death and the Maiden
		Chappaquiddick
		The Innocents
		Escape Room
		Cyborg X
		Anna and the Apocalypse
		Beside Bowie: The Mick Ronson Story
		Taraji's White Hot Holiday Special
		A StoryBots Christmas
		War on Everyone
		Invasion of the Bod

In [8]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2019, 2020]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2019 each genre. This will take a while, please wait...
		Midway
		PAW Patrol: Mighty Pups
		Batman Returns
		Checkered Ninja
		Dumbo
		Super Me
		Chao in Space
		Antz
		Missing Link
		Over the Hedge
		Monster Trucks
		Murder Mystery
		The Fanatic
		Terminal
		Under the Silver Lake
		Antoine Griezmann: The Making of a Legend
		Listen to Me Marlon
		Super Size Me 2: Holy Chicken!
		The Ring Two
		The Karate Kid
		Pinocchio
		Barnyard
		The Rescuers
		Christopher Robin
		Peter Pan
		Adventures of Aladdin
		Tokyo Ghoul 'S'
		Backstabbing for Beginners
		Redbad
		A Twelve-Year Night
		Resident Evil: Degeneration
		Rampant
		Polaroid
		I'll Find You
		Bohemian Rhapsody: Recreating Live Aid
		Gorillaz: Reject False Icons
		Regression
		The Lazarus Effect
		Three Colors: Red
		A Royal Winter
		Kaguya-sama: Love Is War
		Destination Wedding
		Synchronic
		Elizabeth Harvest
		Justice League vs. the Fatal Five
		A Summer to Remember
		Christmas at Graceland


In [9]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2011, 2012]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2011 each genre. This will take a while, please wait...
		Dune
		Star Wars: Episode III - Revenge of the Sith
		The Last Samurai
		A Monster in Paris
		The Bear
		Free Willy
		Planet 51
		Pokémon: Arceus and the Jewel of Life
		Heavy Metal
		13 Going on 30
		Dragon Ball: Curse of the Blood Rubies
		Horrible Bosses
		Taking Lives
		Monster
		Raising Arizona
		The Extraordinary Voyage
		The Black Power Mixtape 1967-1975
		Britain's Greatest Codebreaker
		21
		The Rite
		The Other Boleyn Girl
		Arthur and the Revenge of Maltazard
		Sharpay's Fabulous Adventure
		Babe
		Babe
		The Forbidden Kingdom
		Arthur 3: The War of the Two Worlds
		Ausmerzen
		First Light
		Red Eagle: The Movie
		Bedevilled
		Brotherhood of the Wolf
		Wind Chill
		Lord, All Men Can't Be Dogs
		Inni
		The Dead Inside
		Giallo
		Murder!
		Sound of My Voice
		Music and Lyrics
		Men in Hope
		Take Me Home Tonight
		The Man from Earth
		Transformers
		Pokémon: The Rise of Darkrai
		A 

In [10]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2009, 2010]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2009 each genre. This will take a while, please wait...
		Underworld: Rise of the Lycans
		Planet Terror
		Hellboy II: The Golden Army
		The Flyboys
		The Forbidden Kingdom
		Jumper
		Death Note Relight 2: L's Successors
		Sword of the Stranger
		Shark Bait
		Planet 51
		My Best Friend's Girl
		Shamelessly She-Hulk
		OSS 117: Cairo, Nest of Spies
		Midnight Express
		Ace Ventura Jr: Pet Detective
		Earth 2100
		Michael Jackson Memorial
		I Am... Yours: An Intimate Performance at Wynn Las Vegas
		Contact
		The Hurt Locker
		Memoirs of a Geisha
		Dr. Dolittle 3
		Labyrinth
		Mutant Pumpkins from Outer Space
		Final Fantasy VII: Advent Children
		Mr. Nobody
		The Scorpion King 2: Rise of a Warrior
		All the Mornings of the World
		My Winnipeg
		Arn: The Kingdom at Road's End
		The Echo
		Uncharted
		Zombie Flesh Eaters
		John Mayer Trio - Live at Bowery Ballroom, New York
		The Man Who Would Be Polka King
		Blue Man Group: How to Be a Megastar Live!
	

In [11]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2007, 2008]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2007 each genre. This will take a while, please wait...
		Children of Men
		On Her Majesty's Secret Service
		Beowulf
		Pokémon: Lucario and the Mystery of Mew
		Transformers
		Flushed Away
		Garfield Gets Real
		Paprika
		Futurama: Bender's Big Score
		RV
		One Piece: The Desert Princess and the Pirates: Adventure in Alabasta
		Some Like It Hot
		The Proposition
		On the Waterfront
		Wheels on Meals
		Lake of Fire
		The Enemies of Reason
		Manufacturing Dissent
		Twitches
		Moulin Rouge!
		The Karate Kid Part II
		Underdog
		Hoodwinked!
		Agent Cody Banks
		The Rocky Horror Picture Show
		Day Watch
		It's a Boy Girl Thing
		The Ten Commandments
		Grbavica: The Land of My Dreams
		Z Channel: A Magnificent Obsession
		Severance
		The Entity
		Funny Games
		Reba McEntire and Kelly Clarkson: CMT Crossroads
		Ripple Effect
		Eric Clapton: Behind The Sun Tour
		The Good German
		Southland Tales
		The Flock
		Look Who's Talking
		Sex and Death 101
		Drag

In [12]:
# Let's separate out the IDs only.

genre_ids = list(Genres_id_name_dict.keys())

# By using the Above Ids we'll scrape the data from themoviedb.org for the base year of 2017.

movies = []

for baseyear in [2005, 2006]:
    print(f'Starting pulling movies from TMDB for {baseyear} each genre. This will take a while, please wait...')
    done_ids=set()

    for g_id in genre_ids:
        for page in range(1,6,1): # (1,6,1)
    #         time.sleep(1)

            url = 'https://api.themoviedb.org/3/discover/movie?api_key=' + api_key
            url += '&language=en-US&sort_by=popularity.desc&year=' + str(baseyear) 
            url += '&with_genres=' + str(g_id) + '&page=' + str(page)

            data = urllib.request.urlopen(url).read()

            dataDict = json.loads(data)
            movies.extend(dataDict["results"])
        last_movies = list(map(lambda x: x['title'],movies[-3:]))
        for title in last_movies:
            print('\t\t'+title)
        done_ids.add(str(g_id))
    print("\tPulled movies for genres - "+','.join(done_ids))
    print('\n')
    movie_data_df = pd.DataFrame(movies)

    movie_data_df.to_csv(f"/kfs_public/movies_data/{baseyear}.csv")

Starting pulling movies from TMDB for 2005 each genre. This will take a while, please wait...
		Mighty Morphin Power Rangers: The Movie
		Enemy of the State
		Star Trek: Nemesis
		Gorgeous
		Fear and Loathing in Las Vegas
		Hidalgo
		A Grand Day Out
		Slayers Return
		Air: The Motion Picture
		Kung Pow: Enter the Fist
		Problem Child 3: Junior in Love
		Ladyhawke
		Batman
		Black Cat, White Cat
		2 Fast 2 Furious
		The Comedians of Comedy
		The Mindscape of Alan Moore
		After Innocence
		9 Songs
		The Phantom of the Opera
		Closer
		Kim Possible: So the Drama
		The Land Before Time VII: The Stone of Cold Fire
		Save the Last Dance
		Dracula
		Dragonfly
		Wolf
		Joseph Smith: Prophet of the Restoration
		The Weeping Meadow
		The Travelling Players
		Phenomena
		Rings
		Ringu 0
		Classic Albums: Nirvana - Nevermind
		Rock School
		Crossing the Bridge: The Sound of Istanbul
		The Three Burials of Melquiades Estrada
		City by the Sea
		Gerry
		The Sound of Music
		The Russia House
		Anthon