# MOVIE DATA COLLECTION

`MOVIES with VOTES >= 700 to maintain credability`

In [1]:
import os

# Create 'datasets' folder if not exist
os.makedirs('datasets/imdb', exist_ok=True)

# List content of 'datasets' folder
os.listdir('datasets/imdb')

['title.ratings.tsv', 'title.ratings.tsv.gz']

`Download title.ratings.tsv if not exist; otherwise skip`

In [None]:
import urllib.request
import gzip

file_path = 'datasets/imdb/title.ratings.tsv'

# To download and unzip the rating file
if not os.path.isfile(file_path):
    url = 'https://datasets.imdbws.com/title.ratings.tsv.gz'
    urllib.request.urlretrieve(url, f'{file_path}.gz')
    
    with gzip.open(f'{file_path}.gz', 'rb') as compressed_file, open(file_path, 'wb') as decompressed_file:
        decompressed_file.write(compressed_file.read())

os.listdir('datasets/imdb')

## `IMDb non-commercial datasets for personal use`
These datasets are available in gzipped tab-separated-values (TSV) format and can be accessed from [IMDB datasets](https://datasets.imdbws.com/)

1. **title.akas.tsv.gz**
2. **title.basics.tsv.gz**
3. **title.crew.tsv.gz**
4. **title.episode.tsv.gz**
5. **title.principals.tsv.gz**
6. **title.ratings.tsv.gz**
7. **name.basics.tsv.gz**

You can access and download these datasets from the provided link and use them for personal, non-commercial purposes in accordance with IMDb's terms and conditions. Make sure to verify your compliance with their non-commercial licensing and copyright/license terms.

# `Disclaimer of Warranties and Limitation of Liability`
THE IMDB SERVICES AND ALL INFORMATION, CONTENT, MATERIALS, PRODUCTS (INCLUDING SOFTWARE) AND OTHER SERVICES INCLUDED ON OR OTHERWISE MADE AVAILABLE TO YOU THROUGH THE IMDB SERVICES ARE PROVIDED BY IMDB ON AN "AS IS" AND "AS AVAILABLE" BASIS. IMDB MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, AS TO THE OPERATION OF THE IMDB SERVICES OR THE INFORMATION, CONTENT, MATERIALS, PRODUCTS (INCLUDING SOFTWARE) OR OTHER SERVICES INCLUDED ON OR OTHERWISE MADE AVAILABLE TO YOU THROUGH THE IMDB SERVICES. YOU EXPRESSLY AGREE THAT YOUR USE OF THE IMDB SERVICES IS AT YOUR SOLE RISK. IMDB RESERVES THE RIGHT TO WITHDRAW ANY IMDB SERVICE OR DELETE ANY INFORMATION FROM THE IMDB SERVICES AT ANY TIME IN ITS DISCRETION.

TO THE FULL EXTENT PERMISSIBLE BY APPLICABLE LAW, IMDB DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IMDB DOES NOT WARRANT THAT THE IMDB SERVICES, INFORMATION, CONTENT, MATERIALS, PRODUCTS (INCLUDING SOFTWARE) OR OTHER SERVICES INCLUDED ON OR OTHERWISE MADE AVAILABLE TO YOU THROUGH THE IMDB SERVICES, ITS SERVERS, OR ELECTRONIC COMMUNICATIONS SENT FROM IMDB ARE FREE OF VIRUSES OR OTHER HARMFUL COMPONENTS. IMDB WILL NOT BE LIABLE FOR ANY DAMAGES OF ANY KIND ARISING FROM THE USE OF ANY IMDB SERVICE, OR FROM ANY INFORMATION, CONTENT, MATERIALS, PRODUCTS (INCLUDING SOFTWARE) OR OTHER SERVICES INCLUDED ON OR OTHERWISE MADE AVAILABLE TO YOU THROUGH ANY IMDB SERVICE, INCLUDING, BUT NOT LIMITED TO DIRECT, INDIRECT, INCIDENTAL, PUNITIVE, AND CONSEQUENTIAL DAMAGES.

CERTAIN STATE LAWS DO NOT ALLOW LIMITATIONS ON IMPLIED WARRANTIES OR THE EXCLUSION OR LIMITATION OF CERTAIN DAMAGES. IF THESE LAWS APPLY TO YOU, SOME OR ALL OF THE ABOVE DISCLAIMERS, EXCLUSIONS, OR LIMITATIONS MAY NOT APPLY TO YOU, AND YOU MIGHT HAVE ADDITIONAL RIGHTS.

IMDb Software Terms In addition to these Conditions of Use, the terms found here apply to any software (including any updates or upgrades to the software and any related documentation) that we make available to you from time to time for your use in connection with IMDb Services (“IMDb Software”). If we provide specific Terms for the IMDb Software and there is a conflict between the specific Terms for the IMDb Software and these Conditions of Use, the specific Terms for the IMDb Software will govern.

In [None]:
import pandas as pd

df = pd.read_csv('datasets/imdb/title.ratings.tsv', sep='\t')
df.head()

In [None]:
votes = df[df['numVotes'] >= 700]
votes.shape

In [None]:
moviesID = votes['tconst'].tolist()

## [Cinemagoer](https://cinemagoer.github.io/)
`Cinemagoer` (previously known as IMDbPY) is a Python package for retrieving and managing data from the [IMDb](https://www.imdb.com/)
 movie database, including information about movies, people, and companies. 
 Please note that this project and its authors are not affiliated in any way with Internet Movie Database Inc. for details about data licenses, please refer to the [DISCLAIMER](https://raw.githubusercontent.com/cinemagoer/cinemagoer/master/DISCLAIMER.txt) and [DOCUMENTATION](https://readthedocs.org/projects/imdbpy/downloads/pdf/latest/)


In [None]:
try:
    !pip show Cinemagoer
except ImportError:
    !pip install git+https://github.com/cinemagoer/cinemagoer

In [None]:
from imdb import Cinemagoer
imdbClient = Cinemagoer()

import os
os.makedirs('datasets/logs', exist_ok=True)

import logging
logging.basicConfig(filename='datasets/logs/movieScore.log', level=logging.ERROR)

In [None]:
pid = set()

try:
    pid.update(pd.read_csv('datasets/movieInfo.csv')['imdbID'])
except FileNotFoundError:
    header = pd.DataFrame(columns=['imdbID', 'Title', 'Genres', 'Plot', 'Directors', 'Writers', 'Actors', 'Language', 'Country', 'Kind', 'Runtime'])
    header.to_csv('datasets/movieInfo.csv', index=False)

`imdbID, Title, Genres, Plot, Directors, Writers, Actors, Language, Country, Kind, Runtime`

In [None]:
# Iterate through the list of IMDb IDs
def fetchData(imdbID):
    try:
        movie = imdbClient.get_movie(imdbID[2:])
        if movie:
            # Create a dictionary with the movie data
            movieData = pd.DataFrame([{
                'imdbID': imdbID,
                'Title': movie.get('title', 'N/A'),
                'Genres': ', '.join(movie.get('genres', [])),
                'Plot': ', '.join(movie.get('plot', [])),
                'Directors': ', '.join([director.get('name', '') for director in movie.get('directors', [])]),
                'Actors': ', '.join([actor.get('name', '') for actor in movie.get('cast', [])]),
                'Writers': ', '.join([writer.get('name', '') for writer in movie.get('writer', [])]),
                'Language': ', '.join(language for language in movie.get('language') if language and language.lower() != 'none'),
                'Country': ', '.join(movie.get('country', [])),
                'Kind': movie.get('kind', 'N/A'),
                'Runtime': movie.get('runtime', 'N/A')[0]
            }])

            # Append the movie data to the CSV file
            movieData.to_csv('datasets/movieInfo.csv', mode='a', header=False, index=False)

    except Exception as e:
        logging.error(f'IMDB ID: {imdbID} Error: {e}')

In [None]:
from concurrent.futures import ThreadPoolExecutor
from queue import Queue

# A ThreadPoolExecutor to fetch data concurrently
with ThreadPoolExecutor(max_workers=8) as executor:
    queue = Queue()

    for imdbID in moviesID:
        if imdbID not in pid:
            queue.put(imdbID)
        while not queue.empty():
            imdbID = queue.get()
            executor.submit(fetchData, imdbID)

print('Data stored successfully !!!')

In [66]:
import pandas as pd

In [67]:
movieData = pd.read_csv('datasets/movieScore.csv')
movieData.head()

Unnamed: 0,imdbID,Title,Genres,Plot,Directors,Writers,Actors,Language,Country,Kind,Runtime
0,tt2178784,The Rains of Castamere,"Action, Adventure, Drama, Fantasy",Robb and Catelyn arrive at the Twins for the w...,David Nutter,"Emilia Clarke, Kit Harington, Richard Madden, ...","George R.R. Martin, , David Benioff, D.B. Weis...",English,United States,episode,51
1,tt2301451,Ozymandias,"Crime, Drama, Thriller",Walt goes on the run. Jesse is taken hostage. ...,Rian Johnson,"Bryan Cranston, Anna Gunn, Aaron Paul, Dean No...","Vince Gilligan, , Moira Walley-Beckett",English,United States,episode,47
2,tt12187040,Plan and Execution,"Crime, Drama",Jimmy and Kim deal with a last-minute snag in ...,Thomas Schnauz,"Bob Odenkirk, Jonathan Banks, Rhea Seehorn, Pa...","Vince Gilligan, Peter Gould, , Thomas Schnauz","English, Spanish",United States,episode,50
3,tt2301455,Felina,"Crime, Drama, Thriller",Walter White returns to Albuquerque one last t...,Vince Gilligan,"Bryan Cranston, Anna Gunn, Aaron Paul, Dean No...","Vince Gilligan, , Vince Gilligan",English,United States,episode,55
4,tt21151974,Connor's Wedding,"Comedy, Drama",While Logan doles out an unsavory task ahead o...,Mark Mylod,"Nicholas Braun, Brian Cox, Kieran Culkin, Dagm...","Jesse Armstrong, , Jesse Armstrong, , Miriam B...",English,United States,episode,62


In [68]:
movieData.shape

(5242, 11)

In [69]:
df = movieData[movieData['Kind']=='movie']

In [70]:
len(df)

176