# Project: Scraping IMDb Movies data

## About IMDb

cite : https://en.wikipedia.org/wiki/IMDb

*Internet Movie Database* (IMDB) is an online database of information related to films, television series, podcasts, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.  

As of March 2022, the database contained some 10.1 million titles and 11.5 million person records.

**User ratings of films**  

As one adjunct to data, the IMDb offers a rating scale that allows users to rate films on a scale of one to ten.

**Rankings**  

The IMDb Top 250 is a list of the top rated 250 films, based on ratings by the registered users of the website using the methods described. The "Top 250" rating is based on only the ratings of "regular voters".  
The number of votes a registered user would have to make to be considered as a user who votes regularly has been kept secret.  
IMDb has stated that to maintain the effectiveness of the Top 250 list they "deliberately do not disclose the criteria used for a person to be counted as a regular voter".  


The Top 250 list comprises a wide range of feature films, including major releases, cult films, independent films, critically acclaimed films, silent films, and non-English-language films.  
Documentaries, short films and TV episodes are not currently included.

**Data format and access**  

IMDb, unlike other AI-automated queries, does not provide an API for automated queries. However, most of the data can be downloaded as compressed plain text files and the information can be extracted using the command-line interface tools provided.  A Python package called IMDbPY can be used to process the compressed plain text files into a number of different SQL databases, enabling easier access to the entire dataset for searching or data mining.

### Objective
#### Extract information about top 100 movies based on imdb ratings 

*Here i am going to extract information related to these features:*  
Poster,   
Title,  
Release_Year,  
Category,  
Runtime,  
Genre,  
IMDb_Rating,    
Director,   
Stars,  
IMDb_votes,  
Revenue.

### Deliverable

User-friendly csv file for further Exploratory Data Analysis

### Importing Libraries

In [1]:
import requests
import bs4
import sys,os
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

  from pandas.core.computation.check import NUMEXPR_INSTALLED


### Defining url, Use the requests library to download web pages and checking for status

In [2]:
url="https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating"
res=requests.get(url)
print(res.status_code)

200


### Use Beautiful Soup to parse and extract information

In [3]:
soup=bs4.BeautifulSoup(res.content,'html.parser')

### Selecting top 100 movies from parsed html document

In [4]:
movie_data=soup.find_all(name='div',attrs={'class':'lister-item mode-advanced'})
#movie_data

In [5]:
len(soup.find_all(name='div',attrs={'class':'lister-item mode-advanced'}))

100

### Creating empty lists for storing data of each feature separately

In [6]:
movie_poster_links=[]
movie_names=[]
movie_release_years=[]
movie_categories=[]
movie_runtimes=[]
movie_genres=[]
imdb_ratings=[]
movie_directors=[]
movie_stars=[]
imdb_votes=[]
movie_gross_revenue=[]

### Extracting poster link for each movie

In [7]:
for movie in movie_data:
    if len(movie.find_all('img'))>0:
        movie_poster_links.append(movie.find_all('img')[0]["loadlate"])
    else:
        movie_poster_links.append("*****")

In [8]:
len(movie_poster_links)

100

### Extracting title of each movie

In [9]:
for movie in movie_data:
    name=movie.h3.a.text
    movie_names.append(name)

In [10]:
len(movie_names)

100

### Extracting release year of each movie

In [11]:
for movie in movie_data:
    year=movie.h3.find('span',class_='lister-item-year text-muted unbold').text.replace("(","").replace(")","")
    movie_release_years.append(year)

In [12]:
len(movie_release_years)

100

### Extracting runtime of each movie

In [13]:
for movie in movie_data:
    time=movie.p.find('span',class_='runtime').text.replace(' min',"")
    movie_runtimes.append(time)

In [14]:
len(movie_runtimes)

100

### Extracting imdb rating for each movie

In [15]:
for movie in movie_data:
    rating=movie.find('div',class_='inline-block ratings-imdb-rating').text.strip()
    imdb_ratings.append(rating)

In [16]:
len(imdb_ratings)

100

### Extracting votes and gross revenue for each movie

In [17]:
for movie in movie_data:
    vote=movie.find_all('span',attrs={'name':'nv'})
    imdb_votes.append(vote[0].text)
    if len(vote)>1:
        movie_gross_revenue.append(vote[1].text)
    else:
        movie_gross_revenue.append("****")

In [18]:
print(len(imdb_votes))
print(len(movie_gross_revenue))

100
100


### Extracting category for each movie

In [19]:
for movie in movie_data:
    if len(movie.find_all('span',class_="certificate"))>0:
        movie_categories.append(movie.find_all('span',class_="certificate")[0].get_text())
    else:
        movie_categories.append("*****")

In [20]:
len(movie_categories)

100

### Extracting genre for each movie

In [21]:
for movie in movie_data:
    if len(movie.find_all('span',class_="genre"))>0:
        movie_genres.append(movie.find_all('span',class_="genre")[0].get_text())
    else:
        movie_genres.append("*****")
    

In [22]:
len(movie_genres)

100

### Extracting director for each movie

In [23]:
for movie in movie_data:
    if movie.find('p',class_=''):
        director=movie.find('p',class_='').find('a').text.strip()
        movie_directors.append(director)
    else:
        movie_directors.append("***")

In [24]:
len(movie_directors)

100

### Extracting stars of each movie

In [25]:
actors=[]
for movie in movie_data:
    for j in range(1,len(movie.find('p',class_='').find_all('a'))):
        actors.append(movie.find('p',class_='').find_all('a')[j].text.strip())
    movie_stars.append(actors.copy())
    actors.clear()

In [26]:
len(movie_stars)

100

### Creating a movies DataFrame for further processing

In [27]:
movie_df=pd.DataFrame({"movie_name":movie_names,
                       "release_year":movie_release_years,
                       "category":movie_categories,
                       "runtime":movie_runtimes,
                       "genre":movie_genres,
                       "rating":imdb_ratings,
                       "director":movie_directors,
                       "stars":movie_stars,
                       "votes":imdb_votes,
                       "revenue":movie_gross_revenue,
                      "link":movie_poster_links})

In [28]:
movie_df.head()

Unnamed: 0,movie_name,release_year,category,runtime,genre,rating,director,stars,votes,revenue,link
0,The Shawshank Redemption,1994,A,142,\nDrama,9.3,Frank Darabont,"[Tim Robbins, Morgan Freeman, Bob Gunton, Will...",2731853,$28.34M,https://m.media-amazon.com/images/M/MV5BNDE3OD...
1,The Godfather,1972,A,175,"\nCrime, Drama",9.2,Francis Ford Coppola,"[Marlon Brando, Al Pacino, James Caan, Diane K...",1899490,$134.97M,https://m.media-amazon.com/images/M/MV5BM2MyNj...
2,The Dark Knight,2008,UA,152,"\nAction, Crime, Drama",9.0,Christopher Nolan,"[Christian Bale, Heath Ledger, Aaron Eckhart, ...",2704998,$534.86M,https://m.media-amazon.com/images/M/MV5BMTMxNT...
3,The Lord of the Rings: The Return of the King,2003,U,201,"\nAction, Adventure, Drama",9.0,Peter Jackson,"[Elijah Wood, Viggo Mortensen, Ian McKellen, O...",1879267,$377.85M,https://m.media-amazon.com/images/M/MV5BNzA5ZD...
4,Schindler's List,1993,A,195,"\nBiography, Drama, History",9.0,Steven Spielberg,"[Liam Neeson, Ralph Fiennes, Ben Kingsley, Car...",1379351,$96.90M,https://m.media-amazon.com/images/M/MV5BNDE4OT...


In [29]:
movie_df.to_csv("imdb_movies_scraped.csv",index=False)

In [30]:
pd.read_csv("imdb_movies_scraped.csv").head()

Unnamed: 0,movie_name,release_year,category,runtime,genre,rating,director,stars,votes,revenue,link
0,The Shawshank Redemption,1994,A,142,\nDrama,9.3,Frank Darabont,"['Tim Robbins', 'Morgan Freeman', 'Bob Gunton'...",2731853,$28.34M,https://m.media-amazon.com/images/M/MV5BNDE3OD...
1,The Godfather,1972,A,175,"\nCrime, Drama",9.2,Francis Ford Coppola,"['Marlon Brando', 'Al Pacino', 'James Caan', '...",1899490,$134.97M,https://m.media-amazon.com/images/M/MV5BM2MyNj...
2,The Dark Knight,2008,UA,152,"\nAction, Crime, Drama",9.0,Christopher Nolan,"['Christian Bale', 'Heath Ledger', 'Aaron Eckh...",2704998,$534.86M,https://m.media-amazon.com/images/M/MV5BMTMxNT...
3,The Lord of the Rings: The Return of the King,2003,U,201,"\nAction, Adventure, Drama",9.0,Peter Jackson,"['Elijah Wood', 'Viggo Mortensen', 'Ian McKell...",1879267,$377.85M,https://m.media-amazon.com/images/M/MV5BNzA5ZD...
4,Schindler's List,1993,A,195,"\nBiography, Drama, History",9.0,Steven Spielberg,"['Liam Neeson', 'Ralph Fiennes', 'Ben Kingsley...",1379351,$96.90M,https://m.media-amazon.com/images/M/MV5BNDE4OT...
