# What’s Web Scraping?

Web scraping consists of gathering data available on websites. This can be done manually by a human or by using a bot. A bot is a program you build that helps you extract the data you need much quicker than a human’s hand and eyes can.

# What Are We Going to Scrape?
It’s essential to identify the goal of your scraping right from the start. We don’t want to scrape any data we don’t actually need. For this project, we’ll scrape data from IMDb’s “Top 1,000” movies, specifically the top 50 movies on this page. Here is the information we’ll gather from each movie listing:

- The title
- The year it was released
- Movie Genre
- Movie Certificate
- How long the movie is
- IMDb’s rating of the movie
- The Metascore of the movie
- How many votes the movie got
- The U.S. gross earnings of the movie

In [15]:
import requests #packages that is used to download the content from web
import urllib # packages that is used to work with URL libraries
import requests #package built to make HTTP requests user friendly
import os # package used for file process
import re # package for regular expression - best to have it dont know if it is required or not
from bs4 import BeautifulSoup #a Python library for pulling data out of HTML and XML files
import pandas as pd # the omnipresent of all python to work with dataframes


In [16]:
# Define website to scrap and store the link.
imdb_url = 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=1&ref_=adv_nxt' 
imdb_resp = requests.get(imdb_url) #get the content of the url from the request package.

In [17]:
imdb_soup = BeautifulSoup(imdb_resp.text, 'html.parser')

Let's create a helper function 'create_doc' that will take a url as an argument and carry out all the process we followed above to return a BeautifulSoup object.

In [18]:
def docs(url):
    # download page using requests library.
    response = requests.get(url)
    # check if the web page was successfully downloaded.
    if response1.status_code != 200:
        return "Page not successfully downloaded"
    contents = response1.text
    
    # convert contents into BeautifulSoup object
    soup = BeautifulSoup(page_contents, 'html.parser')
    return soup

Extract the following detail for the movies list:

- name
- year
- rating
- certificate
- duration
- genre
- metascore
- votes
- gross

In [22]:
imdb_containers = imdb_soup.find_all("div",{"class" : "lister-item mode-advanced"}) # get all the containers 
print('Containers is of type: ',type(imdb_containers)) # check the type 
print('Number of movies in the container:',len(imdb_containers)) # number of movies in the container

Containers is of type:  <class 'bs4.element.ResultSet'>
Number of movies in the container: 100


In [23]:
imdb_1 = imdb_containers[0] # get the first movie from the container
print('What is the data type of movie:',type(imdb_1)) # observe the tag of the first movie
print('details of the movie: ', len(imdb_1))

What is the data type of movie: <class 'bs4.element.Tag'>
details of the movie:  7


In [24]:
imdb_1.div

<div class="lister-top-right">
<div class="ribbonize" data-caller="filmosearch" data-tconst="tt0099685"></div>
</div>

In [25]:
imdb_1.a # a or anchor tag defines the href which is hypertext

<a href="/title/tt0099685/"> <img alt="Goodfellas" class="loadlate" data-tconst="tt0099685" height="98" loadlate="https://m.media-amazon.com/images/M/MV5BY2NkZjEzMDgtN2RjYy00YzM1LWI4ZmQtMjIwYjFjNmI3ZGEwXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UX67_CR0,0,67,98_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="67"/>
</a>

In [26]:
imdb_1.h3 

<h3 class="lister-item-header">
<span class="lister-item-index unbold text-primary">1.</span>
<a href="/title/tt0099685/">Goodfellas</a>
<span class="lister-item-year text-muted unbold">(1990)</span>
</h3>

In [27]:
imdb_1.h3.a

<a href="/title/tt0099685/">Goodfellas</a>

In [28]:
movie_name = imdb_1.h3.a.text # Get The movie Title
movie_name

'Goodfellas'

In [29]:
movie_year = imdb_1.h3.find('span', class_ = 'lister-item-year text-muted unbold').text # Get Movie Year's
movie_year

'(1990)'

In [30]:
imdb_1.strong

<strong>8.7</strong>

In [31]:
# convert to float as expecting to see some decimal places for rating
movie_rating = float(imdb_1.strong.text) 
movie_rating

8.7

In [32]:
imdb_1.find('span', class_ = 'certificate') # Getting the certificate for the movie

<span class="certificate">A</span>

In [33]:
movie_certificate = imdb_1.find('span', class_ = 'certificate').text
movie_certificate

'A'

Not all movies are rated so lets create a helper function to put a 'Not Rated' where certificate are not available

In [34]:
def get_certificate(imdb):
    # cases where there is no certificate assigned - lets give default as Not Rated
    if imdb.find('span', class_ = 'certificate') is not None:
        cert = imdb.find('span', class_ = 'certificate').text
    else:
        cert = 'Not Rated'
    return cert

In [35]:
get_certificate(imdb_1)

'A'

In [36]:
movie_runtime = imdb_1.find('span', class_ = 'runtime').text # Get Movie runtime
movie_runtime

'146 min'

In [37]:
movie_genre = imdb_1.find('span', class_ = 'genre').text # Get Movie Genre
movie_genre

'\nBiography, Crime, Drama            '

In [38]:
movie_genre = movie_genre.strip()
movie_genre

'Biography, Crime, Drama'

In [39]:
# Get Votes 
v_g_dtl = imdb_1.findAll('span', attrs = {'name' : 'nv'})
movie_votes = v_g_dtl[0]['data-value']
print(movie_votes)

1043030


In [40]:
#Get Gross
movie_gross = v_g_dtl[1].text
print(movie_gross)

$46.84M


In [41]:
#get vote count 
imdb_1.findAll('span', attrs = {'name' : 'nv'})[0]['data-value']

'1043030'

In [42]:
# get gross 
imdb_1.findAll('span', attrs = {'name' : 'nv'})[1].text

'$46.84M'

Not all movies have gross earned so lets add a helper function to get count and gross with default



In [43]:
def get_votes_and_gross(imdb):
    imdb = imdb.find_all('span',{"name":"nv"})
    votes_and_gross_list = []
    for data_value in imdb:
        votes_and_gross_list.append(data_value.text)
    if(len(imdb)==2):
        votes=votes_and_gross_list[0]
        gross = votes_and_gross_list[1]
    else:
        votes=votes_and_gross_list[0]
        gross = None
    
    return votes,gross

In [44]:
# Get MetaScore
movie_mscore = imdb_1.find('span', class_ = 'metascore').text
movie_mscore.strip()

'90'

In [45]:
def meta_score(imdb):
    # for most of the movies metascore is not available and for those default to 0
    if imdb.find('span', class_ = 'metascore favorable') is not None:
        meta = imdb.find('span', class_ = 'metascore favorable').text
    else:
        meta = '0'
    return int(meta)

In [46]:
# assign the start page as the start_url and see if the function returns the pages
url = 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=1&ref_=adv_nxt'

# Using BeautifulSoup
Make the content we grabbed easy to read by using BeautifulSoup:

soup is the variable we create to assign the method BeatifulSoup to, which specifies a desired format of results using the HTML parser — this allows Python to read the components of the page rather than treating it as one long string

In [47]:
# lets create a helper function to get all the pages.
def all_imdb_page(start_url):
    all_urls = [] #list to get all urls
    url = start_url # begin page
    while(url != None):  #Loop around all the required webpages and terminates when last page arive!
        all_urls.append(url) # add to the list
        soup = BeautifulSoup(requests.get(url).text,"html.parser") # parser
        # use the class function what makes the page to next
        next_links = soup.find_all(class_='lister-page-next next-page') #Extracts the next page link.
        if (len(next_links) == 0):         # If their is no next page, it returns 0.
            url = None
        else:
            next_page = "https://www.imdb.com" + next_links[0].get('href')
            url = next_page
            print(url)
    return all_urls

In [48]:
all_imdb_page(url)

https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=101
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=201
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=301
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=401
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=501
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=601
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=701
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=801
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=901
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&

['https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=1&ref_=adv_nxt',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=101',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=201',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=301',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=401',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=501',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=601',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=701',
 'https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=801',
 'https://www.imdb.com/search/title/?title_t

# Finally, write and save the infomation into CSV files

In [49]:
def get_imdb_movies(start_year, end_year):    
    start_year = str(start_year) # get the begin year
    end_year = str(end_year) # get the end year
    url = get_imdb_url(start_year, end_year) # helper function to build url for start and end year
    imdb_movies = get_imdb_movie_detail(url) # get all the movie detail from helper function
    # give the columns new titles
    imdb_col_list = ['Title','Year','IMDB Rating','Rated','Duration','Genre','Metascore','Votes','Gross USD']
    imdb_df = pd.DataFrame(imdb_movies, columns = imdb_col_list)
    
    # write all the oscar winning movies for the period into one file
    csv_name = 'imdb_movies_'+start_year+'-'+end_year+'.csv'
    print('imdb movies are also saved in {} file'.format(csv_name))
    
    #create a directory to post the files by year
    os.makedirs('movies_data', exist_ok=True)
    # make the data frame group by year as a dictionary
    imdb_movies_year = {j: imdb_df[imdb_df['Year'] == j] for j in imdb_df['Year'].unique()} 
    #loop thru key, value in dictionary to write the files
    for k, v in imdb_movies_year.items():
        v.to_csv('movies_data\imdb_movies_'+'{}.csv'.format(k), index=None)
    return imdb_df

lets do a helper function to get all the pages between start and end year

In [50]:
def get_imdb_url(start_year, end_year):    
    base_url = 'https://www.imdb.com/search/title/?title_type=feature&'
    year = 'year='
    sep = ','
    tail_url = '&count=100&start=1&ref_=adv_nxt'
    imdb_url = base_url + year + start_year + sep + end_year + tail_url
    return imdb_url

Create a helper function to go thru each page and also the number of movies in that page

In [51]:
def get_imdb_movie_detail(url):
    imdb_movies = [] # list of movies of data 

    for link in all_imdb_page(url):     #Runs the function for all the pages.
        imdb_soup = BeautifulSoup(requests.get(url).text, 'html.parser') #Extracts out the main html code.
        imdb_containers = imdb_soup.find_all("div",{"class" : "lister-item mode-advanced"}) # get all the containers 
        #print(len(imdb_containers))

        #loop through all the movies in the container to get the attributes
        for imdb in imdb_containers: 
            imdb_movies.append(get_movie_info(imdb))
        return imdb_movies

Getting all the IMDb Movie data

In [52]:
def get_movie_info(imdb):
    name = imdb.h3.a.text
    year = imdb.h3.find('span', class_ = 'lister-item-year text-muted unbold')
    movie_year = pd.to_numeric(year.text.replace('(','').replace(')','').replace('I',''))
    rating = float(imdb.strong.text)
    certificate = get_certificate(imdb) 
    duration = imdb.find('span', class_ = 'runtime').text
    genre = imdb.find('span', class_ = 'genre').text.strip()
    metascore = meta_score(imdb)
    votes, gross = votes_and_gross(imdb)
    movie_info = [name, movie_year, rating, certificate, duration, genre, metascore, votes, gross]
    return movie_info

Created helper functions to get Movie certificate, metascore, votes, gross

In [53]:
def get_certificate(imdb):
    # cases where there is no certificate assigned - lets give default as Not Rated
    if imdb.find('span', class_ = 'certificate') is not None:
        cert = imdb.find('span', class_ = 'certificate').text
    else:
        cert = 'Not Rated'
    return cert

In [54]:
def meta_score(imdb):
    # for most of the movies metascore is not available and for those default to 0
    if imdb.find('span', class_ = 'metascore favorable') is not None:
        meta = imdb.find('span', class_ = 'metascore favorable').text
    else:
        meta = '0'
    return int(meta)

In [55]:
def votes_and_gross(imdb):
    imdb = imdb.find_all('span',{"name":"nv"})
    votes_and_gross_list = []
    for data_value in imdb:
        votes_and_gross_list.append(data_value.text)
    if(len(imdb)==2):
        votes=votes_and_gross_list[0]
        gross = votes_and_gross_list[1]
    else:
        votes=votes_and_gross_list[0]
        gross = None   
    return votes,gross

In [56]:
imdb_df = get_imdb_movies(1990,1990)

https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=101
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=201
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=301
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=401
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=501
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=601
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=701
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=801
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&start=901
https://www.imdb.com/search/title/?title_type=feature&year=1990-01-01,1990-12-31&count=100&

In [57]:
imdb_df[:5]

Unnamed: 0,Title,Year,IMDB Rating,Rated,Duration,Genre,Metascore,Votes,Gross USD
0,Goodfellas,1990,8.7,A,146 min,"Biography, Crime, Drama",90,1043061,$46.84M
1,Tremors,1990,7.1,Not Rated,96 min,"Comedy, Horror",65,122981,$16.67M
2,Dances with Wolves,1990,8.0,U,181 min,"Adventure, Drama, Western",72,245400,$184.21M
3,Total Recall,1990,7.5,A,113 min,"Action, Sci-Fi, Thriller",0,307497,$119.39M
4,Pretty Woman,1990,7.0,A,119 min,"Comedy, Romance",0,296525,$178.41M
