# Love Movies? Let's Scrape them All !

## Introduction

This Project invloves scraping Movies information from the popular IMDB website. In particular:
- A list of the top 100 movies
- for each of these movies, include detailed information about the movie (obtained from the webpage of the movie)
- Arrange all this information in a tidy pandas data frame and export it to an Excell (or .CSV) file

In [1]:
from selenium import webdriver
from bs4 import BeautifulSoup
import os
import numpy as np
import pandas as pd
import re

In [2]:
proj_dir = r'D:\Work\Upwork\Giselle\webSrcapingProjs'

## Get a DF for all best films ever

In [126]:
def get_best_movies(url):
    phantomJS_path = os.path.join(proj_dir, r"phantomjs-2.1.1-windows\bin\phantomjs.exe")
    driver = webdriver.PhantomJS(phantomJS_path)
    
    driver.get(url)
    
    html_page = driver.page_source
    
    soub = BeautifulSoup(html_page, 'lxml')
    
    movie_table = soub.find('tbody', class_='lister-list')
    
    # initiate the df
    df = list()
    for movie in movie_table.find_all('td', class_='titleColumn'):
        # get the movie text and remove some extra spaces and end of line characters
        movie_title = re.sub(r'\s{6}|\n', '', movie.get_text().strip())
        
        # get movie link
        movie_link = "http://www.imdb.com" + movie.find('a')['href']
         
        # separate the year from the movie title from its rank
        m = re.search(r'(?P<rank>\d{1,3})\.\s*(?P<title>.*?)\((?P<year>\d{4})', movie_title)
        if m:
            df.append([m.group('rank'), m.group('title'), m.group('year'), movie_link])
    
    driver.quit()
    
    top_movies = pd.DataFrame(df, columns = ['rank', 'title', 'year', 'link'])
    
    # convert rank, year columns to numeric 
    top_movies[['rank', 'year']] = top_movies[['rank','year']].apply(pd.to_numeric, axis='columns')
    
    return top_movies

In [136]:
top_movies_url = "http://www.imdb.com/chart/top?ref_=nv_mv_250_6"
top_movies = get_best_movies(top_movies_url)


In [137]:
top_movies.head()

Unnamed: 0,rank,title,year,link
0,1,The Shawshank Redemption,1994,http://www.imdb.com/title/tt0111161/?pf_rd_m=A...
1,2,The Godfather,1972,http://www.imdb.com/title/tt0068646/?pf_rd_m=A...
2,3,The Godfather: Part II,1974,http://www.imdb.com/title/tt0071562/?pf_rd_m=A...
3,4,The Dark Knight,2008,http://www.imdb.com/title/tt0468569/?pf_rd_m=A...
4,5,12 Angry Men,1957,http://www.imdb.com/title/tt0050083/?pf_rd_m=A...


In [138]:
top_movies.tail()

Unnamed: 0,rank,title,year,link
245,246,Gangs of Wasseypur,2012,http://www.imdb.com/title/tt1954470/?pf_rd_m=A...
246,247,Dog Day Afternoon,1975,http://www.imdb.com/title/tt0072890/?pf_rd_m=A...
247,248,What Ever Happened to Baby Jane?,1962,http://www.imdb.com/title/tt0056687/?pf_rd_m=A...
248,249,Pirates of the Caribbean: The Curse of the Bla...,2003,http://www.imdb.com/title/tt0325980/?pf_rd_m=A...
249,250,PK,2014,http://www.imdb.com/title/tt2338151/?pf_rd_m=A...


In [139]:
# let's see the most recent 10 movies of the list (i.e: sort according to year, break ties with rank)
top_movies.sort_values(['year', 'rank'], ascending=[False, True]).head(10)

Unnamed: 0,rank,title,year,link
73,74,Dunkirk,2017,http://www.imdb.com/title/tt5013056/?pf_rd_m=A...
169,170,Logan,2017,http://www.imdb.com/title/tt3315342/?pf_rd_m=A...
203,204,Baby Driver,2017,http://www.imdb.com/title/tt3890160/?pf_rd_m=A...
70,71,Dangal,2016,http://www.imdb.com/title/tt5074352/?pf_rd_m=A...
84,85,Your Name,2016,http://www.imdb.com/title/tt5311514/?pf_rd_m=A...
153,154,La La Land,2016,http://www.imdb.com/title/tt3783958/?pf_rd_m=A...
168,169,Hacksaw Ridge,2016,http://www.imdb.com/title/tt2119532/?pf_rd_m=A...
131,132,Inside Out,2015,http://www.imdb.com/title/tt2096673/?pf_rd_m=A...
139,140,Room,2015,http://www.imdb.com/title/tt3170832/?pf_rd_m=A...
197,198,Mad Max: Fury Road,2015,http://www.imdb.com/title/tt1392190/?pf_rd_m=A...


## Step (2) Get Detail Information for One Movie

In [205]:
def get_movie_details(movie_record, driver):
    
    driver.get(movie_record['link'])
    
    html_page = driver.page_source
    
    soup = BeautifulSoup(html_page, 'lxml')
    
    movie_record['rating'] = float(soup.find('span', itemprop='ratingValue').get_text().strip())
    
    plot_summary = soup.find('div', class_='plot_summary')
    
    movie_record['summary_text'] = plot_summary.find('div', itemprop='description').get_text().strip()
    
    movie_record['director'] = plot_summary.find('span', itemprop='director').get_text().strip()
    
    movie_record['writers'] = [writer.get_text().strip() for writer in plot_summary.find_all('span', itemprop='creator')]
    
    # the next line is just for cleaning the writers column, as it contains extra comma
    movie_record['writers'] = [re.sub(r'(\)),', lambda m: m.group(1), writer) for writer in movie_record['writers']]
    
    movie_record['stars'] = [star.get_text().strip() for star in plot_summary.find_all('span', itemprop='actors')]
    
    # the next line is just for cleaning the stars column, as it contains extra comma
    movie_record['stars'] = [re.sub(r'(\w+),', lambda m: m.group(1), star) for star in movie_record['stars']]

    
    return movie_record

In [206]:
phantomJS_path = os.path.join(proj_dir, r"phantomjs-2.1.1-windows\bin\phantomjs.exe")
driver = webdriver.PhantomJS(phantomJS_path)

In [207]:
full_movie_record = get_movie_details(top_movies.iloc[0, :], driver)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.loc[key] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on 

In [214]:
full_movie_record

rank                                                            1
title                                    The Shawshank Redemption
year                                                         1994
link            http://www.imdb.com/title/tt0111161/?pf_rd_m=A...
rating                                                        9.3
summary_text    Two imprisoned men bond over a number of years...
director                                           Frank Darabont
writers         [Stephen King (short story "Rita Hayworth and ...
stars                   [Tim Robbins, Morgan Freeman, Bob Gunton]
Name: 0, dtype: object

In [215]:
driver.quit()

## Step (3) Get Full Details of all movies

In [216]:
def get_best_movies_details(movies_df):
    
    phantomJS_path = os.path.join(proj_dir, r"phantomjs-2.1.1-windows\bin\phantomjs.exe")
    
    driver = webdriver.PhantomJS(phantomJS_path)
    
    movies_df = movies_df.apply(lambda x: get_movie_details(x, driver), axis='columns')
    
    driver.quit()
    
    return movies_df

In [217]:
full_movies_df = get_best_movies_details( get_best_movies(top_movies_url)[0:5] )

In [218]:
full_movies_df

Unnamed: 0,rank,title,year,link,rating,summary_text,director,writers,stars
0,1,The Shawshank Redemption,1994,http://www.imdb.com/title/tt0111161/?pf_rd_m=A...,9.3,Two imprisoned men bond over a number of years...,Frank Darabont,"[Stephen King (short story ""Rita Hayworth and ...","[Tim Robbins, Morgan Freeman, Bob Gunton]"
1,2,The Godfather,1972,http://www.imdb.com/title/tt0068646/?pf_rd_m=A...,9.2,The aging patriarch of an organized crime dyna...,Francis Ford Coppola,"[Mario Puzo (screenplay), Francis Ford Coppola...","[Marlon Brando, Al Pacino, James Caan]"
2,3,The Godfather: Part II,1974,http://www.imdb.com/title/tt0071562/?pf_rd_m=A...,9.0,The early life and career of Vito Corleone in ...,Francis Ford Coppola,"[Francis Ford Coppola (screenplay), Mario Puzo...","[Al Pacino, Robert De Niro, Robert Duvall]"
3,4,The Dark Knight,2008,http://www.imdb.com/title/tt0468569/?pf_rd_m=A...,9.0,When the menace known as the Joker emerges fro...,Christopher Nolan,"[Jonathan Nolan (screenplay), Christopher Nola...","[Christian Bale, Heath Ledger, Aaron Eckhart]"
4,5,12 Angry Men,1957,http://www.imdb.com/title/tt0050083/?pf_rd_m=A...,8.9,A jury holdout attempts to prevent a miscarria...,Sidney Lumet,"[Reginald Rose (story), Reginald Rose (screenp...","[Henry Fonda, Lee J. Cobb, Martin Balsam]"
