# Web Scraping assignment - IMDB Top Rated Movies
Building a web scraping program to collect data from the IMDB top 250 rated movies list.<br>
The link for the URL is - [IMDB Top rated movies](https://www.imdb.com/chart/top?ref_=nv_mv_250)

#### CONTENTS

[1. Part 1 - Scraping details from IMDB Top Rated Movies page](#1.0-Scrape-details-from-IMDB-Top-Rated-Movies-page)

[2. Part 2 - Scraping details from individual movie pages](#2.0-Scrape-details-from-individual-movie-pages)

[3. Part 3 - Data Exploration and cleaning](#3.0-Data-Exploration-and-cleaning)

[4. Concatenate dataframes and export CSV](#4.0-Cocatenate-dataframes-and-export-to-CSV)

[End Notes](#End-Notes)

In [1]:
# importing basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# packages for web scraping
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
from requests import get
import re

### 1.0 Scrape details from IMDB Top Rated Movies page

In [2]:
#  Specifying the URL containing the main dataset and creating beautiful soup object.
link1 = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
main_url = link1
html = urlopen(main_url)
msoup = BeautifulSoup(html, 'lxml')

# List of movies based on the title
movies_list = msoup.find_all('td', class_="titleColumn")
print (len(movies_list))

250


In [3]:
import time
start = time.time()

# list of movie names
movie_name = []
for row in movies_list:
    # title list
    mn = row.findAll('a')
    movie_name.append(mn[0].text)

# obtain year released
year_released = []
for row in movies_list:
    # year release list
    yr = row.findAll('span')
    stp1= yr[0].text
    stp2= stp1.strip('(')
    stp3= stp2.strip(')')
    year_released.append(float(stp3))

# obtain links of indivudual movies
movie_links = []
for row in movies_list:
    # movie links list
    lnk = row.findAll('a', attrs={'href': re.compile("/title")})
    movie_links.append(lnk[0].get('href')) 

# obtain imdb rating for movies
imdb_rate = []
imdb_rating = msoup.find_all('td', class_="ratingColumn imdbRating")
for row in imdb_rating:
    rt = row.findAll('strong')
    imdb_rate.append(float(rt[0].text))

end = time.time()
print('Start time:',start)
print('End time:',end)
print('Elapsed time: (minutes)',(end - start)/60)

Start time: 1548587623.9641278
End time: 1548587624.0521326
Elapsed time: (minutes) 0.0014667471249898275


In [4]:
print ('Movie names :\n', movie_name[0:5])
print ('\nYear Released :\n', year_released[0:5])
print ('\nLinks :\n', movie_links[0:5])
print ('\nIMDB rating :\n', imdb_rate[0:5])

Movie names :
 ['The Shawshank Redemption', 'The Godfather', 'The Godfather: Part II', 'The Dark Knight', '12 Angry Men']

Year Released :
 [1994.0, 1972.0, 1974.0, 2008.0, 1957.0]

Links :
 ['/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=2YMZYFZ51ARGYEEYM1XX&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1', '/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=2YMZYFZ51ARGYEEYM1XX&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_2', '/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=2YMZYFZ51ARGYEEYM1XX&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_3', '/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=2YMZYFZ51ARGYEEYM1XX&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_4', '/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=2YMZY

In [5]:
# Creating dataframe from the avaiable lists of movies from the main page
df = pd.DataFrame({'Name of the movie': movie_name, 
                   'Link':movie_links, 
                   'Year released' :year_released, 
                   'IMDB rating': imdb_rate})

In [6]:
df.head(10)

Unnamed: 0,Name of the movie,Link,Year released,IMDB rating
0,The Shawshank Redemption,/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1994.0,9.2
1,The Godfather,/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1972.0,9.2
2,The Godfather: Part II,/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1974.0,9.0
3,The Dark Knight,/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,2008.0,9.0
4,12 Angry Men,/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1957.0,8.9
5,Schindler's List,/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1993.0,8.9
6,The Lord of the Rings: The Return of the King,/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,2003.0,8.9
7,Pulp Fiction,/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1994.0,8.9
8,"Il buono, il brutto, il cattivo",/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1966.0,8.8
9,Fight Club,/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1999.0,8.8


### 2.0 Scrape details from individual movie pages

In [7]:
# Function to obtain values from indvidual movie pages

import time
start = time.time()

# ----- Assign dummy lists to append data
lst_rev = [] # number of reviewers list
cens_rat = [] # movie rating
mov_len=[] # movie length
genre=[] # genre
rldt=[] # release date
sumry=[] # movie summary
drct=[] # movie director
wrtr=[] # movie writers
str=[] # movie stars
kywrd=[] # keywords
budget=[] # budget
grsusa=[] # Gross USA
cumgrs=[] # Cumulative gross world wide
prodcpn=[] # Production company
# ----------

# ----code for looping through 250 movie pages ----
for i in range (0, len(movies_list)): 
    mov_url = "https://www.imdb.com"+df.Link[i]
    mov_html = urlopen(mov_url)
    mvsoup = BeautifulSoup(mov_html, 'lxml')
    
    # --- reviewers
    reviewers = mvsoup.find_all('span', itemprop="ratingCount" )
    res1 = re.sub(',','',reviewers[0].text)
    lst_rev.append(float(res1))
    
    # --- movie rating
    censor = mvsoup.find('div', class_="subtext").contents
    res1 = re.sub(re.compile('\n'), '',censor[0].string)
    res2 = res1.strip()
    cens_rat.append(res2) 
    
    # --- movie length
    res3 = mvsoup.find('time').string
    res4 = re.sub(re.compile('\n'), '',res3)
    dur = res4.strip()
    mov_len.append(dur)
    
    # --- movie genre
    gen1 = mvsoup.find_all('div', class_="see-more inline canwrap")
    cnt=0
    for j in range (0,len(gen1)):
        if gen1[j].h4 is not None:
            res1 = gen1[j].h4.string
            if res1=="Genres:":
                res21=gen1[j].text
                genre.append(res21)
                cnt=1
        if j==len(gen1)-1 and cnt==0:
            genre.append(" ")
    
    # --- movie release date
    res5 = mvsoup.find('a', title="See more release dates").string
    reldt = re.sub(re.compile('\n'), '',res5)
    rldt.append(reldt)
    
    # --- movie summary 
    res6 = mvsoup.find('div', class_="summary_text").string
    if res6 is not None:
        res7 = re.sub(re.compile('\n'), '',res6)
        sumary = res7.strip()
        sumry.append(sumary)
    else :
        sumry.append(" ")
    
    
    # --- movie director
    credt = mvsoup.find_all('div', class_="credit_summary_item")
    cnt=0
    for j in range (0,len(credt)):
        res1 = credt[j].h4.string
        if res1=="Director:":
            res21=credt[j].text
            drct.append(res21)
            cnt=1
        if j==len(credt)-1 and cnt==0:
            drct.append(" ")
    
    # --- movie writers
    #credt = mvsoup.find_all('div', class_="credit_summary_item")
    for j in range (0,3):
        res1 = credt[j].h4.string
        if res1== 'Writer:' or res1== 'Writers:':
            res21=credt[j].text
            wrtr.append(res21)
    
    # --- movie stars
    #credt = mvsoup.find_all('div', class_="credit_summary_item")
    for j in range (0,3):
        res1 = credt[j].h4.string
        if res1== 'Stars:' or res1== 'Star:':
            res21=credt[j].text
            str.append(res21)
    
    # --- plot keywords
    keyw = mvsoup.find_all('div', class_="see-more inline canwrap")
    cnt=0
    for j in range (0,len(keyw)):
        if keyw[j].h4 is not None:
            res1 = keyw[j].h4.string
            if res1=="Plot Keywords:":
                res21=keyw[j].text
                kywrd.append(res21)
                cnt=1
        if j==len(keyw)-1 and cnt==0:
            kywrd.append(" ")
    
    # --- movie budget
    txtblk = mvsoup.find_all('div', class_="txt-block")
    cntr=0
    for j in range (6,len(txtblk)):
        if txtblk[j].h4 is not None:        
            res1 = txtblk[j].h4.string
            if res1=="Budget:":
                res21=txtblk[j].text
                cntr=1
                budget.append(res21)
        if j==len(txtblk)-1 and cntr==0:
            budget.append(" ")
    
    # --- movie gross USA
    #grs = mvsoup.find_all('div', class_="txt-block")
    cntr=0
    for j in range (6,len(txtblk)):
        if txtblk[j].h4 is not None:        
            res1 = txtblk[j].h4.string
            if res1=="Gross USA:":
                res21=txtblk[j].text
                cntr=1
                grsusa.append(res21)
        if j==len(txtblk)-1 and cntr==0:
            grsusa.append(" ")
    
    # --- cumulative gross world wide
    #cum = mvsoup.find_all('div', class_="txt-block")
    cnt=0
    for j in range (8,len(txtblk)):
        if txtblk[j].h4 is not None:        
            res1 = txtblk[j].h4.string
            if res1=="Cumulative Worldwide Gross:":
                res21=txtblk[j].text
                cnt=1
                cumgrs.append(res21)
        if j==len(txtblk)-1 and cnt==0:
            cumgrs.append(" ")
    
    # --- production company        
    #cum = mvsoup.find_all('div', class_="txt-block")
    cnt=0
    for j in range (8,len(txtblk)):
        if txtblk[j].h4 is not None:        
            res1 = txtblk[j].h4.string
            if res1=="Production Co:":
                res21=txtblk[j].text
                cnt=1
                prodcpn.append(res21)
        if j==len(txtblk)-1 and cnt==0:
            prodcpn.append(" ")

# ---Clocking the time taken to run the code
end = time.time()
print('Start time:',start)
print('End time:',end)
print('Elapsed time: (minutes)',(end - start)/60)

Start time: 1548587624.4651563
End time: 1548588374.7240686
Elapsed time: (minutes) 12.50431520541509


__Creating Dataframe2__ <br>
For better visualization of all the data in the form of a table, concatinating the appended lists into a dataframe

In [8]:
df2 = pd.DataFrame({'Reviewers': lst_rev,
                   'Movie_rating':cens_rat, 
                   'Movie_duration' :mov_len, 
                   'Genre': genre,
                   'Release_date': rldt,
                   'Summary': sumry,
                   'Director': drct,
                   'Writers': wrtr,
                   'Stars': str,
                   'PlotKeywords':kywrd,
                   'Budget': budget,
                   'Gross_USA': grsusa,
                   'Cum_Gross_Worldwide': cumgrs,
                   'Production_Company': prodcpn})

In [9]:
df2.head()

Unnamed: 0,Reviewers,Movie_rating,Movie_duration,Genre,Release_date,Summary,Director,Writers,Stars,PlotKeywords,Budget,Gross_USA,Cum_Gross_Worldwide,Production_Company
0,2045328.0,A,2h 22min,\nGenres:\n Drama\n,14 October 1994 (USA),Two imprisoned men bond over a number of years...,\nDirector:\nFrank Darabont,"\nWriters:\nStephen King (short story ""Rita Ha...","\nStars:\nTim Robbins, Morgan Freeman, Bob Gun...",\nPlot Keywords:\n wrongful imprisonment\n|\n ...,"\nBudget:$25,000,000\n (estimated)\n","\nGross USA: $28,341,469\n","\nCumulative Worldwide Gross: $58,500,000\n ...",\nProduction Co:\n Castle Rock Entertainment \...
1,1402791.0,A,2h 55min,\nGenres:\n Crime |\n Drama\n,24 March 1972 (USA),The aging patriarch of an organized crime dyna...,\nDirector:\nFrancis Ford Coppola,"\nWriters:\nMario Puzo (screenplay by), Franci...","\nStars:\nMarlon Brando, Al Pacino, James Caan...",\nPlot Keywords:\n mafia\n|\n crime family\n|\...,"\nBudget:$6,000,000\n (estimated)\n","\nGross USA: $134,966,411, 11 May 1997\n","\nCumulative Worldwide Gross: $245,066,411\n ...","\nProduction Co:\n Paramount Pictures, Alfran ..."
2,972533.0,,3h 22min,\nGenres:\n Crime |\n Drama\n,20 December 1974 (USA),The early life and career of Vito Corleone in ...,\nDirector:\nFrancis Ford Coppola,\nWriters:\nFrancis Ford Coppola (screenplay b...,"\nStars:\nAl Pacino, Robert De Niro, Robert Du...",\nPlot Keywords:\n revenge\n|\n corrupt politi...,"\nBudget:$13,000,000\n (estimated)\n","\nGross USA: $57,300,000\n",,"\nProduction Co:\n Paramount Pictures, The Cop..."
3,2012714.0,UA,2h 32min,\nGenres:\n Action |\n Crime |\n Drama |\n Thr...,18 July 2008 (India),When the menace known as the Joker emerges fro...,\nDirector:\nChristopher Nolan,"\nWriters:\nJonathan Nolan (screenplay), Chris...","\nStars:\nChristian Bale, Heath Ledger, Aaron ...",\nPlot Keywords:\n dc comics\n|\n moral dilemm...,"\nBudget:$185,000,000\n (estimated)\n","\nGross USA: $534,858,444, 19 July 2012\n","\nCumulative Worldwide Gross: $1,004,558,444, ...","\nProduction Co:\n Warner Bros., Legendary Ent..."
4,576027.0,,1h 36min,\nGenres:\n Drama\n,10 April 1957 (USA),A jury holdout attempts to prevent a miscarria...,\nDirector:\nSidney Lumet,"\nWriters:\nReginald Rose (story), Reginald Ro...","\nStars:\nHenry Fonda, Lee J. Cobb, Martin Bal...",\nPlot Keywords:\n jury\n|\n dialogue driven\n...,"\nBudget:$350,000\n (estimated)\n","\nGross USA: $4,360,000\n",,\nProduction Co:\n Orion-Nova Productions \nSe...


### 3.0 Data Exploration and cleaning

In [10]:
df2.columns

Index(['Reviewers', 'Movie_rating', 'Movie_duration', 'Genre', 'Release_date',
       'Summary', 'Director', 'Writers', 'Stars', 'PlotKeywords', 'Budget',
       'Gross_USA', 'Cum_Gross_Worldwide', 'Production_Company'],
      dtype='object')

__Observation__:<br>
From the newly created dataframe, we can observe that almost all columns need to be formatted / cleaned..<br>
except for 
- Reviewers, Movie_rating, Movie_duration, Release_date and Summary


In [11]:
# replacing unwanted strings in Genre column
for i in range(0,len(genre)): 
    comp = re.compile(r'(Genres:|\n|\xa0)')
    stp1 = re.sub(comp,'',genre[i])
    comp1 = re.compile(r'\|')
    stp2 = re.sub(comp1,',',stp1)
    stp3=stp2.strip()
    genre[i] = stp3
print (genre[0:5])

['Drama', 'Crime, Drama', 'Crime, Drama', 'Action, Crime, Drama, Thriller', 'Drama']


In [12]:
# splitting genre into 4 columns as per requirement
genre1=[]
genre2=[]
genre3=[]
genre4=[]
for i in range (0, len(genre)):
    splt= re.split(',', genre[i])
    if len(splt)==1:
        genre1.append(splt[0])
        genre2.append(" ")
        genre3.append(" ") 
        genre4.append(" ")
    if len(splt)>1 and len(splt)<3:
        genre1.append(splt[0])
        genre2.append(splt[1])
        genre3.append(" ") 
        genre4.append(" ")
    if len(splt)>2 and len(splt)<4:
        genre1.append(splt[0])
        genre2.append(splt[1])
        genre3.append(splt[2]) 
        genre4.append(" ")
    if len(splt)>3 and len(splt)<5:
        genre1.append(splt[0])
        genre2.append(splt[1])
        genre3.append(splt[2]) 
        genre4.append(splt[3])
    if len(splt)>=5:
        genre1.append(splt[0])
        genre2.append(splt[1])
        genre3.append(splt[2]) 
        genre4.append(splt[3])

print (genre1[0:5],'\n')
print (genre2[0:5],'\n')
print (genre3[0:5],'\n')
print (genre4[0:5],'\n')

['Drama', 'Crime', 'Crime', 'Action', 'Drama'] 

[' ', ' Drama', ' Drama', ' Crime', ' '] 

[' ', ' ', ' ', ' Drama', ' '] 

[' ', ' ', ' ', ' Thriller', ' '] 



In [13]:
# replacing unwanted strings in director list
for i in range(0,len(drct)): 
    comp=re.compile(r'(Director:|\n)')
    stp1=re.sub(comp, '',drct[i])
    stp2=stp1.strip()
    drct[i] = stp2
print(drct[0:5])

['Frank Darabont', 'Francis Ford Coppola', 'Francis Ford Coppola', 'Christopher Nolan', 'Sidney Lumet']


In [14]:
# replacing unwanted strings in writers list
for i in range(0,len(wrtr)): 
    comp=re.compile(r'(Writer:|Writers:|\n|more credit\xa0»|more credits\xa0»)')
    stp1=re.sub(comp,'',wrtr[i])
    comp2=re.compile(r'\|4 | \|3 | \|2 | \|1')
    stp2=re.sub(comp2,'',stp1)
    stp3=stp2.strip()
    wrtr[i]=stp3
print (wrtr[0:5])

['Stephen King (short story "Rita Hayworth and Shawshank Redemption"), Frank Darabont (screenplay)', 'Mario Puzo (screenplay by), Francis Ford Coppola (screenplay by)', 'Francis Ford Coppola (screenplay by), Mario Puzo (screenplay by)', 'Jonathan Nolan (screenplay), Christopher Nolan (screenplay)', 'Reginald Rose (story), Reginald Rose (screenplay)']


In [15]:
# Splitting writers into 3 columns as per requirement
wrtr1=[]
wrtr2=[]
wrtr3=[]
for i in range (0, len(wrtr)):
    splt= re.split(',', wrtr[i])
    if len(splt)==1:
        wrtr1.append(splt[0].strip())
        wrtr2.append(" ")
        wrtr3.append(" ")       
    if len(splt)>1 and len(splt)<3:
        wrtr1.append(splt[0].strip())
        wrtr2.append(splt[1].strip())
        wrtr3.append(" ")
    if len(splt)>2:
        wrtr1.append(splt[0].strip())
        wrtr2.append(splt[1].strip())
        wrtr3.append(splt[2].strip()) 
print (wrtr1[0:5],'\n')
print (wrtr2[0:5],'\n')
print (wrtr3[0:5],'\n')

['Stephen King (short story "Rita Hayworth and Shawshank Redemption")', 'Mario Puzo (screenplay by)', 'Francis Ford Coppola (screenplay by)', 'Jonathan Nolan (screenplay)', 'Reginald Rose (story)'] 

['Frank Darabont (screenplay)', 'Francis Ford Coppola (screenplay by)', 'Mario Puzo (screenplay by)', 'Christopher Nolan (screenplay)', 'Reginald Rose (screenplay)'] 

[' ', ' ', ' ', ' ', ' '] 



In [16]:
# replacing unwanted strings in stars list

for i in range(0,len(str)): 
    comp=re.compile(r'(Stars:|\n|See full cast & crew\xa0»)')
    stp1=re.sub(comp,'',str[i])
    comp2=re.compile(r'\|')
    stp2=re.sub(comp2,'',stp1)
    stp3=stp2.strip()
    str[i]=stp3
print(str[0:5])

['Tim Robbins, Morgan Freeman, Bob Gunton', 'Marlon Brando, Al Pacino, James Caan', 'Al Pacino, Robert De Niro, Robert Duvall', 'Christian Bale, Heath Ledger, Aaron Eckhart', 'Henry Fonda, Lee J. Cobb, Martin Balsam']


In [17]:
# Splitting stars into 5 columns as per requirement
str1=[]
str2=[]
str3=[]
str4=[]
str5=[]
for i in range (0, len(str)):
    splt= re.split(',', str[i])
    if len(splt)==1:
        str1.append(splt[0])
        str2.append(" ")
        str3.append(" ") 
        str4.append(" ")
        str5.append(" ")
        
    if len(splt)>1 and len(splt)<3:
        str1.append(splt[0])
        str2.append(splt[1])
        str3.append(" ") 
        str4.append(" ")
        str5.append(" ")
        
    if len(splt)>2 and len(splt)<4:
        str1.append(splt[0])
        str2.append(splt[1])
        str3.append(splt[2]) 
        str4.append(" ")
        str5.append(" ")
        
    if len(splt)>3 and len(splt)<5:
        str1.append(splt[0])
        str2.append(splt[1])
        str3.append(splt[2]) 
        str4.append(splt[3])
        str5.append(" ")
        
    if len(splt)>4 and len(splt)<6:
        str1.append(splt[0])
        str2.append(splt[1])
        str3.append(splt[2]) 
        str4.append(splt[3])
        str5.append(splt[4])

print (str1[0:5],'\n')   
print (str2[0:5],'\n')  
print (str3[0:5],'\n')  
print (str4[0:5],'\n')  
print (str5[0:5],'\n')  

['Tim Robbins', 'Marlon Brando', 'Al Pacino', 'Christian Bale', 'Henry Fonda'] 

[' Morgan Freeman', ' Al Pacino', ' Robert De Niro', ' Heath Ledger', ' Lee J. Cobb'] 

[' Bob Gunton', ' James Caan', ' Robert Duvall', ' Aaron Eckhart', ' Martin Balsam'] 

[' ', ' ', ' ', ' ', ' '] 

[' ', ' ', ' ', ' ', ' '] 



In [18]:
# replacing unwanted strings in plot keywords list
for i in range(0,len(kywrd)): 
    comp=re.compile(r'(Plot Keywords:|\n|\xa0See All|\xa0»)')
    stp1=re.sub(comp,'',kywrd[i])
    comp2=re.compile(r'((|))')
    stp2=re.sub(comp2,'',stp1)
    stp3=stp2.strip()
    kywrd[i]=stp3
print(kywrd[0:5])

['wrongful imprisonment| escape from prison| based on the works of stephen king| prison| voice over narration| (296)', 'mafia| crime family| patriarch| organized crime| rise to power| (237)', 'revenge| corrupt politician| bloody body of child| mafia| 1950s| (274)', 'dc comics| moral dilemma| psychopath| clown| scarred face| (639)', 'jury| dialogue driven| courtroom| single set production| trial| (78)']


In [19]:
# replacing unwanted strings in budget list
for i in range(0,len(budget)): 
    comp=re.compile(r'(Budget:|\n|\()|\)|(estimated)')
    stp1=re.sub(comp,'',budget[i])
    stp2=stp1.strip()        
    budget[i]=stp2
print(budget[0:15])

# replacing symbols with 3 digit currency strings
for i in range(0,len(budget)): 
    cur1=budget[i].replace('$', 'USD ')
    cur3=cur1.replace('€', 'EUR ')
    cur4=cur3.replace('£', 'GBP ')
    budget[i]=cur4
print(budget[0:15],'\n')

# obtain currencies as seperate list
budget_curr=[]
for i in range(0,len(budget)):
    stp1= budget[i][0:3]
    budget_curr.append(stp1)
    
print(budget_curr[0:15],'\n')

# slicing only the numerical value from budget list
budget_val=[]
for i in range(0,len(budget)):
    stp1 = re.sub(',','',budget[i])
    ln = len(stp1)
    if ln>3:
        stp2= stp1[4:ln]
    if ln<3:
        stp2="0"
    budget_val.append(float(stp2))
    
print(budget_val[0:15])

['$25,000,000', '$6,000,000', '$13,000,000', '$185,000,000', '$350,000', '$22,000,000', '$94,000,000', '$8,000,000', '$1,200,000', '$63,000,000', '$93,000,000', '$55,000,000', '$18,000,000', '$160,000,000', '$94,000,000']
['USD 25,000,000', 'USD 6,000,000', 'USD 13,000,000', 'USD 185,000,000', 'USD 350,000', 'USD 22,000,000', 'USD 94,000,000', 'USD 8,000,000', 'USD 1,200,000', 'USD 63,000,000', 'USD 93,000,000', 'USD 55,000,000', 'USD 18,000,000', 'USD 160,000,000', 'USD 94,000,000'] 

['USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD'] 

[25000000.0, 6000000.0, 13000000.0, 185000000.0, 350000.0, 22000000.0, 94000000.0, 8000000.0, 1200000.0, 63000000.0, 93000000.0, 55000000.0, 18000000.0, 160000000.0, 94000000.0]


In [20]:
# replacing unwanted strings in Gross USA list
for i in range(0,len(grsusa)): 
    comp=re.compile(r'(Gross USA:|\n)')
    stp1=re.sub(comp,'',grsusa[i])
    com1=re.compile(r'\xa0')
    stp0=re.sub(com1,'',stp1)
    stp2=stp0.strip()
    grsusa[i]=stp2

print(grsusa[0:15],'\n')
    
# replacing symbols with 3 digit currency strings
for i in range(0,len(grsusa)): 
    cur1=grsusa[i].replace('$', 'USD ')
    cur3=cur1.replace('€', 'EUR ')
    cur4=cur3.replace('£', 'GBP ')
    grsusa[i]=cur4


for i in range(0,len(grsusa)): 
    comp=re.compile(r'\xa0')
    stp1=re.sub(comp,' ',grsusa[i])
    grsusa[i] = stp1

print(grsusa[0:15],'\n')

# obtain currencies as seperate list
grsusa_curr=[]
for i in range(0,len(grsusa)):
    stp1= grsusa[i][0:3]
    grsusa_curr.append(stp1)

print(grsusa_curr[0:15],'\n')

# slicing only the numerical values
grsusa_val=[]
for i in range(0,len(grsusa)):
    stp1 = re.sub(',','',grsusa[i])
    ln = len(stp1)
    if ln>2:
        stp2= stp1[4:ln]
    if ln<2:
        stp2='0'
    grsusa_val.append(stp2)
print(grsusa[0:15],'\n')
print(grsusa_val[0:15],'\n')


for j in range(0,len(grsusa_val)):
    end = grsusa_val[j].find(' ')
    if end>0:
        stp1 = grsusa_val[j][0:end]
    if end<0:
        stp1 = grsusa_val[j]
    grsusa_val[j]=float(stp1)

print(grsusa_val[0:15])

    

['$28,341,469', '$134,966,411, 11 May 1997', '$57,300,000', '$534,858,444, 19 July 2012', '$4,360,000', '$96,067,179', '$377,845,905, 28 June 2011', '$107,928,762', '$6,100,000', '$37,030,102', '$315,544,750, 14 June 2011', '$330,252,182', '$290,475,067, 31 December 1997', '$292,576,195, 6 January 2011', '$342,551,365, 21 June 2011'] 

['USD 28,341,469', 'USD 134,966,411, 11 May 1997', 'USD 57,300,000', 'USD 534,858,444, 19 July 2012', 'USD 4,360,000', 'USD 96,067,179', 'USD 377,845,905, 28 June 2011', 'USD 107,928,762', 'USD 6,100,000', 'USD 37,030,102', 'USD 315,544,750, 14 June 2011', 'USD 330,252,182', 'USD 290,475,067, 31 December 1997', 'USD 292,576,195, 6 January 2011', 'USD 342,551,365, 21 June 2011'] 

['USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD', 'USD'] 

['USD 28,341,469', 'USD 134,966,411, 11 May 1997', 'USD 57,300,000', 'USD 534,858,444, 19 July 2012', 'USD 4,360,000', 'USD 96,067,179', 'USD 377,845,905, 28 June 2011', '

In [21]:
# replacing unwanted strings in Cumulative worldwide Gross list
for i in range(0,len(cumgrs)): 
    comp=re.compile(r'(Cumulative Worldwide Gross:|\n)')
    stp1=re.sub(comp,'',cumgrs[i])
    stp2=stp1.strip()
    cumgrs[i]=stp2
print(cumgrs[0:15],'\n')

# replacing symbols with 3 digit currency strings
for i in range(0,len(cumgrs)): 
    cur1=cumgrs[i].replace('$', 'USD ')
    cur3=cur1.replace('€', 'EUR ')
    cur4=cur3.replace('£', 'GBP ')
    cumgrs[i]=cur4
print(cumgrs[0:15],'\n')

for i in range(0,len(cumgrs)): 
    comp=re.compile(r'\xa0')
    stp1=re.sub(comp,' ',cumgrs[i])
    cumgrs[i] = stp1

print(cumgrs[0:15],'\n')

# obtain currencies as seperate list
cumgrs_curr=[]
for i in range(0,len(cumgrs)):
    stp1= cumgrs[i][0:3]
    cumgrs_curr.append(stp1)

print(cumgrs_curr[0:15],'\n')

# slicing only the numerical values
cumgrs_val=[]
for i in range(0,len(cumgrs)):
    stp1 = re.sub(',','',cumgrs[i])
    ln = len(stp1)
    if ln>2:
        stp2= stp1[4:ln]
    if ln<2:
        stp2='0'
    cumgrs_val.append(stp2)
print(cumgrs[0:15],'\n')
print(cumgrs_val[0:15],'\n')


for j in range(0,len(cumgrs_val)):
    end = cumgrs_val[j].find(' ')
    if end>0:
        stp1 = cumgrs_val[j][0:end]
    if end<0:
        stp1 = cumgrs_val[j]
    cumgrs_val[j]=float(stp1)

print(cumgrs_val[0:15])


['$58,500,000', '$245,066,411', '', '$1,004,558,444, 19 July 2012', '', '$221,000,000', '$1,119,929,521, 25 November 2011', '$213,928,762', '', '$71,000,000', '$871,530,324, 25 November 2011', '$677,945,399', '$247,916,602', '$825,532,764, 6 January 2011', '$926,047,111, 25 November 2011'] 

['USD 58,500,000', 'USD 245,066,411', '', 'USD 1,004,558,444, 19 July 2012', '', 'USD 221,000,000', 'USD 1,119,929,521, 25 November 2011', 'USD 213,928,762', '', 'USD 71,000,000', 'USD 871,530,324, 25 November 2011', 'USD 677,945,399', 'USD 247,916,602', 'USD 825,532,764, 6 January 2011', 'USD 926,047,111, 25 November 2011'] 

['USD 58,500,000', 'USD 245,066,411', '', 'USD 1,004,558,444, 19 July 2012', '', 'USD 221,000,000', 'USD 1,119,929,521, 25 November 2011', 'USD 213,928,762', '', 'USD 71,000,000', 'USD 871,530,324, 25 November 2011', 'USD 677,945,399', 'USD 247,916,602', 'USD 825,532,764, 6 January 2011', 'USD 926,047,111, 25 November 2011'] 

['USD', 'USD', '', 'USD', '', 'USD', 'USD', 'USD'

In [22]:
# replacing unwanted strings in Production company list
for i in range(0,len(prodcpn)): 
    comp=re.compile(r'(Production Co:|\n|See more\xa0»)')
    stp1=re.sub(comp,'',prodcpn[i])
    stp2=stp1.strip()
    prodcpn[i]=stp2
print(prodcpn[0:5])

['Castle Rock Entertainment', 'Paramount Pictures, Alfran Productions', 'Paramount Pictures, The Coppola Company', 'Warner Bros., Legendary Entertainment, Syncopy', 'Orion-Nova Productions']


__Re-Creating Dataframe__ <br>
After data cleaning, recreating the dataframe using the updated and split lists as per the requirement

In [23]:
df3 = pd.DataFrame({'Reviewers': lst_rev,
                   'Movie_rating':cens_rat, 
                   'Movie_duration' :mov_len, 
                   'Genre1': genre1,
                   'Genre2': genre2,
                   'Genre3': genre3,
                   'Genre4': genre4,
                   'Release_date': rldt,
                   'Summary': sumry,
                   'Director': drct,
                   'Writer1': wrtr1,
                   'Writer2': wrtr2,
                   'Writer3': wrtr3,
                   'Stars1': str1,
                   'Stars2': str2,
                   'Stars3': str3,
                   'Stars4': str4,
                   'Stars5': str5,
                   'PlotKeywords':kywrd,
                   'Budget_curr': budget_curr,
                    'Budget_val': budget_val,
                   'Gross_USA_curr': grsusa_curr,
                    'Gross_USA_val': grsusa_val,
                   'Cum_Gross_Worldwide_curr': cumgrs_curr,
                    'Cum_Gross_Worldwide_val': cumgrs_val,
                   'Production_Company': prodcpn})

In [24]:
df3.head(10)

Unnamed: 0,Reviewers,Movie_rating,Movie_duration,Genre1,Genre2,Genre3,Genre4,Release_date,Summary,Director,...,Stars4,Stars5,PlotKeywords,Budget_curr,Budget_val,Gross_USA_curr,Gross_USA_val,Cum_Gross_Worldwide_curr,Cum_Gross_Worldwide_val,Production_Company
0,2045328.0,A,2h 22min,Drama,,,,14 October 1994 (USA),Two imprisoned men bond over a number of years...,Frank Darabont,...,,,wrongful imprisonment| escape from prison| bas...,USD,25000000.0,USD,28341469.0,USD,58500000.0,Castle Rock Entertainment
1,1402791.0,A,2h 55min,Crime,Drama,,,24 March 1972 (USA),The aging patriarch of an organized crime dyna...,Francis Ford Coppola,...,,,mafia| crime family| patriarch| organized crim...,USD,6000000.0,USD,134966411.0,USD,245066400.0,"Paramount Pictures, Alfran Productions"
2,972533.0,,3h 22min,Crime,Drama,,,20 December 1974 (USA),The early life and career of Vito Corleone in ...,Francis Ford Coppola,...,,,revenge| corrupt politician| bloody body of ch...,USD,13000000.0,USD,57300000.0,,0.0,"Paramount Pictures, The Coppola Company"
3,2012714.0,UA,2h 32min,Action,Crime,Drama,Thriller,18 July 2008 (India),When the menace known as the Joker emerges fro...,Christopher Nolan,...,,,dc comics| moral dilemma| psychopath| clown| s...,USD,185000000.0,USD,534858444.0,USD,1004558000.0,"Warner Bros., Legendary Entertainment, Syncopy"
4,576027.0,,1h 36min,Drama,,,,10 April 1957 (USA),A jury holdout attempts to prevent a miscarria...,Sidney Lumet,...,,,jury| dialogue driven| courtroom| single set p...,USD,350000.0,USD,4360000.0,,0.0,Orion-Nova Productions
5,1058362.0,A,3h 15min,Biography,Drama,History,,4 February 1994 (USA),"In German-occupied Poland during World War II,...",Steven Spielberg,...,,,accountant| champagne| villa| womanizer| soap|...,USD,22000000.0,USD,96067179.0,USD,221000000.0,"Universal Pictures, Amblin Entertainment"
6,1456606.0,PG-13,3h 21min,Adventure,Drama,Fantasy,,6 February 2004 (India),Gandalf and Aragorn lead the World of Men agai...,Peter Jackson,...,,,orc| battle| journey| hobbit| ring| (245),USD,94000000.0,USD,377845905.0,USD,1119930000.0,"New Line Cinema, WingNut Films, The Saul Zaent..."
7,1598144.0,A,2h 34min,Crime,Drama,,,14 October 1994 (USA),"The lives of two mob hitmen, a boxer, a gangst...",Quentin Tarantino,...,,,nonlinear timeline| black comedy| overdose| bo...,USD,8000000.0,USD,107928762.0,USD,213928800.0,"Miramax, A Band Apart, Jersey Films"
8,607098.0,,2h 41min,Western,,,,23 December 1966 (Italy),A bounty hunting scam joins two men in an unea...,Sergio Leone,...,,,spaghetti western| new mexico territory| sonor...,USD,1200000.0,USD,6100000.0,,0.0,"Produzioni Europee Associate (PEA), Arturo Gon..."
9,1637140.0,A,2h 19min,Drama,,,,15 October 1999 (USA),An insomniac office worker and a devil-may-car...,David Fincher,...,,,surprise ending| fighting| multiple personalit...,USD,63000000.0,USD,37030102.0,USD,71000000.0,"Fox 2000 Pictures, Regency Enterprises, Linson..."


### 4.0 Concatenate dataframes and export to CSV

In [25]:
df4 = pd.concat([df, df3], axis=1, sort=False)

In [26]:
df4.columns

Index(['Name of the movie', 'Link', 'Year released', 'IMDB rating',
       'Reviewers', 'Movie_rating', 'Movie_duration', 'Genre1', 'Genre2',
       'Genre3', 'Genre4', 'Release_date', 'Summary', 'Director', 'Writer1',
       'Writer2', 'Writer3', 'Stars1', 'Stars2', 'Stars3', 'Stars4', 'Stars5',
       'PlotKeywords', 'Budget_curr', 'Budget_val', 'Gross_USA_curr',
       'Gross_USA_val', 'Cum_Gross_Worldwide_curr', 'Cum_Gross_Worldwide_val',
       'Production_Company'],
      dtype='object')

In [27]:
df4.head(10)

Unnamed: 0,Name of the movie,Link,Year released,IMDB rating,Reviewers,Movie_rating,Movie_duration,Genre1,Genre2,Genre3,...,Stars4,Stars5,PlotKeywords,Budget_curr,Budget_val,Gross_USA_curr,Gross_USA_val,Cum_Gross_Worldwide_curr,Cum_Gross_Worldwide_val,Production_Company
0,The Shawshank Redemption,/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1994.0,9.2,2045328.0,A,2h 22min,Drama,,,...,,,wrongful imprisonment| escape from prison| bas...,USD,25000000.0,USD,28341469.0,USD,58500000.0,Castle Rock Entertainment
1,The Godfather,/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1972.0,9.2,1402791.0,A,2h 55min,Crime,Drama,,...,,,mafia| crime family| patriarch| organized crim...,USD,6000000.0,USD,134966411.0,USD,245066400.0,"Paramount Pictures, Alfran Productions"
2,The Godfather: Part II,/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1974.0,9.0,972533.0,,3h 22min,Crime,Drama,,...,,,revenge| corrupt politician| bloody body of ch...,USD,13000000.0,USD,57300000.0,,0.0,"Paramount Pictures, The Coppola Company"
3,The Dark Knight,/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,2008.0,9.0,2012714.0,UA,2h 32min,Action,Crime,Drama,...,,,dc comics| moral dilemma| psychopath| clown| s...,USD,185000000.0,USD,534858444.0,USD,1004558000.0,"Warner Bros., Legendary Entertainment, Syncopy"
4,12 Angry Men,/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1957.0,8.9,576027.0,,1h 36min,Drama,,,...,,,jury| dialogue driven| courtroom| single set p...,USD,350000.0,USD,4360000.0,,0.0,Orion-Nova Productions
5,Schindler's List,/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1993.0,8.9,1058362.0,A,3h 15min,Biography,Drama,History,...,,,accountant| champagne| villa| womanizer| soap|...,USD,22000000.0,USD,96067179.0,USD,221000000.0,"Universal Pictures, Amblin Entertainment"
6,The Lord of the Rings: The Return of the King,/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,2003.0,8.9,1456606.0,PG-13,3h 21min,Adventure,Drama,Fantasy,...,,,orc| battle| journey| hobbit| ring| (245),USD,94000000.0,USD,377845905.0,USD,1119930000.0,"New Line Cinema, WingNut Films, The Saul Zaent..."
7,Pulp Fiction,/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1994.0,8.9,1598144.0,A,2h 34min,Crime,Drama,,...,,,nonlinear timeline| black comedy| overdose| bo...,USD,8000000.0,USD,107928762.0,USD,213928800.0,"Miramax, A Band Apart, Jersey Films"
8,"Il buono, il brutto, il cattivo",/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1966.0,8.8,607098.0,,2h 41min,Western,,,...,,,spaghetti western| new mexico territory| sonor...,USD,1200000.0,USD,6100000.0,,0.0,"Produzioni Europee Associate (PEA), Arturo Gon..."
9,Fight Club,/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd...,1999.0,8.8,1637140.0,A,2h 19min,Drama,,,...,,,surprise ending| fighting| multiple personalit...,USD,63000000.0,USD,37030102.0,USD,71000000.0,"Fox 2000 Pictures, Regency Enterprises, Linson..."


__Export dataframe__:

In [29]:
df4.to_csv('results/imdb_web_scraping.csv', index=False)

----