## Web Scraping
**Web scraping is extracting data from website.**\
While doing web scraping, we download a web page and extract the information from their.\
The process of extraction involves following activities:
1. The content of a page may be parsed, searched, reformatted.
2. Its data can be copied into a spreadsheet or loaded into a database.
3. Web scrapers typically take specific information out of web page, to make use of it for a defined purpose.
4. For an example- find and copy names and telephone numbers, or companies and their URLs, or e-mail addresses to a list for a product promotion.

In [697]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [698]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


In [886]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json
import pandas as pd
import numpy as np
import time
import sys

### open the url. Url for searching top 50 films of all times.

In [774]:
movie = urlopen('https://www.imdb.com/list/ls053181721/')

### parsing the url through BeautifulSoup.

In [775]:
soup = BeautifulSoup(movie.read(), 'html.parser')

In [740]:
# soup item contains all the html codes of url: https://www.imdb.com/list/ls053181721/.

soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>Top 50 Best Movies of All Time - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https://www.imdb.com/list/ls053181721/" rel="canonical"/>
<meta content="http://www.imdb.com/list/ls053181721/

In [741]:
soup.findAll('div', {'class': "article listo"})

[<div class="article listo">
 <div class="overflow-menu">
 <div class="circle">
 <div class="vertical-ellipsis">
 <svg fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
 <path d="M0 0h24v24H0z" fill="none"></path>
 <path d="M12 8c1.1 0 2-.9 2-2s-.9-2-2-2-2 .9-2 2 .9 2 2 2zm0 2c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 
                             2-2-.9-2-2-2zm0 6c-1.1 0-2 .9-2 2s.9 2 2 2 2-.9 2-2-.9-2-2-2z"></path>
 </svg>
 </div>
 </div>
 <div class="pop-up-dialog">
 <ul class="pop-up-menu-list-items">
 <li><a class="pop-up-menu-list-item-link" href="/list/ls053181721/copy">
 Copy from this list
 </a></li>
 <li><a class="pop-up-menu-list-item-link" href="/list/ls053181721/export">
 Export
 </a></li>
 <li><a class="pop-up-menu-list-item-link" href="/listo/report?list=ls053181721">
 Report this list
 </a></li>
 </ul>
 </div>
 </div>
 <h1 class="header list-name">Top 50 Best Movies of All Time</h1>
 <span class="list-overview text-small" id="list-overvie

In [746]:
soup.findAll('div', {'class': 'lister-item-content'})

[<div class="lister-item-content">
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">1.</span>
 <a href="/title/tt0109830/">Forrest Gump</a>
 <span class="lister-item-year text-muted unbold">(1994)</span>
 </h3>
 <p class="text-muted text-small">
 <span class="certificate">UA</span>
 <span class="ghost">|</span>
 <span class="runtime">142 min</span>
 <span class="ghost">|</span>
 <span class="genre">
 Drama, Romance            </span>
 </p>
 <div class="ipl-rating-widget">
 <div class="ipl-rating-star small">
 <span class="ipl-rating-star__star">
 <svg class="ipl-icon ipl-star-icon" fill="#000000" height="24" viewbox="0 0 24 24" width="24" xmlns="http://www.w3.org/2000/svg">
 <path d="M0 0h24v24H0z" fill="none"></path>
 <path d="M12 17.27L18.18 21l-1.64-7.03L22 9.24l-7.19-.61L12 2 9.19 8.63 2 9.24l5.46 4.73L5.82 21z"></path>
 <path d="M0 0h24v24H0z" fill="none"></path>
 </svg>
 </span>
 <span class="ipl-rating-star__rating">8.8</span>
 </div>
 <div c

In [762]:
pd.DataFrame(soup.findAll('h1'))

Unnamed: 0,0
0,Top 50 Best Movies of All Time


In [745]:
soup.findAll('h3', {'class': 'lister-item-header'})

[<h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">1.</span>
 <a href="/title/tt0109830/">Forrest Gump</a>
 <span class="lister-item-year text-muted unbold">(1994)</span>
 </h3>,
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">2.</span>
 <a href="/title/tt0111161/">The Shawshank Redemption</a>
 <span class="lister-item-year text-muted unbold">(1994)</span>
 </h3>,
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">3.</span>
 <a href="/title/tt1659337/">The Perks of Being a Wallflower</a>
 <span class="lister-item-year text-muted unbold">(2012)</span>
 </h3>,
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">4.</span>
 <a href="/title/tt0468569/">The Dark Knight</a>
 <span class="lister-item-year text-muted unbold">(2008)</span>
 </h3>,
 <h3 class="lister-item-header">
 <span class="lister-item-index unbold text-primary">5.</span>
 <a href="/tit

In [704]:
soup.findAll('div', {'class': 'lister list detail sub-list'})


[<div class="lister list detail sub-list">
 <div class="header filmosearch">
 <div class="nav">
 <div class="lister-controls float-right lister-activated">
 <div class="lister-control-group">
     Sort by: <br/>
 <select class="lister-sort-by" name="sort">
 <option selected="selected" value="list_order:ascending">List Order</option>
 <option value="moviemeter:ascending">Popularity</option>
 <option value="alpha:ascending">Alphabetical</option>
 <option value="user_rating:descending">IMDb Rating</option>
 <option value="num_votes:descending">Number of Votes</option>
 <option value="release_date:descending">Release Date</option>
 <option value="runtime:descending">Runtime</option>
 <option value="date_added:descending">Date Added</option>
 </select>
 <span class="global-sprite lister-sort-reverse descending" data-sort="list_order:descending" title="Ascending order"></span>
 </div>
 <div class="lister-control-group">
     View: <br/>
 <span +="" class="global-sprite lister-mode grid" data

In [705]:
soup.findAll('p', {'class': 'text-muted text-small'})

[<p class="text-muted text-small">
 <span class="certificate">UA</span>
 <span class="ghost">|</span>
 <span class="runtime">142 min</span>
 <span class="ghost">|</span>
 <span class="genre">
 Drama, Romance            </span>
 </p>,
 <p class="text-muted text-small">
     Director:
 <a href="/name/nm0000709/">Robert Zemeckis</a>
 <span class="ghost">|</span> 
     Stars:
 <a href="/name/nm0000158/">Tom Hanks</a>, 
 <a href="/name/nm0000705/">Robin Wright</a>, 
 <a href="/name/nm0000641/">Gary Sinise</a>, 
 <a href="/name/nm0000398/">Sally Field</a>
 </p>,
 <p class="text-muted text-small">
 <span class="text-muted">Votes:</span>
 <span data-value="1970787" name="nv">1,970,787</span>
 <span class="ghost">|</span> <span class="text-muted">Gross:</span>
 <span data-value="330,252,182" name="nv">$330.25M</span>
 </p>,
 <p class="text-muted text-small">
 <span class="certificate">A</span>
 <span class="ghost">|</span>
 <span class="runtime">142 min</span>
 <span class="ghost">|</span>
 <sp

In [706]:
# The above output can be found using "control + shift + i" on the webpage.

# press "control + shift + i" on webpage: "https://www.imdb.com/list/ls053181721/" to get html codes.

In [707]:
# press "control + shift + c" to get important contants and thier respective tags in html file.
# or go on web page, right click the cursor on web information of which you want html coding.

In [708]:
# <h1 class="header list-name">Top 50 Best Movies of All Time</h1>

pd.DataFrame(soup.findAll('h1', {'class': 'header list-name'}), columns= ['Header'])

Unnamed: 0,Header
0,Top 50 Best Movies of All Time


In [709]:
# read h3.

In [792]:
movie_name= []

for i in soup.findAll('div', {'class': "lister-item-content"}):
    for h in i.findAll('h3'):
        for a in h.findAll('a'):
            movie_name.append(a.text)
title = pd.DataFrame(movie_name, columns=['movie_name'])
title

Unnamed: 0,movie_name
0,Forrest Gump
1,The Shawshank Redemption
2,The Perks of Being a Wallflower
3,The Dark Knight
4,Changeling
5,This Boy's Life
6,It's a Wonderful Life
7,The Silence of the Lambs
8,8 Mile
9,The Breakfast Club


In [796]:
title.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   movie_name  50 non-null     object
dtypes: object(1)
memory usage: 528.0+ bytes


In [793]:
movie_link= []

for l1 in soup.findAll('div', {'class': "lister-item-content"}):
    for l2 in l1.findAll('h3'):
        for l3 in l2.findAll('a'):
            movie_link.append(('http://www.imdb.com'+l3['href']))
weblink = pd.DataFrame(movie_link, columns=['movie_link'])
weblink

Unnamed: 0,movie_link
0,http://www.imdb.com/title/tt0109830/
1,http://www.imdb.com/title/tt0111161/
2,http://www.imdb.com/title/tt1659337/
3,http://www.imdb.com/title/tt0468569/
4,http://www.imdb.com/title/tt0824747/
5,http://www.imdb.com/title/tt0108330/
6,http://www.imdb.com/title/tt0038650/
7,http://www.imdb.com/title/tt0102926/
8,http://www.imdb.com/title/tt0298203/
9,http://www.imdb.com/title/tt0088847/


In [795]:
weblink.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   movie_link  50 non-null     object
dtypes: object(1)
memory usage: 528.0+ bytes


In [822]:
type = pd.DataFrame(soup.findAll('span', {'class': 'genre'}), columns=['type'])
type.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
type,"\nDrama, Romance",\nDrama,"\nDrama, Romance","\nAction, Crime, Drama","\nBiography, Crime, Drama","\nBiography, Drama","\nDrama, Family, Fantasy","\nCrime, Drama, Thriller","\nDrama, Music","\nComedy, Drama",...,"\nAction, Adventure, Sci-Fi",\nDrama,"\nComedy, Drama, Romance","\nDrama, Thriller, War","\nAdventure, Drama, Thriller","\nDrama, Thriller","\nAction, Adventure, Drama","\nAction, Crime, Drama","\nCrime, Drama, Thriller","\nBiography, Drama, Sport"


In [820]:
year = pd.DataFrame(soup.findAll('span', {'class': 'lister-item-year text-muted unbold'}), columns=['release_year'])
year.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
release_year,(1994),(1994),(2012),(2008),(2008),(1993),(1946),(1991),(2002),(1985),...,(2010),(1997),(2011),(I) (2009),(2006),(1992),(2000),(2009),(2008),(2006)


In [824]:
metascore = pd.DataFrame(soup.findAll('span', {'class': 'metascore'}), columns=['score'])
metascore.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
score,82,81,67,84,63,60,89,85,77,66,...,74,85,72,58,64,62,67,34,47,58


In [826]:
rating = pd.DataFrame(soup.findAll('div', {'class': "inline-block ratings-metascore"}), columns=[0,'rating',2]).loc[:, ['rating']]
rating.T

  values = np.array([convert(v) for v in values])


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
rating,[82 ],[81 ],[67 ],[84 ],[63 ],[60 ],[89 ],[85 ],[77 ],[66 ],...,[74 ],[85 ],[72 ],[58 ],[64 ],[62 ],[67 ],[34 ],[47 ],[58 ]


In [828]:
length = pd.DataFrame(soup.findAll('span', {'class': 'runtime'}), columns= ['length'])
length.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
length,142 min,142 min,103 min,152 min,141 min,115 min,130 min,118 min,110 min,97 min,...,148 min,155 min,100 min,105 min,143 min,138 min,155 min,109 min,110 min,118 min


In [830]:
certificate = pd.DataFrame(soup.findAll('span', {'class': 'certificate'}), columns= ['certificate'])
certificate.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
certificate,UA,A,UA,UA,A,U,PG,A,UA,UA,...,A,UA,18,18,UA,A,U,UA,A,U


In [879]:
dataset = title.join(weblink).join(year).join(type).join(metascore).join(rating).join(length).join(certificate)
dataset

Unnamed: 0,movie_name,movie_link,release_year,type,score,rating,length,certificate
0,Forrest Gump,http://www.imdb.com/title/tt0109830/,(1994),"\nDrama, Romance",82,[82 ],142 min,UA
1,The Shawshank Redemption,http://www.imdb.com/title/tt0111161/,(1994),\nDrama,81,[81 ],142 min,A
2,The Perks of Being a Wallflower,http://www.imdb.com/title/tt1659337/,(2012),"\nDrama, Romance",67,[67 ],103 min,UA
3,The Dark Knight,http://www.imdb.com/title/tt0468569/,(2008),"\nAction, Crime, Drama",84,[84 ],152 min,UA
4,Changeling,http://www.imdb.com/title/tt0824747/,(2008),"\nBiography, Crime, Drama",63,[63 ],141 min,A
5,This Boy's Life,http://www.imdb.com/title/tt0108330/,(1993),"\nBiography, Drama",60,[60 ],115 min,U
6,It's a Wonderful Life,http://www.imdb.com/title/tt0038650/,(1946),"\nDrama, Family, Fantasy",89,[89 ],130 min,PG
7,The Silence of the Lambs,http://www.imdb.com/title/tt0102926/,(1991),"\nCrime, Drama, Thriller",85,[85 ],118 min,A
8,8 Mile,http://www.imdb.com/title/tt0298203/,(2002),"\nDrama, Music",77,[77 ],110 min,UA
9,The Breakfast Club,http://www.imdb.com/title/tt0088847/,(1985),"\nComedy, Drama",66,[66 ],97 min,UA


**We can import this dataset in csv or excel format.**