# Web Scraping
- Web Scraping is used to extract data from the web pages and process it.
- The web scraping process can be divided into four major parts:
    1. Reading: For HTML page read and upload

    2. Parsing: For beautifying the HTML code in an understandable format

    3. Extraction: For extraction of data from the web page

    4. Transformation: For converting the information into the required format, e.g., CSV
    
- Import the libraries and classes:
    1. urllib.request
    2. BeautifulSoup

In [2]:
#importing useful libraries and classes
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup 

In [3]:
# html upload
my_url = "https://www.imdb.com/list/ls055386972/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

In [4]:
page_html

b'\n\n\n<!DOCTYPE html>\n<html\n    xmlns:og="http://ogp.me/ns#"\n    xmlns:fb="http://www.facebook.com/2008/fbml">\n    <head>\n         \n        <meta charset="utf-8">\n        <meta http-equiv="X-UA-Compatible" content="IE=edge">\n\n    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///list/ls055386972?src=mdot">\n\n\n\n        <script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:\'java\'};</script>\n\n<script>\n    if (typeof uet == \'function\') {\n      uet("bb", "LoadTitle", {wb: 1});\n    }\n</script>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>\n        <title>The 50 Best Movies Ever Made - IMDb</title>\n  <script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>\n<script>\n    if (typeof uet == \'function\') {\n      uet("be", "LoadTitle", {wb: 1});\n    }\n</script>\n<script>\n   

In [5]:
# html parser
page_soup = soup(page_html, 'html.parser')
S = soup(uClient)

In [6]:
page_soup


<!DOCTYPE html>

<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="app-id=342792525, app-argument=imdb:///list/ls055386972?src=mdot" name="apple-itunes-app"/>
<script type="text/javascript">var IMDbTimer={starttime: new Date().getTime(),pt:'java'};</script>
<script>
    if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
</script>
<script>(function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);</script>
<title>The 50 Best Movies Ever Made - IMDb</title>
<script>(function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);</script>
<script>
    if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
</script>
<script>
    if (typeof uex == 'function') {
      uex("ld", "LoadTitle", {wb: 1});
    }
</script>
<link href="https

In [7]:
# read class from webpage
containers = page_soup.findAll('div', {'class': 'lister-item mode-detail'})
print(len(containers))

50


### 50 Best movies ever made

In [8]:
print(soup.prettify(containers[0]))

<div class="lister-item mode-detail">
 <div class="lister-item-image ribbonize" data-tconst="tt0068646">
  <a href="/title/tt0068646/">
   <img alt="The Godfather" class="loadlate" data-tconst="tt0068646" height="209" loadlate="https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY209_CR3,0,140,209_AL_.jpg" src="https://m.media-amazon.com/images/S/sash/4FyxwxECzL-U1J8.png" width="140"/>
  </a>
 </div>
 <div class="lister-item-content">
  <h3 class="lister-item-header">
   <span class="lister-item-index unbold text-primary">
    1.
   </span>
   <a href="/title/tt0068646/">
    The Godfather
   </a>
   <span class="lister-item-year text-muted unbold">
    (1972)
   </span>
  </h3>
  <p class="text-muted text-small">
   <span class="certificate">
    A
   </span>
   <span class="ghost">
    |
   </span>
   <span class="runtime">
    175 min
   </span>
   <span class="ghost">
    |
   </span>
   <span class="genre">
    

In [15]:
# extract data into a csv file 
filename = 'imdb_m.csv'
f = open(filename, 'w')
headers = 'Name, Year, Runtime\n'
f.write(headers)

for container in containers:
    name = container.img['alt'].replace(','," ")
    year_mov = container.findAll('span', {'class': 'lister-item-year'})
    year = year_mov[0].text
    runtime_mov = container.findAll('span', {'class': 'runtime'})
    runtime = runtime_mov[0].text
    print(name + ',' + year + ',' + runtime + '\n')
    f.write(name + ',' + year + ',' + runtime + '\n')

The Godfather,(1972),175 min

Schindler's List,(1993),195 min

12 Angry Men,(1957),96 min

La vita è bella,(1997),116 min

Il buono  il brutto  il cattivo,(1966),161 min

The Shawshank Redemption,(1994),142 min

The Pursuit of Happyness,(2006),117 min

Shichinin no samurai,(1954),207 min

The Intouchables,(2011),112 min

Central do Brasil,(1998),110 min

Requiem for a Dream,(2000),102 min

A Beautiful Mind,(2001),135 min

Hachi: A Dog's Tale,(2009),93 min

Taken,(I) (2008),90 min

Yeopgijeogin geunyeo,(2001),137 min

Amores perros,(2000),154 min

The Shining,(1980),146 min

Apocalypto,(2006),139 min

Gladiator,(2000),155 min

Cast Away,(2000),143 min

The Dark Knight,(2008),152 min

The Pianist,(2002),150 min

Titanic,(1997),194 min

Bin-jip,(2004),88 min

Braveheart,(1995),178 min

It's a Wonderful Life,(1946),130 min

Bom yeoreum gaeul gyeoul geurigo bom,(2003),103 min

Alien,(1979),117 min

Salinui chueok,(2003),131 min

Vozvrashchenie,(2003),110 min

Ang-ma-reul bo-at-da,(2010),144

### Ensuring the csv was created as expected

In [18]:
import pandas as pd
df = pd.read_csv('imdb_m.csv', encoding = 'latin1')
df.head()

Unnamed: 0,Name,Year,Runtime
0,The Godfather,(1972),175 min
1,Schindler's List,(1993),195 min
2,12 Angry Men,(1957),96 min
3,La vita è bella,(1997),116 min
4,Il buono il brutto il cattivo,(1966),161 min


In [17]:
f.close()