## The Step-by-Step Guide to Build a Web Scraper With Python
     With tools like Repl, pandas, Requests, NumPy, and BeautifulSoup

## What we’ll cover

This guide will take you through understanding HTML web pages, building a web scraper using Python, and creating a      DataFrame with pandas. It’ll cover data quality, data cleaning, and data-type conversion — entirely step by step and with instructions, code, and explanations on how every piece of it works. I hope you code along and enjoy!

## Disclaimer
Websites can restrict or ban scraping data from their website. Users can be subject to legal ramifications depending on where and how you attempt to scrape information. Websites usually describe this in their terms of use and in their robots.txt file found at their site, which usually looks something like this: www.example.com/robots.txt. So scrape responsibly, and respect the robots.txt.

## What’s Web Scraping?
Web scraping consists of gathering data available on websites. This can be done manually by a human or by using a bot.
A bot is a program you build that helps you extract the data you need much quicker than a human’s hand and eyes can.


## What Are We Going to Scrape?
It’s essential to identify the goal of your scraping right from the start. We don’t want to scrape any data we don’t actually need.
For this project, we’ll scrape data from IMDb’s “Top 1,000” movies, specifically the top 50 movies on this page. Here is the information we’ll gather from each movie listing:

* The Title
* The year it was released
* How long the movie is
* Genre of the movie
* IMDb’s rating of the movie
* The Metascore of the movie
* How many votes the movie got
* The U.S. gross earnings of the movie

## How Do Web Scrapers Work?
Web scrapers gather website data in the same way a human would: They go to a web page of the website, get the relevant data, and move on to the next web page — only much faster.
Every website has a different structure. These are a few important things to think about when building a web scraper:

* What’s the structure of the web page that contains the data you’re looking for?
* How do we get to those web pages?
* Will you need to gather more data from the next page?

## The URL
To begin, let’s look at the URL of the page we want to scrape.
This is what we see in the URL:

https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating

#### We notice a few things about the URL:
* ? acts as a separator — it indicates the end of the URL resource path and the start of the parameters
* groups=top_1000 specifies what the page will be about
* &ref_adv_prv takes us to the the next or the previous page. The reference is the page we’re currently on. adv_nxt and adv_prv   are two possible values — translated to advance to next page and advance to previous page.

When you navigate back and forth through the pages, you’ll notice only the parameters change. Keep this structure in mind as it’s helpful to know as we build the scraper.


## The HTML
HTML stands for hypertext markup language, and most web pages are written using it. Essentially, HTML is how two computers speak to each other over the internet, and websites are what they say.

When you access an URL, your computer sends a request to the server that hosts the site. Any technology can be running on that server (JavaScript, Ruby, Java, etc.) to process your request. Eventually, the server returns a response to your browser; oftentimes, that response will be in the form of an HTML page for your browser to display.

HTML describes the structure of a web page semantically, and originally included cues for the appearance of the document.

## Inspect HTML

Chrome, Firefox, and Safari users can examine the HTML structure of any page by right-clicking your mouse and pressing the Inspect option.

A menu will appear on the bottom or right-hand side of your page with a long list of all the HTML tags housing the information displayed to your browser window. If you’re in Safari (photo above), you’ll want to press the button to the left of the search bar, which looks like a target. If you’re in Chrome or Firefox, there’s a small box with an arrow icon in it at the top left that you’ll use to inspect.

Once clicked, if you move your cursor over any element of the page, you’ll notice it’ll get highlighted along with the HTML tags in the menu that they’re associated with, as seen above.

Knowing how to read the basic structure of a page’s HTML page is important so we can turn to Python to help us extract the HTML from the page.

## Tools

The tools we’re going to use are:

* Repl (optional):- is a simple, interactive computer-programming environment used via your web browser. I recommend using this     just for code-  along purposes if you don’t already have an IDE. If you use Repl, make sure you’re using the Python       environment.
* Requests:- will allow us to send HTTP requests to get HTML files
* BeautifulSoup:- will help us parse the HTML files
* pandas:- will help us assemble the data into a DataFrame to clean and analyze it
* NumPy:- will add support for mathematical functions and tools for working with arrays

### Now, Let’s Code
You can follow along below inside your Repl environment or IDE, or you can go directly to the entire code here. Have fun!
### Import tools
First, we’ll import the tools we’ll need so we can use them to help us build the scraper and get the data we nee

In [1]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Movies in English
It’s very likely when we run our code to scrape some of these movies, we’ll get the movie names translated into the main language of the country the movie originated in.

Use this code to make sure we get English-translated titles from all the movies we scrape:

In [2]:
headers = {"Accept-Language": "en-US, en;q=0.5"}

### Request contents of the URL
Get the contents of the page we’re looking at by requesting the URL:


In [3]:

url = "https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating"

results = requests.get(url, headers=headers) 

* url is the variable we create and assign the URL to
* results is the variable we create to store our request.get action
* requests.get(url, headers=headers) is the method we use to grab the contents of the URL. The headers part tells our scraper to bring us English, based on our previous line of code.

### Using BeautifulSoup
Make the content we grabbed easy to read by using BeautifulSoup:

In [4]:
soup = BeautifulSoup(results.text, "html.parser")

print(soup.prettify())

<!DOCTYPE html>
<html xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://ogp.me/ns#">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="app-id=342792525, app-argument=imdb:///?src=mdot" name="apple-itunes-app"/>
  <script type="text/javascript">
   var IMDbTimer={starttime: new Date().getTime(),pt:'java'};
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("bb", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_pre_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <title>
   IMDb "Top 1000"
(Sorted by IMDb Rating Descending) - IMDb
  </title>
  <script>
   (function(t){ (t.events = t.events || {})["csm_head_post_title"] = new Date().getTime(); })(IMDbTimer);
  </script>
  <script>
   if (typeof uet == 'function') {
      uet("be", "LoadTitle", {wb: 1});
    }
  </script>
  <script>
   if (typeof uex == 'function') {
      uex("ld", "L

#### Breaking BeautifulSoup down:
* soup is the variable we create to assign the method BeatifulSoup to, which specifies a desired format of results using the 
* HTML parser — this allows Python to read the components of the page rather than treating it as one long string
* print(soup.prettify()) will print what we’ve grabbed in a more structured tree format, making it easier to read

## Initialize your storage
When we write code to extract our data, we need somewhere to store that data. Create variables for each type of data you’ll extract, and assign an empty list to it, indicated by square brackets []. Remember the list of information we wanted to grab from each movie from earlier:

In [5]:
#initialize empty lists where you'll store your data
titles = []
years = []
time = []
genre=[]
imdb_ratings = []
metascores = []
votes = []
us_gross = []

## Find the right div container
It’s time to check out the HTML code in our web page.

Go to the web page we’re scraping, inspect it, and hover over a single movie in its entirety, like below:

We need to figure out what distinguishes each of these from other "div" containers we see.

You‘ll notice the list of "div" elements to the right with a "class" attribute that has two values: "lister-item" and "mode-advanced".

If you click on each of those, you’ll notice it’ll highlight each movie container on the left of the page, like above.

If we do a quick search within inspect (press Ctrl+F and type "lister-item mode-advanced"), we’ll see 50 matches representing the 50 movies displayed on a single page. We now know all the information we seek lies within this specific "div" tag.

Find all "lister-item mode-advanced" divs

Our next move is to tell our scraper to find all of these "lister-item mode-advanced" divs:

In [6]:
movie_div = soup.find_all('div', class_='lister-item mode-advanced')

### Breaking "find_all" down:
* movie_div is the variable we’ll use to store all of the div containers with a class of lister-item mode-advanced
* the find_all() method extracts all the div containers that have a class attribute of lister-item mode-advanced from what we have stored in our variable soup.

### Get ready to extract each item

We’re missing gross earnings! If you look at the second movie, they’ve included it there.

In [7]:
#initiate the for loop 
#this tells your scraper to iterate through 
#every div container we stored in move_div
for container in movie_div:
     # Name
     name = container.h3.a.text
     titles.append(name)
     
     #year
     year = container.h3.find('span', class_='lister-item-year').text
     years.append(year)
     
     #Time
     runtime = container.find('span', class_='runtime').text if container.p.find('span', class_='runtime') else ''
     time.append(runtime)
        
     #Genre
     Genre = container.p.find('span', class_='genre').text
     genre.append(Genre)
     
     #IMDb rating
     imdb = float(container.strong.text)
     imdb_ratings.append(imdb)

     #metascore
     m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else '-'
     metascores.append(m_score)

     #here are two NV containers, grab both of them as they hold both the votes and the grosses
     nv = container.find_all('span', attrs={'name': 'nv'})
        
     #filter nv for votes
     vote = nv[0].text
     votes.append(vote)
        
     #filter nv for gross
     grosses = nv[1].text if len(nv) > 1 else '-'
     us_gross.append(grosses)

In [8]:
print(titles)
print(years)
print(time)
print(genre)
print(imdb_ratings)
print(metascores)
print(votes)
print(us_gross)

['The Shawshank Redemption', 'The Godfather', 'The Dark Knight', 'The Godfather: Part II', 'The Lord of the Rings: The Return of the King', 'Pulp Fiction', "Schindler's List", '12 Angry Men', 'Inception', 'Fight Club', 'The Lord of the Rings: The Fellowship of the Ring', 'Forrest Gump', 'The Good, the Bad and the Ugly', 'The Lord of the Rings: The Two Towers', 'The Matrix', 'Goodfellas', 'Star Wars: Episode V - The Empire Strikes Back', "One Flew Over the Cuckoo's Nest", 'Harakiri', 'Parasite', 'Interstellar', 'City of God', 'Spirited Away', 'Saving Private Ryan', 'The Green Mile', 'Life Is Beautiful', 'Se7en', 'The Silence of the Lambs', 'Star Wars: Episode IV - A New Hope', 'Anand', 'Seven Samurai', "It's a Wonderful Life", 'Joker', 'Ayla: The Daughter of War', 'Whiplash', 'The Intouchables', 'The Prestige', 'The Departed', 'The Pianist', 'Gladiator', 'American History X', 'The Usual Suspects', 'Léon: The Professional', 'The Lion King', 'Terminator 2: Judgment Day', 'Cinema Paradiso'

In [9]:

#building our Pandas dataframe         
movies = pd.DataFrame({
'movie': titles,
'year': years,
'timeMin': time,
'Genre':genre,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'us_grossMillions': us_gross,
})
movies.head(10)

Unnamed: 0,movie,year,timeMin,Genre,imdb,metascore,votes,us_grossMillions
0,The Shawshank Redemption,(1994),142 min,\nDrama,9.3,80,2262217,$28.34M
1,The Godfather,(1972),175 min,"\nCrime, Drama",9.2,100,1560788,$134.97M
2,The Dark Knight,(2008),152 min,"\nAction, Crime, Drama",9.0,84,2227107,$534.86M
3,The Godfather: Part II,(1974),202 min,"\nCrime, Drama",9.0,90,1090806,$57.30M
4,The Lord of the Rings: The Return of the King,(2003),201 min,"\nAdventure, Drama, Fantasy",8.9,94,1595570,$377.85M
5,Pulp Fiction,(1994),154 min,"\nCrime, Drama",8.9,94,1769520,$107.93M
6,Schindler's List,(1993),195 min,"\nBiography, Drama, History",8.9,94,1175179,$96.90M
7,12 Angry Men,(1957),96 min,"\nCrime, Drama",8.9,96,662880,$4.36M
8,Inception,(2010),148 min,"\nAction, Adventure, Sci-Fi",8.8,74,1984768,$292.58M
9,Fight Club,(1999),139 min,\nDrama,8.8,66,1796624,$37.03M


In [10]:
print(movies.dtypes)

movie                object
year                 object
timeMin              object
Genre                object
imdb                float64
metascore            object
votes                object
us_grossMillions     object
dtype: object


In [12]:
#cleaning data with Pandas
movies['year'] = movies['year'].astype(str).str.extract('(\d+)').astype(int)
movies['timeMin'] = movies['timeMin'].astype(str).str.extract('(\d+)').astype(int)
movies['Genre'] = movies['Genre'].map(lambda x: x.lstrip('\n'))
movies['metascore']=pd.to_numeric(movies['metascore'], errors='coerce')
movies['votes'] = movies['votes'].astype(str).str.replace(',', '').astype(int)
movies['us_grossMillions'] = movies['us_grossMillions'].map(lambda x: x.lstrip('$').rstrip('M'))
movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')

In [13]:
movies.head(90)

Unnamed: 0,movie,year,timeMin,Genre,imdb,metascore,votes,us_grossMillions
0,The Shawshank Redemption,1994,142,Drama,9.3,80.0,2262217,28.34
1,The Godfather,1972,175,"Crime, Drama",9.2,100.0,1560788,134.97
2,The Dark Knight,2008,152,"Action, Crime, Drama",9.0,84.0,2227107,534.86
3,The Godfather: Part II,1974,202,"Crime, Drama",9.0,90.0,1090806,57.30
4,The Lord of the Rings: The Return of the King,2003,201,"Adventure, Drama, Fantasy",8.9,94.0,1595570,377.85
5,Pulp Fiction,1994,154,"Crime, Drama",8.9,94.0,1769520,107.93
6,Schindler's List,1993,195,"Biography, Drama, History",8.9,94.0,1175179,96.90
7,12 Angry Men,1957,96,"Crime, Drama",8.9,96.0,662880,4.36
8,Inception,2010,148,"Action, Adventure, Sci-Fi",8.8,74.0,1984768,292.58
9,Fight Club,1999,139,Drama,8.8,66.0,1796624,37.03


##### The above procedure is for 1 page only. For multiple page below code is used.

In [15]:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

from time import sleep
from random import randint

headers = {"Accept-Language": "en-US,en;q=0.5"}

titles = []
years = []
time = []
genre = []
imdb_ratings = []
metascores = []
votes = []
us_gross = []

pages = np.arange(1, 1001, 100)

for page in pages: 
                  
  page = requests.get("https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&start=" + str(page) + "&ref_=adv_nxt", headers=headers)

  soup = BeautifulSoup(page.text, 'html.parser')
  movie_div = soup.find_all('div', class_='lister-item mode-advanced')
  
  sleep(randint(2,10))

  for container in movie_div:

        name = container.h3.a.text
        titles.append(name)
        
        year = container.h3.find('span', class_='lister-item-year').text
        years.append(year)

        runtime = container.p.find('span', class_='runtime') if container.p.find('span', class_='runtime') else ''
        time.append(runtime)
    
        Genre = container.p.find('span', class_='genre').text
        genre.append(Genre)
        
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)

        m_score = container.find('span', class_='metascore').text if container.find('span', class_='metascore') else ''
        metascores.append(m_score)

        nv = container.find_all('span', attrs={'name': 'nv'})
        
        vote = nv[0].text
        votes.append(vote)
        
        grosses = nv[1].text if len(nv) > 1 else ''
        us_gross.append(grosses)

movies = pd.DataFrame({
'movie': titles,
'year': years,
'genre': genre,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'us_grossMillions': us_gross,
'timeMin': time
})

#cleaning data with Pandas
movies['year'] = movies['year'].astype(str).str.extract('(\d+)').astype(int)

movies['timeMin'] = movies['timeMin'].astype(str).str.extract('(\d+)').astype(int)

movies['metascore']=pd.to_numeric(movies['metascore'], errors='coerce')

movies['votes'] = movies['votes'].astype(str).str.replace(',', '').astype(int)

movies['us_grossMillions'] = movies['us_grossMillions'].map(lambda x: x.lstrip('$').rstrip('M'))
movies['us_grossMillions'] = pd.to_numeric(movies['us_grossMillions'], errors='coerce')


In [22]:
movies

Unnamed: 0,movie,year,genre,imdb,metascore,votes,us_grossMillions,timeMin
0,The Shawshank Redemption,1994,\nDrama,9.3,80.0,2262217,28.34,142
1,The Godfather,1972,"\nCrime, Drama",9.2,100.0,1560788,134.97,175
2,The Dark Knight,2008,"\nAction, Crime, Drama",9.0,84.0,2227107,534.86,152
3,The Godfather: Part II,1974,"\nCrime, Drama",9.0,90.0,1090806,57.30,202
4,The Lord of the Rings: The Return of the King,2003,"\nAdventure, Drama, Fantasy",8.9,94.0,1595570,377.85,201
5,Pulp Fiction,1994,"\nCrime, Drama",8.9,94.0,1769520,107.93,154
6,Schindler's List,1993,"\nBiography, Drama, History",8.9,94.0,1175179,96.90,195
7,12 Angry Men,1957,"\nCrime, Drama",8.9,96.0,662880,4.36,96
8,Inception,2010,"\nAction, Adventure, Sci-Fi",8.8,74.0,1984768,292.58,148
9,Fight Club,1999,\nDrama,8.8,66.0,1796624,37.03,139


In [23]:
movies['genre']=movies['genre'].astype(str).str.replace('\n','')

In [24]:
movies

Unnamed: 0,movie,year,genre,imdb,metascore,votes,us_grossMillions,timeMin
0,The Shawshank Redemption,1994,Drama,9.3,80.0,2262217,28.34,142
1,The Godfather,1972,"Crime, Drama",9.2,100.0,1560788,134.97,175
2,The Dark Knight,2008,"Action, Crime, Drama",9.0,84.0,2227107,534.86,152
3,The Godfather: Part II,1974,"Crime, Drama",9.0,90.0,1090806,57.30,202
4,The Lord of the Rings: The Return of the King,2003,"Adventure, Drama, Fantasy",8.9,94.0,1595570,377.85,201
5,Pulp Fiction,1994,"Crime, Drama",8.9,94.0,1769520,107.93,154
6,Schindler's List,1993,"Biography, Drama, History",8.9,94.0,1175179,96.90,195
7,12 Angry Men,1957,"Crime, Drama",8.9,96.0,662880,4.36,96
8,Inception,2010,"Action, Adventure, Sci-Fi",8.8,74.0,1984768,292.58,148
9,Fight Club,1999,Drama,8.8,66.0,1796624,37.03,139


In [25]:
#Save Dataframe to csv file
movies.to_csv('movies.csv')