<a href="https://colab.research.google.com/github/NitinShindeJ/My_Learning_DayByDay/blob/master/Python/Web%20Scrap/WebScrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scrapping**

**Web Scrapping** is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table format.

You’ll come across multiple libraries and frameworks in Python for web scraping. Here are three popular ones that do the task with efficiency and aplomb:

* **BeautifulSoup**  
It is an amazing parsing library in Python that enables us to extract data from HTML and XML documents. It can automatically detect encodings and gracefully handles HTML documents even with special characters. We can navigate a parsed document and find what we need which makes it quick and painless to extract the data from the webpages. In this course, we will learn how to build web scrapers using Beautiful Soup in detail

* **Scrapy**  
It is a framework for large scale web scraping. It gives you all the tools you need to efficiently **extract** data from websites, **process** them as you want, and store them in your preferred **structure** and format. You can read more about [scrapy](https://www.analyticsvidhya.com/blog/2017/07/web-scraping-in-python-using-scrapy/?utm_source=courses&amp;utm_medium=web-scraping-hands-on-introduction-pytho)

* **Selenium**  
It is another popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. Check out this amazing [artical](https://www.analyticsvidhya.com/blog/2019/05/scraping-classifying-youtube-video-data-python-selenium/?utm_source=courses&utm_medium=web-scraping-hands-on-introduction-python) to know more about how it works in Python

Here’s a brilliant illustration of the three main components that make up web scraping:  
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/components-of-web-scraping.png)






# **Problem Statment**

We’ll understand the components involved by scraping hotel details like the name of the hotel and price per room from the goibibo website:
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/target-url.png)  

Note: Always follow the [robots.txt](https://www.goibibo.com/robots.txt)  file of the target website which is also known as the robot exclusion protocol. This tells web robots which pages not to crawl.  
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/updated_robots_goibibo.png) 

So, looks like we are allowed to scrape the data from our targeted URL. We are good to go and write the script of our web robot. Let’s begin!



The first step is to navigate to the target website and download the source code of the web page. We are going to use the [requests](https://pypi.org/project/requests/) library to do this. A couple of other libraries to make requests and download the source code are [http.client](https://docs.python.org/3/library/http.client.html#module-http.client) and [urlib2](https://docs.python.org/2/library/urllib2.html).  

In [8]:
# Import Libraries 
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

In [15]:
# target URL to scrap
url = "https://www.goibibo.com/hotels/hotels-in-shimla-ct/"

# headers
headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}

# send request to download the data
response = requests.request("GET", url, headers=headers)
#response = requests.get(url)

# parse the downloaded data
data = BeautifulSoup(response.text, 'html.parser')
print(data)


<!DOCTYPE html>

<html lang="en">
<head>
<script>
          var starttime = new Date();
        </script>
<title data-react-helmet="true">Hotels in Shimla, 517 Shimla Hotels from ₹300 + upto 30% off</title>
<meta content="#2d67b2" data-react-helmet="true" name="theme-color"/><meta content="122023101161980" data-react-helmet="true" property="fb:app_id"/><meta content="239522418693" data-react-helmet="true" property="fb:pages"/><meta content="l3rQIge7B2N_G1cQl0VZP0y7-nE" data-react-helmet="true" name="alexaVerifyID"/><meta charset="utf-8" data-react-helmet="true"/><meta content="width=device-width,initial-scale=1.0, maximum-scale=1.0, user-scalable=0" data-react-helmet="true" name="viewport"/><meta content="Book Hotels in Shimla at lowest Prices on Goibibo. Get Free Cancellation and Instant Refund on 517  Shimla Hotels starting from  ₹300. Book from 96 goSafe Hotels in Shimla, ensuring clean and safe hotel stay in current Coronavirus scenario. Use code GETSETGO for discounts upto 30% of

The next step is to parse this data into an HTML Parser and for that, we will use the **BeautifulSoup** library. Now, if you have noticed our target web page, the details of a particular hotel are on a different card like most of the web pages.  
So the next step would be to filter this card data from the complete source code. Next, we will select the card and click on the ‘Inspect Element’ option to get the source code of that particular card. You will get something like this:  
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/source-cord-card-mapping.png)  

The class name of all the cards would be the same and we can get a list of those cards by just passing the tag name and attributes like the **< /class>** 
tag with its name like I’ve shown below: 

In [18]:
# find all the sections with specifiedd class name

cards_data = data.find_all('div', attrs={'class', 'HotelCardstyles__OuterWrapperDiv-sc-1s80tyk-0 eXWmAQ'})

In [19]:
# total number of cards
print('Total Number of Cards Found : ', len(cards_data))

Total Number of Cards Found :  10


In [20]:
# source code of hotel cards
for card in cards_data:
  print(card)

<div class="HotelCardstyles__OuterWrapperDiv-sc-1s80tyk-0 eXWmAQ" itemscope="" itemtype="http://schema.org/Hotel"><div class="HotelCardstyles__WrapperSectionMetaDiv-sc-1s80tyk-2 fKNNeH"><span class="PersuasionHoverTextstyles__WrapperDiv-sc-1c06rw1-13 hmlsFt"><span class="PersuasionHoverTextstyles__HoverTargetWrapperDiv-sc-1c06rw1-2 dfSYRF" type=""><div class="HotelCardDealTag__WrapperDiv-sc-8faxgn-0 bcGOvP" color="#ffffff"><div class="HotelCardDealTag__TagMarkup-sc-8faxgn-2 hivUQz"></div><span class="HotelCardDealTag__TextWrapperSpan-sc-8faxgn-1 gErUjJ">Hot Deal</span></div></span></span><div class="HotelCardstyles__ImageGalleryWrapperDiv-sc-1s80tyk-1 iGkcms"><div class="FreeBreakfastTag__WrapperDiv-sc-1t1x6un-0 gMWZSi"><div class="FreeBreakfastTag__TagMarkup-sc-1t1x6un-2 ksKWcf"></div><span class="FreeBreakfastTag__TextWrapperSpan-sc-1t1x6un-1 fyWKnD">Free Breakfast</span></div><img class="HotelCardImageGallerystyles__ImageStyle-r3dzqu-1 beqcaa" data-testid="" itemprop="image" onload=

We have filtered the cards data from the complete source code of the web page and each card here contains the information about a separate hotel. Select only the Hotel Name, perform the Inspect Element step, and do the same with the Room Price:  
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/inspect-element-nested1.png)   

Now, for each card, we have to find the above Hotel Name which can be extracted from the (p) tag only. **This is because there is only one (p) tag for each card and Room Price by (li) tag along with the (class) tag and class name:** 



In [23]:
# extract the hotel name and price per room
for card in cards_data:
  
  # get the hotel name
  hotel_name = card.find('a')

  # get the room price
  room_price = card.find('span', attrs={'itemprop': 'priceRange'})

  print(hotel_name, room_price)

<a class="HotelCardstyles__HotelNameSeoAnchor-sc-1s80tyk-12 hiSwbR" content="Snow Valley Resorts" href='/hotels/snow-valley-resorts-hotel-in-shimla-1953288179585905793/?hquery={"ci":"20200806","co":"20200807","r":"1-2-0","ibp":"v3"}&amp;hmd=8f36714b239e573faf2c444188b5265be80131b1187c9ce651409af3093068526cf31cbcfa7839f70d07ceba2f7caa6be0e0f08e94353d9abac928f54bdfb5c0b6884c4010c722940b254d3aba1578250994d8b2556224f7e33712d89e40bd201d6aa49bf5bb5318cf7b2351dd78d3f288a0d20e6211ea40ea975e8baabedef64d5e62a128dc76c673cc36f17313a8422447df87c998a62ca0275f8721d01db293cebf6713e294739c714354b386a5e78eb24a5bcc48940739924b16480b2cfab0179feedfb90550675e978c7ccbbf21e84c9a39b285415b67ab11b68d831482493399d6bb438d9b5e474e7d6bd26f7d31c72e0a681d3cc2aeff65f0559e5961241ad82db6900107fb3cc57c8678662cf08410320017de9a98ce83da02b639de6ab44e6f86ec3b536e023cefe1524dcd50626ac8dec8e1d841f0bfc12ceb107ac7d8ac&amp;cc=IN' itemprop="name" target="_blank">Snow Valley Resorts</a> <span class="HotelCardstyles__CurrentPrice-sc

The final step is to store the extracted data in the CSV file. Here, for each card, we will extract the Hotel Name and Price and store it in a Python dictionary. We will then finally append it to a list.  
Next, let’s go ahead and transform this list to a Pandas data frame as it allows us to convert the data frame into CSV or JSON files:  

In [35]:
# create a list to store the data
scraped_data = []

for card in cards_data:
  
  # initialize the dictionary
  card_details = {}

  # get the hotel name
  hotel_name = card.find('a')

  # get the room price
  room_price = card.find('span', attrs={'itemprop': 'priceRange'})

  # get the hotel rating
  hotel_rating = card.find('span', attrs={'itemprop':'ratingValue'})

  # get the hotel star rating
  #hotel_add = card.find('span', attrs={'class':'PersuasionHoverTextstyles__TextWrapperSpan-sc-1c06rw1-14 dlEtqh'}) 

  # add data to the dictionary
  card_details['Hotel_Name'] = hotel_name.text
  card_details['Room_Price'] = room_price.text
  card_details['Hotel_Ratings']= hotel_rating.text
  #card_details['Hotel_Location']= hotel_add.text 

  # append the scraped data to the list
  scraped_data.append(card_details)

In [36]:
# create a data frame from the list of dictionaries
dataFrame = pd.DataFrame.from_dict(scraped_data)
dataFrame.head()

Unnamed: 0,Hotel_Name,Room_Price,Hotel_Ratings
0,Snow Valley Resorts,3361,4.4 / 5
1,Rocky Knob (Explore World Art in One Property),2593,4.6 / 5
2,Radisson Jass Shimla,8000,4.4 / 5
3,Hotel Willow Banks,5275,4.1 / 5
4,The Oaktree House,2971,4.8 / 5


In [None]:
# save the scraped data as CSV file
dataFrame.to_csv('Hotel_Data.csv',index=False)

Congrats! We have successfully created a basic web scraper. I want you to try out these steps and try to get more data like ratings and address of the hotel. Now let’s see how to perform some common tasks like scraping URLs, Email IDs, Images, and Scrape Data on Page Loads.

# **Single Webpage Scraping**

Two of the most common features we try to scrape are website URLs and email IDs. I'm sure you've worked on projects or challenges where extracting email IDs in bulk was required (see marketing teams!). So let's see how to scrape these aspects in Python.

#### **Using the Console of the Web Browser**
*  Let's say we want to keep track of our Instagram followers and want to know the username of the person who unfollowed our account. First, log in to your Instagram account and click on followers to check the list:                                            
![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/instagram_blurred.png)  

*  Scroll down all the way so that we have all the usernames loaded in the background in our browser's memory  
*  Right-click on the browser's window and click 'Inspect Element'  
*  In the Console Window, type this command:  

## **`urls = $$('a'); for (url in urls) console.log ( urls[url].href);`**

![alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/console.png)  
*  With just one line of code, we can find out all the URLs present on that particular page:
![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/usernames-blurred.png)  
*  Next, save this list at two different time stamps and a simple Python program will let you know the difference between the two. We would be able to know the username of who unfollowed our account!
*  There can be multiple ways we can use this hack to simplify our tasks. The main idea is that with a single line of code we can get all the URLs in one go  

#### **Using the Chrome Extension Email Extractor**
*  [Email Extractor](https://chrome.google.com/webstore/detail/email-extractor/jdianbbpnakhcmfkcckaboohfgnngfcc?hl=en) is a Chrome plugin that captures the Email IDs present on the page that we are currently browsing.
*  It even allows us to download the list of Email IDs in CSV or Text file:  
![](https://s3.amazonaws.com/thinkific/file_uploads/118220/images/497/1cd/31d/1586174856067.jpg)  






# **Multiple Webpage Scraping (BeautifulSoup and Regex)**

The above solutions are efficient only when we want to scrape data from just one page. But what if we want the same steps to be done on multiple webpages?  
There are many websites that can do that for us at some price. But here’s the good news – we can also write our own web scraper using Python! You can use the code below:  

In [37]:
# Web Scraping - URLs and Email IDs
import urllib.request
from bs4 import BeautifulSoup

In [38]:
# URL to Scrap
wiki = "https://dlca.logcluster.org/display/public/DLCA/4.1+Nepal+Government+Contact+List"

In [39]:
#Query the website and return the html to the variable 'page'
page = urllib.request.urlopen(wiki) 

In [40]:
#Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page,features='html.parser')
print('\n\nPage Scrapped !!!\n\n')
print('\n\nTITLE OF THE PAGE\n\n')



Page Scrapped !!!




TITLE OF THE PAGE




In [41]:
print(soup.title.string)
print('\n\nALL THE URLs IN THE WEB PAGE\n\n')

4.1 Nepal Government Contact List - Logistics Capacity Assessment - Digital Logistics Capacity Assessments


ALL THE URLs IN THE WEB PAGE




In [42]:
all_links = soup.find_all('a')
print('Total number of URLs present = ',len(all_links)) 

Total number of URLs present =  108


In [43]:
print('\n\nLast 5 URLs in the page are : \n')

if len(all_links) > 5:

  last_5 = all_links[len(all_links)-5:]

  for url in last_5:
    print(url.get('href'))



Last 5 URLs in the page are : 

http://www.atlassian.com/c/conf/11460
http://www.atlassian.com/software/confluence
https://support.atlassian.com/help/confluence
http://www.atlassian.com/about/connected.jsp?s_kwcid=Confluence-stayintouch
http://www.atlassian.com/


In [44]:
emails=[]
for url in all_links:
  if (str(url.get('href')).find('@') > 0):
    emails.append(url.get('href'))

print('\n\nTotal Number of Email IDs Present: ', len(emails))



Total Number of Email IDs Present:  27


In [45]:
print('\n\nSome of the emails are: \n\n')
for email in emails[:5]:
  print(email)



Some of the emails are: 


mailto:info@nepal.gov.np
mailto:info@moad.gov.np
mailto:info@mod.gov.np
mailto:info@mohp.gov.np
mailto:gunaso@moha.gov.np


# **Scrape Images in Python**

In this section, we will scrape all the images from the same *Goibibo* webpage. The first step would be the same to navigate to the target website and download the source code. Next, we will find all the images using the (img) tag.

In [46]:
# importing required libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

In [48]:
# target URL
url = "https://www.goibibo.com/hotels/hotels-in-shimla-ct/"

headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"}

response = requests.request("GET", url, headers=headers)

data = BeautifulSoup(response.text, 'html.parser')

In [49]:
# find all with the image tag
images = data.find_all('img', src=True)
print('Number of Images: ', len(images))

Number of Images:  66


In [50]:
for img in images:
  print(img)

<img onload="pagespeed.CriticalImages.checkImageForCriticality(this);" pagespeed_url_hash="753931433" src="https://goibibo.ibcdn.com/styleguide/images/goLogo.png"/>
<img alt="irctc train logo" class="marginT5 marginB5" height="25px" onload="pagespeed.CriticalImages.checkImageForCriticality(this);" pagespeed_url_hash="137809912" src="https://goibibo.ibcdn.com/styleguide/images/train_logo.png" width="24px"/>
<img class="fl" height="23px" onload="pagespeed.CriticalImages.checkImageForCriticality(this);" pagespeed_url_hash="1432842091" src="https://goibibo.ibcdn.com/styleguide/images/goCash-header.png" width="23px"/>

<img class="HotelCardImageGallerystyles__MiniImageStyle-r3dzqu-2 gVnLtq" data-testid="" itemp

From all the image tags, select only the **scr** part. Also, notice that the hotel images are available in **jpg** format. So we will select only those:

In [73]:
# select src tag
image_src = [x['src'] for x in images]

# select only jp format images
#image_src = [x for x in image_src if x.endswith('.png')]

for img in image_src:
  print(img)

https://goibibo.ibcdn.com/styleguide/images/goLogo.png
https://goibibo.ibcdn.com/styleguide/images/train_logo.png
https://goibibo.ibcdn.com/styleguide/images/goCash-header.png
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
data:image/gif;base64,R0lGODlhAQAB

In [64]:
len(image_src)

3

Now that we have a list of image URLs, all we have to do is request the image content and write it in a file. Make sure that you open the file **'wb (write binary)'** form:

In [72]:
image_count = 1
for img in image_src:

  with open('image_'+str(image_count)+'.png', 'wb') as f:
    res = requests.get(img)
    f.write(res.content)
  
  image_count = image_count+1