# Scraping Different Categories Product Details on GIVA using Python


![](https://i.imgur.com/cMC7RFW.png)

### Introduction:
- Introduction on Web scraping:
- Web scraping is the process of extracting specific data from the internet automatically
- We are scraping https://www.giva.co/
- Note: This is preferred only for learning purpose

### About GIVA
 - GIVA Website is basically a E-Commerce site which sells jewellery store featuring high-quality, affordable designs
 - Here we are scraping the details like the Categories it has,Actual Price of the product,Sale Price of the product,URL of the category,Rating and Reviews it has recieved.
   
   
### Tools that are used for the project
1. Python
2. Requests
3. Beautiful Soup
4. Pandas

## Outline of the project:
1. Download the webpage (GIVA Website) uisng `requests`
2. Installing and Importing required libraries
3. Simulating the page and Extracting the different Category Name and URLs of from website using `BeautilfulSoup`
![](https://i.imgur.com/oVqrjse.png)
4. Accessing each Category with the URL
5. Parsing the Top 10 Products details into 6 fields: Name of Item,Actual Price,Sale Price,Rating,Number of Reviews,Repo URL using Helper Functions.
![](https://i.imgur.com/BOTaRld.png)
6. Storing the extracted data into a dictionary.
7. Compiling all the data into a DataFrame using Pandas and saving the data into CSV file


## Using BeautifulSoup to scrape GIVA Website

#### Scraping the list of different Categories and their URL from GIVA

- Use requests to download the page
- Use BeautifulSoup to extract and parse information
- Convert to a pandas DataFrame

### 1.Download the webpage (GIVA Website) uisng requests

In [21]:
def get_GIVA_page():
     #Downloading the page
    topics_url= "https://www.giva.co/"
    response=requests.get(topics_url)
    
    #checking if the response successfull
    if response.status_code != 200:
        raise Exceotion('Failed to Load Page {}'.format(topics_url))
        
     #Parse using BeautifulSoup   
    doc=BeautifulSoup(response.text, 'html.parser')
    return doc

### How to Run the code

You can run the code using `Run` Button at the top of the page, you can make changes and save your own version of the notebook to [Jovian](https:www.jovian.ai) by executing the folloeing cells

### 2.Installing and Importing required libraries

In [22]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os
import time

In [2]:
get_GIVA_page()

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<!--
Elevar Data Layer V2

This file is automatically updated and should not be edited directly.

https://knowledge.getelevar.com/how-to-customize-data-layer-version-2

Updated: 2022-07-19 20:17:54+00:00
Version: 2.37.5
-->
<!-- Google Tag Manager -->
<script>
  window.dataLayer = window.dataLayer || [];
</script>
<script type="lazyload2">
(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({"gtm.start":
  new Date().getTime(),event:"gtm.js"});var f=d.getElementsByTagName(s)[0],
  j=d.createElement(s),dl=l!="dataLayer"?"&l="+l:"";j.async=true;j.src=
  "https://gtm.giva.co/gtm.js?id="+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,"script","dataLayer","GTM-5XNL7GF");
</script>
<!-- End Google Tag Manager -->
<script id="elevar-gtm-suite-config" type="lazyload2">{"gtm_id": "GTM-5XNL7GF", "event_config": {"cart_reconcile": true, "cart_view": true, "checkout_complete": true, "checkout_step": true, "collection_view": true, "product_ad

We are importing some packages that are required for the project such as
 - requests (to get the response from http requets)
 - BeautifulSoup (for pulling data out of HTML files)
 - pandas (for parsing multiple file formats to converting an entire data table)
 - os (for creating or removing a directory)
 - time (for giving time (sec) to load the page)

Valid URL status code ranges between 200 to 299
- refer this link to know more about different status codes [Status codes](https://www.codegrepper.com/code-examples/whatever/list+of+http+status+codes)

In [3]:
doc=get_GIVA_page()

In [4]:
type(doc)

bs4.BeautifulSoup

Lets create some helper function to parse information from GIVA

- To get Category Names and URLs, we can consider parent tag as `div` tag and `id = shopify-section-1559045890945` and child tag as `div` tag and `class = collection_list_item`

- To get Category URL, we use tags same as we used for category name and get the `href` from the above tags
![](https://i.imgur.com/8oyBx1W.png)

### 3.Simulating the page and Extracting the different Category Name and URLs of from website using BeautilfulSoup

#### Funcion to get Category Names of GIVA

In [23]:
def Category_Name(doc):
    #defining a variable to get required div_tag (refer to the above image)
    ItemName_tags=doc.find_all('div',{'id':'shopify-section-1559045890945'})[0].find_all('div',{'class':'collection_list_item'})
    ItemNames=[]
    
    #looping over ItemName_tags to get all the Category Names
    for item in ItemName_tags:
        ItemNames.append(item.text.strip())
    return ItemNames

`Category_Name` can be used to get Category Names of GIVA

In [6]:
Category_Name(doc)

['Necklaces & Pendants',
 'Rings',
 'Earrings',
 'Bracelets',
 'Sets',
 'Rakhis',
 'Mangalsutra Collection',
 'Toe Rings',
 'Anklets',
 'Nose Pins',
 'Fragrances & Candles',
 'Chains',
 'Diamond Collection',
 "Men's Collection"]

#### Function to get Category URLs of GIVA

In [24]:
def Category_URLs(doc):
    #defining a variable to get required div_tag (refer to the above image)
    ItemName_tags=doc.find_all('div',{'id':'shopify-section-1559045890945'})[0].find_all('div',{'class':'collection_list_item'})
    ItemURLs=[]
    
    #Since ItemName_tags contain link as well, hence looping over ItemName_tags to get all the Category URLs
    for i in ItemName_tags:
        ItemURLs.append("https://www.giva.co/"+i('a')[0]['href'])
    return ItemURLs

`Category_URLs` can be used to get URLs for each of the Categories

In [8]:
Category_URLs(doc)

['https://www.giva.co//collections/pendants',
 'https://www.giva.co//collections/all-rings',
 'https://www.giva.co//collections/earrings',
 'https://www.giva.co//collections/bracelets',
 'https://www.giva.co//collections/all-sets',
 'https://www.giva.co//collections/rakhis',
 'https://www.giva.co//collections/silver-mangalsutra',
 'https://www.giva.co//collections/toe-rings',
 'https://www.giva.co//collections/anklets',
 'https://www.giva.co//collections/nose-pins',
 'https://www.giva.co//collections/fragrances-candles',
 'https://www.giva.co//collections/chains',
 'https://www.giva.co//collections/diamond-collection',
 'https://www.giva.co//collections/men-silver-jewellery']

Lets Scarpe categories in a single function and create DataFrame

#### 4.Accessing each Category with the URL

In [25]:
def scrape_topics():
    topics_url= "https://www.giva.co/"
    response=requests.get(topics_url)
    if response.status_code != 200:
        raise Exception('Failed to Load Page {}'.format(topics_url))
    doc=BeautifulSoup(response.text, 'html.parser')
    topics_dict={
        'Category Name':Category_Name(doc),
        'Category URL':Category_URLs(doc)
    }  
    return pd.DataFrame(topics_dict)

In [10]:
scrape_topics()

Unnamed: 0,Category Name,Category URL
0,Necklaces & Pendants,https://www.giva.co//collections/pendants
1,Rings,https://www.giva.co//collections/all-rings
2,Earrings,https://www.giva.co//collections/earrings
3,Bracelets,https://www.giva.co//collections/bracelets
4,Sets,https://www.giva.co//collections/all-sets
5,Rakhis,https://www.giva.co//collections/rakhis
6,Mangalsutra Collection,https://www.giva.co//collections/silver-mangal...
7,Toe Rings,https://www.giva.co//collections/toe-rings
8,Anklets,https://www.giva.co//collections/anklets
9,Nose Pins,https://www.giva.co//collections/nose-pins


#### 5.Parsing the Top 10 Products details into 6 fields

- We will get each of the URL and parse it into a doc
- We will define tags which gives us `Name of Item,Actual Price,Sale Price,Rating,Number of Reviews,Repo URL`

#### 1.Function to download each category page

- `get_topic_page` takes each of the category URL and checks if the response is successfull and parse it into a doc called `topic_doc`

In [26]:

def get_topic_page(topic_URL):
     #Downloading the page
    response=requests.get(topic_URL)
    

    #checking if the response successfull
    if response.status_code != 200:
        raise Exception('Failed to Load Page {}'.format(topic_URL))
    time.sleep(3)
        
    #Parse using BeautifulSoup
    topic_doc=BeautifulSoup(response.text,'html.parser')
    return topic_doc

#### 2.Function to define tags of each result

- `get_repo_info` takes div_tag as input to get all the required tags
- here `div` tag is the main tag, under which we have all the other tags to get required information

In [27]:
def get_repo_info(div_tag):
    base_url="https://www.giva.co/"
    a_tags=div_tag.find_all('a')
    Item_Name=a_tags[0].text.strip()
    Item_URL=base_url+a_tags[0]['href']
    Sale_Price=div_tag.find('span',class_='price-item price-item--sale money').text.strip()
    Regular_Price=div_tag.find('span',class_='price-item price-item--regular').text.strip()
    Item_Review=div_tag.find('div',class_='loox-rating')['data-raters']
    Item_Rating=div_tag.find('div',class_='loox-rating')['data-rating']
    return Item_Name,Sale_Price,Regular_Price,Item_Review,Item_Rating,Item_URL

- To get Category Names and URLs,we can consider parent tag as `div` tag and `id = shopify-section-1559045890945` and child tag as `div` tag and `class = collection_list_item`
- To get Sale Price, we can consider `span` tag and `class:price-item price-item--sale money` 
![](https://i.imgur.com/r86B7Rl.png)
    

- To get Regular Price,we can consider `span` tag and `class:price-item price-item--regular` 
![](https://i.imgur.com/WpG1qMo.png)

- To get Reviews,we can consider `div` tag and `(class:loox-rating')['data-raters']` 
![](https://i.imgur.com/U2Z2hVy.png)

- To get Ratings,we can consider `div` tag and `(class:loox-rating')['data-raters']` 
![](https://i.imgur.com/2ZA5C1i.png)

### 6.Storing the extracted data into a dictionary.

- Here we are creating a dictionary `topic_repos_dict` to store all the information that we have parsed
- creating a dataframe and storing the data into columns that we got from dictionary using pandas pd

In [28]:
def get_topic_repos(topic_doc):
    #Getting div tag which contains all the information that we need
    repo_tags=topic_doc('div',{'class':'grid-view-item product-card'})
    
    topic_repos_dict={
    'Item_Name': [],
    'Sale_Price':[],
    'Regular_Price':[],
    'Item_Review':[],
    'Item_Rating':[],
    'Item_URL':[]
    }
    #Get repo info
    for i in range(len(repo_tags)):
        repo_info=get_repo_info(repo_tags[i])
        topic_repos_dict['Item_Name'].append(repo_info[0])
        topic_repos_dict['Sale_Price'].append(repo_info[1])
        topic_repos_dict['Regular_Price'].append(repo_info[2])
        topic_repos_dict['Item_Review'].append(repo_info[3])
        topic_repos_dict['Item_Rating'].append(repo_info[4])
        topic_repos_dict['Item_URL'].append(repo_info[5])
        
    return pd.DataFrame(topic_repos_dict)

### 7.Compiling all the data into a DataFrame using Pandas and saving the data into CSV file

- We are storing the parsed data into csv files `topic_df.to_csv(path,index=None)`,
- Checking if the file already exists, if file already exists it skips and create next csv file.



In [43]:
def scrape_topic(topic_url, path):
    
    if os.path.exists(path):
        print("The file {} exists, skipping".format(path))
        return
    topic_df=get_topic_repos(get_topic_page(topic_url))
    topic_df.to_csv(path,index=None)


#### Function to Store CSV files into Final CSV

- We have function to get list of categories
- We have function to create CSV file for each of the categories from scraped repos from GIVA
- Let's create a function to put them all together

In [44]:
def scrape_topics_repos():
    print('Scraping list of Categories from Giva')
    topic_df=scrape_topics()
    
    for index, row in topic_df.iterrows():
        print('Scraping top items for "{}"'.format(row['Category Name']))
        scrape_topic(row['Category URL'], '{}.csv'.format(row['Category Name']))

This gives us top products for all the categories on the page

In [45]:
scrape_topics_repos()

Scraping list of Categories from Giva
Scraping top items for "Necklaces & Pendants"
Scraping top items for "Rings"
Scraping top items for "Earrings"
Scraping top items for "Bracelets"
Scraping top items for "Sets"
Scraping top items for "Rakhis"
Scraping top items for "Mangalsutra Collection"
Scraping top items for "Toe Rings"
Scraping top items for "Anklets"
Scraping top items for "Nose Pins"
Scraping top items for "Fragrances & Candles"
Scraping top items for "Chains"
Scraping top items for "Diamond Collection"
Scraping top items for "Men's Collection"


In [46]:
df = pd.concat(
    map(pd.read_csv, ['Necklaces & Pendants.csv',
 'Rings.csv',
 'Earrings.csv',
 'Bracelets.csv',
 'Sets.csv',
 'Rakhis.csv',
 'Mangalsutra Collection.csv',
 'Toe Rings.csv',
 'Anklets.csv',
 'Nose Pins.csv',
 'Fragrances & Candles.csv',
 'Chains.csv',
 'Diamond Collection.csv',
 "Men's Collection.csv"]), ignore_index=True)
print(df)


                                             Item_Name    Sale_Price  \
0    Anushka Sharma Golden Star Constellation Necklace  Rs. 1,699.00   
1    Anushka Sharma Silver Zircon Pendant with Link...  Rs. 1,599.00   
2                             Rose Gold Heart Necklace  Rs. 1,999.00   
3            Anushka Sharma Silver Deer Heart Necklace  Rs. 1,799.00   
4                Anushka Sharma Silver Queens Necklace  Rs. 1,599.00   
..                                                 ...           ...   
104                       Silver Shine Zircon Earrings  Rs. 1,199.00   
105  Oxidised Silver Unwind Moon Pendant with Box C...  Rs. 1,699.00   
106                      Oxidised Silver Om Shiva Ring  Rs. 1,499.00   
107              Oxidised Silver Threaded Sun Bracelet    Rs. 999.00   
108                      Silver Solitaire Band For Him  Rs. 2,399.00   

    Regular_Price  Item_Review  Item_Rating  \
0    Rs. 2,999.00        179.0          4.9   
1    Rs. 2,599.00        231.0          4

In [47]:
df.to_csv('FinalFile.csv',index=False)

Check if the CSV's created successfully using pandas

In [48]:
pd.read_csv('Anklets.csv')

Unnamed: 0,Item_Name,Sale_Price,Regular_Price,Item_Review,Item_Rating,Item_URL
0,Silver Black Bead Anklet,"Rs. 2,199.00","Rs. 2,699.00",30.0,4.8,https://www.giva.co//collections/anklets/produ...
1,Silver Tiny Charm Anklet,"Rs. 1,599.00","Rs. 2,799.00",,,https://www.giva.co//collections/anklets/produ...
2,Rose Gold Heart Anklet,"Rs. 2,299.00","Rs. 3,299.00",1.0,5.0,https://www.giva.co//collections/anklets/produ...
3,Layered Rose Gold Queen's Anklet,"Rs. 1,699.00","Rs. 2,999.00",,,https://www.giva.co//collections/anklets/produ...
4,Oxidised Silver Leaf Anklet,"Rs. 1,599.00","Rs. 2,999.00",9.0,4.7,https://www.giva.co//collections/anklets/produ...
5,Silver Snowflake Charm Anklet,"Rs. 1,699.00","Rs. 2,099.00",1.0,4.0,https://www.giva.co//collections/anklets/produ...
6,Silver Zircon Bubble Anklet,"Rs. 1,699.00","Rs. 2,999.00",3.0,4.7,https://www.giva.co//collections/anklets/produ...


# Summary 

 - We have successfully scraped `https://www.giva.co/` .
 - Scraping was done using Python,requests,BeautifulSoup and Pandas.
 - We have scraped top products from different 12 categories of GIVA website like Actual Price of the product,Sale Price of the product,Rating and Reviews  
 - Parsed all the scraped data into a csv file for each of the category containing 10 rows and 6 columns and total of 120 rows and 6 columns
 

# Ideas for future woks

 - Can be scraped with different set of collections such as `by color,stone,style` etc..
![](https://i.imgur.com/nNsLwz2.png)
 - `GiVA` website also can be scraped using `Selenium` since it has dynamic pages as well                                        
   WebSite:https://www.giva.co/
 - Improving the documentation part of the project
 - Scrape Flight/Hotel/Bus/Train details (MakeMyTrip)- Similarly as this project, Capture the details such as Name,Timings,Path,Price of flight/bus/train etc.. 
   WebSite:https://www.makemytrip.com/

# References

 - Complete project: https://jovian.ai/aakashns/python-web-scraping-project-guide
 - Requests:https://www.w3schools.com/python/module_requests.asp
 - BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 - Pandas: https://www.w3schools.com/python/pandas/default.asp


In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

In [None]:
import jovian

jovian.commit(files=['FinalFile.csv'])