## Introduction
- Today, the Internet is flooded with an enormous amount of data relative to what we had a decade ago. According to Forbes, the amount of data we produce every day is truly mind-boggling. There are 2.5 quintillion bytes of data generated every day at our current pace, and the credit goes to the Internet of Things (IoT) devices. With access to this data, either in the form of audio, video, text, images, or any format, most businesses are relying heavily on data to beat their competitors & succeed in their business. Unfortunately, most of this data is not open. Most websites do not provide the option to save the data which they display on their websites. This is where Web Scraping tools/ Software comes to extract the data from the websites.

## What is Web Scraping? 
- Web Scraping is the process of automatically downloading the data displayed on the website using some computer program. A web scraping tool can scrape multiple pages from a website & automate the tedious task of manually copying and pasting the data displayed. Web Scraping is important because, irrespective of the industry,  the web contains information that can provide actionable insights for businesses to gain an advantage over competitors.

## To Fetch the data using Web Scraping using Python, we need to go through the following steps:
- Find the URL that you want to scrape
- Inspecting the Page
- Find the data you want to extract
- Write the code
- Run the code & extract the data
- Finally, Store the data in the required format

## Packages used for Web Scraping
- We'll use the following python packages:
   - Pandas: Pandas is a library used for data manipulation and analysis. It is used to store the data in the desired format.
   - BeautifulSoup4: BeautifulSoup is the python web scraping library used for parsing HTML documents. It creates parse trees that are helpful in extracting tags from the HTML string.
   - Selenium: Selenium is a tool designed to help you run automated tests in web applications. Although it's not its main purpose, Selenium is also used in Python for web scraping, because it can access JavaScript-rendered content (which regular scraping tools like BeautifulSoup can't do). We'll use Selenium to download the HTML content from Flipkart and see in an interactive way what's happening.

## Project Demonstration

### Importing necessary Libraries

In [1]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

### Starting up the WebDriver

In [2]:
# Creating an instance of webdriver for google chrome
driver = webdriver.Chrome()

In [3]:
# Using webdriver we'll now open the flipkart website in chrome
url = 'https://flipkart.com'
# We;ll use the get method of driver and pass in the URL
driver.get(url)

- Now there a few ways we can conduct a product search :
   - A) First is to automate the browser by finding the input element and then insert a text and hit enter key on the keyboard. The image like below.

![](https://imgur.com/eWdEOk8.png)

- However, this kind of automation is unnecessary and it creates a potential for program failure. The Rule of thumb for automation is to only automate what you absolutely need to when Web Scraping. 

- B) Let's search the input inside the search area and hit enter. You'll notice that the search term has now embeded into the URL site. Now we can use this pattern to create a function that will build the necessary URL for our driver to retrieve. This will be much more efficient in the long term and less prone to proram failure. The image like below.

![](https://imgur.com/79tSPGV.png)

- Let's copy this Pattern and create a function that will insert the search term using string formatting.

In [4]:
def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    # We'are replacing every space with '+' to adhere with the pattern 
    search_item = search_item.replace(" ","+")
    return template.format(search_item)

- Now we have a function that will generate a URL based on the search term we provide.

In [5]:
# Checking whether the function is working properly or not
url = get_url('mobile phones')
print(url)

https://www.flipkart.com/search?q=mobile+phones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on


- The fuction produces the same result as before.

### Extracting the collection
- Now we are going to extract the contents of the webpage from which we want to extract the information from.
- To do that we need to create a BeautifulSoup object which will parse the HTML content from the page source.

In [17]:
# Creating a soup object using driver.page_source to retreive the HTML text and then we'll use the default html parser to parse
# the HTML.
soup = BeautifulSoup(driver.page_source, 'html.parser')

![](https://imgur.com/PR7ZWht.png)

- Now that we have identified that the above card indicated by the box contains all the information what we need for a mobile phone. So let's find out all the tags for these boxes/cards which contains information we want to extract. 
- We'll be extracting Model , stars, number of ratings, number of reviews, RAM, Storage capacity, Exapandable option, display, camera information, battery, processor , warranty and Price information.

## Inspecting the tags

![](https://imgur.com/ug5Ft6R.png)

- We can fetch the `a tag & specifically `class = _1fQZEK to get all the cards/boxes and then we can easily take out information of out these boxes for any mobile phone.

In [19]:
results = soup.find_all('a',{'class':"_1fQZEK"})
len(results)

24

### Prototyping for a single record

In [None]:
# picking the 1st card from the complete list of cards
item = results[0]

In [21]:
# Extracting the model of the phone from the 1st card
model = item.find('div',{'class':"_4rR01T"}).text
model

'REDMI 9i (Nature Green, 64 GB)'

In [22]:
# Extracting Stars from 1st card
star = item.find('div',{'class':"_3LWZlK"}).text
star

'4.3'

In [23]:
# Extracting Number of Ratings from 1st card
num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip()
num_ratings

'4,06,452 Ratings'

In [24]:
# Extracting Number of Reviews from 1st card
reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip()
reviews

'23,336 Reviews'

In [25]:
# Extracting RAM from the 1st card
ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')]
ram

'4 GB RAM '

In [26]:
# Extracting Storage/ROM from 1st card
storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip()
storage

'64 GB ROM'

In [27]:
# Extracting whether there is an option of expanding the storage or not
expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
expandable

'Expandable Upto 512 GB'

In [28]:
# Extracting the display option from the 1st card
display = item.find_all('li')[1].text.strip()
display

'16.59 cm (6.53 inch) HD+ Display'

In [29]:
# Extracting camera options from the 1st card
camera = item.find_all('li')[2].text.strip()
camera

'13MP Rear Camera | 5MP Front Camera'

In [30]:
# Extracting the battery option from the 1st card
battery = item.find_all('li')[3].text
battery

'5000 mAh Lithium Polymer Battery'

In [31]:
# Extracting the processir option from the 1st card
processor = item.find_all('li')[4].text.strip()
processor

'MediaTek Helio G25 Processor'

In [32]:
# Extracting Warranty from the 1st card
warranty = item.find_all('li')[-1].text.strip()
warranty

'Brand Warranty of 1 Year Available for Mobile and 6 Months for Accessories'

In [33]:
# Extracting price of the model from the 1st card
price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text

### Generalizing the Pattern

- Now let create a function that will extract all the information at once from a single page.

In [34]:
def extract_phone_model_info(item):
    """
    This function extracts model, price, ram, storage, stars , number of ratings, number of reviews, 
    storage expandable option, display option, camera quality, battery , processor, warranty of a phone model at flipkart
    """
    # Extracting the model of the phone from the 1st card
    model = item.find('div',{'class':"_4rR01T"}).text
    # Extracting Stars from 1st card
    star = item.find('div',{'class':"_3LWZlK"}).text
    # Extracting Number of Ratings from 1st card
    num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip()
    # Extracting Number of Reviews from 1st card
    reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip()
    # Extracting RAM from the 1st card
    ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')]
    # Extracting Storage/ROM from 1st card
    storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip()
    # Extracting whether there is an option of expanding the storage or not
    expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
    # Extracting the display option from the 1st card
    display = item.find_all('li')[1].text.strip()
    # Extracting camera options from the 1st card
    camera = item.find_all('li')[2].text.strip()
    # Extracting the battery option from the 1st card
    battery = item.find_all('li')[3].text
    # Extracting the processir option from the 1st card
    processor = item.find_all('li')[4].text.strip()
    # Extracting Warranty from the 1st card
    warranty = item.find_all('li')[-1].text.strip()
    # Extracting price of the model from the 1st card
    price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text
    result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price)
    return result

In [35]:
# Now putting all the information from all the cards/phone models and putting them into a list
records_list = []
results = soup.find_all('a',{'class':"_1fQZEK"})
for item in results:
    records_list.append(extract_phone_model_info(item))

- Viewing how does our dataframe look like for the 1st page.

In [36]:
pd.DataFrame(records_list,columns=['model',"star","num_ratings"
   ,"reviews",'ram',"storage","expandable","display","camera","battery","processor","warranty","price"])

Unnamed: 0,model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price
0,"REDMI 9i (Nature Green, 64 GB)",4.3,"4,06,452 Ratings","23,336 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 512 GB,16.59 cm (6.53 inch) HD+ Display,13MP Rear Camera | 5MP Front Camera,5000 mAh Lithium Polymer Battery,MediaTek Helio G25 Processor,Brand Warranty of 1 Year Available for Mobile ...,"₹8,499"
1,"realme C21 (Cross Blue, 64 GB)",4.4,"63,273 Ratings","2,912 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹9,499"
2,"realme C21 (Cross Black, 64 GB)",4.4,"63,273 Ratings","2,912 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹9,499"
3,"realme C21 (Cross Black, 32 GB)",4.4,"51,035 Ratings","2,564 Reviews",3 GB RAM,32 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹8,499"
4,"realme C21 (Cross Blue, 32 GB)",4.4,"51,035 Ratings","2,564 Reviews",3 GB RAM,32 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹8,499"
5,"REDMI 9 Power (Mighty Black, 64 GB)",4.3,"1,30,038 Ratings","9,051 Reviews",4 GB RAM,64 GB ROM,,16.59 cm (6.53 inch) Full HD+ Display,48MP + 8MP + 2MP + 2MP | 8MP Front Camera,6000 mAh Battery,Qualcomm Snapdragon 662 Processor,1 year manufacturer warranty for device and 6 ...,"₹10,999"
6,"POCO M3 (Cool Blue, 64 GB)",4.3,"14,630 Ratings",930 Reviews,4 GB RAM,64 GB ROM,Expandable Upto 512 GB,16.59 cm (6.53 inch) Full HD+ Display,48MP + 2MP + 2MP | 8MP Front Camera,6000 mAh Lithium-ion Polymer Battery,Qualcomm Snapdragon 662 Processor,"One Year Warranty for Handset, 6 Months for Ac...","₹10,499"
7,"POCO M2 Reloaded (Mostly Blue, 64 GB)",4.3,"20,010 Ratings","1,315 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 512 GB,16.59 cm (6.53 inch) Full HD+ Display,13MP + 8MP + 5MP + 2MP | 8MP Front Camera,5000 mAh Lithium Polymer Battery,MediaTek Helio G80 Processor,"1 Year for Handset, 6 Months for Accessories","₹9,999"
8,"realme C11 2021 (Cool Grey, 32 GB)",4.3,"3,584 Ratings",197 Reviews,2 GB RAM,32 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,8MP Rear Camera | 5MP Front Camera,5000 mAh Battery,Octa-core Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹6,999"
9,"realme C11 2021 (Cool Blue, 32 GB)",4.3,"3,584 Ratings",197 Reviews,2 GB RAM,32 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,8MP Rear Camera | 5MP Front Camera,5000 mAh Battery,Octa-core Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹6,999"


### Navigating to next page

- Writing a custom function that will help us getting information from multiple pages

In [37]:
def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    search_item = search_item.replace(" ","+")
    # Add term query to URL
    url = template.format(search_item)
    # Add term query placeholder
    url += '&page{}'
    return url

### Putting it all together

- Now combining all thhe things that we have done so far.

In [45]:
# Importing necessary Libraries
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    search_item = search_item.replace(" ","+")
    # Add term query to URL
    url = template.format(search_item)
    # Add term query placeholder
    url += '&page{}'
    return url

def extract_phone_model_info(item):
    """
    This function extracts model, price, ram, storage, stars , number of ratings, number of reviews, 
    storage expandable option, display option, camera quality, battery , processor, warranty of a phone model at flipkart
    """
    # Extracting the model of the phone from the 1st card
    model = item.find('div',{'class':"_4rR01T"}).text
    # Extracting Stars from 1st card
    star = item.find('div',{'class':"_3LWZlK"}).text
    # Extracting Number of Ratings from 1st card
    num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip()
    # Extracting Number of Reviews from 1st card
    reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip()
    # Extracting RAM from the 1st card
    ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')]
    # Extracting Storage/ROM from 1st card
    storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip()
    # Extracting whether there is an option of expanding the storage or not
    expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
    # Extracting the display option from the 1st card
    display = item.find_all('li')[1].text.strip()
    # Extracting camera options from the 1st card
    camera = item.find_all('li')[2].text.strip()
    # Extracting the battery option from the 1st card
    battery = item.find_all('li')[3].text
    # Extracting the processir option from the 1st card
    processor = item.find_all('li')[4].text.strip()
    # Extracting Warranty from the 1st card
    warranty = item.find_all('li')[-1].text.strip()
    # Extracting price of the model from the 1st card
    price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text
    result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price)
    return result

def main(search_item):
    '''
    This function will create a dataframe for all the details that we are fetching from all the multiple pages
    '''
    driver = webdriver.Chrome()
    records = []
    url = get_url(search_item)
    for page in range(1,464):
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source,'html.parser')
        results = soup.find_all('a',{'class':"_1fQZEK"})
        for item in results:
            records.append(extract_phone_model_info(item))
    driver.close()
    # Saving the data into a csv file
    with open('Flipkart_results.csv','w',newline='',encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Model','Stars','Num_of_Ratings','Reviews','Ram','Storage','Expandable',
                        'Display','Camera','Battery','Processor','Warranty','Price'])
        writer.writerows(records)

### Extracting Informtion of all the Mobile phones present on multiple pages

In [46]:
%%time
main('mobile phones')

Wall time: 40min 54s


### Viewing the data

In [47]:
scraped_df = pd.read_csv('C:\\Users\\DELL\\Desktop\\Jupyter Notebook\\Jovian Web Scraping\\Amazon Products Web Scrapper\\Flipkart_results.csv')
scraped_df.head()

Unnamed: 0,Model,Stars,Num_of_Ratings,Reviews,Ram,Storage,Expandable,Display,Camera,Battery,Processor,Warranty,Price
0,"REDMI 9i (Nature Green, 64 GB)",4.3,"4,06,452 Ratings","23,336 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 512 GB,16.59 cm (6.53 inch) HD+ Display,13MP Rear Camera | 5MP Front Camera,5000 mAh Lithium Polymer Battery,MediaTek Helio G25 Processor,Brand Warranty of 1 Year Available for Mobile ...,"₹8,499"
1,"realme C21 (Cross Black, 64 GB)",4.4,"63,273 Ratings","2,912 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹9,499"
2,"realme C21 (Cross Blue, 64 GB)",4.4,"63,273 Ratings","2,912 Reviews",4 GB RAM,64 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹9,499"
3,"SAMSUNG Galaxy S21 Ultra (Phantom Silver, 256 GB)",4.5,537 Ratings,101 Reviews,12 GB RAM,256 GB RO,,17.27 cm (6.8 inch) Quad HD+ Display,108MP + 12MP + 10MP + 10MP | 40MP Front Camera,5000 mAh Lithium-ion Battery,Exynos 2100 Processor,1 Year Manufacturer Warranty for Handset and 6...,"₹1,05,999"
4,"realme C21 (Cross Black, 32 GB)",4.4,"51,035 Ratings","2,564 Reviews",3 GB RAM,32 GB ROM,Expandable Upto 256 GB,16.51 cm (6.5 inch) HD+ Display,13MP + 2MP + 2MP | 5MP Front Camera,5000 mAh Battery,MediaTek Helio G35 Processor,1 Year Warranty for Mobile and 6 Months for Ac...,"₹8,499"
