# Web Scraping

### From 'edureka' page on web scraping - https://www.edureka.co/blog/web-scraping-with-python/

## *Written By Nathanael Hitch*

Useful for 'pulling' large amounts of data from a website; it is a lot faster and easier than going to the website and doing it manually. Examples include:

- Price Comparison: collecting data from online shopping websites to compare
- Social Media Trending: collect data from social media websites, such as Twitter, to see what's trending

Web scraping helps collect the data that's, usually, unstructured and stores it in a structured form. There are different ways to web scrape, including:

- Online Services
- APIs
- Writing your own code

### Can you scrape off of THIS website?

While web scraping is legal, some websites allow it while others don't. To see if a website does allow web scraping, you can look at the site's "robots.txt" file by appending "/robots.txt" to the URL.

## How does it work?

Running web scraping code a request to the URL; the server responds by the response as a HTML or XML page. The basic steps for web scraping using Python is:

1. Find the URL that you want to scrape
2. Inspect the page
3. Find the data you want to extract
4. Write the code
5. Run the code and extract the data
6. Store the data in the required format

The libraries that'll be used for web scraping are:

- Selenium: web testing library, used to automate browser activities
- BeautifulSoup: parses HTML and XML documents, creating parse trees
- Pandas: used for data manipulation and analysis, extracting the data and store it in a desired format

## Scraping Flipkart Website

1. Find the URL you want to scrape:

Use the Flipkart website to extract the Price, Name and Rating of the laptops.
- https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2

2. Inspect the page:

As the data is usually nested in tags, we need to see under which tag the data we need to scrape is. To inspect the page, right-click on the element and click "Inspect" and a "Browser Inspection Box" will open

3. Find the data we want extracted

Find where the Name, Price and Rating for the laptops are nested in the box. Additionally, extracting the Processor is useful as it has certain barriers which are useful to learn to work round.

4. Write the code

You need to install the packages required for the code below, Selenium, Pandas and BeautifulSoup4 (bs4). To install the code onto Python:

- \>>> pip install *Package Name*

To install onto Anaconda:

- \>>> conda install *Package Name*

Additionally, the ChromeDriver needs to be installed for selenium to use it:

- Download ChromeDriver from the ChromeDriver website - https://chromedriver.chromium.org/.
- Download the necessary version (e.g. Windows) and cut and paste the download into your WebDrivers folder in the C: drive (create a folder if needed).
- Add the WebDrivers folder path to the PATH system variables.

Selenium should be able to find and use the WebDriver.<br>
*[Additionaly notes on 'ChromeDriver' are in NoteBook 4]*.

In [1]:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome("C:\WebDrivers\chromedriver")
# Calls the ChromeDriver so that selenium can access Chrome.

driver.get("https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniq")
# Opens a browser (Chrome in this case) with that URL.

products = []
prices = []
ratings = []
processors = []

content = driver.page_source
# Get's the html script from the driver's page - stuff seen in Browser Inspection Box.

soup = BeautifulSoup(content)
# Format's the content into a parse tree.

findallInSoup = soup.find_all('a', href=True, attrs={'class':'_31qSD5'})
# Finds all the <a> tags in soup that have a hyper-reference and their class equals '_31qSD5'.
# Basically, the details about the laptops on the website.

for a in findallInSoup:
# For each <a> tag found in soup...
    
    name = a.find('div', attrs={'class':'_3wU53n'})
    # Finds a <div> tag in with their class equaling '_3wU53n' - the name of the laptop.
    
    price = a.find('div', attrs={'class':'_1vC4OE _2rQ-NK'})
    # Finds a <div> tag with the class equaling '_1vC4OE _2rQ-NK' - the price of the laptop.
    
    rating = a.find('div', attrs={'class':'hGSR34'}) # This initially didn't work; unknown why.
    # Finds a <span> tag with the class equaling '_2_KrJI' - the rating of the laptop.
        # OR -> rating = a.find('span', attrs={'class':'_2_KrJI'})
            # On eudreka, the class was said to equal 'hGSR34 _2beYZw' but that didn't work.
            
    products.append(name.text)
    prices.append(price.text)
    ratings.append(rating.text)
    # Appends each name/price/rating to its respective list.
    
    laptop_details = a.find_all('li', attrs={'class':'tVe95H'})
    # Finds a <li> tag with the tag equaling 'tVe95H'.
        # There are more than one on these tags though.        
    for eachInfo in laptop_details:
        if "processor" in eachInfo.text.lower():
        # Find whether this part of the laptop information contains the processor.
            processors.append(eachInfo.text)
            # Append the processor to the processors list.
            break
            # 'Break' as we've found what we needed.
    
    # Larger portions of data can be extracted:
    comp_dets = a.find('ul', attrs={'class':'vFw0gD'})
    #print("Details:\n", comp_dets, "\n")
        
# Print data:
print("\n","Products:\n***************************************************************\n",\
      products,"\n***************************************************************")
print("\n","Prices:\n***************************************************************\n",\
      prices,"\n***************************************************************")
print("\n","Ratings:\n***************************************************************\n",\
      ratings,"\n***************************************************************")
print("\n","Processors:\n***************************************************************\n",\
      processors,"\n***************************************************************")

driver.quit()
# Closes the browser that the driver opened at the start.

print("\n","FINISHED")


 Products:
***************************************************************
 ['Apple MacBook Air Core i5 5th Gen - (8 GB/128 GB SSD/Mac OS Sierra) MQD32HN/A A1466', 'HP 15 Core i3 6th Gen - (4 GB/1 TB HDD/Windows 10 Home) 15-be014TU Laptop', 'Lenovo Ideapad Core i5 7th Gen - (8 GB/1 TB HDD/Windows 10 Home/2 GB Graphics) IP 320-15IKB Laptop', 'Lenovo Core i5 7th Gen - (8 GB/2 TB HDD/Windows 10 Home/4 GB Graphics) IP 520 Laptop', 'Lenovo Core i5 7th Gen - (8 GB/1 TB HDD/DOS/2 GB Graphics) IP 320-15IKB Laptop'] 
***************************************************************

 Prices:
***************************************************************
 ['₹66,990', '₹36,163', '₹51,990', '₹79,500', '₹56,990'] 
***************************************************************

 Ratings:
***************************************************************
 ['4.7', '4.1', '4.3', '4.4', '4.3'] 
***************************************************************

 Processors:
**********************************

5. Run the code and extract the data

To run the code in command script (assuming you've set up the PATH variables for python and/or conda):

- \>>> python "*{File Directory}* \\ *{File name}*.py"
- \>>> conda run "*{File Directory}* \\ *{File name}*.py"

The "" are needed if there are spaces in the file directory or name; probably best to put them in anyway.

6. Store the data in a required format

Once the code is run, storing the information in a file is useful:

In [2]:
""" Continuing the code from before: """

df = pd.DataFrame({'ProductName':products,'Price':prices,\
                   'Rating':ratings,'Processor':processors})
# Creating a dataframe from the previous lists.

df.to_csv('Files\products.csv', index=False, encoding='utf-8')
# Creating a csv file/ overwriting a previously created file, populating it with the dataframe.
    # 'Files' folder needs to already exist.

print("File Created")

File Created


Unless a file path is stated, the document will be saved in the same folder as the python file.