## Web Scraping Project

In this project I will be gathering data on gaming laptops from Amazon.

In [139]:
# Importing libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import csv

## Starting the webdriver
I will be utilizing the chrome browser to scrape the data.

In [64]:
driver = webdriver.Chrome()

In [65]:
url = 'https://www.amazon.com'
driver.get(url)

In [60]:
def get_url(search_text):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
    search_term = search_text.replace(' ', '+')
    return template.format(search_term)

In [66]:
# Adding 'gaming laptop' to the url
url = get_url('gaming laptop')
print(url)

https://www.amazon.com/s?k=gaming+laptop&ref=nb_sb_noss_1


In [67]:
driver.get(url)

## Extracting Collection
Gathering the number of items that are listed on the first page.

In [68]:
soup = BeautifulSoup(driver.page_source,'html.parser')

In [69]:
results = soup.find_all('div', {'data-component-type': 's-search-result'})

In [70]:
len(results)

22

## Prototype the record

Selecting the data to be extracted from Amazon

In [71]:
item = results[0]

In [72]:
atag = item.h2.a

In [73]:
description = atag.text.strip()

In [74]:
url = 'https://www.amazon.com' + atag.get('href')

In [75]:
price_parent = item.find('span', 'a-price')

In [76]:
price = price_parent.find('span', 'a-offscreen').text

In [77]:
rating = item.i.text

In [78]:
review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text

## Generalize the pattern

Creating a pattern to grab the same data from the next item

In [79]:
def extract_record(item):
    """Extract and return data from a single record"""
    
    #description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    
    # Price
    price_parent = item.find('span', 'a-price')
    price = price_parent.find('span', 'a-offscreen').text
    
    # Rating and Review count
    rating = item.i.text
    review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text
    
    result = (description, price, rating, review_count, url)
    
    return result

In [80]:
records = []
results = soup.find_all ('div', {'data-component-type': 's-search-result'})

for item in results:
    records.append(extract_record(item))

AttributeError: 'NoneType' object has no attribute 'text'

## Handling Errors

Some items may not have all of the item's details which will usually cause an error. Creating a function to ignore those errors and provide the details anyway.

In [112]:
def extract_record(item):
    """Extract and return data from a single record"""
    
    #description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    try:
        # Price
        price_parent = item.find('span', 'a-price')
        price = price_parent.find('span', 'a-offscreen').text
    except AttributeError:
        return
    
    try:
    # Rating and Review count
        rating = item.i.text
        review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text
    except AttributeError:
        rating = ''
        review_count = ''
        
    result = (description, price, rating, review_count, url)
    
    return result

In [113]:
records = []
results = soup.find_all ('div', {'data-component-type': 's-search-result'})

for item in results:
    record = extract_record(item)
    if record:
        records.append(record)

Here we can see the data of the first item on the page being extracted

In [114]:

records[0]

('HP Omen 16 Premium Gaming Laptop I 16.1” Full HD IPS 144Hz 7ms I 11th Gen Intel 8-Core i7-11800H I 64GB DDR4 1TB SSD I Geforce RTX 3060 6GB I Thunderbolt Backlit WiFi6E Win11 + 32GB MicroSD Card',
 '$1,999.00',
 '4.3 out of 5 stars',
 '3',
 'https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A09647243963WPCBSGBFJ&url=%2FHP-Omen-Premium-i7-11800H-Thunderbolt%2Fdp%2FB09N8SZWJJ%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Dgaming%2Blaptop%26qid%3D1645150997%26sr%3D8-1-spons%26psc%3D1&qualifier=1645150997&id=8460303053454000&widgetName=sp_atf')

Price listed for every item on the first page.

In [99]:
for row in records:
    print(row[1])

$1,999.00
$1,269.00
$789.89
$999.99
$1,099.00
$1,489.00
$1,399.00
$1,279.99
$899.99
$3,699.00
$899.99
$929.00
$1,799.99
$1,949.99
$688.00
$1,519.99
$849.00
$1,449.00
$739.99
$979.99
$1,265.00
$979.99


## Grabbing next page

In [93]:
def get_url(search_text):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
    search_term = search_text.replace(' ', '+')
    
    # add term query to url
    url = template.format(search_term)
    
    # add page query placeholder
    url += '&page={}'
    
    return url

## Putting it all together

Gathering all of the sections and condensing it into one block of code.

In [None]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver

def get_url(search_text):
    """Generate a url from search term"""
    template = 'https://www.amazon.com/s?k={}&ref=nb_sb_noss_1'
    search_term = search_text.replace(' ', '+')
    
    # add term query to url
    url = template.format(search_term)
    
    # add page query placeholder
    url += '&page={}'
    
    return url

def extract_record(item):
    """Extract and return data from a single record"""
    
    #description and url
    atag = item.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.com' + atag.get('href')
    
    try:
        # Price
        price_parent = item.find('span', 'a-price')
        price = price_parent.find('span', 'a-offscreen').text
    except AttributeError:
        return
    
    try:
    # Rating and Review count
        rating = item.i.text
        review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text
    except AttributeError:
        rating = ''
        review_count = ''
        
    result = (description, price, rating, review_count, url)
    
    return result

def main(search_term):
    driver = webdriver.Chrome()
    
    records = []
    url = get_url(search_term)
    
    for page in range(1):
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        results = soup.find_all ('div', {'data-component-type': 's-search-result'})
        
        for item in results:
            record = extract_record(item)
            if record:
                records.append(record)
        driver.close()
        
        # save data
        
    with open('GamingLaptops.csv', 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Description', 'Price', 'Rating', 'ReviewCount', 'Url'])
        writer.writerows(records)

In [129]:
main('gaming laptop')

## Results

The data was converted to a CSV file which we can use in Excel and perform further analysis. Here is a view of the data.

In [133]:
df = pd.read_csv('GamingLaptops.csv')

In [134]:
df

Unnamed: 0,Description,Price,Rating,ReviewCount,Url
0,Acer Nitro 5 AN515-55-53E5 Gaming Laptop | Int...,$789.89,4.6 out of 5 stars,6497.0,https://www.amazon.com/gp/slredirect/picassoRe...
1,Acer Predator Helios 300 PH315-54-760S Gaming ...,"$1,279.99",4.6 out of 5 stars,4216.0,https://www.amazon.com/gp/slredirect/picassoRe...
2,Acer Nitro 5 AN515-55-53E5 Gaming Laptop | Int...,$789.89,4.6 out of 5 stars,6497.0,https://www.amazon.com/Acer-AN515-55-53E5-i5-1...
3,"ASUS ROG Strix G15 (2021) Gaming Laptop, 15.6”...",$999.99,4.6 out of 5 stars,69.0,https://www.amazon.com/ASUS-Display-GeForce-Ke...
4,"Newest MSI Crosshair 15.6"" 144Hz FHD IPS Gamin...","$1,099.00",,,https://www.amazon.com/MSI-Crosshair-i7-11800H...
5,"2021 ASUS ROG Strix G17 17.3"" FHD 144Hz Gaming...","$1,489.00",4.7 out of 5 stars,38.0,https://www.amazon.com/ASUS-ROG-Strix-G17-Comp...
6,"MSI GS66 Stealth 15.6"" 240Hz 3ms Ultra Thin an...","$1,399.00",4.3 out of 5 stars,17.0,https://www.amazon.com/MSI-GS66-Stealth-i7-107...
7,Acer Predator Helios 300 PH315-54-760S Gaming ...,"$1,279.99",4.6 out of 5 stars,4216.0,https://www.amazon.com/Acer-Predator-PH315-54-...
8,"Victus 16 Gaming Laptop, NVIDIA GeForce RTX 30...",$899.99,4.4 out of 5 stars,3.0,https://www.amazon.com/GeForce-i5-11260H-Displ...
9,"MSI Crosshair17 17.3"" 144Hz FHD Gaming Laptop ...","$1,299.00",4.4 out of 5 stars,14.0,https://www.amazon.com/MSI-Crosshair17-i7-1180...
