## Using request-html and Pandas to create and process a dataframe

#### Using Request-html to parse amazon.com.

The following shows one way request-html can be used to parse through a webpage and obtain the needed data:

The site being parsed is amazon.com, an e-commerce site commonly used around the world. The product names, the prices, the ratings and other valuable information would be obtained using a combination of css selectors and Xpath.

Afterwards, the data will be put in a Pandas dataframe and processed. This dataframe will be saved as a .csv file for further evaluation in the future.

In [2]:
# Using requests-html's Async session 

from requests_html import AsyncHTMLSession

# Creating headers to emulate client-servers

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

asession = AsyncHTMLSession()



In [23]:
# The url from amazon and setting page variable

page = 1

url = f"https://www.amazon.com/Best-Sellers-Electronics-Headphones-Earbuds-Accessories/zgbs/electronics/24046923011/ref=zg_bs_pg_1?_encoding=UTF8&pg={page}"

In [24]:
# Request to get url using headers and Async

s = await asession.get(url, headers = headers)

In [5]:
#Checking status code

s.status_code

200

In [6]:
#getting the electonic product name

product_names = []

while page <= 10:
    product_names_path = s.html.find('div._cDEzb_p13n-sc-css-line-clamp-3_g3dy1')

    [product_names.append(p.text.split('. ')[0]) for p in product_names_path]
     
    page+=1

print(product_names)

print(len(product_names))


['Apple AirPods (2nd Generation) Wireless Earbuds with Lightning Charging Case Included', 'Apple EarPods Headphones with Lightning Connector', 'Apple AirPods Pro (2nd Generation) Wireless Earbuds, Up to 2X More Active Noise Cancelling, Adaptive Transparency, Personalized Spatial Audio, MagSafe Charging Case, Bluetooth Headphones for iPhone', 'Apple Lightning to 3.5 mm Headphone Jack Adapter', 'TOZO A1 Mini Wireless Earbuds Bluetooth 5.3 in Ear Light-Weight Headphones Built-in Microphone, IPX5 Waterproof, Immersive Premium Sound Long Distance Connection Headset with Charging Case, Black', 'Apple EarPods Headphones with 3.5mm Plug', 'TOZO T6 True Wireless Earbuds Bluetooth 5.3 Headphones Touch Control with Wireless Charging Case IPX8 Waterproof Stereo Earphones in-Ear Built-in Mic Headset Premium Deep Bass Black (2022 Upgraded)', 'Sony ZX Series Wired On-Ear Headphones, Black MDR-ZX110', 'TOZO T10 Bluetooth 5.3 Wireless Earbuds with Wireless Charging Case IPX8 Waterproof Stereo Headphone

In [25]:
#getting product prices

product_price = []

while page <= 10:
    product_price_path = s.html.find('span.a-size-base')
    
    [product_price.append(t.text.replace("$",'')) for t in product_price_path]
  
    page+=1

print(len(product_price))

print(product_price)

300
['91.08', '16.22', '202.76', '13.95', '15.29', '16.99', '26.99', '6.00', '23.99', '47.99', '59.99', '9.99', '42.99', '119.55', '29.73', '7.19', '19.99', '16.99', '29.96', '8.99', '29.73', '29.99', '19.99', '10.98', '13.99', '25.49', '36.99', '25.99', '14.95', '33.98', '91.08', '16.22', '202.76', '13.95', '15.29', '16.99', '26.99', '6.00', '23.99', '47.99', '59.99', '9.99', '42.99', '119.55', '29.73', '7.19', '19.99', '16.99', '29.96', '8.99', '29.73', '29.99', '19.99', '10.98', '13.99', '25.49', '36.99', '25.99', '14.95', '33.98', '91.08', '16.22', '202.76', '13.95', '15.29', '16.99', '26.99', '6.00', '23.99', '47.99', '59.99', '9.99', '42.99', '119.55', '29.73', '7.19', '19.99', '16.99', '29.96', '8.99', '29.73', '29.99', '19.99', '10.98', '13.99', '25.49', '36.99', '25.99', '14.95', '33.98', '91.08', '16.22', '202.76', '13.95', '15.29', '16.99', '26.99', '6.00', '23.99', '47.99', '59.99', '9.99', '42.99', '119.55', '29.73', '7.19', '19.99', '16.99', '29.96', '8.99', '29.73', '29.

In [18]:
#getting the product rating

product_rating = []

while page <= 10 :
    product_rating_path = s.html.find('span.a-icon-alt')
    [product_rating.append(j.text.split(" ")[0]) for j in product_rating_path]

    page +=1
    
print(len(product_rating))

print(product_rating)


300
['4.8', '4.6', '4.7', '4.7', '4.3', '4.6', '4.4', '4.5', '4.3', '4.3', '4.5', '4.3', '4.3', '4.7', '4.5', '4.6', '4.3', '4.4', '4.8', '4.2', '4.5', '4.7', '4.4', '3.8', '4.5', '4.8', '4.9', '4.4', '4.3', '4.3', '4.8', '4.6', '4.7', '4.7', '4.3', '4.6', '4.4', '4.5', '4.3', '4.3', '4.5', '4.3', '4.3', '4.7', '4.5', '4.6', '4.3', '4.4', '4.8', '4.2', '4.5', '4.7', '4.4', '3.8', '4.5', '4.8', '4.9', '4.4', '4.3', '4.3', '4.8', '4.6', '4.7', '4.7', '4.3', '4.6', '4.4', '4.5', '4.3', '4.3', '4.5', '4.3', '4.3', '4.7', '4.5', '4.6', '4.3', '4.4', '4.8', '4.2', '4.5', '4.7', '4.4', '3.8', '4.5', '4.8', '4.9', '4.4', '4.3', '4.3', '4.8', '4.6', '4.7', '4.7', '4.3', '4.6', '4.4', '4.5', '4.3', '4.3', '4.5', '4.3', '4.3', '4.7', '4.5', '4.6', '4.3', '4.4', '4.8', '4.2', '4.5', '4.7', '4.4', '3.8', '4.5', '4.8', '4.9', '4.4', '4.3', '4.3', '4.8', '4.6', '4.7', '4.7', '4.3', '4.6', '4.4', '4.5', '4.3', '4.3', '4.5', '4.3', '4.3', '4.7', '4.5', '4.6', '4.3', '4.4', '4.8', '4.2', '4.5', '4.7', '

In [26]:
#Importing pandas and creating a dataframe

import pandas as pd

amazon_electronics_bestsellers_data = pd.DataFrame()

amazon_electronics_bestsellers_data['product_names'] = pd.Series(product_names) 

amazon_electronics_bestsellers_data['product_prices_feb(2023)'] = pd.Series(product_price) 

amazon_electronics_bestsellers_data['product_ratings_feb(2023)'] = pd.Series(product_rating) 

amazon_electronics_bestsellers_data.head()



    

Unnamed: 0,product_names,product_prices_feb(2023),product_ratings_feb(2023)
0,Apple AirPods (2nd Generation) Wireless Earbud...,91.08,4.8
1,Apple EarPods Headphones with Lightning Connector,16.22,4.6
2,Apple AirPods Pro (2nd Generation) Wireless Ea...,202.76,4.7
3,Apple Lightning to 3.5 mm Headphone Jack Adapter,13.95,4.7
4,TOZO A1 Mini Wireless Earbuds Bluetooth 5.3 in...,15.29,4.3


In [27]:
#msaving dataframe as csv file
amazon_electronics_bestsellers_data.to_csv("amazon_electronics_bestsellers.csv")



#### Conclusion

This part focused on parsing through the amazon electronics site and getting the bestsellers data. The data received was saved in a Pandas dataframe termed 'amazon_electronics_bestsellers_data' and saved to a csv file named 'amazon_electronics_bestsellers'. 

Upon re-running the program, a few notes were taken;

-  Every time a cell containing the **product_names, product_pricing and product_rating** is needed to be run, the cell with the ***url and the get request*** should be run before continuing with the preferred cell.

Further data processing would be continued below.


