### Using request-html and Pandas to create a dataframe


#### The following shows one way request-html can be used to parse through a webpage and obtain the needed data:

The site being parsed is jumia.com, an e-commerce site commonly used in West Africa. The product names, the prices, the ratings and other valuable information would be obtained using a combination of **css selectors** and **Xpath**.

Afterwards, the data will be put in a **Pandas** dataframe and processed. This dataframe will  be saved as a ***.csv*** file for further evaluation in the future.

In [102]:
# Downlaoding all necessary packages.

!pip install requests-html




##### The Jumia website

As explained, the Jumia site is an e-commerce site featuring a wide range of products. Anything from electronics to food can be bought from here through delivery. This project will focus on the electronics section of the site.

There are many electronics stores on the site such as Itel, Samsung and Telefonika. Telefonika was the given choice because of the diversity of their phone and tablet products. Some factors were considered to narrow down our preferences. These were:

- the products were chosen based on popularity (most popular to least popular). The metric for popularity was for products having a ***seller score of 40% or more***

- Prices were reduced, hence there would be two price lists; one to reflect the original and the other will be the current price.

- Not all products contain ratings, thus any product with 0 would indicate that there no ratings available.

- The main products featured were from Inifinx, Nokia, Samsung and Tecno.


The data obtained from this site would include:

1. The device name

2. The original price

3. The new price

4. The brand name

5. Rating if available






In [1]:
#importing AsyncHTMLSession from requests-html 

from requests_html import AsyncHTMLSession

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

asession = AsyncHTMLSession()


url = "https://www.jumia.com.gh/mlp-telefonika-store/mobile-phones/infinix--nokia--samsung--tecno/?price=169-25999&seller_score=2-5&page=1"

In [2]:
# getting the url.

s = await asession.get(url)

In [7]:
s.status_code

403

In [26]:
# getting the device name 

device_name = s.html.find('article > div.info > h3.name')


In [106]:
#attempting httpx

import httpx

def get_html(url): 
    html2 = httpx.get(url).text 
    return html2

html2 = get_html(url)

r2 = Selector(str(html2))

device_name = r2.xpath('//div/h3[contains(@class, "name")]/text()').getall()

print(device_name)

[]


In [8]:
#attempting to use requests

import requests as rq

response = rq.get(url, headers=headers)

from parsel import Selector

html = response.content

print(response.status_code)



403


#### Conclusion

Request-html was used to attempt to get the data required. This lead to some problems. An empty was returned each time the device_name was run.

Since there are other parsers such as requests and httpx available, they were tried to see if the problem will persist. 

After trying each parse out, the status code of the url was checked using request's status code checker.

> response = rq.get(url, headers=headers)

> print(response.status_code)

>> 403

The number 403 was produced, implying that as webscrapers we do not have access to this site for our use. 