Website: https://sg.carousell.com/ 

Item: Baby Chair 

Condition: New 

Seller’s location: Within 5km from Woodlands 

Price Range: <span>&#36;</span>5 to <span>&#36;</span>150 

Seller’s Ratings: The more stars the better. The maximum number of stars is five. 

The results should be returned from the most recent first as we prefer to buy latest posted item. 
The results should include <b>Date</b>, <b>title</b>, <b>link</b>, <b>price</b>, <b>seller</b>, <b>seller's link</b> and <b>seller’s ratings</b>.  


Example:
https://sg.carousell.com/search/baby%20chair?condition_v2=NEW&location_name=Woodlands&price_end=150&price_start=0&range=5&sort_by=time_created%2Cdescending

<p>    
    
References: <br />
https://selenium-python.readthedocs.io/index.html
    
<p>
Notes: <br />
print(self.driver.page_source) # Get page source via Selenium

In [9]:
# TODO:
# 1. Send Key to Search Box Input Type=Search
# OK 2. Save page source
# 3. Generate csv of <b>Date</b>, <b>title</b>, <b>link</b>, <b>price</b>, <b>seller</b>, <b>seller's link</b> and <b>seller’s ratings</b>
# OK 4. Clean up links returned
# OK 5. Format Date time including Spotlight
# 6. Retrieve Seller's Rating
# OK 7. Filter items with title without "Baby Chair"
# OK 8. Screenshot of main page
# 9. Load More
# 10. Handle emoji

In [10]:
from datetime import datetime, timedelta

t = datetime.now() - timedelta(hours=3)

In [7]:
help(str.find)

Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.



In [18]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

from bs4 import BeautifulSoup as bs
from urllib.request import urlopen
from urllib.request import Request

from datetime import datetime,timedelta
import csv
import os
import re

In [27]:
class CarousellScraper(object):
    def __init__(self,item='baby chair', condition='NEW', location='Woodlands', 
                 distance='5', min_price='0', max_price='150'):
        
        # For data logging
        self.curdatetime = datetime.now().strftime('%Y%m%d_%H%M%S')
        self.item      = item
        self.condition = condition
        self.location  = location
        self.distance  = distance
        self.min_price = min_price
        self.max_price = max_price
        self.base_url  = 'https://sg.carousell.com'

        self.url = f"{self.base_url}/search/{item}?condition_v2={condition}&location_name={location}&price_end=\
{max_price}&price_start={min_price}&range={distance}&sort_by=time_created,descending"
        print(self.url)
        
        self.driver = webdriver.Chrome(executable_path='chromedriver.exe')
        self.delay = 3
        
        #Create folder for data logging
        if not os.path.exists("raw"):
            os.mkdir("raw")
        
        if not os.path.exists("processed"):
            os.mkdir("processed")
        
    def load_carousell_url(self):
        
        self.driver.get(self.url)
        
        try:
            wait = WebDriverWait(self.driver, self.delay)
            # Search box
            wait.until(EC.presence_of_element_located((By.XPATH,"//input[@type='search']" )))
            
            #Save page source
            dest_path = "raw/"+self.curdatetime+"_CarousellSource.html"
            with open(dest_path, "w+", encoding="utf-8") as f:
                f.write(self.driver.page_source)

        except TimeOutException:
            print("Time out to load ", self.url)
        
        ### image saved in raw folder
        screenshot_path = "raw/"+self.curdatetime+"_CarousellSearch.png"
        self.driver.get_screenshot_as_file(screenshot_path) 
     
    def extract_item_information(self):
        pass
    
#    The results should include <b>Date</>, <b>title</b>, <b>link</b>, <b>price</b>, <b>seller<b> and <b>seller’s ratings<b/>.  

    def extract_item_title(self):
        
        print()
        headers = {}
        headers['User-Agent'] = "Chrome/24.0.1312.27"
        request = Request(self.url, headers = headers)
        html = urlopen(request).read()
        soup = bs(html, 'lxml')
        
        #The items are in three layers below main tag
        sellers        = []
        seller_links   = []
        dates          = []
        titles         = []
        item_links     = []
        prices         = []
        seller_ratings = []
        
        items = soup.find("main").find_all('div')[0].find_all('div')[0].children
        for item in items:
            base_items = item.find_all('a')
            if (len(base_items) > 0):
                # seller, seller's page, date time
                base_items01 = base_items[0]
#                 print(base_items01)
                seller      = base_items01.find_all('p')[0].text
        
                seller_link = self.shorten_url(self.base_url + base_items01['href'])
                
                date        = base_items01.find_all('p')[-1].text
                print("Date:" + date)
                date = self.return_date(date)
                print("Date:" + date)
                print("Seller:" +  seller)
                print("Seller_Link:" +  seller_link)
                
                base_items02 = base_items[1]
                item_link = self.shorten_url(self.base_url + base_items02['href'])
                title = base_items02.find_all('p')[0].text
                price = base_items02.find_all('p')[1].text
                
                seller_rating = self.extract_item_seller_ratings(seller_link)
                
                
                # Only match item's name containing 'baby' and 'chair' (case insensitive)
                if re.search('(?i)(baby.*chair)', title) is not None:
                    sellers.append(seller)
                    seller_links.append(seller_link)
                    dates.append(date)
                    item_links.append(item_link)
                    titles.append(title)
                    prices.append(price)
                    seller_ratings.append(seller_rating)
                
                print("\nTitle:" + title  + "\nLink:" + item_link + "\nPrice:" + price)
                print()
                print()
        
        print("sellers:", len(sellers), ", seller_links:", len(seller_links), ", dates: ", len(dates), 
", titles: ", len(titles), ", title links: ", len(item_links), ", prices: ", len(prices), ", seller_ratings:", len(seller_ratings))
        
        # Write to csv
        dest_path = "processed/"+self.curdatetime+"_Carousell_Search_"+self.item+".csv"
        dest_path = dest_path.replace(' ', '')
        csvFile = open(dest_path, 'w+', encoding='utf-8',newline='')
        try:
            writer = csv.writer(csvFile)
            writer.writerow(('Date', 'Item', 'Item_Link', 'Price', 'Seller', 'Seller_Link, Seller_Ratings'))
            for i in range(len(dates)):
                writer.writerow((dates[i].strip(), titles[i].strip(), item_links[i].strip(), prices[i].strip(), 
sellers[i].strip(), seller_links[i].strip(), seller_ratings[i].strip()))
        finally:
            csvFile.close()
    
    def return_date(self, d):
        # https://docs.python.org/3/library/datetime.html?highlight=datetime#datetime.timedelta
        # date2 = date1 + timedelta
        # class datetime.timedelta(days=0, seconds=0, microseconds=0, milliseconds=0, minutes=0, hours=0, weeks=0)¶
        #re.sub(r"(\w)(\w+)(\w)", repl, text)
        if 'hour' in d:
            hour = re.sub(r"[a-zA-Z]", "", d)
            t = datetime.now() - timedelta(hours=int(hour))
            posted_date = t.strftime('%Y-%m-%d')
            print('posted_date:' + posted_date)
        elif 'day' in d:
            day = re.sub(r"[a-zA-Z]", "", d)
            t = datetime.now() - timedelta(days=int(day))
            posted_date = t.strftime('%Y-%m-%d')
            print('posted_date:' + posted_date)
        elif 'minute' in d:
            minute = re.sub(r"[a-zA-Z]", "", d)
            t = datetime.now() - timedelta(days=int(minute))
            posted_date = t.strftime('%Y-%m-%d')
            print('posted_date:' + posted_date)
        else:
            posted_date = d
     
        return posted_date
    
    def extract_item_seller_ratings(self, seller_url):
        headers = {}
        headers['User-Agent'] = "Chrome/24.0.1312.27"
        request = Request(seller_url, headers = headers)
        html = urlopen(request).read()
        soup = bs(html, 'lxml')
        
        NA_RATINGS = "No ratings yet"
        contents = soup.find_all('p')
        seller_rating = ""
        
        for cont in contents:
            if cont.text.find(NA_RATINGS) != -1:
#                 return cont.text.strip()
                seller_rating = cont.text.strip()
                break
            if re.match("^\d+\.\d+$", cont.text.strip()):
                seller_rating = cont.text.strip()
            
        
        
#         if "(" in contents or ")" in contents:
#             index = 3
#             contents = soup.find_all('p')[index].text
        
#         if re.search('[a-zA-Z]', contents):
#             index -= 2
#             contents = soup.find_all('p')[index].text
        
       
            
        print(seller_rating)
        return seller_rating
   
        
    
    def shorten_url(self, url):
        pos = url.find('?')
        if pos != -1:
            return url[:pos]
        else:
            return url
 
    
    def extract_item_url(self):
        pass
    
    def extract_item_price(self):
        pass
    
    def extract_item_seller(self):
        pass
    
   
    
    def quit(self):
        self.driver.close()
 
 
 
 

In [28]:
scraper = CarousellScraper()

https://sg.carousell.com/search/baby chair?condition_v2=NEW&location_name=Woodlands&price_end=150&price_start=0&range=5&sort_by=time_created,descending


In [29]:
scraper.load_carousell_url()

In [None]:
scraper.extract_item_title()


Date:Spotlight
Date:Spotlight
Seller:littlebabybernice
Seller_Link:https://sg.carousell.com/littlebabybernice/
5.0

Title:🦷  [INSTOCK]  Baby Feeding Chair With Squeaking Sound
Link:https://sg.carousell.com/p/🦷-instock-baby-feeding-chair-with-squeaking-sound-275984583
Price:S$34.90


Date:2 minutes ago
posted_date:2020-03-04
Date:2020-03-04
Seller:desmondlim01238
Seller_Link:https://sg.carousell.com/desmondlim01238/
5.0

Title:Ceilling hook instsll
Link:https://sg.carousell.com/p/ceilling-hook-instsll-230784946
Price:S$50


Date:57 minutes ago
posted_date:2020-01-09
Date:2020-01-09
Seller:raylwq
Seller_Link:https://sg.carousell.com/raylwq/
5.0

Title:Antibacterial Non Alcohol Spray
Link:https://sg.carousell.com/p/antibacterial-non-alcohol-spray-281592901
Price:S$35


Date:7 hours ago
posted_date:2020-03-06
Date:2020-03-06
Seller:victoryforeva
Seller_Link:https://sg.carousell.com/victoryforeva/
4.9

Title:Baby High Chair
Link:https://sg.carousell.com/p/baby-high-chair-281536023
Price:S$

In [None]:
#scraper.quit()