Subject: Data Scraping Code Submission.

Date: October 15, 2024.

Dear Professor Ilia Tetin,

I am writing on behalf of our presentation team, which consists of two members:



*   LE TRAN NHA TRAN - JASMINE (Student ID: 11285100M);

*   DINH VAN LONG - BRAD (Student ID: 11285109M).


We have decided to focus on the topic: "Consumer Trends in the E-Commerce Platform (Chotot.com) for Used Cell Phones: Insights and Predictions". This study will analyze consumer behavior, market dynamics, and emerging trends in the used smartphone sector, using a dataset of over 19,000 observations collected from Chotot.com. The data represents 63 provinces and cities across Vietnam, encompassing both urban and rural areas, ensuring a comprehensive understanding of the market.

Chotot.com operates under the motto "A Way to Your Wants" (LinkedIn: https://www.linkedin.com/company/cho-tot/) and functions as a marketplace offering a wide variety of physical goods to Vietnamese consumers. For our research, we have specifically focused on the used smartphone category.


Enclosed below is the data scraping code we developed, which extracts data from Chotot.com. This dataset serves as the foundation for our topic, which investigates pricing strategy trends within this second-hand marketplace.

In [None]:
!pip install -U scrapy scrapy-user-agents

Collecting scrapy
  Downloading Scrapy-2.12.0-py2.py3-none-any.whl.metadata (5.3 kB)
Collecting scrapy-user-agents
  Downloading scrapy_user_agents-0.1.1-py2.py3-none-any.whl.metadata (3.4 kB)
Collecting Twisted>=21.7.0 (from scrapy)
  Downloading twisted-24.10.0-py3-none-any.whl.metadata (20 kB)
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl.metadata (2.2 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.3.2-py3-none-any.whl.metadata (3.9 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.9.1-py2.py3-none-any.whl.metadata (11 kB)
Collecting queuelib>=1.4.2 (from scrapy)
  Downloading queuelib-1.7.0-py2.py3-none-any.whl.metadata (5.7 kB)
Collecting service-identity>=18.1.0 (from scrapy)
  Downloading service_identity-24.2.0-py3-none-any.whl.metadata (5.1 kB)
Collecting w3lib>=1.17.0 (from scrapy)
  Downloading w3lib-2.2.1-py3-none-any.whl.metadata (2.1 kB)
Collecting zope.interface>=5.1.0 (from scrap

In [None]:
import logging
import scrapy
import json
from datetime import datetime
from dataclasses import dataclass, asdict
from scrapy.crawler import CrawlerProcess

In [None]:
# Number of pages the scraper will attempt to crawl
NUM_PAGES = 10000

# Scrapy settings dictionary
settings = {
    # Define output format and file for the scraped data
    "FEEDS": {"posts.jsonl": {"format": "jsonlines"}},  # Save data to `posts.jsonl` in JSON Lines format
    "FEED_EXPORT_ENCODING": "utf-8",  # Use UTF-8 encoding for the output file

    # Middleware settings for handling user agents
    "DOWNLOADER_MIDDLEWARES": {
        # Disable the default UserAgentMiddleware
        "scrapy.downloadermiddlewares.useragent.UserAgentMiddleware": None,
        # Enable a custom middleware for rotating random user agents
        "scrapy_user_agents.middlewares.RandomUserAgentMiddleware": 400,
    },

    # Default headers sent with each request
    "DEFAULT_REQUEST_HEADERS": {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",  # Accept a variety of content types
        "Accept-Language": "en",  # Set preferred language to English
    },

    # Performance tuning
    "CONCURRENT_REQUESTS": 128,  # Number of concurrent requests allowed
    "COOKIES_ENABLED": False,  # Disable cookies for better performance and fewer restrictions
    "TELNETCONSOLE_ENABLED": False,  # Disable Telnet console for security and simplicity
    "DOWNLOAD_DELAY": 0.1,  # Delay between consecutive requests (100ms)

    # Crawler behavior
    "ROBOTSTXT_OBEY": True,  # Respect `robots.txt` rules to avoid violating site policies

    # Logging settings
    "LOG_LEVEL": "INFO",  # Set log verbosity to show only informational messages and above
}

# Disable all logging to suppress unnecessary output
logging.disable(logging.CRITICAL)

This code defines a dataclass called PhoneCrawlerItem, structuring the data collected during web crawling.

10,000 is the maximum number of pages the scraper will attempt to crawl.

In [None]:
# Import the dataclass decorator
from dataclasses import dataclass

# Define a PhoneCrawlerItem class to represent a scraped item
@dataclass
class PhoneCrawlerItem:
    # Unique identifier for the listing
    listing_id: str  # Type: String

    # URL of the webpage where the listing was scraped
    url: str  # Type: String

    # Raw or processed content of the listing
    content: dict  # Type: Dictionary (key-value pairs)

    # Date when the data was crawled
    crawl_date: str  # Type: String

    # Source or website from which the data was scraped
    source: str  # Type: String

This Scrapy spider, UsedSmartphoneChototSpier, is designed to scrape used smartphone listings from the Chotot website.

In [None]:
class UsedSmartphoneChototSpier(scrapy.Spider):
    name = "chotot"  # Name of the spider
    allowed_domains = ["chotot.com"]  # Domains the spider is allowed to crawl
    start_urls = [
        f"https://www.chotot.com/mua-ban-dien-thoai?page={i}" for i in range(NUM_PAGES)
    ]  # List of URLs to start scraping from
    crawl_date = datetime.now().strftime(r"%Y-%m-%d")  # Date of the crawl

    def parse(self, response: scrapy.http.Response):
    listings = json.loads(
        response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    )  # Extract JSON data from a specific script tag
    urls = [
        f"{listing['list_id']}.htm"
        for listing in listings["props"]["pageProps"]["initialState"]["adlisting"][
            "data"
        ]["ads"]
    ]  # Extract listing IDs and create URLs
    print(f"Found {len(urls)} listings @ {response.url}")
    yield from response.follow_all(urls, self.parse_post)  # Follow each URL to parse post details

    def parse_post(self, response):
        json_content = response.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
        json_content = json.loads(json_content)
        results = {}
        results["listing_id"] = json_content["query"]["listId"]
        results["content"] = {}
        results["url"] = json_content["props"]["canonicalUrl"]
        results["content"]["props"] = {}
        results["content"]["props"]["pageProps"] = {}
        results["content"]["props"]["pageProps"]["initialState"] = {}
        results["content"]["props"]["pageProps"]["initialState"]["adView"] = {}
        results["content"]["props"]["pageProps"]["initialState"]["adView"][
            "adInfo"
        ] = {}
        results["content"]["props"]["pageProps"]["initialState"]["adView"]["adInfo"][
            "ad"
        ] = json_content["props"]["pageProps"]["initialState"]["adView"]["adInfo"]["ad"]
        results["content"]["props"]["pageProps"]["initialState"]["adView"]["adInfo"][
            "ad_params"
        ] = json_content["props"]["pageProps"]["initialState"]["adView"]["adInfo"][
            "ad_params"
        ]
        results["content"]["props"]["pageProps"]["initialState"]["nav"] = {}
        results["content"]["props"]["pageProps"]["initialState"]["nav"]["navObj"] = (
            json_content["props"]["initialState"]["nav"]["navObj"]
        )
        yield PhoneCrawlerItem(
            listing_id=results["listing_id"],
            url=results["url"],
            content=results["content"],
            crawl_date=self.crawl_date,
            source="chotot",
        )

In [None]:
# Create a CrawlerProcess instance with the defined settings
process = CrawlerProcess(settings=settings)
# Add the UsedSmartphoneChototSpier spider to the process
process.crawl(UsedSmartphoneChototSpier)
# Start the crawling process
process.start()

Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=2
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=0
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=5
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=1
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=4
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=3
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=6
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=7
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=8
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=9
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=15
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=11
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=12
Found 20 listings @ https://www.chotot.com/mua-ban-dien-thoai?page=10
Found 20 listings @ https://ww

The spider is consistently finding 20 listings on each page as well as successfully iterates through multiple pages (page=0 to page=62).

There are no visible errors or issues during the crawling process, which means the spider's methods (parse and parse_post) are executing without problems.