# **Web Scraping Project - Dataset of Video Games Powered by Steam**

## **Project Idea**


The objective of this project is to create a dataset of video games available on the Steam platform. We will scrape the Steam website to extract information about these games, including their **names, release dates, prices and number of reviews**.

The output will be a CSV file containing a structured dataset that can be used for further analysis to gain insights on the games available on Steam.

This is the URL of the website which we will be scraping: 'https://store.steampowered.com/search/?filter=topsellers'


## **Project Strategy**

We will adopt a step-wise approach to this project, focusing on specific aspects of the web scraping process as we proceed.

The first step involves selecting the website (Steam) and defining the objective while subsequent steps will involve  downloading web pages, parsing HTML content, extracting relevant information, and finally, creating CSV files.

Libraries like requests and BeautifulSoup will be used for downloading and parsing data respectively. The project will be documented in a Jupyter notebook, ensuring proper explanations and documentation for each step. Additionally, a final CSV file will be generated, meeting the specified criteria of at least 3 columns and 100 rows of data.


## **Project Steps**

Here is an outline of the steps we will follow :

* Install and import all necessary libraries
* Download the web page using requests
* Parse the HTML source code using BeautifulSoup library
* Build the scraper components for one search filter
* Compile the extracted information into Python list and dictionaries
* Write information into a CSV file using python CSV module
* Build Scraper element for all search filters
* Write information for all search filter into a CSV file using python CSV module
* Future work and references







## **Installing and Importing Necessary Libraries**

1. **Requests**

Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

2. **BeautifulSoup**

BeautifulSoup is a Python HTTP library that allows us to is a Python library for pulling data out of HTML and XML files.

3. **CSV**

The `csv` module implements classes to read and write tabular data in CSV format.

4. **Regex (re)**

The `re` module provides regular expression matching operations.



> We use pip, a package-management system, to install and manage softwares. Since the platform we selected is Binder, we would have to type a line of code !pip install to install requests. You will see lots codes of !pip when installing other packages.

> When we attempt to use some prewritten functions from a certain library, we would use the import statement. e.g. When we would have to type import requests after installation, we are able to use any function from requests library.

In [1]:
# Install the request library
# !pip install requests --upgrade --quiet

In [2]:
# Install the beautiful soup library
# !pip install beautifulsoup4 --upgrade --quiet

In [3]:
# Import necessary libraries

import requests
from bs4 import BeautifulSoup

# Import csv module
import csv

# Import regex
import re

In [4]:
# List of search queries
search_queries = ['topsellers', 'mostplayed', 'newreleases', 'upcomingreleases']

## **Download the web page using request**


In [5]:
# URL of the website to be scraped for the current search query
url = 'https://store.steampowered.com/search/?filter=topsellers'

# Send a GET request to the specified URL
response = requests.get(url)

In [6]:
# Get the content of the downloaded page and save in a variable
page_content = response.text
page_content



In [7]:
len(page_content)

730619

## **Writing and Reading the downloaded web page**

In [8]:
# Write the content of the webpage into games_topsellers
with open('games_topsellers.html', 'w', encoding="utf-8") as file:
    file.write(page_content)

In [9]:
# To scrape, read the games_topsellers.html file

with open('games_topsellers.html', 'r', encoding="utf-8") as f:
    html_source = f.read()

## **Parse the HTML source code using BeautifulSoup library**

In [10]:
# Convert the file to a beautiful soup file
doc = BeautifulSoup(html_source, 'html.parser')

In [11]:
# Find all the games on the page
games = doc.find_all('div', {'class': 'responsive_search_name_combined'})
games

[<div class="responsive_search_name_combined">
 <div class="col search_name ellipsis">
 <span class="title">PUBG: BATTLEGROUNDS</span>
 <div>
 <span class="platform_img win"></span> </div>
 </div>
 <div class="col search_released responsive_secondrow">
 
                     21 Dec, 2017                </div>
 <div class="col search_reviewscore responsive_secondrow">
 <span class="search_review_summary mixed" data-tooltip-html="Mixed&lt;br&gt;57% of the 2,276,384 user reviews for this game are positive.">
 </span>
 </div>
 <div class="col search_price_discount_combined responsive_secondrow" data-price-final="0">
 <div class="col search_discount_and_price responsive_secondrow">
 <div class="discount_block no_discount search_discount_block"> <div class="discount_prices"> <div class="discount_final_price free">Free</div> </div></div> </div>
 </div>
 </div>,
 <div class="responsive_search_name_combined">
 <div class="col search_name ellipsis">
 <span class="title">Baldur's Gate 3</span>
 <

## **Building a scraper component and converting to a list of dictionaries**

In [12]:
game_data_list = []

for game in games:
    name = game.find('span', {'class': 'title'}).text
    published_date = game.find('div', {'class': 'col search_released responsive_secondrow'}).text.strip()

    # Check if the element is present before accessing the text attribute
    original_price_elem = game.find('div', {'class': 'discount_original_price'})
    original_price = original_price_elem.text.strip() if original_price_elem else 'N/A'

    discount_price_elem = game.find('div', {'class': 'discount_final_price'})
    discount_price = discount_price_elem.text.strip() if discount_price_elem else 'N/A'

    # Extract review information using regular expressions
    review_summary = game.find('span', {'class': 'search_review_summary'})
    reviews_html = review_summary['data-tooltip-html'] if review_summary else 'N/A'

    # Use regular expressions to extract the number of reviews
    match = re.search(r'(\d+,*\d*)\s+user reviews', reviews_html)
    reviews_number = match.group(1).replace(',', '') if match else 'N/A'

    # Store each game's information as a dictionary
    game_data = {
        'name': name,
        'date': published_date,
        'original_price': original_price,
        'discount_price': discount_price,
        'reviews_number': reviews_number
    }

    game_data_list.append(game_data)



In [13]:
game_data_list

[{'name': 'PUBG: BATTLEGROUNDS',
  'date': '21 Dec, 2017',
  'original_price': 'N/A',
  'discount_price': 'Free',
  'reviews_number': '276384'},
 {'name': "Baldur's Gate 3",
  'date': '3 Aug, 2023',
  'original_price': '$59.99',
  'discount_price': '$53.99',
  'reviews_number': '465110'},
 {'name': 'Steam Deck',
  'date': '',
  'original_price': 'N/A',
  'discount_price': 'N/A',
  'reviews_number': 'N/A'},
 {'name': 'Counter-Strike 2',
  'date': '21 Aug, 2012',
  'original_price': 'N/A',
  'discount_price': 'Free',
  'reviews_number': '815239'},
 {'name': 'Lethal Company',
  'date': '23 Oct, 2023',
  'original_price': 'N/A',
  'discount_price': '$9.99',
  'reviews_number': '191863'},
 {'name': 'Cyberpunk 2077',
  'date': '9 Dec, 2020',
  'original_price': '$59.99',
  'discount_price': '$29.99',
  'reviews_number': '617237'},
 {'name': 'ELDEN RING',
  'date': '24 Feb, 2022',
  'original_price': '$59.99',
  'discount_price': '$35.99',
  'reviews_number': '558743'},
 {'name': 'Call of Dut

## **Extending the scraper component to save the result as a CSV file using the CSV module**

In [14]:
# Create a CSV file to keep our result using the CSV module

with open('games_topsellers.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Published Date', 'Original Price', 'Discount Price', 'Reviews'])

    # Loop through each game and extract the relevant information
    for game in games:
        name = game.find('span', {'class': 'title'}).text
        published_date = game.find('div', {'class': 'col search_released responsive_secondrow'}).text.strip()

        # Check if the element is present before accessing the text attribute
        original_price_elem = game.find('div', {'class': 'discount_original_price'})
        original_price = original_price_elem.text.strip() if original_price_elem else 'N/A'

        discount_price_elem = game.find('div', {'class': 'discount_final_price'})
        discount_price = discount_price_elem.text.strip() if discount_price_elem else 'N/A'

        # Extract review information using regular expressions
        review_summary = game.find('span', {'class': 'search_review_summary'})
        reviews_html = review_summary['data-tooltip-html'] if review_summary else 'N/A'

        # Use regular expressions to extract the number of reviews
        match = re.search(r'(\d+,*\d*)\s+user reviews', reviews_html)
        reviews_number = match.group(1).replace(',', '') if match else 'N/A'

        # Write the extracted information to the CSV file
        writer.writerow([name, published_date, original_price, discount_price, reviews_number])



## **Building a scraper for all search filters and saving the information in a CSV file**

In [15]:
# List of search filters
search_filters = ['topsellers', 'mostplayed', 'newreleases', 'upcomingreleases']

# Create a CSV file to store the scraped data
with open('games_all.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Published_Date', 'Original Price', 'Discount Price', 'Reviews', 'Search Query'])

    # Loop through each search query
    for filter in search_filters:
        # URL of the website to be scraped for the current search query
        url = f'https://store.steampowered.com/search/?filter={filter}'

        # Send a GET request to the specified URL
        response = requests.get(url)

        # Parse the HTML content of the page using BeautifulSoup
        webpage = BeautifulSoup(response.content, 'html.parser')

        # Find the total number of pages
        total_pages = int(webpage.find('div', {'class': 'search_pagination_right'}).find_all('a')[-2].text)

        # Counter to keep track of the number of lines written
        line_count = 0

        # Loop through each page and extract the relevant information
        for page in range(1, total_pages + 1):
            # Send a GET request to the specified URL
            response = requests.get(url + '&page=' + str(page))

            # Parse the HTML content of the page using BeautifulSoup
            doc = BeautifulSoup(response.content, 'html.parser')

            # Find all the games on the page
            games = doc.find_all('div', {'class': 'responsive_search_name_combined'})

            # Loop through each game and extract the relevant information
            for game in games:
                name = game.find('span', {'class': 'title'}).text
                published_date = game.find('div', {'class': 'col search_released responsive_secondrow'}).text.strip()

                # Check if the element is present before accessing the text attribute
                original_price_elem = game.find('div', {'class': 'discount_original_price'})
                original_price = original_price_elem.text.strip() if original_price_elem else 'N/A'

                discount_price_elem = game.find('div', {'class': 'discount_final_price'})
                discount_price = discount_price_elem.text.strip() if discount_price_elem else 'N/A'

                # Extract review information using regular expressions
                review_summary = game.find('span', {'class': 'search_review_summary'})
                reviews_html = review_summary['data-tooltip-html'] if review_summary else 'N/A'

                # Use regular expressions to extract the number of reviews
                match = re.search(r'(\d+,*\d*)\s+user reviews', reviews_html)
                reviews_number = match.group(1).replace(',', '') if match else 'N/A'

                # Write the extracted information to the CSV file
                writer.writerow([name, published_date, original_price, discount_price, reviews_number, filter])

                # Increment the line count
                line_count += 1

                # Stop scraping if we have reached the minimum data requirement
                if line_count > 100:
                    break

            # Stop scraping if we have reached the minimum data requirement
            if line_count > 100:
                break


## **Decomposing the scraper into functions**

In [None]:
# Create a function that takes url and get the total page

def get_total_pages(url):
    response = requests.get(url)
    doc = BeautifulSoup(response.content, 'html.parser')
    total_pages = int(doc.find('div', {'class': 'search_pagination_right'}).find_all('a')[-2].text)
    return total_pages

In [None]:
# Create a function that extracts game info from the webpage

def extract_game_info(game):
    name = game.find('span', {'class': 'title'}).text
    published_date = game.find('div', {'class': 'col search_released responsive_secondrow'}).text.strip()

    original_price_elem = game.find('div', {'class': 'discount_original_price'})
    original_price = original_price_elem.text.strip() if original_price_elem else 'N/A'

    discount_price_elem = game.find('div', {'class': 'discount_final_price'})
    discount_price = discount_price_elem.text.strip() if discount_price_elem else 'N/A'

    review_summary = game.find('span', {'class': 'search_review_summary'})
    reviews_html = review_summary['data-tooltip-html'] if review_summary else 'N/A'

    match = re.search(r'(\d+,*\d*)\s+user reviews', reviews_html)
    reviews_number = match.group(1).replace(',', '') if match else 'N/A'

    return name, published_date, original_price, discount_price, reviews_number


In [None]:
# Create a function that scrapes the webpage

def scrape_page(url, filter, writer):
    # Invoking get total page function
    total_pages = get_total_pages(url)

    line_count = 0

    for page in range(1, total_pages + 1):
        response = requests.get(f"{url}&page={page}")
        doc = BeautifulSoup(response.content, 'html.parser')
        games = doc.find_all('div', {'class': 'responsive_search_name_combined'})

        for game in games:
            # Invoking the extract game info function
            game_info = extract_game_info(game)
            writer.writerow([*game_info, filter])

            line_count += 1
            if line_count > 100:
                break

        if line_count > 100:
            break

In [None]:

# Creating the main function that takes the scrape page function and do the actual scraping

def main(search_filters=['topsellers']):


    with open('games_all.csv', mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Name', 'Date', 'Original Price', 'Discount Price', 'Reviews', 'Search Filter'])

        for filter in search_filters:
            url = f'https://store.steampowered.com/search/?filter={filter}'
            # Invoking the scrape page function
            scrape_page(url, filter, writer)

In [None]:
# Invoking the main function
search_queries = ['topsellers', 'mostplayed', 'newreleases', 'upcomingreleases']
main(search_queries)

**Future Works**

The completion of this project opens avenues for future exploration and analysis. The dataset we create can be utilized for diverse purposes, from market trends to user preferences. To deepen the understanding of this domain, future work may involve:


*   Exploratory Data Analysis (EDA) to uncover patterns and trends.

*   Visualizations to present insights in an accessible manner.

*   Integration with machine learning models for predictive analysis.