---
# 1/ Finding the AppID of the selected game (The Crew 2 in this case)
---

To access the reviews page of a Steam game, we need the AppID of the game.  
To achieve this, we will use Steam's game search page by entering the game name.  
Then, we will display (on the Streamlit app) a list of the top 5 (or less/more) results from the search, allowing the user to select the corresponding game.  
Based on this selection, we will scrape the AppID of the chosen game using the HTML content of Steam's game search page.

## Importing libraries

In [1]:
import requests

from bs4 import BeautifulSoup

import urllib.parse

In [2]:
# This url corresponds to the game search page on Steam
# It is necessary to transform the game name into URL format

game_name = "the crew 2"
search_page_url = "https://store.steampowered.com/search/?term=" + urllib.parse.quote(game_name)
print(search_page_url)

https://store.steampowered.com/search/?term=the%20crew%202


In [3]:
# Request the URL of the search page
search_page_response = requests.get(search_page_url)

In [4]:
# Retrieve the HTML content of the page
search_page_soup = BeautifulSoup(search_page_response.text, 'lxml')

Scrape the game names and their corresponding AppIDs

In [5]:
game_names_list = []
appIDs_list = []

games_list_result = search_page_soup.find_all("a", class_ = "search_result_row ds_collapse_flag", limit = 5)

for game in games_list_result :
    game_names_list.append(game.find(class_ = "title").get_text(strip = True))
    appIDs_list.append(game.get("data-ds-appid"))

# Dictionary of game names and their corresponding AppIDs
dict_game = dict(zip(game_names_list, appIDs_list))
print(dict_game)

{'The Crew™ 2': '646910', 'The Crew 2 - Season Pass': '889890', 'The Crew 2 Demo': '1075340', 'The Crew Motorfest | Year 2 Pass': '3251420', 'The Crew 2 - Mazda RX8 Starter Pack': '2183882'}


In [6]:
# We select the game "The Crew 2"
final_AppID = dict_game[game_names_list[0]]
print(final_AppID)

646910


In [7]:
# We now have the URL of the Steam reviews page for the selected game
reviews_page_url = "https://steamcommunity.com/app/" + final_AppID + "/reviews?l=english&"
print(reviews_page_url)

https://steamcommunity.com/app/646910/reviews?l=english&


# 2/ Set the parameters for the reviews search

To filter the reviews obtained, it is possible to set parameters as the review sorting method, the review language, and the maximum number of reviews to scrape.  
In our case, the user will choose these parameters on the Streamlit app.  
After setting these parameters, we will define the URL of the reviews page based on the selected filters.  
To apply these filters, we will use BeautifulSoup once again...

In [8]:
# Request the URL of the initial reviews page
reviews_page_response = requests.get(reviews_page_url)

In [9]:
# Retrieve the HTML content of the page
reviews_page_soup = BeautifulSoup(reviews_page_response.text, 'lxml')

## Set the review sorting method

In [10]:
# List containing the names of the sorting methods to be displayed on the Streamlit app
sorting_methods_name_list = []

# List containing the identifier of the sorting methods to be added to the final reviews page URL
sorting_methods_identifier_list = []

sorting_methods_result = reviews_page_soup.find("div", class_ = "filterselect_options shadow_content").find_all("div", class_ = "option")

for sorting_method in sorting_methods_result :
    sorting_methods_identifier_list.append(sorting_method.get("onclick").replace("javascript:SelectContentFilter( '?", "").replace("' );", ""))
    sorting_method_text = sorting_method.get_text(strip = True)
    sorting_methods_name_list.append(sorting_method_text)

# Dictionary of sorting methods names and their corresponding identifier
dict_sorting_methods = dict(zip(sorting_methods_name_list, sorting_methods_identifier_list))
print(dict_sorting_methods)

{'Most Helpful(All Time)': 'l=english&p=1&browsefilter=toprated', 'Most Helpful(Today)': 'l=english&p=1&browsefilter=trendday', 'Most Helpful(Week)': 'l=english&p=1&browsefilter=trendweek', 'Most Helpful(Month)': 'l=english&p=1&browsefilter=trendmonth', 'Most Helpful(Three Months)': 'l=english&p=1&browsefilter=trendthreemonths', 'Most Helpful(Six Months)': 'l=english&p=1&browsefilter=trendsixmonths', 'Most Helpful(Year)': 'l=english&p=1&browsefilter=trendyear', 'Most Recent': 'l=english&p=1&browsefilter=mostrecent', 'Recently Updated': 'l=english&p=1&browsefilter=recentlyupdated', 'Funny': 'l=english&p=1&browsefilter=funny'}


In [11]:
# In our case, we choose the sorting method 'Most Helpful(Week)'
sorting_method_identifier = dict_sorting_methods['Most Helpful(Week)']
print(sorting_method_identifier)

l=english&p=1&browsefilter=trendweek


## Set the language for the reviews

In [12]:
# List containing the names of the languages to be displayed on the Streamlit app
languages_name_list = []

# List containing the identifier of the languages to be added to the final reviews page URL
languages_identifier_list = []

languages_result = reviews_page_soup.find("div", class_ = "filterselect_options language shadow_content").find_all("div", class_ = "option")

for language in languages_result :
    languages_identifier_list.append(language.get("onclick").replace("javascript:SelectLanguageFilter( '?", "").replace("' );", ""))
    language_text = language.get_text(strip = True)
    languages_name_list.append(language_text)

# Dictionary of languages names and their corresponding identifier
dict_languages = dict(zip(languages_name_list, languages_identifier_list))
print(dict_languages)

{'All Languages': 'l=english&filterLanguage=all', 'Simplified Chinese': 'l=english&filterLanguage=schinese', 'Traditional Chinese': 'l=english&filterLanguage=tchinese', 'Japanese': 'l=english&filterLanguage=japanese', 'Korean': 'l=english&filterLanguage=koreana', 'Thai': 'l=english&filterLanguage=thai', 'Bulgarian': 'l=english&filterLanguage=bulgarian', 'Czech': 'l=english&filterLanguage=czech', 'Danish': 'l=english&filterLanguage=danish', 'German': 'l=english&filterLanguage=german', 'English': 'l=english&filterLanguage=english', 'Spanish - Spain': 'l=english&filterLanguage=spanish', 'Spanish - Latin America': 'l=english&filterLanguage=latam', 'Greek': 'l=english&filterLanguage=greek', 'French': 'l=english&filterLanguage=french', 'Italian': 'l=english&filterLanguage=italian', 'Indonesian': 'l=english&filterLanguage=indonesian', 'Hungarian': 'l=english&filterLanguage=hungarian', 'Dutch': 'l=english&filterLanguage=dutch', 'Norwegian': 'l=english&filterLanguage=norwegian', 'Polish': 'l=en

In [13]:
# In our case, we choose English language
language_identifier = dict_languages['English']
print(language_identifier)

l=english&filterLanguage=english


## Set the maximum number of reviews to scrape

In [14]:
max_review = 100

## Define the final URL with the selected filters

In [15]:
final_reviews_page_url = reviews_page_url + sorting_method_identifier + "&" + language_identifier
print(final_reviews_page_url)

https://steamcommunity.com/app/646910/reviews?l=english&l=english&p=1&browsefilter=trendweek&l=english&filterLanguage=english


By observing this page, we can see that not all reviews are in the HTML content...  
This is because the page needs to be loaded by scrolling to the bottom.  
To solve this, we will use Selenium library to scroll to the bottom of the page, ensuring that all reviews are loaded in the HTML content.  
Selenium allows navigation on a webpage as a human would. For example, it enables clicking on an element, entering text into a search bar, locating an element using the HTML code, and more.

---
# 3/ Scroll the page using Selenium
---

## Importing libraries

In [16]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains

from webdriver_manager.chrome import ChromeDriverManager

## Create and launch the Selenium driver

In [17]:
# Set the options to launch the driver in the background
chrome_options = Options()
chrome_options.add_argument("--headless")

# Launch the headless driver
driver = webdriver.Chrome(options = chrome_options, service=Service(ChromeDriverManager().install()))

# Load the driver with graphical interface
# driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Access the Steam reviews page
driver.get(final_reviews_page_url)

## Scrolling on the Reviews Page

This step was not easy to code.  
To simplify, we use a loop that, at each iteration, scrolls to the bottom of the loaded (HMTL) content.  
The loop ends when the program detects that it has reached the bottom of the page.  
Moreover, between each scroll, we add a wait time to allow the additional content to load : we wait until the "action wait" image disappears.

In [18]:
# Variable used to determine if the bottom of the page has been reached
end = False

# Element corresponding to the bottom of the currently loaded content
getMoreContent = driver.find_element(by = By.ID, value = 'GetMoreContentBtn')

# Element corresponding to the "action wait" image
action_wait = driver.find_element(by = By.XPATH, value = '//*[@id="action_wait"]/img')

# Element corresponding to the bottom of the page
noMoreContent = driver.find_element(by = By.XPATH, value = '//*[@id="NoMoreContent"]/div[2]/a')

# While we haven't reached the bottom of the page and the number of reviews is lower than the maximum review limit...
while end == False and len(driver.find_elements(By.CLASS_NAME, 'apphub_CardTextContent')) < max_review :
    # Check if the bottom-of-page element is present in the loaded content
    try :
        ActionChains(driver).scroll_to_element(noMoreContent).perform()
        end = True

    except :
        # Scrolling to the bottom of the currently loaded content
        ActionChains(driver).scroll_to_element(getMoreContent).perform()

        # Waiting for the page to load
        WebDriverWait(driver, 10).until(EC.invisibility_of_element(action_wait))

## Scrape the reviews from the fully loaded page

Once the Steam reviews page is fully loaded, we will scrape the reviews from the HTML content.  
As before, we will use the BeautifulSoup library...

In [19]:
# Extract the soup HTML content from the driver's page source
reviews_loaded_page_soup = BeautifulSoup(driver.page_source, "lxml")

In [20]:
# Retrieve all review details (date posted, review text, and additional information such as "received_compensation" or "refunded")
reviews_content = reviews_loaded_page_soup.find_all("div", class_ = "apphub_CardTextContent")

In [21]:
print("Number of reviews :", len(reviews_content))

Number of reviews : 87


Finally, we will extract only the review text from the review content.

In [22]:
reviews_list = []

for review in reviews_content :
    # Remove the additional unnecessary information
    for elt_a_suppr in review.find_all(['div'], class_=["date_posted", "received_compensation", "refunded"]):
        elt_a_suppr.decompose()
    
    # Extract the review text from the HTML content
    review = review.get_text(strip=True, separator = "\n")

    reviews_list.append(review)
    print(review, end = "\n\n")

Titled: 'The crying dollar'
Prologue/authors notes:
I despise this game with amounts I foresaw humanly impossible. How can this game take the worst f*cking elements of every racing game ever and combine it into one?
Graphics: Forza Horizon 2 was better then this ♥♥♥♥.
Open world: Is exactly that, 'open', like a damned wasteland in Fallout
AI: Rubberbands like it's some alternate NFS Underground era all over again.
Quantity over quality: Of course we need 750+ items to grind for.
Vehicles:
-Airplanes: Them ballers pull 30G and could commit multiple mass crimes before getting destroyed.
-Sport Cars: Semi-okay driver assisted land boats.
-Dirt bikes: Can accelerate at an angle of 70 degrees?
-Rally cars: Hopeless sand boats, the bigger the spoiler the harder the ai slaps. Steering is weird
-Drift cars: Just as low grossing as Tokyo Drift in 2006, feels very forced.
-Boat races: Handles better then most cars but seems to permanently suffer from 'permanent hurricane' syndrome, in which the 