# Scraping Top Selling Games on Steam using Selenium with Python

![banner_image](https://i.imgur.com/Ztwv6dx.png)

**Web Scraping** is a technique used to extract data from websites. It is the process of collecting structured web data in an automated fashion. It's also called web data extraction. Some of the main use cases of web scraping include price monitoring, price intelligence, news monitoring, lead generation, and market research among many others.

Here, in this web scraping we will scrap the data of top selling games from [Steam](https://store.steampowered.com/search/?filter=topsellers).

[Steam](https://store.steampowered.com/) is an online platform from game developer Valve where you can buy, play, create, and discuss PC games. The platform hosts thousands of games (as well as downloadable content, or DLC, and user-generated features called "mods") from major developers and indie game designers alike.

As the website that we will scrape is a [dynamic website](https://www.geeksforgeeks.org/dynamic-websites/) meaning this website has a infinite scrolling of games which is happening on one page only so, therefore we'll use the Python library `Selenium` to perform scrapping from the webpage and then we will use the `Pandas` library to store the information in  structured tabular format.





## Project workflow

Here's an outline of the steps that we will follow:
1. Create a webdriver with selenium to access the website.
2. Download the webpage using the `Selenium` library.
3. Simulate the scrolling of the webpage using `Selenium` to get to the end of the page to get all the games.
4. Extract the Title and url of the games.
5. Cleaning the data of any anomalies which occour.
6. Extracting data from every game's page.
7. Compile the extracted data into python lists and dictionary.
8. Save the extracted information in a csv file.

## Exepected results:

By the end of the project we will create a CSV file in the following format:

```
Title,Game_url,Release_date,Reviews,Price,Discounted_price
LEGO® Star Wars™: The Skywalker Saga,https://store.steampowered.com/app/920210/LEGO_Star_Wars_The_Skywalker_Saga/?snr=1_7_7_7000_150_1,5 Apr 2022,Very Positive 94% of the 8,348 user reviews for this game are positive,2499,2499
Sea of Thieves,https://store.steampowered.com/app/1172620/Sea_of_Thieves/?snr=1_7_7_7000_150_1,3 Jun 2020,Very Positive<br>90% of the 178,341 user reviews for this game are positive,

```

## Run the code

You can execute the code using **Run** button at the top of this page and the select **'Run on Colab'**. You can make changes and save your own version of the notebook at [jovian](https://www.jovian.ai) by executing the following cells.

In [None]:
!pip install jovian --upgrade --quiet

In [None]:
import jovian

In [None]:
# Execute this to save new versions of the notebook
jovian.commit(project="scraping-top-selling-games-on-steam")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "altamashwaseem04/scraping-top-selling-games-on-steam" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/altamashwaseem04/scraping-top-selling-games-on-steam[0m


'https://jovian.ai/altamashwaseem04/scraping-top-selling-games-on-steam'

## Install and import important libraries

For downloding the page and scraping the website we will use the [`Selenium`](https://www.geeksforgeeks.org/selenium-python-tutorial/) library and use the [chromium](https://en.wikipedia.org/wiki/Chromium_(web_browser)) browser to access the website

In [68]:
!apt update
!apt install chromium-chromedriver --quiet
!pip install selenium --quiet

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:5 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:6 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Get:7 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,732 kB]
Ign:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:10 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Ign:11 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Get:12 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,496 kB]
Hit:13 https://

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select

We will use the [`Pandas`](https://mode.com/python-tutorial/libraries/pandas/) library to convert the Collected data into a dataframe

In [None]:
!pip install pandas --upgrade --quiet
import pandas as pd

## Creating a webdriver

[Webdriver]('https://stackoverflow.com/questions/54459701/what-is-selenium-and-what-is-webdriver') will help us to run the website as it runs on a website so that the content of a dynamic website get's loaded, but before we can create  webdriver we need to define some options for the webdriver so that the driver can run in background

In [None]:
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

Now we can create a webdriver

In [None]:
wd = webdriver.Chrome(options=options)

We can access the driver with the "wd".

## Downloading the page

To download the page we will use the `get` property of the webdriver.

In [None]:
url ='https://store.steampowered.com/search/?filter=topsellers'

wd.get(url)

To check if the page was download successfully we can check the title of the page with the `title` property.

In [None]:
wd.title

'Steam Search'

We can create a function here for creating the driver and downloading the page.

In [None]:
def create_driver(url):
  '''Takes the url as an input, creates the webdriver and returns the driver with the page'''
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  wd = webdriver.Chrome(options=options)
  wd.get(url)
  return wd

Now we have the [Steam's top selling games page](https://store.steampowered.com/search/?filter=topsellers).

![Top-selling](https://i.imgur.com/Pc7VkLR.png)

This page all the top selling games in steam. Now when we scroll down and get to the end of the page we can see that new games get loaded and the page's height keeps increasing and if we try to extract the information now we will not get all the games but only those which are on the page when the page gets loaded.

To get all the games we need to simulate the scrolling on the page so that all the games can be loaded on the page and we can extract all the games. As there are a lot of games probably more the 10,000 therefore we are going to limit the scrolling with a for loop

## Simulating the scrolling on the page

In [None]:
import time

SCROLL_PAUSE_TIME = 1

# Get scroll height
last_height = wd.execute_script("return document.body.scrollHeight")

for i in range(6):
    # Scroll down to bottom
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

In [None]:
last_height

17704

and with that the web page will be scrolled down to the bottom and we will get all the games.

We can create a function to simulate the scrolling of the webpage.

In [None]:
def scroll_page(wd):
  '''Takes the driver as an input and simulates the scrolling to get all the games'''  
  SCROLL_PAUSE_TIME = 2

  # Get scroll height
  last_height = wd.execute_script("return document.body.scrollHeight")

  for i in range(6):
    # Scroll down to bottom
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
  return

## Extracting the information

We need to extract the information of games like it's title, price, reviews, release date, game_url and also the discounted price.

To do that we first need to get all the games that is in th page. Let's do this first:

First let's go the the [Steam](https://store.steampowered.com/search/?filter=topsellers) top selling page, right click on any game and select inspect element, we will get this screen.

![games_search](https://i.imgur.com/5Fer3Ks.png)

We can see that the all the games are in a `div` tag. We can get this `div` by the id attribute and also by an `xpath`. To get the `xpath` right click on the `div` tag and select "Copy>Copy Xpath".

![xpath](https://i.imgur.com/E0eByhC.png)




To get any element we can use the `find_element` property of web driver.

In [None]:
games_rows = wd.find_element(By.ID, 'search_resultsRows')

This gives us a single element which contains all the games. Now we need to get a list of all the games so that we can scroll through them to get all the info.

![games](https://i.imgur.com/cz2QBtO.png)

as we can see in the above image, the games are inside of an anchor tag so we extract the games using the `tag_name`.

In [None]:
games = games_rows.find_elements_by_tag_name('a')



Let's create a function for getting all the games

In [None]:
def get_games(wd):
  '''Takes the driver as an input gives and returns a selenium web element with all the games'''
  games_rows = wd.find_element(By.ID, 'search_resultsRows')
  games = games_rows.find_elements_by_tag_name('a')
  return games

## Looping through all the games and storing them inside a list of dictionary.

In [None]:
#Creating an empty list
games_list = []

#looping through all the games
for i in range(len(games)): 
  title =  games[i].find_element_by_class_name('title').text
  game_url = games[i].get_attribute('href')
  
  #storing the extracted information inside a dictionary
  my_game = {
      'title': title,
      'url': game_url,
  }

  #adding the dictionary inside the list
  games_list.append(my_game)



Now we have a list of dictionary with the title, price, discounted, url, release date and the reviews. Lets's check how many games have we got the information of:

In [69]:
len(games_list)

350

We should create a function for parsing and extracting all the information from all the games.

In [None]:
def parse_data(games):
  '''Takes the web element "games" as input and returns a list of dictionary with the title and urls of the games'''
  #Creating an empty list
  games_list = []

  #looping through all the games
  for i in range(len(games)): 
    title =  games[i].find_element_by_class_name('title').text
    game_url = games[i].get_attribute('href')
    
    #storing the extracted information inside a dictionary
    my_game = {
        'title': title,
        'url': game_url,
    }

    #adding the dictionary inside the list
    games_list.append(my_game)
  return games_list

In [None]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] Please enter your API key ( from https://jovian.ai/ ):[0m
API KEY: ··········
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/altamashwaseem04/scraping-top-selling-games-on-steam


'https://jovian.ai/altamashwaseem04/scraping-top-selling-games-on-steam'

Now we have a list of all the games and their information.

## Cleaning the data of any anomalies

Before we can clean the data we need convert it to a Pandas [DataFrame]('https://www.geeksforgeeks.org/python-pandas-dataframe/') which will make interacting and cleaning the data much easier.

In [None]:
gamesdf = pd.DataFrame(games_list)

Now as we can see that there are lots of rows which have blank titles and also some titles which atre not even a game like "EA Play".

![](https://i.imgur.com/yEOXn7g.png)

we need to remove these anomalies before proceeding further.

Let's create a function here for the cleaning of the data

In [None]:
def clean_data(games_list):
  '''Takes the list of games as an input,  cleans the list of any anomalies and returns it's dataframe'''
  gamesdf = pd.DataFrame(games_list)
  for i in range(len(gamesdf)):
    if gamesdf['title'][i] == '':
      new_gamesdf = gamesdf.drop(i)
      gamesdf = new_gamesdf
    elif gamesdf['title'][i] == 'EA Play' or gamesdf['title'][i] == 'Valve Index VR Kit' or gamesdf['title'][i] == 'Steam Deck' or gamesdf['title'][i] == 'Valve Index® Base Station':
      new_gamesdf = gamesdf.drop(i)
      gamesdf = new_gamesdf
  return new_gamesdf

In [None]:
new_gamesdf = clean_data(gamesdf)

now we can see that the data is clean. There are only game titles and no blank titles

![](https://i.imgur.com/IKXG3nS.png)

## Geting information from each of the games.

Now that we have a dataframe with title and url we can use the url of the game, we can use the url of every game to extract information from the each game's page.



> We will extract the price, discounted price, release date and reviews

But before we can create functions there is a problem that we need to tackle first which is age selector for some games which are violent.

![](https://i.imgur.com/S5kGtNO.png)



So we need a way to get past of this page and get to the actual game page.
With `Selenium` we can do this very easily. Let's create a function to tackle this.









In [70]:
def check_page(wd):
    '''Takes the driver as an input with a game url and, checks the page and returns the driver'''
    try: 
       try:
          info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')
          return wd
       except:
          year_tag = wd.find_element(By.CLASS_NAME, 'agegate_birthday_selector')
          year = year_tag.find_element(By.ID, 'ageYear')
          yearDD = Select(year)
          yearDD.select_by_value('1900')
          view_button = wd.find_element(By.XPATH, '//*[@id="view_product_page_btn"]')
          view_button.click()
          time.sleep(4)
    except:
         return wd

    return wd

Now that we have tackled the age selector problem, we can create functions for extraction.

We can see that the price is under a `div` with a class name "game_purchase_action". let's create a function to extract price first:


In [116]:
def get_price(wd):
  '''Takes the driver an returns the price of the game'''
  try:  
    try:  
      try:
        price_tag = wd.find_element(By.CLASS_NAME, 'game_purchase_action')
        price = price_tag.find_element(By.CLASS_NAME, 'price').text.strip('$')
        return price
        
      except:
        price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
        prices = price_tag.text.strip().replace('\n').split('$')
        price = prices[1]
        return price
        
    except:
      price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
      price = price_tag.find_element(By.CLASS_NAME, 'discount_original_price').text.split('$')[1]
      return price

  except:
     price = 'not available'
     return price

With `get_price()` we can extract the price of a game. 

Next let's create the functions for discounted price, release date and reviews.

In [71]:
def get_discounted(wd):
  '''Takes the driver an returns the discounted price of the game'''
  try:
    try:  
      try:
        price_tag = wd.find_element(By.CLASS_NAME, 'game_purchase_action')
        discprice = price_tag.find_element(By.CLASS_NAME, 'price').text.strip('$')
          
      except:
        price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
        prices = price_tag.text.strip('\n').split('$')
        discprice = prices[2]
      
    except:
      discprice = 'not availale'

  except:
      price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
      discprice = price_tag.find_element(By.CLASS_NAME, 'discount_final_price')

  return discprice

In [72]:
def get_release(wd):
   '''Takes the driver an returns the release date of the game'''
   try: 
      info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')

      release = info_tag.find_element(By.CLASS_NAME, 'release_date').text.strip().replace('RELEASE DATE:\n','')
   except:
      release = 'not available'
       
   return release

In [155]:
def get_reviews(wd):
  '''Takes the driver an returns the reviews of the game'''
  try:  
    info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')
    try:
      reviews = info_tag.find_element(By.XPATH, '//*[@id="userReviews"]/div[2]').text.replace('ALL REVIEWS:\n', '')
    except:
      reviews = info_tag.find_element(By.CLASS_NAME, 'user_reviews').text.replace('ALL REVIEWS:\n', '')
  except:
    reviews = 'not available'
  return reviews

Let's get the url of all the games from the dataframe that we created of the games.

In [None]:
games_url = new_gamesdf['url']

let's ceate a function to create a function to download avery individual game page

In [None]:
def get_page(url):
  '''Takes the url and returns the driver with the page'''
  wd.get(url)
  return wd

Now that we have have all the functions to extract info let's combine them into one function which gives us the information in the form of a dictionary

In [74]:
def get_game_info(url):
  '''Takes the url and returns a dictionary with the price, discounted price, release date and the reviews'''
  wd_1 = get_page(url)
  wd_new = check_page(wd_1)
  price = get_price(wd_new)
  discounted = get_discounted(wd_new)
  release = get_release(wd_new)
  reviews = get_reviews(wd_new)
  mygame = {
            'Price': price,
            'Discounted Price': discounted,
            'Release Date': release,
            'Reviews': reviews
           }
  return mygame

The `get_game_info()` gives us the information of one game in the form of a dictionary, so to get information of all the games we need to loop through all the games url and fetch the information.

Let's create a function for that:

In [75]:
def get_all_games(games_url):
  '''Takes all the urls of the games, creates a list of dictionary of info of the games and returns a dataframe of this'''
  game_info_list = []
  for i in games_url:
    game_info = get_game_info(i)
    game_info_list.append(game_info)
  game_info_df = pd.DataFrame(game_info_list)
  return game_info_df

The `get_all_games()` function gives us information of all the games in the form of a dataframe.

In [156]:
game_info_df = get_all_games(games_url)

In [157]:
game_info_df

Unnamed: 0,Price,Discounted Price,Release Date,Reviews
0,29.99,29.99,"Apr 26, 2022",Very Positive - 83% of the 623 user reviews fo...
1,59.99,29.99,"Mar 21, 2019","Very Positive - 94% of the 125,833 user review..."
2,59.99,59.99,"Feb 24, 2022","Very Positive - 89% of the 348,755 user review..."
3,19.99,17.99,"Apr 25, 2022",Very Positive - 85% of the 437 user reviews fo...
4,44.99,40.49,"Jan 26, 2021",Very Positive - 82% of the 683 user reviews fo...
...,...,...,...,...
323,19.99,9.99,"Feb 2, 2017",Mostly Positive - 76% of the 663 user reviews ...
324,69.99,69.99,not available,not available
325,39.99,25.99,not available,not available
326,59.99,59.99,"Apr 2, 2018","Mostly Positive - 75% of the 5,901 user review..."


The dataframe of the all the games info looks like this:

![](https://i.imgur.com/Rwloo7h.png)

Now that we have both the dataframe we can use the `conact()` function of pandas to merge the two dataframe and use the `reset_index()` function to reset the index of the final dataframe.

In [158]:
result = pd.concat([new_gamesdf, game_info_df], axis=1, join='inner')

In [159]:
result.reset_index(drop=True, inplace=True)

In [160]:
result

Unnamed: 0,title,url,Price,Discounted Price,Release Date,Reviews
0,Dune: Spice Wars,https://store.steampowered.com/app/1605220/Dun...,29.99,29.99,"Apr 26, 2022",Very Positive - 83% of the 623 user reviews fo...
1,Sekiro™: Shadows Die Twice - GOTY Edition,https://store.steampowered.com/app/814380/Seki...,59.99,29.99,"Mar 21, 2019","Very Positive - 94% of the 125,833 user review..."
2,ELDEN RING,https://store.steampowered.com/app/1245620/ELD...,59.99,59.99,"Feb 24, 2022","Very Positive - 89% of the 348,755 user review..."
3,Peglin,https://store.steampowered.com/app/1296610/Peg...,49.99,49.99,"Apr 5, 2022","Very Positive - 93% of the 17,457 user reviews..."
4,King Arthur: Knight's Tale,https://store.steampowered.com/app/1157390/Kin...,49.99,24.99,"Oct 29, 2020","Very Positive - 94% of the 2,566 user reviews ..."
...,...,...,...,...,...,...
302,CarX Drift Racing Online,https://store.steampowered.com/app/635260/CarX...,19.99,9.99,"Feb 2, 2017",Mostly Positive - 76% of the 663 user reviews ...
303,FINAL FANTASY I-VI Bundle,https://store.steampowered.com/bundle/21478/FI...,69.99,69.99,not available,not available
304,Age of Empires II: Definitive Edition - Dynast...,https://store.steampowered.com/app/1869820/Age...,39.99,25.99,not available,not available
305,Police Simulator: Patrol Officers,https://store.steampowered.com/app/997010/Poli...,59.99,59.99,"Apr 2, 2018","Mostly Positive - 75% of the 5,901 user review..."


Now we can use the `to_csv()` function to convert the final dataframe into a csv file

In [None]:
result.to_csv('top_selling_games' + '.csv', index=False)

Now we have two dataframes, `new_gamesdf` with the title and url and `game_info_df` with the release, reviews and other info of these games.

let's create a function a function to combine these two dataframe and then save it to csv file

In [None]:
def save_to_csv(df_1, df_2, file_name):
  '''Takes both the dataframes & the filename, merges both the dataframes and saves it to a csv file'''
  result = pd.concat([df_1, df_2], axis=1, join='inner')
  result.to_csv(file_name + '.csv', index=False)

Now that we have info of every games, we need to combine both the dataframes `new_gamesdf` and `game_info_df` to create a dataframe of details of every games.

In [None]:
save_to_csv(new_gamesdf, game_info_df, 'top_selling_games')

Now we have a csv file with the Games title, url, release, price, discounted price and reviews with the name 'top_selling_games'

## Let's have a look at all the functions we have created:

In [None]:
def create_driver(url):
  options = webdriver.ChromeOptions()
  options.add_argument('--headless')
  options.add_argument('--no-sandbox')
  options.add_argument('--disable-dev-shm-usage')
  wd = webdriver.Chrome(options=options)
  wd.get(url)
  return wd


def scroll_page(wd):
  SCROLL_PAUSE_TIME = 2
  # Get scroll height
  last_height = wd.execute_script("return document.body.scrollHeight")

  for i in range(6):
    # Scroll down to bottom
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
  return


def get_games(wd):
  games_rows = wd.find_element(By.ID, 'search_resultsRows')
  games = games_rows.find_elements_by_tag_name('a')
  return games


def parse_data(games):
  #Creating an empty list
  games_list = []

  #looping through all the games
  for i in range(len(games)): 
    title =  games[i].find_element(By.CLASS_NAME, 'title').text
    game_url = games[i].get_attribute('href')
    
    #storing the extracted information inside a dictionary
    my_game = {
        'title': title,
        'url': game_url,
    }

    #adding the dictionary inside the list
    games_list.append(my_game)
  return games_list



These are the functions that we have created for the extraction of title and url of games from the top selling page of steam.

Let's create a function to combine all these so that wo don't have to run every function individually.

In [None]:
def get_all_games(url):
  wd = create_driver(url)
  scroll_page(wd)
  games = get_games(wd)
  games_list = parse_data(games)
  return games_list

Next we have a function for cleaning up of data 

In [None]:
def clean_data(games_list):
  gamesdf = pd.DataFrame(games_list)
  for i in range(len(gamesdf)):
    if gamesdf['title'][i] == '':
      new_gamesdf = gamesdf.drop(i)
      gamesdf = new_gamesdf
    elif gamesdf['title'][i] == 'EA Play' or gamesdf['title'][i] == 'Valve Index VR Kit' or gamesdf['title'][i] == 'Steam Deck' or gamesdf['title'][i] == 'Valve Index® Base Station':
      new_gamesdf = gamesdf.drop(i)
      gamesdf = new_gamesdf
  return new_gamesdf
  

Let's take a look at the functions of extracting info from every games:

In [None]:
def check_page(wd):
 try: 
    try:
      info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')
      return wd
    except:
      year_tag = wd.find_element(By.CLASS_NAME, 'agegate_birthday_selector')
      year = year_tag.find_element(By.ID, 'ageYear')
      yearDD = Select(year)
      yearDD.select_by_value('1900')
      view_button = wd.find_element(By.XPATH, '//*[@id="view_product_page_btn"]')
      view_button.click()
      time.sleep(4)
 except:
     return wd

 return wd


 
def get_price(wd):
  try:  
    try:  
      try:
        price_tag = wd.find_element(By.CLASS_NAME, 'game_purchase_action')
        price = price_tag.find_element(By.CLASS_NAME, 'price').text.strip('$')
        
      except:
        price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
        prices = price_tag.text.strip('\n').split('$')
        price = prices[1]
        
    except:
      price = 'not availale'

  except:
     price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
     price = price_tag.find_element(By.CLASS_NAME, 'discount_original_price').text

  return price



def get_discounted(wd):
  try:
    try:  
      try:
        price_tag = wd.find_element(By.CLASS_NAME, 'game_purchase_action')
        discprice = price_tag.find_element(By.CLASS_NAME, 'price').text.strip('$')
          
      except:
        price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
        prices = price_tag.text.strip('\n').split('$')
        discprice = prices[2]
      
    except:
      discprice = 'not availale'

  except:
      price_tag = wd.find_element(By.CLASS_NAME, 'discount_prices')
      discprice = price_tag.find_element(By.CLASS_NAME, 'discount_final_price')

  return discprice



def get_release(wd):
   try: 
      info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')

      release = info_tag.find_element(By.CLASS_NAME, 'release_date').text.strip('RELEASE DATE:\n')
   except:
      release = 'not available'
       
   return release



def get_reviews(wd):
  try:  
    info_tag = wd.find_element(By.CLASS_NAME, 'glance_ctn_responsive_left')

    reviews_tag = info_tag.find_element(By.CLASS_NAME, 'user_reviews').text.strip('\n')
  except:
    reviews_tag = 'not available'
  return reviews_tag



def get_page(url, wd):
  wd.get(url)
  return wd



def get_game_info(url):
  wd_1 = get_page(url)
  wd_new = check_page(wd_1)
  price = get_price(wd_new)
  discounted = get_discounted(wd_new)
  release = get_release(wd_new)
  reviews = get_reviews(wd_new)
  mygame = {
            'Price': price,
            'Discounted Price': discounted,
            'Release Date': release,
            'Reviews': reviews
           }
  return mygame


Let's create a function to combine all of these into one:

In [None]:
def get_all_games(new_gamesdf):
  games_url = new_gamesdf['url']
  game_info_list = []
  for i in games_url:
    game_info = get_game_info(i)
    game_info_list.append(game_info)
  game_info_df = pd.DataFrame(game_info_list)
  return game_info_df

And at last we have have the function to combine both dataframe and save it to a csv file.

In [None]:
def save_to_csv(df_1, df_2, file_name):
  result = pd.concat([df_1, df_2], axis=1, join='inner')
  result.to_csv(file_name + '.csv', index=False)

These are all the function that have used to scrape the top selling games of steam but it is really a hassle to run all the functions and get the csv file.

So,  let's create a function that combine all the above functions which fisrst get all the games from top selling page, cleans the data of any anomalies, get the information of every games and combine & save the data into a csv file.

In [None]:
def scrape_top_games(url):
  games_list = get_all_games(url)
  new_gamesdf = clean_data(games_list)
  game_info_df = get_all_games(new_gamesdf)
  save_to_csv(new_gamesdf, game_info_df, 'top_selling_games')

This is the final function which is going to gives us the top selling games in the form of csv file

## Project Summary

Here's what we've covered in this notebook:

1. Create a webdriver with selenium to access the website.
2. Download the webpage using the Selenium library.
3. Simulate the scrolling of the webpage using Selenium to get to the end of the page to get all the games.
4. Extract the Title and url of the games.
5. Cleaning the data of any anomalies which occour.
6. Extracting data from every game's page.
7. Compile the extracted data into python lists and dictionary.
8. Save the extracted information in a csv file.


The CSV file we created has this format:

![](https://i.imgur.com/1zCuJqi.png)

## Future work



*   We can now fetch individual games page and get all the information about the games. Further refinement of this notebook could include top selling games from a specific genre, games for particular OS or even by number of players who can play the game like single player or multiple player.
*   With the collected data further analysis can be done. For example as the data gets change in the top selling games we can determine for how many days a games stays on top.





## References



1.   Web scraping tutorial : https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
2.   Webscraping with selenium tutorial : https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/workshop-web-scraping-with-selenium-aws

1.   `Selenium` documentation library : https://pypi.org/project/selenium/
2.   `Pandas` : https://www.w3schools.com/python/pandas/default.asp 

1.   Steam Website(top selling games) : https://store.steampowered.com/search/?filter=topsellers







In [None]:
#@title Default title text
jovian.commit(files = ['top_selling_games.csv'])

<IPython.core.display.Javascript object>