# STEAM CHART WEBSCRAPING PROJECT 

Goal of the project: scrape the relevant data available over the first 5000 most played games on Steam. In particular we're interested in:
- Name of the game
- Current active players
- Peak players
- Hours played
- Game's genre
- Price
- Date of release
- Game developer
- Game distributor
- Reviews 

This data can be retrieved from https://steamcharts.com/top and https://store.steampowered.com/

First, let's import the required libraries

In [2]:
import pandas as pd
from bs4 import BeautifulSoup as bs
import requests

We're starting with steamcharts.com. From there we can retrieve the number of players and hours for every game. 

In [3]:
url = "https://steamcharts.com/top/p.1"

req = requests.get(url).text
steamchart = bs(req, "html.parser")

First, we need to scrape all the titles in the first page, along with the numbers of active players, peak players and total hours.

In [None]:
# Let's create a function that scrapes all the data for each game...

body = steamchart.find("tbody")
games = body.find_all("tr")

def get_title_and_players_data():
    try:
        title = games[g].find("a", href = True)
        title = title.string
        title = title.strip()
        
        players_data = games[g].find_all(class_ = "num")
        
        current_players = players_data[0].string
        peak_players = players_data[1].string
        total_hours = players_data[2].string

    except:
        title = "missing value"
        current_players = "missing value"
        peak_players = "missing value"
        total_hours = "missing value"
    
    return title, current_players, peak_players, total_hours

# ... and then iterate it for all the games in the page.

t = []
cp = []
pp = []
th = []
for g in range(len(games)):
  t.append(get_title_and_players_data()[0])
  cp.append(get_title_and_players_data()[1])
  pp.append(get_title_and_players_data()[2])
  th.append(get_title_and_players_data()[3])

Create a dataframe for the first 25 games.

In [None]:
# Dataframe

Titles = t
Current_players = cp
Peak_players = pp
Total_hours = th

data = {"Titles": Titles, "Current Players": Current_players, "Peak Players": Peak_players, "Total Hours": Total_hours}
DF = pd.DataFrame(data)
DF


Unnamed: 0,Titles,Current Players,Peak Players,Total Hours
0,Counter-Strike: Global Offensive,1075054,1320219,558998360
1,Dota 2,623139,676653,295245771
2,PUBG: BATTLEGROUNDS,288743,452239,152986898
3,Apex Legends,277684,441067,155374796
4,Lost Ark,221578,223820,101661911
5,Goose Goose Duck,150457,701898,83690695
6,Rust,136693,167827,69037300
7,Grand Theft Auto V,135366,176528,78797127
8,Team Fortress 2,106360,119399,68325043
9,Dark and Darker Demo,106010,108429,9546942


Now for the hard part: to retrieve the prices, producers, distributors etc. we need to scrape each game's page on the Steam store. Let's start by gathering all urls directed to the store for the games in the first page.

In [None]:
url = "https://steamcharts.com/top/p.1"

def get_store_urls():
    body = steamchart.find("tbody")
    s = body.find_all(href = True)
    store_urls = []    

    for i in range(len(s)):
        try:
            t = s[i].string    # avoid collecting urls from games we don't have the title
            t = t.strip()
            s[i] = s[i]["href"]
            store_urls.append("https://store.steampowered.com" + s[i] + "/?cc=IT") 

        except:
            store_urls.append("missing value")
       
    return store_urls
    
urls_list = get_store_urls()
urls_list

['https://store.steampowered.com/app/730/?cc=IT',
 'https://store.steampowered.com/app/570/?cc=IT',
 'https://store.steampowered.com/app/578080/?cc=IT',
 'https://store.steampowered.com/app/1172470/?cc=IT',
 'https://store.steampowered.com/app/1599340/?cc=IT',
 'https://store.steampowered.com/app/1568590/?cc=IT',
 'https://store.steampowered.com/app/252490/?cc=IT',
 'https://store.steampowered.com/app/271590/?cc=IT',
 'https://store.steampowered.com/app/440/?cc=IT',
 'https://store.steampowered.com/app/2258570/?cc=IT',
 'https://store.steampowered.com/app/1938090/?cc=IT',
 'https://store.steampowered.com/app/236390/?cc=IT',
 'https://store.steampowered.com/app/431960/?cc=IT',
 'https://store.steampowered.com/app/1811260/?cc=IT',
 'https://store.steampowered.com/app/1085660/?cc=IT',
 'https://store.steampowered.com/app/346110/?cc=IT',
 'https://store.steampowered.com/app/1904540/?cc=IT',
 'https://store.steampowered.com/app/1245620/?cc=IT',
 'https://store.steampowered.com/app/304930/?c

We have the link to the store page for all games. From them we can extract the other data we're interested in. 
We'll start by scraping GTA V's page.



In [None]:
# Get the genre
url = "https://store.steampowered.com/app/271590/Grand_Theft_Auto_V/?cc=IT"
req = requests.get(url).text
store_page = bs(req, "html.parser")

def get_game_genre():
  try:
    G = store_page.find_all(class_ = "app_tag")
    genre = G[0].string.strip() + "; " + G[1].string.strip() + "; " + G[2].string.strip()
  
  except:
    genre = "missing value"
    
  return genre

get_game_genre()

'Open World; Action; Multiplayer'

In [None]:
# Get the price

def get_game_price():
  
  P = store_page.find(class_ = "game_purchase_price price") 
  try: 
    price = P.string.strip()
    
  except: 
    pass
    
    try:
      P = store_page.find(class_ = "discount_original_price")   # if the game is discounted
      price = P.string.strip()
      
    except:
      pass
      
      P = store_page.find(class_ = "discount_prices")   # if the game is only sellable in a bundle
      
      try:
        price = P.string.strip()
        
      except:
        price = "missing value"
            
  return price


get_game_price()

'29,98€'

In [None]:
# Get the release date 

infotab = store_page.find(class_ = "glance_ctn_responsive_left")   # tab where infos such as release date, average reviews, developer and publisher are reported

def get_date_of_release():
  try:
    rel = infotab.find(class_ = "date")
    rel = rel.string
    rel = rel.replace(",", "")       
  
  except:
    rel = "missing value"

  return rel

get_date_of_release()

'14 Apr 2015'

In [None]:
# Get the publisher and developer

def get_company():
  try:
    coms = infotab.find_all("a")
  
  except:
    dev = "missing value"
    pub = "missing value"
  
  try:
    dev = coms[0].string

  except:
    dev = "missing value"
  
  try:
    pub = coms[1].string
  
  except:
    pub = dev
  
  return dev, pub

get_company()

('Rockstar North', 'Rockstar Games')

In [None]:
# Get the reviews

def get_reviews():

  try:
    revtypes = store_page.find(class_ = "user_reviews_filter_menu_flyout")
    revs = revtypes.find_all(class_ = "user_reviews_count")
    
    posrevs = revs[1].string
    posrevs = posrevs.replace("(", "")
    posrevs = posrevs.replace(")", "")
    
    negrevs = revs[2].string
    negrevs = negrevs.replace("(", "")
    negrevs = negrevs.replace(")", "")

  except:
    posrevs = "missing value"
    negrevs = "missing value"
  
  return posrevs, negrevs

get_reviews()

('1,291,675', '217,690')

We succesfully scraped all the data we needed! Now we just need to automate the process for the first 200 pages of the Steam chart. 

WARNING: the following loop will take a while to complete.

In [None]:
# Scrape all the data for the first 200 pages

all_titles = []
all_cp = []
all_pp = []
all_th = []
all_genres = []
all_prices = []
all_release_dates = []
all_devs = []
all_pubs = []
all_pos_revs = []
all_neg_revs = []

for i in range(1, 201):
  url = "https://steamcharts.com/top/p." + str(i)
  req = requests.get(url).text
  steamchart = bs(req, "html.parser")
  body = steamchart.find("tbody")
  games = body.find_all("tr")

  urls_list = get_store_urls()

  for g in range(len(games)):
    url_store = urls_list[g]


    all_titles.append(get_title_and_players_data()[0])
    all_cp.append(get_title_and_players_data()[1])
    all_pp.append(get_title_and_players_data()[2])
    all_th.append(get_title_and_players_data()[3])
    
    try: 
        req_store = requests.get(url_store).text
        store_page = bs(req_store, "html.parser")
        infotab = infotab = store_page.find(class_ = "glance_ctn_responsive_left")
        print(store_page.find(class_ = "apphub_AppName").string)
        
        all_genres.append(get_game_genre())
        all_prices.append(get_game_price())
        all_release_dates.append(get_date_of_release())
        all_devs.append(get_company()[0])
        all_pubs.append(get_company()[1])
        all_pos_revs.append(get_reviews()[0])
        all_neg_revs.append(get_reviews()[1])
        
    except:
        print("missing value")
        all_genres.append("missing value")
        all_prices.append("missing value")
        all_release_dates.append("missing value")
        all_devs.append("missing value")
        all_pubs.append("missing value")
        all_pos_revs.append("missing value")
        all_neg_revs.append("missing value")

  print("Page " + str(i) + " completed\n")

Done! All we need to do now is store the data into a dataframe.

In [None]:
I = list(range(1,5001))

data = {"Index": I, "Titles": all_titles, "Current Players": all_cp, "Peak Players": all_pp, "Total Hours": all_th, "Price": all_prices, "Genre": all_genres, "Release date": all_release_dates, "Developer": all_devs, "Publisher": all_pubs, "Positive Reviews": all_pos_revs, "Negative Reviews": all_neg_revs}
df = pd.DataFrame(data)
df.set_index("Index", inplace = True)
df

Unnamed: 0_level_0,Titles,Current Players,Peak Players,Total Hours,Price,Genre,Release date,Developer,Publisher,Positive Reviews,Negative Reviews
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,Counter-Strike: Global Offensive,1075054,1320219,558998360,Free to Play,FPS; Shooter; Multiplayer,21 Aug 2012,Valve,Hidden Path Entertainment,6123686,804911
2,Dota 2,623139,676653,295245771,Free to Play,Free to Play; MOBA; Multiplayer,9 Jul 2013,Valve,Valve,1595197,332805
3,PUBG: BATTLEGROUNDS,288743,452239,152986898,Free to Play,Survival; Shooter; Battle Royale,21 Dec 2017,"KRAFTON, Inc.","KRAFTON, Inc.",1219342,919837
4,Apex Legends,277684,441067,155374796,Free to Play,Free to Play; Multiplayer; Battle Royale,4 Nov 2020,Respawn Entertainment,Electronic Arts,492111,97196
5,Lost Ark,221578,223820,101661911,Free To Play,MMORPG; Free to Play; Action RPG,11 Feb 2022,Smilegate RPG,Amazon Games,136039,53143
...,...,...,...,...,...,...,...,...,...,...,...
4996,Imperiums: Greek Wars,16,27,6603,"27,58€",4X; Grand Strategy; Turn-Based Strategy,30 Jul 2020,Kube Games,Kube Games,331,62
4997,Monster Energy Supercross - The Official Video...,16,56,9715,"19,99€",Simulation; Sports; Motorbike,13 Feb 2018,Milestone S.r.l.,Milestone S.r.l.,662,145
4998,Poker Club,16,22,5481,"19,99€",Simulation; Sports; Indie,20 Nov 2020,Ripstone,Ripstone,174,137
4999,Risk,16,25,8780,missing value,Strategy; Casual; Board Game,missing value,"Sperasoft, Inc.",PopCap,159,141


Some of the values are missing, but that's nothing we weren't accounting for. In the excel file we'll assess how many there are.

Export the dataframe to csv:

In [None]:
df.to_csv("F://Utente/Desktop/steamgames.csv") 

If we are working with Google Colab the following lines will do the trick.

In [None]:
from google.colab import files
df.to_csv('steamgames.csv', encoding = 'utf-8-sig') 
files.download('steamgames.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Finished! Check out the Excel file for the final results ;)