# QuadStore project - Data collection

- Many **key features** of the project requires *realistic* games' data in order to function correctly.
Some of these features are:
    - Games display.
    - Filtering system (by Genre,...)
    - Reccomendation system.
- As a result, we will collect data from **Steam** - a popular website for purchasing games.
---
**Disclaimer**: 

No private data is collected. All data gathered is not prohibited by Steam (robots.txt).

The_Quad team **guarantees** under no circumstances will this data be used for *commercial or unlawful* intentions. 

## Collection Method

- We will use simple web scraping techniques to collect games data, such as **Selenium, requests**

## Metadata

- Our data will have **9 features**:
    - **Title**: title of the game.
    - **Release date**: release date of the game.
    - **Categories**: the main categories that the game belongs to.
    - **Sub-Categories**: all sub-categories that the game belongs to.
    - **Price**: Retail price of the game.
    - **Img url**: Url for the thumbnail image of the game.
    - **Description**: short description of the game. *(currently unavailable)*
    - **Rating**: Rating of the game.
    - **Reviews**: Number of reviews made about the game.

### 0. Import necessary libraries

In [None]:
import requests
from multiprocessing.dummy import Pool
import pandas as pd
import json

### 1. Extract categories and sub-categories 

In [None]:
#Collect categories and sub-categories list
with open('./categories.json', 'r') as rstream:
    CATEGORIES = json.load(rstream)
with open('./sub_categories.json', 'r') as rstream:
    SUB_CATEGORIES = json.load(rstream)

In [None]:
CATEGORIES

In [None]:
cat_list = CATEGORIES.keys()
sub_cat_list = SUB_CATEGORIES.keys()

In [None]:
cat_list

### 2. Selenium and Scraping

- We will use ```selenium.webdriver``` to handle dynamic javascript content.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from time import sleep

- Create function to extract necessary data

In [None]:
def extract_data(game_list, cur_titles, data:list, default_cat):
    for game in game_list:
        title = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_TitleCtn_1F4bc').text
        if title in cur_titles: #Avoid overlapping
            continue
        else:
            cur_titles.append(title)
        img_url = game.find_element(By.TAG_NAME, 'img').get_attribute('src')
        release_date = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSaleWidgetRelease_3eOdk').text
        try:
            price = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSalePriceBox_Wh0L8').text.strip('₫')
            if price == "Free To Play":
                price = 0
        except:
            price = 0
        rating = game.find_element(By.CSS_SELECTOR, 'a[class="gamehover_ReviewScore_24NyY ReviewScore Focusable"]').find_elements(By.TAG_NAME, 'div')[1].text
        reviews = game.find_element(By.CLASS_NAME, 'gamehover_ReviewScoreCount_1Deyv').text.strip('|')
        reviews = reviews.strip('User Reviews').strip()
        
        categories = default_cat + ";"
        sub_categories = ""
        cats = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSaleWidgetTags_3OSJs')
        tags = cats.find_elements(By.TAG_NAME, 'a')
        for tag in tags:
            t = tag.text
            if t == default_cat:
                continue
            if t in cat_list:
                categories += t + ";"
            elif t in sub_cat_list:
                sub_categories += t + ";"
    
        data.append([title,release_date,categories,sub_categories,price,img_url,"dummy desc",rating,reviews])
    return data, cur_titles
        

- Prepare some necessary variables

In [None]:
urls = CATEGORIES.items()
data = []
cur_titles = []

#Initialize and run Chrome browser
options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=options)
browser.implicitly_wait(5)

- Start scraping

In [None]:
#Start scraping
for cat,url in urls:
    print("Scraping: ",url)
    browser.get(url)
    sleep(7)
    
    game_list = browser.find_elements(By.CLASS_NAME,'salepreviewwidgets_SaleItemBrowserRow_y9MSd')
    data, cur_titles = extract_data(game_list, cur_titles, data, cat)

- Create dataframe and save to csv file

In [None]:
columns = ['title','release_date','categories','sub_categories','price','img_url','desc','rating','reviews_count']
df = pd.DataFrame(data,columns=columns)

In [None]:
df.to_csv('games_steam_org.csv')

### 3. Preprocessing

- We will encode all *ratings* to numerical values **(ranking from 1-5).**

In [None]:
df['rating'].unique()

In [None]:
mapping = {'Overwhelmingly Positive':5,'Very Positive':4,'Mostly Positive':3,'Mixed':2,'Mostly Negative':1,'Overwhelmingly Negative':0}

df['rating'] = df['rating'].map(mapping)

In [None]:
df.head()

In [None]:
df.to_csv('games_steam_processed.csv')

## (04/12/2023) Recollecting Data
This section is written to collect extra neccessary features on current data.

In [1]:
import requests
from multiprocessing.dummy import Pool
import pandas as pd
import json

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from time import sleep

In [2]:
df = pd.read_csv('../data/processed_data.csv')
df.head()

Unnamed: 0,title,release_date,categories,sub_categories,price,img_url,desc,rating,reviews_count
0,COUNTER-STRIKE 2,"AUG 22, 2012",Action;,Multiplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,dummy desc,4,7694152
1,EA SPORTS FC™ 24,"SEP 28, 2023",Action;,,1090000.0,https://cdn.cloudflare.steamstatic.com/steam/a...,dummy desc,2,11131
2,APEX LEGENDS™,"NOV 5, 2020",Action;,Multiplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,dummy desc,3,732521
3,CALL OF DUTY®,"OCT 28, 2022",Action;,Multiplayer;Singleplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,dummy desc,2,248663
4,BATTLEFIELD™ 2042,"NOV 19, 2021",Action;,Military;Multiplayer;,158500.0,https://cdn.cloudflare.steamstatic.com/steam/a...,dummy desc,2,164144


### Extra features to be collected
- long/short description
- img_urls
- videos urls
- developer

In [106]:
url = 'https://store.steampowered.com/search/?term='
#search_result_row ds_collapse_flag  app_impression_tracked
dev = []
short_desc = []
long_desc = []
imgs = []
vids = []

options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=options)
browser.implicitly_wait(3)

for name in df.title:
    name = name.replace(' ','+')
    browser.get(url+name)
    sleep(2)
    
    link = browser.find_element(By.ID, 'search_resultsRows')
    link = link.find_element(By.TAG_NAME, 'a')
    link = link.get_attribute('href')
    browser.get(link)
    sleep(2)
    
    try:
        dev_val = browser.find_element(By.ID, 'developers_list')
    except: #Age restricted
        dev.append(None)
        short_desc.append(None)
        long_desc.append(None)
        imgs.append(None)
        vids.append(None)
        continue
    dev.append(dev_val.text)
    
    try:
        short_desc_val = browser.find_element(By.CLASS_NAME, 'game_description_snippet')
        short_desc.append(short_desc_val.text)
    except:
        short_desc.append(None)
    
    try:
        long_desc_val = browser.find_element(By.CLASS_NAME, 'game_area_description')
        long_desc.append(long_desc_val.text)
    except:
        long_desc.append(None)
    
    try:
        imgs_vals = browser.find_element(By.ID, 'highlight_strip_scroll')
        imgs_vals = imgs_vals.find_elements(By.TAG_NAME, 'img')
        print(len(imgs_vals))
        ivals = ''
        vvals = ''
        text = 'movie'
        for val in imgs_vals:
            val = val.get_attribute('src')
            if val.find(text) != -1:
                vvals += val + ';'
                continue
            vals = vals + val + ';'
        imgs.append(vals)
        vids.append(vvals) if vvals != '' else vids.append(None)
    except:
        imgs.append(None)

23
13
10
14
42
6
19
15
33
13
56
7
15
16
44
35
9
25
12
8
12
10
5
25
26
14
14
14
27
8
13
11
18
10
24
152
12
10
95
29
56
20
11
17
19
11
10
18
9
9
16
26
13
12
21
15
18
9
12
21
21
5
8
11
10
14
19
20
10
13
21
14
21
9
4


Append to Dataframe

In [107]:
df['developer'] = pd.Series(dev)
df['short_desc'] = pd.Series(short_desc)
df['desc'] = pd.Series(long_desc)
df = df.rename(columns={'img_url':'banner_url'})
df['img_urls'] = pd.Series(imgs)
df['vid_urls'] = pd.Series(vids)
df.head()

Unnamed: 0,title,release_date,categories,sub_categories,price,banner_url,desc,rating,reviews_count,developer,short_desc,img_urls,vid_urls
0,COUNTER-STRIKE 2,"AUG 22, 2012",Action;,Multiplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,"ABOUT THIS GAME\nFor over two decades, Counter...",4,7694152,Valve,"For over two decades, Counter-Strike has offer...",https://cdn.akamai.steamstatic.com/steam/apps/...,https://cdn.akamai.steamstatic.com/steam/apps/...
1,EA SPORTS FC™ 24,"SEP 28, 2023",Action;,,1090000.0,https://cdn.cloudflare.steamstatic.com/steam/a...,PLAY NOW FOR A UEFA EURO 2024™ ULTIMATE TEAM™ ...,2,11131,EA Canada & EA Romania,EA SPORTS FC™ 24 welcomes you to The World’s G...,https://cdn.akamai.steamstatic.com/steam/apps/...,https://cdn.akamai.steamstatic.com/steam/apps/...
2,APEX LEGENDS™,"NOV 5, 2020",Action;,Multiplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,REVIEWS\n“The champion of Battle Royales.”\n9/...,3,732521,Respawn Entertainment,"Apex Legends is the award-winning, free-to-pla...",https://cdn.akamai.steamstatic.com/steam/apps/...,https://cdn.akamai.steamstatic.com/steam/apps/...
3,CALL OF DUTY®,"OCT 28, 2022",Action;,Multiplayer;Singleplayer;,0.0,https://cdn.cloudflare.steamstatic.com/steam/a...,,2,248663,,,,https://cdn.akamai.steamstatic.com/steam/apps/...
4,BATTLEFIELD™ 2042,"NOV 19, 2021",Action;,Military;Multiplayer;,158500.0,https://cdn.cloudflare.steamstatic.com/steam/a...,MASTER THE UNKNOWN IN BATTLEFIELD™ 2042 – SEAS...,2,164144,DICE,Master the unknown in Season 6: Dark Creations...,https://cdn.akamai.steamstatic.com/steam/apps/...,https://cdn.akamai.steamstatic.com/steam/apps/...


Save to csv

In [158]:
df.to_csv('../data/new_processed_data.csv', index=False)