# QuadStore project - Data collection

- Many **key features** of the project requires *realistic* games' data in order to function correctly.
Some of these features are:
    - Games display.
    - Filtering system (by Genre,...)
    - Reccomendation system.
- As a result, we will collect data from **Steam** - a popular website for purchasing games.
---
**Disclaimer**: 

No private data is collected. All data gathered is not prohibited by Steam (robots.txt).

The_Quad team **guarantees** under no circumstances will this data be used for *commercial or unlawful* intentions. 

## Collection Method

- We will use simple web scraping techniques to collect games data, such as **Selenium, requests**

## Metadata

- Our data will have **9 features**:
    - **Title**: title of the game.
    - **Release date**: release date of the game.
    - **Categories**: the main categories that the game belongs to.
    - **Sub-Categories**: all sub-categories that the game belongs to.
    - **Price**: Retail price of the game.
    - **Img url**: Url for the thumbnail image of the game.
    - **Description**: short description of the game. *(currently unavailable)*
    - **Rating**: Rating of the game.
    - **Reviews**: Number of reviews made about the game.

### 0. Import necessary libraries

In [None]:
import requests
from multiprocessing.dummy import Pool
import pandas as pd
import json

### 1. Extract categories and sub-categories 

In [None]:
#Collect categories and sub-categories list
with open('./categories.json', 'r') as rstream:
    CATEGORIES = json.load(rstream)
with open('./sub_categories.json', 'r') as rstream:
    SUB_CATEGORIES = json.load(rstream)

In [None]:
CATEGORIES

In [None]:
cat_list = CATEGORIES.keys()
sub_cat_list = SUB_CATEGORIES.keys()

In [None]:
cat_list

### 2. Selenium and Scraping

- We will use ```selenium.webdriver``` to handle dynamic javascript content.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from time import sleep

- Create function to extract necessary data

In [None]:
def extract_data(game_list, cur_titles, data:list, default_cat):
    for game in game_list:
        title = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_TitleCtn_1F4bc').text
        if title in cur_titles: #Avoid overlapping
            continue
        else:
            cur_titles.append(title)
        img_url = game.find_element(By.TAG_NAME, 'img').get_attribute('src')
        release_date = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSaleWidgetRelease_3eOdk').text
        try:
            price = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSalePriceBox_Wh0L8').text.strip('₫')
            if price == "Free To Play":
                price = 0
        except:
            price = 0
        rating = game.find_element(By.CSS_SELECTOR, 'a[class="gamehover_ReviewScore_24NyY ReviewScore Focusable"]').find_elements(By.TAG_NAME, 'div')[1].text
        reviews = game.find_element(By.CLASS_NAME, 'gamehover_ReviewScoreCount_1Deyv').text.strip('|')
        reviews = reviews.strip('User Reviews').strip()
        
        categories = default_cat + ";"
        sub_categories = ""
        cats = game.find_element(By.CLASS_NAME, 'salepreviewwidgets_StoreSaleWidgetTags_3OSJs')
        tags = cats.find_elements(By.TAG_NAME, 'a')
        for tag in tags:
            t = tag.text
            if t == default_cat:
                continue
            if t in cat_list:
                categories += t + ";"
            elif t in sub_cat_list:
                sub_categories += t + ";"
    
        data.append([title,release_date,categories,sub_categories,price,img_url,"dummy desc",rating,reviews])
    return data, cur_titles
        

- Prepare some necessary variables

In [None]:
urls = CATEGORIES.items()
data = []
cur_titles = []

#Initialize and run Chrome browser
options = webdriver.ChromeOptions()
browser = webdriver.Chrome(options=options)
browser.implicitly_wait(5)

- Start scraping

In [None]:
#Start scraping
for cat,url in urls:
    print("Scraping: ",url)
    browser.get(url)
    sleep(7)
    
    game_list = browser.find_elements(By.CLASS_NAME,'salepreviewwidgets_SaleItemBrowserRow_y9MSd')
    data, cur_titles = extract_data(game_list, cur_titles, data, cat)

- Create dataframe and save to csv file

In [None]:
columns = ['title','release_date','categories','sub_categories','price','img_url','desc','rating','reviews_count']
df = pd.DataFrame(data,columns=columns)

In [None]:
df.to_csv('games_steam_org.csv')

### 3. Preprocessing

- We will encode all *ratings* to numerical values **(ranking from 1-5).**

In [None]:
df['rating'].unique()

In [None]:
mapping = {'Overwhelmingly Positive':5,'Very Positive':4,'Mostly Positive':3,'Mixed':2,'Mostly Negative':1,'Overwhelmingly Negative':0}

df['rating'] = df['rating'].map(mapping)

In [None]:
df.head()

In [None]:
df.to_csv('games_steam_processed.csv')

## (04/12/2023) Recollecting Data
This section is written to collect extra neccessary features on current data.