# Project 2, scraping from steam sales

This project reads the data from the first 5 pages of steam sales, and parses through it to get info on each game title. It then generates and pushes this data into a dataFrame. This dataFrame is converted to a CSV for easier digestion on the executive side of things.

## Imports

In [17]:
import bs4 #beautifulsoup 4
from urllib.request import urlopen as uReq #url open module
from bs4 import BeautifulSoup as soup #beautifulsoup module
import pandas as pd
import numpy as np
import re # regex
import os
from datetime import datetime

## Read all the web pages

In [19]:
base_url = "https://store.steampowered.com/search/?specials=1&page="

# Download all 5 pages we want to scrape and dump them to a variable
uClient1 = uReq(base_url + "1")
page_html_1 = uClient1.read()
uClient2 = uReq(base_url + "2")
page_html_2 = uClient2.read()
uClient3 = uReq(base_url + "3")
page_html_3 = uClient3.read()
uClient4 = uReq(base_url + "4")
page_html_4 = uClient4.read()
uClient5 = uReq(base_url + "5")
page_html_5 = uClient5.read()
uClient1.close()
uClient2.close()
uClient3.close()
uClient4.close()
uClient5.close()

## Convert each page to soup and pick out each game

container will contain each game as it's own component

In [20]:
# Make soup out of each page
page_1 = soup(page_html_1)
page_2 = soup(page_html_2)
page_3 = soup(page_html_3)
page_4 = soup(page_html_4)
page_5 = soup(page_html_5)

# Combined all the data from each page into one container
container = page_1.find_all('a', class_="search_result_row ds_collapse_flag")
container += page_2.find_all('a', class_="search_result_row ds_collapse_flag")
container += page_3.find_all('a', class_="search_result_row ds_collapse_flag")
container += page_4.find_all('a', class_="search_result_row ds_collapse_flag")
container += page_5.find_all('a', class_="search_result_row ds_collapse_flag")

# Functions for parsing data

All functions in this section are designed for this specific instance of scraping the item in the discounted steam sales. Each function also primarily makes use of the base item, whcih we get from the soup container.

## Parse ratings into relevant formats

***See bottom of page for a small description of possible ratings***

rating_text_to_int will take convert a steam rating text, such as very positive, and convert it to an int

rating_count takes in a string and returns the 4th value when the string is split. In this application this works out to be the total ratings a game has.

parse_ratings makes splits an item into the correct formats for the 2 above functions and is the "main" function handling the ratings for the item.

parse_ratings information:
* rating contains a number from 1-9 based on how the game is rated
* rate_count is the number of ratings the item has in total
* Both values are 0 if the game has no rating

In [43]:
# Convert a specific text to int
def rating_text_to_int(text):
    if (text == "overwhelmingly positive"):
        return 9
    elif(text == "very positive"):
        return 8
    elif(text == "positive"):
        return 7
    elif(text == "mostly positive"):
        return 6
    elif(text == "mixed"):
        return 5
    elif(text == "mostly negative"):
        return 4
    elif(text == "negative"):
        return 3
    elif(text == "very negative"):
        return 2
    elif(text == "overwhelmingly negative"):
        return 1
    else:
        return 0

# Grab the count from an item
def rating_count(item):
    list = []
    for i in item.split():
        list.append(i)
    number = list[3].replace(',', '')
    number = int(number)
    # number is the number of ratings an item has in total
    return number

# Converts a text in a span to a rating as text and the count of ratings
def parse_ratings(item):
    # We find the review section in the item
    span = item.find('span', class_="search_review_summary")
    # Pick out rating from the tooltip (very positive etc.)
    try:
        content = span['data-tooltip-html']
        split = content.split('<br>')
        rating = rating_text_to_int(split[0].lower())
        rate_count = rating_count(split[1])
    except TypeError:
        rating = '0'
        rate_count = '0'
    return rating, rate_count

## Get the discounted price

The string in the item has a format which lets us split by the € and simply replace the unwanted values to get the correct price.

In [22]:
def parse_sale_price(item):
    text = item.find('div', class_="col search_price discounted responsive_secondrow").text
    price = text.split('€')[1].replace('\n', '')
    return price

## Get the discount percent from an item

We convert the percent to an int after we have replaced all unnecessary values in the string.

In [23]:
def parse_sale_percentage(item):
    text = item.find('div', class_="col search_discount responsive_secondrow").text
    text = text.replace('\n', '').replace('%', '')
    # We return the discount % as a negative int ie. -67
    return int(text)

## Get the year from an item

The year can be found in the last 4 letters of the string.

In [24]:
def parse_year(item):
    text = item.find('div', class_="col search_released responsive_secondrow").text
    return text[-4:]

## Define our pandas dataFrame

We define an empty dataFrame which we use to store our parsed data, the data has 11 columns.

In [44]:
column_names = ['Name', 'Rating', 'Rating_Count', 'Discount', 'Discount_price', 'Retail_price', 'Release_year', 'Win', 'Linux', 'Mac', 'Time']
df = pd.DataFrame(columns=column_names)
df.head()

Unnamed: 0,Name,Rating,Rating_Count,Discount,Discount_price,Retail_price,Release_year,Win,Linux,Mac,Time


## Convert dataFrame types to what they should be

Many of our columns will contain integers so we already set them to be that data type

In [45]:
df.Rating = df.Rating.astype(int)
df.Rating_Count = df.Rating_Count.astype(int)
df.Discount = df.Discount.astype(int)
df.Win = df.Win.astype(int)
df.Linux = df.Linux.astype(int)
df.Mac = df.Mac.astype(int)
df.dtypes

Name              object
Rating             int32
Rating_Count       int32
Discount           int32
Discount_price    object
Retail_price      object
Release_year      object
Win                int32
Linux              int32
Mac                int32
Time              object
dtype: object

# The actual application

The application is a simple loop through each item in the container. The item is an html a tag, and contains all relevant data that we want to pick out, except for the time which we generate ourselves.

The time is separated from the loop since we want the time to be the same for each row in the dataFrame when the items were scraped at the same time. It makes for more accurate grouping/sorting when analysing the data.

## Loop through each item and append it to the dataFrame

We loop through each item and make use of the functions we defined above to get the values we want out of the item.

In [46]:
# Time is defined separately to have the same time for each item per scrape
time = datetime.now()
i = 0
for item in container:
    # Parse name
    name = item.find('span', class_="title").text
    # Parse rating & review count
    rating, rate_count = parse_ratings(item)
    # Sale %
    print(rating, rate_count)
    sale_percentage = parse_sale_percentage(item)
    # Sale price
    sale_price = parse_sale_price(item)
    # Ordinary price
    ordinary_price = item.find('strike').text.replace('€', '')
    # Release year
    release_year = parse_year(item)
    # Win
    win = 1 if bool(item.find_all('span', class_="platform_img win")) else 0
    # Linux
    linux = 1 if bool(item.find_all('span', class_="platform_img linux")) else 0
    # Mac
    mac = 1 if bool(item.find_all('span', class_="platform_img mac")) else 0
    # Time
    df.loc[i] = name, rating, rate_count, sale_percentage, sale_price, ordinary_price, release_year, win, linux, mac, time
    i += 1

8 20700
8 61835
8 37574
6 39023
8 41675
6 20649
8 65909
8 89827
8 39778
5 43173
6 24282
6 22456
8 29032
6 27583
5 9633
6 4042
8 44019
6 5685
8 96186
5 49706
6 8065
8 1703
8 35248
8 45315
5 4243
8 103
5 2228
6 1352
8 202
8 63073
7 41
6 1176
8 2127
5 171492
9 8648
8 167174
6 1376
8 6139
8 30243
6 1931
6 2251
8 30035
8 5003
8 158
6 41183
6 3504
8 5191
6 29566
8 4057
6 9472
5 5542
8 6481
8 8470
9 2583
8 5653
5 3509
6 1463
8 15396
6 4721
8 9955
6 3228
8 51775
8 25106
5 8189
6 4896
8 40732
8 41798
9 40473
6 258
8 1420
8 555
8 1721
6 2570
8 16389
8 5285
9 744
8 93368
8 5989
9 11209
8 1347
9 4961
6 2283
8 1995
8 1963
8 5565
8 43211
8 11570
5 172
8 7227
8 11302
8 6463
5 18587
5 2244
5 805
5 183
8 14871
8 2756
6 5102
6 61
8 12944
5 220
7 16
5 297
8 7165
8 748
8 576
8 3716
0 0
6 630
6 356
9 37503
8 12567
8 16829
6 1815
5 448
8 4320
8 42921
6 3385
8 5452
6 3438
8 3360
8 146
6 2365
6 4538
8 4258


## Describe the dataFrame we have just filled

This section just gives a quick insight into how the dataFrame we just generated looks with data in it. 

In [28]:
df.head()

Unnamed: 0,Name,Rating,Rating_Count,Discount,Discount_price,Retail_price,Release_year,Win,Linux,Mac,Time
0,Total War: WARHAMMER II,8,20700,-66,2039,5999,2017.0,1,1,1,2019-10-20 15:03:47.039044
1,Cities: Skylines,8,61835,-75,699,2799,2015.0,1,1,1,2019-10-20 15:03:47.039044
2,Hearts of Iron IV,8,37574,-60,1599,3999,2016.0,1,1,1,2019-10-20 15:03:47.039044
3,Stellaris,6,39023,-75,999,3999,2016.0,1,1,1,2019-10-20 15:03:47.039044
4,Hearts of Iron IV: Mobilization Pack,8,41675,-56,4666,10495,,1,1,1,2019-10-20 15:03:47.039044


In [29]:
df.describe

<bound method NDFrame.describe of                                                   Name  Rating  Rating_Count  \
0                              Total War: WARHAMMER II       8         20700   
1                                     Cities: Skylines       8         61835   
2                                    Hearts of Iron IV       8         37574   
3                                            Stellaris       6         39023   
4                 Hearts of Iron IV: Mobilization Pack       8         41675   
5                         theHunter: Call of the Wild™       6         20649   
6                          Cities: Skylines Collection       8         65909   
7                                               Arma 3       8         89827   
8                                Europa Universalis IV       8         39778   
9                         Sid Meier’s Civilization® VI       5         43173   
10                                Total War: WARHAMMER       6         24282   
11    

## Save the dataFrame to a csv file

The file name is discounts and it will be saved in the same directory as the script. If it has to create the file it will include the headers but if the file already exists it will simply append the new data to the existing file.

In [47]:
file_name = 'discounts.csv'
if (os.path.exists('discounts.csv')):
    df.to_csv(file_name, mode='a', header=False, index=False)
else:
    df.to_csv(file_name, index=False)

## Rating formats on steam

There are 10 possibilities for ratings on stea

overwhelmingly positive

very positive

positive

mostly positive

mixed

mostly negative

negative

very negative

overwhelmingly negative

no rating
