# Data Pre-Processing

**The original dataset can be found on Kaggle**

**URL = https://www.kaggle.com/datasets/trolukovich/steam-games-complete-dataset?resource=download**

**It is licensed with "CC0: Public Domain" which allows me to use the dataset**

# 1. Loading the dataset
**Starting with importing libraries and reading the dataset.**

In [356]:
import re
import random
import warnings

import pandas as pd

from collections import Counter

warnings.filterwarnings("ignore")

In [357]:
dataframe = pd.read_csv("Datasets/steam_games.csv")
dataframe.shape

(40833, 20)

In [358]:
dataframe.head()

Unnamed: 0,url,types,name,desc_snippet,recent_reviews,all_reviews,release_date,developer,publisher,popular_tags,game_details,languages,achievements,genre,game_description,mature_content,minimum_requirements,recommended_requirements,original_price,discount_price
0,https://store.steampowered.com/app/379720/DOOM/,app,DOOM,Now includes all three premium DLC packs (Unto...,"Very Positive,(554),- 89% of the 554 user revi...","Very Positive,(42,550),- 92% of the 42,550 use...","May 12, 2016",id Software,"Bethesda Softworks,Bethesda Softworks","FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...","English,French,Italian,German,Spanish - Spain,...",54.0,Action,"About This Game Developed by id software, the...",,"Minimum:,OS:,Windows 7/8.1/10 (64-bit versions...","Recommended:,OS:,Windows 7/8.1/10 (64-bit vers...",$19.99,$14.99
1,https://store.steampowered.com/app/578080/PLAY...,app,PLAYERUNKNOWN'S BATTLEGROUNDS,PLAYERUNKNOWN'S BATTLEGROUNDS is a battle roya...,"Mixed,(6,214),- 49% of the 6,214 user reviews ...","Mixed,(836,608),- 49% of the 836,608 user revi...","Dec 21, 2017",PUBG Corporation,"PUBG Corporation,PUBG Corporation","Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","English,Korean,Simplified Chinese,French,Germa...",37.0,"Action,Adventure,Massively Multiplayer",About This Game PLAYERUNKNOWN'S BATTLEGROUND...,Mature Content Description The developers de...,"Minimum:,Requires a 64-bit processor and opera...","Recommended:,Requires a 64-bit processor and o...",$29.99,
2,https://store.steampowered.com/app/637090/BATT...,app,BATTLETECH,Take command of your own mercenary outfit of '...,"Mixed,(166),- 54% of the 166 user reviews in t...","Mostly Positive,(7,030),- 71% of the 7,030 use...","Apr 24, 2018",Harebrained Schemes,"Paradox Interactive,Paradox Interactive","Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","English,French,German,Russian",128.0,"Action,Adventure,Strategy",About This Game From original BATTLETECH/Mec...,,"Minimum:,Requires a 64-bit processor and opera...","Recommended:,Requires a 64-bit processor and o...",$39.99,
3,https://store.steampowered.com/app/221100/DayZ/,app,DayZ,The post-soviet country of Chernarus is struck...,"Mixed,(932),- 57% of the 932 user reviews in t...","Mixed,(167,115),- 61% of the 167,115 user revi...","Dec 13, 2018",Bohemia Interactive,"Bohemia Interactive,Bohemia Interactive","Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","English,French,Italian,German,Spanish - Spain,...",,"Action,Adventure,Massively Multiplayer",About This Game The post-soviet country of Ch...,,"Minimum:,OS:,Windows 7/8.1 64-bit,Processor:,I...","Recommended:,OS:,Windows 10 64-bit,Processor:,...",$44.99,
4,https://store.steampowered.com/app/8500/EVE_On...,app,EVE Online,EVE Online is a community-driven spaceship MMO...,"Mixed,(287),- 54% of the 287 user reviews in t...","Mostly Positive,(11,481),- 74% of the 11,481 u...","May 6, 2003",CCP,"CCP,CCP","Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","English,German,Russian,French",,"Action,Free to Play,Massively Multiplayer,RPG,...",About This Game,,"Minimum:,OS:,Windows 7,Processor:,Intel Dual C...","Recommended:,OS:,Windows 10,Processor:,Intel i...",Free,


# 2. Cleaning the dataset
**I get the idea how data looks now. Next step is to check for missing values in dataset and drop rows with missing one.**

**1. Column "name"**

In [359]:
dataframe.loc[dataframe["name"].isna()].index

Index([  704,  4847,  6381,  7869,  9615,  9616,  9956, 12146, 12879, 23099,
       28380, 31321, 34989, 34991, 35169, 39575],
      dtype='int64')

In [360]:
dataframe = dataframe.drop(dataframe.loc[dataframe["name"].isna()].index)
dataframe.loc[dataframe["name"].isna()].index

Index([], dtype='int64')

In [361]:
dataframe.shape

(40817, 20)

**I have 16 rows less now.**

**2. Irrelevant columns**

**Next step is to drop columns which provide us with no usable data. Taking into account the goal - predict the popularity of the new game. Upon release most of this data is absent.**

In [362]:
dataframe_clean = dataframe.drop(["url", "types", "desc_snippet", "recent_reviews", "achievements",
                                      "game_description", "mature_content", "minimum_requirements",
                                     "recommended_requirements", "original_price", "discount_price"], axis=1)
dataframe_clean.head()

Unnamed: 0,name,all_reviews,release_date,developer,publisher,popular_tags,game_details,languages,genre
0,DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...","May 12, 2016",id Software,"Bethesda Softworks,Bethesda Softworks","FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...","English,French,Italian,German,Spanish - Spain,...",Action
1,PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...","Dec 21, 2017",PUBG Corporation,"PUBG Corporation,PUBG Corporation","Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","English,Korean,Simplified Chinese,French,Germa...","Action,Adventure,Massively Multiplayer"
2,BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...","Apr 24, 2018",Harebrained Schemes,"Paradox Interactive,Paradox Interactive","Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","English,French,German,Russian","Action,Adventure,Strategy"
3,DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...","Dec 13, 2018",Bohemia Interactive,"Bohemia Interactive,Bohemia Interactive","Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","English,French,Italian,German,Spanish - Spain,...","Action,Adventure,Massively Multiplayer"
4,EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...","May 6, 2003",CCP,"CCP,CCP","Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","English,German,Russian,French","Action,Free to Play,Massively Multiplayer,RPG,..."


**4. Column "release_date"**

In [363]:
dataframe_clean.loc[dataframe_clean["release_date"].isna()].index

Index([    5,    15,    25,    39,    44,    57,    63,    66,    68,    74,
       ...
       40239, 40281, 40315, 40326, 40339, 40401, 40454, 40536, 40651, 40701],
      dtype='int64', length=3164)

In [364]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["release_date"].isna()].index)
dataframe_clean.loc[dataframe_clean["release_date"].isna()].index

Index([], dtype='int64')

In [365]:
dataframe_clean.shape

(37653, 9)

**132 less rows. Insignificant on the chunk of data left.**

**5. Column "developer"**

In [366]:
dataframe_clean.loc[dataframe_clean["developer"].isna()].index

Index([  146,  1101,  1854,  2517,  2712,  3471,  3989,  4722,  4854,  4933,
       ...
       39457, 39458, 39679, 39680, 39706, 40133, 40234, 40248, 40266, 40785],
      dtype='int64', length=275)

In [367]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["developer"].isna()].index)
dataframe_clean.loc[dataframe_clean["developer"].isna()].index

Index([], dtype='int64')

In [368]:
dataframe_clean.shape

(37378, 9)

**155 less rows. Insignificant on the chunk of data left.**

**6. Column "publisher"**

In [369]:
dataframe_clean.loc[dataframe_clean["publisher"].isna()].index

Index([   45,  1061,  1899,  1979,  2422,  2574,  2645,  2711,  2854,  2941,
       ...
       40812, 40814, 40817, 40819, 40820, 40821, 40827, 40828, 40829, 40830],
      dtype='int64', length=4726)

In [370]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["publisher"].isna()].index)
dataframe_clean.loc[dataframe_clean["publisher"].isna()].index

Index([], dtype='int64')

In [371]:
dataframe_clean.shape

(32652, 9)

**2096 less rows. Significant loss, but no way to fill the missing data.**

**7. Column "popular_tags"**

In [372]:
dataframe_clean.loc[dataframe_clean["popular_tags"].isna()].index

Index([ 1433,  6027, 10649, 10664, 11206, 11377, 11379, 11814, 11940, 11942,
       12508, 12663, 13206, 13213, 13967, 14104, 14702, 14763, 15580, 15635,
       15637, 15740, 16082, 16139, 16583, 17437, 18580, 19396, 19996, 20268,
       21321, 21642, 22442, 23009, 23050, 23414, 23562, 24129, 25036, 27069,
       27845, 28949, 31218, 31227, 31615, 31754, 32041, 32045, 32633, 32673,
       34091, 34092, 34271, 34357, 34608, 34969, 35132, 35159, 35160, 35161,
       35384, 35527, 36381, 37251, 37393, 37394, 38333, 38835, 40729],
      dtype='int64')

In [373]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["popular_tags"].isna()].index)
dataframe_clean.loc[dataframe_clean["popular_tags"].isna()].index

Index([], dtype='int64')

In [374]:
dataframe_clean.shape

(32583, 9)

**47 less rows. Insignificant on the chunk of data left.**

**8. Column "game_details"**

In [375]:
dataframe_clean.loc[dataframe_clean["game_details"].isna()].index

Index([  478,   663,   727,   859,   949,  1007,  1052,  1114,  1153,  1305,
       ...
       36284, 36685, 36701, 36821, 36874, 37314, 37456, 37489, 39721, 40693],
      dtype='int64', length=290)

In [376]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["game_details"].isna()].index)
dataframe_clean.loc[dataframe_clean["game_details"].isna()].index

Index([], dtype='int64')

In [377]:
dataframe_clean.shape

(32293, 9)

**273 less rows. Insignificant on the chunk of data left.**

**9. Column "languages"**

In [378]:
dataframe_clean.loc[dataframe_clean["languages"].isna()].index

Index([], dtype='int64')

**No missing data.**

**10. Column "genre"**

In [379]:
dataframe_clean.loc[dataframe_clean["genre"].isna()].index

Index([  211,   528,  1613,  2254,  2603,  3639,  4018,  5211,  6127,  6743,
        6990,  7138,  7510,  7836,  8239,  8780, 10677, 11948, 12388, 12987,
       13857, 14871, 14879, 14880, 14884, 15119, 16370, 18200, 18268, 19055,
       21998, 22570, 23695, 23696, 24331, 24884, 26416, 27277, 27402, 29379,
       30928, 39098],
      dtype='int64')

In [380]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["genre"].isna()].index)
dataframe_clean.loc[dataframe_clean["genre"].isna()].index

Index([], dtype='int64')

In [381]:
dataframe_clean.shape

(32251, 9)

**38 less rows. Insignificant on the chunk of data left.**

# 2 Reshaping the dataset

**1. Creating the column "has_setting". It describes wether the game starts the setting or continue it.**

In [382]:
# function looks for numbers in range (2-10) in the column "name" and returns True if there is a number in a string
def check_for_setting(name):
    setting = re.findall(r"\d+", name)
    if setting:
        for i in setting:
            if 2 <= int(i) <= 10:
                return True
    return False

In [383]:
dataframe_clean["has_setting"] = dataframe_clean["name"].apply(check_for_setting)
dataframe_clean = dataframe_clean.set_index("name")

In [384]:
has_setting = dataframe_clean[dataframe_clean["has_setting"]]
has_setting.shape

(3423, 9)

**2669 rows in dataset respond with True for a column "has_setting"**

**2. Changing the datatype in column "release_date" to an int which indicates the year of releas.**

In [385]:
dataframe_clean["release_date"] = pd.to_datetime(dataframe_clean["release_date"], 
                                                 format="%b %d, %Y", errors="coerce").dt.year.astype("Int64")

**Checking if some data was lost in a process and deleting rows where it is absent.**

In [386]:
dataframe_clean.loc[dataframe_clean["release_date"].isna()].index

Index(['Steel Division 2', 'Vampire: The Masquerade® - Bloodlines™ 2',
       'Aura Kingdom', 'League of Maidens®', 'Technobabylon',
       'Age of Wonders Shadow Magic', 'Puzzle Agent',
       'NECROPOLIS: BRUTAL EDITION', 'Red Stone Online',
       'Silent Hunter 5®: Battle of the Atlantic',
       ...
       'Zombie Rollerz', 'Torn Asunder', 'Ruins to Rumble',
       'Touch Type Tale - Strategic Typing', '温室之城（Glass City : The Dust）',
       'Organism8', 'Mike goes on hike', 'Paws and Soul', 'Gamers Club',
       'Gravia'],
      dtype='object', name='name', length=1568)

In [387]:
dataframe_clean = dataframe_clean.drop(dataframe_clean.loc[dataframe_clean["release_date"].isna()].index)
dataframe_clean.loc[dataframe_clean["release_date"].isna()].index

Index([], dtype='object', name='name')

In [388]:
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,developer,publisher,popular_tags,game_details,languages,genre,has_setting
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,id Software,"Bethesda Softworks,Bethesda Softworks","FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...","English,French,Italian,German,Spanish - Spain,...",Action,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,PUBG Corporation,"PUBG Corporation,PUBG Corporation","Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","English,Korean,Simplified Chinese,French,Germa...","Action,Adventure,Massively Multiplayer",False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,Harebrained Schemes,"Paradox Interactive,Paradox Interactive","Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","English,French,German,Russian","Action,Adventure,Strategy",False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,Bohemia Interactive,"Bohemia Interactive,Bohemia Interactive","Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","English,French,Italian,German,Spanish - Spain,...","Action,Adventure,Massively Multiplayer",False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,CCP,"CCP,CCP","Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","English,German,Russian,French","Action,Free to Play,Massively Multiplayer,RPG,...",False


**33 rows were deleted in a process.**

**3. Creating the column "published_by_developer". The column will have a boolean type of data regarding the data in columns "publisher" and "developer".**

**removing the duplicated strings in column "publisher".**

In [389]:
# function spilts the string and returns unuiqe part of it
def remove_duplicates(string):
    publishers = string.split(",")
    unique_publishers = list(set(publishers))
    return ",".join(unique_publishers)

In [390]:
dataframe_clean["publisher"] = dataframe_clean["publisher"].apply(remove_duplicates)

In [391]:
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,developer,publisher,popular_tags,game_details,languages,genre,has_setting
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,id Software,Bethesda Softworks,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...","English,French,Italian,German,Spanish - Spain,...",Action,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,PUBG Corporation,PUBG Corporation,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","English,Korean,Simplified Chinese,French,Germa...","Action,Adventure,Massively Multiplayer",False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,Harebrained Schemes,Paradox Interactive,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","English,French,German,Russian","Action,Adventure,Strategy",False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,Bohemia Interactive,Bohemia Interactive,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","English,French,Italian,German,Spanish - Spain,...","Action,Adventure,Massively Multiplayer",False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,CCP,CCP,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","English,German,Russian,French","Action,Free to Play,Massively Multiplayer,RPG,...",False


**Creating the column "published_by_developer" and deliting columns "developer" and "publisher".**

In [392]:
dataframe_clean["published_by_developer"] = dataframe_clean["developer"] == dataframe_clean["publisher"]

In [393]:
published_by_developer = dataframe_clean[dataframe_clean["published_by_developer"]]
dataframe_clean = dataframe_clean.drop(["developer", "publisher"], axis=1)
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,popular_tags,game_details,languages,genre,has_setting,published_by_developer
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...","English,French,Italian,German,Spanish - Spain,...",Action,False,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","English,Korean,Simplified Chinese,French,Germa...","Action,Adventure,Massively Multiplayer",False,True
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","English,French,German,Russian","Action,Adventure,Strategy",False,False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","English,French,Italian,German,Spanish - Spain,...","Action,Adventure,Massively Multiplayer",False,True
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","English,German,Russian,French","Action,Free to Play,Massively Multiplayer,RPG,...",False,True


**4. Creating the column "multiple_languages". The column will have a boolean type of data regarding the data in column "languages". Drop the cloumn "languages" after data extraction.**

In [394]:
dataframe_clean["multiple_languages"] = dataframe_clean["languages"].str.split(",").apply(lambda x: len(x) > 1)
dataframe_clean = dataframe_clean.drop(["languages"], axis=1)
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,popular_tags,game_details,genre,has_setting,published_by_developer,multiple_languages
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...",Action,False,False,True
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","Action,Adventure,Massively Multiplayer",False,True,True
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","Action,Adventure,Strategy",False,False,True
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","Action,Adventure,Massively Multiplayer",False,True,True
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","Action,Free to Play,Massively Multiplayer,RPG,...",False,True,True


In [395]:
multiple_languages = dataframe_clean[dataframe_clean["multiple_languages"]]
multiple_languages.shape

(16071, 8)

**13610 rows respond with True for column "multiple_languages".**

**5. Creating multiple columns regarding unique tags, details and genres. They will store boolean data type regarding the string they were extracted from.**

**Listing and counting the tags from column "popular_tags".**

In [396]:
all_tags = ",".join(dataframe_clean["popular_tags"])
tag_list = all_tags.split(',')
tag_counts = Counter(tag_list)
most_popular_tags = tag_counts.most_common()
print("Most popular tags:")
for tag, count in most_popular_tags:
    if count > 300:
        print(f"{tag}: {count}")

Most popular tags:
Indie: 19636
Action: 13745
Adventure: 11320
Casual: 10797
Simulation: 7147
Strategy: 6707
Singleplayer: 6208
Early Access: 5553
RPG: 5524
Great Soundtrack: 2921
2D: 2835
Multiplayer: 2809
Atmospheric: 2805
Puzzle: 2731
Free to Play: 2285
VR: 2227
Violent: 1949
Story Rich: 1926
Difficult: 1924
Horror: 1732
Anime: 1712
Gore: 1637
Pixel Graphics: 1607
Sports: 1580
Funny: 1576
Platformer: 1575
Shooter: 1556
Sci-fi: 1493
Female Protagonist: 1485
Open World: 1484
First-Person: 1481
Fantasy: 1469
Co-op: 1450
Retro: 1372
Arcade: 1284
Racing: 1281
Nudity: 1256
FPS: 1238
Family Friendly: 1164
Survival: 1149
Massively Multiplayer: 1075
Visual Novel: 1065
Comedy: 1040
Sexual Content: 1037
Sandbox: 1029
Cute: 986
Classic: 985
Point & Click: 952
Exploration: 882
Turn-Based: 881
Masterpiece: 878
Replay Value: 830
Space: 803
Relaxing: 799
Psychological Horror: 757
Third Person: 750
Local Multiplayer: 658
Colorful: 653
Fast-Paced: 634
Tactical: 624
Mystery: 623
RPGMaker: 619
Controll

**Creating new columns according to uniqe tags. The count to create a new column will require more then 1300 entries for a tag.**

In [397]:
the_most_popular_tags = [(tag, count) for tag, count in tag_counts.items() if count > 300]
for tag, count in the_most_popular_tags:
    dataframe_clean[tag] = dataframe_clean["popular_tags"].apply(lambda x: tag in x)

In [398]:
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,popular_tags,game_details,genre,has_setting,published_by_developer,multiple_languages,FPS,Gore,...,Replay Value,Education,Design & Illustration,Procedural Generation,Music,Shoot 'Em Up,RPGMaker,Web Publishing,Hidden Object,Minimalist
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...",Action,False,False,True,True,True,...,False,False,False,False,False,False,False,False,False,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","Action,Adventure,Strategy",False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","Action,Free to Play,Massively Multiplayer,RPG,...",False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False


**Listing and counting the tags from column "game_details".**

In [399]:
all_details = ",".join(dataframe_clean["game_details"])
details_list = all_details.split(',')
details_counts = Counter(details_list)
most_popular_details = details_counts.most_common()
print("Most popular details:")
for detail, count in most_popular_details:
    if count > 300:
        print(f"{detail}: {count}")

Most popular details:
Single-player: 28370
Steam Achievements: 16424
Steam Cloud: 9443
Steam Trading Cards: 9389
Downloadable Content: 8346
Full controller support: 7345
Profile Features Limited 
									: 6602
Multi-player: 6243
Partial Controller Support: 5223
Steam Leaderboards: 4454
Online Multi-Player: 3426
Co-op: 2874
Shared/Split Screen: 2608
Stats: 2211
Local Multi-Player: 1788
Steam Workshop: 1692
Online Co-op: 1565
Cross-Platform Multiplayer: 1420
Includes level editor: 1244
Steam is learning about this game 
									: 1225
Local Co-op: 1109
In-App Purchases: 1036
Captions available: 868
MMO: 632


In [400]:
the_most_popular_details = [(detail, count) for detail, count in details_counts.items() if count > 500]
for detail, count in the_most_popular_details:
    new_column_name = f"detail_{detail}"
    dataframe_clean[new_column_name] = dataframe_clean["game_details"].apply(lambda x: detail in x)

In [401]:
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,popular_tags,game_details,genre,has_setting,published_by_developer,multiple_languages,FPS,Gore,...,detail_Local Co-op,detail_Shared/Split Screen,detail_Steam Leaderboards,detail_Local Multi-Player,detail_Captions available,detail_Includes level editor,detail_In-App Purchases,detail_Profile Features Limited \r\n\t\t\t\t\t\t\t\t\t,detail_Steam is learning about this game \r\n\t\t\t\t\t\t\t\t\t,detail_Downloadable Content
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...",Action,False,False,True,True,True,...,False,False,False,False,False,False,False,False,False,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","Action,Adventure,Strategy",False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,False,False,False,False,False,False,False,False,False,False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","Action,Free to Play,Massively Multiplayer,RPG,...",False,True,True,False,False,...,False,False,False,False,False,False,False,False,False,False


**Listing and counting the tags from column "genre".**

In [402]:
all_genres = ",".join(dataframe_clean["genre"])
genre_list = all_genres.split(",")
genre_counts = Counter(genre_list)
most_popular_genres = [(genre, count) for genre, count in genre_counts.items() if count > 300]
print("Most popular genres:")
for genre, count in most_popular_genres:
    print(f"{genre}: {count}")

Most popular genres:
Action: 12996
Adventure: 10173
Massively Multiplayer: 947
Strategy: 6158
Free to Play: 2079
RPG: 5161
Indie: 18864
Early Access: 2485
Simulation: 6627
Racing: 1212
Casual: 9921
Sports: 1499
Design & Illustration: 486
Web Publishing: 391


In [403]:
the_most_popular_genres = [(genre, count) for genre, count in genre_counts.items() if count > 500]
for genre, count in the_most_popular_genres:
    new_column_name = f"genre_{genre}"
    dataframe_clean[new_column_name] = dataframe_clean["genre"].apply(lambda x: genre in x)

In [404]:
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,popular_tags,game_details,genre,has_setting,published_by_developer,multiple_languages,FPS,Gore,...,genre_Massively Multiplayer,genre_Strategy,genre_Free to Play,genre_RPG,genre_Indie,genre_Early Access,genre_Simulation,genre_Racing,genre_Casual,genre_Sports
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,"FPS,Gore,Action,Demons,Shooter,First-Person,Gr...","Single-player,Multi-player,Co-op,Steam Achieve...",Action,False,False,True,True,True,...,False,False,False,False,False,False,False,False,False,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,"Survival,Shooter,Multiplayer,Battle Royale,PvP...","Multi-player,Online Multi-Player,Stats","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,True,False,False,False,False,False,False,False,False,False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,"Mechs,Strategy,Turn-Based,Turn-Based Tactics,S...","Single-player,Multi-player,Online Multi-Player...","Action,Adventure,Strategy",False,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,"Survival,Zombies,Open World,Multiplayer,PvP,Ma...","Multi-player,Online Multi-Player,Steam Worksho...","Action,Adventure,Massively Multiplayer",False,True,True,True,False,...,True,False,False,False,False,False,False,False,False,False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,"Space,Massively Multiplayer,Sci-fi,Sandbox,MMO...","Multi-player,Online Multi-Player,MMO,Co-op,Onl...","Action,Free to Play,Massively Multiplayer,RPG,...",False,True,True,False,False,...,True,True,True,True,False,False,False,False,False,False


**Deleting the columns "genre", "popular_tags" and "game_details" as I no longer need them.**

In [405]:
dataframe_clean = dataframe_clean.drop(["genre", "popular_tags", "game_details"], axis=1)
dataframe_clean.head()

Unnamed: 0_level_0,all_reviews,release_date,has_setting,published_by_developer,multiple_languages,FPS,Gore,Action,Shooter,First-Person,...,genre_Massively Multiplayer,genre_Strategy,genre_Free to Play,genre_RPG,genre_Indie,genre_Early Access,genre_Simulation,genre_Racing,genre_Casual,genre_Sports
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOOM,"Very Positive,(42,550),- 92% of the 42,550 use...",2016,False,False,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
PLAYERUNKNOWN'S BATTLEGROUNDS,"Mixed,(836,608),- 49% of the 836,608 user revi...",2017,False,True,True,True,False,True,True,True,...,True,False,False,False,False,False,False,False,False,False
BATTLETECH,"Mostly Positive,(7,030),- 71% of the 7,030 use...",2018,False,False,True,False,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
DayZ,"Mixed,(167,115),- 61% of the 167,115 user revi...",2018,False,True,True,True,False,True,True,False,...,True,False,False,False,False,False,False,False,False,False
EVE Online,"Mostly Positive,(11,481),- 74% of the 11,481 u...",2003,False,True,True,False,False,True,False,False,...,True,True,True,True,False,False,False,False,False,False


In [406]:
dataframe_clean.shape

(30675, 153)

**6. Creating columns "total_reviews" and "positive_reviews_share" regarding the column "all_reviews". They will contain int and float data type accordingly.**

**Checking if all rows have a sufficient string to extract data. Deleting those which do not have.**

In [407]:
# function extracts first and second number from a string and returns them as int and float data type accordingly
def extract_numbers(string):
    total_reviews = None
    positive_reviews_share = None
    match_total = re.search(r"\(([\d,]+)\)", string)
    match_positive = re.search(r"(\d+)%", string)
    if match_total:
        total_reviews = int(match_total.group(1).replace(",", ""))
    if match_positive:
        positive_reviews_share = float(match_positive.group(1)) / 100
    return total_reviews, positive_reviews_share

In [408]:
def replace_string_or_fill_na(text):
    if text is None or "Need more" in str(text):
        random_number = random.randint(10, 1000)
        random_float = round(random.uniform(0.01, 1.00), 2)
        return f"{random_number} user reviews, {random_float}"
    return text

In [409]:
dataframe_clean["all_reviews"] = dataframe_clean["all_reviews"].astype(str).apply(replace_string_or_fill_na)

In [410]:
dataframe_clean[["total_reviews", 
                 "positive_reviews_share"]] = dataframe_clean["all_reviews"].apply(lambda x: pd.Series(extract_numbers(x)))

In [411]:
min_value = 10
max_value = 1000

dataframe_clean["total_reviews"] = dataframe_clean["total_reviews"].apply(lambda x: random.randint(
    min_value, max_value) if pd.isnull(x) else x)

In [412]:
min_value = 0.01
max_value = 1.00

dataframe_clean["positive_reviews_share"] = dataframe_clean["positive_reviews_share"].apply(lambda x: round(random.uniform(
    min_value, max_value), 2) if pd.isnull(x) else x)

**Deleting the column "all_reviews" as data was exctracted already.**

In [413]:
dataframe_clean = dataframe_clean.drop(["all_reviews"], axis=1)

**Checking on how dataset looks like now.**

In [415]:
dataframe_clean

Unnamed: 0_level_0,release_date,has_setting,published_by_developer,multiple_languages,FPS,Gore,Action,Shooter,First-Person,Great Soundtrack,...,genre_Free to Play,genre_RPG,genre_Indie,genre_Early Access,genre_Simulation,genre_Racing,genre_Casual,genre_Sports,total_reviews,positive_reviews_share
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
DOOM,2016,False,False,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,42550.0,0.92
PLAYERUNKNOWN'S BATTLEGROUNDS,2017,False,True,True,True,False,True,True,True,False,...,False,False,False,False,False,False,False,False,836608.0,0.49
BATTLETECH,2018,False,False,True,False,False,True,False,False,True,...,False,False,False,False,False,False,False,False,7030.0,0.71
DayZ,2018,False,True,True,True,False,True,True,False,False,...,False,False,False,False,False,False,False,False,167115.0,0.61
EVE Online,2003,False,True,True,False,False,True,False,False,False,...,True,True,False,False,False,False,False,False,11481.0,0.74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Bare Boob Brawlerz: Novel 01 (Visual Novel),2019,False,True,False,False,False,False,False,False,False,...,False,False,True,False,False,False,True,False,695.0,0.99
Galactis,2018,False,True,False,False,False,True,False,False,False,...,False,False,True,False,False,False,False,False,426.0,0.98
Alive,2019,False,True,False,True,False,True,False,False,False,...,True,True,False,False,False,False,False,False,206.0,0.23
Mega Man X5 Sound Collection,2018,True,True,True,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,640.0,0.33


# 3 Saving the reshaped dataset

In [416]:
dataframe_clean.to_csv("Datasets/steam_games_test_03.csv", index=False)