# Sigmoid Exam Part 1

## Data Importation

Connect to Google Drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Import some important libraries:

In [2]:
import pandas as pd
import numpy as np

Read the dataset:

In [3]:
data = pd.read_csv('/content/drive/MyDrive/Exam_01_07_2024_4/data.csv')
data.head()

Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,original_price,discount_percentage,discounted_price,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
0,730,Counter-Strike 2,"21 Aug, 2012","Action, Free to Play","Cross-Platform Multiplayer, Steam Trading Card...",Valve,Valve,,,Free,...,True,False,True,1,Very Positive,87.0,8062218.0,Mostly Positive,79.0,57466.0
1,570,Dota 2,"9 Jul, 2013","Action, Strategy, Free to Play","Steam Trading Cards, Steam Workshop, SteamVR C...",Valve,Valve,,,Free,...,True,True,True,0,Very Positive,81.0,2243112.0,Mostly Positive,72.0,23395.0
2,2215430,Ghost of Tsushima DIRECTOR'S CUT,"16 May, 2024","Action, Adventure","Single-player, Online Co-op, Steam Achievement...",Sucker Punch Productions,PlayStation PC LLC,,,"₹3,999.00",...,True,False,False,0,Very Positive,89.0,12294.0,,,
3,1245620,ELDEN RING,"24 Feb, 2022","Action, RPG","Single-player, Online PvP, Online Co-op, Steam...",FromSoftware Inc.,FromSoftware Inc.,,,"₹3,599.00",...,True,False,False,6,Very Positive,93.0,605191.0,Very Positive,94.0,7837.0
4,1085660,Destiny 2,"1 Oct, 2019","Action, Adventure, Free to Play","Single-player, Online PvP, Online Co-op, Steam...",Bungie,Bungie,,,Free,...,True,False,False,0,Very Positive,80.0,594713.0,Mostly Positive,73.0,4845.0


In [4]:
data.columns

Index(['app_id', 'title', 'release_date', 'genres', 'categories', 'developer',
       'publisher', 'original_price', 'discount_percentage',
       'discounted_price', 'dlc_available', 'age_rating', 'content_descriptor',
       'about_description', 'win_support', 'mac_support', 'linux_support',
       'awards', 'overall_review', 'overall_review_%', 'overall_review_count',
       'recent_review', 'recent_review_%', 'recent_review_count'],
      dtype='object')

Description of each columns:

1. **app_id**: The unique identifier assigned to the application (game) by the platform.
2. **title**: The name of the game.
3. **release_date**: The date when the game was officially released.
4. **genres**: The categories of games that describe the gameplay style or thematic elements.
5. **categories**: Additional classifications that provide information about features or elements of the game.
6. **developer**: The company or individual responsible for creating the game.
7. **publisher**: The company or individual responsible for distributing the game.
8. **original_price**: The original cost of the game before any discounts, usually given in a specific currency.
9. **discount_percentage**: The percentage reduction in the price of the game during a sale or promotion.
10. **discounted_price**: The cost of the game after applying the discount, usually given in a specific currency.
11. **dlc_available**: Indicates additional content created for an already released video game, distributed through the Internet by the game's publisher.
12. **age_rating**: Binary value that indicates whether or not there are recommendations for parents and carers to help them decide what is appropriate for their child depending on what stage of development they are at.
13. **content_descriptor**: Descriptions of the types of content in the game that may influence its age rating.
14. **about_description**: A brief overview or description of the game's plot, gameplay, or features.
15. **win_support**: Indicates whether the game supports the Windows operating system.
16. **mac_support**: Indicates whether the game supports the Mac operating system.
17. **linux_support**: Indicates whether the game supports the Linux operating system.
18. **awards**: Represents the number of notable awards the game has won.
19. **overall_review**: The overall sentiment of user reviews for the game.
20. **overall_review_%**: The percentage of positive reviews out of the total number of reviews.
21. **overall_review_count**: The total number of user reviews for the game.
22. **recent_review**: The sentiment of user reviews over a recent period.
23. **recent_review_%**: The percentage of positive reviews out of the total number of recent reviews.
24. **recent_review_count**: The total number of recent user reviews for the game.


## Exploratory Data Analysis

### Validation and Data Preparation

In [5]:
data.shape

(42497, 24)

The dataset contains 42497 rows and 24 columns.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42497 entries, 0 to 42496
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   app_id                42497 non-null  int64  
 1   title                 42497 non-null  object 
 2   release_date          42440 non-null  object 
 3   genres                42410 non-null  object 
 4   categories            42452 non-null  object 
 5   developer             42307 non-null  object 
 6   publisher             42286 non-null  object 
 7   original_price        4859 non-null   object 
 8   discount_percentage   4859 non-null   object 
 9   discounted_price      42257 non-null  object 
 10  dlc_available         42497 non-null  int64  
 11  age_rating            42497 non-null  int64  
 12  content_descriptor    2375 non-null   object 
 13  about_description     42359 non-null  object 
 14  win_support           42497 non-null  bool   
 15  mac_support        

This dataframe contains 23 columns of the bool, float, int and object data type.

Check for duplicates:

In [8]:
data.duplicated().sum()

0

There are no duplicates.

Check for NaN values:

In [7]:
data.isna().sum()

app_id                      0
title                       0
release_date               57
genres                     87
categories                 45
developer                 190
publisher                 211
original_price          37638
discount_percentage     37638
discounted_price          240
dlc_available               0
age_rating                  0
content_descriptor      40122
about_description         138
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2477
overall_review_%         2477
overall_review_count     2477
recent_review           36994
recent_review_%         36994
recent_review_count     36994
dtype: int64

There are a lot of columns with missing values and individually I will take care of them.

Print the data types of each column in order to know what data type the missing values should have:

In [None]:
data.dtypes

app_id                    int64
title                    object
release_date             object
genres                   object
categories               object
developer                object
publisher                object
original_price           object
discount_percentage      object
discounted_price         object
dlc_available             int64
age_rating                int64
content_descriptor       object
about_description        object
win_support                bool
mac_support                bool
linux_support              bool
awards                    int64
overall_review           object
overall_review_%        float64
overall_review_count    float64
recent_review            object
recent_review_%         float64
recent_review_count     float64
dtype: object

First of all, take care of the missing values in 'release_date'. In this column, I hve 57 missing values.

In [None]:
data.loc[data['release_date'].isna()]

Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,original_price,discount_percentage,discounted_price,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
196,12210,Grand Theft Auto IV: Complete Edition,,"Action, Adventure",Single-player,Rockstar North,Rockstar Games,,,,...,True,False,False,0,Very Positive,81.0,133557.0,Very Positive,88.0,1747.0
967,24810,Command & Conquer™ 3: Kane’s Wrath,,Strategy,"Single-player, Family Sharing",EA Los Angeles,Electronic Arts,,,"₹1,720.00",...,True,False,False,0,Overwhelmingly Positive,95.0,3952.0,Very Positive,91.0,78.0
1553,30561,"Warhammer 40,000: Dawn of War II - Grand Maste...",,,,,,"₹2,891.00",-80%,₹578.00,...,True,True,True,0,,,,,,
1681,44370,Dawn of War Franchise Pack,,,,,,"₹5,275.00",-80%,"₹1,055.00",...,True,False,False,0,,,,,,
2388,45615,Fallout Classic Collection,,,,,,,,₹565.00,...,True,False,False,0,,,,,,
3747,17093,Total War: NAPOLEON - Definitive Edition,,Strategy,"Single-player, Downloadable Content, Steam Ach...",,,,,"₹1,499.00",...,True,True,False,0,,,,,,
4222,132479,METAL GEAR SOLID V: The Definitive Experience,,,,,,,,"₹1,599.00",...,True,False,False,0,,,,,,
4520,202170,Sleeping Dogs,,"Action, Adventure","Single-player, Steam Achievements, Steam Tradi...",United Front Games,Square Enix,,,,...,True,False,False,0,Very Positive,93.0,9995.0,,,
5904,17092,Total War: Empire - Definitive Edition,,Strategy,"Single-player, Online PvP, LAN PvP, Online Co-...",,,,,"₹1,499.00",...,True,True,True,0,,,,,,
6959,31220,Sam & Max 301: The Penal Zone,,"Action, Adventure","Single-player, Family Sharing",Telltale Games,Skunkape Games,,,,...,True,False,False,0,Very Positive,92.0,294.0,,,


Since these are well-known games, there is their release date on Google. I need to do some web scrapping to get this data and replace the NaN values.

First method would be using beautifulsoup and doing some webscrapping on the Google page.

In [None]:
# !pip install beautifulsoup4 requests

# import requests
# from bs4 import BeautifulSoup

# def get_release_date_from_google(title):
#     try:
#         query = f"{title} release date"
#         url = f"https://www.google.com/search?q={query}"
#         headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
#         response = requests.get(url, headers=headers)
#         soup = BeautifulSoup(response.text, 'html.parser')

#         date = soup.find('div', {'class': 'BNeawe iBp4i AP7Wnd'}).text
#         return date
#     except Exception as e:
#         print(f"Could not retrieve release date for {title}. Error: {e}")
#         return None

# for index, row in data.iterrows():
#     if pd.isna(row['release_date']):
#         title = row['title']
#         release_date = get_release_date_from_google(title)
#         if release_date:
#             data.at[index, 'release_date'] = release_date
#             print(f"Updated release date for {title}: {release_date}")
#         else:
#             print(f"Could not find release date for {title}")

# data.isna().sum()

This method fails because the ouputs returned by Google have each a different format and is difficult to find a template for scrapping that would fit all the pages.

Since OpenAI ChatGPT 3.5 model is updated with data until 2021, it should have information about the games. Let's try to use its API to complete our missing values.

In [None]:
#!pip install openai==0.28

In [None]:
# import os
# import openai

# openai.api_key = "DELETED THIS BEFORE SUBMITTING THE COLAB"
# response = []
# prompt = ''
# def chatWithGPT(prompt, data):
#     for index, row in data.iterrows():
#         if pd.isna(row['release_date']):
#             game_title = row['title']
#             release_value = f"Retrieve release value for {game_title} here having only the date in the form DD Month Year (ex. 21 Aug, 2012). I need only the date, without any additional words."
#             prompt += f" {release_value}"  # Modify the prompt to include the release value
#         completion = openai.ChatCompletion.create(
#             model="gpt-3.5-turbo",
#             messages=[
#                 {"role": "user", "content": prompt}
#             ]
#         )
#         response.append(completion.choices[0].message.content)
#         time.sleep(2)

# chatWithGPT(prompt, data)
# print(response)

Unfortunately, I get some error regarding the limit of usage, thus I will move forward.

I will try to use Cohere's LLM to complete the data in my missing dataset.

In [None]:
!pip install cohere



In [None]:
import time
import cohere
import re

co = cohere.Client(api_key="DELETED THIS BEFORE SUBMITTING THE COLAB")

def fetch_release_date_with_chatbot(data):
    for index, row in data.iterrows():
        if pd.isna(row['release_date']):
            game_title = row['title']
            prompt = f"Retrieve release value for {game_title} here having only the date in the form DD Month Year (ex. 21 Aug, 2012). I need only the date, without any additional words."
            response = co.chat(
	            message=prompt
            )
            print(response)
            response_string = str(response)
            match = re.search(r"text='([^']*)'", response_string)
            if match:
                extracted_text = match.group(1)
                data.loc[data['title'] == game_title, 'release_date'] = extracted_text
                print(data.loc[data['title'] == game_title])
            else:
                print("No match found.")
            time.sleep(6)  # There is a maximum of 10 requests per minute with the free trial


fetch_release_date_with_chatbot(data)

     app_id                                  title     release_date  \
196   12210  Grand Theft Auto IV: Complete Edition  26 October 2010   

                genres     categories       developer       publisher  \
196  Action, Adventure  Single-player  Rockstar North  Rockstar Games   

    original_price discount_percentage discounted_price  ...  win_support  \
196            NaN                 NaN              NaN  ...         True   

     mac_support linux_support awards  overall_review  overall_review_%  \
196        False         False      0   Very Positive              81.0   

     overall_review_count  recent_review recent_review_%  recent_review_count  
196              133557.0  Very Positive            88.0               1747.0  

[1 rows x 24 columns]
     app_id                               title   release_date    genres  \
967   24810  Command & Conquer™ 3: Kane’s Wrath  26 March 2008  Strategy   

                        categories       developer        publisher 

In [None]:
data.isna().sum()

app_id                      0
title                       0
release_date                0
genres                     87
categories                 45
developer                 190
publisher                 211
original_price          37638
discount_percentage     37638
discounted_price          240
dlc_available               0
age_rating                  0
content_descriptor      40122
about_description         138
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2477
overall_review_%         2477
overall_review_count     2477
recent_review           36994
recent_review_%         36994
recent_review_count     36994
dtype: int64

As it can be seen, this was a success. There are no more any missing values. Let's save the new dataframe in a CSV file.

In [None]:
data.to_csv('data_v1.csv')

Work with the data from the newly saved dataset.

In [None]:
data = pd.read_csv('/content/data_v1.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,original_price,discount_percentage,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
0,0,730,Counter-Strike 2,"21 Aug, 2012","Action, Free to Play","Cross-Platform Multiplayer, Steam Trading Card...",Valve,Valve,,,...,True,False,True,1,Very Positive,87.0,8062218.0,Mostly Positive,79.0,57466.0
1,1,570,Dota 2,"9 Jul, 2013","Action, Strategy, Free to Play","Steam Trading Cards, Steam Workshop, SteamVR C...",Valve,Valve,,,...,True,True,True,0,Very Positive,81.0,2243112.0,Mostly Positive,72.0,23395.0
2,2,2215430,Ghost of Tsushima DIRECTOR'S CUT,"16 May, 2024","Action, Adventure","Single-player, Online Co-op, Steam Achievement...",Sucker Punch Productions,PlayStation PC LLC,,,...,True,False,False,0,Very Positive,89.0,12294.0,,,
3,3,1245620,ELDEN RING,"24 Feb, 2022","Action, RPG","Single-player, Online PvP, Online Co-op, Steam...",FromSoftware Inc.,FromSoftware Inc.,,,...,True,False,False,6,Very Positive,93.0,605191.0,Very Positive,94.0,7837.0
4,4,1085660,Destiny 2,"1 Oct, 2019","Action, Adventure, Free to Play","Single-player, Online PvP, Online Co-op, Steam...",Bungie,Bungie,,,...,True,False,False,0,Very Positive,80.0,594713.0,Mostly Positive,73.0,4845.0


I will remove the rows containing NaN values because there are limited options for handling them. Requesting a language model to predict the genre may not yield accurate results, thus, I will proceed with the safer approach of dropping these rows.

In [None]:
data = data.dropna(subset=['genres'])

In [None]:
data.shape

(42410, 25)

Now, the dataset has  87 rows less. There are still 42410 rows left though.

Let's remove the rows that contain NaN values for the 'categories' column too, as there is no accurate method to predict with high precision the categories of the video game.

In [None]:
data = data.dropna(subset=['categories'])

In [None]:
data.shape

(42401, 25)

Now, the dataset has with 9 rows less (meaning that some rows contained NaNs in both the 'categories' and 'genres' columns). There are still 42401 rows left.

In [None]:
data.isna().sum()

Unnamed: 0                  0
app_id                      0
title                       0
release_date                0
genres                      0
categories                  0
developer                 155
publisher                 176
original_price          37557
discount_percentage     37557
discounted_price          240
dlc_available               0
age_rating                  0
content_descriptor      40029
about_description         102
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2442
overall_review_%         2442
overall_review_count     2442
recent_review           36912
recent_review_%         36912
recent_review_count     36912
dtype: int64

For the 'developer' column, I can get accurate information from LLM models, so let's use them to complete the missing values.

In [None]:
co = cohere.Client(api_key="DELETED THIS BEFORE SUBMITTING THE COLAB")

def fetch_release_date_with_chatbot(data):
    for index, row in data.iterrows():
        if pd.isna(row['developer']):
            game_title = row['title']
            prompt = f"Retrieve who is the developer company for {game_title}. I need only the name, without any additional words."
            response = co.chat(
	            message=prompt
            )
            print(response)
            response_string = str(response)
            match = re.search(r"text='([^']*)'", response_string)
            if match:
                extracted_text = match.group(1)
                data.loc[data['title'] == game_title, 'developer'] = extracted_text
                print(data.loc[data['title'] == game_title])
            else:
                print("No match found.")
            time.sleep(6)  # There is a maximum of 10 requests per minute with the free trial


fetch_release_date_with_chatbot(data)

     Unnamed: 0  app_id                                        title  \
456         456  124923  The Witcher 3: Wild Hunt - Complete Edition   

     release_date genres                                         categories  \
456  30 Aug, 2016    RPG  Single-player, Downloadable Content, Steam Ach...   

          developer publisher original_price discount_percentage  ...  \
456  CD Projekt Red       NaN      ₹2,199.00                -75%  ...   

    win_support  mac_support  linux_support awards overall_review  \
456        True        False          False      0            NaN   

     overall_review_%  overall_review_count  recent_review  recent_review_%  \
456               NaN                   NaN            NaN              NaN   

    recent_review_count  
456                 NaN  

[1 rows x 25 columns]
     Unnamed: 0  app_id                             title  release_date  \
968         968    1679  Oblivion Game of the Year Deluxe  16 Jun, 2009   

    genres               

In [None]:
data.isna().sum()

Unnamed: 0                  0
app_id                      0
title                       0
release_date                0
genres                      0
categories                  0
developer                   0
publisher                 176
original_price          37557
discount_percentage     37557
discounted_price          240
dlc_available               0
age_rating                  0
content_descriptor      40029
about_description         102
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2442
overall_review_%         2442
overall_review_count     2442
recent_review           36912
recent_review_%         36912
recent_review_count     36912
dtype: int64

As it can be seen, this was a success. There are no more any missing values. Let's save the new dataframe in a CSV file.

In [None]:
data.to_csv('data_v2.csv')

Work with the data from the newly saved dataset.

In [None]:
data = pd.read_csv('/content/data_v2.csv')
data.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,original_price,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
0,0,0,730,Counter-Strike 2,"21 Aug, 2012","Action, Free to Play","Cross-Platform Multiplayer, Steam Trading Card...",Valve,Valve,,...,True,False,True,1,Very Positive,87.0,8062218.0,Mostly Positive,79.0,57466.0
1,1,1,570,Dota 2,"9 Jul, 2013","Action, Strategy, Free to Play","Steam Trading Cards, Steam Workshop, SteamVR C...",Valve,Valve,,...,True,True,True,0,Very Positive,81.0,2243112.0,Mostly Positive,72.0,23395.0
2,2,2,2215430,Ghost of Tsushima DIRECTOR'S CUT,"16 May, 2024","Action, Adventure","Single-player, Online Co-op, Steam Achievement...",Sucker Punch Productions,PlayStation PC LLC,,...,True,False,False,0,Very Positive,89.0,12294.0,,,
3,3,3,1245620,ELDEN RING,"24 Feb, 2022","Action, RPG","Single-player, Online PvP, Online Co-op, Steam...",FromSoftware Inc.,FromSoftware Inc.,,...,True,False,False,6,Very Positive,93.0,605191.0,Very Positive,94.0,7837.0
4,4,4,1085660,Destiny 2,"1 Oct, 2019","Action, Adventure, Free to Play","Single-player, Online PvP, Online Co-op, Steam...",Bungie,Bungie,,...,True,False,False,0,Very Positive,80.0,594713.0,Mostly Positive,73.0,4845.0


For the 'publisher' column, I can get accurate information from LLM models, so let's use them to complete the missing values.

In [None]:
co = cohere.Client(api_key="DELETED THIS BEFORE SUBMITTING THE COLAB")

def fetch_release_date_with_chatbot(data):
    for index, row in data.iterrows():
        if pd.isna(row['publisher']):
            game_title = row['title']
            prompt = f"Retrieve who is the publisher company for {game_title}. I need only the name, without any additional words."
            response = co.chat(
	            message=prompt
            )
            print(response)
            response_string = str(response)
            match = re.search(r"text='([^']*)'", response_string)
            if match:
                extracted_text = match.group(1)
                data.loc[data['title'] == game_title, 'publisher'] = extracted_text
                print(data.loc[data['title'] == game_title])
            else:
                print("No match found.")
            time.sleep(6)  # There is a maximum of 10 requests per minute with the free trial


fetch_release_date_with_chatbot(data)

     Unnamed: 0.1  Unnamed: 0  app_id  \
456           456         456  124923   

                                           title  release_date genres  \
456  The Witcher 3: Wild Hunt - Complete Edition  30 Aug, 2016    RPG   

                                            categories       developer  \
456  Single-player, Downloadable Content, Steam Ach...  CD Projekt Red   

          publisher original_price  ... win_support mac_support  \
456  CD Projekt Red      ₹2,199.00  ...        True       False   

     linux_support  awards overall_review overall_review_%  \
456          False       0            NaN              NaN   

     overall_review_count  recent_review  recent_review_%  recent_review_count  
456                   NaN            NaN              NaN                  NaN  

[1 rows x 26 columns]
     Unnamed: 0.1  Unnamed: 0  app_id                             title  \
963           968         968    1679  Oblivion Game of the Year Deluxe   

     release_date genres 

In [None]:
data.isna().sum()

Unnamed: 0.1                0
Unnamed: 0                  0
app_id                      0
title                       0
release_date                0
genres                      0
categories                  0
developer                   0
publisher                   0
original_price          37557
discount_percentage     37557
discounted_price          240
dlc_available               0
age_rating                  0
content_descriptor      40029
about_description         102
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2442
overall_review_%         2442
overall_review_count     2442
recent_review           36912
recent_review_%         36912
recent_review_count     36912
dtype: int64

There are no more any missing values in the 'publisher' column. Let's save the new dataframe in a CSV file.

In [None]:
data.to_csv('data_v3.csv')

Work with the data from the newly saved dataset.

In [None]:
data = pd.read_csv('/content/data_v3.csv')
data.head()

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
0,0,0,0,730,Counter-Strike 2,"21 Aug, 2012","Action, Free to Play","Cross-Platform Multiplayer, Steam Trading Card...",Valve,Valve,...,True,False,True,1,Very Positive,87.0,8062218.0,Mostly Positive,79.0,57466.0
1,1,1,1,570,Dota 2,"9 Jul, 2013","Action, Strategy, Free to Play","Steam Trading Cards, Steam Workshop, SteamVR C...",Valve,Valve,...,True,True,True,0,Very Positive,81.0,2243112.0,Mostly Positive,72.0,23395.0
2,2,2,2,2215430,Ghost of Tsushima DIRECTOR'S CUT,"16 May, 2024","Action, Adventure","Single-player, Online Co-op, Steam Achievement...",Sucker Punch Productions,PlayStation PC LLC,...,True,False,False,0,Very Positive,89.0,12294.0,,,
3,3,3,3,1245620,ELDEN RING,"24 Feb, 2022","Action, RPG","Single-player, Online PvP, Online Co-op, Steam...",FromSoftware Inc.,FromSoftware Inc.,...,True,False,False,6,Very Positive,93.0,605191.0,Very Positive,94.0,7837.0
4,4,4,4,1085660,Destiny 2,"1 Oct, 2019","Action, Adventure, Free to Play","Single-player, Online PvP, Online Co-op, Steam...",Bungie,Bungie,...,True,False,False,0,Very Positive,80.0,594713.0,Mostly Positive,73.0,4845.0


In [None]:
data.tail()

Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,...,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count,recent_review,recent_review_%,recent_review_count
42396,42396,42492,42492,477910,Wanderer of Teandria,"26 Sep, 2017","Action, Adventure, Indie","Single-player, Steam Achievements, Steam Tradi...",Silentplaygames,Silentplaygames,...,True,False,False,0,,,,,,
42397,42397,42493,42493,1501390,KING PONG,"18 Feb, 2021","Action, Sports, Early Access","Single-player, Online PvP, Steam Achievements,...",Iconik,Iconik,...,True,False,False,1,,,,,,
42398,42398,42494,42494,2683250,Falnarion Tactics III,"25 Dec, 2023",Strategy,"Single-player, Steam Achievements, Steam Cloud...",Team Syukino,Team Syukino,...,True,False,False,0,,,,,,
42399,42399,42495,42495,1508840,Great Exploration VR: New Colony beyond Viking...,"12 Jan, 2021","Action, Casual","Single-player, Tracked Controller Support, VR ...",William at Oxford,William at Oxford,...,True,False,False,0,,,,,,
42400,42400,42496,42496,1191080,SANCTION,"10 Jun, 2022","Action, Adventure, Casual, Indie, Simulation, ...","Single-player, Steam Achievements, Family Shar...",LethalLizard Studios,LethalLizard Studios,...,True,False,False,0,,,,,,


In [None]:
data[['title','about_description']].values

array([['Counter-Strike 2',
        'For over two decades, Counter-Strike has offered an elite competitive experience, one shaped by millions of players from across the globe. And now the next chapter in the CS story is about to begin. This is Counter-Strike 2.'],
       ['Dota 2',
        "Every day, millions of players worldwide enter battle as one of over a hundred Dota heroes. And no matter if it's their 10th hour of play or 1,000th, there's always something new to discover. With regular updates that ensure a constant evolution of gameplay, features, and heroes, Dota 2 has taken on a life of its own."],
       ["Ghost of Tsushima DIRECTOR'S CUT",
        'A storm is coming. Venture into the complete Ghost of Tsushima DIRECTOR’S CUT on PC; forge your own path through this open-world action adventure and uncover its hidden wonders. Brought to you by Sucker Punch Productions, Nixxes Software and PlayStation Studios.'],
       ...,
       ['Falnarion Tactics III',
        'A missing qu

After a small research, I can see that these are descriptions from STEAM.

In [None]:
data[['title','about_description']].loc[data['about_description'].isna()]

Unnamed: 0,title,about_description
456,The Witcher 3: Wild Hunt - Complete Edition,
963,Oblivion Game of the Year Deluxe,
1601,A Plague Tale Bundle,
1960,Alan Wake Franchise,
1997,DARK SOULS III Deluxe Edition,
...,...,...
38574,DYNASTY WARRIORS 9 Complete Edition,
39025,DYNASTY WARRIORS 9 Special Weapon Edition,
39242,Airline Tycoon 2: Gold,
40062,APOX and Legend DLC Combo,


When I search on STEAM for these games, I can see that they do not have a description, thus the NaN values. I could generate a description for them using a LLM model, but it might not be consistent with the rest of the data. Hence, I decided to delete these rows.

In [None]:
data = data.dropna(subset=['about_description'])

In [None]:
data.shape

(42299, 27)

We still have 42299 rows which is a good indicator.

In [None]:
data.isna().sum()

Unnamed: 0.2                0
Unnamed: 0.1                0
Unnamed: 0                  0
app_id                      0
title                       0
release_date                0
genres                      0
categories                  0
developer                   0
publisher                   1
original_price          37462
discount_percentage     37462
discounted_price          234
dlc_available               0
age_rating                  0
content_descriptor      39927
about_description           0
win_support                 0
mac_support                 0
linux_support               0
awards                      0
overall_review           2346
overall_review_%         2346
overall_review_count     2346
recent_review           36810
recent_review_%         36810
recent_review_count     36810
dtype: int64

In the meantime, I can see that due to the fact that I saved my dataset as a CSV file multimple times, I got new columns called Unnamed. I will delete them. As well, I can see that in the 'publisher' column I have a NaN value, which appeared in the process of conversion to CSV. I will drop that row.

The columns 'original_price', 'discount_percentage', 'content_descriptor', 'recent_review', 'recent_review_%', 'recent_review_count' are data from STEAM, that is available only for a couple of games. Thus, I get a lot of NaN values and even if I wanted to do web scrapping and to complete these columns, I wouldn't be able because this data is unavailable. So, I am forced to get rid of these columns.

In [None]:
data = data.dropna(subset=['publisher'])

In [None]:
data.shape

(42298, 27)

In [None]:
data = data.drop(columns=['original_price', 'discount_percentage', 'content_descriptor', 'recent_review', 'recent_review_%', 'recent_review_count', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0'])
data.shape

(42298, 18)

In [None]:
data.isna().sum()

app_id                     0
title                      0
release_date               0
genres                     0
categories                 0
developer                  0
publisher                  0
discounted_price         234
dlc_available              0
age_rating                 0
about_description          0
win_support                0
mac_support                0
linux_support              0
awards                     0
overall_review          2346
overall_review_%        2346
overall_review_count    2346
dtype: int64

In [None]:
print(data['discounted_price'].unique())

['Free' '₹3,999.00' '₹3,599.00' ... '₹461.00' '₹1,075.00' '₹150,000.00']


The currency is the Indian Rupee. Let's transform this column from object data type to numeric. I will get rid of the ₹ symbol and I will change Free to 0.

In [None]:
for value in data['discounted_price'].unique():
    if not (pd.isna(value)):
      if value == 'Free':
        data.loc[data['discounted_price'] == value, 'discounted_price'] = 0
      else:
        data.loc[data['discounted_price'] == value, 'discounted_price'] = (float(value[1:].replace(',', '')))

data['discounted_price'] = pd.to_numeric(data['discounted_price'])

In [None]:
data.head()

Unnamed: 0,app_id,title,release_date,genres,categories,developer,publisher,discounted_price,dlc_available,age_rating,about_description,win_support,mac_support,linux_support,awards,overall_review,overall_review_%,overall_review_count
0,730,Counter-Strike 2,"21 Aug, 2012","Action, Free to Play","Cross-Platform Multiplayer, Steam Trading Card...",Valve,Valve,0.0,1,0,"For over two decades, Counter-Strike has offer...",True,False,True,1,Very Positive,87.0,8062218.0
1,570,Dota 2,"9 Jul, 2013","Action, Strategy, Free to Play","Steam Trading Cards, Steam Workshop, SteamVR C...",Valve,Valve,0.0,2,0,"Every day, millions of players worldwide enter...",True,True,True,0,Very Positive,81.0,2243112.0
2,2215430,Ghost of Tsushima DIRECTOR'S CUT,"16 May, 2024","Action, Adventure","Single-player, Online Co-op, Steam Achievement...",Sucker Punch Productions,PlayStation PC LLC,3999.0,0,1,A storm is coming. Venture into the complete G...,True,False,False,0,Very Positive,89.0,12294.0
3,1245620,ELDEN RING,"24 Feb, 2022","Action, RPG","Single-player, Online PvP, Online Co-op, Steam...",FromSoftware Inc.,FromSoftware Inc.,3599.0,2,1,"THE NEW FANTASY ACTION RPG. Rise, Tarnished, a...",True,False,False,6,Very Positive,93.0,605191.0
4,1085660,Destiny 2,"1 Oct, 2019","Action, Adventure, Free to Play","Single-player, Online PvP, Online Co-op, Steam...",Bungie,Bungie,0.0,14,1,Destiny 2 is an action MMO with a single evolv...,True,False,False,0,Very Positive,80.0,594713.0


In [None]:
data.dtypes

app_id                    int64
title                    object
release_date             object
genres                   object
categories               object
developer                object
publisher                object
discounted_price        float64
dlc_available             int64
age_rating                int64
about_description        object
win_support                bool
mac_support                bool
linux_support              bool
awards                    int64
overall_review           object
overall_review_%        float64
overall_review_count    float64
dtype: object

In [None]:
data.describe(include=['object'])

Unnamed: 0,title,release_date,genres,categories,developer,publisher,about_description,overall_review
count,42298,42298,42298,42298,42298,42298,42298,39952
unique,42177,4415,1515,4559,25139,21022,42020,9
top,Under Pressure,"28 Mar, 2024","Action, Indie","Single-player, Family Sharing",Choice of Games,Big Fish Games,Find the objects that are hidden on the map.,Very Positive
freq,3,66,2374,3808,128,244,47,11128


Last step would be take care of the missing values in the 'discounted_price', 'overall_review', 'overall_review_%', and 'overall_review_count'.

'discounted_price' has few missing values, so I will replace the missing values with the mode.

But 'overall_review_%' is my target column and I don't want to alter the results, though I will delete the missing values and this will solve the problem in the 'overall_review' and 'overall_review_count' columns too.

In [None]:
data['discounted_price'] = data['discounted_price'].fillna(data['discounted_price'].mode()[0])

In [None]:
data = data.dropna(subset=['overall_review_%'])

In [None]:
data.shape

(39952, 18)

In [None]:
data.isna().sum()

app_id                  0
title                   0
release_date            0
genres                  0
categories              0
developer               0
publisher               0
discounted_price        0
dlc_available           0
age_rating              0
about_description       0
win_support             0
mac_support             0
linux_support           0
awards                  0
overall_review          0
overall_review_%        0
overall_review_count    0
dtype: int64

Great! All the NaN values have been taken care of. Now I can proceed to the next step. I will save this dataset in a new csv file and will continue [here](https://colab.research.google.com/drive/1C4q3hDfuIoXHq2MEP_IQxMHt4IBeBAM7?usp=sharing).

In [None]:
data.to_csv('data_v4.csv')