# Analysis of video game sales, by Deborah Thomas.

<div style="background-color: rgb(255, 176, 155); padding: 10px; border-radius: 5px;">
    <h2>Introduction</h2>
</div>

#### Analysis of historic video game sales, from 1980-2016, from the (fictitional) online store called "Ice". This dataset includes sales from these three regions:
- North America
- Europe
- Japan
#### I will be analyzing the following, in order to gain an understanding as to which video games will be successful:
- Video game platforms
- Year of release
- Video game genres
- Critics' scores
- Users' scores
- ESRB's ratings (Entertainment Software Rating Board)

<div style="background-color: rgb(255, 176, 155); padding: 10px; border-radius: 5px;">
    <h2>Import libraries, and read in the dataset</h2>
</div>

In [1]:
import pandas as pd
import numpy as np

from IPython.display import Image

In [2]:
Image(url='../girl_videoGame_dog_ice.webp', width=300, height=300)

In [3]:
games = pd.read_csv('../games.csv')
display(games.head(5))

Unnamed: 0,Name,Platform,Year_of_Release,Genre,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score,User_Score,Rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009.0,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,11.27,8.89,10.22,1.0,,,


<div style="background-color: rgb(255, 176, 155); padding: 10px; border-radius: 5px;">
    <h2>Basic summary of the data</h2>
</div>

In [4]:
print("This dataset has " + str(games.shape[1]) + " columns, and " + str(games.shape[0]) + " rows.")

This dataset has 11 columns, and 16715 rows.


In [5]:
games.describe()

Unnamed: 0,Year_of_Release,NA_sales,EU_sales,JP_sales,Other_sales,Critic_Score
count,16446.0,16715.0,16715.0,16715.0,16715.0,8137.0
mean,2006.484616,0.263377,0.14506,0.077617,0.047342,68.967679
std,5.87705,0.813604,0.503339,0.308853,0.186731,13.938165
min,1980.0,0.0,0.0,0.0,0.0,13.0
25%,2003.0,0.0,0.0,0.0,0.0,60.0
50%,2007.0,0.08,0.02,0.0,0.01,71.0
75%,2010.0,0.24,0.11,0.04,0.03,79.0
max,2016.0,41.36,28.96,10.22,10.57,98.0


#### North America had the most sales.

In [6]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Name             16713 non-null  object 
 1   Platform         16715 non-null  object 
 2   Year_of_Release  16446 non-null  float64
 3   Genre            16713 non-null  object 
 4   NA_sales         16715 non-null  float64
 5   EU_sales         16715 non-null  float64
 6   JP_sales         16715 non-null  float64
 7   Other_sales      16715 non-null  float64
 8   Critic_Score     8137 non-null   float64
 9   User_Score       10014 non-null  object 
 10  Rating           9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


In [7]:
games.Critic_Score.max()

98.0

In [8]:
games.User_Score.value_counts()

User_Score
tbd    2424
7.8     324
8       290
8.2     282
8.3     254
       ... 
1.1       2
1.9       2
9.6       2
0         1
9.7       1
Name: count, Length: 96, dtype: int64

#### A quick study of the data shows:
- North America had the highest video game sales overall.
    
- Column names need to be lowercase.

- Year_of_Release, and Critic_Score should not have decimals. These datatypse will need to change to int.
- User_Score appears to only go to 10, so this can stay as a decimal. But, the datatype will need to change to float.
- The Rating column will need to change to type category.

- These columns have NaN values and / or missing data: Name, Year_of_Release, Genre, Critic_Score, User_Score, Rating.
- Luckily, there is no data missing from the sales columns: NA_sales, EU_sales, JP_sales.

<div style="background-color: rgb(255, 176, 155); padding: 10px; border-radius: 5px;">
    <h2>Clean the data</h2>
</div>

### Rename the column names to lowercase.

In [9]:
# New column names
new_columns = ['name', 'platform', 'year', 'genre', 'sales_na', 'sales_eu', 'sales_jp', 'sales_other', 'critic_score', 'user_score', 'rating']

# Assign the new column names to the DataFrame
games.columns = new_columns

#Display dataframe with new lowercase names
display(games.head(3))

Unnamed: 0,name,platform,year,genre,sales_na,sales_eu,sales_jp,sales_other,critic_score,user_score,rating
0,Wii Sports,Wii,2006.0,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985.0,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008.0,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E


### Get rid of NaN values.

#### Impute data for the 'year' column.

In [10]:
# Group by 'name' and 'year' and count occurrences
grouped = games.groupby(['name', 'year']).size().reset_index(name='count')
print("Grouped DataFrame with Counts:")
print(grouped)

Grouped DataFrame with Counts:
                               name    year  count
0                    Beyblade Burst  2016.0      1
1                 Fire Emblem Fates  2015.0      1
2              Frozen: Olaf's Quest  2013.0      2
3        Haikyu!! Cross Team Match!  2016.0      1
4                 Tales of Xillia 2  2012.0      1
...                             ...     ...    ...
12190            thinkSMART FAMILY!  2010.0      1
12191    thinkSMART: Chess for Kids  2011.0      1
12192                  uDraw Studio  2010.0      1
12193  uDraw Studio: Instant Artist  2011.0      2
12194  ¡Shin Chan Flipa en colores!  2007.0      1

[12195 rows x 3 columns]


In [11]:
# Additional analysis: Determine the most common year for each game
#Apply a function to each 'year' group.
#Calculates the mode (most frequent value) of the 'year' values in the 'name' group.
#['0'] selects the first mode value if mode() returns multiple values. 
# Check if the mode calculation returns an empty result. If it does, it assigns np.nan instead. 
common_year = games.groupby('name')['year'].apply(lambda x: x.mode().iloc[0] if not x.mode().empty else None)

print("\nMost Common Year for Each Game Name:")
print(common_year)


Most Common Year for Each Game Name:
name
 Beyblade Burst                 2016.0
 Fire Emblem Fates              2015.0
 Frozen: Olaf's Quest           2013.0
 Haikyu!! Cross Team Match!     2016.0
 Tales of Xillia 2              2012.0
                                 ...  
thinkSMART: Chess for Kids      2011.0
uDraw Studio                    2010.0
uDraw Studio: Instant Artist    2011.0
wwe Smackdown vs. Raw 2006         NaN
¡Shin Chan Flipa en colores!    2007.0
Name: year, Length: 11559, dtype: float64


In [12]:
#Take the above info, and now fill in the missing values.  
#Create a dictionary from the common_year Series
common_year_dict = common_year.to_dict()

#Use this dictionary to fill in missing 'year' values
#If row is found in common_year_dict, it returns the most common year for that game.
games['year'] = games.apply(lambda row: common_year_dict.get(row['name'], row['year']) if pd.isna(row['year']) else row['year'], axis=1)

In [13]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          16713 non-null  object 
 1   platform      16715 non-null  object 
 2   year          16569 non-null  float64
 3   genre         16713 non-null  object 
 4   sales_na      16715 non-null  float64
 5   sales_eu      16715 non-null  float64
 6   sales_jp      16715 non-null  float64
 7   sales_other   16715 non-null  float64
 8   critic_score  8137 non-null   float64
 9   user_score    10014 non-null  object 
 10  rating        9949 non-null   object 
dtypes: float64(6), object(5)
memory usage: 1.4+ MB


#### There are still 146 NaN values in the 'year' column

In [14]:
#Fill the remaining 146 NaN values with '0' before changing type from float to int. 
#Get rid of decimal in year column.
games['year'] = games['year'].fillna(0).astype(int)

# Verify the changes
print("\nDataFrame after cleaning 'year' column:")
display(games)


DataFrame after cleaning 'year' column:


Unnamed: 0,name,platform,year,genre,sales_na,sales_eu,sales_jp,sales_other,critic_score,user_score,rating
0,Wii Sports,Wii,2006,Sports,41.36,28.96,3.77,8.45,76.0,8,E
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,,,
2,Mario Kart Wii,Wii,2008,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009,Sports,15.61,10.93,3.28,2.95,80.0,8,E
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.00,,,
...,...,...,...,...,...,...,...,...,...,...,...
16710,Samurai Warriors: Sanada Maru,PS3,2016,Action,0.00,0.00,0.01,0.00,,,
16711,LMA Manager 2007,X360,2006,Sports,0.00,0.01,0.00,0.00,,,
16712,Haitaka no Psychedelica,PSV,2016,Adventure,0.00,0.00,0.01,0.00,,,
16713,Spirits & Spells,GBA,2003,Platform,0.01,0.00,0.00,0.00,,,


In [15]:
games.year.value_counts()

year
2008    1441
2009    1430
2010    1270
2007    1202
2011    1153
2006    1019
2005     948
2002     845
2003     783
2004     764
2012     662
2015     606
2014     581
2013     548
2016     502
2001     486
1998     379
2000     351
1999     339
1997     289
1996     263
1995     219
0        146
1994     122
1993      62
1981      46
1992      43
1991      41
1982      36
1986      21
1989      17
1983      17
1990      16
1987      16
1988      15
1985      14
1984      14
1980       9
Name: count, dtype: int64

#### Get rid of NaN values, and decimals in 'critic_score' column. There are currently 8578 NaN values.

In [16]:
games.critic_score.value_counts()

critic_score
70.0    256
71.0    254
75.0    245
78.0    240
73.0    238
       ... 
20.0      3
21.0      1
17.0      1
22.0      1
13.0      1
Name: count, Length: 82, dtype: int64

### Impute the data for 'critic_score'

In [17]:
# Group by 'genre' and calculate the median critic_score for each group
genre_medians_critic = games.groupby('genre')['critic_score'].transform('median')

# Fill NaN values in 'critic_score' column with the Critic's median score for that genre
games['critic_score'] = games['critic_score'].fillna(genre_medians_critic)

In [18]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16715 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          16713 non-null  object 
 1   platform      16715 non-null  object 
 2   year          16715 non-null  int64  
 3   genre         16713 non-null  object 
 4   sales_na      16715 non-null  float64
 5   sales_eu      16715 non-null  float64
 6   sales_jp      16715 non-null  float64
 7   sales_other   16715 non-null  float64
 8   critic_score  16713 non-null  float64
 9   user_score    10014 non-null  object 
 10  rating        9949 non-null   object 
dtypes: float64(5), int64(1), object(5)
memory usage: 1.4+ MB


#### A few of the columns have only 2 rows with missing values. Could it be that there are 2 rows that have missing values in many rows? If so, those two rows should be dropped from the dataframe.

In [19]:
# Filter rows where 'critic_score' is NaN
nan_critic_score_rows = games[games['critic_score'].isna()]

# Display the filtered rows
print("Rows with NaN values in 'critic_score':")
display(nan_critic_score_rows)

Rows with NaN values in 'critic_score':


Unnamed: 0,name,platform,year,genre,sales_na,sales_eu,sales_jp,sales_other,critic_score,user_score,rating
659,,GEN,1993,,1.78,0.53,0.0,0.08,,,
14244,,GEN,1993,,0.0,0.0,0.03,0.0,,,


#### Yes, those 2 rows have multiple columns with NaN values. These 2 rows will be dropped from the dataframe.

In [20]:
# Drop the 2 rows where 'critic_score' is NaN
games = games.dropna(subset=['critic_score'])

# Verify the changes by displaying the modified DataFrame
print("\nDataFrame after dropping rows with NaN values in 'critic_score':")
display(games.head(5))


DataFrame after dropping rows with NaN values in 'critic_score':


Unnamed: 0,name,platform,year,genre,sales_na,sales_eu,sales_jp,sales_other,critic_score,user_score,rating
0,Wii Sports,Wii,2006,Sports,41.36,28.96,3.77,8.45,76.0,8.0,E
1,Super Mario Bros.,NES,1985,Platform,29.08,3.58,6.81,0.77,69.0,,
2,Mario Kart Wii,Wii,2008,Racing,15.68,12.76,3.79,3.29,82.0,8.3,E
3,Wii Sports Resort,Wii,2009,Sports,15.61,10.93,3.28,2.95,80.0,8.0,E
4,Pokemon Red/Pokemon Blue,GB,1996,Role-Playing,11.27,8.89,10.22,1.0,74.0,,


### Clean NaN from 'user_score' column, and change datatype to float.

In [21]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16713 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          16713 non-null  object 
 1   platform      16713 non-null  object 
 2   year          16713 non-null  int64  
 3   genre         16713 non-null  object 
 4   sales_na      16713 non-null  float64
 5   sales_eu      16713 non-null  float64
 6   sales_jp      16713 non-null  float64
 7   sales_other   16713 non-null  float64
 8   critic_score  16713 non-null  float64
 9   user_score    10014 non-null  object 
 10  rating        9949 non-null   object 
dtypes: float64(5), int64(1), object(5)
memory usage: 1.5+ MB


#### Since the user_score only goes to 10, I will impute the median score, for that genre. I will leave the decimals.

In [22]:
games['user_score'] = games['user_score'].replace('tbd', np.nan)

# Change 'user_score' from float to int type.
games['user_score'] = games['user_score'].astype(float)

In [23]:
# Group by 'genre' and calculate the median user_score for each group
genre_medians_user = games.groupby('genre')['user_score'].transform('median')

# Fill NaN values in 'user_score' column with the User's median score for that genre
games['user_score'] = games['user_score'].fillna(genre_medians_user)

In [24]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16713 entries, 0 to 16714
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          16713 non-null  object 
 1   platform      16713 non-null  object 
 2   year          16713 non-null  int64  
 3   genre         16713 non-null  object 
 4   sales_na      16713 non-null  float64
 5   sales_eu      16713 non-null  float64
 6   sales_jp      16713 non-null  float64
 7   sales_other   16713 non-null  float64
 8   critic_score  16713 non-null  float64
 9   user_score    16713 non-null  float64
 10  rating        9949 non-null   object 
dtypes: float64(6), int64(1), object(4)
memory usage: 1.5+ MB


### Change datatype of 'rating' column

games.info()