<h1 style='text-align: center; front-size: 50px;'>Patterns Behind Best-Selling Games</h1>

# Introduction:

In this project, we will work with data from the online store 'Ice', which sells video games all over the world. User and expert reviews, genres, platforms (e.g. Xbox or PlayStation), and historical data on game sales are available from open source. Our mission is to clean up the data and prepare a report that gives insight into identifying patterns that determine whether a game succeeds or not. This will allow us to spot potential big winners and plan advertising campaigns.The dataset is stored in a single file (/datasets/games.csv). During data preprocessing, we will:

- Load and display the dataset in a standardized format.
- Verify and correct data types.
- Identify and handle missing values.
- Detect and remove duplicate entries.
- Create visualizations to clearly communicate insights from the data.

By following this process, we aim to produce a detailed report that provides actionable insights for business strategy.

# Step 1. Initialization:

In [None]:
# Loading all the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
import scipy.stats as stats

# Step 2. Load data:

In [None]:
video_games = pd.read_csv('/datasets/games.csv')
video_games.head()

# Step 3. Preparing and Fixing the Data:

In [None]:
# Renaming columns names:
video_games = video_games.rename(columns={'Name': 'name',
                                         'Platform': 'platform',
                                         'Year_of_Release': 'year_of_release',
                                         'Genre': 'genre',
                                         'NA_sales': 'na_sales',
                                         'EU_sales': 'eu_sales',
                                         'JP_sales': 'jp_sales',
                                         'Other_sales': 'other_sales',
                                         'Critic_Score': 'critic_score',
                                         'User_Score': 'user_score',
                                         'Rating': 'rating'}
                                )
video_games.head()

In [None]:
# Data overview:
video_games.info()

In [None]:
# Checking for missing values:
video_games.isna().sum()

In [None]:
# Checking for duplicates:
video_games.duplicated().sum()

In [None]:
# Dropping rows where 'name', 'year_of_release', 'genre' is missing:
video_games.dropna(subset=['name', 'year_of_release', 'genre'], inplace=True)

In [None]:
# Converting 'year_of_release' into datetime:
video_games['year_of_release'] = pd.to_datetime(video_games['year_of_release'], format='%Y').dt.year
video_games.head(10)

In [None]:
# Replacing 'tbd' by 'None':
video_games['user_score'] = video_games['user_score'].replace('tbd', None)
video_games.head()

In [None]:
# Converting 'user_score' into Numerical:
video_games['user_score'] = video_games['user_score'].astype(float)

In [None]:
# Filling missing values in 'rating' using 'Unknown':
video_games['rating'] = video_games['rating'].fillna('Unknown')
video_games.head(10)

In [None]:
# Double checking for missing values:
video_games.isna().sum()

In [None]:
# Calculating the total sales for each game:
video_games['total_sales'] = video_games[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
video_games.sample(n=5)

# Step 4. Analyzing the data:

In [None]:
# Number of games released in different years:
yearly_game_counts = video_games['year_of_release'].value_counts().sort_index()
yearly_game_counts.head()

In [None]:
yearly_game_counts.describe()

In [None]:
# Creating a Line Chart: 
plt.figure(figsize=(8, 4))
plt.plot(yearly_game_counts)
plt.title('Number of Games Released Yearly')
plt.xlabel('Year of Release')
plt.ylabel('Number of Games')
plt.show()

The dataset covers 37 years, with an average of 452 games released per year, though this varies significantly **(standard deviation: 469.66)**. Game releases range from 9 (Min) to 1,696 (Max) per year, peaking around **2005 - 2011** before declining.  With 25% of years having fewer than 36 games and 25% having 762 or more. The median is 338 games per year, indicating limited data in earlier years, a rapid expansion starting in **1995** and a peak followed by decline in recent years.

In [None]:
# Grouping 'platform' by 'total sales':
platform_sales = video_games.groupby(['year_of_release', 'platform'])['total_sales'].sum().reset_index()
platform_sales.sample(n=5)

In [None]:
# Sorting 'platform_sales' by total sales and select the top 10:
top_10_platforms = platform_sales.sort_values(by='total_sales', ascending=False).reset_index(drop=True).head(10)
top_10_platforms

In [None]:
# Sorting 'platform_sales' by total sales and select the bottom 10:
bottom_10_platforms = platform_sales.sort_values(by='total_sales', ascending=False).reset_index(drop=True).tail(10)
bottom_10_platforms

In [None]:
# Create a figure with 2 subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

# First subplot for 'top_10_platforms': 
sns.barplot(data=top_10_platforms, x='year_of_release', y='total_sales', hue='platform', ax=ax1)
ax1.set_xlabel('Year of Release')
ax1.set_ylabel('Total Sales')
ax1.set_title('Top 10 Platforms by Total Sales')
ax1.tick_params(axis='x', rotation=45)

# Second subplot for 'bottom_10_platforms':
sns.barplot(data=bottom_10_platforms, x='year_of_release', y='total_sales', hue='platform', ax=ax2)
ax2.set_title('Bottom 10 Platforms by Total Sales')
ax2.set_xlabel('Year of Release')
ax2.set_ylabel('Total Sales')
ax2.tick_params(axis='x', rotation=45)


# Adjusting layout to prevent overlap:
plt.tight_layout()
plt.show()

The graphs above shows a significant gap between the top and bottom platforms in total sales. The **PS2** dominates the top 10, followed by **The Xbox 360, PS3, Wii**, showcasing the success of brands like Sony, Microsoft and Nintendo. Most top performing platforms were released between 1998 and 2010. On the other hand, devices like **3DO, GG, PCFX**, many of which were released between 1985 and 1995, struggled with much lower sales.

In [None]:
# Displaying data from 2013 and up:
video_games_from_2013 = video_games[video_games['year_of_release'] >= 2013].reset_index(drop=True)
video_games_from_2013.sample(n=5)

In [None]:
# Calculating the 'total sales_2013' for each game:
video_games_from_2013['total_sales_2013'] = video_games_from_2013[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum(axis=1)
video_games_from_2013.head()

In [None]:
# Grouping 'platform' by 'total sales' in the 'video_games_from_2013':
platform_sales_2013 = video_games_from_2013.groupby(['year_of_release', 'platform'])['total_sales_2013'].sum().reset_index()
platform_sales_2013.head()

In [None]:
# Sorting 'platform_sales_2013' by 'total_sales_2013' and select the top 5:
top_5_platforms = platform_sales_2013.sort_values(by='total_sales_2013', ascending=False).reset_index(drop=True).head()
top_5_platforms

In [None]:
# Sorting 'platform_sales_2013' by 'total_sales_2013' and select the bottom 5:
bottom_5_platforms = platform_sales_2013.sort_values(by='total_sales_2013', ascending=False).reset_index(drop=True).tail()
bottom_5_platforms

From 2013 and up, **The Xbox 360, PS3, PS4** dominated total sales, highlighting their strong market impact and long-lasting popularity. In contrast, platforms like **PS2**, **PSP** and **Wii** saw a significant decline in sales, indicating the end of their lifecycle. 

In [None]:
# Createing the box plot:
plt.figure(figsize=(12, 6))
sns.boxplot(x='platform', y='total_sales_2013', data=video_games_from_2013)
plt.title('Global Sales of 2013 and up by Platform')
plt.xlabel('Platform')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.ylim(0,3)
plt.show()

In [None]:
# Calculating the Average of 'total_sales_2013':
average_sales_by_platform = video_games_from_2013.groupby('platform')['total_sales_2013'].mean().sort_values(ascending=False)
average_sales_by_platform

From 2013 onward, **Xbox 360, PS4, XOne** dominated global sales, with higher average sales compared to other platforms. These platforms had a wide range of game performances, indicating strong popularity.

In contrast, platforms like **PSP, DS, PSV** had much lower sales, showing they were no longer major players in the market during this period. Platforms like **PC** had consistent but lower sales compared to the top platforms.

Overall, the top platforms significantly outperformed others, making them key drivers of the gaming industry after 2013.

In [None]:
# Filtering data by the chosen platform:
platform_data = video_games_from_2013[video_games_from_2013['platform'] == 'X360']
platform_data.head()

In [None]:
# Create a figure with 2 subplots:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# First Scatter plot for user scores vs. 'total sales_2010':
sns.scatterplot(x='user_score', y='total_sales_2013', data=platform_data, ax=ax1)
ax1.set_title('User Scores vs. Sales for (X360)')
ax1.set_xlabel('User Score')
ax1.set_ylabel('Total Sales')

# Second Scatter plot for Critic Scores vs. 'total sales_2010':
sns.scatterplot(x='critic_score', y='total_sales_2013', data=platform_data, ax=ax2)
ax2.set_title('Critic Scores vs. Sales for (X360)')
ax2.set_xlabel('Critic Score')
ax2.set_ylabel('Total Sales')

# Adjust layout to prevent overlap:
plt.tight_layout()
plt.show()

In [None]:
# Calculating correlation between User Score and Total Sales:
user_score_corr = platform_data['user_score'].corr(platform_data['total_sales_2013'])
f"Correlation (User Score Vs. Total Sales) for 'X360':{user_score_corr:.2f}"

In [None]:
# Calculating correlation between Critic Score and Total Sales:
critic_score_corr = platform_data['critic_score'].corr(platform_data['total_sales_2013'])
f"Correlation (Critic Score Vs. Total Sales) for 'X360':{critic_score_corr:.2f}"

The analysis shows that **User Score** have almost no impact on game sales for X360 (correlation: -0.01), while **Critic Score** show a weak positive influence (correlation: 0.35). This indicates that critic reviews may slightly influence sales.

In [None]:
# Identifing Unique Names on X360:
x360_game = platform_data['name'].unique()
len(x360_game)

In [None]:
# Filtering data across all platforms:
same_game = video_games_from_2013[video_games_from_2013['name'].isin(x360_game)]
same_game.head()

In [None]:
# Grouping by Names and platform to compare Sales:
comparison_sales = same_game.groupby(['name', 'platform'])['total_sales_2013'].sum().reset_index()
comparison_sales.head(10)

After filtering data across all platforms, **X360** appears to have higher sales for multi-platform games compared to other platforms. Meanwhile, **PS** has moderate sales for the same titles. 


In [None]:
# Grouping data by Genre and Totale Sales:
genre_sales = video_games_from_2013.groupby('genre')['total_sales_2013'].sum().sort_values(ascending=False)
genre_sales

In [None]:
plt.figure(figsize=(8, 4))
genre_sales.plot(kind='bar', color='blue')
plt.title('Total Sales by Genre')
plt.xlabel('Genre')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Createing the box plot:
plt.figure(figsize=(12, 6))
sns.boxplot(x='genre', y='total_sales_2013', data=video_games_from_2013)
plt.title('Total Sales of 2013 and up by Genre')
plt.xlabel('Genre')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.ylim(0,3)
plt.show()

The graph above, shows that **Action, Shooter, Sports** genres have the highest total sales, making them the most popular and profitable. Genres like **Role-Playing, Racing** have moderate sales, while **Strategy, Puzzle** have the lowest sales.

# Step 5. Creating a user profile for each region:

In [None]:
# Grouping Sales by Platform and Region:
region_platform_sales = video_games_from_2013.groupby('platform')[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum()
region_platform_sales

In [None]:
# Determining Top 5 Platforms in NA region:
top_na_platforms = region_platform_sales['na_sales'].sort_values(ascending=False).head()
top_na_platforms

In [None]:
# Creating a Bar Plot NA Region:
plt.figure(figsize=(8,4))
top_na_platforms.plot(kind='bar', color='blue')
plt.title('Top 5 Platforms in NA region')
plt.xlabel('Platforms')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**North American gamers** favored **XOne** and **PS4**, suggesting support and popularity during the chossen period (2013 and up). 

In [None]:
# Determining Top 5 Platforms in EU region:
top_eu_platforms = region_platform_sales['eu_sales'].sort_values(ascending=False).head()
top_eu_platforms

In [None]:
# Creating a Bar Plot EU Region:
plt.figure(figsize=(8,4))
top_eu_platforms.plot(kind='bar', color='green')
plt.title('Top 5 Platforms in EU region')
plt.xlabel('Platforms')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**PS4** leads in **Europe**, showing its strong dominance.

In [None]:
# Determining Top 5 Platforms in JP region:
top_jp_platforms = region_platform_sales['jp_sales'].sort_values(ascending=False).head()
top_jp_platforms

In [None]:
# Creating a Bar Plot JP Region:
plt.figure(figsize=(8,4))
top_jp_platforms.plot(kind='bar', color='pink')
plt.title('Top 5 Platforms in JP region')
plt.xlabel('Platforms')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**Japan** has a unique gaming market whre handheld platforms like **3DS, PSV** dominate. This reflects a preference for portable gaming devices.

In [None]:
# Determining Top 5 Platforms in Other regions:
top_other_platforms = region_platform_sales['other_sales'].sort_values(ascending=False).head()
top_other_platforms

In [None]:
# Creating a Bar Plot Other Regions:
plt.figure(figsize=(8,4))
top_other_platforms.plot(kind='bar', color='orange')
plt.title('Top 5 Platforms in Other regions')
plt.xlabel('Platforms')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

In **Other Regions** we can notice a dominance of **PS4, PS3, XOne**, which highlights a preference for home consoles.

In [None]:
# Grouping by Genre and calculating total sales for each region:
region_genre_sales = video_games_from_2013.groupby('genre')[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum()
region_genre_sales

In [None]:
# Determining Top 5 Genres in NA region:
top_na_genres = region_genre_sales['na_sales'].sort_values(ascending=False).head()
top_na_genres

In [None]:
# Creating a Bar Plot NA Region:
plt.figure(figsize=(8,4))
top_na_genres.plot(kind='bar', color='blue')
plt.title('Top 5 Genres in NA region')
plt.xlabel('Genre')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**North America** has a strong preference for **Action** and **Shooter** genres, which dominate the market. **Sports** games are also highly popular.

In [None]:
# Determining Top 5 Genres in EU region:
top_eu_genres = region_genre_sales['eu_sales'].sort_values(ascending=False).head()
top_eu_genres

In [None]:
# Creating a Bar Plot EU Region:
plt.figure(figsize=(8,4))
top_eu_genres.plot(kind='bar', color='green')
plt.title('Top 5 Genres in EU region')
plt.xlabel('Genre')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

Similar to North America, **Europe** has a strong preference for **Action** and **Shooter** whith **Sports** games also performing well.

In [None]:
# Determining Top 5 Genres in JP region:
top_jp_genres = region_genre_sales['jp_sales'].sort_values(ascending=False).head()
top_jp_genres

In [None]:
# Creating a Bar Plot JP Region:
plt.figure(figsize=(8,4))
top_jp_genres.plot(kind='bar', color='pink')
plt.title('Top 5 Genres in JP region')
plt.xlabel('Genre')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

The **Japanese** market prefers **Role-Playing** games significantly over other genres. **Action** games also perform well.

In [None]:
# Determining Top 5 Genres in Other regions:
top_other_genres = region_genre_sales['other_sales'].sort_values(ascending=False).head()
top_other_genres

In [None]:
# Creating a Bar Plot Other Regions:
plt.figure(figsize=(8,4))
top_other_genres.plot(kind='bar', color='orange')
plt.title('Top 5 Genres in Other regions')
plt.xlabel('Genre')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**Action** and **Shooter** continue to dominate, with **Sports** games also contributing significantly.

In [None]:
# Grouping Sales by ESRB Rating for each region:
region_rating_sales = video_games_from_2013.groupby('rating')[['na_sales', 'eu_sales', 'jp_sales', 'other_sales']].sum()
region_rating_sales

In [None]:
# The impact of ESRB Rating in NA region:
na_esrb = region_rating_sales['na_sales'].sort_values(ascending=False)
na_esrb

In [None]:
# Creating a Bar Plot NA Region:
plt.figure(figsize=(8,4))
na_esrb.plot(kind='bar', color='blue')
plt.title('Game Slaes by ESBR Rating in NA')
plt.xlabel('ESBR Rating')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

Games rated **M** dominat sales, highlighting a preference for mature content, followed by games that thier rating is **Unknown** (maybe missing data). While **E**, **E10+** and **T** perform moderatly.

In [None]:
# The impact of ESRB Rating in EU region:
eu_esrb = region_rating_sales['eu_sales'].sort_values(ascending=False)
eu_esrb

In [None]:
# Creating a Bar Plot EU Region:
plt.figure(figsize=(8,4))
eu_esrb.plot(kind='bar', color='green')
plt.title('Game Slaes by ESBR Rating in EU')
plt.xlabel('ESBR Rating')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

**M** and **E** rated games lead sales highlighting a strong preference for mature games. **T** and **E10+** contribute less.

In [None]:
# The impact of ESRB Rating in JP region:
jp_esrb = region_rating_sales['jp_sales'].sort_values(ascending=False)
jp_esrb

In [None]:
# Creating a Bar Plot JP Region:
plt.figure(figsize=(8,4))
jp_esrb.plot(kind='bar', color='pink')
plt.title('Game Slaes by ESBR Rating in JP')
plt.xlabel('ESBR Rating')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

The majority of game sales in **Japan** lack ESRB rating, indicating missing data or differences in rating systems. Other then that **T** and **E** rated games are dominant.

In [None]:
# The impact of ESRB Rating in Other regions:
other_esrb = region_rating_sales['other_sales'].sort_values(ascending=False)
other_esrb

In [None]:
# Creating a Bar Plot Other Regions:
plt.figure(figsize=(8,4))
other_esrb.plot(kind='bar', color='orange')
plt.title('Game Slaes by ESBR Rating in Other Regions')
plt.xlabel('ESBR Rating')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.show()

The **M** rated games dominate sales in **Other Regions** indicating a greater demand for mature content. However, the high **Unknown** rating indicates incomplete data or a different rating systems.

# Step 5. Testing the hypotheses:

#Test the hypotheses:

- Null Hypothesis (H_0): Average user ratings of the Xbox One and PC platforms are the same.

- Alternative Hypothesis (H_1): Average user ratings of the Xbox One and PC platforms are different. 

In [None]:
# Extracting user ratings for each platform:
xbox_one_ratings = video_games_from_2013[video_games_from_2013['platform'] == 'XOne']['user_score'].dropna()
pc_ratings = video_games_from_2013[video_games_from_2013['platform'] == 'PC']['user_score'].dropna()

# Conducting the t-test:
t_stat, p_value = stats.ttest_ind(xbox_one_ratings, pc_ratings, equal_var=False)

# Printing the results:
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpretation:
if p_value < 0.05:
    print("Reject the null hypothesis: Average user ratings differ between Xbox One and PC.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average user ratings.")



#Test the hypotheses:

- Null Hypothesis (H_0): Average user ratings for the Action and Sports genres are the same.

- Alternative Hypothesis (H_1): Average user ratings for the Action and Sports genres are different.

In [None]:
# Extracting user ratings for each genre:
action_ratings = video_games_from_2013[video_games_from_2013['genre'] == 'Action']['user_score'].dropna()
sports_ratings = video_games_from_2013[video_games_from_2013['genre'] == 'Sports']['user_score'].dropna()

# Conducting the t-test:
t_stat, p_value = stats.ttest_ind(action_ratings, sports_ratings, equal_var=False)

# Printing the results:
print(f"T-statistic: {t_stat}, P-value: {p_value}")

# Interpretation:
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: Average user ratings differ between Action and Sports genres.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average user ratings.")

# General Conclusion:

The dataset highlights significant trends in the gaming industry over 37 years, showcasing variability in game releases and sales performance across platforms and genres. The number of games released per year fluctuates widely, with a median of **338 games**, reflecting both peaks in production and years with limited activity.

Platform performance varied significantly, with **PS2**, **Xbox 360**, **PS3**, and **Wii** dominating total sales, especially between **1998** and **2013**. **Post-2013**, **Xbox One**, **PS4** led the market, while older platforms like **PS2** and **PSP** experienced sharp declines. Handheld platforms like **3DO**, and **PCFX** struggled to achieve meaningful sales, underscoring the dominance of major brands like **Sony**, **Microsoft**, and **Nintendo**.

Sales trends also show that in all regions **Action**, **Shooter**, and **Sports** genres were the most profitable, while **Strategy** and **Puzzle** genres had the lowest sales. Critic Reviews have a weak positive impact on sales **(correlation: 0.35)**, while User Scores have almost no influence **(correlation: -0.01)**. Across all regions, platforms like **PS** demonstrated higher sales for multi-platform games compared to **XOne**, which showed moderate performance.

Games rated **E** and **M** dominate sales across regions, reflecting a balance between broad audience appeal and mature content demand. **T** and **E10+** contribute moderately. However, the high **Unknown** rating indicates incomplete data or a different rating systems.