# Advanced Video Game Sales Analysis for Three Questions

This notebook answers three advanced questions with multi-step analysis. Each question includes three blocks (cells) containing tables, graphs, and detailed insights.

Questions:
1. **Which factors most significantly predict global sales?**
2. **How does the popularity of a game evolve with age, and what is the longevity effect on sales?**
3. **Can we cluster games into distinct groups based on their regional sales patterns?**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Load the dataset
df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')

# Convert Year_of_Release to numeric and drop rows with missing years
df['Year_of_Release'] = pd.to_numeric(df['Year_of_Release'], errors='coerce')
df = df.dropna(subset=['Year_of_Release'])
df['Year_of_Release'] = df['Year_of_Release'].astype(int)

# For demonstration, if Critic_Score does not exist, create a synthetic column
if 'Critic_Score' not in df.columns:
    np.random.seed(42)
    df['Critic_Score'] = np.random.randint(50, 100, size=len(df))

print('Setup complete. Data loaded and preprocessed.')

FileNotFoundError: [Errno 2] No such file or directory: 'Video_Games_Sales_as_at_22_Dec_2016.csv'

## Question 1: Which factors most significantly predict global sales?

In this analysis we will first perform a correlation study and then build a regression model to determine which factors (e.g., Year of Release, Critic Score, etc.) predict Global Sales. Tables and graphs are used for insights.

In [None]:
# Q1 Block 1: Correlation Analysis
numeric_cols = ['Year_of_Release', 'Critic_Score', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']
corr_matrix = df[numeric_cols].corr()
corr_table = pd.DataFrame(corr_matrix['Global_Sales'].sort_values(ascending=False))
print('Correlation of numeric factors with Global Sales:')
display(corr_table)

plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix of Numeric Features')
plt.tight_layout()
plt.show()

In [None]:
# Q1 Block 2: Regression Model for Predicting Global Sales
# Select features and target (we include Year_of_Release, Critic_Score, and Genre)
features = df[['Year_of_Release', 'Critic_Score', 'Genre']].dropna()
target = df.loc[features.index, 'Global_Sales']

# One-hot encode 'Genre'
features_encoded = pd.get_dummies(features, columns=['Genre'], drop_first=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

print('Regression model trained.')

# Evaluate model performance
r2_score = model.score(X_test, y_test)
print('R^2 Score on Test Set:', r2_score)

In [None]:
# Q1 Block 3: Display Regression Coefficients and Plot Predictions vs Actual
coefficients = pd.Series(model.coef_, index=X_train.columns)
print('Regression Coefficients:')
display(coefficients.sort_values(ascending=False))

# Generate predictions for test set
predictions = model.predict(X_test)
results = pd.DataFrame({'Actual': y_test, 'Predicted': predictions})
print('Sample Predictions:')
display(results.head(10))

# Scatter plot of actual vs predicted values
plt.figure(figsize=(8,6))
plt.scatter(y_test, predictions, alpha=0.6, edgecolor='k')
plt.xlabel('Actual Global Sales')
plt.ylabel('Predicted Global Sales')
plt.title('Actual vs Predicted Global Sales')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.tight_layout()
plt.show()

## Question 2: How does the popularity of a game evolve with age, and what is the longevity effect on sales?

In this analysis we create a new column for game age, group games by age intervals, and study the average global sales per age group. Tables and graphs are used to reveal trends.

In [None]:
# Q2 Block 1: Calculate Game Age and Create Age Groups
current_year = 2025
df['Game_Age'] = current_year - df['Year_of_Release']

# Define age bins (e.g., 0-5, 6-10, 11-15, 16-20, 21+)
bins = [0, 5, 10, 15, 20, df['Game_Age'].max()]
labels = ['0-5', '6-10', '11-15', '16-20', '21+']
df['Age_Group'] = pd.cut(df['Game_Age'], bins=bins, labels=labels, right=False)

print('Sample of Game Age and Age Groups:')
display(df[['Name', 'Year_of_Release', 'Game_Age', 'Age_Group']].head(10))

In [None]:
# Q2 Block 2: Create Table of Average Global Sales by Age Group
age_group_sales = df.groupby('Age_Group')['Global_Sales'].mean().reset_index()
age_group_sales.rename(columns={'Global_Sales': 'Avg_Global_Sales'}, inplace=True)
print('Average Global Sales by Game Age Group:')
display(age_group_sales)

In [None]:
# Q2 Block 3: Visualize the Effect of Game Age on Global Sales
plt.figure(figsize=(10,6))
sns.barplot(x='Age_Group', y='Global_Sales', data=df, palette='viridis', ci=None)
plt.xlabel('Game Age Group (years)')
plt.ylabel('Average Global Sales (millions)')
plt.title('Longevity Effect on Global Sales by Age Group')
plt.tight_layout()
plt.show()

# Optionally, also display a boxplot to show distribution within each group
plt.figure(figsize=(10,6))
sns.boxplot(x='Age_Group', y='Global_Sales', data=df, palette='Set2')
plt.xlabel('Game Age Group (years)')
plt.ylabel('Global Sales (millions)')
plt.title('Sales Distribution by Game Age Group')
plt.tight_layout()
plt.show()

## Question 3: Can we cluster games into distinct groups based on their regional sales patterns?

For this analysis, we use regional sales columns to cluster games. We first prepare the data, then apply K-Means clustering, and finally use PCA for visualization. Tables and graphs help illustrate the clusters.

In [None]:
# Q3 Block 1: Data Preparation for Clustering
regional_cols = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']
regional_data = df[regional_cols].dropna()

# Scale the regional sales data
scaler = StandardScaler()
regional_scaled = scaler.fit_transform(regional_data)

print('First 5 rows of scaled regional sales data:')
display(pd.DataFrame(regional_scaled, columns=regional_cols).head())

In [None]:
# Q3 Block 2: Apply K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(regional_scaled)

# Add the cluster labels back to the original dataframe (for the rows used)
df_cluster = df.loc[regional_data.index].copy()
df_cluster['Cluster'] = clusters

print('Cluster distribution:')
display(df_cluster['Cluster'].value_counts().reset_index().rename(columns={'index': 'Cluster', 'Cluster': 'Count'}))

In [None]:
# Q3 Block 3: Visualize Clusters Using PCA
pca = PCA(n_components=2)
pca_components = pca.fit_transform(regional_scaled)

plt.figure(figsize=(10,6))
scatter = plt.scatter(pca_components[:, 0], pca_components[:, 1], c=clusters, cmap='Set1', edgecolor='k', s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection of Regional Sales Clusters')
plt.legend(*scatter.legend_elements(), title='Cluster')
plt.tight_layout()
plt.show()

# Create a table summarizing the mean regional sales per cluster
cluster_summary = df_cluster.groupby('Cluster')[regional_cols].mean().reset_index()
print('Mean Regional Sales by Cluster:')
display(cluster_summary)