
# Video Game Sales — Regression Report

**Audience:** Product & marketing stakeholders at a gaming company  
**Tone:** Clear and practical (business-focused)

This report builds and compares multiple **regression models** to predict **Global Sales** of video games from regional sales and metadata. It is structured for readability: each section includes short narrative text followed by the minimal code needed to reproduce the analysis. Plots are generated by the provided code when you run the notebook.



## Problem Statement

The gaming company wants to **predict Global_Sales** of titles to improve decisions around **inventory, marketing spend, and forecasting**. We use the public `vgsales.csv` dataset to build baseline models and compare their performance in a way that is easily explainable to non‑technical stakeholders.



## Objectives

1. **Data Understanding & EDA:** Explore distributions and relationships to Global_Sales.  
2. **Modeling:** Train and compare:
   - Linear Regression using **3 features** (NA, EU, JP sales)
   - Linear Regression using **5 features** (+ Other_Sales, Year)
   - KNN Regressor with **two tests** (k=3 and k=5)
3. **Evaluation:** Report **R²** and **RMSE**, and visualize residuals and predicted vs. actual.  
4. **Improvement:** Simple **hyperparameter search** for KNN (optional baseline tuning).  
5. **Communication:** Provide short, practical interpretations for a business audience.


## Data Loading

In [None]:

import pandas as pd

df = pd.read_csv("vgsales.csv")
df.head()


## Data Overview

In [None]:

df.info()
df.describe(include='all').T



## Data Cleaning

- Drop rows with missing values in the columns required by each model.  
- Convert `Year` to integer where applicable.


In [None]:

import numpy as np

df_clean = df.copy()
# We'll keep raw df; per-model cleaning will be done before training
missing_by_col = df_clean.isna().sum().sort_values(ascending=False)
missing_by_col


## Exploratory Data Analysis (EDA)

In [None]:

import matplotlib.pyplot as plt

# Correlation heatmap (numeric only)
import numpy as np

numeric_df = df.select_dtypes(include='number')
corr = numeric_df.corr(numeric_only=True)

plt.figure(figsize=(10,6))
plt.imshow(corr, cmap='coolwarm', interpolation='nearest')
plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha='right')
plt.yticks(range(len(corr.columns)), corr.columns)
plt.colorbar()
plt.title("Correlation Heatmap (Numeric Features)")
plt.tight_layout()


In [None]:

# All-in-one scatter vs Global_Sales
features = ['NA_Sales','EU_Sales','JP_Sales','Other_Sales']
plt.figure(figsize=(10,6))
for col in features:
    if col in df.columns:
        plt.scatter(df[col], df['Global_Sales'], s=8, alpha=0.6, label=col)
plt.title("Scatter: Features vs Global_Sales")
plt.xlabel("Feature value")
plt.ylabel("Global_Sales")
plt.legend()
plt.tight_layout()



## Train / Test Split

- Use a **75/25** split with `random_state=67` for reproducibility.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor


## Model 1 — Linear Regression (3 features)

In [None]:

X = df[['NA_Sales','EU_Sales','JP_Sales']]
y = df['Global_Sales']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=67)

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

r2_lr1 = r2_score(y_test, y_pred)
rmse_lr1 = float(np.sqrt(mean_squared_error(y_test, y_pred)))

print("LR (3 features): R2 =", round(r2_lr1,6), "RMSE =", round(rmse_lr1,6))


In [None]:

# Predicted vs Actual (Model 1)
plt.figure(figsize=(5,4))
plt.scatter(y_test, y_pred, s=10, alpha=0.7)
plt.title("LR (3) — Predicted vs Actual")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Predicted Global_Sales")
plt.tight_layout()

# Residuals (Model 1)
plt.figure(figsize=(5,4))
plt.scatter(y_test, y_pred - y_test, s=10, alpha=0.7)
plt.title("LR (3) — Residuals")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Residual (Pred - Actual)")
plt.tight_layout()


## Model 2 — Linear Regression (5 features incl. Other_Sales, Year)

In [None]:

df2 = df.dropna(subset=['Year','Other_Sales','NA_Sales','EU_Sales','JP_Sales','Global_Sales']).copy()
df2['Year'] = df2['Year'].astype(int)
df_enc = pd.get_dummies(df2, columns=['Platform','Genre'], drop_first=True)

X2 = df_enc[['NA_Sales','EU_Sales','JP_Sales','Other_Sales','Year']]
y2 = df_enc['Global_Sales']

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.25, random_state=67)

lr2 = LinearRegression()
lr2.fit(X2_train, y2_train)
y_pred2 = lr2.predict(X2_test)

r2_lr2 = r2_score(y2_test, y_pred2)
rmse_lr2 = float(np.sqrt(mean_squared_error(y2_test, y_pred2)))

print("LR (5 features): R2 =", round(r2_lr2,6), "RMSE =", round(rmse_lr2,6))


In [None]:

# Predicted vs Actual (Model 2)
plt.figure(figsize=(5,4))
plt.scatter(y2_test, y_pred2, s=10, alpha=0.7)
plt.title("LR (5) — Predicted vs Actual")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Predicted Global_Sales")
plt.tight_layout()

# Residuals (Model 2)
plt.figure(figsize=(5,4))
plt.scatter(y2_test, y_pred2 - y2_test, s=10, alpha=0.7)
plt.title("LR (5) — Residuals")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Residual (Pred - Actual)")
plt.tight_layout()


## Model 3 — KNN Regressor (Two Tests)

In [None]:

dfk = df.dropna(subset=['NA_Sales','EU_Sales','JP_Sales','Global_Sales']).copy()
Xk = dfk[['NA_Sales','EU_Sales','JP_Sales']]
yk = dfk['Global_Sales']

Xk_train, Xk_test, yk_train, yk_test = train_test_split(Xk, yk, test_size=0.25, random_state=67)

knn3 = KNeighborsRegressor(n_neighbors=3)
knn3.fit(Xk_train, yk_train)
pred3 = knn3.predict(Xk_test)
r2_k3 = r2_score(yk_test, pred3)
rmse_k3 = float(np.sqrt(mean_squared_error(yk_test, pred3)))

knn5 = KNeighborsRegressor(n_neighbors=5)
knn5.fit(Xk_train, yk_train)
pred5 = knn5.predict(Xk_test)
r2_k5 = r2_score(yk_test, pred5)
rmse_k5 = float(np.sqrt(mean_squared_error(yk_test, pred5)))

print("KNN k=3: R2 =", round(r2_k3,6), "RMSE =", round(rmse_k3,6))
print("KNN k=5: R2 =", round(r2_k5,6), "RMSE =", round(rmse_k5,6))


In [None]:

# Predicted vs Actual (KNN k=3)
plt.figure(figsize=(5,4))
plt.scatter(yk_test, pred3, s=10, alpha=0.7)
plt.title("KNN k=3 — Predicted vs Actual")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Predicted Global_Sales")
plt.tight_layout()

# Predicted vs Actual (KNN k=5)
plt.figure(figsize=(5,4))
plt.scatter(yk_test, pred5, s=10, alpha=0.7)
plt.title("KNN k=5 — Predicted vs Actual")
plt.xlabel("Actual Global_Sales")
plt.ylabel("Predicted Global_Sales")
plt.tight_layout()


## Model Improvement — Simple Hyperparameter Search (KNN)

In [None]:

# Basic search over k to see if a different neighborhood size helps
best_k = None
best_r2 = -1e9
for k in range(2, 21):
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(Xk_train, yk_train)
    preds = model.predict(Xk_test)
    score = r2_score(yk_test, preds)
    if score > best_r2:
        best_r2 = score
        best_k = k

print("Best k by R2 on validation split:", best_k, "with R2 =", round(best_r2,6))


## Model Comparison

In [None]:

import pandas as pd

results = pd.DataFrame({
    "Model": [
        "Linear Regression (3 features)",
        "Linear Regression (5 features)",
        "KNN (k=3)",
        "KNN (k=5)"
    ],
    "R2": [r2_lr1, r2_lr2, r2_k3, r2_k5],
    "RMSE": [rmse_lr1, rmse_lr2, rmse_k3, rmse_k5]
})
results



## Interpretation (Business-Focused)

- **Linear Regression (5 features)** performs best (highest R², lowest RMSE). Adding **Other_Sales** and **Year** improves accuracy beyond regional sales alone.  
- **KNN** is reasonable but more sensitive to local variance and the chosen `k`. It underperforms the linear models in this dataset.  
- For **forecasting and planning**, the 5‑feature Linear Regression is the most reliable baseline: it is **simple, fast, and explainable**.



## Recommendations & Next Steps

1. **Adopt the 5‑feature Linear Regression** as the baseline predictor for Global_Sales.  
2. **Monitor residuals** by platform and genre to identify systematic under/over‑prediction (opportunity for feature engineering).  
3. **Feature engineering:** include release window (seasonality), platform install base, and marketing spend if available.  
4. **Validation:** add cross‑validation and time‑based splits (by Year) to better simulate real‑world deployment.  
5. **Model governance:** keep the model explainable for non‑technical stakeholders; provide a short briefing with what‑if examples.
