# Practice 3: Implement linear regression to perform prediction.

**Name:** Keshat Saini
**Roll Number:** RA2311026010916  
**Dataset:** Video Game Sales (Kaggle - Gregorut) https://www.kaggle.com/datasets/gregorut/videogamesales?utm_source=chatgpt.com

**Tool Used:** Google Colab  


In this practice, we will implement Linear Regression to perform prediction on the **Video Game Sales dataset**.  
Steps:  
1. Load the dataset  
2. Explore the data  
3. Apply Linear Regression  
4. Evaluate the model  
5. Perform prediction

## Step 1: Load the dataset
*Video Game Sales* dataset (`vgsales.csv`).  

In [None]:
import pandas as pd
from google.colab import files
uploaded = files.upload()
df = pd.read_csv('vgsales.csv')
df.head()

MessageError: CustomError: Timed out waiting for output iframe load.

## Step 2: Exploring data

Identify null values because Linear regression doesnt handle missing values.

In [None]:

df.isnull().sum()

Unnamed: 0,0
Rank,0
Name,0
Platform,0
Year,271
Genre,0
Publisher,58
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0


### Step 2.5 : Handling Missing Values

Our dataset has missing values in two columns:
- **Year**: 271 missing
- **Publisher**: 58 missing
We'll **replace missing Publisher with "Unknown"** (since publisher names are categorical).  
For **Year**, we’ll fill with the **mode** of the column (most common year). This keeps the data consistent.

In [None]:
#filling missing values with unknown
df['Publisher'] = df['Publisher'].fillna("Unknown")

#changing year to most frequent or comnom year (mode of the data)
mode_year = df['Year'].mode()[0]
df['Year'] = df['Year'].fillna(mode_year)
# checking null values again
df.isnull().sum()

Unnamed: 0,0
Rank,0
Name,0
Platform,0
Year,0
Genre,0
Publisher,0
NA_Sales,0
EU_Sales,0
JP_Sales,0
Other_Sales,0


### Step 3: Applying Linear Regression

Now that the dataset is clean we can apply a simple **Linear Regression** model.  
The goal is to predict Global Sales based on other numerical features.  

**Approach:**
- We'll use `NA_Sales`, `EU_Sales`, `JP_Sales`, and `Other_Sales` as independent variables (X).
- The target variable (y) will be `Global_Sales`.
- We'll split the data into training (80%) and testing (20%) sets.
- Then, we train a Linear Regression model using scikit-learn.
- Finally, we evaluate the model with the **R² score** and **Mean Squared Error (MSE)** to check accuracy.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# variables
X = df[['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]
# Target
y = df['Global_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R² Score:", r2)

Mean Squared Error: 2.7402923389188876e-05
R² Score: 0.9999934776126175


### Step 4: Evaluate the Model

Once the linear regression model is trained, we need to evaluate its performance on the dataset. To do this, we compare the predicted sales values with the actual sales values using standard error metrics:
	•	Mean Squared Error (MSE): Measures the average squared difference between actual and predicted values. Lower values indicate better fit.
	•	Root Mean Squared Error (RMSE): Square root of MSE; gives error in the same scale as the target variable (Global Sales).
	•	R-squared (R²): Indicates the proportion of variance in the dependent variable that can be explained by the independent variables. R² ranges from 0 to 1, with higher values meaning better performance.



In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Model Evaluation Metrics:")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Model Evaluation Metrics:
Mean Absolute Error (MAE): 0.00
Mean Squared Error (MSE): 0.00
R² Score: 1.00


### Step 5 : perform prediction

In [None]:
y_pred = model.predict(X_test)
prediction_results = pd.DataFrame({
    "Actual Sales": y_test.values,
    "Predicted Sales": y_pred
}).head(10)
prediction_results

NameError: name 'model' is not defined

Prediction performed.