<a href="https://colab.research.google.com/github/FranciscoTeon/Video-game-ratings-best-to-worst-Data-Analysis/blob/main/Classificaton_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, mean_squared_error, r2_score, classification_report

In [6]:
# Load dataset
df = pd.read_csv("/content/sample_data/IGN games from best to worst.csv")

# Preprocessing: Convert categorical variables, fill missing data, etc.
df = pd.get_dummies(df, drop_first=True)

# Handling missing data
df.fillna(df.mean(), inplace=True)

# Define features (X) and target (y)
X = df.drop(columns=['score']) # Features
y = df['score']  # Target (could be classification or regression)


In [7]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Experiment 1
# Linear Regression (instead of Logistic Regression)

# Input (X): Game attributes like genre, developer, year, ratings.
# Target (y): Continuous values representing game scores.

# Model: Linear Regression
lr_model = LinearRegression() # Change to LinearRegression
lr_model.fit(X_train, y_train)

# Model Performance
y_train_pred = lr_model.predict(X_train)
y_test_pred = lr_model.predict(X_test)

# Use regression metrics (e.g., MSE, R-squared)
print("Train MSE:", mean_squared_error(y_train, y_train_pred))
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
print("Train R-squared:", r2_score(y_train, y_train_pred))
print("Test R-squared:", r2_score(y_test, y_test_pred))

# Compare train and test metrics

lr_model = LinearRegression()  # Change to LinearRegression
lr_model.fit(X_train, y_train)

Experiment 1: Linear Regression
Input Data and Target:

Input (X): Game attributes like genre, developer, year, and ratings (after applying one-hot encoding and filling missing values).
Target (y): Continuous values representing game scores.
Model Performance:

Train MSE: This gives you the mean squared error on the training set.

Train MSE: 2.68

Test MSE: This is the mean squared error on the test set.

Test MSE:
1.25
×
1
0
22
1.25×10
22
  (very high)

Train R-squared: Measures how well the model fits the training data.

Train R-squared: 0.09

Test R-squared: Shows how well the model generalizes to unseen data.

Test R-squared: -
4.38
×
1
0
21
4.38×10
21
  (extremely poor performance)

Overfitting/Underfitting:

If the train R-squared is much higher than the test R-squared and test MSE is higher than train MSE, the model may be overfitting.
If both the train and test R-squared values are low, the model could be underfitting.
Iteration:

If overfitting: You can try regularization techniques such as Ridge Regression or Lasso Regression to prevent the model from fitting the noise in the training data.
If underfitting: Consider adding more features, or using non-linear models like Polynomial Regression.
Potential Improvements:

Add more specific features about the games, such as reviews, social media sentiment, or player statistics, to improve model performance.

Analysis: The Linear Regression model performed poorly on both the train and test sets, with especially catastrophic results on the test set. The very high MSE and negative R-squared suggest that the model is unable to generalize well. It is likely severely underfitting the data.

Next Steps: Linear Regression may not be suitable for this task due to the complexity of the data. Trying more flexible models like Random Forest or Gradient Boosting is a better approach. Regularization (Ridge/Lasso) might also help.

In [None]:
# Experiment 2
# Random Forest

# Input (X): Game attributes like genre, developer, year, ratings.
# Target (y): Continuous values representing game scores (using regression).

# Model: Random Forest Regressor (instead of Classifier)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)  # Changed to Regressor
rf_model.fit(X_train, y_train)

# Model Performance
y_train_pred = rf_model.predict(X_train)
y_test_pred = rf_model.predict(X_test)

# Use regression metrics (e.g., MSE, R-squared)
print("Train MSE:", mean_squared_error(y_train, y_train_pred))
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
print("Train R-squared:", r2_score(y_train, y_train_pred))
print("Test R-squared:", r2_score(y_test, y_test_pred))

# Consider Overfitting/Underfitting
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42) # Changed to Regressor
rf_model.fit(X_train, y_train)

Experiment 2: Random Forest Regressor
Input Data and Target:

Input (X): Game attributes like genre, developer, year, and ratings.
Target (y): Continuous values representing game scores.
Model Performance:

Train MSE: Check how well the model fits the training data.

Train MSE: 0.46

Test MSE: Indicates how well the model generalizes.

Test MSE: 2.74

Train R-squared: Typically, Random Forest performs well on training data.

Train R-squared: 0.84

Test R-squared: Evaluate the generalization of the model.

Test R-squared: 0.04

Overfitting/Underfitting:

Random Forests are prone to overfitting, especially with many trees. If train MSE is very low and test MSE is higher, your model may overfit.
Iteration:

If overfitting: You can reduce the max_depth of the trees or limit the number of features considered at each split to prevent overfitting.
In the code, you already addressed overfitting by limiting the max_depth in the second Random Forest model.
Potential Improvements:

Increase the number of trees (n_estimators) or fine-tune hyperparameters like max_depth, min_samples_split, etc.
Try increasing the diversity of the training data or using techniques like cross-validation to get more robust results.

Analysis: The Random Forest model fits the training data very well (with a high R-squared), but its performance on the test set is poor, indicating overfitting. The test R-squared of 0.04 suggests that the model barely explains the variance in the test set.

Next Steps: To reduce overfitting, limiting tree depth (max_depth) and reducing the number of features considered at each split could help. Cross-validation could also be used to fine-tune hyperparameters.


In [None]:
# Experiment 3
# Gradient Boosting

# Input (X): Game attributes like genre, developer, year, ratings.
# Target (y): Continuous values representing game scores (using regression).

# Model: Gradient Boosting Regressor (instead of Classifier)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42) # Changed to Regressor
gb_model.fit(X_train, y_train)

# Model Performance
y_train_pred = gb_model.predict(X_train)
y_test_pred = gb_model.predict(X_test)

# Use regression metrics (e.g., MSE, R-squared)
from sklearn.metrics import mean_squared_error, r2_score # Import necessary metrics
print("Train MSE:", mean_squared_error(y_train, y_train_pred))
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
print("Train R-squared:", r2_score(y_train, y_train_pred))
print("Test R-squared:", r2_score(y_test, y_test_pred))

# Consider Overfitting/Underfitting

gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.01, random_state=42) # Changed to Regressor
gb_model.fit(X_train, y_train)

Experiment 3: Gradient Boosting Regressor
Input Data and Target:

Input (X): Game attributes like genre, developer, year, and ratings.
Target (y): Continuous values representing game scores.
Model Performance:

Train MSE: Shows the training error.

Train MSE: 2.63

Test MSE: Indicates generalization performance.

Test MSE: 2.60

Train R-squared: Likely to be high if the model is learning well.

Train R-squared: 0.11

Test R-squared: May suffer if the model overfits or underfits.

Test R-squared: 0.09

Overfitting/Underfitting:

In Gradient Boosting, overfitting can occur if the learning rate is too high, or if you use too many trees (n_estimators). In your case, reducing the learning rate (as shown in the second Gradient Boosting model) can help combat overfitting.
Iteration:

You’ve already made a change by lowering the learning rate. If this didn’t help enough, try early stopping (stopping training once the test error stops decreasing) or regularization techniques like subsampling.
Potential Improvements:

Add more features or use external data to provide more context about each game.
You could also use feature importance to select the most significant features, which might improve performance.

Analysis: The Gradient Boosting model performs similarly on both the train and test sets, with relatively low MSE and moderate R-squared values. There doesn't appear to be severe overfitting, but the model's ability to explain the variance in the data is still limited.

Next Steps: Gradient Boosting can be further tuned by adjusting the learning rate or increasing the number of estimators. Feature engineering or gathering more data could also improve performance.

In [None]:
# Extra Credit

# Analysis Structure

# For Classification Models

print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("Precision:", precision_score(y_test, y_test_pred))
print("Recall:", recall_score(y_test, y_test_pred))
print("F1 Score:", f1_score(y_test, y_test_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

# For Regression Models

print("Train MSE:", mean_squared_error(y_train, y_train_pred))
print("Test MSE:", mean_squared_error(y_test, y_test_pred))
print("Train R-Squared:", r2_score(y_train, y_train_pred))
print("Test R-Squared:", r2_score(y_test, y_test_pred))