## Handling Noise in the Regression Dataset

We noticed that our regression dataset contained noise, which could affect the performance of our model. Noise in the data can arise from measurement errors, uncontrolled variability, or other undesirable factors.

### Solution: PCA (Principal Component Analysis)
To mitigate the noise, we applied PCA to reduce the dimensionality of our data.

In [1]:
from sklearn.decomposition import PCA
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

df = pd.read_csv('Model/VGChartzGamesSalesDataset.csv')

df = df.drop(['name', 'img_url', 'release_date'], axis=1)
df = pd.get_dummies(df, columns=['publisher', 'genre'])

X = df.drop('total_sales', axis=1)
y = df['total_sales']

pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error (MSE) with PCA: {mse}')


Mean Squared Error (MSE) with PCA: 0.28479284804888955
