## Handling Outliers in the Datasets

We identified that both our classification and regression datasets contained outliers, which could distort the models' predictions and overall performance. Outliers can arise from various sources, including data entry errors, variability in data collection, or genuine anomalies in the data.

### Solution: Isolation Forest
To address the outliers, we employed Isolation Forest, an unsupervised learning algorithm specifically designed for anomaly detection. This technique isolates observations in the dataset, identifying outliers based on how easily they can be separated from the rest of the data. By applying Isolation Forest to both the classification and regression datasets, we were able to effectively detect and remove these outliers, enhancing the robustness of our models and improving their predictive performance.

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import IsolationForest

df_class = pd.read_csv('Model/mushrooms.csv')

df_class_encoded = pd.get_dummies(df_class.drop('class', axis=1))
le = LabelEncoder()
y_class = le.fit_transform(df_class['class'])
iso_forest_class = IsolationForest(contamination=0.1, random_state=42)
yhat_class = iso_forest_class.fit_predict(df_class_encoded)

X_class_no_outliers = df_class_encoded[yhat_class != -1]
y_class_no_outliers = y_class[yhat_class != -1]

print('Class distribution before outliers removal:', pd.Series(y_class).value_counts())
print('Class distribution after outliers removal:', pd.Series(y_class_no_outliers).value_counts())


Class distribution before outliers removal: 0    4208
1    3916
Name: count, dtype: int64
Class distribution after outliers removal: 1    3801
0    3510
Name: count, dtype: int64


In [15]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import IsolationForest
from scipy import sparse

# Step 1: Load the dataset
df_reg = pd.read_csv('Model\VGChartzGamesSalesDataset.csv')

# Step 2: Apply Label Encoding to the 'publisher' column
le = LabelEncoder()
df_reg['publisher'] = le.fit_transform(df_reg['publisher'])

# Step 3: Convert 'release_date' to a usable format (extracting year)
df_reg['release_year'] = pd.to_datetime(df_reg['release_date'], errors='coerce').dt.year
df_reg = df_reg.drop('release_date', axis=1)

# Step 4: Remove non-numeric columns like 'name' and 'img_url' if present
if 'name' in df_reg.columns:
    df_reg = df_reg.drop('name', axis=1)
if 'img_url' in df_reg.columns:
    df_reg = df_reg.drop('img_url', axis=1)

# Step 5: One-Hot Encode the 'genre' column
df_reg = pd.get_dummies(df_reg, columns=['genre'])

# Step 6: Convert all boolean columns to integers (1 for True, 0 for False) if any remain
bool_cols = df_reg.select_dtypes(include='bool').columns
df_reg[bool_cols] = df_reg[bool_cols].astype(int)

# Step 7: Handle missing values
df_reg = df_reg.fillna(0)  # Fill missing values with 0

# Step 9: Define features (X) and target (y)
X_reg = df_reg.drop('total_sales', axis=1)  # Features
y_reg = df_reg['total_sales']  # Target

# Step 10: Convert the feature matrix into a sparse matrix to reduce memory usage
X_sparse = sparse.csr_matrix(X_reg)

# Step 11: Apply Isolation Forest for outlier detection
iso_forest = IsolationForest(contamination=0.1, random_state=42)
yhat_reg = iso_forest.fit_predict(X_sparse)

# Step 12: Filter out the outliers (-1 means outlier, 1 means inlier)
X_reg_no_outliers = X_reg[yhat_reg != -1]
y_reg_no_outliers = y_reg[yhat_reg != -1]

# Step 13: Display shape of the data before and after outlier removal
print(f"\nShape of X_reg before outlier removal: {X_reg.shape}")
print(f"Shape of X_reg after outlier removal: {X_reg_no_outliers.shape}")

# Optional: Display the first few rows of the cleaned data
print("\nFirst few rows of cleaned data:\n", X_reg_no_outliers.head())



Shape of X_reg before outlier removal: (37715, 26)
Shape of X_reg after outlier removal: (33943, 26)

First few rows of cleaned data:
    publisher  vgchartz_score  critic_score  user_score  total_shipped  \
0        299        7.310415      7.228117    8.086988       5.066328   
1       2542        7.310415      7.228117    8.086988       5.066328   
2         96        7.310415      7.228117    8.086988       5.066328   
3       1286        7.310415      7.228117    8.086988       5.066328   
5       1238        7.310415      7.228117    8.086988       5.066328   

   release_year  genre_Action  genre_Action-Adventure  genre_Adventure  \
0          2003             0                       0                0   
1          1991             0                       0                0   
2          2005             0                       0                0   
3          2010             0                       0                1   
5          1999             0                       0  