### By Amon Johnson and Francisco Teon

In [91]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler


In [92]:
# Load Dataset
ign_df = pd.read_csv('ign.csv')

# Display the first few rows of the dataset
ign_df.head()

Unnamed: 0,title,score,score_phrase,platform,genre,release_year,release_month,release_day
0,Checkered Flag,10.0,Masterpiece,Lynx,Racing,1999,7,6
1,Chrono Trigger,10.0,Masterpiece,Wii,"Action, RPG",2011,5,25
2,Dragon Warrior III,10.0,Masterpiece,Game Boy Color,RPG,2001,7,20
3,Grand Theft Auto IV,10.0,Masterpiece,Xbox 360,"Action, Adventure",2008,4,25
4,Grand Theft Auto IV,10.0,Masterpiece,PlayStation 3,"Action, Adventure",2008,4,25


In [93]:
#encode genre column to fit into the algorithm
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
ign_df['genre'] = le.fit_transform(ign_df['genre'])

In [94]:
# Select the attributes for anomaly detection
attributes = ign_df[['score', 'genre', 'release_year']]

# Function to calculate Z-Scores and identify anomalies
def detect_anomalies_zscore(data, threshold=3):
    anomalies = {}
    for col in data.columns:
        mean = np.mean(data[col])
        std_dev = np.std(data[col])
        z_scores = [(x - mean) / std_dev for x in data[col]]
        anomalies[col] = [data[col][i] for i in range(len(z_scores)) if abs(z_scores[i]) > threshold]
    return anomalies

# Detect anomalies
anomalies = detect_anomalies_zscore(attributes)

# Display potential anomalies
print("Potential Anomalies Detected:")
print(anomalies)

Potential Anomalies Detected:
{'score': [1.8, 1.7, 1.7, 1.7, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.5, 1.4, 1.3, 1.2, 1.2, 1.2, 1.2, 1.1, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.7, 0.5], 'genre': [], 'release_year': [1970]}


In [95]:
for col, values in anomalies.items():
    print(f"\nAnomalies in {col}:")
    for value in values:
        print(value)



Anomalies in score:
1.8
1.7
1.7
1.7
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.5
1.4
1.3
1.2
1.2
1.2
1.2
1.1
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.8
0.7
0.5

Anomalies in genre:

Anomalies in release_year:
1970


Did you notice any potential anomalies?

  Through our anomaly detection process, we found a myriad of anomalous data points, specifically in the score columns and a couple in the year column.

Look into a few of the anomalies and try to identify what could have caused them. Explain what you found.

  For the anomalous data point within the 'release_year' column, it is really clear what is happening. The game states that it was released in 1970 for the Xbox 360, which was not out in that year; hence, no other title was released for the Xbox 360 in 1970. This is also the only game with the release year listed as 1970.

  The 'score' column anomalies are a little bit more unclear as to what is causing them. It could potentially be from the significant change in score for the genre. Most games in the 'action' genre, for example, have a majority score of 10, making it seem that a score of 1.5 to 0 within this genre would be anomalous.

In [96]:
# Function to detect anomalies using Isolation Forest
def detect_anomalies_isolation_forest(data):
    iso_forest = IsolationForest(contamination=0.05)
    preds = iso_forest.fit_predict(data)
    anomalies = data[preds == -1]
    return anomalies

# Detect anomalies using Isolation Forest
anomalies_if = detect_anomalies_isolation_forest(attributes)

# Display potential anomalies
print("Potential Anomalies Detected (Isolation Forest):")
print(anomalies_if)

Potential Anomalies Detected (Isolation Forest):
       score  genre  release_year
1       10.0      6          2011
7       10.0      1          2013
8       10.0      1          2013
9       10.0      1          2014
10      10.0      1          2014
...      ...    ...           ...
18620    1.0      0          1997
18621    1.0      0          2001
18622    0.8     80          2009
18623    0.7      0          1998
18624    0.5     80          2003

[932 rows x 3 columns]


Did you notice any potential anomalies?

Using the isolation forest algorithm we were able to find various anomalous data points within the 'genre', 'score', and 'release_year' columns.

Look into a few of the anomalies and try to identify what could have caused them. Explain what you found.

Potential causes for these anomalies could be that some of the game genres and combinations of them show up less within the dataset. Similarly, some of the abnormal scores based on the overall genre could be causing these anomalies, for example, a score of 0.5 would be seen as an anomaly within a genre that has a high average score.