### CS 421 PROJECT

**Background & Objective**

In this project, you will be working with data extracted from famous recommender systems type datasets: you are provided with a large set of interactions between users (persons) and items (movies). Whenever a user "interacts" with an item, it watches the movie and gives a "rating". There are 4 possible ratings: "dislike", "neutral", "like", and "watched". The "watched" rating indicates that the user has rated the movie, but the specific rating is unknown (that means you know that the user has watched the movie, but you don't know whether they liked it).

In this exercise, we will **not** be performing the recommendation task per se. Instead, we will identify *anomalous users*. In the dataset that you are provided with, some of the data was corrupted. Whilst most of the data comes from real life user-item interactions from a famous movie rating website, some "users" are anomalous: they were generated by me according to some undisclosed procedure. Furthermore, there are **two types of anomalies** with two different generation procedures.

**Data**

You are provided with two data frames: the first one ("X") contains the interactions provided to you, and the second one ("yy") contains the labels for the users.

As you can see, the three columns in "X" correspond to the user ID, the item ID and the rating (encoded into numerical form). Thus, each row of "X" contains a single interaction. For instance, if the row "142, 152, 10" is present, this means that the user with ID 142 has given the movie 152 a positive rating of "like".

The table below shows what each numerical encoding of the rating corresponds to:

| Rating in X    | Meaning     |
| :------------- | :---------- |
| -10            | dislike     |
| 0              | neutral     |
| 10             | like        |
| 1              | watched     |

The dataframe "yy" has two columns. In the first column we have the user IDs, whilst the second column contains the labels. A label of 0 denotes a natural user (coming from real life interactions), whilst a label of 1 or 2 indicates an anomaly generated by me. The anomalies with label 1 are generated with a different procedure from the anomalies with label 2. 

For instance, if the labels matrix contains the line "142, 1", it means that ALL of the ratings given by the user with ID 142 are fake, and generated according to the first anomaly generation procedure. This means all lines in the dataframe "ratings" which start with the user ID 142 correspond to fake interactions. 

#### Evaluation

Your task is to be able to classify unseen instances as either anomalies or non anomalies (guess whether they are real users or if they were generated by me). As well as indicate which anomaly type they belong to. 

There are **far more** normal users than anomalies in the dataset, which makes this a very heavily **unbalanced dataset**. Thus, accuracy will not be a good measure of performance, since simply predicting that every user is normal will give good accuracy. Thus, we need to use some other evaluation metrics (see lecture notes from week 3). 

THE **EVALUATION METRIC** is:  THE **AUC** (AREA UNDER CURVE) for each class (thus, there are three performance measures, one for each class). The main final metric to evaluate the ranking will be the average of the three.  This means your programs should return a **score** for each user and anomaly type combination. For instance, your model's prediction for user 1200 should consist of three scores $z_0,z_1,z_2$ corresponding to the normal class and the two anomalous classes respectively. 

Every few weeks, we will evaluate the performance of each team (on a *test set with unseen labels* that I will provide) in terms of all three AUCs.

The difficulty implied by **the generation procedure of the anomalies MAY CHANGE as the project evolves: depending on how well the teams are doing, I may generate easier or harder anomalies**.

**Deliverables**

Together with this file, you are provided with a first batch of labelled examples "first_batch_multi_labels.npz". You are also provided with the test samples to rank by the next round (without labels) in the file "second_batch_multi.npz".

The **first round** will take place after recess (week 9): you must hand in your scores for the second batch before the **WEDNESDAY at NOON (16th of October)**. We will then look at the results together on the Thursday.  

We will check everyone's performance in this way every week (once on  week 10, once on week 11 and once on week 12). 

To summarise, the project deliverables are as follows:

- Before every checkpoint's deadline, you need to submit **a `.npz` file** containing a Numpy array of size $\text{number of test batch users} \times 3$, where the value of each cell corresponds to the predicted score of the user (row) belonging to the anomaly type (column). The order of rows should correspond to the user IDs. For example, if the test batch contains users 1100-2200, scores for user 1100 should be the first row (row 0), scores for user 1101 should be the second row (row 1), and so on.

- On Week 12-13 (schedule to be decided), you need to present your work in class. The presentation duration is **10 minutes** with 5 minutes of QA. 

- On Week 12, you need to submit your **Jupyter Notebook** (with comments in Markdown) and the **slides** for your presentation. 
- On week 13 you need to submit your **final report**. The final report should be 2-3 pages long (consisting of problem statement, literature review, and motivation of algorithm design) with unlimited references/appendix.

Whilst performance (expressed in terms of AUC and your ranking compared to other teams) at **each of the check points** (weeks 9 to 12 inclusive) is an **important component** of your **final grade**, the **final report** and the detail of the various methods you will have tried will **also** be very **important**. Ideally, to get perfect marks (A+), you should try at least **two supervised methods** and **two unsupervised methods**, as well as be ranked the **best team** in terms of performance. 


In addition, I will be especially interested in your **reasoning**. Especially high marks will be awarded to any team that is able to **qualitatively describe** the difference between the two anomaly types. You are also encouraged to compute statistics related to each class and describe what is different about them. 

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import linear_model, preprocessing
from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split

b1_data = np.load("first_batch_multi_labels.npz")
b2_data = np.load("second_batch_multi_labels.npz")

In [28]:
b1_X = b1_data["X"]
b1_y = b1_data["yy"]
b2_X = b2_data["X"]
b2_y = b2_data["yy"]

b1_XX = pd.DataFrame(b1_X)
b1_yy = pd.DataFrame(b1_y)
b2_XX = pd.DataFrame(b2_X)
b2_yy = pd.DataFrame(b2_y)

data_XX = pd.concat([b1_XX, b2_XX])
data_YY = pd.concat([b1_yy, b2_yy])

In [29]:
data_XX.head()
data_XX.rename(columns={0:"user", 1:"item_id", 2:"rating"}, inplace=True)

In [30]:
data_YY.rename(columns={0:"user",1:"label"},inplace=True)

In [31]:
data_YY.head()

Unnamed: 0,user,label
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


In [68]:
!pip install numpy scipy numba


Collecting numpy
  Using cached numpy-1.22.4-cp39-cp39-win_amd64.whl (14.7 MB)
  Downloading numpy-1.21.6-cp39-cp39-win_amd64.whl (14.0 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.5
    Uninstalling numpy-1.23.5:
      Successfully uninstalled numpy-1.23.5
Successfully installed numpy-1.21.6


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.5.0 requires daal==2021.4.0, which is not installed.
tensorflow-intel 2.15.0 requires ml-dtypes~=0.2.0, but you have ml-dtypes 0.4.1 which is incompatible.
tensorflow-intel 2.15.0 requires numpy<2.0.0,>=1.23.5, but you have numpy 1.21.6 which is incompatible.


In [69]:
# Merge the interactions data (X) with the labels data (yy) on the 'user' column
merged_data = pd.merge(data_XX, data_YY, on='user')

# Function to calculate the ratio of a particular rating for a user
def user_rating_ratio(df, rating_value):
    return df.groupby('user')['rating'].apply(lambda x: (x == rating_value).mean())

# Function to calculate the ratio of a particular rating for an item
def item_rating_ratio(df, rating_value):
    return df.groupby('item_id')['rating'].apply(lambda x: (x == rating_value).mean())

# Add user-based features
merged_data['like_ratio_user'] = merged_data['user'].map(user_rating_ratio(merged_data, 10))
merged_data['dislike_ratio_user'] = merged_data['user'].map(user_rating_ratio(merged_data, -10))
merged_data['mean_rating_user'] = merged_data.groupby('user')['rating'].transform('mean')
merged_data['median_rating_user'] = merged_data.groupby('user')['rating'].transform('median')

# Add item-based features
merged_data['like_ratio_item'] = merged_data['item_id'].map(item_rating_ratio(merged_data, 10))
merged_data['dislike_ratio_item'] = merged_data['item_id'].map(item_rating_ratio(merged_data, -10))

# Keep the 'label' as it is from previous merging
features = merged_data[['user', 'like_ratio_user', 'dislike_ratio_user', 'mean_rating_user', 
                        'median_rating_user', 'like_ratio_item', 'dislike_ratio_item', 'label']]
features

Unnamed: 0,user,like_ratio_user,dislike_ratio_user,mean_rating_user,median_rating_user,like_ratio_item,dislike_ratio_item,label
0,1073,0.259259,0.000000,2.962963,1.0,0.222222,0.116466,0
1,1073,0.259259,0.000000,2.962963,1.0,0.416144,0.038401,0
2,1073,0.259259,0.000000,2.962963,1.0,0.353858,0.060795,0
3,1073,0.259259,0.000000,2.962963,1.0,0.305503,0.049336,0
4,1073,0.259259,0.000000,2.962963,1.0,0.241463,0.080488,0
...,...,...,...,...,...,...,...,...
353519,1660,0.197531,0.166667,0.851852,1.0,0.400236,0.043684,2
353520,1660,0.197531,0.166667,0.851852,1.0,0.355401,0.055749,2
353521,1660,0.197531,0.166667,0.851852,1.0,0.238434,0.085409,2
353522,1660,0.197531,0.166667,0.851852,1.0,0.337434,0.054482,2


In [74]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, roc_auc_score
import numpy as np

# Assuming 'features' is the dataframe with the engineered features and 'label' is the target
X = features.drop(columns=['label'])
y = features['label']

# Scale the feature data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Build the MLPClassifier model
mlp = MLPClassifier(hidden_layer_sizes=(64, 32, 16), max_iter=50, verbose=True, random_state=42, early_stopping=True)


# Train the model
mlp.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = mlp.predict(X_test)

# Predict the probabilities for the test set
y_pred_proba = mlp.predict_proba(X_test)

# Classification report for Precision, Recall, F1-Score
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Normal', 'Anomaly Type 1', 'Anomaly Type 2']))

# AUC for each individual class (one-vs-rest)
y_test_onehot = pd.get_dummies(y_test)  # One-hot encode the true labels
for i, class_name in enumerate(['Normal', 'Anomaly Type 1', 'Anomaly Type 2']):
    class_auc = roc_auc_score(y_test_onehot.iloc[:, i], y_pred_proba[:, i])
    print(f'AUC for {class_name}: {class_auc}')





Iteration 1, loss = 0.22509208
Validation score: 0.962379
Iteration 2, loss = 0.07702514
Validation score: 0.981189
Iteration 3, loss = 0.05996857
Validation score: 0.982745
Iteration 4, loss = 0.05511781
Validation score: 0.983700
Iteration 5, loss = 0.05028678
Validation score: 0.985432
Iteration 6, loss = 0.04649387
Validation score: 0.986422
Iteration 7, loss = 0.04427035
Validation score: 0.986316
Iteration 8, loss = 0.04131991
Validation score: 0.987660
Iteration 9, loss = 0.03896515
Validation score: 0.988296
Iteration 10, loss = 0.03786821
Validation score: 0.983488
Iteration 11, loss = 0.03545316
Validation score: 0.988190
Iteration 12, loss = 0.03393842
Validation score: 0.989923
Iteration 13, loss = 0.03220766
Validation score: 0.991443
Iteration 14, loss = 0.03096851
Validation score: 0.990559
Iteration 15, loss = 0.02952529
Validation score: 0.992398
Iteration 16, loss = 0.02733703
Validation score: 0.990135
Iteration 17, loss = 0.02701565
Validation score: 0.990984
Iterat