### CS 421 PROJECT

In this project, you will be working with data extracted from famous recommender systems type datasets: you are provided with a large set of interactions between users (persons)  and items (movies). Whenever a user "interacts" with an item, it watches the movie and gives a mark or "rating": you can interpret a rating of "1" as a "like", a rating of "-1" as a "dislike" and a rating of "0" as a neutral "meh" rating. 




In this exercise, we will **not** be performing the recommendation task per se. Instead, we will identify *anomalous users*. In the dataset that you are provided with, some of the data was corrupted. Whilst most of the data comes from real life user-item interactions from a famous movie rating website, some "users" are anomalous: they were generated by me according to some undisclosed procedure. 

You are provided with two data frames: the first one ("ratings") contains the interactions provided to you, and the second one ("labels") contains the labels for the users.

As you can see, the three columns in "ratings" correspond to the user ID, the item ID and the rating. Thus, each row of "ratings" contains a single interaction. For instance, if the row "142, 152, 1" is present, this means that the user with ID 142 has given the movie 152 a positive rating of "1" ("like").

The dataframe "labels" has two columns. In the first column we have the user ids, whilst the second column contains the labels. A label of 1 indicates that the user is fake (generated by me), whilst a label of 0 denotes a natural user (coming from real life interactions). 

For instance, if the labels matrix contains the line "142, 1", it means that all of the ratings given by the user with id 142 are fake. This means all lines in the dataframe "ratings" which start with the userID 142 correspond to fake interactions. 

#### Evaluation

Your task is to be able to classify unseen instances as either anomalies or non anomalies (guess whether they are real users or if they were generated by me). 

There are **far more** normal users than anomalies in the dataset, which makes this a very heavily **unbalanced dataset**. Thus, accuracy will not be a good measure of performance, since simply predicting that every user is normal will give good accuracy. Thus, we need to use some other evaluation metrics (see lecture notes from week 3). 

THE **EVALUATION METRICS** are:  THE **AUC** (AREA UNDER CURVE), the **PRECISION**, THE **RECALL**, and the **F1 score**. The **main metric** will be the **AREA UNDER CURVE**, and it will by default be used to rank teams. This means your programs should return an **anomaly score** for each user (the higher the score, the more likely the model think the sample is anomalous).  

Every few weeks, we will evaluate the performance of each team (on an *unseen test set* I will provide) in terms of AUC, PRECISION, RECALL and F1 score, and rank the teams by **AUC** and by F1 score to distinguish between ties, where a tie is defined by a difference of less than 0.005 in AUC.  

The difficulty implied by **the generation procedure of the anomalies MAY CHANGE as the project evolves: depending on how well the teams are doing, I may generate easier or harder anomalies**.

Together with this file, you are provided with a first batch of labelled examples "first_batch_with_labels_likes.npz". You are also provided with the test samples to rank by the next round (without labels) in the file "second_batch_likes.npz".

The **first round** will take place after recess (week 9): this means that I will **release the next test set on the tuesday of week 9**, and you must hand in your scores for the second batch before the **WEDNESDAY at NOON (11th of October)**. Your submission will be a numpy array containing the **scores** for each of the users I will send you for each test set. We will then look at the results together on the thursday.  

We will check everyone's performance in this way every week (once on  week 10, once on week 11 and once on week 12). 

Whilst performance (expressed in terms of AUC and your ranking compared to other teams) at **each of the check points** (weeks 9 to 12 inclusive) is an **important component** of your **final grade**, the **final report** and the detail of the various methods you will have tried will **also** be very **important**. Ideally, to get perfect marks (A+), you should try at least **two supervised methods** and **two unsupervised methods**, as well as be ranked the **best team** in terms of performance.



In [1]:
import numpy as np
import pandas as pd
data=np.load("first_batch_with_labels_likes.npz")

In [2]:
X=data["X"]
y=data["y"]

XX=pd.DataFrame(X)
yy=pd.DataFrame(y)
XX.rename(columns={0:"user",1:"item",2:"rating"},inplace=True)

In [3]:
XX.head()

Unnamed: 0,user,item,rating
0,1220,6,0
1,1220,21,1
2,1220,31,0
3,1220,33,0
4,1220,35,-1


In [4]:
yy.rename(columns={0:"user",1:"label"},inplace=True)

In [5]:
yy.head(10)

Unnamed: 0,user,label
0,0,1
1,1,0
2,2,1
3,3,0
4,4,0
5,5,0
6,6,0
7,7,0
8,8,0
9,9,0


In [6]:
XX = XX.sort_values(by=["item"], ascending=True)
dictItem = {}
for i, row in XX.iterrows():
    rating = row["rating"]
    item = row["item"]
    
    try:
        keyValue = dictItem[item]
    except KeyError:
        keyValue = [0,0,0]
    if(rating == 1):
        keyValue[0] = keyValue[0] + 1
        dictItem[item] = keyValue
        
    elif(rating == 0):
        keyValue[1] = keyValue[1] + 1
        dictItem[item] = keyValue
        
    elif(rating == -1):
        keyValue[2] = keyValue[2] + 1
        dictItem[item] = keyValue
        
dictItem

{0: [437, 223, 71],
 1: [119, 195, 117],
 2: [35, 70, 67],
 3: [29, 78, 56],
 4: [239, 124, 38],
 5: [66, 72, 51],
 6: [155, 179, 72],
 7: [204, 94, 35],
 8: [160, 71, 35],
 9: [65, 149, 168],
 10: [167, 126, 73],
 11: [59, 75, 45],
 12: [161, 89, 35],
 13: [109, 42, 16],
 14: [11, 8, 8],
 15: [161, 71, 32],
 16: [181, 156, 85],
 17: [29, 73, 90],
 18: [63, 68, 35],
 19: [469, 155, 63],
 20: [63, 87, 84],
 21: [496, 127, 57],
 22: [50, 56, 27],
 23: [3, 9, 11],
 24: [23, 56, 46],
 25: [105, 134, 78],
 26: [21, 23, 18],
 27: [32, 45, 38],
 28: [67, 137, 95],
 29: [16, 24, 15],
 30: [134, 126, 97],
 31: [447, 184, 93],
 32: [52, 23, 12],
 33: [131, 121, 52],
 34: [92, 59, 33],
 35: [78, 185, 182],
 36: [49, 23, 8],
 37: [39, 91, 103],
 38: [35, 70, 123],
 39: [89, 127, 50],
 40: [202, 197, 75],
 41: [31, 95, 96],
 42: [34, 98, 135],
 43: [10, 19, 6],
 44: [30, 30, 23],
 45: [77, 148, 187],
 46: [234, 116, 44],
 47: [52, 60, 51],
 48: [12, 17, 24],
 49: [149, 192, 142],
 50: [14, 16, 25],

In [7]:
for i in dictItem:
    itemValue = dictItem[i]
    largestIdx = itemValue.index(max(itemValue))
    if(largestIdx == 0): 
        dictItem[i] = 1
    if(largestIdx == 1): 
        dictItem[i] = 0
    if(largestIdx == 2): 
        dictItem[i] = -1
dictItem   

{0: 1,
 1: 0,
 2: 0,
 3: 0,
 4: 1,
 5: 0,
 6: 0,
 7: 1,
 8: 1,
 9: -1,
 10: 1,
 11: 0,
 12: 1,
 13: 1,
 14: 1,
 15: 1,
 16: 1,
 17: -1,
 18: 0,
 19: 1,
 20: 0,
 21: 1,
 22: 0,
 23: -1,
 24: 0,
 25: 0,
 26: 0,
 27: 0,
 28: 0,
 29: 0,
 30: 1,
 31: 1,
 32: 1,
 33: 1,
 34: 1,
 35: 0,
 36: 1,
 37: -1,
 38: -1,
 39: 0,
 40: 1,
 41: -1,
 42: -1,
 43: 0,
 44: 1,
 45: -1,
 46: 1,
 47: 0,
 48: -1,
 49: 0,
 50: -1,
 51: 0,
 52: 0,
 53: 1,
 54: 1,
 55: 1,
 56: 0,
 57: 1,
 58: 1,
 59: 1,
 60: 0,
 61: 1,
 62: -1,
 63: 0,
 64: 1,
 65: 0,
 66: 0,
 67: 0,
 68: 1,
 69: 1,
 70: 0,
 71: 1,
 72: 1,
 73: 0,
 74: 0,
 75: 0,
 76: 1,
 77: 0,
 78: 1,
 79: -1,
 80: 0,
 81: 0,
 82: -1,
 83: 1,
 84: 0,
 85: 1,
 86: -1,
 87: 0,
 88: 1,
 89: 1,
 90: -1,
 91: 1,
 92: 0,
 93: 1,
 94: 0,
 95: 1,
 96: 0,
 97: 1,
 98: 1,
 99: 1,
 100: 1,
 101: 0,
 102: 0,
 103: 1,
 104: 1,
 105: 1,
 106: 0,
 107: 1,
 108: 0,
 109: 1,
 110: 1,
 111: 0,
 112: 1,
 113: 1,
 114: 1,
 115: 0,
 116: 1,
 117: 1,
 118: 1,
 119: 0,
 120: 1,
 121: 

In [8]:
XX = XX.sort_values(by=["user"], ascending=True)
XX
curr_user = -1

userItem = {}
for index, row in XX.iterrows():
    if (row["user"] != curr_user):
        curr_user +=1
        userItem[curr_user] = 0
        
    mostPopRating = dictItem[row["item"]] 
    
    if(mostPopRating == row["rating"]):
        userItem[curr_user] = userItem[curr_user] + 1
userItem

{0: 64,
 1: 19,
 2: 64,
 3: 69,
 4: 66,
 5: 12,
 6: 90,
 7: 550,
 8: 76,
 9: 6,
 10: 257,
 11: 63,
 12: 72,
 13: 83,
 14: 277,
 15: 270,
 16: 58,
 17: 43,
 18: 23,
 19: 107,
 20: 71,
 21: 32,
 22: 118,
 23: 47,
 24: 86,
 25: 6,
 26: 29,
 27: 67,
 28: 70,
 29: 32,
 30: 59,
 31: 60,
 32: 69,
 33: 25,
 34: 48,
 35: 34,
 36: 130,
 37: 94,
 38: 101,
 39: 105,
 40: 185,
 41: 37,
 42: 114,
 43: 221,
 44: 25,
 45: 176,
 46: 68,
 47: 119,
 48: 91,
 49: 7,
 50: 32,
 51: 14,
 52: 39,
 53: 48,
 54: 16,
 55: 187,
 56: 398,
 57: 275,
 58: 173,
 59: 24,
 60: 43,
 61: 122,
 62: 173,
 63: 96,
 64: 16,
 65: 67,
 66: 73,
 67: 11,
 68: 125,
 69: 119,
 70: 6,
 71: 117,
 72: 55,
 73: 78,
 74: 114,
 75: 52,
 76: 49,
 77: 32,
 78: 38,
 79: 43,
 80: 12,
 81: 142,
 82: 33,
 83: 50,
 84: 11,
 85: 33,
 86: 36,
 87: 46,
 88: 8,
 89: 159,
 90: 48,
 91: 173,
 92: 217,
 93: 75,
 94: 8,
 95: 8,
 96: 67,
 97: 68,
 98: 141,
 99: 52,
 100: 33,
 101: 129,
 102: 8,
 103: 75,
 104: 46,
 105: 198,
 106: 84,
 107: 80,
 108: 1

In [9]:
# Grouping by user and creating aggregated features
df_grouped = XX.groupby('user').agg(
    average_rating=('rating', 'mean'),
    total_interactions=('rating', 'size'),
    likes=('rating', lambda x: (x == 1).sum()),
    dislikes=('rating', lambda x: (x == -1).sum()),
    neutral_ratings=('rating', lambda x: (x == 0).sum())
)
df_grouped['likes_ratio'] = df_grouped['likes'] / df_grouped['total_interactions']
df_grouped['dislikes_ratio'] = df_grouped['dislikes'] / df_grouped['total_interactions']
df_grouped['interaction_balance'] = df_grouped['likes'] - df_grouped['dislikes']
df_grouped['neutral_ratio'] = df_grouped['neutral_ratings'] / df_grouped['total_interactions']
df_grouped['balance_ratio'] = df_grouped['interaction_balance'] / df_grouped['total_interactions']

# Merging with labels to create a single DataFrame
df_final = df_grouped.merge(yy, left_index=True, right_on='user')

df_final.head()


Unnamed: 0,average_rating,total_interactions,likes,dislikes,neutral_ratings,likes_ratio,dislikes_ratio,interaction_balance,neutral_ratio,balance_ratio,user,label
0,0.324786,117,59,21,37,0.504274,0.179487,38,0.316239,0.324786,0,1
1,0.62963,27,17,0,10,0.62963,0.0,17,0.37037,0.62963,1,0
2,0.071856,167,61,49,57,0.365269,0.293413,12,0.341317,0.071856,2,1
3,0.811321,106,88,2,16,0.830189,0.018868,86,0.150943,0.811321,3,0
4,0.65,100,66,1,33,0.66,0.01,65,0.33,0.65,4,0


In [10]:
df_grouped = XX.groupby('user').agg(
    total_interactions=('rating', 'size'),
    likes=('rating', lambda x: (x == 1).sum()),
    dislikes=('rating', lambda x: (x == -1).sum()),
    meh=('rating', lambda x: (x == 0).sum())
)
df_grouped['mean'] = df_grouped['likes'] + df_grouped['dislikes'] + df_grouped['meh'] / 3
df_grouped['std'] = df_grouped[['likes', 'dislikes', 'meh']].std(axis=1)
df_grouped['cv'] = df_grouped['std']/df_grouped['mean'] * 100

df_grouped['followed majority'] = pd.DataFrame(userItem.values())
df_grouped['followed majority %'] = df_grouped['followed majority'] / df_grouped['total_interactions']
# df_grouped.drop(['mean'], axis=1, inplace=True)
# df_grouped.drop(['std'], axis=1, inplace=True)

# Merging with labels to create a single DataFrame
df_final = df_grouped.merge(yy, left_index=True, right_on='user')

df_final.head()

Unnamed: 0,total_interactions,likes,dislikes,meh,mean,std,cv,followed majority,followed majority %,user,label
0,117,59,21,37,92.333333,19.078784,20.662943,64,0.547009,0,1
1,27,17,0,10,20.333333,8.544004,42.019691,19,0.703704,1,0
2,167,61,49,57,129.0,6.110101,4.736512,64,0.383234,2,1
3,106,88,2,16,95.333333,46.1447,48.403531,69,0.650943,3,0
4,100,66,1,33,78.0,32.501282,41.66831,66,0.66,4,0


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold

# # Splitting the data into training and validation sets
# features = df_final.columns.difference(['user', 'label'])
# X = df_final[features]
# y = df_final['label']
# X_train, X_val, y_train, y_val = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y)



# # Scaling the features
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_val_scaled = scaler.transform(X_val)

# # # Training a logistic regression model
# # # logreg = LogisticRegression(solver='saga',max_iter=1000, random_state=42, penalty='elasticnet', l1_ratio=0, C=1.0)
# # logreg = LogisticRegression(max_iter=1000, random_state=42)
# # logreg.fit(X_train_scaled, y_train)

# logreg = LogisticRegression()

# param_grid = [    
#     {'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
#     'C' : np.logspace(-4, 4, 20),
#     'solver' : ['lbfgs','newton-cg','liblinear','sag','saga'],
#     'max_iter' : [100, 1000, 2500, 5000, 10000, 25000]
#     }
# ]

# cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# clf = GridSearchCV(logreg, param_grid = param_grid, scoring='roc_auc', cv = cv, verbose=True, n_jobs=-1)
# best_clf = clf.fit(X_train_scaled,y_train)
# best_clf.best_estimator_

In [12]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score

# Splitting the data into training and validation sets
features = df_final.columns.difference(['user', 'label'])
X = df_final[features]
y = df_final['label']
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=11, stratify=y)

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Training a 2-layer neural network
# Define the hyperparameters
hidden_layer_sizes = (50, 10)  # The number of neurons in each hidden layer
activation = 'tanh'  # Activation function for the hidden layers ('logistic', 'tanh', 'relu', etc.)
solver = 'adam'  # The optimization algorithm ('adam', 'sgd', 'lbfgs', etc.)
alpha = 0.0001  # L2 regularization parameter
learning_rate = 'adaptive'  # The learning rate schedule for weight updates ('constant', 'invscaling', 'adaptive')
max_iter = 2000  # Maximum number of iterations
random_state = 44  # Seed for random initialization

mlp = MLPClassifier(
    hidden_layer_sizes=hidden_layer_sizes,
    activation=activation,
    solver=solver,
    alpha=alpha,
    learning_rate=learning_rate,
    max_iter=max_iter,
    random_state=random_state,
    batch_size=410,
    beta_1=0.7,
    beta_2=0.994
)
mlp.fit(X_train_scaled, y_train)

# Predicting probabilities for the validation set
mlp_probs = mlp.predict_proba(X_val_scaled)[:, 1]
mlp_auc = roc_auc_score(y_val, mlp_probs)

# Convert probabilities to binary predictions using a threshold (e.g., 0.5)
mlp_preds = (mlp_probs >= 0.50).astype(int)

# Calculate precision, recall, and F1-score
precision = precision_score(y_val, mlp_preds)
recall = recall_score(y_val, mlp_preds)
f1 = f1_score(y_val, mlp_preds)

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("ROC AUC for 2-layer Neural Network:", mlp_auc)

Precision: 0.76
Recall: 0.6333333333333333
F1-score: 0.6909090909090909
ROC AUC for 2-layer Neural Network: 0.8965833333333333


In [26]:
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score

def get_scores(actual, pred):
    precision = precision_score(actual, pred)
    recall = recall_score(actual, pred)
    f1 = f1_score(actual, pred)
    auc_score = roc_auc_score(actual, pred)
    
    return {"Precision":precision, "Recall":recall, "F1_Score":f1, "AUC":auc_score}

In [21]:
# Splitting the data into training and validation sets
features = df_final.columns.difference(['user', 'label'])
X = df_final[features]
y = df_final['label']
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, random_state=11, stratify=y)
X_test, X_val, y_test, y_val = train_test_split(
    X_val, y_val, test_size=0.5, random_state=11)

# Scaling the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

In [32]:
import tensorflow as tf
from tensorflow.keras.utils import to_categorical

# Check if GPU is available and set the appropriate device
if tf.test.is_gpu_available():
    device_name = tf.test.gpu_device_name()
    print(f'GPU found: {device_name}')
else:
    print('No GPU found. Using CPU.')

# Create a simple neural network model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(9,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define the number of output classes (assuming a multi-class classification problem)
output_dim = 3  # Adjust based on the number of classes in your classification problem

# Convert target labels to one-hot encoding
y_train_encoded = to_categorical(y_train, num_classes=output_dim)
y_val_encoded = to_categorical(y_val, num_classes=output_dim)

# Train the model (provide your own training data and labels)
model.fit(X_train_scaled, y_train_encoded, epochs=200, batch_size=16, validation_data=(X_val_scaled, y_val_encoded))


No GPU found. Using CPU.
Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/

<keras.callbacks.History at 0x15f1936cfd0>

In [34]:
y_pred = model.predict(X_test)[:, 1]
y_pred = (y_pred >= 0.50).astype(int)
get_scores(y_test, y_pred)



  _warn_prf(average, modifier, msg_start, len(result))


{'Precision': 0.0, 'Recall': 0.0, 'F1_Score': 0.0, 'AUC': 0.5}

In [24]:
# lr = LogisticRegression(C=0.03359818286283781)

# lr.fit(X_train_scaled,y_train)
# # Predicting probabilities for the validation set
# logreg_probs = lr.predict_proba(X_val_scaled)[:, 1]
# logreg_auc = roc_auc_score(y_val, logreg_probs)

# precision = precision_score(y_val, lr.predict(X_val_scaled))
# recall = recall_score(y_val, lr.predict(X_val_scaled))
# f1 = f1_score(y_val, lr.predict(X_val_scaled))

# # Printing the results
# print("AUC:", logreg_auc)
# print("Precision:", precision)
# print("Recall:", recall)
# print("F1:", f1)