## CS/INFO 5304 Assignment 2: Recommender Systems

#### Credit: 35 points + possible bonus (10 points)
#### Due date: May 2nd, 11:59PM

### Part D: bonus [10 points]

Your goal is to build a good recommendation system for Yelp with an ensemble of predictors. You can use any individual predictor and any method to combine them (it could be linear weighted combination or vote)

- Test set: 
    - 5 new users x the same 1000 businesses. Their records of the 100 first businesses are also erased.
    
- Submission: 
    - the prediction of the erased records which are 1s and 0s.
    - Submission format: sample_bonus_submission.csv, (5 rows, 100 columns, separator = comma, integers) Please make sure your raw text exactly matches the sample format. Otherwise you might have 0 points since we run auto grading.
    
- Evaluation metrics: 
    - Since the test set is sparse, i.e. most entries are 0s.  We use F1 score as our evaluation metric.
    - You can split a validation set out of the training set (for example, user 5-9 ) if you want to test your model. 
    
- Code and write-up:
    - Write your code for the test set in a separate jupyter notebook. At the top of the notebook, add brief write-ups to explain each predictor you used and how you combined them.
    - Your bonus points = max(10*min( yourF1 - 0.120.5 - 0.12, 1), 0)
        - This means that you will get some points as long as you attempt! For reference, a random guess(all as 1s) is 0.12. And 0.6 is pretty accurate.
      

In [4]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
import random

# Load the datasets
business_df = pd.read_csv('data/business.csv', header = None, names = ["Business"])
user_business_df = pd.read_csv('data/user-business.csv', header = None)
user_business_test_df = pd.read_csv('data/user-business_test.csv', header = None)

In [5]:
from sklearn.metrics import f1_score
import random

def normalize_arr(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

class my_recom_system:
    def __init__(self, erased_vals = 100, train_sample = 10, epochs = 10, 
                 weight_step_limit = 0.05, bias = 0.3):
        self.erased_vals = erased_vals
        self.weights = np.array([0.5, 0.5])
        self.bias = bias
        self.train_sample = train_sample
        self.epochs = epochs
        self.weight_step_limit = weight_step_limit

    def train(self):
        best_weights = self.weights.copy()
        prev_increm_idx = 0
        bias_increm_sign = 1
        best_f1_score = 0.0
        best_bias = self.bias
        
        random_user_train_sample = random.sample(range(len(user_business_df)), self.train_sample)
        for e in range(1, self.epochs + 1):
            curr_f1_score = 0.0
            
            for i in random_user_train_sample:
                user_row = user_business_df.iloc[i, :].copy()
                user_row_targets = user_row[:self.erased_vals].copy()
                user_row[:self.erased_vals] = 0

                predictions = self.predict(user_business_df, user_row, i)
                curr_f1_score += f1_score(user_row_targets, predictions, zero_division = 1) / \
                                 self.train_sample
            
            
            print("epoch: ", e, "\t F1 Score: ", curr_f1_score, 
                  "\t Bias-Weights: ", self.bias, " - ", self.weights)  
            
            if curr_f1_score < best_f1_score:
                self.weights = best_weights.copy()
                prev_increm_idx = (prev_increm_idx + 1) % 2
                self.bias = best_bias
                bias_increm_sign *= -1
            else :
                best_weights = self.weights.copy()
                best_f1_score = curr_f1_score
                best_bias = self.bias                
            
            weight_rand_increment = random.uniform(0, self.weight_step_limit)
            self.bias += bias_increm_sign * random.uniform(0, 0.05)
            self.weights -= weight_rand_increment
            self.weights[prev_increm_idx] += 2 * weight_rand_increment
    
        # Chose best weights and bias.
#         self.bias = best_bias
#         self.weights = best_weights
    
    def predict(self, user_df, user_row, i):
     
        # user system
        user_simil_matrix = cosine_similarity(user_df.iloc[:, self.erased_vals:], 
                                      np.array(user_row[self.erased_vals:]).reshape(1, -1))
        user_recomm_matrix = normalize_arr(np.dot(user_simil_matrix.T, user_df)
                                          ).flatten()[:self.erased_vals]
        
        # item system    
        item_simil_matrix = cosine_similarity(user_df.drop(labels = [i], axis = 0).T)
        item_recomm_matrix = normalize_arr(np.dot(user_row, item_simil_matrix)
                                          )[:self.erased_vals]

        mixed_score = self.bias + user_recomm_matrix * self.weights[0] + \
                      item_recomm_matrix * self.weights[1]
        predictions = (mixed_score > 0.5).astype(int)
        
        return predictions
    
    
    def test(self, user_df):
        all_preds = np.zeros_like(user_df.iloc[:, :self.erased_vals])
        for i in range(len(user_df)):
            user_row = user_df.iloc[i, :].copy()
            predictions = self.predict(user_business_df, user_row, i)
            all_preds[i, :] = predictions
        
        pred_df = pd.DataFrame(all_preds)
        return pred_df

In [6]:
recom_sys = my_recom_system(train_sample = 250, epochs = 25)

recom_sys.train()

epoch:  1 	 F1 Score:  0.33631287601287607 	 Bias-Weights:  0.3  -  [0.5 0.5]
epoch:  2 	 F1 Score:  0.25549249378072914 	 Bias-Weights:  0.3388049400259937  -  [0.53378123 0.46621877]
epoch:  3 	 F1 Score:  0.414458585858586 	 Bias-Weights:  0.25039423138531414  -  [0.48098335 0.51901665]
epoch:  4 	 F1 Score:  0.4744366522366524 	 Bias-Weights:  0.20861170422933922  -  [0.43306314 0.56693686]
epoch:  5 	 F1 Score:  0.4937033189033191 	 Bias-Weights:  0.20078403246889065  -  [0.40450134 0.59549866]
epoch:  6 	 F1 Score:  0.5289532467532473 	 Bias-Weights:  0.17089912619931685  -  [0.38134285 0.61865715]
epoch:  7 	 F1 Score:  0.5549437229437234 	 Bias-Weights:  0.145271374958865  -  [0.35648908 0.64351092]
epoch:  8 	 F1 Score:  0.5931777777777782 	 Bias-Weights:  0.11231274897808666  -  [0.31981355 0.68018645]
epoch:  9 	 F1 Score:  0.5976952380952385 	 Bias-Weights:  0.10245075141367904  -  [0.31683343 0.68316657]
epoch:  10 	 F1 Score:  0.6032571428571433 	 Bias-Weights:  0.0635893

In [7]:
test_predictions = recom_sys.test(user_business_test_df)

display(test_predictions)
test_predictions.to_csv("bonus_submission.csv", header = False, index = False)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


### Turn in:
#### A2
    a) A Jupyter notebook a2.jpynb with the code and answers (if you work in a study group, write their names at the top to avoid any trouble in the plagiarism check. And we encourage you to write your code independently.)
    b) A a2.py exported from your .jpynb

#### A2-bonus
    c) bonus_submission.csv
    d) bonus.ipynb
    e) A bonus.py exported from your .jpynb (last time, some students submit an unexpected messy file with the raw content of Jupyter. Please make sure it is the exported one with codes only.)

__Please double check that you have all the required files submitted!__ Last time we received many regrading requests about this. In this and the following assignments, we will need to apply at least 20% penalty.