# Model Evaluation:
<p><b><u>Evaluation measures used:</u></b><p>
<p>I will be using the F1 Score as the primary measure for performance.</p>
<p><b><u>Vector similarity measure:</u></b></p>
<p>I will be using the cosine similarity between user vectors and business vectors as a “closeness” measure.  This measure will show how similar a user’s taste is to a specific business.  This measure can take a value between -1 and 1 (with 1 representing a perfect match).
Approach:</p>
<p>A set of users with a good amount of reviews will be retrieved. These users will have their profiles (vectors) compared to a balanced set of their existing likes and dislikes.</p>
<p>A cosine similarity value will then be retrieved for the business vectors in the balanced test set. A cut-off point will be created for each user that will map the cosine similarity to a binary value of 1,-1.  The cut-off point will be the median of all similarity scores obtained for the current user’s balanced test set.  I use the median to ensure I predict half 1s and half -1s.</p>
<p>The predicted (mapped) user score of 1,-1 will then be compared to their actual score using a Boolean column of True/False. The TP, FP, TN, TN results will be calculated from the Boolean column and the F1 score will be obtained (for each user).</p>
<p>Lastly, the mean of all the F1 scores will be used to show how well this model performs.</p>
<p><b><u>Additional info:</u></b></p>
<p>The user profile vectors are created with all of the user's likes and dislikes, even if the class is imbalanced (more likes than dislikes).  This is to get a good idea of what the user's taste is based on the businesses they have already liked or disliked. 
I did not perform a train/test split on the data used to create the user profiles to preserve the information already obtained to build them.  The test will measure how well the full user profile vectors (containing all of the user to business interactions) perform, and can be done on a per user basis to show how well their current model is performing.</p>
<p>The test set is a balanced set of user reviews between likes/dislikes.  The balancing is performed with under-sampling the majority class.  The balancing is done to get accurate test results on the models, and to prevent high accuracy scores due to one class outcome holding most of the values.
The test set can be pictured as new businesses that a user hasn't seen yet, with similar attributes to what they've already liked.</p>


In [2]:
import numpy as np
import pandas as pd
import mysql.connector
from itertools import chain
from sklearn.metrics.pairwise import cosine_similarity


fp = "C:/Users/Tolis/Documents/Data Analytics Cource/CKME136 X10/Project/data/final/profiles"


db_settings = {
    "host": "localhost",
    "un": "root",
    "pw": "",
    "db_name": "yelp"
}

def mysql_result_to_df(myresult, mycursor):
    field_names = [i[0] for i in mycursor.description]
    return pd.DataFrame(myresult, columns=field_names)

def flatten_list(l):
    return list(chain.from_iterable(l))

"""
    Takes in a user_id (string) and returns all reviews for the user provided.
    The user star rating gets converted to a 1,-1 based 
    on the 'min like threshold of' 3.5.
"""
def get_user_reviews(user_id):
    mydb = mysql.connector.connect(
        host=db_settings["host"],
        user=db_settings["un"],
        passwd=db_settings["pw"],
        database=db_settings["db_name"]
    )
    
    mycursor = mydb.cursor()
    
    q = f"""
        SELECT review_id, user_id, business_id,
        CASE
            WHEN stars >= 3.5 THEN 1
            ELSE -1
        END AS "like/dislike"
        FROM review WHERE user_id = '{user_id}'
        """
    
    mycursor.execute(q)
    user_reviews = mycursor.fetchall()
    
    user_reviews_df = mysql_result_to_df(user_reviews, mycursor)
    mycursor.close()
    
    return user_reviews_df

"""
    Takes in a user profile vector and a set of business profile vectors.
    The cosign similarity is obtained between the user and business vector(s) and returned as a series.
"""
def get_recommendations(user_profile_df, business_profiles, max_amt=None):
    #Get the cosine similarity of the user profile vector and each business profile vector passed in
    rec_scores = cosine_similarity(user_profile_df.drop("user_id", axis=1), 
                                              business_profiles.drop("business_id", axis=1))

    # Convert scores to series object and set labels equal to business ids
    rec_scores = pd.Series(rec_scores[0])
    rec_scores.index = business_profiles["business_id"]
    rec_scores = rec_scores.sort_values(ascending=False)
    if (not(max_amt is None)):
        rec_scores = rec_scores.head(max_amt)
        
    return rec_scores


"""
    Maps the user id string to a user vector, drops columns that aren't needed,
    and returns the result.
"""
def get_user_profile(user_profiles, user_id_map, user_id):
    
    result = user_profiles.merge(user_id_map, left_on="user_id", right_on="index")
    result = result.drop(["user_id_x","index"], axis=1)
    result["user_id"] = result["user_id_y"]
    result = result.drop(["user_id_y"], axis=1)
    result = result[ result["user_id"] == user_id ]
    
    return result

"""
    Takes in the confusion matrix from a test result and obtains metrics for evaluation:
    accuracy, recall, precision, f1_score.
    
    The functions returns the f1_score which is the primary measure.
"""
def get_test_scores(conf_matrix):
    tp = conf_matrix["TP"]
    fp = conf_matrix["FP"]
    tn = conf_matrix["TN"]
    fn = conf_matrix["FN"]
    results = []
    
    accuracy = (tp+tn)/(tp+fp+tn+fn)
    recall = tp/(tp+fn)
    precision = tp/(tp+fp)
    f1_score = 2*(recall * precision) / (recall + precision)
    
    results.append(f1_score)
    
    return results

## Query database for users that have at least 100 reviews and obtain their review rows.
<p>The reason I chose users with 100 reviews is because the model should work better when users have more business interactions, since the user vector profile changes everytime a like/dislike is performed on a business.  Users with enough reviews makes testing easier, and a problem with the model would be more visible during evaluation as opposed to a user with very few interactions.</p>

In [3]:
mydb = mysql.connector.connect(
    host=db_settings["host"],
    user=db_settings["un"],
    passwd=db_settings["pw"],
    database=db_settings["db_name"]
)

mycursor = mydb.cursor()

q = f"""
    SELECT review_id, user_id, business_id,
    CASE
        WHEN stars >= 3.5 THEN 1
        ELSE -1
    END AS "like/dislike"
    FROM review;
    """
mycursor.execute(q)
user_reviews = mycursor.fetchall()

user_reviews_df = mysql_result_to_df(user_reviews, mycursor)

q = f"""
    SELECT user_id, COUNT(*) FROM review
    GROUP BY user_id HAVING COUNT(*)>=100;
    """
mycursor.execute(q)
user_counts = mycursor.fetchall()

user_counts_df = mysql_result_to_df(user_counts, mycursor)

mycursor.close()
which_rows = user_reviews_df["user_id"].isin(user_counts_df["user_id"])
user_reviews_df_filt = user_reviews_df[which_rows]

## Apply a second filter to the data to ensure that the minority class between (1) disike and (-1) like has atleast 40 rows.
<p>A previous step shows a left skewed distribution for user star ratings, with most ratings between 3-5. Due to this, I expect a class imbalance mostly with the like count being greater than the dislike count.</p>
<p>To address this issue, I will under sample the majority class based on the count of the minority class.  The minimum count for the minority class (40) is to filter out  users with very low dislike counts and to prevent the balanced sample from being to small.</p>
<p>Example: If a user has 200 likes but only 4 dislikes, a sample of 4 likes will be taken to balance the likes/dislikes class.  This will leave us with a total of 8 records to test on, which is too small for a good evaluation.</p>

In [4]:
# Min count for minority class
count_min = 40

# Get a count of users likes and dislikes to identity the minory class
user_reviews_df_grouped = user_reviews_df_filt.groupby(["user_id","like/dislike"], as_index=False).count()

# Get the unique ids for each user with atleast 100 reavies
user_review_ids = pd.Series(user_reviews_df_grouped["user_id"].unique().tolist())

def filter_reviews(x, df=user_reviews_df_grouped, count_min=count_min):
    # Find row in grouped dataframe (by user_id and like/dislike)
    which_row = df["user_id"] == x
    current = df[which_row]
    
    # Check if the min count (minority class) between like or dislike 
    return current["review_id"].min() >= count_min
    
user_review_ids = user_review_ids[user_review_ids.apply(filter_reviews)]
user_review_ids.shape
#user_id="-P3SyBLmBhyhDcYatlBgBQ"
#def user_reviews_df_grouped(row)
#user_reviews_df_grouped.apply()

(63,)

## Single user evaluation:
<p>I will perform the evaluation steps on a single user first to outline the process to obtain the F1 score.</p>

#### Balance like/dislike class
<p>I will use the balanced class result as the test set.</p>

In [4]:
# Take first id in filtered list 
user_id = user_review_ids[0]

#Get current reviews for user
top_user_reviews = get_user_reviews(user_id)

#Filter reviews between likes and dislikes.
top_user_reviews_dislikes = top_user_reviews[ top_user_reviews["like/dislike"] == -1]
top_user_reviews_likes = top_user_reviews[ top_user_reviews["like/dislike"] == 1 ]

# If dislike count is minority class, use it's count as the under sample amount.
# Otherwise, use the like count.
if(top_user_reviews_dislikes.shape[0]<top_user_reviews_likes.shape[0]):
    s_size = top_user_reviews_dislikes.shape[0]
else:
    s_size = top_user_reviews_likes.shape[0]

# Obtain samples with the minority class count.
# Note: one of these dataframes will have all of their rows returned (minority class)
# The majority class will be the only one sampled
top_user_reviews_dislikes = top_user_reviews_dislikes.sample(n=s_size)
top_user_reviews_likes = top_user_reviews_likes.sample(n=s_size)

#Combine balanced classes to one dataframe and use as test set
top_user_reviews_balanced = pd.concat([top_user_reviews_likes,top_user_reviews_dislikes], axis=0, ignore_index=True)

#### Preview balanced class count for current test user

In [6]:
top_user_reviews_balanced.groupby("like/dislike").count()["business_id"]

like/dislike
-1    66
 1    66
Name: business_id, dtype: int64

#### Get user profile for current test user
<p>The profile was constructed in previous steps utilizing all of the user's likes/dislikes</p>

In [7]:
user_id_map = pd.read_parquet(fp+"/user_id_map.gzip")
user_profiles = pd.read_parquet(fp+"/user_profile_weighted.gzip")
user_profiles_filt = get_user_profile(user_profiles, user_id_map, user_id)
user_profiles_filt

Unnamed: 0,lot,garage,valet,street,validated,lunch,dinner,brunch,breakfast,dessert,...,matchmakers,badminton,perfume,themed-cafes,misting-system-services,cideries,bike-tours,private-schools,drive-in-theater,user_id
1979,0.197248,-0.159493,0.0,0.690468,-0.208199,0.51492,0.705978,0.168132,0.176434,0.514746,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-P3SyBLmBhyhDcYatlBgBQ


#### Obtain the business profile vectors for the balanced test set
<p>First, all business profiles are opened, then filtered by the balanced set's business ids</p>

In [8]:
business_profiles_weighted = pd.read_parquet(fp+"/business_profiles_weighted.gzip")
test_ids = business_profiles_weighted["business_id"].isin(top_user_reviews_balanced["business_id"])

#### Get recommendation scores for current user profile vector and business vectors from the balanced set

In [9]:
rec_scores = get_recommendations(user_profiles_filt, business_profiles_weighted[test_ids])
rec_scores = rec_scores.reset_index()
# Join in actual review data for user to obtain TP, FP, TN, FN count.
rec_scores = top_user_reviews_balanced.merge(rec_scores, on="business_id")
rec_scores.columns = top_user_reviews_balanced.columns.tolist() + ["rec_score"]

#### Map the cosine similarity value to either +1 or -1
<p>I use the median value for all recommendation scores obtained as the cut-off point for a +1 or -1.  I did this to ensure that an even number of 1s and -1s are obtained from the recommendation score.  For example, making the cut-off value higher than the median will cause most rec_scores to be less than that amount, causing -1s to be returned most of the time.  Since we are testing on a balanced class of likes/dislikes, the model shouldn't overpredict only one class outcome.</p>
<p>*Note: The recommendation score median will be different for each user's results and will fluctuate depending on the results obtained from the balanced class sample</p>

In [10]:
score_median = score_median=rec_scores["rec_score"].median()
def get_pred_score(row, cutoff):
    if(row["rec_score"]>=cutoff):
        return 1
    else:
        return -1
    
rec_scores["pred"] = rec_scores.apply(get_pred_score, axis=1, args=(score_median,))

#### Show the count of predictions made
<p>The distribution should be 50% for each class outcome.</p>

In [11]:
rec_scores.groupby("pred").count()["rec_score"]

pred
-1    65
 1    67
Name: rec_score, dtype: int64

#### Create comparison column that checks if the predicted outcome matches the actual outcome

In [12]:
rec_scores["is_equal"] = rec_scores.apply(lambda row: row["like/dislike"] == row["pred"], axis=1)

#### Use boolean comparison field created above + the actual and predicted class outcome for a user to get (TP, FP, TN, FN) outcomes.
<p>I will group by the obtained test outcomes and get the counts for each one.  This will be the confusion matrix.</p>

In [13]:
def get_test_cat(row):
    if(row["is_equal"] and row["like/dislike"]==1):
        return "TP"
    elif(row["is_equal"] and row["like/dislike"]==-1):
        return "TN"
    elif( not(row["is_equal"]) and row["like/dislike"]==1):
        return "FN"
    else:
        return "FP"
rec_scores["is_equal_type"] = rec_scores.apply(get_test_cat, axis=1)
rec_scores.head()

Unnamed: 0,review_id,user_id,business_id,like/dislike,rec_score,pred,is_equal,is_equal_type
0,7lq-KgoGt_rZjG65Sn2-gg,-P3SyBLmBhyhDcYatlBgBQ,qwlsGR-pJb6xMWJBqiNIQw,1,0.209411,-1,False,FN
1,BMI1Ad8Snnu29rH4DL0Oew,-P3SyBLmBhyhDcYatlBgBQ,oDBz4UnpaAVkBFGEAUaQPA,1,0.435074,1,True,TP
2,rZuh0QhZCmB2IxV6yI1fUQ,-P3SyBLmBhyhDcYatlBgBQ,oDBz4UnpaAVkBFGEAUaQPA,1,0.435074,1,True,TP
3,JfIQIv2UXypMxR3qjUOrAA,-P3SyBLmBhyhDcYatlBgBQ,oDBz4UnpaAVkBFGEAUaQPA,1,0.435074,1,True,TP
4,rb5wztLkodfzOZa8iDzAgQ,-P3SyBLmBhyhDcYatlBgBQ,oDBz4UnpaAVkBFGEAUaQPA,1,0.435074,1,True,TP


#### Create confusion matrix from test results and calculate F1 score

In [14]:
conf_matrix = rec_scores.groupby("is_equal_type").count()["is_equal"]
conf_matrix

is_equal_type
FN    20
FP    21
TN    45
TP    46
Name: is_equal, dtype: int64

In [15]:
results = get_test_scores(conf_matrix)
print("User Id: ", user_id, "\nF1 Score:", results[0])

User Id:  -P3SyBLmBhyhDcYatlBgBQ 
F1 Score: 0.6917293233082707


## Multiple user evaluation:
<p>The same steps as the single user evaluation will be performed for all users obtained in the filtered set.  I will take the average of all F1 scores obtained as the primary metric for how well this model performs.</p>

In [16]:
i=0

# Will store each user's evaluation metrics
evaluation_results = []


for i,v in enumerate(user_review_ids):
    user_id = v
    
    # Get current user's reviews and seperate to two datasets:
    # One for likes, one for dislikes...
    curr_user_reviews = get_user_reviews(user_id)
    curr_user_reviews_dislikes = curr_user_reviews[ curr_user_reviews["like/dislike"] == -1]
    curr_user_reviews_likes = curr_user_reviews[ curr_user_reviews["like/dislike"] == 1 ]
    
    # Get row counts for like and dislike data
    dislikes_count = curr_user_reviews_dislikes.shape[0]
    likes_count = curr_user_reviews_likes.shape[0]
    
    # Find the smaller row count for the likes, dislikes data and make that the under sample size.
    if(dislikes_count<likes_count):
        s_size = curr_user_reviews_dislikes.shape[0]
    else:
        s_size = curr_user_reviews_likes.shape[0]
        
    # Sample like, dislike data using smaller row count as amount
    # This is to balance the class...
    curr_user_reviews_dislikes = curr_user_reviews_dislikes.sample(n=s_size)
    curr_user_reviews_likes = curr_user_reviews_likes.sample(n=s_size)
    
    # Skip user if they only have all likes or all dislikes...
    if(curr_user_reviews_dislikes.empty or curr_user_reviews_likes.empty):
        continue
    
    
    # Create balanced test set for the user
    curr_user_reviews_balanced = pd.concat([curr_user_reviews_likes,
                                           curr_user_reviews_dislikes], axis=0, ignore_index=True)
    
    # Get user profile vector for current user
    user_profiles_filt = get_user_profile(user_profiles, user_id_map, user_id)
    
    # Get business profiles using business ids obtained from balanced class
    test_ids = business_profiles_weighted["business_id"].isin(curr_user_reviews_balanced["business_id"])
    
    # Get rec scores for current user and balanced test set
    rec_scores = get_recommendations(user_profiles_filt, business_profiles_weighted[test_ids])
    rec_scores = rec_scores.reset_index()
    rec_scores = curr_user_reviews_balanced.merge(rec_scores, on="business_id")
    rec_scores.columns = curr_user_reviews_balanced.columns.tolist() + ["rec_score"] 
    
    # Map scores to either +1, -1 using rec score median as the cut off
    score_median = rec_scores["rec_score"].median()
    rec_scores["pred"] = rec_scores.apply(get_pred_score, axis=1, args=(score_median,))
    
    # Check if predicitions are equal to actuals
    rec_scores["is_equal"] = rec_scores.apply(lambda row: row["like/dislike"] == row["pred"], axis=1)
    
    # Map is_equal results to TP, FP, TN, FN
    rec_scores["is_equal_type"] = rec_scores.apply(get_test_cat, axis=1)
    
    # Get confusion matrix for current user and obtain results
    conf_matrix = rec_scores.groupby("is_equal_type").count()["is_equal"]
    results = get_test_scores(conf_matrix)
    
    # Append results for current user to list
    user_results = [user_id] + results
    evaluation_results.append(user_results)
    
evaluation_results = pd.DataFrame(evaluation_results, columns=["user_id","fscore"])

#### Preview top 20 multi-user evaluation results

In [17]:
evaluation_results.head(20).sort_values(by=["fscore"], ascending=False)

Unnamed: 0,user_id,fscore
9,9pNcdrQLWWrX0vEGGJlEbg,0.893617
10,ACwBMSJzgW6vOvV7vOrk8Q,0.85
1,0Zswwlz4NzUJoG-skyWzIw,0.846154
15,FgyvflZtqRF03j5bIrlnlA,0.804878
3,3xBFFH866WoySDG7uuwBSQ,0.759494
17,Ksp1e9Dw0Jcog_ZBD3-45g,0.74
4,5CgjjDAic2-FAvCtiHpytA,0.71875
18,Lk70TsLeGBYSXsnr5q-cXg,0.71
13,CxDOIDnH8gp9KXzpBHJYXw,0.703704
8,9-oFF_fYUJEfpm_Gm9fMAQ,0.7


#### Obtain mean of all F1 score values

In [18]:
evaluation_results["fscore"].mean()

0.6958174263493548

In [19]:
evaluation_results = None