# Introduction

This notebook aims to compare users to each other so we can recommend content to them based on their similarity.

The notebook consists of 2 sections which are briefly explained below:

1. Generate dummy data for users who answered questionnaires/quizzes on Clear Your Mind
    - Includes username, answers to quiz 1 and quiz 2
    - This dummy data is saved to a csv file
    
    
2. Read the dummy data and use it to calculate the cosine similarity of the users
    - Performed on quiz 1 and quiz 2
    - Results displayed as top 10 most similar users
    - Display the answers for a sample user, the highest and lowest similarity users.
    

# Section 1 - Generating dummy data

The first section demonstrates how the dummy data was generated using the Faker library. 

The results are saved to a csv file which will be used in the second section to simplify running the recommender algorithm without installing Faker.

**Note: Section 1 can be skipped**

In [None]:
# Run this command to install the library if it doesn't already exist locally
! pip install Faker

In [None]:
import random
import pandas as pd
from faker import Faker

fake = Faker()

### Generating data for 1000 users
Generate dummy data for 1000 users including a username and their answers to the questionnaires.

There are 21 answers for the questionnaire I used in the frontend. The answers are based on a range 0-3 (with 0 representing least likely, and 3 representing most likely)

In [None]:
users = []
quiz1 = []
quiz2 = []

for i in range(1000):
    # Generate a unique username
    users.append(fake.unique.user_name())
    q1 = []
    q2 = []
    for _ in range(21):
        q1.append(random.randint(0, 3)) 
    for _ in range(21):
        q2.append(random.randint(0, 3))         
    quiz1.append(q1)
    quiz2.append(q2)

❗❗❗❗ `WARNING: Running the cell below will overwrite the saved csv file.` ❗❗❗❗

The code will still work as intended, but the users and quiz answers will all change, leading to different usernames and answers, thus the similarities will also change.

In [None]:
userdata = {
    "username": users,
    "quiz1": quiz1,
    "quiz2": quiz2,
}

users_df = pd.DataFrame(userdata)
users_df.to_csv('dummy_data.csv', index=False)

users_df.head()

# Section 2 - Calculating similarities

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

pd.set_option('display.max_row', 20)

In [2]:
# import the csv file and map the list with quiz answers to integers so it can be used with numpy
users_df = pd.read_csv('dummy_data.csv')

users_df['quiz1'] = users_df['quiz1'].apply(lambda x: list(map(int, x.strip('[]').split(', '))))
users_df['quiz2'] = users_df['quiz2'].apply(lambda x: list(map(int, x.strip('[]').split(', '))))

users_df.head()

Unnamed: 0,username,quiz1,quiz2
0,kenneth14,"[3, 0, 2, 2, 3, 2, 3, 1, 2, 0, 1, 0, 3, 2, 3, ...","[0, 3, 2, 1, 2, 1, 3, 3, 2, 2, 0, 1, 0, 0, 3, ..."
1,coneill,"[0, 2, 1, 0, 1, 1, 1, 1, 3, 1, 0, 2, 3, 3, 2, ...","[0, 2, 0, 3, 2, 1, 2, 3, 1, 2, 1, 3, 1, 2, 1, ..."
2,emilyholt,"[3, 3, 0, 2, 0, 2, 0, 0, 0, 1, 0, 1, 0, 1, 1, ...","[2, 1, 0, 1, 0, 0, 3, 1, 2, 2, 3, 1, 0, 0, 0, ..."
3,tiffanyquinn,"[0, 0, 2, 3, 0, 2, 0, 3, 0, 1, 3, 3, 1, 0, 1, ...","[2, 0, 2, 3, 3, 3, 1, 2, 3, 0, 0, 1, 3, 2, 2, ..."
4,jenniferwilson,"[2, 2, 1, 2, 3, 2, 1, 0, 0, 3, 3, 0, 1, 2, 2, ...","[1, 0, 3, 1, 0, 0, 1, 1, 2, 2, 1, 1, 1, 2, 3, ..."


In [3]:
#Convert quiz answers to numpy so we can use it in the cosine similairty function

quiz1 = users_df['quiz1']
quiz1_answers = np.array([row for row in quiz1.values])

quiz2 = users_df['quiz2']
quiz2_answers = np.array([row for row in quiz2.values])

In [4]:
#calculate the similarity matrix for each quiz/user
sim_quiz1 = cosine_similarity(quiz1_answers)

sim_quiz2 = cosine_similarity(quiz2_answers)

In [5]:
#Add similarity matrix to dataframe with user data

# Quiz 1
similarity_quiz1_df = pd.DataFrame(sim_quiz1, columns=users_df['username'], index=users_df['username'])
similarity_quiz1_df.head()

username,kenneth14,coneill,emilyholt,tiffanyquinn,jenniferwilson,victoriaayala,gonzalesnicholas,eclements,simschad,matthewtaylor,...,catherine24,xstevens,gary75,angela89,paulmoreno,jessicakennedy,thomas42,perezbrenda,iking,dgarcia
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
kenneth14,1.0,0.666973,0.651773,0.594212,0.725655,0.682001,0.801279,0.691515,0.626783,0.812801,...,0.598764,0.586353,0.835182,0.707925,0.646632,0.714652,0.496521,0.554322,0.653835,0.722222
coneill,0.666973,1.0,0.521321,0.5,0.4258,0.623675,0.652233,0.585678,0.386953,0.523635,...,0.544581,0.28068,0.657667,0.584765,0.60212,0.523256,0.60212,0.762704,0.681608,0.754555
emilyholt,0.651773,0.521321,1.0,0.571772,0.632997,0.496139,0.68497,0.441425,0.537319,0.528352,...,0.415168,0.21398,0.661823,0.597566,0.560449,0.635764,0.528423,0.561747,0.524935,0.585517
tiffanyquinn,0.594212,0.5,0.571772,1.0,0.55354,0.610117,0.585678,0.745409,0.718627,0.508234,...,0.726108,0.392953,0.697127,0.522556,0.644129,0.506904,0.490098,0.723923,0.723339,0.565916
jenniferwilson,0.725655,0.4258,0.632997,0.55354,1.0,0.575766,0.757969,0.565265,0.666904,0.59457,...,0.668946,0.632094,0.736304,0.804547,0.594649,0.662837,0.51356,0.598878,0.537022,0.689242


In [6]:
# Quiz 2
similarity_quiz2_df = pd.DataFrame(sim_quiz2, columns=users_df['username'], index=users_df['username'])
similarity_quiz2_df.head()

username,kenneth14,coneill,emilyholt,tiffanyquinn,jenniferwilson,victoriaayala,gonzalesnicholas,eclements,simschad,matthewtaylor,...,catherine24,xstevens,gary75,angela89,paulmoreno,jessicakennedy,thomas42,perezbrenda,iking,dgarcia
username,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
kenneth14,1.0,0.717258,0.584305,0.668041,0.634069,0.613462,0.576392,0.529668,0.671576,0.631103,...,0.719708,0.663561,0.561448,0.729241,0.699455,0.575086,0.634498,0.5125,0.680451,0.566641
coneill,0.717258,1.0,0.621034,0.672664,0.608762,0.617708,0.713385,0.55,0.702742,0.757677,...,0.882803,0.697849,0.728652,0.763084,0.507093,0.601338,0.75,0.749497,0.803746,0.570563
emilyholt,0.584305,0.621034,1.0,0.410982,0.471728,0.55241,0.654477,0.470679,0.468065,0.532058,...,0.666687,0.454257,0.561747,0.64379,0.397796,0.3669,0.506633,0.534941,0.46513,0.411491
tiffanyquinn,0.668041,0.672664,0.410982,1.0,0.705279,0.816956,0.596863,0.729581,0.753316,0.796742,...,0.552241,0.636134,0.690354,0.576623,0.60349,0.788253,0.646792,0.70951,0.699505,0.66855
jenniferwilson,0.634069,0.608762,0.471728,0.705279,1.0,0.686772,0.439488,0.694879,0.708739,0.653218,...,0.436663,0.507937,0.738671,0.64646,0.737865,0.428571,0.564218,0.577948,0.577522,0.708338


### Top 10 similar users for quiz 1
The output below shows 10 users with the highest similarity to the sample user, in this case, the first user.

The similarity values are usually normalized and represented as floats between 0.0 and 1.0, where 0.0 indicates no similarity and 1.0 indicates perfect similarity

It is important to note that the first result will always compare the sample user with themselves, resulting in a similarity value of 1.0, and this value is usually ignored because it is of no value to the recommender system.


In [7]:
#Get the first user so the results can be sorted by most similiar
sample_username = similarity_quiz1_df.index[0]

#Sort results from most to least similar users and print out top 10 for a sample user

# Row for sample user and 
user_similarity = similarity_quiz1_df.loc[sample_username]
user_similarity = user_similarity.sort_values(ascending=False)

user_similarity[:11]


username
kenneth14          1.000000
carlosvang         0.928279
oliviaturner       0.887940
jason48            0.887500
cchurch            0.880647
nathanielwright    0.877734
jennifer61         0.874123
jadewilliams       0.873445
marcsanchez        0.872717
aaronhernandez     0.872517
ubaker             0.870930
Name: kenneth14, dtype: float64

In [8]:
# Display the answers of the sample user and the user with highest similarities for quiz 1

sample_user_answers = users_df.loc[users_df['username'] == sample_username, 'quiz1'].values[0]
similar_user_answers = users_df.loc[users_df['username'] == user_similarity.index[1], 'quiz1'].values[0]
least_similar_user_answers = users_df.loc[users_df['username'] == user_similarity.index[999], 'quiz1'].values[0]


print("Quiz 2 answers of\033[1m %s\033[0m: \n%s" %(sample_username, sample_user_answers))

print("\nHighest similarity answers of\033[1m %s\033[0m: \n%s" %(user_similarity.index[1], similar_user_answers))

print("\nLowest similarity answers of\033[1m %s\033[0m: \n%s" %(user_similarity.index[-1], least_similar_user_answers))


Quiz 2 answers of[1m kenneth14[0m: 
[3, 0, 2, 2, 3, 2, 3, 1, 2, 0, 1, 0, 3, 2, 3, 3, 1, 3, 3, 1, 2]

Highest similarity answers of[1m carlosvang[0m: 
[3, 2, 3, 2, 2, 3, 2, 1, 2, 1, 2, 0, 3, 2, 3, 2, 1, 3, 2, 0, 1]

Lowest similarity answers of[1m katherinekelly[0m: 
[0, 3, 0, 0, 3, 2, 0, 1, 1, 3, 0, 2, 0, 3, 0, 0, 2, 0, 2, 1, 0]


### Top 10 similar users for quiz 2

In [9]:
#Get the first user so the results can be sorted by most similiar
sample_username2 = similarity_quiz2_df.index[0]

#Sort results from most to least similar users and print out top 10 for a sample user

# Row for sample user and 
user_similarity2 = similarity_quiz2_df.loc[sample_username]
user_similarity2 = user_similarity2.sort_values(ascending=False)

user_similarity2[:11]


username
kenneth14        1.000000
karen57          0.864406
taylormegan      0.860745
harperangela     0.855628
hfisher          0.850160
james01          0.849315
amanda08         0.845982
wareangela       0.842762
ryanfischer      0.838931
petersonjenna    0.837478
donna83          0.837426
Name: kenneth14, dtype: float64

In [10]:
# Display the answers of the sample user and the user with highest similarities for quiz 1

sample_user_answers2 = users_df.loc[users_df['username'] == sample_username2, 'quiz2'].values[0]
similar_user_answers2 = users_df.loc[users_df['username'] == user_similarity2.index[1], 'quiz2'].values[0]
least_similar_user_answers2 = users_df.loc[users_df['username'] == user_similarity2.index[999], 'quiz2'].values[0]

print("Quiz 2 answers of\033[1m %s\033[0m: \n%s" %(sample_username2, sample_user_answers2))

print("\nHighest similarity answers of\033[1m %s\033[0m: \n%s" %(user_similarity2.index[1], similar_user_answers2))

print("\nLowest similarity answers of\033[1m %s\033[0m: \n%s" %(user_similarity2.index[-1], least_similar_user_answers2))


Quiz 2 answers of[1m kenneth14[0m: 
[0, 3, 2, 1, 2, 1, 3, 3, 2, 2, 0, 1, 0, 0, 3, 1, 3, 0, 2, 0, 2]

Highest similarity answers of[1m karen57[0m: 
[0, 3, 0, 2, 2, 2, 3, 2, 2, 2, 1, 0, 2, 1, 1, 0, 3, 0, 2, 0, 2]

Lowest similarity answers of[1m phillipsdenise[0m: 
[1, 2, 2, 2, 0, 3, 0, 0, 0, 0, 1, 1, 3, 3, 1, 1, 0, 2, 2, 2, 0]


# Conclusion

Based on these results, we can recommend videos for a user based on the videos liked by other users with high similarities

This currently only works by comparing the answers to 1 quiz at a time (i.e. the similarity for all users who answered quiz 1 only)
Future work could include combining all quizzes to make the recommendations more accurate.

