# Hybrid Recommendation System, Personal Matching System (Cosine Symilarity) and KNN

---

Author: Corné van den Boogert <br>
Date: 23/07/21, last edited: 02/08/21 <br>
contact: corne.vandenboogert@student.hu.nl <br>
Organisation: Hogeschool Utrecht 

**Short description** <br>
As the second form of my iteration, this is the "B" of iteration two. Instead of only using the KNN to recommend a time management method, let's use a user-matching so another user "recommends" time management techniques back to the other user. <br>

First, we make a 'match' with the algorithm cosine sim, as this is one of the most _basic_ ways of user recommendation, this can quickly result in "a" match (not preferably the best match). However, as this isn't a social matching platform it only needed to match between the users that are the most simular. There still can be discussed that someone who scored really high on time management should match somebody who isn't really good, to help eachother out. Due to time constraints I just let this idea sail for now.

Second, the second RS is build on top of the matching system, as the second RS uses the K-nearest neighbor algorithm to search for their top recommendations on time management methods, and this shall be communicated back to the user in the interface.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt

In [2]:
df = pd.read_csv('Survey_3.csv')

#make the user more anon, delete the name DF. User 'still' can be identified through index number for testing purposes.
del df['What is your name?']

## Data cleaning

As we all know, the most fun part of working with data is cleaning the data. Let's remove everything we don't need, and see what we can retrieve back from the dataframe.

After that, make a new column with the name 'User' so we can always direct back to the user via their respective number. Split the column "school, work or personal" (for Time Management Methods use), and turn them into dummy variables (multiple options were needed).

In [3]:
df.rename(columns={'Tijdstempel':'time','User consent':'consent','What is your age?':'age',
                  'What is your highest education level? (can be ongoing or already finished)':'education',
                  'On a scale from one to five, how well can you manage your own time?':'manage-time',
                  'When you’re working in a flow, you have the feeling that you have a good balance between the difficulty of the goal you’re trying to reach, and your own skill for your goal (could be anything).':'difficulty',
                  'When you’re in a workflow, you have the feeling that you want to have immediate feedback on your work.':'feedback',
                  'When you’re in a workflow, you have the feeling that you are only aware of your own goal you’re trying to reach.':'goal',
                  'When you are in a workflow, you have a clear focus on what you are doing.':'focus',
                  'When you are in a workflow, the attention span is long enough to complete a small task.':'attentionspan',
                  'When you are in a workflow, you have a decreased sense of self awareness.':'awareness',
                'When you are in a workflow, you have a strong feeling of discipline to complete your task or goal at hand.':'discipline',
                  'When you are in a workflow, you have no awareness of time.':'time-awareness',
                  'Do you use time management methods for school, work or personal life?':'school-work-personal',}, inplace=True)
del df['time']
del df['consent']
del df['Did you miss any methods, please let me know here below!']
del df['Before you go']

In [4]:


df

Unnamed: 0,age,education,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,...,The Swiss Cheese Method / The Salami Method,The Seinfeld Method/Don't break the chain,The Spotlight Method,Time Boxing,Time Blocking,To-done-list,Top Goal,Triage Technique,Who's Got the Monkey,Zen to Done
0,28,MBO,3,4,5,3,4,4,3,4,...,3,3.0,1,3.0,4,1,3,2.0,1.0,1.0
1,28,Masters,4,3,4,3,5,4,4,5,...,3,1.0,1,4.0,5,4,3,2.0,1.0,1.0
2,35,College of Applied Sciences (HBO-BA),5,4,4,5,4,5,5,5,...,1,1.0,1,1.0,3,5,2,,1.0,4.0
3,24,master degree Universiteit,3,5,2,4,5,5,5,5,...,1,1.0,5,1.0,1,4,4,4.0,1.0,5.0
4,28,Mbo,4,4,4,5,3,4,4,3,...,4,3.0,3,4.0,5,3,3,4.0,2.0,3.0
5,30,University Master,4,5,1,5,5,3,3,4,...,4,1.0,1,3.0,2,1,2,1.0,1.0,3.0
6,25,MBO,4,5,4,2,4,5,4,5,...,3,1.0,1,5.0,1,2,5,5.0,1.0,3.0
7,24,MBO,4,4,4,1,5,3,4,5,...,4,1.0,1,1.0,5,1,1,3.0,3.0,4.0
8,25,Master,4,4,3,5,5,2,4,4,...,4,2.0,1,4.0,2,4,4,2.0,2.0,2.0
9,25,Master,4,4,3,5,5,2,4,4,...,4,2.0,1,4.0,2,4,4,2.0,2.0,2.0


In [5]:
df['User'] = np.arange(len(df))
#df1.set
df.head()

Unnamed: 0,age,education,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,...,The Seinfeld Method/Don't break the chain,The Spotlight Method,Time Boxing,Time Blocking,To-done-list,Top Goal,Triage Technique,Who's Got the Monkey,Zen to Done,User
0,28,MBO,3,4,5,3,4,4,3,4,...,3.0,1,3.0,4,1,3,2.0,1.0,1.0,0
1,28,Masters,4,3,4,3,5,4,4,5,...,1.0,1,4.0,5,4,3,2.0,1.0,1.0,1
2,35,College of Applied Sciences (HBO-BA),5,4,4,5,4,5,5,5,...,1.0,1,1.0,3,5,2,,1.0,4.0,2
3,24,master degree Universiteit,3,5,2,4,5,5,5,5,...,1.0,5,1.0,1,4,4,4.0,1.0,5.0,3
4,28,Mbo,4,4,4,5,3,4,4,3,...,3.0,3,4.0,5,3,3,4.0,2.0,3.0,4


I knew already there were some NaN, or float objects in the data frame, which we needed to 'replace' or at least fill in with a 0 for later data analysis.

In [6]:
#df['Triage Technique'] = df['Triage Technique'].fillna(0)
#df['Triage Technique'] = df['Triage Technique'].astype(int)
df = df.replace(np.nan, 0)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 61 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   age                                          20 non-null     int64  
 1   education                                    20 non-null     object 
 2   manage-time                                  20 non-null     int64  
 3   difficulty                                   20 non-null     int64  
 4   feedback                                     20 non-null     int64  
 5   goal                                         20 non-null     int64  
 6   focus                                        20 non-null     int64  
 7   attentionspan                                20 non-null     int64  
 8   awareness                                    20 non-null     int64  
 9   discipline                                   20 non-null     int64  
 10  time

In [7]:
del df['age']
del df['education']
df

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,school-work-personal,...,The Seinfeld Method/Don't break the chain,The Spotlight Method,Time Boxing,Time Blocking,To-done-list,Top Goal,Triage Technique,Who's Got the Monkey,Zen to Done,User
0,3,4,5,3,4,4,3,4,4,Work,...,3.0,1,3.0,4,1,3,2.0,1.0,1.0,0
1,4,3,4,3,5,4,4,5,5,School;Personal,...,1.0,1,4.0,5,4,3,2.0,1.0,1.0,1
2,5,4,4,5,4,5,5,5,4,School;Work;Personal,...,1.0,1,1.0,3,5,2,0.0,1.0,4.0,2
3,3,5,2,4,5,5,5,5,5,School;Work;Personal,...,1.0,5,1.0,1,4,4,4.0,1.0,5.0,3
4,4,4,4,5,3,4,4,3,4,Work,...,3.0,3,4.0,5,3,3,4.0,2.0,3.0,4
5,4,5,1,5,5,3,3,4,4,Work;Personal,...,1.0,1,3.0,2,1,2,1.0,1.0,3.0,5
6,4,5,4,2,4,5,4,5,4,Work;Personal,...,1.0,1,5.0,1,2,5,5.0,1.0,3.0,6
7,4,4,4,1,5,3,4,5,5,Work;Personal,...,1.0,1,1.0,5,1,1,3.0,3.0,4.0,7
8,4,4,3,5,5,2,4,4,4,School,...,2.0,1,4.0,2,4,4,2.0,2.0,2.0,8
9,4,4,3,5,5,2,4,4,4,School,...,2.0,1,4.0,2,4,4,2.0,2.0,2.0,9


In [8]:
df = df.join(df['school-work-personal'].str.get_dummies("|"))
del df['school-work-personal']

df

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,1-3-5,...,Zen to Done,User,0,Personal,School,School;Personal,School;Work,School;Work;Personal,Work,Work;Personal
0,3,4,5,3,4,4,3,4,4,2.0,...,1.0,0,0,0,0,0,0,0,1,0
1,4,3,4,3,5,4,4,5,5,1.0,...,1.0,1,0,0,0,1,0,0,0,0
2,5,4,4,5,4,5,5,5,4,1.0,...,4.0,2,0,0,0,0,0,1,0,0
3,3,5,2,4,5,5,5,5,5,3.0,...,5.0,3,0,0,0,0,0,1,0,0
4,4,4,4,5,3,4,4,3,4,4.0,...,3.0,4,0,0,0,0,0,0,1,0
5,4,5,1,5,5,3,3,4,4,1.0,...,3.0,5,0,0,0,0,0,0,0,1
6,4,5,4,2,4,5,4,5,4,1.0,...,3.0,6,0,0,0,0,0,0,0,1
7,4,4,4,1,5,3,4,5,5,1.0,...,4.0,7,0,0,0,0,0,0,0,1
8,4,4,3,5,5,2,4,4,4,2.0,...,2.0,8,0,0,1,0,0,0,0,0
9,4,4,3,5,5,2,4,4,4,2.0,...,2.0,9,0,0,1,0,0,0,0,0


In [9]:
df_2 = pd.read_csv('Survey_1_29.csv')


In [10]:
df_2.rename(columns={'Tijdstempel':'time','User consent':'consent','What is your age?':'age',
                  'What is your highest education level? (can be ongoing or already finished)':'education',
                  'On a scale from one to five, how well can you manage your own time?':'manage-time',
                  'When you’re working in a flow, you have the feeling that you have a good balance between the difficulty of the goal you’re trying to reach, and your own skill for your goal (could be anything).':'difficulty',
                  'When you’re in a workflow, you have the feeling that you want to have immediate feedback on your work.':'feedback',
                  'When you’re in a workflow, you have the feeling that you are only aware of your own goal you’re trying to reach.':'goal',
                  'When you are in a workflow, you have a clear focus on what you are doing.':'focus',
                  'When you are in a workflow, the attention span is long enough to complete a small task.':'attentionspan',
                  'When you are in a workflow, you have a decreased sense of self awareness.':'awareness',
                'When you are in a workflow, you have a strong feeling of discipline to complete your task or goal at hand.':'discipline',
                  'When you are in a workflow, you have no awareness of time.':'time-awareness',
                  'Do you use time management methods for school, work or personal life?':'school-work-personal',}, inplace=True)
#df_2.head()

In [11]:
del df_2['time']
del df_2['consent']
del df_2['age']
del df_2['What is your gender?']

df_2

Unnamed: 0,"The next part of the survey is about time management, and time management techniques. What time management methods do you use and are you aware of?",Which of the following time managing methods do you feel the best using? (only choose one),manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,Would you like to help me with my research?
0,Time Boxing;To-do lists;Calendars (offline as ...,To-do lists,5,4,3,4,5,5,3,5,4,
1,To-do lists;Digital reminders;Calendars (offli...,Digital Reminders,3,4,3,5,4,5,3,4,5,
2,Time Boxing;To-do lists;Calendars (offline as ...,To-do lists,2,3,4,4,4,3,3,2,4,
3,To-do lists;Digital reminders;Calendars (offli...,Calendar,3,4,3,4,5,5,3,3,5,
4,To-do lists;Calendars (offline as online);Goal...,Calendar,3,4,3,4,4,5,4,4,5,
5,To-do lists;Calendars (offline as online);Goal...,To-do lists,4,3,5,3,4,3,3,4,2,Nienkemart@gmail.com - Nienkemart on instagram...
6,To-do lists;Calendars (offline as online);Goal...,To-do lists,4,3,2,5,5,5,5,5,5,pfr.voorrips@gmail.com
7,Digital reminders;Goal setting (setting a goal...,Digital Reminders,3,4,4,4,5,4,4,4,4,
8,To-do lists;Calendars (offline as online);Goal...,To-do lists,4,4,4,3,5,5,4,4,3,
9,To-do lists;Digital reminders;Calendars (offli...,To-do lists,4,4,5,1,4,4,4,5,1,


In [12]:
del df_2['The next part of the survey is about time management, and time management techniques. What time management methods do you use and are you aware of?']
del df_2['Which of the following time managing methods do you feel the best using? (only choose one)']
del df_2['Would you like to help me with my research?']

df_2

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness
0,5,4,3,4,5,5,3,5,4
1,3,4,3,5,4,5,3,4,5
2,2,3,4,4,4,3,3,2,4
3,3,4,3,4,5,5,3,3,5
4,3,4,3,4,4,5,4,4,5
5,4,3,5,3,4,3,3,4,2
6,4,3,2,5,5,5,5,5,5
7,3,4,4,4,5,4,4,4,4
8,4,4,4,3,5,5,4,4,3
9,4,4,5,1,4,4,4,5,1


In [13]:
df_n = pd.concat([df, df_2], axis=0)
df_n

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,1-3-5,...,Zen to Done,User,0,Personal,School,School;Personal,School;Work,School;Work;Personal,Work,Work;Personal
0,3,4,5,3,4,4,3,4,4,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4,3,4,3,5,4,4,5,5,1.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,5,4,4,5,4,5,5,5,4,1.0,...,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3,5,2,4,5,5,5,5,5,3.0,...,5.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,4,4,4,5,3,4,4,3,4,4.0,...,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,4,5,1,5,5,3,3,4,4,1.0,...,3.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,4,5,4,2,4,5,4,5,4,1.0,...,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,4,4,4,1,5,3,4,5,5,1.0,...,4.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,4,4,3,5,5,2,4,4,4,2.0,...,2.0,8.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,4,4,3,5,5,2,4,4,4,2.0,...,2.0,9.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [14]:
df_n = df_n.replace(np.nan, 0)
df_n.head()

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,1-3-5,...,Zen to Done,User,0,Personal,School,School;Personal,School;Work,School;Work;Personal,Work,Work;Personal
0,3,4,5,3,4,4,3,4,4,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4,3,4,3,5,4,4,5,5,1.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,5,4,4,5,4,5,5,5,4,1.0,...,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3,5,2,4,5,5,5,5,5,3.0,...,5.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,4,4,4,5,3,4,4,3,4,4.0,...,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [15]:
df_n

Unnamed: 0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,1-3-5,...,Zen to Done,User,0,Personal,School,School;Personal,School;Work,School;Work;Personal,Work,Work;Personal
0,3,4,5,3,4,4,3,4,4,2.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4,3,4,3,5,4,4,5,5,1.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,5,4,4,5,4,5,5,5,4,1.0,...,4.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3,5,2,4,5,5,5,5,5,3.0,...,5.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,4,4,4,5,3,4,4,3,4,4.0,...,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,4,5,1,5,5,3,3,4,4,1.0,...,3.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,4,5,4,2,4,5,4,5,4,1.0,...,3.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,4,4,4,1,5,3,4,5,5,1.0,...,4.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,4,4,3,5,5,2,4,4,4,2.0,...,2.0,8.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,4,4,3,5,5,2,4,4,4,2.0,...,2.0,9.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [16]:
df_n['User'] = np.arange(len(df_n))
df_n = df_n.set_index('User')
#df1.set
df_n

Unnamed: 0_level_0,manage-time,difficulty,feedback,goal,focus,attentionspan,awareness,discipline,time-awareness,1-3-5,...,Who's Got the Monkey,Zen to Done,0,Personal,School,School;Personal,School;Work,School;Work;Personal,Work,Work;Personal
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3,4,5,3,4,4,3,4,4,2.0,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,4,3,4,3,5,4,4,5,5,1.0,...,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,5,4,4,5,4,5,5,5,4,1.0,...,1.0,4.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,3,5,2,4,5,5,5,5,5,3.0,...,1.0,5.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,4,4,4,5,3,4,4,3,4,4.0,...,2.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,4,5,1,5,5,3,3,4,4,1.0,...,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,4,5,4,2,4,5,4,5,4,1.0,...,1.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,4,4,4,1,5,3,4,5,5,1.0,...,3.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,4,4,3,5,5,2,4,4,4,2.0,...,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,4,4,3,5,5,2,4,4,4,2.0,...,2.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


---

### This is where the fun begins, let's learn some stuff with Cosine Sim.

I've used cosine sim to compare every 'user profile' in the dataframe, and use this to match the user with another user. As this isn't a social matching system it doesn't have to be top notch tip top matching system, as there are 63 columns in the data set it's already enough, however, with eleven rows the chance that one person get multiple (same) matches is very high, but alas there never will be enough data

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(df_n)
print(similarity)

[[1.         0.87533006 0.79660812 ... 0.49125336 0.48827363 0.46489413]
 [0.87533006 1.         0.84208314 ... 0.50202258 0.5092325  0.49224094]
 [0.79660812 0.84208314 1.         ... 0.63430458 0.63144689 0.60580895]
 ...
 [0.49125336 0.50202258 0.63430458 ... 1.         0.97877496 0.91846339]
 [0.48827363 0.5092325  0.63144689 ... 0.97877496 1.         0.97458318]
 [0.46489413 0.49224094 0.60580895 ... 0.91846339 0.97458318 1.        ]]


In [18]:
df_n['User'] = np.arange(len(df_n))

In [19]:
#create a matrix of the similarity scores
df_n = pd.DataFrame(similarity, columns=df_n.index, index=df_n['User']).reset_index()
df_n

User,User.1,0,1,2,3,4,5,6,7,8,...,39,40,41,42,43,44,45,46,47,48
0,0,1.0,0.87533,0.796608,0.792604,0.846293,0.800362,0.856296,0.824687,0.875257,...,0.475688,0.485391,0.456997,0.495103,0.481929,0.481376,0.476226,0.491253,0.488274,0.464894
1,1,0.87533,1.0,0.842083,0.81218,0.880802,0.827422,0.868426,0.816627,0.898854,...,0.502224,0.493112,0.488656,0.515678,0.509359,0.494044,0.487503,0.502023,0.509232,0.492241
2,2,0.796608,0.842083,1.0,0.868959,0.822867,0.833282,0.818039,0.808915,0.840411,...,0.617921,0.62081,0.58816,0.624189,0.624024,0.605007,0.623768,0.634305,0.631447,0.605809
3,3,0.792604,0.81218,0.868959,1.0,0.821178,0.823492,0.819298,0.782546,0.836456,...,0.54281,0.506943,0.536667,0.532377,0.535497,0.535032,0.526028,0.523088,0.538752,0.53217
4,4,0.846293,0.880802,0.822867,0.821178,1.0,0.83661,0.830683,0.823047,0.891226,...,0.466516,0.477033,0.441642,0.478467,0.478856,0.472585,0.47643,0.490441,0.481767,0.464901
5,5,0.800362,0.827422,0.833282,0.823492,0.83661,1.0,0.805026,0.809298,0.866434,...,0.702357,0.667661,0.68116,0.700053,0.715151,0.690809,0.711462,0.682092,0.714637,0.725217
6,6,0.856296,0.868426,0.818039,0.819298,0.830683,0.805026,1.0,0.828339,0.879679,...,0.457748,0.453517,0.445503,0.464001,0.452495,0.444065,0.450006,0.454554,0.461083,0.445698
7,7,0.824687,0.816627,0.808915,0.782546,0.823047,0.809298,0.828339,1.0,0.819567,...,0.511327,0.493292,0.503708,0.529902,0.519859,0.49665,0.48667,0.499146,0.512697,0.501082
8,8,0.875257,0.898854,0.840411,0.836456,0.891226,0.866434,0.879679,0.819567,1.0,...,0.484583,0.472767,0.462944,0.49641,0.506745,0.484122,0.490865,0.490542,0.49641,0.488441
9,9,0.875257,0.898854,0.840411,0.836456,0.891226,0.866434,0.879679,0.819567,1.0,...,0.484583,0.472767,0.462944,0.49641,0.506745,0.484122,0.490865,0.490542,0.49641,0.488441


To "compare" the cosine sim and the matrix, separate them in different data sets, so we can mix and match with this in the recommender fucntion later on.

In [20]:
df1 = df_n.drop(columns=['User'])
df1 = pd.DataFrame(df1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 49 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       49 non-null     float64
 1   1       49 non-null     float64
 2   2       49 non-null     float64
 3   3       49 non-null     float64
 4   4       49 non-null     float64
 5   5       49 non-null     float64
 6   6       49 non-null     float64
 7   7       49 non-null     float64
 8   8       49 non-null     float64
 9   9       49 non-null     float64
 10  10      49 non-null     float64
 11  11      49 non-null     float64
 12  12      49 non-null     float64
 13  13      49 non-null     float64
 14  14      49 non-null     float64
 15  15      49 non-null     float64
 16  16      49 non-null     float64
 17  17      49 non-null     float64
 18  18      49 non-null     float64
 19  19      49 non-null     float64
 20  20      49 non-null     float64
 21  21      49 non-null     float64
 22  22  

In [21]:
indices = pd.Series(df_n.index, index=df_n['User']).drop_duplicates()
indices

User
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
11    11
12    12
13    13
14    14
15    15
16    16
17    17
18    18
19    19
20    20
21    21
22    22
23    23
24    24
25    25
26    26
27    27
28    28
29    29
30    30
31    31
32    32
33    33
34    34
35    35
36    36
37    37
38    38
39    39
40    40
41    41
42    42
43    43
44    44
45    45
46    46
47    47
48    48
dtype: int64

In [22]:
# Function that takes in user as input and outputs most similar users
def get_recommendations(User, df1=df1):
    # Get the index of the users that matches the df
    idx = indices[User]

    # Get the pairwsie similarity scores of all users with df
    sim_scores = list(enumerate(df1[idx])) 
    
    # Sort the user based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 4 most similar users
    sim_scores = sim_scores[1:5]

    # Get the user indices
    match_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar users
    return df_n['User'].loc[match_indices]

Because of uploading this to github, I've commented out the user input. However, if you want to try this you can always comment this out.

With "What match do you want to see ?" I Implied what match do you want to see from whitch user, as for now for testing reasons I've chosen user number 2.

In [23]:
user = 11
#user = int(input('What match do you want to see? '))

#Print recommendations
def recommendation(x):
    print ('Your top 4 is:')
    recommendations = get_recommendations(user, similarity) 
    return recommendations

recommendation(1)

Your top 4 is:


17    17
13    13
6      6
8      8
Name: User, dtype: int64

---

# Rec system 2, Electric Buggaloo: KNN setup with user-sparse matrix

The second recommendation system the data was already 'cleaned' *(Taken from the "A" test of iteration 2)*
so (almost) no cleaning was needed. However, as there are a lot of '1's in the dataframe, which actually should represent a 0 in this case, as it was a fault in my survey, I replaced them a little bit down the line.

In [24]:
df_ur = pd.read_csv('user_ratings_final.csv', index_col=0)
df_ur

Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9,...,user_39,user_40,user_41,user_42,user_43,user_44,user_45,user_46,user_47,user_48
1-3-5,2,1,1,3,4,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,0
168 Hours,4,2,1,1,1,1,4,1,4,4,...,0,0,0,0,0,0,0,0,0,0
10 Minutes,5,4,1,1,3,1,4,1,3,3,...,0,0,0,0,0,0,0,0,0,0
10 Minute Task,2,2,1,1,2,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,0
18 Minutes,2,1,4,5,2,1,2,1,1,1,...,0,0,0,0,0,0,0,0,0,0
90 Minute Focus Session,4,3,5,4,2,1,3,1,1,1,...,0,0,0,0,0,0,0,0,0,0
4D Systems,4,4,1,4,1,1,3,1,1,1,...,0,0,0,0,0,0,0,0,0,0
52/17,1,1,1,1,1,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,0
7 Minute Life,3,2,1,4,1,1,2,5,4,4,...,0,0,0,0,0,0,0,0,0,0
ABCDE,4,1,1,4,3,2,1,4,4,4,...,0,0,0,0,0,0,0,0,0,0


In [25]:
df_ur.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, 1-3-5 to Zen to Done
Data columns (total 49 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   user_0   48 non-null     int64
 1   user_1   48 non-null     int64
 2   user_2   48 non-null     int64
 3   user_3   48 non-null     int64
 4   user_4   48 non-null     int64
 5   user_5   48 non-null     int64
 6   user_6   48 non-null     int64
 7   user_7   48 non-null     int64
 8   user_8   48 non-null     int64
 9   user_9   48 non-null     int64
 10  user_10  48 non-null     int64
 11  user_11  48 non-null     int64
 12  user_12  48 non-null     int64
 13  user_13  48 non-null     int64
 14  user_14  48 non-null     int64
 15  user_15  48 non-null     int64
 16  user_16  48 non-null     int64
 17  user_17  48 non-null     int64
 18  user_18  48 non-null     int64
 19  user_19  48 non-null     int64
 20  user_20  48 non-null     int64
 21  user_21  48 non-null     int64
 22  user_22  48 non-null

In [26]:
#df_ur['user_10'] = df_ur['user_10'].fillna(0)
#df_ur['user_10'] = df_ur['user_10'].astype(int)

#df_ur['User Average'] = df_ur['User Average'].fillna(0)
#df_ur['User Average'] = df_ur['User Average'].astype(int)

df_ur.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48 entries, 1-3-5 to Zen to Done
Data columns (total 49 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   user_0   48 non-null     int64
 1   user_1   48 non-null     int64
 2   user_2   48 non-null     int64
 3   user_3   48 non-null     int64
 4   user_4   48 non-null     int64
 5   user_5   48 non-null     int64
 6   user_6   48 non-null     int64
 7   user_7   48 non-null     int64
 8   user_8   48 non-null     int64
 9   user_9   48 non-null     int64
 10  user_10  48 non-null     int64
 11  user_11  48 non-null     int64
 12  user_12  48 non-null     int64
 13  user_13  48 non-null     int64
 14  user_14  48 non-null     int64
 15  user_15  48 non-null     int64
 16  user_16  48 non-null     int64
 17  user_17  48 non-null     int64
 18  user_18  48 non-null     int64
 19  user_19  48 non-null     int64
 20  user_20  48 non-null     int64
 21  user_21  48 non-null     int64
 22  user_22  48 non-null

In [27]:
#df_ur['TMM'] = np.arange(len(df_ur))

In [28]:
#df_ur = df_ur.replace([1], 0)

#df_ur = df_ur.reset_index(drop=True)
#df_ur = pd.Series(df_ur.index, index=df_ur['TMM']).drop_duplicates()
#df_ur = pd.Series().drop_duplicates()
df_ur.head()

Unnamed: 0,user_0,user_1,user_2,user_3,user_4,user_5,user_6,user_7,user_8,user_9,...,user_39,user_40,user_41,user_42,user_43,user_44,user_45,user_46,user_47,user_48
1-3-5,2,1,1,3,4,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,0
168 Hours,4,2,1,1,1,1,4,1,4,4,...,0,0,0,0,0,0,0,0,0,0
10 Minutes,5,4,1,1,3,1,4,1,3,3,...,0,0,0,0,0,0,0,0,0,0
10 Minute Task,2,2,1,1,2,1,1,1,2,2,...,0,0,0,0,0,0,0,0,0,0
18 Minutes,2,1,4,5,2,1,2,1,1,1,...,0,0,0,0,0,0,0,0,0,0


---

### Setup K Nearest Neighbors

For this prototype, I've decided to use N = 3, normally I would've used N = 5 or N = 7, however as there are only 11 user ratings in this dataset, I really can't use N = 7, so that's why I have sticked with N = 3.

In [29]:
from sklearn.neighbors import NearestNeighbors
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df_ur.values)
distances, indices = knn.kneighbors(df_ur.values, n_neighbors=5)

In [30]:
indices

array([[ 0,  4, 46,  7, 47],
       [ 1, 31, 21,  7, 10],
       [ 2, 15, 29, 26, 25],
       [ 3, 14, 25,  7, 29],
       [ 4, 47,  0,  5, 10],
       [ 5, 15, 29,  2, 20],
       [ 6, 10,  5, 15,  4],
       [ 7, 31, 10,  1, 15],
       [ 8, 15,  9, 47,  7],
       [ 9, 44, 42, 32, 17],
       [10, 44, 12, 11,  7],
       [11, 28, 10, 25, 15],
       [12, 10,  7, 44, 13],
       [13, 44, 31, 10,  7],
       [14,  3, 20, 25,  2],
       [15, 26,  2, 25,  5],
       [16, 45, 15, 32,  5],
       [17, 44, 34, 38, 42],
       [18, 40, 12,  9, 16],
       [19, 10, 34, 44, 17],
       [20, 14,  3,  5,  2],
       [21,  1, 41, 10, 28],
       [22, 32, 45, 30, 44],
       [23, 35, 37, 38, 11],
       [24, 25, 29, 11, 15],
       [25, 24, 29, 28, 26],
       [26, 15, 25, 38,  2],
       [27, 44, 33, 12, 31],
       [28, 25, 46, 11, 21],
       [29, 25,  2, 15,  5],
       [30, 37, 33,  9,  7],
       [31,  7,  1, 13, 44],
       [32, 42, 17,  9, 44],
       [33, 30, 27,  7, 32],
       [34, 17

In [31]:
distances

array([[3.33066907e-16, 1.07062173e-01, 1.51125312e-01, 1.54845745e-01,
        1.57984723e-01],
       [0.00000000e+00, 9.80247664e-02, 1.05413559e-01, 1.08943615e-01,
        1.17439809e-01],
       [0.00000000e+00, 1.00016118e-01, 1.07089123e-01, 1.19296634e-01,
        1.25603165e-01],
       [0.00000000e+00, 4.78915941e-02, 1.19325685e-01, 1.25957156e-01,
        1.32890030e-01],
       [0.00000000e+00, 1.00108524e-01, 1.07062173e-01, 1.48562666e-01,
        1.71336614e-01],
       [0.00000000e+00, 1.07408728e-01, 1.24950819e-01, 1.26043759e-01,
        1.40787035e-01],
       [0.00000000e+00, 1.14349173e-01, 1.91638598e-01, 1.98517301e-01,
        1.99492104e-01],
       [3.33066907e-16, 8.96745121e-02, 1.00264589e-01, 1.08943615e-01,
        1.10266080e-01],
       [0.00000000e+00, 1.16673554e-01, 1.32437566e-01, 1.34015505e-01,
        1.36823372e-01],
       [0.00000000e+00, 1.16984666e-01, 1.18288309e-01, 1.20770534e-01,
        1.23715573e-01],
       [4.44089210e-16, 8.3739

---

### Make the recommendation fucntion for the KNN Recommender.

This fucntion, was taken from [here](https://github.com/TheClub4/collaborative_filtering/blob/master/collaborative_filtering.ipynb), as it gave me increadible headaches, however it did seem to work in the end.

First, let's test the KNN from the perspective as a user.
The first is a bit backwards, to see which results are the closest.

In [32]:
# get the index for 'POSEC'
index_for_tmm = df_ur.index.tolist().index('POSEC')

# find the indices for the similar tmm
sim_tmm = indices[index_for_tmm].tolist()

# distances between 'POSEC' and the similar tmm

tmm_distances = distances[index_for_tmm].tolist()

# the position of 'POSEC' in the list sim_movies

id_tmm = sim_tmm.index(index_for_tmm)

# remove 'POSEC' from the list sim_TMM
sim_tmm.remove(index_for_tmm)

# remove 'POSEC' from the list movie_distances
tmm_distances.pop(id_tmm)
print('The Nearest TMM to POSEC:', sim_tmm)
print('The Distance from POSEC:', tmm_distances)

The Nearest TMM to POSEC: [25, 29, 11, 15]
The Distance from POSEC: [0.06567118213103829, 0.12786973138587987, 0.1405979452452727, 0.15963263627405588]


As we can see here, for this example.

The nearest Time Mangement Methods, based on User Ratings, should be number 25, 29, 11 and 15.

I actually have a dropbox paper file which you can find [here](https://paper.dropbox.com/doc/List-of-time-management-techniques-WC2YZPULFGbqze4cC7Bgu) with all the TMM respectively. Yes there are really 47 of them.

Furthermore, the 2 chosen TMM by the KNN algorithm are:<br>
25 - **RACI Matrix** *A system to help you determine who is responsible, who is accountable, who needs to be consulted, and who must be kept informed at every step of the project to increase the odds of success in meeting your goals.*

29 - **The Action Method** *Developed by Behance. Break ideas down into three categories:
Action items are the steps to get projects done<br>
Backburner items are all the interesting ideas that don’t lead to progress on the project<br>
Reference items are resources and information needed to complete a project*

11 - **Autofocus** *Use four different lists: New tasks, recurring tasks, unfinished and old tasks. Start with new tasks, move on to recurring tasks, then spend some time on the unfinished tasks. At the end, clear some old tasks.*

15 - **Do it now**
*If it takes less than 3 minutes, do it now, without any thinking or planning.*

In [33]:
# copy df
df2 = df_ur.copy()

# find the nearest neighbors using NearestNeighbors(n_neighbors=3)
n_neighbors = 5
number_neighbors = 5
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(df_ur.values)
distances, indices = knn.kneighbors(df_ur.values, n_neighbors=number_neighbors)

# convert user_name to user_index
user_index = df_ur.columns.tolist().index('user_17')

# t: tmm_title, m: the row number of t in df
for m,t in list(enumerate(df_ur.index)):
  
  # find TMM without ratings by user_4
    if df_ur.iloc[m, user_index] == 0:
        sim_tmm = indices[m].tolist()
        tmm_distances = distances[m].tolist()
 
    if m in sim_tmm:
        id_tmm = sim_tmm.index(m)
        sim_tmm.remove(m)
        tmm_distances.pop(id_tmm) 

    
    else:
        sim_tmm = sim_tmm[:n_neighbors-1]
        tmm_distances = tmm_distances[:n_neighbors-1]
        
    # movie_similarty = 1 - movie_distance    
    tmm_similarity = [1-x for x in tmm_distances]
    tmm_similarity_copy = tmm_similarity.copy()
    nominator = 0

    # for each similar movie
    for s in range(0, len(tmm_similarity)):
      
      # check if the rating of a similar tmm is zero
        if df_ur.iloc[sim_tmm[s], user_index] == 0:

        # if the rating is zero, ignore the rating and the similarity in calculating the predicted rating
            if len(tmm_similarity_copy) == (number_neighbors - 1):
                tmm_similarity_copy.pop(s)
          
            else:
                tmm_similarity_copy.pop(s-(len(tmm_similarity)-len(tmm_similarity_copy)))

      # if the rating is not zero, use the rating and similarity in the calculation
        else:
            nominator = nominator + tmm_similarity[s]*df_ur.iloc[sim_tmm[s],user_index]

    # check if the number of the ratings with non-zero is positive
    if len(tmm_similarity_copy) > 0:
      
      # check if the sum of the ratings of the similar movies is positive.
        if sum(tmm_similarity_copy) > 0:
            predicted_r = nominator/sum(tmm_similarity_copy)

      # Even if there are some movies for which the ratings are positive, some movies have zero similarity even though they are selected as similar movies.
      # in this case, the predicted rating becomes zero as well  
        else:
            predicted_r = 0

    # if all the ratings of the similar movies are zero, then predicted rating should be zero
    else:
        predicted_r = 0

  # place the predicted rating into the copy of the original dataset
    df_ur.iloc[m,user_index] = predicted_r

In [34]:
def recommend_tmm(user, num_recommended_tmm):

    #print('The list of Time Management Methods {} Has Used \n'.format(user))

    #for m in df_ur[df_ur[user] > 0][user].index.tolist():
    #    print(m)
    
    #print('\n')

    recommended_tmm = []

    for m in df_ur[df_ur[user] == 0].index.tolist():

        index_df_ur= df_ur.index.tolist().index(m)
        predicted_rating = df2.iloc[index_df_ur, df2.columns.tolist().index(user)]
        recommended_tmm.append((m, predicted_rating))

    sorted_rm = sorted(recommended_tmm, key=lambda x:x[1], reverse=True)
  
    print('The list of the Recommended TMM \n')
    rank = 1
    for recommended_tmm in sorted_rm[:num_recommended_tmm]:
    
        print('{}: {} - predicted rating:{}'.format(rank, recommended_tmm[0], recommended_tmm[1]))
        rank = rank + 1

In [35]:
recommend_tmm('user_17', 10)

The list of the Recommended TMM 

1: The Action Method - predicted rating:1
2: Not-to-do-list - predicted rating:1
3: The Autofocus Method - predicted rating:1
4: The checklist Manifesto - predicted rating:1
5: The Final Version - predicted rating:1
6: The Medium Method - predicted rating:1
7: The Now Habit/Unscheduling - predicted rating:1
8: The Productivity Journal - predicted rating:1
9: The Jar Glass - predicted rating:1
10: The Swiss Cheese Method / The Salami Method - predicted rating:1


---

## "Sources"

The Sources I gathered for my last prototype, there not really references as sometimes it was just inspiration, however when there was coded "copy pasted" for my own benefitted, it's noted with code.

https://github.com/youonf/recommendation_system/blob/master/content_based_filtering/content_based_recommender_approach1.ipynb

https://github.com/youonf/recommendation_system/blob/master/content_based_filtering/content_based_recommender_approach2_v2.ipynb

https://github.com/TheClub4/collaborative_filtering/blob/master/collaborative_filtering.ipynb

https://towardsdatascience.com/recommendation-system-in-python-lightfm-61c85010ce17

https://towardsdatascience.com/introduction-to-two-approaches-of-content-based-recommendation-system-fc797460c18c

https://github.com/KayleighDaisydeHaan/Gradutation_MDDD/blob/021e0af819e16f6a4a80d1bb62bba04d4f92811c/Recommendation_system_iteration1.ipynb

https://github.com/KayleighDaisydeHaan/Gradutation_MDDD/blob/021e0af819e16f6a4a80d1bb62bba04d4f92811c/Final_RS.ipynb

https://github.com/pedro-de-bastos/Practical-Data-Science-IL181/blob/master/Prototyping_Movie_Recommender.ipynb

https://www.offerzen.com/blog/how-to-build-a-content-based-recommender-system-for-your-product

https://medium.com/@bindhubalu/content-based-recommender-system-4db1b3de03e7

https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd

---