### Content-Based Filtering
This notebook implements a simple version of content-based recommendation.  
Content-based filtering is one of the two most common methods, next to collaborative filtering.  
It uses as input data on the items (in our case the courses). It then calculates similarity between each of the item, and recommends the items with the highest similarity.  
It has been shown to work pretty well and an advantage is that we only need data on the courses and not on the users. 
On the other hand, this means that it is not personalized and it often is hard to suggest things that are different than the input. 

This is only a test (but gives reasonable results). 
The code was largely inspired by and partly copied from https://www.datacamp.com/community/tutorials/recommender-systems-python

In [79]:
#import the needed packages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
from IPython.display import display

In [15]:
#to show all columns
# pd.set_option('display.max_columns', None)  
#o show full info per column
# pd.set_option('display.max_colwidth', -1)

In [16]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')
df['content']=df['content'].fillna('')
print("Number of courses:",len(df),"\n")
df.head()

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [17]:
#Construct a reverse map of indices and movie titles
# we use this to map index to title and other way around
indices = pd.Series(df.index, index=df['name'])
print(indices[:10])

name
Capstone: Business Development Project    0
Introduction to business                  1
Human Resource Management                 2
Current Issues in Leadership              3
Business and Society                      4
Managing Corporate Careers                5
Doing Qualitative Research                6
Gender and Diversity at Work              7
Managing Mergers and Acquisitions         8
Innovation Processes in Transition        9
dtype: int64


#### Tf-idf
We now use tf-idf to get a mapping of words in the document to a number that says something about the importance of the word. 
To be more precise, tf-idf is calcluated by #word appears/#words in document * #documents word appears
See wiki for more info: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Here we use the lazy version given by sklearn, which does all the preprocessing (tokenizing etc.) by itself

In [18]:
#define the tf-idf vectorizer
tfidf = TfidfVectorizer(stop_words='english')
#get the tf-idf score for each word in each ontent description of each course
tfidf_matrix = tfidf.fit_transform(df['content'])
print("Shape of matrix:",tfidf_matrix.shape)
print("Number of unique tokens:",tfidf_matrix.shape[1])

Shape of matrix: (1212, 6938)
Number of unique tokens: 6938


#### Cosine similarity
We now want to compute the similarity of the different courses.  
Here, we use the cosine similarity. This is one of the common measures to calculate similarity.  
Some other common ones are the Euclidean distance and the Pearson coefficient. It depends on the situation which one works best. 

In [19]:
# Compute the cosine similarity matrix
# We use the linear kernel of sklearn for this
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [20]:
cosine_sim

array([[1.        , 0.06331961, 0.00273555, ..., 0.03626142, 0.01779766,
        0.02844093],
       [0.06331961, 1.        , 0.04472521, ..., 0.05631803, 0.00961036,
        0.0327205 ],
       [0.00273555, 0.04472521, 1.        , ..., 0.00875949, 0.00758236,
        0.05219148],
       ...,
       [0.03626142, 0.05631803, 0.00875949, ..., 1.        , 0.07130622,
        0.00303629],
       [0.01779766, 0.00961036, 0.00758236, ..., 0.07130622, 1.        ,
        0.00581591],
       [0.02844093, 0.0327205 , 0.05219148, ..., 0.00303629, 0.00581591,
        1.        ]])

We can now use this cosine similarity to get the courses that are most similar to the input course!

In [71]:
def get_mostsim(title,indices=indices,sim=cosine_sim):
    
    # Get index of course given title
    idx = indices[title]

    #Get similarity of course to all other courses
    # structure is list of (index, similarity)
    sim_row = list(enumerate(sim[idx]))
    
    #sort the courses by descending score
    sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

    
    sim_indices = [i[0] for i in sim_sorted[1:]]
    sim_scores=[i[1] for i in sim_sorted[1:]]
    
    return sim_indices,sim_scores

In [72]:
title='Artificial Intelligence'
sim_indices,sim_scores=get_mostsim('Artificial Intelligence')
# Print the 10 most similar courses
print("The 10 most similar courses for the course",title)
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])

429
The 10 most similar courses for the course Artificial Intelligence
402                              Declarative Programming
64                            Capstone course: Marketing
673                               Reinforcement learning
708                            AI in health technologies
430     Machine Learning: Advanced Probabilistic Methods
921                       Computational inverse problems
401                   Machine Learning: Basic Principles
1071                                            Training
1195                     Art and Artificial Intelligence
1187                                     Design Learning
Name: name, dtype: object
Similarity scores
[0.2270661141951742, 0.16058039039089853, 0.15006070317973458, 0.14938852706142705, 0.14410021166907916, 0.14290986102893344, 0.14178510803135097, 0.1355741332432976, 0.12374501553732745, 0.12244226856485471]


In [41]:
ra=indices.sample(5)
for title,v in ra.items():
    # Print the 10 most similar courses
    sim_indices,sim_scores=get_mostsim(title)
    print("The 10 most similar courses for the course",title)
    print(df['name'].iloc[sim_indices[:10]])
    print("Similarity scores")
    print(sim_scores[:10])

The 10 most similar courses for the course Managing Corporate Careers
31                              Corporate Governance
880                                Corporate Finance
135     Tax Challenges for Multinational Enterprises
1158                       Managing Innovative Sales
854                             Corporate Governance
527                      Game Design Basics Workshop
87                                 Corporate Finance
101                         Mergers and Acquisitions
174           Corporate Responsibility Communication
157                  CEMS Global Management Practice
Name: name, dtype: object
Similarity scores
[0.12882109787384063, 0.11819161016557521, 0.09685054990094283, 0.09095382268695999, 0.08812908135418979, 0.08712414176048891, 0.08546154918545394, 0.0848332727417784, 0.078308558272884, 0.07779534342026469]
The 10 most similar courses for the course Visual Cultures and Aesthetics in Digital Communication and Learning Designs
757     Introduction to Inte

In [None]:
#seems to work pretty well! Several courses you would expect but don't really eadd something, some courses that might add something (Art and artificial intelligence), some courses that seem totally random but are not (capstone course: marketing)

### Experimenting more with tf-idf
Largely inspired by https://buhrmann.github.io/tfidf-analysis.html

More links on tf-idf  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform  
https://stackoverflow.com/questions/35757560/sklearns-tfidfvectorizer-word-frequency  
https://www.quora.com/How-does-TfidfVectorizer-work-in-laymans-terms  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  
https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html  

In [76]:
def top_tfidf(ind,tfidfvec=tfidf,tfidfmatrix=tfidf_matrix,n_tok=10):
    features=tfidfvec.get_feature_names()
    course_tfidf=np.squeeze(tfidfmatrix[ind].toarray())
    sorted_tfidf=np.argsort(-course_tfidf) #minus cause want descending order
    return [(features[i],course_tfidf[i]) for i in sorted_tfidf[:n_tok]]

In [78]:
title='Artificial Intelligence'
print("10 tokens with highest tf-idf for",title)
top_tfidf(indices[title])

10 tokens with highest tf-idf for Artificial Intelligence


[('solving', 0.29926907106902434),
 ('logical', 0.28092653701559595),
 ('representations', 0.27159056146075444),
 ('machine', 0.2115488799139426),
 ('learning', 0.20487671875013339),
 ('problems', 0.17203409026085273),
 ('techniques', 0.15611608245605438),
 ('solver', 0.1549633629436502),
 ('adaptation', 0.1549633629436502),
 ('formulas', 0.14648135144176114)]

In [82]:
corpus = ["The dog ate a sandwich and I ate a sandwich",
          "The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=False)
tfidfs = vectorizer.fit_transform(corpus)

In [83]:
vectorizer.idf_

array([1.69314718, 1.69314718, 1.        , 1.69314718, 1.69314718])

### Questions
- sklearn preprocessing vs own one (see topic modelling)?
- sklearn tf-idf vs doing by self (see link topic modelling)
    - What about tokenization in tf-idf?
- Try different similarity measures
- Dive more into the working and mathematics of the methods I am using (tf-idf, cosine similarity)
- Some funny things happen because of data, see MagLif below

In [None]:
#this course has no content --> all similarities are 0 --> just give first 10 courses
sim_indices,sim_scores=get_mostsim('Magnificent Life')
# Print the 10 most similar courses
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])