### Content-Based Filtering
This notebook implements a simple version of content-based recommendation.  
Content-based filtering is one of the two most common methods, next to collaborative filtering.  
It uses as input data on the items (in our case the courses). It then calculates similarity between each of the item, and recommends the items with the highest similarity.  
It has been shown to work pretty well and an advantage is that we only need data on the courses and not on the users. 
On the other hand, this means that it is not personalized and it often is hard to suggest things that are different than the input. 

This is only a test (but gives reasonable results). 
The code was largely inspired by and partly copied from https://www.datacamp.com/community/tutorials/recommender-systems-python

In [3]:
#import the needed packages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel
from IPython.display import display

In [4]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')
df['content']=df['content'].fillna('')
print("Number of courses:",len(df),"\n")
df.head()

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [29]:
def recom_freetext(user_input,courses=df):
#Construct a reverse map of indices and movie titles
# we use this to map index to title and other way around
    courses=courses.append({'content':user_input,'name':user_input},ignore_index=True)#[len(courses)]['content']=user_input
    #print(courses)
    indices = pd.Series(courses.index, index=courses['name'])
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(courses['content'])
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    sim_indices,sim_scores=get_mostsim(user_input,indices,cosine_sim)
    # Print the 10 most similar courses
    print("The 10 most similar courses for the input",user_input)
    print(df['name'].iloc[sim_indices[:10]])
    print("Similarity scores")
    print(sim_scores[:10])

In [27]:

df_try=df.append({'content':'entrepeneurship'},ignore_index=True)#[len(courses)]['content']=user_input
#print(courses)
indices = pd.Series(df_try.index, index=df_try['name'])

In [28]:
indices

name
Capstone: Business Development Project                                                                          0
Introduction to business                                                                                        1
Human Resource Management                                                                                       2
Current Issues in Leadership                                                                                    3
Business and Society                                                                                            4
Managing Corporate Careers                                                                                      5
Doing Qualitative Research                                                                                      6
Gender and Diversity at Work                                                                                    7
Managing Mergers and Acquisitions                                                  

In [25]:
recom_freetext('entrepeneurship')

KeyError: 'entrepeneurship'

We can now use this cosine similarity to get the courses that are most similar to the input course!

In [10]:
def get_mostsim(title,indices,sim):
    
    # Get index of course given title
    idx = indices[title]

    #Get similarity of course to all other courses
    # structure is list of (index, similarity)
    sim_row = list(enumerate(sim[idx]))
    
    #sort the courses by descending score
    sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

    
    sim_indices = [i[0] for i in sim_sorted[1:]]
    sim_scores=[i[1] for i in sim_sorted[1:]]
    
    return sim_indices,sim_scores

In [19]:
title='Explorative Information Visualization'
sim_indices,sim_scores=get_mostsim(title)
# Print the 10 most similar courses
print("The 10 most similar courses for the course",title)
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])

The 10 most similar courses for the course Explorative Information Visualization
1182                                     Digital Urban
109                    Quantitative Empirical Research
477         Information Design Studio (Advanced level)
476     Information Design Studio (Intermediate level)
535     Topics in Visualization and Cultural Analytics
23                        Perspectives on organization
749                       Advanced Spatial Analytics L
147                     Management Information Systems
1175                        Urban GIS and Visual Tools
475                                           Data Now
Name: name, dtype: object
Similarity scores
[0.21711850689321494, 0.20787109715824192, 0.19651746874909612, 0.19154857322369867, 0.19145319273618902, 0.18879540490395033, 0.1872989166578033, 0.1813301746506613, 0.17487816471808537, 0.17189368406838348]


In [10]:
ra=indices.sample(5)
for title,v in ra.items():
    # Print the 10 most similar courses
    sim_indices,sim_scores=get_mostsim(title)
    print("The 10 most similar courses for the course",title)
    print(df['name'].iloc[sim_indices[:10]])
    print("Similarity scores")
    print(sim_scores[:10])

The 10 most similar courses for the course Electromechanics
677                        Design of Electrical Machines
687                        Seminar on Electromechanics P
688               Special Course on Electromechanics P V
802                      Finite Element Method in Solids
840                            Finite Element Analysis L
585                                  Electronic circuits
690                   An Introduction to Electric Energy
676                                 Converter Techniques
225    Ways of Making 2 - Constructing sculpture with...
181            Introduction to Advanced Energy Solutions
Name: name, dtype: object
Similarity scores
[0.3326772014288948, 0.28394697086702986, 0.28394697086702986, 0.20418609872481755, 0.16596945989083237, 0.15689931128732673, 0.15387144435726985, 0.14041973798350596, 0.13125891464333567, 0.12850464569793243]
The 10 most similar courses for the course Thesis Writing for Engineers (MSc) (w) - H01
771    Academic Communicatio

In [11]:
#seems to work pretty well! Several courses you would expect but don't really eadd something, some courses that might add something (Art and artificial intelligence), some courses that seem totally random but are not (capstone course: marketing)

### Experimenting more with tf-idf
Largely inspired by https://buhrmann.github.io/tfidf-analysis.html

More links on tf-idf  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform  
https://stackoverflow.com/questions/35757560/sklearns-tfidfvectorizer-word-frequency  
https://www.quora.com/How-does-TfidfVectorizer-work-in-laymans-terms  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  
https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html  

In [12]:
def top_tfidf(ind,tfidfvec=tfidf,tfidfmatrix=tfidf_matrix,n_tok=10):
    features=tfidfvec.get_feature_names()
    course_tfidf=np.squeeze(tfidfmatrix[ind].toarray())
    sorted_tfidf=np.argsort(-course_tfidf) #minus cause want descending order
    return [(features[i],course_tfidf[i]) for i in sorted_tfidf[:n_tok]]

In [13]:
title='Artificial Intelligence'
print("10 tokens with highest tf-idf for",title)
top_tfidf(indices[title])

10 tokens with highest tf-idf for Artificial Intelligence


[('solving', 0.29926907106902434),
 ('logical', 0.28092653701559595),
 ('representations', 0.27159056146075444),
 ('machine', 0.2115488799139426),
 ('learning', 0.20487671875013339),
 ('problems', 0.17203409026085273),
 ('techniques', 0.15611608245605438),
 ('solver', 0.1549633629436502),
 ('adaptation', 0.1549633629436502),
 ('formulas', 0.14648135144176114)]

In [14]:
corpus = ["The dog ate a sandwich and I ate a sandwich",
          "The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english',smooth_idf=False)
tfidfs = vectorizer.fit_transform(corpus)

In [15]:
vectorizer.idf_

array([1.69314718, 1.69314718, 1.        , 1.69314718, 1.69314718])

### Questions
- sklearn preprocessing vs own one (see topic modelling)?
- sklearn tf-idf vs doing by self (see link topic modelling)
    - What about tokenization in tf-idf?
- Try different similarity measures
- Dive more into the working and mathematics of the methods I am using (tf-idf, cosine similarity)
- Some funny things happen because of data, see MagLif below

In [16]:
#this course has no content --> all similarities are 0 --> just give first 10 courses
sim_indices,sim_scores=get_mostsim('Magnificent Life')
# Print the 10 most similar courses
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])

1               Introduction to business
2              Human Resource Management
3           Current Issues in Leadership
4                   Business and Society
5             Managing Corporate Careers
6             Doing Qualitative Research
7           Gender and Diversity at Work
8      Managing Mergers and Acquisitions
9     Innovation Processes in Transition
10                      Strategy Process
Name: name, dtype: object
Similarity scores
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


## Brainstorm Tomi 26-11 outcomes

Incooperate data other universities
Also interesting for teachers. How do courses relate to yours and how does it change when changing description
Student: input what you want to learn and gives courses closes to that

Holistic view: 
System where you can say what you want to learn. What your current level is and what your "wish level" is. It then suggest activities based on this. These activities can be courses but also other things, e.g. read this article. It can also suggest when to do these activities, e.g. it is best to listen to the radio in the morning. 

Paper doesn't perse need to have results and evaluation Can also be more holistic, what approaches could there be and what are their pros and cons
Also it might be good to describe the whole process. According to Tomi this is possible to do in a paper

User data is gonna take some time. Maybe take focus of that. 
2 things: generate ultimate scenarios. What would be perfect system?
Think of how with data we got now we can get creative and generate something cool. 

Good approach to first test some approaches and based on those choose

Document things and share, but mention my name everywhere.
Think how connection to A!OLE. Would say just part of it. Made by Tinka Valentijn, in cooperation with A!OLE. 
