### Content-Based Filtering
This notebook implements a simple version of content-based recommendation.  
Content-based filtering is one of the two most common methods, next to collaborative filtering.  
It uses as input data on the items (in our case the courses). It then calculates similarity between each of the item, and recommends the items with the highest similarity.  
It has been shown to work pretty well and an advantage is that we only need data on the courses and not on the users. 
On the other hand, this means that it is not personalized and it often is hard to suggest things that are different than the input. 

This is only a test (but gives reasonable results). 
The code was largely inspired by and partly copied from https://www.datacamp.com/community/tutorials/recommender-systems-python

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import datetime

from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

In [39]:
#to show all columns
# pd.set_option('display.max_columns', None)  
#o show full info per column
# pd.set_option('display.max_colwidth', -1)

In [3]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')
df['content']=df['content'].fillna('')
print("Number of courses:",len(df),"\n")
df.head()

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [4]:
import datetime

In [40]:
def check_startdate(df):
    """Check that course starts in the future"""
    df['startDate']=pd.to_datetime(df['startDate'])
    now=datetime.datetime.now()
    return df[df['startDate']>=now]

In [6]:
now=datetime.datetime.now()
now

datetime.datetime(2019, 1, 16, 15, 9, 34, 663056)

In [7]:
df_test=df[df['startDate'] > now]

In [8]:
#Construct a reverse map of indices and movie titles
# we use this to map index to title and other way around
indices = pd.Series(df.index, index=df['name'])
print(indices[:10])

name
Capstone: Business Development Project    0
Introduction to business                  1
Human Resource Management                 2
Current Issues in Leadership              3
Business and Society                      4
Managing Corporate Careers                5
Doing Qualitative Research                6
Gender and Diversity at Work              7
Managing Mergers and Acquisitions         8
Innovation Processes in Transition        9
dtype: int64


In [9]:
tfidf_train = TfidfVectorizer()
#get the tf-idf score for each word in each ontent description of each course
tfidf_matrix_train = tfidf_train.fit_transform(df['content'])
#print(tfidf_matrix_train.shape)
tfidf_test=TfidfVectorizer(use_idf=True,vocabulary=tfidf_train.vocabulary_)
brr = tfidf_test.fit_transform(df_test['content'])
print(brr.shape)
print(brr[0].toarray())
#print(tfidf_train.idf_)
#print(tfidf_train.get_feature_names())

(342, 7178)
[[0. 0. 0. ... 0. 0. 0.]]


In [10]:
df

Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...
5,5,Priority to BIZ students; other Aalto students...,40% exam 50% assignments 10% participation,True,,21C23000,"Each year, thousands of the best and brightest...",Management minor elective,1132617804,6,...,E706,,,2018-09-10,,Derin Kent,['Derin Kent'],"Period I (2018-2019), Otaniemi campus Period I...",course,Contact teaching 15h Independent work 142h Exa...
6,6,Compulsory active participation in class (5/8)...,100% assignments,True,,21E00011,The course covers the basic epistemological un...,Master's programme in Management and Internati...,1113172843,6,...,E706,The course is designed for students who are wo...,WebOodi,2018-10-29,This course can be substituted with 80E80100 B...,Saija Katila,['Saija Katila'],"Period II, Töölö Campus and IV Otaniemi Campus...",course,Contact teaching 24 h Independent work 136 h T...
7,7,The course is restricted to 50 students (inclu...,Assignments 100 %,True,,21E00012,The course provides an overview of gender and ...,"M.Sc. degree, elective course in common studie...",1113202815,6,...,E706,,Via WebOodi one week before the start of the t...,2019-02-26,"21E80000 Gender, Management and Organizations....",Saija KatilaKirsi Eräranta,"['Kirsi Eräranta', 'Saija Katila']",Period IV (2018-2019) Otaniemi campusPeriod IV...,course,Contact teaching 24 h Independent work 136 h T...
8,8,The course is restricted to 50 students (inclu...,Assignments 100%,True,,21E00029,The course offers a theoretically grounded and...,Master¿s Programme in Management and Internati...,1129318458,6,...,E706,Completing basic courses on strategic manageme...,WebOodi.,2019-04-16,,Natalia VuoriAlexei Koveshnikov,"['Paulina Junni', 'Alexei Koveshnikov']","Period V 2018-2019, Otaniemi campusPeriod V 20...",course,Contact teaching 30hIndependent work 130...
9,9,All Aalto students are welcome but BIZ incl...,20% of evaluation: Participating in and prepar...,True,,21E00032,Core content of the course is to give a compre...,Elective course in the Master's programme in M...,1116420822,6,...,E706,Students must have an interest to learn these ...,via Weboodi,2018-10-30,,Erkki OrmalaTaina Tukiainen,['Erkki Ormala'],"Period II (2018-2019), Töölö campusPeriod II (...",course,18h Participation in the course sessions;60h r...


#### Tf-idf
We now use tf-idf to get a mapping of words in the document to a number that says something about the importance of the word. 
To be more precise, tf-idf is calcluated by #word appears/#words in document * #documents word appears
See wiki for more info: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Here we use the lazy version given by sklearn, which does all the preprocessing (tokenizing etc.) by itself

In [2]:
#copied from https://github.com/senticr/SentiCR/blob/master/SentiCR/SentiCR.py
stemmer =SnowballStemmer("english")

def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens)
    return stems

In [3]:
def define_tfidf(df,smooth,stopwords,sublin,tokenize):
    print(sublin)
    #define the tf-idf vectorizer
    tfidf = TfidfVectorizer(stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
    #get the tf-idf score for each word in each ontent description of each course
    tfidf_matrix = tfidf.fit_transform(df['content'])
    print("Shape of matrix:",tfidf_matrix.shape)
    print("Number of unique tokens:",tfidf_matrix.shape[1])
    return tfidf,tfidf_matrix

#### Cosine similarity
We now want to compute the similarity of the different courses.  
Here, we use the cosine similarity. This is one of the common measures to calculate similarity.  
Some other common ones are the Euclidean distance and the Pearson coefficient. It depends on the situation which one works best. 

In [4]:
def get_sim(tfidf_matrix,meas='cosine'):
    if meas=='cosine':
        # Compute the cosine similarity matrix
        # We use the linear kernel of sklearn for this
        cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    return cosine_sim

We can now use this cosine similarity to get the courses that are most similar to the input course!

In [5]:
def get_mostsim(course,df,meas='cosine',stopwords='english',smooth=True,sublin=False,tokenize=None):
    indices = pd.Series(df.index, index=df['name'])
    # Get index of course given title
    idx = indices[course]
    
    tfidf,tfidf_matrix=define_tfidf(df,smooth,stopwords,sublin,tokenize)
    sim=get_sim(tfidf_matrix,meas)
    #Get similarity of course to all other courses
    # structure is list of (index, similarity)
    sim_row = list(enumerate(sim[idx]))
    
    #sort the courses by descending score
    sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

    
    sim_indices = [i[0] for i in sim_sorted[1:]]
    sim_scores=[i[1] for i in sim_sorted[1:]]
    
    return sim_indices,sim_scores

In [6]:
title='Artificial Intelligence'
sim_indices_stem,sim_scores_stem=get_mostsim(title,df,tokenize=tokenize_and_stem)
# Print the 10 most similar courses
print("The 10 most similar courses for the course",title)
print(df['name'].iloc[sim_indices_stem[:10]])
print("Similarity scores")
print(sim_scores_stem[:10])

sim_indices,sim_scores=get_mostsim(title,df)
# Print the 10 most similar courses
print("The 10 most similar courses for the course",title)
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])

NameError: name 'df' is not defined

In [None]:
ra=indices.sample(5)
for title,v in ra.items():
    # Print the 10 most similar courses
    sim_indices,sim_scores=get_mostsim(title,df)
    print("The 10 most similar courses for the course",title)
    print(df['name'].iloc[sim_indices[:10]])
    print("Similarity scores")
    print(sim_scores[:10])

In [18]:
#seems to work pretty well! Several courses you would expect but don't really eadd something, some courses that might add something (Art and artificial intelligence), some courses that seem totally random but are not (capstone course: marketing)

### Experimenting more with tf-idf
Largely inspired by https://buhrmann.github.io/tfidf-analysis.html

More links on tf-idf  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit_transform  
https://stackoverflow.com/questions/35757560/sklearns-tfidfvectorizer-word-frequency  
https://www.quora.com/How-does-TfidfVectorizer-work-in-laymans-terms  
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html  
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction  
https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html  
Why normalization: https://www.quora.com/What-is-the-benefit-of-normalization-in-the-tf-idf-algorithm  
Sub-linear TF: https://nlp.stanford.edu/IR-book/html/htmledition/sublinear-tf-scaling-1.html
Why sub-linear TF: https://stackoverflow.com/questions/27067992/why-is-log-used-when-calculating-term-frequency-weight-and-idf-inverse-document
Great code explanation: http://billchambers.me/tutorials/2014/12/21/tf-idf-explained-in-python.html
Excellent explanation on cosine similarity and TF-IDF: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [30]:
def top_tfidf(ind,n_tok=10):
    tfidfvec,tfidf_matrix=define_tfidf(df,True,'english',False,None)
    features=tfidfvec.get_feature_names()
    course_tfidf=np.squeeze(tfidf_matrix[ind].toarray())
    sorted_tfidf=np.argsort(-course_tfidf) #minus cause want descending order
    return [(features[i],course_tfidf[i]) for i in sorted_tfidf[:n_tok]]

In [31]:
title='Artificial Intelligence'
print("10 tokens with highest tf-idf for",title)
top_tfidf(indices[title])

10 tokens with highest tf-idf for Artificial Intelligence
False
Shape of matrix: (1212, 6938)
Number of unique tokens: 6938


[('solving', 0.29926907106902434),
 ('logical', 0.28092653701559595),
 ('representations', 0.27159056146075444),
 ('machine', 0.2115488799139426),
 ('learning', 0.20487671875013339),
 ('problems', 0.17203409026085273),
 ('techniques', 0.15611608245605438),
 ('solver', 0.1549633629436502),
 ('adaptation', 0.1549633629436502),
 ('formulas', 0.14648135144176114)]

In [32]:
corpus = ["The dog ate a sandwich and I ate a sandwich",
          "The wizard transfigured a sandwich"]
vectorizer = TfidfVectorizer(stop_words='english')#,smooth_idf=False)
tfidfs = vectorizer.fit_transform(corpus)

In [33]:
vectorizer.idf_

array([1.40546511, 1.40546511, 1.        , 1.40546511, 1.40546511])

In [34]:
vectorizer.get_feature_names()

['ate', 'dog', 'sandwich', 'transfigured', 'wizard']

In [35]:
tfidfs.todense()

matrix([[0.75458397, 0.37729199, 0.53689271, 0.        , 0.        ],
        [0.        , 0.        , 0.44943642, 0.6316672 , 0.6316672 ]])

### Questions
- sklearn preprocessing vs own one (see topic modelling)?
- sklearn tf-idf vs doing by self (see link topic modelling)
    - What about tokenization in tf-idf?
- Try different similarity measures
- Dive more into the working and mathematics of the methods I am using (tf-idf, cosine similarity)
- Some funny things happen because of data, see MagLif below

In [37]:
#this course has no content --> all similarities are 0 --> just give first 10 courses
sim_indices,sim_scores=get_mostsim('Magnificent Life',df)
# Print the 10 most similar courses
print(df['name'].iloc[sim_indices[:10]])
print("Similarity scores")
print(sim_scores[:10])

False
Shape of matrix: (1212, 6938)
Number of unique tokens: 6938
1               Introduction to business
2              Human Resource Management
3           Current Issues in Leadership
4                   Business and Society
5             Managing Corporate Careers
6             Doing Qualitative Research
7           Gender and Diversity at Work
8      Managing Mergers and Acquisitions
9     Innovation Processes in Transition
10                      Strategy Process
Name: name, dtype: object
Similarity scores
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


## Brainstorm Tomi 26-11 outcomes

Incooperate data other universities
Also interesting for teachers. How do courses relate to yours and how does it change when changing description
Student: input what you want to learn and gives courses closes to that

Holistic view: 
System where you can say what you want to learn. What your current level is and what your "wish level" is. It then suggest activities based on this. These activities can be courses but also other things, e.g. read this article. It can also suggest when to do these activities, e.g. it is best to listen to the radio in the morning. 

Paper doesn't perse need to have results and evaluation Can also be more holistic, what approaches could there be and what are their pros and cons
Also it might be good to describe the whole process. According to Tomi this is possible to do in a paper

User data is gonna take some time. Maybe take focus of that. 
2 things: generate ultimate scenarios. What would be perfect system?
Think of how with data we got now we can get creative and generate something cool. 

Good approach to first test some approaches and based on those choose

Document things and share, but mention my name everywhere.
Think how connection to A!OLE. Would say just part of it. Made by Tinka Valentijn, in cooperation with A!OLE. 
