### Content-Based Recommendation
This notebook implements a simple recommendation system for courses at Aalto University. 
The method being used is content-based recommendation, which is one of the two most common recommendation methods. The other being collaberative filtering.  
 
Content based models use information about the items, in our case the courses, and the current user, in our case a student. It then combines this information to give the recommendation. 
It has been shown to work pretty well and an advantage is that we only need data on the courses and the current user, not about other users.
On the other hand, this means that the amount of data is limited and it is hard to suggest a diverse range of items. 

The code was partly inspired by https://www.datacamp.com/community/tutorials/recommender-systems-python

In this notebook I will go through the code step-by-step. See Content-based_clean for a more functional approach.

We will first import and analyze the data. 
We use the library Pandas for this

In [1]:
#import pandas
import pandas as pd
#read the course info. The courses are stored in the file filtered_courses.csv
courses=pd.read_csv('../Data/filtered_courses.csv')
#if an entry is empty, i.e. NaN, replace it with an empty string. This makes processing easier 
courses=courses.fillna('')
#convert all content to strings. We do this because we will be treating everything as strings and this makes it a lot easier to work with
courses=courses.astype(str)
#Add a column with the name of the course lowered and stripped
courses['name_low']=courses['name'].str.lower()
courses['name_low']=courses['name_low'].str.strip()
#print the number of courses and the first entries, to get a feel for the data
print("Number of courses:",len(courses),"\n")
courses.head()
#courses.to_csv("../Data/filtered_courses_ITS.csv",index=False)

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload,name_low
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...,capstone: business development project
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...,introduction to business
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...,human resource management
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,,current issues in leadership
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...,business and society


We then create one column per course that contains all the information we are interested in, i.e. which we will use as item information for our model 
Which columns we take into account is specified in the variable input_columns. You can change the values here! All available columns can be retrieved by courses.columns

In [2]:
#give the column names of the columns we want to consider
#this can be changed!
input_columns=['additionalInformation', 'assesmentMethods','content','courseStatus','credits','gradingScale','learningOutcomes','level','literature','organizationId','prerequisities','teacherInCharge','type','workload']

#create one column that contains all info we consider per course
courses['combined']=courses[input_columns].apply(lambda x: ' '.join(x), axis=1)

Now we give an input, this is the topic or course the user is interested in

In [3]:
user_input='Artificial Intelligence'

We process the user input and check whether the input is an existing course or any other term

In [4]:
#lower input to make processing easier
user_input=user_input.lower()

In [5]:
#check if user input is a course or not
if not courses['name_low'].str.contains(r'^%s$'%user_input).any():
    print('Not a course')
    #if not a course append the input to the courses dataframe. We use this structure to later compute the similarities between input and courses
    courses=courses.append({'combined':user_input,'name_low':user_input},ignore_index=True)
else:
    print('A course')

A course


### Future courses
For the recommender it makes sense to only recommend courses in the future.  
Hence, we add a variable "future". When this is True, the recommender will only recommend courses that take place in the future.   
For these testing purposes we set it to False but still implemented the code.  
Other requirements, such as language or credits, can easily be added in the same way. 

In [6]:
#set the future variable
#you can change this!
future=False

In [7]:
import datetime
def check_startdate(df):
    """Check that course starts in the future"""
    df['startDate']=pd.to_datetime(df['startDate'])
    now=datetime.datetime.now()
    return df[df['startDate']>=now]

In [8]:
#create dataframe with courses that meet the requirements 
if future:
    courses_req=check_startdate(courses)
else:
    courses_req=courses

## TF-IDF 
Now that we have the correct data in place, we want to vectorize the data. 
By vectorizing we mean giving a number to each word. So that we can calculate the similarities later by comparing those numbers.  
As method we use tf-idf which stands for term frequency–inverse document frequency. tf-idf indicates the importance of a certain word compared to the whole corpus.   
Tf-idf consists of two terms, the term frequency (tf) and the inverse document frequency (idf). We multiply these two terms to get the tf-idf.    
The term frequency in its most basic form states how often word d occurs in document n. Mathematically this can be written as $|d \in n|$  
The inverse document frequency states how important a certain word d is in the whole corpus. For example, 'entrepreneurship' can be very indicative whereas the word 'course' isn't. This is calculated by the total number of courses, N, dived by the number of courses in which the word d occurs, mathematically this can be written as $\frac{N}{|\{n \in N: d \in n\}|}$.  

In our implementation we expand this a bit by calculating it as $log(\frac{N}{|\{n \in N: d \in n\}|})+1$ if $|\{n \in N: d \in n\}|>0$ else $0$.  
We add the log here because the importance of a word is not linear but more sub-linear. If word1 occurs 10 times in the corpus instead of 5 times, the relevance of that word is probably not 2 times as small. The log heavily minimizes the difference in idf between these two cases.  
The +1 is added to distinguish between $|\{n \in N: d \in n\}|<=0$ and $\frac{N}{|\{n \in N: d \in n\}|}$=1 since the log of 1 equals 0.  

By multiplying the tf with the idf we get an importance score for every word in every course. Pay attention, that this score doesn't only depend on the word, but also on the number of occurences and length of the specific course you are looking at.    
In theory the tf-idf will be the highest when a course only has that word d and no other course has that word d. The tf-idf is always larger or equal than zero. 

We use the implementation by sklearn. This implementation comes with several tunable parameters. There is no golden standard on which values to use. I will go through some of the parameters, but it is not important to understand in order to understand the rest of this code.

### Stopwords
You can choose whether to remove stopwords. In general this is a good idea. This code uses as standard option to use the English corpus of stopwords provided by sklearn.  
Default=english

### Tokenizer
You can give any kind of tokenizer. This means the model will do some pre-processing on the data. When adding a tokenizer, it will mostly remove punctation etc. In the tokenizer, you can also add a stemmer, which stems words, e.g. innovation and innovate will both become innov.  
In this code one tokenizer is available, see the cell below. However, in the pre-set parameters this is not used. It really depends on the situation whether it is good to stem and tokenize, or that this generalizes too much.  
Default=None    
Call tokenizer by tokenize=tokenize_and_stem

### Smooth_idf
This adds one to the numerator and denominator of the idf. This prevents division by zero. 
Hence, the full formula becomes $log(\frac{N+1}{|\{n \in N: d \in n\}|+1})+1$  
Default=True  

### Sublinear_tf
This also makes the term frequency sublinear. Hence, the formula becomes $log(|d \in n|)+1$    
Again, the same logic applies as with the idf. If a word occurs 100 times or 200 times, it might doesn't make the word twice as relevant. The log minimizes the difference in tf between these two cases.    
From experiments we concluded that it really depends on the case whether this seems to work better or not. In the final experiments this has to be tested again.    
Default=False
stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize

### Norm
This divides the tf-idf by the norm. I.e. $tf-idf=\frac{tf-idf}{||tf-idf||}$  
We do this to normalize for documents with different lengths. When not normalizing, longer documents tend to get higher tf-idf scores. 
In our case the documents are not that long and the lengths not too different, so it doesn't make a big difference whether using this or not.   
Default='l2' 

In [9]:
#Initialize the above described variables
#You can change these!
stopwords='english'
smooth=True
sublin=False
tokenize=None
norm='l2'

In [10]:
#the stemmer and tokenizer. This is an extra feature and you don't have to understand this to use the code 

from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

#create stemmer and tokenizer, to be used by tf-idf
#copied from https://github.com/senticr/SentiCR/blob/master/SentiCR/SentiCR.py
stemmer =SnowballStemmer("english")

def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens)
    return stems

Now we can finally calculate the tf-idfs. We use the TfidfVectorizer by sklearn for this.   
We first define the vectorizer with its variables. Then we fit it to our dataset.   
Since we selected courses based on requirements, in this case taking place in the future, we calculate two different tf-idfs. One of all courses, and one that only meets the requirements. However, for the one that meets the requirements, we input the vocabulary of all courses. Hence, the tf-idf is calculated based on the whole vocabulary which gives more reliable results. 

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
#define the tf-idf vectorizer
tfidf_all = TfidfVectorizer(stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize,norm=norm)
#get the tf-idf score for each word in each ontent description of each course
tfidf_matrix_all = tfidf_all.fit_transform(courses['combined'])

#do the same but with only the courses that meet the requirements. However, use the vocabulary from the list of all courses
tfidf_req=TfidfVectorizer(use_idf=True,vocabulary=tfidf_all.vocabulary_,stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
tfidf_matrix_req=tfidf_req.fit_transform(courses_req['combined'])

Now that we have the tf-idf for each word in every document, we need a way to somehow compare the documents.   
We do this by looking at similarity between courses. There are several similarity measures that could be used, two commonly used ones are Jaccard and cosine similarity. In this implementation we use the cosine similarity.   
The cosine similarity returns the cosine between two vectors, i.e. how much the two vectors are pointing in the same direction. The more they are pointing in the same direction, the more similar they are. An excellent explanation of how the cosine similarity works, applied to TF-IDF, can be found here http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/   
The dimensions of the vectors are compromised of all the different words in the vocubalary and hence are very high-dimensional. The amount in each direction is the value of the tf-idf for that word in that course.   
Because we look at the angle between two vectors, the length of the vector doesn't matter, only the direction.   
The tf-idf is always larger or equal than zero, which results in a cosine similarity $0 \leq cos(\theta) \leq 1$


In [12]:
from sklearn.metrics.pairwise import linear_kernel

#Construct a reverse map of indices and courses
# we use this to map index to title and the other way around
indices = pd.Series(courses_req.index, index=courses_req['name_low'])

#construct the cosine similarities between each course
#in sklearn this is done by the linear kernel
sim = linear_kernel(tfidf_matrix_req, tfidf_matrix_req)

# Get index of course given title
idx = indices[user_input]

#Get similarity of course to all other courses
# structure is list of (index, similarity)
sim_row = list(enumerate(sim[idx]))

#sort the courses by descending score
sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

#get indices and scores in sorted order of most similar
sim_indices = [i[0] for i in sim_sorted[1:]]
sim_scores=[i[1] for i in sim_sorted[1:]]

Now we got all we need and we can print the courses with most similarity to the users' input. 

In [13]:
#t is number of most similar courses to be displayed
#you can change this!
t=10
courses_req_top=courses_req['name'].iloc[sim_indices[:t]]
sim_scores_top=sim_scores[:t]
urls=['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste='+courses_req['code'].iloc[i] for i in sim_indices[:t]]
pd.DataFrame({'name':courses_req_top,'similarity':sim_scores_top,'url':urls})

Unnamed: 0,name,similarity,url
401,Machine Learning: Basic Principles,0.471659,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
434,Research Project in Machine Learning and Data ...,0.45336,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
432,Information Visualization,0.427604,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
427,Algorithmic Methods of Data Mining,0.41809,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
435,Deep Learning,0.416107,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
463,Bayesian Data Analysis,0.370345,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
431,Kernel Methods in Machine Learning,0.357362,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
388,Introduction to Artificial Intelligence,0.352603,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
464,Complex Networks,0.344221,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
397,Data Science,0.34418,https://oodi.aalto.fi/a/opintjakstied.jsp?Kiel...
