### Content-Based Filtering
This notebook implements a simple version of content-based recommendation.  
Content-based filtering is one of the two most common methods, next to collaborative filtering.  
It uses as input data on the items (in our case the courses). It then calculates similarity between each of the item, and recommends the items with the highest similarity.  
It has been shown to work pretty well and an advantage is that we only need data on the courses and not on the users. 
On the other hand, this means that it is not personalized and it often is hard to suggest things that are different than the input. 

This is only a test (but gives reasonable results). 
The code was largely inspired by and partly copied from https://www.datacamp.com/community/tutorials/recommender-systems-python

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import datetime

In [2]:
#read the course info
df=pd.read_csv('../Data/filtered_courses.csv')
df['content']=df['content'].fillna('')
print("Number of courses:",len(df),"\n")
df.head()

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"['Perttu Kähäri', 'Laura Peni', 'Pekka Pälli',...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","['Alice Wickström', 'Ingmar Björkman']","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,['Kathrin Sele'],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,['Mikko Martela'],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"['Esko Aho', 'Kirsti Iivonen']",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [3]:
def check_startdate(df):
    """Check if course starts in future"""
    df['startDate']=pd.to_datetime(df['startDate'])
    now=datetime.datetime.now()
    return df[df['startDate']>=now]

In [4]:
def get_mostsim(user_input,indices,sim):
    """Get the most similar courses for the input"""

    # Get index of course given title
    idx = indices[user_input]

    #Get similarity of course to all other courses
    # structure is list of (index, similarity)
    sim_row = list(enumerate(sim[idx]))

    #sort the courses by descending score
    sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

    sim_indices = [i[0] for i in sim_sorted[1:]]
    sim_scores=[i[1] for i in sim_sorted[1:]]

    return sim_indices,sim_scores

In [5]:
def recom_freetext(user_input,courses,input_columns=['content','name'],future=False,meas='cosine',stopwords='english',smooth=True,sublin=False,tokenize=None):
    """
    Give recommendations based on the user input
    user_input: string user inputs
    courses: DataFrame with the courses
    input_columns: the columns of the courses dataframe we want to consider
    future: if True only recommend courses taking place in the future
    meas: indicates the similarity measure we use
    stopwords: indicates which stopwords should be removed
    smooth: whether to use smooth idf
    sublin: whether to use sublinear TF
    tokenize: whether to tokenize the input
    """
    user_input=user_input.lower()
    #create one column that contains all info we consider per course
    courses['combined']=courses[input_columns].apply(lambda x: ' '.join(x), axis=1)
    #some preprocessing on the name (title of courses)
    courses['name_low']=courses['name'].str.lower()
    courses['name_low']=courses['name_low'].str.strip()
    #check if user input is a course or not
    if not courses['name_low'].str.contains(r'^%s$'%user_input).any():
        print('Not a course')
        courses=courses.append({'combined':user_input,'name_low':user_input},ignore_index=True)
    else:
        print('A course')

    #create dataframe with courses that meet the requirements 
    if future:
        courses_req=check_startdate(courses)
    else:
        courses_req=courses

    #define the tf-idf vectorizer
    tfidf_all = TfidfVectorizer(stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
    #get the tf-idf score for each word in each ontent description of each course
    tfidf_matrix_all = tfidf_all.fit_transform(courses['combined'])
    
    #do the same but with only the courses that meet the requirements. However, use the vocabulary from the list of all courses
    tfidf_req=TfidfVectorizer(use_idf=True,vocabulary=tfidf_all.vocabulary_,stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
    tfidf_matrix_req=tfidf_req.fit_transform(courses_req['combined'])
    
    #Construct a reverse map of indices and courses
    # we use this to map index to title and the other way around
    indices = pd.Series(courses_req.index, index=courses_req['name_low'])
    
    #get the similarity scores
    if meas=='cosine':
        sim = linear_kernel(tfidf_matrix_req, tfidf_matrix_req)
    sim_indices,sim_scores=get_mostsim(user_input,indices,sim)
    urls=['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste='+courses_req['code'].iloc[i] for i in sim_indices[:10]]
    return courses_req['name'].iloc[sim_indices[:10]],urls,sim_scores[:10]

In [6]:
def get_recom(user_input):
    df=pd.read_csv('../Data/filtered_courses.csv')
    df['content']=df['content'].fillna('')
    course_titles,urls,similarities=recom_freetext(user_input,df)
    return course_titles, urls,similarities

In [7]:
get_recom('artificial intelligence')

A course


(402                               Declarative Programming
 1195                      Art and Artificial Intelligence
 673                                Reinforcement learning
 401                    Machine Learning: Basic Principles
 430      Machine Learning: Advanced Probabilistic Methods
 708                             AI in health technologies
 64                             Capstone course: Marketing
 388               Introduction to Artificial Intelligence
 644     Machine Learning for Mobile and Pervasive syst...
 921                        Computational inverse problems
 Name: name, dtype: object,
 ['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=CS-E3220',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=UWAS-C0025',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=ELEC-E8125',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=CS-E3210',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=