### Content-Based Filtering
This notebook implements a simple version of content-based recommendation.  
Content-based filtering is one of the two most common methods, next to collaborative filtering.  
It uses as input data on the items (in our case the courses). It then calculates similarity between each of the item, and recommends the items with the highest similarity.  
It has been shown to work pretty well and an advantage is that we only need data on the courses and not on the users. 
On the other hand, this means that it is not personalized and it often is hard to suggest things that are different than the input. 

This is only a test (but gives reasonable results). 
The code was largely inspired by and partly copied from https://www.datacamp.com/community/tutorials/recommender-systems-python

In [113]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
import datetime

from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

In [412]:
#read the course info
courses=pd.read_csv('../Data/filtered_courses.csv')
#fix string list to make it just a string
courses['teachers']=courses['teachers'].str.replace("[\[,'\]]", '')
courses['content']=courses['content'].fillna('')
print("Number of courses:",len(courses),"\n")
courses.head()

Number of courses: 1212 



Unnamed: 0,index,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,Perttu Kähäri Laura Peni Pekka Pälli Iiris Sai...,"Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman",Alice Wickström Ingmar Björkman,"2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,Kathrin Sele,"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,Mikko Martela,"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,Esko Aho Kirsti Iivonen,Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [414]:
# bla='creating stories and narratives'
# courses['name_low'].str.contains(r'^%s$'%bla).any()

In [415]:
def check_startdate(df):
    """Check if course starts in future"""
    df['startDate']=pd.to_datetime(df['startDate'])
    now=datetime.datetime.now()
    return df[df['startDate']>=now]

In [416]:
#create stemmer and tokenizer, to be used by tf-idf
#copied from https://github.com/senticr/SentiCR/blob/master/SentiCR/SentiCR.py
stemmer =SnowballStemmer("english")

def stem_tokens(tokens):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

def tokenize_and_stem(text):
    tokens = word_tokenize(text)
    stems = stem_tokens(tokens)
    return stems

In [417]:
def top_tfidf(ind,tfidfvec,tfidf_matrix,n_tok=10):
    #tfidfvec,tfidf_matrix=define_tfidf(df,True,'english',False,None)
    features=tfidfvec.get_feature_names()
    course_tfidf=np.squeeze(tfidf_matrix[ind].toarray())
    sorted_tfidf=np.argsort(-course_tfidf) #minus cause want descending order
    return [(features[i],course_tfidf[i]) for i in sorted_tfidf[:n_tok]]

In [418]:
def get_mostsim(user_input,courses_ed,indices,sim):
    """Get the most similar courses for the input"""

    # Get index of course given title
    idx = indices[user_input]
    
    #Get similarity of course to all other courses
    # structure is list of (index, similarity)
    sim_row = list(enumerate(sim[idx]))

    #sort the courses by descending score
    sim_sorted = sorted(sim_row, key=lambda x: x[1], reverse=True)

    sim_sorted=[sim_sorted[i] for i in range(len(sim_sorted)) if courses_ed['name_low'].iloc[sim_sorted[i][0]] not in history_names]
    sim_indices = [i[0] for i in sim_sorted[1:]]
    sim_scores=[i[1] for i in sim_sorted[1:]] 

    return sim_indices,sim_scores

In [472]:
input_columns=['name','content']#['additionalInformation', 'assesmentMethods','content','courseStatus','credits','gradingScale','learningOutcomes','level','literature','organizationId','prerequisities','teacherInCharge','type','workload']#['content','name']
future=False
meas='cosine'
stopwords='english'
smooth=True
sublin=False
tokenize=None

In [473]:
courses.columns

Index(['index', 'additionalInformation', 'assesmentMethods',
       'availableEnglish', 'cefrLevel', 'code', 'content', 'courseStatus',
       'courseUnitId', 'credits', 'endDate', 'gradingScale', 'homepage', 'id',
       'languageOfInstruction', 'languageOfInstructionCodes',
       'learningOutcomes', 'level', 'literature', 'name', 'organizationId',
       'prerequisities', 'registration', 'startDate', 'substitutes',
       'teacherInCharge', 'teachers', 'teachingPeriod', 'type', 'workload',
       'name_low', 'combined'],
      dtype='object')

In [474]:
courses['name_low']=courses['name'].str.lower()
courses['name_low']=courses['name_low'].str.strip()

In [475]:
courses=courses.astype(str)

In [476]:
courses['combined']=courses[input_columns].apply(lambda x: ' '.join(x), axis=1)

In [477]:
#history tinka
history_names=['business and society','web software development','magnificent life','creating stories and narratives','digital workshop basics','machine learning: basic principles','algorithmic methods of data mining','machine learning: advanced probabilistic methods','information visualization','deep learning','law in digital society','bayesian data analysis','sustainable built environment','state of the world and development','finnish 1a','finnish 1b','finnish 2a']#['artificial intelligence','law in digital society','machine learning: advanced probabilistic methods','creating stories and narratives']

In [478]:
#change from using names to using course codes?

In [479]:
#history Jukka
#history_names=['theoretical computer science','computer graphics','information security','operating systems','linear algebra and differential equations','fourier analysis','matrix algebra','differential and integral calculus 1 (sci)','foundations of discrete mathematics','first course in probability and statistics','principles of algorithmic techniques','discrete models and search','machine learning: basic principles','declarative programming','seminar in computer science algorithms','special course in copmuter science gnetic algorithms','individual studies in computer science hands-on natural language processing study group','scalable cloud computing','mobile systems programming','advanced course in algorithms','distributed algorithms','computational complexity theory','programming parallel computers','algorithmic methods of data mining','artificial intelligence','machine learning: advanced probabilistic methods','information visualization','computer vision','speech recognition p','reinforcement learning']

In [480]:
len(history_names)

17

In [481]:
history_courses=courses[courses['name_low'].isin(history_names)]

In [482]:
#history_courses.columns

In [483]:
#history_courses

In [484]:
#list(history_courses['name'])

In [485]:
#courses from history that do not occur in dataset
#[i for i in history_names if i not in list(history_courses['name_low'])]

In [486]:
#top tfidfs from history
top_tfidf(1212,tfidf_all,tfidf_matrix,10)

[('stories', 0.33368812635031453),
 ('data', 0.3200009026407034),
 ('science', 0.3170613305107867),
 ('machine', 0.23048201122921785),
 ('course', 0.1846978505139666),
 ('bayesian', 0.15802140523129538),
 ('programme', 0.15657345827309793),
 ('learning', 0.1487121214571029),
 ('major', 0.14384523575918112),
 ('ccis', 0.14181086536001855)]

In [487]:
histo_all=' '.join(history_courses['combined'])

In [488]:
courses_ed=courses.append({'combined':histo_all,'name_low':'history'},ignore_index=True)

In [489]:
#define the tf-idf vectorizer
tfidf_all = TfidfVectorizer(stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
#get the tf-idf score for each word in each ontent description of each course
tfidf_matrix = tfidf_all.fit_transform(courses_ed['combined'])

In [490]:
#Construct a reverse map of indices and courses
# we use this to map index to title and the other way around
indices = pd.Series(courses_ed.index, index=courses_ed['name_low'])

In [491]:
sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [492]:
sim_indices,sim_scores=get_mostsim('history',courses_ed,indices,sim)

In [493]:
urls=['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste='+courses_ed['code'].iloc[i] for i in sim_indices[:10]]

In [494]:
courses_ed['name'].iloc[sim_indices[:10]],urls,sim_scores[:10]

(397                                          Data Science
 434     Research Project in Machine Learning and Data ...
 408     Special Course in Machine Learning and Data Sc...
 1161                                        Aalto Fellows
 469                          Modeling Biological Networks
 387                        Data Structures and Algorithms
 631             Statistical Natural Language Processing P
 404     Seminar in Computer Science: Internet, Data an...
 109                       Quantitative Empirical Research
 864            Research Methods in International Business
 Name: name, dtype: object,
 ['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=CS-C3160',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=CS-E4870',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=CS-E4070',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste=TU-E4110',
  'https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tu

In [495]:
[top_tfidf(ind,tfidf_all,tfidf_matrix,1) for ind in sim_indices[:10]]

[[('data', 0.3776560526587816)],
 [('science', 0.5717873114771781)],
 [('year', 0.35895200719320364)],
 [('fellows', 0.4041658907631192)],
 [('networks', 0.5807095435544969)],
 [('structures', 0.4412323994563839)],
 [('natural', 0.3297622388333133)],
 [('things', 0.3861632584994783)],
 [('data', 0.43877208661504247)],
 [('research', 0.37994514232665805)]]

In [252]:
def recom_freetext(user_input,courses,input_columns=['content','name'],future=False,meas='cosine',stopwords='english',smooth=True,sublin=False,tokenize=None):
    """
    Give recommendations based on the user input
    user_input: string user inputs
    courses: DataFrame with the courses
    input_columns: the columns of the courses dataframe we want to consider
    future: if True only recommend courses taking place in the future
    meas: indicates the similarity measure we use
    stopwords: indicates which stopwords should be removed
    smooth: whether to use smooth idf
    sublin: whether to use sublinear TF
    tokenize: whether to tokenize the input
    """
    user_input=user_input.lower()
    #create one column that contains all info we consider per course
    courses['combined']=courses[input_columns].apply(lambda x: ' '.join(x), axis=1)
    #some preprocessing on the name (title of courses)
    courses['name_low']=courses['name'].str.lower()
    courses['name_low']=courses['name_low'].str.strip()
    #check if user input is a course or not
    if not courses['name_low'].str.contains(r'^%s$'%user_input).any():
        print('Not a course')
        courses=courses.append({'combined':user_input,'name_low':user_input},ignore_index=True)
    else:
        print('A course')

    #create dataframe with courses that meet the requirements 
    if future:
        courses_req=check_startdate(courses)
    else:
        courses_req=courses

    #define the tf-idf vectorizer
    tfidf_all = TfidfVectorizer(stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
    #get the tf-idf score for each word in each ontent description of each course
    tfidf_matrix_all = tfidf_all.fit_transform(courses['combined'])
    
    #do the same but with only the courses that meet the requirements. However, use the vocabulary from the list of all courses
    tfidf_req=TfidfVectorizer(use_idf=True,vocabulary=tfidf_all.vocabulary_,stop_words=stopwords,smooth_idf=smooth,sublinear_tf=sublin,tokenizer=tokenize)
    tfidf_matrix_req=tfidf_req.fit_transform(courses_req['combined'])
    
    #Construct a reverse map of indices and courses
    # we use this to map index to title and the other way around
    indices = pd.Series(courses_req.index, index=courses_req['name_low'])
    
    #get the similarity scores
    if meas=='cosine':
        sim = linear_kernel(tfidf_matrix_req, tfidf_matrix_req)
    sim_indices,sim_scores=get_mostsim(user_input,indices,sim)
    urls=['https://oodi.aalto.fi/a/opintjakstied.jsp?Kieli=6&html=1&Tunniste='+courses_req['code'].iloc[i] for i in sim_indices[:10]]
    return courses_req['name'].iloc[sim_indices[:10]],urls,sim_scores[:10]

In [253]:
def get_recom(user_input):
    df=pd.read_csv('../Data/filtered_courses.csv')
    df['content']=df['content'].fillna('')
    course_titles,urls,similarities=recom_freetext(user_input,df)
    return course_titles, urls,similarities

In [70]:
get_recom('artificial intelligence')

A course


TypeError: get_mostsim() missing 1 required positional argument: 'sim'