#  Recommender system of edx course data


**Content Based Filtering**


**What we would like to achieve in this notebook**


1.1 | Problem Formulation/Statement
With the world becoming digital, any new skill can be acquired with just a click. However, many of us still needs a dedicated curriculum in order to excel in a specific topic.

This is where e-learning platforms comes handy and EdX is one of such massive open online course (MOOC) providers.

So we've found a course we like, and went through the course, so what next?

With the availability of so many online courses, it may be take some effort and time to look through all available courses.
We can utilise a recommendation system to give some tips on what course the user might like to go though next
Whilst there are quite a number approaches to recommendation systems, well utilise an approach which requires NLP

1.2 | Recommendation system

GOALS

The purpose of our recommendation system is to inform a user about possible courses they make like, based on a course they liked.


METHOD

We will utilise scrapped course description data (our corpus), well convert each document into vector form using (bow,embeddings), then calculate the consine similarity, from which we will be able to extract courses which are most similar.


1.3 | The Dataset

This dataset is scraped off the publicly available information on the EdX website.
This dataset consists of 720 rows and 6 columns namely Name of the Course, Name of the University, Difficulty Level, Course URL, short summary about the course and course description
What is edX?

edX online courses are self-paced, interactive courses offered by leading universities and organizations around the world. These courses provide learners with a range of topics to explore and learn from, including computer science, business, health, engineering, humanities, and more. With edX courses, learners can gain valuable skills and knowledge in an engaging and convenient way.

Image


1.4 | Notebook Goals

Two subgoals are of interest:

EDA study | Analyse an draw conclusions based on the courses that are available

Course Recommendation system | Create a course recommendation based on a specified course.



2 | idX DATASET

WHAT WE WILL DO IN THIS SECTION

We'll read the data EdX.csv

Lower the register of column names

Show for one course the name, about & description

# Data Import

In [None]:
#importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
#add the path of edx csv file
data1=pd.read_csv('**path to edx csv file**')
data=data1.copy()
data.head()

for **corpus** we are utilising name, about and course description

In [None]:
data.duplicated().sum()

In [None]:
data.drop_duplicates(inplace = True)

In [None]:
data.duplicated().sum()

In [None]:
data.reset_index(drop = True, inplace = True)

In [None]:
data.shape

In [None]:
data.isnull().sum()

In [None]:
data.info()

In [None]:
data['University'].value_counts()

In [None]:
data['Difficulty Level'].value_counts()

In [None]:
data['Link'].nunique()

In [None]:



data['About'].nunique()

In [None]:
data['About'].value_counts()

In [None]:
data['Name'].value_counts()

In [None]:
data.info()

In [None]:
df=data[['University','Difficulty Level']]
df

In [None]:
import numpy as np
from google.colab import autoviz

def categorical_histogram(df, colname, figscale=1, mpl_palette_name='Dark2'):
  from matplotlib import pyplot as plt
  import seaborn as sns
  df.groupby(colname).size().plot(kind='barh', color=sns.palettes.mpl_palette(mpl_palette_name), figsize=(8*figscale, 4.8*figscale))
  plt.gca().spines[['top', 'right',]].set_visible(False)
  return autoviz.MplChart.from_current_mpl_state()

chart = categorical_histogram(df, *['Difficulty Level'], **{})
chart

In [None]:
plt.pie(x=df['Difficulty Level'].value_counts(),labels=df['Difficulty Level'].value_counts().index,autopct='%0.2f%%',data=df)
plt.show()

In [None]:
plt.figure(figsize=(20, 6))
# Group the data by university and difficulty level
grouped_data = df.groupby(['University', 'Difficulty Level']).size().unstack(fill_value=0)

# Filter universities that have all three difficulty levels
universities_with_all_levels = grouped_data[
    (grouped_data['Beginner'] > 0) &
    (grouped_data['Intermediate'] > 0) &
    (grouped_data['Advanced'] > 0)
]



# Define colors for each difficulty level
color_map = {
    'Beginner': 'skyblue',
    'Intermediate': 'gold',
    'Advanced': 'lightgreen'
}

ax = universities_with_all_levels.plot(kind='bar',width=0.9,stacked=True,
                                       color=[color_map[level] for level in universities_with_all_levels.columns])




shortened_labels = [label if len(label) <= 15 else label[:17] + '...' for label in universities_with_all_levels.index]

for p in ax.patches:
    height = p.get_height()
    if height > 0:
        ax.annotate(str(int(height)), (p.get_x() + p.get_width() / 2., height), ha='center', va='center', color='black', fontweight='normal', fontsize=8)



ax.set_xticks(range(len(universities_with_all_levels)))
ax.set_xticklabels(shortened_labels, rotation='vertical')

# Adding labels and title
plt.xlabel('University')
plt.ylabel('Number of Courses')
plt.title('Courses by Difficulty Level for Universities with All Levels')
plt.xticks(rotation='vertical')




# Display the legend
ax.legend(title='Difficulty level', labels=color_map.keys())

# Display the chart
plt.tight_layout()
plt.show()


In [None]:
data['About'].head()

# n-gram of course description

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
nltk.download('punkt')
nltk.download('wordnet')

In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

stopwords_en = stopwords.words('english')
lemma=WordNetLemmatizer()

def cleaning(text):
  text=re.sub("[^a-zA-Z1-9]"," ",text) # remove punctuation marks
  text=text.lower()
  tokens=word_tokenize(text)
  cleaned_list=[]
  for token in tokens:
    if token not in stopwords_en:
      cleaned_list.append(lemma.lemmatize(token))
  return  " ".join(cleaned_list)

df1=data['Course Description'].apply(cleaning)

In [None]:
df1

In [None]:
import spacy
from collections import Counter


nlp_en = spacy.load('en_core_web_sm')

ngrams = {'unigrams':[],'bigrams':[],'trigrams':[]}

for document in df1:
  doc=nlp_en(document)
  tokens=[token.text for token in doc]

  def ngrams1(tokens,n):
    list_ngrams=[' '.join(i) for i in [tokens[i:i+n]  for i in range(len(tokens)-n+1)]]
    return list_ngrams
  ngrams['unigrams'].extend(ngrams1(tokens,1))
  ngrams['bigrams'].extend(ngrams1(tokens,2))
  ngrams['trigrams'].extend(ngrams1(tokens,3))

print(ngrams['unigrams'][0:3])
print('unigrams : ',len(ngrams['unigrams']))
print(ngrams['bigrams'][0:3])
print('bigrams : ',len(ngrams['bigrams']))
print(ngrams['trigrams'][0:3])
print('trigrams : ',len(ngrams['trigrams']))


In [None]:
ngrams1(tokens,1)

In [None]:
def plot_counter(counter,top,name):
    labels, values = zip(*counter.items())
    fig = px.bar(pd.Series(values,index=labels,name=name).sort_values(ascending=False)[:top],
                 template='plotly_white',orientation='h')
    fig.show()

plot_counter(Counter(ngrams['unigrams']),10,'unigram')
plot_counter(Counter(ngrams['bigrams']),10,'unigram')
plot_counter(Counter(ngrams['trigrams']),10,'unigram')


#  Natural Language Processing





*   Remove irrelovant columns in our data that won't be utilised in this study
*   Create a new columns text, which will be used in our analysis


*   Do some text cleaning & stemming of the text column data
*   Prepare the data for both TF-IDF & Word2Vec, which require slightly different inputs





 **Drop irrelavant columns**

In [None]:
data.columns

In [None]:
data.drop(columns=['University','Difficulty Level'],axis=1,inplace=True)
df2=data.copy()
df2.head()


Create a documents which will be comprised of the course name, about & course description

We will be utilising this as our corpus data we will feed into TF-IDF & Word2Vec models

data['text'] will be our corpus

In [None]:
data['text']= df2['Name']+' '+df2['About']+' '+df2['Course Description']
data.head()

In [None]:
text_data = data[['Name','About','Course Description','text']]
text_data.to_csv('text_data.csv',index=False)

In [None]:
data['text'][0]

# Text Cleaning / Stemming

In [None]:
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer,WordNetLemmatizer

In [None]:
en_stopwords = stopwords.words("english") # stop words
lemma = WordNetLemmatizer() # lemmatiser

# define a function for preprocessing
def clean(text):
    text = re.sub("[^A-Za-z1-9 ]", "", text) #removes punctuation marks
    text = text.lower()
     #changes to lower case
    tokens = word_tokenize(text) #tokenize the text
    clean_list = []
    for token in tokens:
        if token not in en_stopwords: #removes stopwords
            clean_list.append(lemma.lemmatize(token)) #lemmatizing and appends to clean_list
    return " ".join(clean_list)# joins the tokens

# applying the "clean" function on the text column
data.text = data.text.apply(clean)

data.text

In [None]:
# Preprocessing, returns list instead
def clean_for_word2vec(text):

    text = re.sub("[^A-Za-z1-9 ]", "", text) #removes punctuation marks
    text = text.lower() #changes to lower case
    tokens = word_tokenize(text) #tokenize the text
    clean_list = []
    for token in tokens:
        if token not in en_stopwords: #removes stopwords
            clean_list.append(lemma.lemmatize(token)) #lemmatizing and appends to clean_list
    return clean_list

#cleaning the documents
corpus_cleaned = data.text.apply(clean_for_word2vec)
lst_corpus = corpus_cleaned.tolist()

In [None]:
len(corpus_cleaned[0])

In [None]:
len(corpus_cleaned)

In [None]:
corpus = []
for words in data['text']:
    corpus.append(words.split())

len(f'corpus length: {corpus}')

In [None]:
data['text'][0]

In [None]:
len(corpus)

In [None]:
course_list={}
names1={}
tags1={}
for i in range(len(data['text'])):
  names1[i]=data['Name'][i]
  tags1[i]=data['text'][i]

course_list['course_name']=names1
course_list['tags']=tags1
course_list

In [None]:
#course_lists=course_list.to_pickle('course_list_edx.pkl')
import pickle
pickle.dump(course_list, open("course_list_edx.pkl", "wb"))

# COURSE RECOMMENDATIONS

Our approach to providing recommendations is based on cosine similarity of input vectors

The first approach we can utilise to generate vectors for each course is by utilising Term Frequency-Inverse Document Frequency (TF-IDF)

The second approach we can utilise to generate vectors for each course is by utilising Embedding Vectors

In [None]:
list_names=list(data['Name'])
list_names[:5]

In [None]:
data['text']

In [None]:
new_df=pd.DataFrame(list_names,columns=['course_name'])
new_df['tags']=data['text']
new_df['Link']=data['Link']
new_df

In [None]:
courses=new_df.to_pickle('courses.pkl')

**GENERATION OF VECTOR REPRESENTATION OF TEXT**

* TF-IDF was described in notebook nlp | Natural Language Processing Reference

* test_matrix, is input into our recommendation generation function Recommendation_Cosine_similarity

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectoriser=TfidfVectorizer()
test_matrix = vectoriser.fit_transform(data['text']).toarray()

test_matrix.shape

In [None]:
pd.DataFrame(test_matrix)

In [None]:
len(vectoriser.vocabulary_)

GENERATION OF VECTOR REPRESENTATION OF TEXT

Example recommendation for: MathTrackX: Differential Calculus

1 MathTrackX: Differential Calculus

2 MathTrackX: Integral Calculus

3 MathTrackX: Statistics

4 MathTrackX: Polynomials, Functions and Graphs

5 MathTrackX: Probability

6 MathTrackX: Special Functions

In [None]:
similarity=cosine_similarity(test_matrix)

In [None]:

pickle.dump(similarity, open("similarity_edx.pkl", "wb"))

In [None]:
def recommend(course):
    course_index = new_df[new_df['course_name'] == course].index[0]

    distances = similarity[course_index]
    course_list = sorted(list(enumerate(distances)),reverse=True, key=lambda x:x[1])[1:7]

    for i in course_list:
        print(new_df.iloc[i[0]].course_name,":")
        print(data1.iloc[i[0]].Link)
recommend('MathTrackX: Differential Calculus')

In [None]:
def Recommendation_wth_Cos_similarity(matrix,name):

  row_ind=list_names.index(name)
  similarity=cosine_similarity(matrix)

  #getting the course with highest cosine similarity
  similar_courses = list(enumerate(similarity[row_ind]))
  sorted_courses=sorted(similar_courses, key=lambda x:x[1] , reverse=True)[:6]

  print(f'Recommended course for {name} \n')

  i=0
  for course in sorted_courses:
    course_des=data[ data.index == course[0]]['Name']
    recommendation = print(f'{i+1} {course_des}')
    i=i+1
  return recommendation

Recommendation_wth_Cos_similarity(test_matrix,'MathTrackX: Statistics')