The **Content-based filtering** approach uses additional information about users and/or items. The Content-based approach requires a good amount of information about items’ features, rather than using the user’s interactions and feedback. This approach can be used to recommend a wide variety of items, including movies, music, books, products, and even news articles. It is a popular approach for recommender systems because it is relatively easy to implement and can be effective in many cases.

### Prerequisites
- The following concepts have to be revised
    - TF-IDF
    - Content-based filtering
    - Cosine Similarity
- The following dependencies have to be installed
    - pandas
    - numpy
    - sklearn
    - nltk package (**This can be installed using nltk.download()**)

### Dataset

2021 has seen a boom in the MOOCs due to the Covid-19 Pandemic. With the availability of numerous paid and free resources on the internet, it becomes overwhelming for students to learn new skills. MOOCs are online courses that are open to anyone to enroll and participate. They offer a great opportunity for students to learn new skills at their own pace and from the comfort of their own home.

This dataset was scraped off the publicly available information on the [EdX website](https://www.edx.org/) till September 2021. You can download the dataset [from here](https://github.com/pratham76/UE21CS342AA2-Data-Analytics/blob/main/EDX_WORKSHEET4B.csv). The data is a subset of this [kaggle dataset](https://www.kaggle.com/datasets/khusheekapoor/edx-courses-dataset-2021) that has been extensively preprocessed and stored within the `EDX_WORKSHEET4B.csv`

In this notebook, we will be exploring and analyzing all the courses offered by EDX. First, we will tokenize the textual data using TF-IDF. Then, we will proceed to find the top-k most similarcorses using cosine similarity between the transformed vectors.






### Data Dictionary
**Name**: Name of the Course.

**University**: The University offering this particular course

**text**: The text associated with the tweet.

**Difficulty level**: The difficulty of the course classified as Beginer, Intermediate and Advanced


**link**: Link to the course page.

**About**: Objective of the course.

**Description**: Complete Description of the course.


### Loading the dataset

In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('/kaggle/input/edx-dataset-preprocessed/EDX.csv')
df.head(5)

Unnamed: 0,Name,University,Difficulty Level,Link,About,Course Description
0,How to Learn Online,edX,Beginner,https://www.edx.org/course/how-to-learn-online,Learn essential strategies for successful onli...,"Designed for those who are new to elearning, t..."
1,Programming for Everybody (Getting Started wit...,The University of Michigan,Beginner,https://www.edx.org/course/programming-for-eve...,"This course is a ""no prerequisite"" introductio...",This course aims to teach everyone the basics ...
2,CS50's Introduction to Computer Science,Harvard University,Beginner,https://www.edx.org/course/cs50s-introduction-...,An introduction to the intellectual enterprise...,"This is CS50x , Harvard University's introduct..."
3,The Analytics Edge,Massachusetts Institute of Technology,Intermediate,https://www.edx.org/course/the-analytics-edge,"Through inspiring examples and stories, discov...","In the last decade, the amount of data availab..."
4,Marketing Analytics: Marketing Measurement Str...,"University of California, Berkeley",Beginner,https://www.edx.org/course/marketing-analytics...,This course is part of a MicroMasters® Program,Begin your journey in a new career in marketin...



Preprocessing the list of Courses along with their descriptions.Some of the preprocess thechniques that are to be used are stopword removal, stemming, lematisation, and special character removal.(Complete the preprocess function)

Steps:
1. remove the special characters using re module use [^A-Za-z1-9 ] as regualr expression
2. transform the text into lowercase
3. remove the stopwords defined in eng_stopwords
4. Perform stemming and lemmatisation using PorterStemmer() and WordNetLemmatizer()
5. print the list after preprocessing the text

For futher reference: https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/

In [2]:
#imports
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Download NLTK resources (if not already downloaded)
nltk.download('stopwords')
nltk.download('punkt')



[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
eng_stopwords = stopwords.words('english')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [4]:

eng_stopwords = set(stopwords.words('english'))

# function for preprocessing the text
def preprocess(text):
    # Convert to lowercase
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])
    text = text.lower()
    
    tokens = word_tokenize(text) #tokenize the text
    clean_text = [] 
    for token in tokens:
        if token not in eng_stopwords: #removes stopwords
            clean_text.append(token) #lemmatizing and appends to clean_list
    return clean_text


# preprocess the course details
preprocessed_course_details = [preprocess(text) for text in df['Course Description']]

# print the preprocessed course details
print(preprocessed_course_details[0])



['designed', 'new', 'elearning', 'course', 'prepare', 'strategies', 'successful', 'online', 'learnerthe', 'edx', 'learning', 'design', 'team', 'curated', 'powerful', 'sciencebacked', 'techniques', 'start', 'using', 'right', 'away', 'learning', 'platformthe', 'verified', 'certificate', 'course', 'free', 'use', 'following', 'coupon', 'code', 'september', 'upgrade', 'cost', 'yzadmnuanjuthis', 'course', 'help', 'answer', 'following', 'questions', 'education', 'teacher', 'training']


In [5]:
print(df['Course Description'][0])

Designed for those who are new to elearning, this course will prepare you with strategies to be a successful online learner.The edX learning design team has curated some of the most powerful, science-backed techniques which you can start using right away and on any learning platform.The Verified Certificate for this course is free. Use the following coupon code before September 1, 2020 to upgrade at no cost to you: Y5ZADM5NU2AN5JU7This course will help you answer the following questions: Education & Teacher Training


In [6]:
print(preprocessed_course_details[0])

['designed', 'new', 'elearning', 'course', 'prepare', 'strategies', 'successful', 'online', 'learnerthe', 'edx', 'learning', 'design', 'team', 'curated', 'powerful', 'sciencebacked', 'techniques', 'start', 'using', 'right', 'away', 'learning', 'platformthe', 'verified', 'certificate', 'course', 'free', 'use', 'following', 'coupon', 'code', 'september', 'upgrade', 'cost', 'yzadmnuanjuthis', 'course', 'help', 'answer', 'following', 'questions', 'education', 'teacher', 'training']


Lets try to analysize the diffrence between raw text and preprocessed text: 

* The entire raw **text is now in lowercase**, making it easier for the model to analyze the data. **Models are generally case-sensitive**, treating words like **"Online" and "online"** as **distinct during vectorization**.

* Previously, **numbers** were present in our text, which typically **doesn't provide much information** but **increases** the **dimensionality and redundancy** of our data.

* Many **stop words** have been removed from the dataset, and it now contains only **keywords**. Removing stop words is a necessary step because, **without their removal**, if we calculate **TF-IDF**, these words are **likely to receive high weights**, potentially **misleading the model**.

* Along with stop words, **short words** have also been **removed**. Sometimes, there are **misspelled or short words** that **don't carry much weight** in the dataset but should be eliminated.

* The **removal of punctuation and special characters** was an important preprocessing step. These elements tend to **increase the dimensionality** of the dataset without **adding meaningful information** for the model to learn from.

Tokenizing the string representations provided in preprocessed_course_details using TF-IDF from sklearn.

Steps:
1. Initialize the `TfidfVectorizer()`
2. Use the `.fit_transform()` method on the entire text
3. `.transform()` the Text
4. Print number of samples and features using `.shape`
5. Print the TF-IDF of the first row

In [7]:
df['Preprocessed_Course_Details'] = preprocessed_course_details

In [8]:
df.head()

Unnamed: 0,Name,University,Difficulty Level,Link,About,Course Description,Preprocessed_Course_Details
0,How to Learn Online,edX,Beginner,https://www.edx.org/course/how-to-learn-online,Learn essential strategies for successful onli...,"Designed for those who are new to elearning, t...","[designed, new, elearning, course, prepare, st..."
1,Programming for Everybody (Getting Started wit...,The University of Michigan,Beginner,https://www.edx.org/course/programming-for-eve...,"This course is a ""no prerequisite"" introductio...",This course aims to teach everyone the basics ...,"[course, aims, teach, everyone, basics, progra..."
2,CS50's Introduction to Computer Science,Harvard University,Beginner,https://www.edx.org/course/cs50s-introduction-...,An introduction to the intellectual enterprise...,"This is CS50x , Harvard University's introduct...","[csx, harvard, universitys, introduction, inte..."
3,The Analytics Edge,Massachusetts Institute of Technology,Intermediate,https://www.edx.org/course/the-analytics-edge,"Through inspiring examples and stories, discov...","In the last decade, the amount of data availab...","[last, decade, amount, data, available, organi..."
4,Marketing Analytics: Marketing Measurement Str...,"University of California, Berkeley",Beginner,https://www.edx.org/course/marketing-analytics...,This course is part of a MicroMasters® Program,Begin your journey in a new career in marketin...,"[begin, journey, new, career, marketing, analy..."


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert the list of tokens back to strings for each document
preprocessed_texts = [' '.join(tokens) for tokens in df['Preprocessed_Course_Details']]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the data
tfidf_matrix = vectorizer.fit_transform(preprocessed_texts)

# Print the output matrix size
print(f'Output matrix size: {tfidf_matrix.shape}')

# Print the length of the vectorizer vocabulary
print(f'Length of the vectorizer vocabulary: {len(vectorizer.vocabulary_)}')


Output matrix size: (720, 11767)
Length of the vectorizer vocabulary: 11767


* **Term Frequency** (TF): It measures how often a **term (word) appears in a document**. The TF for a term in a document is calculated as the ratio of the number of times the term appears in the document to the total number of terms in the document.

* **Inverse Document Frequency** (IDF): It measures how important a term is across the entire corpus. The IDF for a term is calculated as the ratio of the total number of documents in the corpus to the number of documents containing the term.

The TF-IDF score represents how important a term is in a specific document relative to its importance in the entire corpus. High TF-IDF scores are assigned to terms that appear frequently in a document but are rare across the entire corpus.

Other text representations and consider which one might be suitable for the task.

* **Word Embeddings (Word2Vec, GloVe)**: Dense vector representations of words that capture semantic relationships. Word embeddings are able to capture word semantics and relationships.

* **Doc2Vec**: Extends Word2Vec to represent entire documents as vectors. Doc2Vec captures the semantic meaning of documents as vectors.

* **Count Vectorization**: Represents a document as a vector of word counts. It is similar to TF-IDF but without the inverse document frequency component.



The top-5 most similar courses that a person might opt if he is interested in the course named 'Python Basics for Data Science'.

Steps:
1. Import `cosine_similarity` from sklearn.metrics.pairwise
2. Compute `cosine_similarity` using tf_vector with index of 'Python Basics for Data Science' and all other rows
3. Use `argsort` to sort the cosine_similarity results
4. Print indices of top-5 most similar results from sorted array (hint: argsort sorts in ascending order)
5. Display text of top-5 most similar courses

In [10]:
from sklearn.metrics.pairwise import cosine_similarity

# Assuming df is your DataFrame and tfidf_matrix is the TF-IDF matrix obtained from TfidfVectorizer

# Index of 'Python Basics for Data Science' course
query_index = df[df['Name'] == 'Python Basics for Data Science'].index[0]

# Compute cosine similarity
cosine_similarities = cosine_similarity(tfidf_matrix[query_index], tfidf_matrix).flatten()

# Use argsort to get the indices of the courses sorted by similarity
top_5_indices = cosine_similarities.argsort()[:-7:-1]

# Display text of top-5 most similar courses
top_5 = [df.iloc[index]['Name'] for index in top_5_indices[1:]]  # Exclude the first one
print("Top-5 most similar courses:")
for course in top_5:
    print(course)


Top-5 most similar courses:
Python Data Structures
Using Python for Research
Analytics in Python
Visualizing Data with Python
Programming for Everybody (Getting Started with Python)


The different metrics available apart form Cosine similarity to fine the similarity between the vector representations of text.

* **Euclidean Distance**: Measures the straight-line distance between two vectors in the vector space. Smaller distances indicate higher similarity.

* **Manhattan Distance** (L1 Norm): Sum of the absolute differences between the vector components. It is also known as the "city block" distance.

* **Minkowski Distance**: A generalization of Euclidean and Manhattan distances. It allows you to control the "order" of the distance calculation.

* **Jaccard Similarity**: Measures the similarity between sets. It is the size of the intersection divided by the size of the union of the sets.


In the context of finding similar courses based on content, cosine similarity is often preferred for several reasons:

* **Scale-Invariant**:  not affected by the magnitude of the vectors

* **Insensitive to Vector Length**: Cosine similarity is insensitive to the length of the vectors

* **Angle Measure** : Cosine similarity measures the cosine of the angle between two vectors. In the context of text similarity, this can be interpreted as measuring the orientation of the vectors in the high-dimensional space.

Given the common usage and effectiveness of cosine similarity in text-related tasks, it is a reasonable choice for finding similar courses based on content. 