<a href="https://colab.research.google.com/github/Imashish-45/LearnFlow-Personalized-Course-Advisor/blob/main/Course_recommender.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



## Introduction to Content-Based Recommendation System

Welcome to this notebook exploring content-based recommendation systems! In the world of personalized recommendations, content-based systems play a significant role by suggesting items to users based on the inherent characteristics of those items. These systems leverage the idea that users who liked a particular item in the past are likely to be interested in items with similar attributes or content.

The foundation of a content-based recommendation system lies in the understanding of item attributes, features, or descriptions. By analyzing and quantifying these characteristics, the system can establish connections between items and tailor recommendations to match a user's preferences.

**Key Features of Content-Based Recommendation Systems:**

- **Personalization**: Content-based systems provide personalized recommendations by focusing on the attributes that matter most to the user. They're capable of suggesting items that align closely with the user's tastes.

- **Cold Start Problem**: One of the strengths of content-based systems is their ability to handle the "cold start" problem. Even when a user is new or an item is just introduced, the system can make recommendations based on the item's attributes.

- **Item Diversity**: While content-based systems offer recommendations tailored to a user's preferences, they can sometimes fall short in terms of diversity, as recommendations are driven by item attributes.

In this notebook, we'll explore the concepts behind content-based recommendation systems and walk through the process of building one step by step. By the end, you'll have a deeper understanding of how these systems work and how you can apply them to provide personalized recommendations to users.

So, let's dive in and uncover the inner workings of content-based recommendation systems!

---



In [None]:
# importing important libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, KFold

## Data loading and exploration:

In [None]:
df = pd.read_csv("/content/udemy_output_All_IT__Software_p1_p626.csv")

In [None]:
df.head()

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__currency,discount_price__price_string,price_detail__amount,price_detail__currency,price_detail__price_string
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,INR,₹455,8640.0,INR,"₹8,640"


In [None]:
df.shape

(22853, 20)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22853 entries, 0 to 22852
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            22853 non-null  int64  
 1   title                         22853 non-null  object 
 2   url                           22853 non-null  object 
 3   is_paid                       22853 non-null  bool   
 4   num_subscribers               22853 non-null  int64  
 5   avg_rating                    22853 non-null  float64
 6   avg_rating_recent             22853 non-null  float64
 7   rating                        22853 non-null  float64
 8   num_reviews                   22853 non-null  int64  
 9   is_wishlisted                 22853 non-null  bool   
 10  num_published_lectures        22853 non-null  int64  
 11  num_published_practice_tests  22853 non-null  int64  
 12  created                       22853 non-null  object 
 13  p

In [None]:
df.shape

(22853, 20)

# **Data Cleaning:**

In [None]:
# Drop the specified columns
columns_to_drop = ['discount_price__currency', 'price_detail__currency']
df = df.drop(columns=columns_to_drop)


In [None]:
# Removing the currency symbols:

columns_to_clean = ['price_detail__price_string', 'discount_price__price_string']
for column in columns_to_clean:
    df[column].replace('₹', '', regex=True, inplace=True)

In [None]:
df.head()

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__price_string,price_detail__amount,price_detail__price_string
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,455,8640.0,8640
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,455,8640.0,8640
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,455,8640.0,8640
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,455,8640.0,8640
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,455,8640.0,8640


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22853 entries, 0 to 22852
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            22853 non-null  int64  
 1   title                         22853 non-null  object 
 2   url                           22853 non-null  object 
 3   is_paid                       22853 non-null  bool   
 4   num_subscribers               22853 non-null  int64  
 5   avg_rating                    22853 non-null  float64
 6   avg_rating_recent             22853 non-null  float64
 7   rating                        22853 non-null  float64
 8   num_reviews                   22853 non-null  int64  
 9   is_wishlisted                 22853 non-null  bool   
 10  num_published_lectures        22853 non-null  int64  
 11  num_published_practice_tests  22853 non-null  int64  
 12  created                       22853 non-null  object 
 13  p

## Text Preprocessing:

In [None]:
# Downloading NLTK fucntionalities
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    # replacing the punctuations with no space,
    # which in effect deletes the punctuation marks
    translator = str.maketrans('', '', string.punctuation)
    # return the text stripped of punctuation marks
    return text.translate(translator)

In [None]:
df['title'] = df['title'].apply(remove_punctuation)
df.head(10)

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__price_string,price_detail__amount,price_detail__price_string
0,762616,The Complete SQL Bootcamp 2020 Go from Zero to...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,455,8640.0,8640
1,937678,Tableau 2020 AZ HandsOn Tableau Training for D...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,455,8640.0,8640
2,1361790,PMP Exam Prep Seminar PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,455,8640.0,8640
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,455,8640.0,8640
4,637930,An Entire MBA in 1 CourseAward Winning Busines...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,455,8640.0,8640
5,1208634,Microsoft Power BI A Complete Introduction 20...,/course/powerbi-complete-introduction/,True,124180,4.56228,4.57676,4.57676,38093,False,275,0,2017-05-08T13:03:21Z,2017-05-15T18:48:54Z,455.0,455,8640.0,8640
6,864146,Agile Crash Course Agile Project Management Ag...,/course/agile-crash-course/,True,96207,4.32383,4.29118,4.29118,30470,False,23,0,2016-05-30T22:57:40Z,2016-06-23T17:49:26Z,455.0,455,8640.0,8640
7,321410,Beginner to Pro in Excel Financial Modeling an...,/course/beginner-to-pro-in-excel-financial-mod...,True,127680,4.54034,4.53346,4.53346,28665,False,275,0,2014-10-17T08:39:52Z,2014-11-25T23:00:40Z,455.0,455,8640.0,8640
8,673654,Become a Product Manager Learn the Skills Ge...,/course/become-a-product-manager-learn-the-ski...,True,112572,4.50386,4.5008,4.5008,27408,False,144,0,2015-11-18T19:35:12Z,2016-03-17T17:04:59Z,455.0,455,8640.0,8640
9,1653432,The Business Intelligence Analyst Course 2020,/course/the-business-intelligence-analyst-cour...,True,115269,4.50067,4.49575,4.49575,23906,False,413,0,2018-04-19T07:00:09Z,2018-04-25T18:40:55Z,455.0,455,8640.0,8640


In [None]:
# extracting the stopwords from nltk library
sw = stopwords.words('english')
# displaying the stopwords
np.array(sw)

array(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you',
       "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself',
       'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her',
       'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them',
       'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom',
       'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are',
       'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had',
       'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and',
       'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at',
       'by', 'for', 'with', 'about', 'against', 'between', 'into',
       'through', 'during', 'before', 'after', 'above', 'below', 'to',
       'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under',
       'again', 'further', 'then', 'once', 'here', 'there', 'when',
       'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'm

In [None]:
print("Number of stopwords: ", len(sw))

Number of stopwords:  179


In [None]:
def stopwords(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)

In [None]:
df['title'] = df['title'].apply(stopwords)
df.head(10)

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__price_string,price_detail__amount,price_detail__price_string
0,762616,complete sql bootcamp 2020 go zero hero,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,455,8640.0,8640
1,937678,tableau 2020 az handson tableau training data ...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,455,8640.0,8640
2,1361790,pmp exam prep seminar pmbok guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,455,8640.0,8640
3,648826,complete financial analyst course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,455,8640.0,8640
4,637930,entire mba 1 courseaward winning business scho...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,455,8640.0,8640
5,1208634,microsoft power bi complete introduction 2020 ...,/course/powerbi-complete-introduction/,True,124180,4.56228,4.57676,4.57676,38093,False,275,0,2017-05-08T13:03:21Z,2017-05-15T18:48:54Z,455.0,455,8640.0,8640
6,864146,agile crash course agile project management ag...,/course/agile-crash-course/,True,96207,4.32383,4.29118,4.29118,30470,False,23,0,2016-05-30T22:57:40Z,2016-06-23T17:49:26Z,455.0,455,8640.0,8640
7,321410,beginner pro excel financial modeling valuation,/course/beginner-to-pro-in-excel-financial-mod...,True,127680,4.54034,4.53346,4.53346,28665,False,275,0,2014-10-17T08:39:52Z,2014-11-25T23:00:40Z,455.0,455,8640.0,8640
8,673654,become product manager learn skills get job,/course/become-a-product-manager-learn-the-ski...,True,112572,4.50386,4.5008,4.5008,27408,False,144,0,2015-11-18T19:35:12Z,2016-03-17T17:04:59Z,455.0,455,8640.0,8640
9,1653432,business intelligence analyst course 2020,/course/the-business-intelligence-analyst-cour...,True,115269,4.50067,4.49575,4.49575,23906,False,413,0,2018-04-19T07:00:09Z,2018-04-25T18:40:55Z,455.0,455,8640.0,8640


## Tokenizing:

In [None]:
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text data
count_vectorizer.fit(df['title'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()

In [None]:
# lists to store the vocab and counts
vocab = []
count = []
# iterate through each vocab and count append the value to designated lists
for key, value in dictionary:
    vocab.append(key)
    count.append(value)
# store the count in panadas dataframe with vocab as index
vocab_bef_stem = pd.Series(count, index=vocab)
# sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending=False)

# **Vectorization:**

In [None]:
# Create a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=3000, stop_words="english")  # You can adjust the number of features

# Transform the preprocessed titles into TF-IDF vectors
tfidf_matrix = tfidf_vectorizer.fit_transform(df['title'])

# Print the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (22853, 3000)


In [None]:
tfidf_matrix

<22853x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 100802 stored elements in Compressed Sparse Row format>

## **Checking similarities between courses:**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# **Visuword :**

It is a dictionary that allows you to look up words and learn about their origins and similarities with other terms and words. It produces nodes with all of the related terms, as well as the meaning and every aspect of the phrase. The user can tap a node to see a definition for that word category, and press and drag individual nodes to help explain connections.

"enables users to look up words to find their definitions and connections with other terms and concepts."

In [None]:
# Calculating similarity


cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# **Recommendation System:**

# **The Rising Star algorithm:**



The Rising Star algorithm generates interpretable feature significance rankings, allowing users to choose the most essential features in the categorization process. The rising star algorithm pruned data to be transferred and on basis of this algorithm, it would do a ranking of course material based on content. With the aid of the rising star algorithm, it automatically estimates the number of important characteristics and the model's complexity, making it suitable for a wide range of datasets. The Rising Star technique is scalable and computationally efficient, allowing it to handle enormous datasets using few processing resources. The top document features have been selected for recommendation to students.

In [None]:
def recommend_courses_by_skill(skill, num_recommendations=5):
    # Transform the skill into a TF-IDF vector using the same vectorizer
    skill_tfidf = tfidf_vectorizer.transform([skill])

    # Calculate cosine similarity between the skill and all courses
    similarity_scores = cosine_similarity(skill_tfidf, tfidf_matrix)[0]

    # Get indices of recommended courses
    recommended_indices = similarity_scores.argsort()[-num_recommendations:][::-1]

    # Get recommended course titles
    recommended_courses = df.iloc[recommended_indices]['title'].tolist()

    return recommended_courses

## **Testing:**

In [None]:
# Example usage
user_skill = "API"  #skill user want to learn
recommended_courses = recommend_courses_by_skill(user_skill)

if recommended_courses:
    print(f"Recommended courses for skill '{user_skill}':")
    for course in recommended_courses:
        print(course)
else:
    print("No recommendations available for the entered skill.")

Recommended courses for skill 'API':
api testing restsharp specflow c
api blueprint design api specs create docs seconds
learn api webservices testing
api blueprint advanced creating complex api specs docs
api testing api tests automation using postman newman


In [None]:
# Hence user can add the skill they want to learn and our model will give 5 suggestion based on user input.