# **Problem Definition and Objective**
**Project Track:** This project is a Content-Based Recommendation System that leverages Natural Language Processing (NLP) and TF-IDF vectorization to measure semantic similarity between learning materials.

**Problem Statement:** With thousands of options on e-learning platforms, learners struggle to find courses that align with their specific goals. Standard keyword searches often miss semantic context, leading to discovery fatigue. This project addresses this by analyzing metadata—including titles, descriptions, skills, difficulty levels, and ratings—to provide intelligent, context-aware suggestions.

**Motivation & Real-World Relevance:** Personalization is essential for modern digital education. By simulating the recommendation engines of top-tier platforms, this system:
1. Optimizes Discovery: Reduces the time learners spend searching for relevant content.
2. Enhances Engagement: Tailors learning pathways to individual skill levels and interests.
3. Drives Outcomes: Increases course completion rates by matching users with content that truly fits their needs.

**1. Install Dependencies**

In [92]:
!pip install nltk




**2. Import Libraries**

In [93]:
import pandas as pd
import numpy as np
import re
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download("stopwords")
nltk.download("wordnet")

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## **Dataset Source** - Kaggle

**3. Load Dataset**

In [94]:
from google.colab import files
uploaded = files.upload()

df = pd.read_csv("coursera_courses.csv")

df.head()

Saving coursera_courses.csv to coursera_courses.csv


Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [95]:
df.columns = [col.lower().replace(" ", "_") for col in df.columns]

In [96]:
print("Updated Columns:", df.columns.tolist())

Updated Columns: ['course_name', 'university', 'difficulty_level', 'course_rating', 'course_url', 'course_description', 'skills']


In [97]:
df.shape

(3522, 7)

**4. Basic Cleaning**

In [98]:
# 1. Select only the columns you need (using the new lowercase names)
possible_cols = ["course_name", "course_description", "skills", "difficulty_level", "course_rating"]
df = df[possible_cols]

# 2. Fill missing values with an empty string to avoid errors during text processing
df = df.fillna("")

# 3. Remove duplicate courses based on their main content
df = df.drop_duplicates(subset=["course_name", "course_description", "skills"])
df = df.reset_index(drop=True) # Add this line to reset the index

# 4. Shorten the description to the first 150 words to keep the data manageable
df["course_description"] = df["course_description"].apply(lambda x: " ".join(str(x).split()[:150]))

# 5. Display the cleaned result
df.head()

Unnamed: 0,course_name,course_description,skills,difficulty_level,course_rating
0,Write A Feature Length Screenplay For Film Or ...,Write a Full Length Feature Film Script In thi...,Drama Comedy peering screenwriting film D...,Beginner,4.8
1,Business Strategy: Business Model Canvas Analy...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...,Beginner,4.8
2,Silicon Thin Film Solar Cells,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...,Advanced,4.1
3,Finance for Managers,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,Intermediate,4.8
4,Retrieve Data using Single-Table SQL Queries,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...,Beginner,4.6


In [99]:
df.shape

(3424, 5)

**5. Preprocessing Function**

In [100]:
stop_words = set(stopwords.words("english")) # Initialize a set of English stop words for efficient lookup
lemm = WordNetLemmatizer() # Initialize the WordNet Lemmatizer to reduce words to their base form

def preprocess(text):
    text = str(text).lower() # Convert the input text to a string and then to lowercase
    text = re.sub(r"[^a-zA-Z ]", " ", text) # Remove any characters that are not letters or spaces, replacing them with a space
    words = text.split() # Split the text into individual words
    words = [lemm.lemmatize(w) for w in words if w not in stop_words] # Lemmatize each word and remove stop words
    return " ".join(words) # Join the processed words back into a single string

**6. Combine Text Features**

In [101]:
df["combined_text"] = (
    df["course_name"] + " " + # Combine course name
    df["course_description"] + " " + # Combine course description
    df["skills"] + " " + # Combine skills
    df["difficulty_level"] # Combine difficulty level
).apply(preprocess) # Apply the preprocessing function to the combined text

**7. TF-IDF + Similarity Matrix**

In [102]:
# 1. Text Vectorization and Similarity
vectorizer = TfidfVectorizer() # Initialize a TF-IDF Vectorizer to convert text into numerical feature vectors
tfidf_matrix = vectorizer.fit_transform(df["combined_text"]) # Fit the vectorizer to the combined text and transform it into a TF-IDF matrix
base_similarity = cosine_similarity(tfidf_matrix) # Compute the cosine similarity between all course TF-IDF vectors

# 2. Fix Rating Normalization
# errors='coerce' turns "Not Calibrated" or any text into NaN
df["rating_norm"] = pd.to_numeric(df["course_rating"], errors='coerce') # Convert 'course_rating' to numeric, coercing errors to NaN

# Fill those NaN values with 0 so the math works
df["rating_norm"] = df["rating_norm"].fillna(0) # Fill any NaN values in 'rating_norm' with 0

# Normalize the ratings between 0 and 1
max_rating = df["rating_norm"].max() # Find the maximum rating for normalization
if max_rating > 0: # Check if max_rating is greater than 0 to avoid division by zero
    df["rating_norm"] = df["rating_norm"] / max_rating # Normalize ratings to a 0-1 scale

**8. Hybrid Scoring (Boosting)**

In [103]:
def final_scores(idx):
    difficulty_match = (df["difficulty_level"] == df.loc[idx, "difficulty_level"]).astype(int) # Create a boolean array indicating if a course's difficulty matches the input course's difficulty, then convert to int (0 or 1)
    popularity_boost = 0.05 * df["rating_norm"] # Calculate a popularity boost based on normalized course ratings
    difficulty_boost = 0.05 * difficulty_match # Calculate a difficulty boost based on difficulty level matching
    return base_similarity[idx] + popularity_boost + difficulty_boost # Return the final scores by adding similarity, popularity boost, and difficulty boost

**9. Recommendation Function (By Index)**

In [104]:
def recommend(course_index, top_k=10):
    if course_index < 0 or course_index >= len(df): # Check if the provided course_index is valid
        print("Invalid Index") # Print an error message if the index is invalid
        return pd.DataFrame() # Return an empty DataFrame

    scores = final_scores(course_index) # Calculate the final scores for all courses relative to the given course_index
    sorted_idx = np.argsort(scores)[::-1] # Get the indices that would sort the scores in descending order

    recommendations = [] # Initialize an empty list to store recommendation dictionaries
    for i in sorted_idx: # Iterate through the sorted indices
        if i == course_index: # Skip the course itself from the recommendations
            continue
        recommendations.append({ # Append a dictionary of recommended course details
            "Course Name": df.loc[i, "course_name"], # Add the course name
            "Difficulty": df.loc[i, "difficulty_level"], # Add the difficulty level
            "Rating": df.loc[i, "rating_norm"], # Add the normalized course rating
            "Score": round(scores[i], 3) # Add the rounded recommendation score
        })
        if len(recommendations) == top_k: # Break the loop once top_k recommendations are collected
            break

    return pd.DataFrame(recommendations) # Return the recommendations as a Pandas DataFrame

**10. Search Course by Title**

In [105]:
def search_course(keyword):
    return df[df["course_name"].str.contains(keyword, case=False, na=False)] # Search for courses where the course name contains the keyword (case-insensitive)

**11. Query-Based Recommendations**

In [106]:
def recommend_by_query(query, top_k=10):
    processed = preprocess(query) # Preprocess the input query string
    query_vec = vectorizer.transform([processed]) # Transform the processed query into a TF-IDF vector
    scores = cosine_similarity(query_vec, tfidf_matrix)[0] # Calculate cosine similarity between the query vector and all course TF-IDF vectors

    if max(scores) < 0.05: # If the maximum similarity score is very low, suggest no strong match
        print("No strong match found. Showing popular fallback.") # Inform the user about the fallback
        return df.sort_values(by="rating_norm", ascending=False).head(top_k) # Return top_k courses by normalized rating as a fallback

    sorted_idx = np.argsort(scores)[::-1] # Get the indices that would sort the scores in descending order

    results = [] # Initialize an empty list to store results
    for i in sorted_idx[:top_k]: # Iterate through the top_k sorted indices
        results.append({ # Append a dictionary of recommended course details
            "Course Name": df.loc[i, "course_name"], # Add the course name
            "Difficulty": df.loc[i, "difficulty_level"], # Add the difficulty level
            "Rating": df.loc[i, "rating_norm"], # Add the normalized course rating
            "Score": round(scores[i], 3) # Add the rounded recommendation score
        })

    return pd.DataFrame(results) # Return the results as a Pandas DataFrame

**12. Random Recommendation Test**

In [107]:
import random # Import the random module for generating random numbers

def test_random(top_k=5):
    idx = random.randint(0, len(df)-1) # Generate a random index within the range of the DataFrame length
    print("Random Course:", df.loc[idx, "course_name"]) # Print the name of the randomly selected course
    return recommend(idx, top_k) # Call the recommend function with the random index and top_k

**13. Evaluation Metrics**

In [108]:
def coverage():
    covered = 0 # Initialize a counter for covered items
    for i in range(len(df)): # Iterate through all courses in the DataFrame
        if len(recommend(i, 5)) > 0: # If the recommendation function returns any recommendations for the course
            covered += 1 # Increment the covered counter
    return covered / len(df) # Return the coverage as a ratio of covered items to total items

def average_similarity_k(k=5):
    scores = [] # Initialize an empty list to store similarity scores
    for i in range(len(df)): # Iterate through all courses
        recs = recommend(i, k) # Get recommendations for the current course
        if len(recs) > 0: # If recommendations exist
            scores.extend(recs["Score"].tolist()) # Extend the scores list with the 'Score' values from recommendations
    return np.mean(scores) # Return the mean of all collected similarity scores

def diversity(idx, k=5):
    recs = recommend(idx, k) # Get recommendations for a given course index
    if len(recs) < 2: # If fewer than 2 recommendations, diversity cannot be calculated
        return 0 # Return 0 diversity
    indices = recs.index.tolist() # Get the original indices of the recommended courses
    sim_matrix = base_similarity[np.ix_(indices, indices)] # Create a similarity sub-matrix for the recommended courses
    return 1 - np.mean(sim_matrix) # Calculate diversity as 1 minus the mean similarity within the recommendations

def global_diversity(k=5):
    return np.mean([diversity(i, k) for i in range(len(df))]) # Calculate the average diversity across all courses

def novelty(k=5):
    pop_list = [] # Initialize an empty list to store popularity scores
    for i in range(len(df)): # Iterate through all courses
        recs = recommend(i, k) # Get recommendations for the current course
        pop_list.extend(recs["Rating"].tolist()) # Extend the popularity list with 'Rating' values from recommendations
    return 1 - np.mean(pop_list) # Calculate novelty as 1 minus the mean popularity of recommended items

def precision_at_k(k=5):
    results = [] # Initialize an empty list to store precision results
    for idx in range(len(df)): # Iterate through all courses
        target_skills = set(df.loc[idx, "skills"].lower().split()) # Get the skills of the target course, converted to a set of lowercase words
        recs = recommend(idx, k) # Get recommendations for the current course
        match = 0 # Initialize a counter for skill matches
        for _, row in recs.iterrows(): # Iterate through each recommended course
            rec_skills = set(str(row["Course Name"]).lower().split()) # Get the skills from the recommended course, converted to a set of lowercase words
            if target_skills.intersection(rec_skills): # Check if there is any intersection between target skills and recommended course skills
                match += 1 # Increment match counter if there's an intersection
        results.append(match / k) # Append the precision for the current course (matches / k)
    return np.mean(results) # Return the average precision at k across all courses

In [109]:
print("\n Evaluation Metrics")
print("Coverage:", round(coverage(), 3))
print("Average Similarity:", round(average_similarity_k(), 3))
print("Global Diversity:", round(global_diversity(), 3))
print("Novelty Score:", round(novelty(), 3))
print("Precision@5:", round(precision_at_k(), 3))



 Evaluation Metrics
Coverage: 1.0
Average Similarity: 0.475
Global Diversity: 0.774
Novelty Score: 0.095
Precision@5: 0.736


# **Sample Outputs**

In [111]:
print("Search Result:")
print(search_course("python").head())

Search Result:
                                           course_name  \
16                       Python Programming Essentials   
61            Python Tricks and Hacks for Productivity   
114                       Exception Handling in Python   
174  Using Python to Interact with the Operating Sy...   
192  AWS Elastic Beanstalk:Deploy a Python(Flask) W...   

                                    course_description  \
16   This course will introduce you to the wonderfu...   
61   By the end of this project, you are going to b...   
114  In this 1-hour long project-based course, you ...   
174  By the end of this course, you�ll be able to m...   
192  In this 1-hour long project-based course, you ...   

                                                skills difficulty_level  \
16   semantics  Python Programming  coding conventi...         Beginner   
61   Computer Programming  Transpose  concision  Py...         Advanced   
114  Python Programming  exception handling  relati...        

In [112]:
print("\nRecommend by Index:")
print(recommend(10, 5))


Recommend by Index:
                                         Course Name Difficulty  Rating  Score
0  Agile Projects: Defining Epics and Mapping Val...   Beginner    1.00  0.876
1  Agile Projects: Develop Product Wireframe Prot...   Beginner    0.92  0.803
2  Agile Projects: Creating User Stories with Val...   Beginner    0.98  0.792
3  Business Strategy: Business Model Canvas Analy...   Beginner    0.96  0.522
4  Product Development: Customer Persona Developm...   Beginner    0.98  0.511


In [113]:
print("\nRecommend by Query:")
print(recommend_by_query("beginner data science"))




Recommend by Query:
                                Course Name    Difficulty  Rating  Score
0            Introduction to Data Analytics      Advanced    0.94  0.496
1                     What is Data Science?    Conversant    0.92  0.463
2                  Data Science Methodology      Beginner    0.90  0.441
3             Applied Data Science Capstone      Beginner    0.92  0.421
4      Data Science for Business Innovation      Advanced    0.84  0.421
5         Data Management and Visualization      Advanced    0.86  0.417
6     Introduction to Clinical Data Science      Advanced    0.92  0.412
7      Research Data Management and Sharing  Intermediate    0.94  0.408
8            Data Visualization with Python      Beginner    0.88  0.402
9  Big Data Modeling and Management Systems  Intermediate    0.88  0.399


In [114]:
print("\nRandom Test:")
print(test_random())


Random Test:
Random Course: Create an FPS Weapon in Unity (Part 3 -Damage Effects)
                                         Course Name Difficulty  Rating  Score
0  Create an FPS Weapon in Unity (Part 2 - Firing...   Advanced    0.00  0.857
1  Create an FPS Weapon in Unity (Part 1 - Revolver)   Advanced    1.00  0.833
2  Make Your Pick-Ups Look Cool in Unity (Intro t...   Beginner    0.88  0.391
3  Create UI in Unity Part 1 - Screen Overlay Canvas   Advanced    0.96  0.345
4  Create Animation Transitions in Unity (Intro t...   Advanced    0.94  0.340


# **Performance Analysis & Limitations**
***Strengths***
* Works without user history
* Explainable recommendations
* Efficient for medium-sized datasets

***Limitations***

Does not adapt to individual user behavior

* Does not adapt to individual user behavior
* Relies on quality of textual metadata
* Cannot capture evolving user preferences







# **Ethical Considerations & Responsible AI**
This system uses only course metadata and does not rely on personal user data. Recommendations may favor courses with richer descriptions, and dataset limitations such as subjective ratings or inconsistent difficulty labels can affect results. The system is designed for educational use, with explainable recommendations to promote transparency and responsible AI usage.

# **Conclusion & Future Scope**
***Conclusion***
* Developed a content-based recommendation system for online learning courses
* Successfully generated relevant and explainable course recommendations
* Enabled filtering based on rating and difficulty level

***Future Scope***
* Incorporate user interaction data for personalized recommendations
* Extend the system using deep learning–based embeddings
* Deploy as a real-time recommendation service with a web interface