
## 🔍 Knowledge Base: Recommendation System Overview

### 📌 1. What is a Recommendation System?

A **Recommendation System** suggests relevant items to users based on various data signals. The two most widely used types are:

---

### 📘 2. Collaborative Filtering (CF)

**Collaborative Filtering** relies on past user-item interactions (like ratings, views, purchases) to recommend items. It doesn't need item metadata (like course description), only interaction data.

#### Types of Collaborative Filtering:

- **User-Based CF**: Finds similar users and recommends items they liked.
- **Item-Based CF**: Recommends items similar to those the user liked.
- **Matrix Factorization**: Techniques like SVD (Singular Value Decomposition), ALS, and deep learning to learn latent factors.

#### Pros:
- Captures complex behavior
- Doesn’t need content information

#### Cons:
- Suffers from cold-start problem (new users/items)
- Needs large interaction datasets

---

### 📗 3. Content-Based Filtering (Used in This Notebook ✅)

This notebook uses **Content-Based Filtering**, which recommends courses based on their **text features** (Course Name, Description, Skills).

#### Steps involved:
1. Preprocess course text (cleaning and lemmatization)
2. Combine all relevant text into a `tags` column
3. Use **TF-IDF Vectorizer** to convert text into numerical vectors
4. Use **Cosine Similarity** to compute similarity between courses

#### Pros:
- Works well with descriptive content
- No need for user interaction data

#### Cons:
- Limited to known features
- Can become overly specialized (recommendations too similar)

---

### ✅ Summary

| Feature | Collaborative Filtering | Content-Based Filtering |
|--------|--------------------------|--------------------------|
| Needs User Interaction Data | ✅ Yes | ❌ No |
| Uses Item Metadata | ❌ No | ✅ Yes |
| Used in this Notebook | ❌ No | ✅ Yes |

---

**This notebook builds a content-based recommendation engine for Coursera courses using TF-IDF and cosine similarity.**


In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
import pickle
import nltk
import re
from nltk.corpus import wordnet

# Download wordnet once (if needed)
try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')


print('Dependencies Imported')

Dependencies Imported


[nltk_data] Downloading package wordnet to /root/nltk_data...


In [2]:
data = pd.read_csv('/content/Coursera.csv')
data.head()

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3522 entries, 0 to 3521
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Course Name         3522 non-null   object
 1   University          3522 non-null   object
 2   Difficulty Level    3522 non-null   object
 3   Course Rating       3522 non-null   object
 4   Course URL          3522 non-null   object
 5   Course Description  3522 non-null   object
 6   Skills              3522 non-null   object
dtypes: object(7)
memory usage: 192.7+ KB


In [4]:
data.isnull().sum()

Unnamed: 0,0
Course Name,0
University,0
Difficulty Level,0
Course Rating,0
Course URL,0
Course Description,0
Skills,0


In [5]:
data.nunique()


Unnamed: 0,0
Course Name,3416
University,184
Difficulty Level,5
Course Rating,31
Course URL,3424
Course Description,3397
Skills,3424


In [6]:
data = data.drop_duplicates(subset=['Course Name', 'University', 'Difficulty Level', 'Course Rating',
       'Course URL', 'Course Description'])
data.shape

(3424, 7)

In [7]:
import nltk
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('wordnet2022')

! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet # temp fix for lookup error.


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet2022 to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet2022.zip.


cp: cannot stat '/usr/share/nltk_data/corpora/wordnet2022': No such file or directory


In [8]:
lemmatizer = WordNetLemmatizer()

# Function for text cleaning (removing special characters, stopwords, and lemmatization)
def clean_for_tags(text):
    text = re.sub(r'��+', '', text)  # This removes "��" or any repeated "��" characters
    text = re.sub(r'[^\x00-\x7F]+', '', text)  # Removes non-ASCII characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove anything that is not a letter or space
    text = text.lower()  # Convert text to lowercase
    text = ' '.join([lemmatizer.lemmatize(word) for word in text.split()])  # Lemmatization
    return text

training_data = data.copy()

# Apply clean_for_tags on columns to be used in tags column
training_data['Course Name'] = training_data['Course Name'].apply(clean_for_tags)
training_data['Course Description'] = training_data['Course Description'].apply(clean_for_tags)
training_data['Skills'] = training_data['Skills'].apply(clean_for_tags)

# Combine 'Course Name', 'Course Description', and 'Skills' into 'tags'
data['tags'] = training_data['Course Name'] + ' ' + training_data['Course Description'] + ' ' + training_data['Skills']

training_data = data[['Course Name', 'tags']]


In [9]:
training_data.head()


Unnamed: 0,Course Name,tags
0,Write A Feature Length Screenplay For Film Or ...,write a feature length screenplay for film or ...
1,Business Strategy: Business Model Canvas Analy...,business strategy business model canvas analys...
2,Silicon Thin Film Solar Cells,silicon thin film solar cell this course consi...
3,Finance for Managers,finance for manager when it come to number the...
4,Retrieve Data using Single-Table SQL Queries,retrieve data using singletable sql query in t...


In [10]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(training_data['tags'])
print("TF-IDF matrix shape:", tfidf_matrix.shape)



TF-IDF matrix shape: (3424, 5000)


In [11]:
n_components = 100 # Reduce to 100 dimensions
svd = TruncatedSVD(n_components=n_components, random_state=42)
tfidf_matrix = svd.fit_transform(tfidf_matrix)

print("Reduced TF-IDF matrix shape:", tfidf_matrix.shape)


Reduced TF-IDF matrix shape: (3424, 100)


In [12]:
similarity_matrix = cosine_similarity(tfidf_matrix)
print(similarity_matrix[0][1])



0.02368863791605983


In [13]:
def normalize_rating(rating_str):
    """
    Normalize the course rating to a 0-1 scale.
    """
    try:
        return (float(rating_str) - 0) / (5 - 0)  # Normalize to 0-1
    except ValueError:
        return 0


In [14]:
def get_recommendations(course_name, data, similarity_matrix, top_n=3, rating_weight=0.05):
    """
    Get top N course recommendations based on similarity to the given course name.
    """
    course_name = data[data['Course Name'] == course_name]  # Filter data for selected course
    course_idx = course_name.index[0]  # Get the index of the selected course
    similarity_scores = list(enumerate(similarity_matrix[course_idx]))  # Get similarity scores for all courses

    recommendations = []
    for idx, similarity_score in sorted(similarity_scores, key=lambda x: x[1], reverse=True)[:top_n]:
        course_data = data.iloc[idx]  # Get course data for the current recommendation
        normalized_rating = normalize_rating(course_data.get('Course Rating', '0'))  # Normalize rating

        # Prepare recommendation dictionary with relevant course information
        recommendations.append({
            "course_name": course_data['Course Name'],
            "course_url": course_data.get('Course URL', ''),
            "rating": course_data['Course Rating'],
            "institution": course_data.get('University', 'Unknown'),
            "difficulty_level": course_data.get('Difficulty Level', 'Unknown'),
            "similarity": similarity_score,
            "final_score": similarity_score * (1 - rating_weight) + normalized_rating * rating_weight
        })

    return sorted(recommendations, key=lambda x: x['final_score'], reverse=True)


In [15]:
get_recommendations('Silicon Thin Film Solar Cells', data, similarity_matrix)

[{'course_name': 'Silicon Thin Film Solar Cells',
  'course_url': 'https://www.coursera.org/learn/silicon-thin-film-solar-cells',
  'rating': '4.1',
  'institution': '�cole Polytechnique',
  'difficulty_level': 'Advanced',
  'similarity': np.float64(0.9999999999999997),
  'final_score': np.float64(0.9909999999999997)},
 {'course_name': 'Physics of silicon solar cells',
  'course_url': 'https://www.coursera.org/learn/physics-silicon-solar-cells',
  'rating': '4.4',
  'institution': '�cole Polytechnique',
  'difficulty_level': 'Intermediate',
  'similarity': np.float64(0.9700534637088671),
  'final_score': np.float64(0.9655507905234237)},
 {'course_name': 'Introduction to solar cells',
  'course_url': 'https://www.coursera.org/learn/solar-cells',
  'rating': '4.8',
  'institution': 'Technical University of Denmark (DTU)',
  'difficulty_level': 'Beginner',
  'similarity': np.float64(0.962698497883828),
  'final_score': np.float64(0.9625635729896366)}]

In [None]:
import joblib

recommender = Recommender(data, ratings_df, tfidf, tfidf_matrix, similarity_matrix)
joblib.dump(recommender, "recommender_model.pkl")




In [None]:
import joblib

model = joblib.load("recommender_model.pkl")

# Use like this
user_id = 'User1'
user_skills = {"python", "deep learning", "statistics"}

recommendations = model.hybrid_recommend(user_id, user_skills)

for course, score, skills in recommendations:
    print(f"{course}: {score} ({', '.join(skills)})")

In [17]:

# Transpose for user-user similarity
user_similarity = cosine_similarity(ratings_df.T)
user_sim_df = pd.DataFrame(user_similarity, index=ratings_df.columns, columns=ratings_df.columns)

print("User-User Similarity Matrix:")
user_sim_df


User-User Similarity Matrix:


Unnamed: 0,User1,User2,User3,User4,User5
User1,1.0,0.778792,0.395817,0.0,0.0
User2,0.778792,1.0,0.599002,0.272639,0.0
User3,0.395817,0.599002,1.0,0.476326,0.0
User4,0.0,0.272639,0.476326,1.0,0.780869
User5,0.0,0.0,0.0,0.780869,1.0


Project

In [3]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the Coursera dataset
data = pd.read_csv('Coursera.csv')

# Preprocess Skills column: Convert string to set of skills
data['Skills'] = data['Skills'].apply(lambda x: set(x.lower().split()))

# Simulated User-Course Ratings Matrix (for collaborative filtering)
sample_courses = data['Course Name'].head(5).tolist()
ratings_data = {
    'User1': [5, 3, 0, 0, 2],
    'User2': [4, 0, 0, 2, 1],
    'User3': [1, 1, 0, 5, 4],
    'User4': [0, 0, 5, 4, 0],
    'User5': [0, 0, 4, 0, 0]
}
ratings_df = pd.DataFrame(ratings_data, index=sample_courses)

# Collaborative Filtering: User-User Similarity
def collaborative_filtering(ratings_df, user_id, top_n=2):
    user_similarity = cosine_similarity(ratings_df.T)
    user_sim_df = pd.DataFrame(user_similarity, index=ratings_df.columns, columns=ratings_df.columns)
    user_ratings = ratings_df[user_id]
    similar_users = user_sim_df[user_id].drop(user_id)
    relevant_ratings = ratings_df[similar_users.index]
    weighted_scores = relevant_ratings.dot(similar_users) / similar_users.sum()
    unrated_items = user_ratings[user_ratings == 0].index
    recommendations = weighted_scores.loc[unrated_items].sort_values(ascending=False).head(top_n)
    return recommendations

# Knowledge-Based Filtering
def knowledge_based_recommend(data, user_skills, top_n=5):
    recommendations = []
    for _, row in data.iterrows():
        course_name = row['Course Name']
        course_skills = row['Skills']
        overlap = user_skills.intersection(course_skills)
        score = len(overlap)
        if score > 0:
            recommendations.append((course_name, score, overlap))
    recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)
    return recommendations[:top_n]

# Hybrid Recommender
def hybrid_recommend(data, ratings_df, user_id, user_skills, top_n=5):
    cf_recommendations = collaborative_filtering(ratings_df, user_id, top_n=3)
    kb_recommendations = knowledge_based_recommend(data, user_skills, top_n=5)
    hybrid_recommendations = []

    for course in cf_recommendations.index:
        score = cf_recommendations[course] * 0.6
        hybrid_recommendations.append((course, round(score, 2), set(['collaborative filtering'])))

    for course, score, overlap in kb_recommendations:
        existing = next((x for x in hybrid_recommendations if x[0] == course), None)
        if existing:
            existing_score = existing[1] + (score * 0.4)
            hybrid_recommendations[hybrid_recommendations.index(existing)] = (
                course, round(existing_score, 2), existing[2].union(overlap)
            )
        else:
            hybrid_recommendations.append((course, round(score * 0.4, 2), overlap))

    hybrid_recommendations = sorted(hybrid_recommendations, key=lambda x: x[1], reverse=True)[:top_n]
    return hybrid_recommendations

In [4]:

# Content-Based Filtering: Prepare similarity matrix
data['Content'] = data.apply(lambda x: ' '.join(x['Skills']) + ' ' + str(x['Course Description']), axis=1)
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix = tfidf.fit_transform(data['Content'])
similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Normalize rating function
def normalize_rating(rating):
    try:
        rating = float(rating)
        return (rating - 1) / 4  # Normalize from [1, 5] to [0, 1]
    except:
        return 0.0

# Content-Based Recommendation Function
def get_recommendations(course_name, data, similarity_matrix, top_n=3, rating_weight=0.05):
    course_name = data[data['Course Name'] == course_name]
    if course_name.empty:
        return []
    course_idx = course_name.index[0]
    similarity_scores = list(enumerate(similarity_matrix[course_idx]))

    recommendations = []
    for idx, similarity_score in sorted(similarity_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]:
        course_data = data.iloc[idx]
        normalized_rating = normalize_rating(course_data.get('Course Rating', '0'))
        recommendations.append({
            "course_name": course_data['Course Name'],
            "course_url": course_data.get('Course URL', ''),
            "rating": course_data['Course Rating'],
            "institution": course_data.get('University', 'Unknown'),
            "difficulty_level": course_data.get('Difficulty Level', 'Unknown'),
            "similarity": round(similarity_score, 3),
            "final_score": round(similarity_score * (1 - rating_weight) + normalized_rating * rating_weight, 3)
        })

    return sorted(recommendations, key=lambda x: x['final_score'], reverse=True)


In [4]:
# Example usage
user_id = 'User1'

# Simulated user skills
user_skills = {"python", "deep learning", "statistics"}

# Run hybrid recommender
recommendations = hybrid_recommend(data, ratings_df, user_id, user_skills)

# Display hybrid recommendations
print("🔍 Top Course Recommendations for", user_id, ":\n")
for course, score, skills in recommendations:
    print(f"📘 {course} — ✅ Score: {score} — Matched: {', '.join(skills)}")

# Run content-based recommendations for each hybrid course
print("\n🔍 Content-Based Recommendations for Each Hybrid Course\n")
for course, _, _ in recommendations:
    print(f"For: {course}")
    content_recs = get_recommendations(course, data, similarity_matrix, top_n=3)
    for rec in content_recs:
        print(f"  📘 {rec['course_name']} — Final Score: {rec['final_score']} — Similarity: {rec['similarity']}")
        print(f"     - Institution: {rec['institution']}")
        print(f"     - Difficulty: {rec['difficulty_level']}")
        print(f"     - Rating: {rec['rating']}")
        print(f"     - URL: {rec['course_url']}")
    print()

🔍 Top Course Recommendations for User1 :

📘 Finance for Managers — ✅ Score: 1.81 — Matched: collaborative filtering
📘 Statistical Mechanics: Algorithms and Computations — ✅ Score: 0.8 — Matched: statistics, python
📘 Data Science at Scale - Capstone Project — ✅ Score: 0.8 — Matched: statistics, python
📘 Data Management and Visualization — ✅ Score: 0.8 — Matched: statistics, python
📘 Inferential Statistical Analysis with Python — ✅ Score: 0.8 — Matched: statistics, python

🔍 Content-Based Recommendations for Each Hybrid Course

For: Finance for Managers
  📘 Understanding Financial Statements: Company Position — Final Score: 0.435 — Similarity: 0.409
     - Institution: University of Illinois at Urbana-Champaign
     - Difficulty: Beginner
     - Rating: 4.7
     - URL: https://www.coursera.org/learn/financial-statements
  📘 Fundamentals of financial and management accounting — Final Score: 0.432 — Similarity: 0.406
     - Institution: Politecnico di Milano
     - Difficulty: Beginner
   

In [5]:
import joblib

# Save the TF-IDF vectorizer
joblib.dump(tfidf, 'tfidf_vectorizer.pkl')

# Save the TF-IDF matrix
joblib.dump(tfidf_matrix, 'tfidf_matrix.pkl')

# Save the similarity matrix
joblib.dump(similarity_matrix, 'similarity_matrix.pkl')

# Save the preprocessed data
data.to_pickle('preprocessed_courses.pkl')

# Save the user ratings dataframe
ratings_df.to_pickle('user_ratings.pkl')


In [3]:
import joblib
import pandas as pd
# Load objects
tfidf = joblib.load('tfidf_vectorizer.pkl')
tfidf_matrix = joblib.load('tfidf_matrix.pkl')
similarity_matrix = joblib.load('similarity_matrix.pkl')
data = pd.read_pickle('preprocessed_courses.pkl')
ratings_df = pd.read_pickle('user_ratings.pkl')


In [4]:
user_id = 'User1'
user_skills = {"python", "deep learning", "statistics"}

# Run hybrid recommender
recommendations = hybrid_recommend(data, ratings_df, user_id, user_skills)

print("🔍 Top Course Recommendations for", user_id, ":\n")
for course, score, skills in recommendations:
    print(f"📘 {course} — ✅ Score: {score} — Matched: {', '.join(skills)}")

# Optional: content-based for each
print("\n🔍 Content-Based Recommendations for Each Hybrid Course\n")
for course, _, _ in recommendations:
    print(f"For: {course}")
    content_recs = get_recommendations(course, data, similarity_matrix)
    for rec in content_recs:
        print(f"  📘 {rec['course_name']} — Final Score: {rec['final_score']} — Similarity: {rec['similarity']}")
        print(f"     - Institution: {rec['institution']}")
        print(f"     - Difficulty: {rec['difficulty_level']}")
        print(f"     - Rating: {rec['rating']}")
        print(f"     - URL: {rec['course_url']}")
    print()


NameError: name 'hybrid_recommend' is not defined

In [None]:
saved The model 

SyntaxError: invalid syntax (410057929.py, line 1)

In [6]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

class Recommender:
    def __init__(self, data, ratings_df, tfidf, tfidf_matrix, similarity_matrix):
        self.data = data
        self.ratings_df = ratings_df
        self.tfidf = tfidf
        self.tfidf_matrix = tfidf_matrix
        self.similarity_matrix = similarity_matrix

    def collaborative_filtering(self, user_id, top_n=2):
        user_similarity = cosine_similarity(self.ratings_df.T)
        user_sim_df = pd.DataFrame(user_similarity, index=self.ratings_df.columns, columns=self.ratings_df.columns)
        user_ratings = self.ratings_df[user_id]
        similar_users = user_sim_df[user_id].drop(user_id)
        relevant_ratings = self.ratings_df[similar_users.index]
        weighted_scores = relevant_ratings.dot(similar_users) / similar_users.sum()
        unrated_items = user_ratings[user_ratings == 0].index
        recommendations = weighted_scores.loc[unrated_items].sort_values(ascending=False).head(top_n)
        return recommendations

    def knowledge_based_recommend(self, user_skills, top_n=5):
        recommendations = []
        for _, row in self.data.iterrows():
            course_name = row['Course Name']
            course_skills = row['Skills']
            overlap = user_skills.intersection(course_skills)
            score = len(overlap)
            if score > 0:
                recommendations.append((course_name, score, overlap))
        recommendations = sorted(recommendations, key=lambda x: x[1], reverse=True)
        return recommendations[:top_n]

    def hybrid_recommend(self, user_id, user_skills, top_n=5):
        cf_recommendations = self.collaborative_filtering(user_id, top_n=3)
        kb_recommendations = self.knowledge_based_recommend(user_skills, top_n=5)
        hybrid_recommendations = []

        for course in cf_recommendations.index:
            score = cf_recommendations[course] * 0.6
            hybrid_recommendations.append((course, round(score, 2), set(['collaborative filtering'])))

        for course, score, overlap in kb_recommendations:
            existing = next((x for x in hybrid_recommendations if x[0] == course), None)
            if existing:
                existing_score = existing[1] + (score * 0.4)
                hybrid_recommendations[hybrid_recommendations.index(existing)] = (
                    course, round(existing_score, 2), existing[2].union(overlap)
                )
            else:
                hybrid_recommendations.append((course, round(score * 0.4, 2), overlap))

        hybrid_recommendations = sorted(hybrid_recommendations, key=lambda x: x[1], reverse=True)[:top_n]
        return hybrid_recommendations


In [7]:
import joblib

recommender = Recommender(data, ratings_df, tfidf, tfidf_matrix, similarity_matrix)
joblib.dump(recommender, "recommender_model.pkl")


['recommender_model.pkl']

In [2]:
import joblib

model = joblib.load("recommender_model.pkl")

# Use like this
user_id = 'User1'
user_skills = {"python", "deep learning", "statistics"}

recommendations = model.hybrid_recommend(user_id, user_skills)

for course, score, skills in recommendations:
    print(f"{course}: {score} ({', '.join(skills)})")


AttributeError: Can't get attribute 'Recommender' on <module '__main__'>