In [1]:
import numpy as np
import pandas as pd

## Load Coursera Dataset & Basic Exploration

In this step, we load the Coursera dataset from Kaggle and do a preliminary check:
- Preview top rows  
- Check dataset shape  
- Display columns  
- Verify missing values  

This helps us understand the structure before building the recommendation pipeline.

In [2]:
import pandas as pd

# Load dataset (update filename if needed)
df = pd.read_csv("../data/Coursera.csv")

print("Dataset Loaded Successfully!\n")

print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

display(df.head())

print("\n Missing Values:")
print(df.isna().sum())


Dataset Loaded Successfully!

Shape: (3522, 7)

Columns: ['Course Name', 'University', 'Difficulty Level', 'Course Rating', 'Course URL', 'Course Description', 'Skills']


Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course URL,Course Description,Skills
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,https://www.coursera.org/learn/write-a-feature...,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,https://www.coursera.org/learn/canvas-analysis...,"By the end of this guided project, you will be...",Finance business plan persona (user experien...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,https://www.coursera.org/learn/silicon-thin-fi...,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...
3,Finance for Managers,IESE Business School,Intermediate,4.8,https://www.coursera.org/learn/operational-fin...,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,https://www.coursera.org/learn/single-table-sq...,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...



 Missing Values:
Course Name           0
University            0
Difficulty Level      0
Course Rating         0
Course URL            0
Course Description    0
Skills                0
dtype: int64


## Select Relevant Columns and Clean Text

In this step:
- We keep only the important columns for recommendation
- Convert text to lowercase
- Remove extra spaces
- Prepare combined text features for TF-IDF vectorization

In [3]:
import pandas as pd

# Select important columns
df_clean = df[[
    "Course Name",
    "University",
    "Difficulty Level",
    "Course Rating",
    "Course Description",
    "Skills"
]].copy()

# Basic text cleaning function
def clean_text(text):
    text = str(text).lower().strip()
    return text

# Apply cleaning
df_clean["clean_description"] = df_clean["Course Description"].apply(clean_text)
df_clean["clean_skills"] = df_clean["Skills"].apply(clean_text)
df_clean["clean_name"] = df_clean["Course Name"].apply(clean_text)

# Combine fields for modeling
df_clean["combined_text"] = (
    df_clean["clean_name"] + " " +
    df_clean["clean_description"] + " " +
    df_clean["clean_skills"]
)

print("Cleaned Dataset:\n")
display(df_clean.head())

print("Shape:", df_clean.shape)

Cleaned Dataset:



Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Course Description,Skills,clean_description,clean_skills,clean_name,combined_text
0,Write A Feature Length Screenplay For Film Or ...,Michigan State University,Beginner,4.8,Write a Full Length Feature Film Script In th...,Drama Comedy peering screenwriting film D...,write a full length feature film script in th...,drama comedy peering screenwriting film d...,write a feature length screenplay for film or ...,write a feature length screenplay for film or ...
1,Business Strategy: Business Model Canvas Analy...,Coursera Project Network,Beginner,4.8,"By the end of this guided project, you will be...",Finance business plan persona (user experien...,"by the end of this guided project, you will be...",finance business plan persona (user experien...,business strategy: business model canvas analy...,business strategy: business model canvas analy...
2,Silicon Thin Film Solar Cells,�cole Polytechnique,Advanced,4.1,This course consists of a general presentation...,chemistry physics Solar Energy film lambda...,this course consists of a general presentation...,chemistry physics solar energy film lambda...,silicon thin film solar cells,silicon thin film solar cells this course cons...
3,Finance for Managers,IESE Business School,Intermediate,4.8,"When it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,"when it comes to numbers, there is always more...",accounts receivable dupont analysis analysis...,finance for managers,"finance for managers when it comes to numbers,..."
4,Retrieve Data using Single-Table SQL Queries,Coursera Project Network,Beginner,4.6,In this course you�ll learn how to effectively...,Data Analysis select (sql) database manageme...,in this course you�ll learn how to effectively...,data analysis select (sql) database manageme...,retrieve data using single-table sql queries,retrieve data using single-table sql queries i...


Shape: (3522, 10)


## TF-IDF Vectorization

In this step:
- Convert the combined course text into numerical vectors
- Use TF-IDF to capture important keywords and skill information
- Limit max_features to avoid overfitting and reduce computation

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF
tfidf = TfidfVectorizer(
    stop_words="english",
    max_features=5000
)

# Fit-transform the combined text
tfidf_matrix = tfidf.fit_transform(df_clean["combined_text"])

print("TF-IDF matrix shape:", tfidf_matrix.shape)


TF-IDF matrix shape: (3522, 5000)


## Compute Cosine Similarity Matrix

Using the TF-IDF vectors, we compute pairwise cosine similarity between courses.
This matrix will be used to recommend similar courses based on textual features
(course name, description, and skills).


In [5]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix)

print("Cosine similarity matrix shape:", cosine_sim.shape)

Cosine similarity matrix shape: (3522, 3522)


## Course Recommendation Function

This function:
- Takes a course name as input
- Finds its index in the dataset
- Retrieves similarity scores from the cosine similarity matrix
- Sorts courses by similarity
- Returns the top recommended courses

In [6]:
# Reset index to ensure consistent mapping
df_clean = df_clean.reset_index(drop=True)

def recommend_courses(course_name, num_recommendations=5):
    # Convert input to lowercase for matching
    course_name = course_name.lower()
    
    # Find course index
    indices = df_clean[df_clean["clean_name"].str.contains(course_name, case=False, na=False)].index
    
    if len(indices) == 0:
        return {"error": "Course not found in dataset."}
    
    idx = indices[0]  # choose first match
    
    # Get similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort courses by similarity score (descending)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Exclude the course itself and take top N recommendations
    sim_scores = sim_scores[1 : num_recommendations + 1]
    
    # Extract recommended course indices
    recommended_indices = [i[0] for i in sim_scores]
    
    # Build result dictionary
    recommendations = df_clean.iloc[recommended_indices][[
        "Course Name",
        "University",
        "Difficulty Level",
        "Course Rating",
        "Skills"
    ]]

    return recommendations

In [7]:
test_output = recommend_courses("machine learning", num_recommendations=5)
test_output

Unnamed: 0,Course Name,University,Difficulty Level,Course Rating,Skills
2538,Introduction to Applied Machine Learning,Alberta Machine Intelligence Institute,Intermediate,4.7,Algorithms Machine Learning Algorithms Appli...
2405,Machine Learning With Big Data,University of California San Diego,Beginner,4.6,statistical classification knime data cluste...
2707,Machine Learning for Data Analysis,Wesleyan University,Intermediate,4.2,Algorithms Data Analysis numbers (spreadshee...
3471,Optimizing Machine Learning Performance,Alberta Machine Intelligence Institute,Beginner,4.5,project Deep Learning Strategy mathematical...
3230,Machine Learning for All,University of London,Conversant,4.7,robotics Machine Learning Artificial Neural ...


## Save Model Artifacts

We save:
- The TF-IDF vectorizer
- Cosine similarity matrix
- Cleaned course dataset

These files will be used in the Flask backend to perform real-time recommendations.

In [8]:
import pickle
import os

# Path to save models (adjust if needed)
save_path = "../models"
os.makedirs(save_path, exist_ok=True)

# Save TF-IDF vectorizer
with open(f"{save_path}/course_recommender_tfidf.pkl", "wb") as f:
    pickle.dump(tfidf, f)

# Save similarity matrix
with open(f"{save_path}/course_similarity_matrix.pkl", "wb") as f:
    pickle.dump(cosine_sim, f)

# Save cleaned dataset
df_clean.to_pickle(f"{save_path}/course_recommender_dataset.pkl")

print("Model artifacts saved successfully in ../models/")

Model artifacts saved successfully in ../models/
