This project builds a course recommendation system to help learners find relevant online courses.

It uses content-based filtering to suggest similar courses and collaborative filtering to recommend top-rated ones.

The system also evaluates prediction accuracy using RMSE and MAE to ensure reliable recommendations.


In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity


We’re importing tools to handle data (pandas, numpy), scale numbers (MinMaxScaler), and measure similarity between courses (cosine_similarity).


In [18]:

# Load dataset
df = pd.read_excel("online_course_recommendation_v2.xlsx")
df

Unnamed: 0,user_id,course_id,course_name,instructor,course_duration_hours,certification_offered,difficulty_level,rating,enrollment_numbers,course_price,feedback_score,study_material_available,time_spent_hours,previous_courses_taken
0,15796,9366,Python for Beginners,Emma Harris,39.1,Yes,Beginner,5.0,21600,317.50,0.797,Yes,17.60,4
1,861,1928,Cybersecurity for Professionals,Alexander Young,36.3,Yes,Beginner,4.3,15379,40.99,0.770,Yes,28.97,9
2,38159,9541,DevOps and Continuous Deployment,Dr. Mia Walker,13.4,Yes,Beginner,3.9,6431,380.81,0.772,Yes,52.44,4
3,44733,3708,Project Management Fundamentals,Benjamin Lewis,58.3,Yes,Beginner,3.1,48245,342.80,0.969,No,22.29,6
4,11285,3361,Ethical Hacking Masterclass,Daniel White,30.8,Yes,Beginner,2.8,34556,381.01,0.555,Yes,22.01,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,10647,5466,Graphic Design with Canva,Emma Harris,17.3,Yes,Beginner,3.9,49101,485.34,0.838,Yes,16.35,3
99996,13800,2623,Stock Market and Trading Strategies,Dr. John Smith,68.4,Yes,Beginner,3.5,35107,162.80,0.922,Yes,41.94,7
99997,47131,1556,Networking and System Administration,Dr. John Smith,73.8,Yes,Beginner,4.3,12146,24.02,0.990,Yes,15.87,5
99998,49654,6001,Graphic Design with Canva,Daniel White,30.3,Yes,Beginner,3.5,9933,402.24,0.630,Yes,21.05,4


This reads your Excel file into a table called df. It contains all course and user information.


In [19]:

# Drop unused columns to reduce memory
df = df[['user_id', 'course_id', 'course_name', 'instructor', 'difficulty_level',
         'course_duration_hours', 'certification_offered', 'study_material_available',
         'course_price', 'feedback_score', 'rating']]
df


Unnamed: 0,user_id,course_id,course_name,instructor,difficulty_level,course_duration_hours,certification_offered,study_material_available,course_price,feedback_score,rating
0,15796,9366,Python for Beginners,Emma Harris,Beginner,39.1,Yes,Yes,317.50,0.797,5.0
1,861,1928,Cybersecurity for Professionals,Alexander Young,Beginner,36.3,Yes,Yes,40.99,0.770,4.3
2,38159,9541,DevOps and Continuous Deployment,Dr. Mia Walker,Beginner,13.4,Yes,Yes,380.81,0.772,3.9
3,44733,3708,Project Management Fundamentals,Benjamin Lewis,Beginner,58.3,Yes,No,342.80,0.969,3.1
4,11285,3361,Ethical Hacking Masterclass,Daniel White,Beginner,30.8,Yes,Yes,381.01,0.555,2.8
...,...,...,...,...,...,...,...,...,...,...,...
99995,10647,5466,Graphic Design with Canva,Emma Harris,Beginner,17.3,Yes,Yes,485.34,0.838,3.9
99996,13800,2623,Stock Market and Trading Strategies,Dr. John Smith,Beginner,68.4,Yes,Yes,162.80,0.922,3.5
99997,47131,1556,Networking and System Administration,Dr. John Smith,Beginner,73.8,Yes,Yes,24.02,0.990,4.3
99998,49654,6001,Graphic Design with Canva,Daniel White,Beginner,30.3,Yes,Yes,402.24,0.630,3.5


Keeps only relevant columns needed for recommendation.

Reduces memory usage and simplifies processing


In [20]:
# Encode binary and categorical features
df['certification_offered'] = df['certification_offered'].map({'Yes': 1, 'No': 0})
df['certification_offered']



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['certification_offered'] = df['certification_offered'].map({'Yes': 1, 'No': 0})


Unnamed: 0,certification_offered
0,1
1,1
2,1
3,1
4,1
...,...
99995,1
99996,1
99997,1
99998,1


In [21]:
df['difficulty_level'] = df['difficulty_level'].map({'Beginner': 1, 'Intermediate': 2, 'Advanced': 3})
df['difficulty_level']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['difficulty_level'] = df['difficulty_level'].map({'Beginner': 1, 'Intermediate': 2, 'Advanced': 3})


Unnamed: 0,difficulty_level
0,1
1,1
2,1
3,1
4,1
...,...
99995,1
99996,1
99997,1
99998,1


In [23]:
df['study_material_available'] = df['study_material_available'].map({'Yes': 1, 'No': 0})
df['study_material_available']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['study_material_available'] = df['study_material_available'].map({'Yes': 1, 'No': 0})


Unnamed: 0,study_material_available
0,1
1,1
2,1
3,0
4,1
...,...
99995,1
99996,1
99997,1
99998,1


We change words like "Yes"/"No" and "Beginner"/"Advanced" into numbers so the computer can understand and compare them

In [24]:

# Normalize numerical features
scaler = MinMaxScaler()
df[['course_duration_hours', 'course_price', 'feedback_score']] = scaler.fit_transform(
    df[['course_duration_hours', 'course_price', 'feedback_score']]
)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['course_duration_hours', 'course_price', 'feedback_score']] = scaler.fit_transform(


We scale values like course duration and price to a range between 0 and 1. This helps avoid bias in similarity calculations.

In [25]:
# Content-based: course similarity matrix
course_features = df.drop_duplicates('course_id')[[
    'course_id', 'difficulty_level', 'course_duration_hours',
    'certification_offered', 'study_material_available',
    'course_price', 'feedback_score'
]].set_index('course_id')
course_features

Unnamed: 0_level_0,difficulty_level,course_duration_hours,certification_offered,study_material_available,course_price,feedback_score
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9366,1,0.358947,1,1,0.619792,0.776432
1928,1,0.329474,1,1,0.043729,0.746696
9541,1,0.088421,1,1,0.751688,0.748899
3708,1,0.561053,1,0,0.672500,0.965859
3361,1,0.271579,1,1,0.752104,0.509912
...,...,...,...,...,...,...
1570,2,0.238947,1,1,0.871146,0.906388
1629,1,0.545263,1,1,0.497542,0.806167
4445,3,0.489474,0,1,0.914604,0.595815
4004,2,0.892632,1,1,0.014000,0.857930


We create a table where each course has a unique set of features. This helps us compare courses to each other.

In [26]:
course_sim = cosine_similarity(course_features)
course_sim_df = pd.DataFrame(course_sim, index=course_features.index, columns=course_features.index)
course_sim_df

course_id,9366,1928,9541,3708,3361,8076,7887,2876,1578,4298,...,7560,7580,3014,964,90,1570,1629,4445,4004,5636
course_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9366,1.000000,0.958730,0.988930,0.863103,0.988362,0.842756,0.996928,0.626996,0.967013,0.901127,...,0.820706,0.967786,0.715655,0.829823,0.749937,0.950554,0.993963,0.758861,0.897098,0.828707
1928,0.958730,1.000000,0.929967,0.796828,0.926272,0.845556,0.938041,0.603032,0.983419,0.926730,...,0.762209,0.963018,0.749220,0.778312,0.742832,0.905672,0.969614,0.717520,0.938890,0.852006
9541,0.988930,0.929967,1.000000,0.838306,0.989135,0.819168,0.986369,0.545382,0.922211,0.878983,...,0.831252,0.919849,0.709541,0.776376,0.725694,0.953001,0.966818,0.753119,0.847981,0.794407
3708,0.863103,0.796828,0.838306,1.000000,0.829479,0.778594,0.878981,0.770575,0.839393,0.784036,...,0.709917,0.855823,0.629202,0.714208,0.838413,0.864462,0.867900,0.686590,0.821855,0.768167
3361,0.988362,0.926272,0.989135,0.829479,1.000000,0.847608,0.983843,0.573701,0.932849,0.897834,...,0.816911,0.937407,0.695004,0.793805,0.743251,0.949604,0.972544,0.767570,0.865419,0.824811
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1570,0.950554,0.905672,0.953001,0.864462,0.949604,0.941275,0.945260,0.699296,0.901331,0.954787,...,0.930707,0.898759,0.867567,0.815818,0.887326,1.000000,0.935380,0.904752,0.923271,0.921347
1629,0.993963,0.969614,0.966818,0.867900,0.972544,0.848237,0.989891,0.675670,0.987605,0.906271,...,0.801204,0.989399,0.713391,0.855073,0.757087,0.935380,1.000000,0.751269,0.922437,0.842854
4445,0.758861,0.717520,0.753119,0.686590,0.767570,0.945822,0.753883,0.776991,0.726576,0.893037,...,0.966710,0.730860,0.949876,0.816719,0.901216,0.904752,0.751269,1.000000,0.857893,0.929970
4004,0.897098,0.938890,0.847981,0.821855,0.865419,0.953953,0.878712,0.801983,0.948223,0.975089,...,0.834442,0.938401,0.866480,0.839796,0.896098,0.923271,0.922437,0.857893,1.000000,0.967117


We calculate how similar each course is to every other course. The result is a matrix of similarity scores.

In [27]:
# Collaborative: average ratings per course
course_avg = df.groupby('course_id')['rating'].mean()
course_avg

Unnamed: 0_level_0,rating
course_id,Unnamed: 1_level_1
1,3.972727
2,4.125000
3,3.950000
4,3.722222
5,3.866667
...,...
9995,3.962500
9996,3.700000
9997,3.838462
9998,4.166667


We find the average rating for each course. This helps us recommend top-rated courses

In [29]:
# Content-based recommendation
def recommend_similar_courses(course_id, top_n=5):
    if course_id not in course_sim_df.index:
        return "Course not found."
    similar = course_sim_df[course_id].sort_values(ascending=False)[1:top_n+1]
    return df[df['course_id'].isin(similar.index)][['course_name', 'instructor', 'difficulty_level']]


This function finds courses that are most similar to the one you liked, based on features like difficulty and price

In [30]:
# Collaborative recommendation
def recommend_top_courses(user_id, top_n=5):
    rated = df[df['user_id'] == user_id]['course_id'].tolist()
    unrated = course_avg[~course_avg.index.isin(rated)].sort_values(ascending=False).head(top_n)
    return df[df['course_id'].isin(unrated.index)][['course_name', 'instructor', 'difficulty_level']]


This function recommends top-rated courses that the user hasn’t rated yet. It’s based on what others liked.

In [33]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [34]:
# Accuracy Evaluation (Collaborative baseline using course_avg)
def evaluate_accuracy():
    test_df = df.sample(frac=0.2, random_state=42)
    test_df['predicted_rating'] = test_df['course_id'].map(course_avg)
    test_df = test_df.dropna(subset=['predicted_rating'])

    rmse = np.sqrt(mean_squared_error(test_df['rating'], test_df['predicted_rating']))
    mae = mean_absolute_error(test_df['rating'], test_df['predicted_rating'])

    print(f"📊 Accuracy Evaluation:\nRMSE: {rmse:.3f}\nMAE: {mae:.3f}")

# Run evaluation
evaluate_accuracy()

📊 Accuracy Evaluation:
RMSE: 0.700
MAE: 0.571


This function checks how close our predicted ratings are to actual ratings. Lower RMSE and MAE means better accuracy.

In [31]:
# Content-based: similar to course_id = 101
print(recommend_similar_courses(course_id=101))

# Collaborative: top courses for user_id = 5
print(recommend_top_courses(user_id=5))

                                     course_name        instructor  \
23             Mobile App Development with Swift         Sarah Lee   
282          Stock Market and Trading Strategies    Benjamin Lewis   
1420                 Ethical Hacking Masterclass         Sarah Lee   
3657                        Python for Beginners    Isabella Scott   
11641            Project Management Fundamentals    Charlotte King   
12567            Cybersecurity for Professionals    Benjamin Lewis   
12810          Mobile App Development with Swift         Sarah Lee   
17459           DevOps and Continuous Deployment    Charlotte King   
25250             Fitness and Nutrition Coaching    William Thomas   
27189              Photography and Video Editing     Olivia Taylor   
27416          Mobile App Development with Swift   Sophia Anderson   
27680                    Public Speaking Mastery        Ethan Hall   
31965                Game Development with Unity    William Thomas   
32383            Cyb

We test the accuracy and print recommendations for a sample course and user.