<a href="https://colab.research.google.com/github/AAdewunmi/Online-Course-Recommendation-App-Project/blob/main/Online_Course_Recommendation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Online Course Recommendation Project README

## Overview

This project builds a **recommendation system for online courses**, using course title information to suggest similar courses. The aim is to help learners discover relevant content by applying text vectorisation and similarity techniques.

## Objectives

* Clean and preprocess course titles.
* Vectorise text using Natural Language Processing (NLP) methods.
* Compute similarity scores between courses.
* Generate course recommendations based on user input.

## Tech Stack

* **Python 3.x**
* **Jupyter Notebook / Google Colab**
* Libraries:

  * `pandas` – data handling
  * `numpy` – numerical operations
  * `scikit-learn` – vectorisation (`TfidfVectorizer`) and similarity measures
  * `nltk` or `re` – text preprocessing

## Repository Contents

* `Online_Course_Recommendation_Project.ipynb` – notebook with data preprocessing, vectorisation, and recommendation system.
* `README.md` – project documentation.

## How It Works

1. Course titles are cleaned and normalised (lowercasing, stopword removal, punctuation stripping).
2. Titles are transformed into numerical vectors using **TF-IDF** (term frequency–inverse document frequency).
3. **Cosine similarity** is calculated to find the closest matches between course titles.
4. Given an input course, the system recommends the most similar courses.

## Example Usage

```python
# Input course title
input_course = "Data Science with Python"

# System outputs top N similar courses, e.g.:
1. Python for Data Analysis
2. Machine Learning A-Z
3. Statistics for Data Science
```

## How to Run

1. Clone this repository.
2. Install dependencies:

   ```bash
   pip install pandas numpy scikit-learn nltk
   ```
3. Launch Jupyter or open in Google Colab.
4. Run all cells in `Online_Course_Recommendation_Project.ipynb`.

## Next Steps

* Enhance preprocessing (lemmatisation, stemming).
* Experiment with **word embeddings** (Word2Vec, GloVe, BERT).
* Extend recommendations beyond titles (e.g., course description, subject, reviews).



# Data Analysis on Online_Course_Recommendation_Project dataset

In [None]:
# Install the neattext Package
!pip install neattext

Collecting neattext
  Downloading neattext-0.1.3-py3-none-any.whl.metadata (12 kB)
Downloading neattext-0.1.3-py3-none-any.whl (114 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/114.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.7/114.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: neattext
Successfully installed neattext-0.1.3


In [None]:
# Import packages

import pandas as pd
import numpy as np
import neattext.functions as nfx
import seaborn as sn

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity,linear_kernel

In [None]:
# Read udemy_courses.csv
# and print first 5 rows

df = pd.read_csv('/content/sample_data/udemy_course_data.csv')
df.head()

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject,profit,published_date,published_time,year,month,day
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5 hours,2017-01-18T20:58:58Z,Business Finance,429400,2017-01-18,20:58:58Z,2017,1,18
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39 hours,2017-03-09T16:34:20Z,Business Finance,209400,2017-03-09,16:34:20Z,2017,3,9
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5 hours,2016-12-19T19:26:30Z,Business Finance,97830,2016-12-19,19:26:30Z,2016,12,19
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3 hours,2017-05-30T20:07:24Z,Business Finance,232845,2017-05-30,20:07:24Z,2017,5,30
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2 hours,2016-12-13T14:57:18Z,Business Finance,255200,2016-12-13,14:57:18Z,2016,12,13


In [None]:
# List all methods present in the neattext function

dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [None]:
# Select a subset of the 'course_title' column
# from a DataFrame called df

df['course_title'].iloc[1:5]

Unnamed: 0,course_title
1,Complete GST Course & Certification - Grow You...
2,Financial Modeling for Business Analysts and C...
3,Beginner to Pro - Financial Analysis in Excel ...
4,How To Maximize Your Profits Trading Options


In [None]:
# Generate clean text by removing the
# stopwords and special characters

df['Clean_title'] = df['course_title'].apply(nfx.remove_stopwords)
df['Clean_title'] = df['Clean_title'].apply(nfx.remove_special_characters)
df['Clean_title'].iloc[1:5]

Unnamed: 0,Clean_title
1,Complete GST Course Certification Grow Practice
2,Financial Modeling Business Analysts Consultants
3,Beginner Pro Financial Analysis Excel 2017
4,Maximize Profits Trading Options


# Vectorise the Clean Title

In [None]:
# Implement CountVectorizer for text feature extraction

countvect = CountVectorizer()
cvmat = countvect.fit_transform(df['Clean_title'])
cvmat


<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 18364 stored elements and shape (3683, 3564)>

In [None]:
# Return a dense representation of this sparse matrix

cvmat.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

In [None]:
# Convert vectorized text to word count DataFrame

df_cv_words = pd.DataFrame(cvmat.todense(),columns=countvect.get_feature_names_out())
df_cv_words

Unnamed: 0,000005,001,01,02,10,100,101,101master,102,10k,...,zend,zero,zerotohero,zf2,zinsen,zoho,zombie,zu,zuhause,zur
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3678,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3679,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3680,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3681,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Cosine Similarity Matrix


In [None]:
#  Calculate cosine similarity for document vectors

cosine_sim_mat = cosine_similarity(cvmat)
cosine_sim_mat

array([[1.        , 0.20412415, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.20412415, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.23570226],
       [0.        , 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.23570226, 0.        ,
        1.        ]])

# Recommended Course

In [None]:
# Create series for course title lookup

course_index = pd.Series(df.index,index = df['course_title']).drop_duplicates()
course_index

Unnamed: 0_level_0,0
course_title,Unnamed: 1_level_1
Ultimate Investment Banking Course,0
Complete GST Course & Certification - Grow Your CA Practice,1
Financial Modeling for Business Analysts and Consultants,2
Beginner to Pro - Financial Analysis in Excel 2017,3
How To Maximize Your Profits Trading Options,4
...,...
Learn jQuery from Scratch - Master of JavaScript library,3678
How To Design A WordPress Website With No Coding At All,3679
Learn and Build using Polymer,3680
CSS Animations: Create Amazing Effects on Your Website,3681


In [None]:
# Add course recommendation logic using cosine similarity

# Assuming 'my_course_title' is a variable holding the title of the course
my_course_title = 'How To Maximize Your Profits Trading Options' # Replace with the actual course title

# Get the index of the course from the 'course_index' Series
index = course_index[my_course_title]

# Calculate and sort similarity scores
scores = list(enumerate(cosine_sim_mat[index]))
sorted_scores = sorted(scores, key=lambda x:x[1], reverse=True)

# Select the recommended courses
selected_course_index = [i[0] for i in sorted_scores[1:]]
selected_course_score = [i[1] for i in sorted_scores[1:]]

# Create the final DataFrame
rec_df = df.iloc[selected_course_index].copy() # Using .copy() to avoid a SettingWithCopyWarning
rec_df['Similarity_Score'] = selected_course_score
final_recommended_courses = rec_df[['course_title','Similarity_Score','url',
                                    'price', 'num_subscribers']]
print(final_recommended_courses)

                                           course_title  Similarity_Score  \
410                              Trading Options Basics          0.577350   
43     Options Trading - How to Win with Weekly Options          0.566947   
96    Intermediate Options trading concepts for Stoc...          0.530330   
138   Forex Trading with Fixed 'Risk through Options...          0.530330   
195   Trading Options For Consistent Returns: Option...          0.530330   
...                                                 ...               ...   
3678  Learn jQuery from Scratch - Master of JavaScri...          0.000000   
3679  How To Design A WordPress Website With No Codi...          0.000000   
3680                      Learn and Build using Polymer          0.000000   
3681  CSS Animations: Create Amazing Effects on Your...          0.000000   
3682  Using MODX CMS to Build Websites: A Beginner's...          0.000000   

                                                    url  price  \
410      