# ðŸ§¹ PIPELINE 2: COURSE DATA CLEANING (SBERT-READY)

**Goal:** Prepare course data for the Recommendation Engine.

**Key Steps:**
1. **Load & Standardize**: make sure we have `course_name` and `description`.
2. **Level Tagging**: Identify if a course is a Degree, Diploma, or Short Course.
3. **ESCO Mapping**: (Optional) Link to official job titles using SBERT.

In [1]:
import pandas as pd
import re
from pathlib import Path
import unicodedata

RAW_PATH = Path("../data/raw/courses.csv")  # Adjust if your filename differs
PROCESSED_DIR = Path("../data/processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"Loading courses from: {RAW_PATH}")
try:
    courses_df = pd.read_csv(RAW_PATH, on_bad_lines='skip')
    # Standardize column names
    courses_df.columns = courses_df.columns.str.lower().str.strip().str.replace(' ', '_')
    print(f" Loaded {len(courses_df)} courses.")
except FileNotFoundError:
    print(" Raw data not found. Creating dummy data for testing.")
    courses_df = pd.DataFrame({
        'course_name': ['BSc in Computer Science', 'PythonZero to Hero', 'MBA'],
        'description': ['Learn Java, Algorithms', 'Python basics', 'Management level'],
        'fees': ['500000', '10000', '1200000'],
        'duration': ['3 years', '3 months', '1 year']
    })

Loading courses from: ..\data\raw\courses.csv
 Loaded 24930 courses.


## STEP 2: GENTLE TEXT CLEANING
We keep the text rich so the AI understands it.

In [2]:
def gentle_clean(text):
    if pd.isna(text): return ""
    text = str(text)
    # Fix encoding issues (common in scraped data)
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8")
    return text.strip()

courses_df['course_name'] = courses_df['course_name'].apply(gentle_clean)
if 'description' in courses_df.columns:
    courses_df['description'] = courses_df['description'].apply(gentle_clean)
else:
    print(" No description column found. Using course_name as description.")
    courses_df['description'] = courses_df['course_name']

print("Text cleaned.")

Text cleaned.


## STEP 3: LEVEL TAGGING (Critical for Career Paths)
We need to know if it's a Degree (Level 6+) or a Certificate (Level 4).

In [3]:
def tag_course_level(title):
    title = title.lower()
    if any(x in title for x in ['msc', 'mba', 'master', 'phd', 'postgraduate']):
        return 'Postgraduate'  # Good for Seniors
    elif any(x in title for x in ['bsc', 'bachelor', 'degree', 'undergraduate']):
        return 'Undergraduate' # Good for Entry -> mid lvl

    elif 'diploma' in title:
        return 'Diploma'       # Good for Entry lvl
    else:
        return 'Certificate'   # Good for Skill Gap

courses_df['course_level'] = courses_df['course_name'].apply(tag_course_level)
print(" Course Levels Tagged:")
print(courses_df['course_level'].value_counts())

 Course Levels Tagged:
course_level
Certificate      15715
Diploma           5306
Undergraduate     2190
Postgraduate      1719
Name: count, dtype: int64


## STEP 4: SAVE
This file will be used by `sbert_pipeline.ipynb`.

In [4]:
filename = PROCESSED_DIR / "courses_cleaned_sbert_ready.csv"
courses_df.to_csv(filename, index=False)
print(f" SUCCESS! Saved to: {filename}")

 SUCCESS! Saved to: ..\data\processed\courses_cleaned_sbert_ready.csv
