
# Job Recommendation System — Global Notebook

```mermaid
gantt
    dateFormat  YYYY-MM-DD
    title       Job Recommendation Project — 2-week Plan
    excludes    weekends

    section Week 1 - Setup & Data
    Planning & Kickoff          :    planning, 2025-10-01, 2d
    Data gathering & cleaning   :active,  data,    2025-10-03, 3d
    Quick EDA and schema review :         eda,     2025-10-06, 1d

    section Week 1-2 - Core NLP
    Text preprocessing & mapping:         preprocess, 2025-10-07, 2d
    Model experiments (MiniLM / mpnet):crit, model, 2025-10-07, 4d
    Embedding pipeline & scoring :         pipeline, 2025-10-09, 2d

    section Week 2 - Frontend & Integration
    Streamlit frontend dev      :         frontend, 2025-10-10, 2d
    Integration (NLP ↔ Frontend) :         integrate, 2025-10-10, 2d
    Testing & bugfixing         :         test,      2025-10-11, 1d

    section Finalization
    Documentation & notebook    :milestone, docs, 2025-10-12, 1d
    Final review & presentation :         final,    2025-10-12, 1d
```
---



## Imports and Model Initialization

These imports include all necessary libraries for data manipulation, text preprocessing, and semantic similarity computation using **Sentence-BERT (SBERT)**.  
NLTK is used for text tokenization and stopword removal, while Plotly is used for visualization.

### Model Choices :
During experimentation, we tested two Sentence-BERT (SBERT) models:

- `all-MiniLM-L6-v2` – a lightweight, multilingual-friendly model known for its speed and small memory footprint.

- `all-mpnet-base-v2` – a larger and more powerful English-only model with higher embedding quality and semantic accuracy.

**We finally chose `all-mpnet-base-v2` because it provides better semantic similarity performance, especially for nuanced English text.
Although it is heavier and not multilingual, its higher accuracy and contextual understanding make it more suitable for precise job–profile matching, where the quality of embeddings has a strong impact on the ranking results.**

In [None]:

import pandas as pd
import re
import plotly.graph_objects as go
import nltk
from pathlib import Path
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sentence_transformers import SentenceTransformer, util
import math

# Download required NLTK resources
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Model initialization
print("Loading SBERT model...")
MODEL_ID = 'all-mpnet-base-v2'
print("Model loaded.")

data_path = Path.cwd() / "data"



## Function: `graph_result(jobScores)`

**Purpose:**  
Creates a **polar radar chart** using Plotly to visualize how well a user's profile matches different job roles.

**Input:**  
- `jobScores`: list of tuples `(job_title, score, top_skills)` representing the job name, similarity score, and top matching skills.

**Output:**  
- Returns a Plotly `Figure` object ready for visualization.

**Role in Project:**  
Used at the end of the NLP pipeline to graphically represent how strongly the user matches each potential job.


In [None]:

def graph_result(jobScores):
    fig = go.Figure()

    bestJobsTitle = [job for job, score, topSkills in jobScores]
    bestJobsScoresValues = [round(score * 100, 2) for job, score, topSkills in jobScores]

    maxScore = max(bestJobsScoresValues)
    minScore = min(bestJobsScoresValues)

    fig = go.Figure(data=go.Scatterpolar(
        r=bestJobsScoresValues,
        theta=bestJobsTitle,
        fill='toself',
        name='Profile Match'
    ))
    fig.update_layout(
        polar=dict(radialaxis=dict(visible=True, range=[minScore - 4, maxScore + 1])),
        title="Overall Job Profile Match (Weighted)"
    )
    
    return fig



## Function: `loadData()`

**Purpose:**  
Loads the competency and job datasets from the `data/` folder.

**Input:**  
- None (uses relative paths).

**Output:**  
- `df_competencies`: DataFrame containing all competencies.
- `df_jobs`: DataFrame containing job descriptions and required competencies.

**Role in Project:**  
Provides the base data for the recommendation system. Each job is associated with several competencies.


In [None]:

def loadData():
    try:
        df_competencies = pd.read_csv(data_path / r"competencies.csv", sep=",")
        df_jobs = pd.read_csv(data_path / r"jobs.csv", sep=",")
        df_jobs["RequiredCompetencies"] = df_jobs["RequiredCompetencies"].apply(lambda x: x.split(";"))
    except FileNotFoundError:
        print("Error: Make sure 'competencies.csv' and 'jobs.csv' are in a 'data' folder at the project root.")
        df_competencies = pd.DataFrame()
        df_jobs = pd.DataFrame()
    return df_competencies, df_jobs



## Function: `transformInDf(...)`

**Purpose:**  
Transforms user input (skills, experience, interests) into a pandas DataFrame suitable for NLP processing.

**Input:**  
- User profile data (skills levels, tools, experience, etc.).

**Output:**  
- A single-row DataFrame representing the user's answers.

**Role in Project:**  
Acts as the bridge between the Streamlit interface and the NLP engine, converting form input into structured data.


In [None]:

def transformInDf(level_python, level_ai, level_visu, level_sql, level_token_embedding,
                  tools, languages, frameworks, data_types, preferred_domains,
                  experience_text, challenges, learning_goals):
    data = {
        "level_python": level_python,
        "level_ai": level_ai,
        "level_visu": level_visu,
        "level_sql": level_sql,
        "level_token_embedding": level_token_embedding,
        "tools": tools,
        "languages": languages,
        "frameworks": frameworks,
        "data_types": data_types,
        "preferred_domains": preferred_domains,
        "experience_text": experience_text,
        "challenges": challenges,
        "learning_goals": learning_goals
    }
    df = pd.DataFrame([data])
    return df



## Function: `clean_text(text)`

**Purpose:**  
Cleans and normalizes text for embedding.

**Input:**  
- `text`: Raw user input text.

**Output:**  
- Cleaned and tokenized string with stopwords removed.

**Role in Project:**  
Ensures consistent input for SBERT embedding and similarity calculation.

**Technical Choice:**<br>
We chose not to use lemmatization to avoid losing contextual meaning.<br>
Since Sentence-BERT captures the semantic representation of full sentences, modifying words to their base forms could slightly change the intended meaning.
Keeping the original text ensures the embeddings reflect the user's phrasing and context more accurately.


In [None]:

def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", " ", text)
    # We uses NLTK's word_tokenize and stopwords for better tokenization and stopword removal
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t not in stopwords.words('english') and len(t) > 2]
    return " ".join(tokens)



## Function: `preprocessing(df)`

**Purpose:**  
Maps numeric skill levels to descriptive phrases and cleans all text columns.

**Input:**  
- `df`: DataFrame containing the user profile data.

**Output:**  
- Preprocessed DataFrame ready for sentence embedding.

**Role in Project:**  
Transforms structured numeric and text inputs into natural language sentences for semantic embedding.

**Explanation:**
Transform numerical skill levels into descriptive sentences and clean all text fields.


In [None]:

def preprocessing(df):
    mappingLevel = {
        1: "Beginner",
        2: "Novice",
        3: "Intermediate",
        4: "Advanced",
        5: "Expert"
    }
    mappingCompetence = {
        "level_python": "python",
        "level_ai": "Artificial Intelligence",
        "level_visu": "Visualization",
        "level_sql": "SQL",
        "level_token_embedding": "Tokenization and embeddings"
    }
    mappingExperience = {
        "experience_text": "My experience includes:  ",
        "tools": "I have used these tools and software:  ",
        "languages": "I know the following programming languages: ",
        "frameworks": "I am familiar with these frameworks and libraries: ",
        "data_types": "I have worked with these types of data: ",
        "preferred_domains": "I am interested in these domains: "
    }

    for col in df.columns:
        if pd.api.types.is_integer_dtype(df[col]):
            df[col] = df[col].apply(lambda numLevel: f"I am a {mappingLevel.get(numLevel)} in {mappingCompetence.get(col)}")
        elif pd.api.types.is_string_dtype(df[col]):
            if col in mappingExperience.keys():
                df[col] = df[col].apply(lambda x: f"{mappingExperience[col]} {x}" if pd.notnull(x) and x != "" else "")
            df[col] = df[col].fillna("").apply(clean_text)
    return df



## Function: `nlp(...)`

**Purpose:**  
The **core NLP engine** — combines embeddings, computes similarity, applies weighting, and generates job recommendations.

**Input:**  
- All user answers (skill levels, text fields, etc.).

**Output:**  
- A dictionary containing:
  - `"fig"` → Polar chart (Plotly figure).
  - `"top_jobs"` → List of top 3 matching jobs.
  - `"block_scores"` → Dictionary of competency block coverage.

**Role in Project:**  
Executes the entire recommendation pipeline:
1. Converts user input to natural text.  
2. Embeds user data and competencies using SBERT.  
3. Computes similarity scores and weights them by job relevance (IDF).  
4. Returns final ranked job recommendations and visualizations.

In [None]:
def nlp(level_python, level_ai, level_visu, level_sql, level_token_embedding,
        tools, languages, frameworks, data_types, preferred_domains,
        experience_text, challenges, learning_goals):

    # --- STEP 1: Create a dataframe from user input ---
    # Convert all input parameters (skills, tools, experience, etc.) into a single-row DataFrame.
    df_question = transformInDf(level_python, level_ai, level_visu, level_sql, level_token_embedding,
                                tools, languages, frameworks, data_types, preferred_domains,
                                experience_text, challenges, learning_goals)
    
    # --- STEP 2: Preprocess the text data ---
    # Transform numerical skill levels into descriptive sentences and clean all text fields.
    df_question = preprocessing(df_question)

    # --- STEP 3: Load competencies and job datasets ---
    df_competencies, df_jobs = loadData()

    # --- STEP 4: Compute IDF scores for competencies ---
    # IDF (Inverse Document Frequency) gives more weight to rare or specialized competencies.
    total_jobs = len(df_jobs)

    # Count how often each competency appears across all jobs.
    competency_counts = df_jobs.explode('RequiredCompetencies')['RequiredCompetencies'].value_counts().to_dict()

    # Define a helper function to compute the IDF for each competency.
    def calculate_idf(competency_id):
        count = competency_counts.get(competency_id, 0)
        return math.log(total_jobs / (count + 1))  # add 1 to avoid division by zero

    # Apply the IDF function to all competencies.
    df_competencies['idf_score'] = df_competencies['CompetencyID'].apply(calculate_idf)

    # --- STEP 5: Load the Sentence-BERT model ---
    # This model transforms text into numerical embeddings (semantic vectors).
    model = SentenceTransformer(MODEL_ID)

    # --- STEP 6: Encode the user profile and competency text ---
    # Convert the user’s text profile into a vector representation.
    listQuestion = df_question.iloc[0]
    userEmbedding = model.encode(listQuestion, convert_to_tensor=True)

    # Encode all competencies and competency block names (skill categories).
    compEmbeddings = model.encode(df_competencies["Competency"].tolist(), convert_to_tensor=True)
    blockEmbeddings = model.encode(df_competencies["BlockName"].unique().tolist(), convert_to_tensor=True)

    # --- STEP 7: Compute cosine similarity between user and competencies ---
    # The cosine similarity measures how close the user's profile is to each competency. (-1 to 1)
    compCosineMatrix = util.cos_sim(userEmbedding, compEmbeddings).cpu().numpy()
    compCosineScores = compCosineMatrix.max(axis=0)  # take the highest similarity per competency
    df_competencies["similarity"] = compCosineScores

    # --- STEP 8: Compute similarity at the block (category) level ---
    blockCosineMatrix = util.cos_sim(userEmbedding, blockEmbeddings).cpu().numpy()
    blockCosineScores = blockCosineMatrix.max(axis=0)

    # --- STEP 9: Compute a weighted score per competency ---
    # Combines the competency similarity and block-level similarity for better weighting.
    df_competencies["weightedScore"] = df_competencies.apply(
        lambda row: row["similarity"] * (1 + 0.1 * blockCosineScores[row['BlockID'] - 1]),
        axis=1
    )

    # --- STEP 10: Aggregate scores by block for visualization ---
    scoresByBlock = df_competencies.groupby("BlockName")["weightedScore"].mean().to_dict()

    # --- STEP 11: Compute job recommendation scores ---
    jobScores = []
    for _, job in df_jobs.iterrows():
        # Filter competencies required for this job.
        jobComps = df_competencies[df_competencies["CompetencyID"].isin(job["RequiredCompetencies"])]

        if not jobComps.empty:
            # Weighted average of similarity scores, adjusted by IDF.
            weighted_job_score = (jobComps['weightedScore'] * jobComps['idf_score']).sum()
            sum_of_idf = jobComps['idf_score'].sum()
            jobScore = weighted_job_score / sum_of_idf if sum_of_idf > 0 else 0

            # Identify top 3 matching competencies for this job.
            topCompScores = jobComps.sort_values(by='weightedScore', ascending=False).head(3)
        else:
            # If the job has no competencies, assign a score of 0.
            jobScore = 0
            topCompScores = pd.DataFrame(columns=["Competency"])

        # Save the job title, score, and best-matching skills.
        jobScores.append((job["JobTitle"], jobScore, topCompScores["Competency"].tolist()))

    # Sort jobs by descending similarity score.
    jobScores.sort(key=lambda x: x[1], reverse=True)

    # --- STEP 12: Select top 3 recommended jobs ---
    top3Jobs = []
    for job, score, topSkills in jobScores[:3]:
        top3Jobs.append({
            "title": job,
            "score": round(score * 100, 2),  # convert to percentage
            "matching_skills": topSkills
        })

    # --- STEP 13: Generate visualization ---
    # Create a radar/polar chart showing the user's match across all jobs.
    fig = graph_result(jobScores)

    # --- STEP 14: Return all computed results ---
    return {
        "fig": fig,                  # Polar chart visualization
        "top_jobs": top3Jobs,        # Top 3 recommended jobs
        "block_scores": scoresByBlock  # Average score per competency block
    }



# Streamlit Frontend (`front.py`)

Below is the full Streamlit code that defines the web interface for user input and visualization of job recommendations.  
It collects user information, sends it to the `nlp()` function, and displays results (top 3 jobs, polar chart, and competency bars).


In [None]:
import streamlit as st
import plotly.graph_objects as go
import pandas as pd
import sys
import os

# Get the absolute path of the current file's directory (interface/)
current_dir = os.path.dirname(os.path.abspath(__file__))
# Get the path of the parent directory (the project root)
project_root = os.path.dirname(current_dir)
# Add the project root to the system's path
sys.path.append(project_root)

# Now, Python can find the NLP module which is in a sibling directory
from NLP import main

def start():
    st.set_page_config(page_title="Job Finder", page_icon="💼", layout="wide")
    st.title("Find Your Ideal Job 🔎")
    st.write("Answer a few questions and find out the best job opportunities based on your profile.")

    with st.form("job_form"):
        st.header("👤 Your Profile")

        # Using columns for a cleaner layout
        col1, col2 = st.columns(2)
        with col1:
            st.subheader("📊 Rate Your Skills (1-Beginner, 5-Expert)")
            level_python = st.slider("How much do you love python ?", 1, 5, 3)
            level_ai = st.slider("Do you like working with AI ?", 1, 5, 2)
            level_visu = st.slider("Can you make art with data ?", 1, 5, 1)
            level_sql = st.slider("How confident are you concerning your knoledge in SQL ?", 1, 5, 2)
            level_token_embedding = st.slider("How familiar are you with tokkenization and embeddings ?", 1, 5, 1)

        with col2:
            st.subheader("💡 Domains & Tools")
            tools = st.text_input("Tools / software (e.g., Power BI, Excel)")
            languages = st.text_input("Programming languages (e.g., Python, R, SQL)")
            frameworks = st.text_input("AI frameworks / libraries (e.g., Scikit-learn, Pandas)")
            data_types = st.text_input("Types of data handled (e.g., tabular, text, images)")
            preferred_domains = st.text_input("Preferred domains (e.g., finance, healthcare)")
        
        st.subheader("📝 Describe Your Experience")
        experience_text = st.text_area("Provide a summary of your projects and professional experience.", height=150)
        challenges = st.text_area("What was the biggest challenge you faced in your projects, and how did you overcome it?", height=150)
        learning_goals = st.text_area("What skills or domains are you looking to improve or learn next?", height=100)

        submitted = st.form_submit_button("🔍 Find My Job")

    if submitted:
        if all([level_python, level_ai, level_visu, level_sql, level_token_embedding,
                tools, languages, frameworks, data_types, preferred_domains,
                experience_text, challenges, learning_goals
                ]):
        
            if not all([tools, languages, frameworks, experience_text]):
                st.warning("⚠️ Please fill in all the text fields for an accurate analysis!")
            else:
                with st.spinner('Analyzing your profile...'):
                    results = main.nlp(
                        level_python, level_ai, level_visu, level_sql, level_token_embedding,
                        tools, languages, frameworks, data_types, preferred_domains,
                        experience_text, challenges, learning_goals
                    )

                if results:
                    st.header("📈 Your Personalized Results Dashboard")
                    st.subheader("🏆 Your Top 3 Job Recommendations")
                    cols = st.columns(3)
                    for i, job in enumerate(results["top_jobs"]):
                        with cols[i]:
                            st.metric(label=job['title'], value=f"{job['score']}% Match")
                            with st.expander("Why this recommendation?"):
                                st.write("This role is a good fit because of your skills in:")
                                for skill in job['matching_skills']:
                                    st.markdown(f"- **{skill}**")
                    
                    st.markdown("---")

                    col1, col2 = st.columns(2)
                    with col1:
                        st.subheader("🎯 Overall Profile Match")
                        st.plotly_chart(results["fig"], use_container_width=True)
                    with col2:
                        st.subheader("💡 Competency Block Coverage")
                        st.write("This shows how well your profile covers different skill areas.")
                        block_names = list(results["block_scores"].keys())
                        block_values = list(results["block_scores"].values())
                        bar_fig = go.Figure([go.Bar(x=block_values, y=block_names, orientation='h', text=block_values, textposition='auto', marker_color='#4169E1')])
                        bar_fig.update_layout(title="Coverage per Skill Category (%)", xaxis_title="Coverage Score", yaxis_title="Competency Block")
                        st.plotly_chart(bar_fig, use_container_width=True)
                else:
                    st.error("❌ Could not process your profile. Please check if the data files are available.")
        else : 
            submitted = False
            st.warning("Please answer all the questions !!")

        
start()

### Run the Streamlit Frontend Interface

In [2]:
!streamlit run ../interface/front.py

^C
