# Notebook 4: Gap Analyst Agent - Advanced Normalization

### Objective
This notebook develops and tests the `Gap Analyst` agent. A key feature of this notebook is the implementation of an advanced semantic normalization pipeline. Instead of relying on manual rules, we will use vector embeddings and community detection (a form of clustering) to intelligently group and standardize the skills extracted by our `Market Analyst` agent. This creates a highly accurate and scalable foundation for the gap analysis.

## 1. Setup and Data Loading

In [9]:
import pandas as pd
import ast
from sentence_transformers import SentenceTransformer
from sentence_transformers import util
from typing import List, Set

# Load the processed data from Notebook 2
processed_file_path = '../data/processed/processed_google_jobs.csv'
df = pd.read_csv(processed_file_path)

# Convert string-lists back to actual list objects
def safe_converter(value):
    try: return ast.literal_eval(value)
    except: return []

df['extracted_technical_skills'] = df['extracted_technical_skills'].apply(safe_converter)
df['extracted_soft_skills'] = df['extracted_soft_skills'].apply(safe_converter)

## 2. Initial Skill Aggregation for a Target Role

Before we can normalize the skills, we first need to collect them. We'll choose a target job category (e.g., 'Software Engineering') and aggregate all the raw skill strings mentioned in those job postings.

In [2]:
target_category = 'Software Engineering'

df_filtered = df[df['Category'].str.contains(target_category, case=False, na=False)]

tech_skills = [skill for sublist in df_filtered['extracted_technical_skills'] for skill in sublist if skill]
soft_skills = [skill for sublist in df_filtered['extracted_soft_skills'] for skill in sublist if skill]
all_skills_raw = tech_skills + soft_skills

unique_skills = sorted(list(set(skill.strip().lower() for skill in all_skills_raw)))

print(f"Found {len(all_skills_raw)} total skill instances.")
print(f"Found {len(unique_skills)} unique skills to normalize.")

Found 278 total skill instances.
Found 110 unique skills to normalize.


## 3. Advanced Normalization via Embeddings and Clustering

This is the core of our intelligent normalization pipeline. We will process our list of `unique_skills` in three steps:
1.  **Generate Embeddings**: Convert each skill string into a meaningful numerical vector using a pre-trained `SentenceTransformer` model.
2.  **Cluster Skills**: Use a community detection algorithm to group vectors (and thus skills) that are semantically similar.
3.  **Generate Normalization Map**: Create a mapping dictionary from these clusters to standardize skill variations.

In [3]:
print("Loading sentence transformer model...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded.")

skill_embeddings = model.encode(unique_skills, convert_to_tensor=True)
print(f"Embeddings generated. Shape: {skill_embeddings.shape}")

Loading sentence transformer model...
Model loaded.
Embeddings generated. Shape: torch.Size([110, 384])


In [4]:
# We use a community detection algorithm to find clusters of similar skills.
# `min_community_size` groups skills only if a cluster has at least 2 members.

print("Clustering similar skills...")
clusters = util.community_detection(skill_embeddings, min_community_size=2, threshold=0.85)
print(f"Clustering complete. Found {len(clusters)} skill clusters.")

print("\n--- Sample Skill Clusters ---")
for i, cluster in enumerate(clusters[:10]):
    print(f"Cluster {i+1}: {[unique_skills[skill_id] for skill_id in cluster]}")

Clustering similar skills...
Clustering complete. Found 5 skill clusters.

--- Sample Skill Clusters ---
Cluster 1: ['unix/linux', 'unix', 'linux']
Cluster 2: ['c++', 'c/c++']
Cluster 3: ['code review', 'code reviews']
Cluster 4: ['objective c', 'objective-c']
Cluster 5: ['problem solving', 'problem-solving']


In [5]:
# Now, we create our normalization map automatically from the clusters.
skill_mapping = {}
for cluster in clusters:

    cluster_skills = [unique_skills[skill_id] for skill_id in cluster]
    
    # Choose the shortest name as the standard for the group
    canonical_name = min(cluster_skills, key=len)
    
    # Map all skills in the cluster to the canonical name
    for skill in cluster_skills:
        skill_mapping[skill] = canonical_name

print(f"Generated a normalization map with {len(skill_mapping)} entries.")
print("\n--- Sample of the generated map ---")
for i, (key, value) in enumerate(skill_mapping.items()):
    if i >= 10: break
    print(f"'{key}'  ->  '{value}'")

Generated a normalization map with 11 entries.

--- Sample of the generated map ---
'unix/linux'  ->  'unix'
'unix'  ->  'unix'
'linux'  ->  'unix'
'c++'  ->  'c++'
'c/c++'  ->  'c++'
'code review'  ->  'code review'
'code reviews'  ->  'code review'
'objective c'  ->  'objective c'
'objective-c'  ->  'objective c'
'problem solving'  ->  'problem solving'


In [6]:
# Apply the basic normalization (lowercase, strip) first
normalized_skills = [skill.strip().lower() for skill in all_skills_raw]

# Apply the advanced semantic normalization using our auto-generated map
final_skills = [skill_mapping.get(skill, skill) for skill in normalized_skills]

# Now, we calculate the frequencies on the fully cleaned and normalized data
SKILL_BENCHMARK_COUNT = 30
market_skills_series = pd.Series(final_skills)
top_market_skills = market_skills_series.value_counts().head(SKILL_BENCHMARK_COUNT)


market_skills_list = top_market_skills.index.tolist()

print(f"\n--- Top {SKILL_BENCHMARK_COUNT} Most In-Demand Market Skills (Semantically Normalized) ---")
print(market_skills_list)


--- Top 30 Most In-Demand Market Skills (Semantically Normalized) ---
['problem solving', 'c++', 'communication', 'python', 'teamwork', 'java', 'agile methodologies', 'javascript', 'c', 'go', 'c#', 'unix', 'objective c', 'collaboration', 'leadership', 'rtl', 'code review', 'stratus', 'perl', 'shell', 'systemc', 'machine learning', 'catapult', 'vivado', 'vp9', 'angularjs', 'sql', 'css', 'unit testing', 'software design']


## 4. Defining the User's Current Skills (Simulation)

With the market skill benchmark established, we now need the second piece of input for our `Gap Analyst`: a list of skills the user already possesses. In a real application, this would be provided dynamically. For our development and testing, we will simulate it with a predefined list.

In [7]:
user_skills_list = [
    'python',
    'sql',
    'java',
    'git',
    'data structures',
    'algorithms',
    'debugging',
    'problem-solving',
    'communication',
    'teamwork',
    'scrum', 
    'api design' 
]

print("--- Simulated User Skills ---")
print(user_skills_list)
print(f"\nThe simulated user has {len(user_skills_list)} skills.")

--- Simulated User Skills ---
['python', 'sql', 'java', 'git', 'data structures', 'algorithms', 'debugging', 'problem-solving', 'communication', 'teamwork', 'scrum', 'api design']

The simulated user has 12 skills.


## 5. Implementing and Running the Gap Analyst

With both the market and user skill lists prepared, we can now define and execute the core logic of the `Gap Analyst`. The function will be simple, robust, and efficient, using Python sets to perform the comparison.

In [8]:
def find_skill_gaps(market_skills, user_skills):
    """
    Compares market skills with user skills to find the gaps.
    """
    # Normalize inputs and convert to sets for efficient difference calculation
    market_set = {skill.lower().strip() for skill in market_skills}
    user_set = {skill.lower().strip() for skill in user_skills}
    
    # Calculate the skills that are in the market set but not in the user's set
    gap_set = market_set.difference(user_set)
    
    # Return as a sorted list for consistent and readable output
    return sorted(list(gap_set))


skill_gaps = find_skill_gaps(market_skills_list, user_skills_list)

print(f"\n--- Analysis Complete ---")
print(f"Target Role: '{target_category}'")
print(f"Total Market Skills in Benchmark: {len(market_skills_list)}")
print(f"User's Current Skills: {len(user_skills_list)}")
print(f"---------------------------------")
print(f"\nIdentified {len(skill_gaps)} Skill Gaps to focus on:")


for skill in skill_gaps:
    print(f"- {skill.title()}")


--- Analysis Complete ---
Target Role: 'Software Engineering'
Total Market Skills in Benchmark: 30
User's Current Skills: 12
---------------------------------

Identified 25 Skill Gaps to focus on:
- Agile Methodologies
- Angularjs
- C
- C#
- C++
- Catapult
- Code Review
- Collaboration
- Css
- Go
- Javascript
- Leadership
- Machine Learning
- Objective C
- Perl
- Problem Solving
- Rtl
- Shell
- Software Design
- Stratus
- Systemc
- Unit Testing
- Unix
- Vivado
- Vp9


## 6. Conclusion and Modularization

This development notebook successfully achieved its objectives. We have:

1.  **Developed `get_top_skills_by_category`**: A sophisticated function was defined and tested directly within this notebook, creating a dynamic skill benchmark using an advanced semantic normalization pipeline.
2.  **Developed `find_skill_gaps`**: The core logic for the `Gap Analyst` agent was defined and validated, accurately identifying missing skills.

### Next Step: Creating the Application Modules

With the core logic for Phase 2 now validated, the final step is to create the permanent application modules. **These new `.py` files will be created with the final, validated code developed within this notebook.**

-   **`src/core/skill_processing.py`**: This file will contain the `get_top_skills_by_category` function, including the complete semantic normalization pipeline.
-   **`src/agents/gap_analyst.py`**: This file will contain the `find_skill_gaps` function, which forms the core logic of our second agent.

This action cleanly separates our experimental and development work (this notebook) from our final, organized application code (the `.py` modules).