# Notebook 2: Batch Processing (Colab GPU Version)

### Objective
The goal of this notebook is to leverage Google Colab's free GPU resources to apply our `Market_Analyst` agent at scale to the entire dataset.

### Key Steps:
1.  **Environment Setup**: Install and configure Ollama, the required LLM model (`llama3:8b`), and all necessary Python libraries within the temporary Colab virtual machine.
2.  **Data Loading**: Mount Google Drive to access our project files and load the prepared dataset.
3.  **Agent Definition**: Define the agent's logic directly within the notebook for portability.
4.  **Batch Processing**: Iterate through every job description, using the GPU-accelerated LLM to extract skills and store the results.
5.  **Save Processed Data**: Store the final, enriched DataFrame to a new CSV file in Google Drive, creating a valuable new data asset.

## 1. Environment Setup (Colab with GPU)

This first cell configures the entire Colab environment. It performs the following actions:
- Installs the Ollama server.
- Starts the server in the background.
- Pulls the `llama3:8b` model.
- Installs all required Python packages.

In [1]:
import subprocess
import time

# --- 1. Install Ollama ---
print("Starting Ollama installation...")
!curl -fsSL https://ollama.com/install.sh | sh
print("Ollama has been installed.")

# --- 2. Start the Ollama Server as a Background Process ---
print("Starting Ollama server in the background...")
command = "ollama serve"
server_process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
time.sleep(25)
print("Ollama server process started.")

# --- 3. Pull the LLM Model ---
print("Pulling the llama3:8b model...")
!ollama pull llama3:8b
print("Model 'llama3:8b' pulled successfully.")

# --- 4. Install Required Python Libraries ---
print("Installing Python libraries...")
# The '-q' flag makes pip's output less verbose
!pip install -q langchain langchain-community langchain-ollama pandas numpy tqdm matplotlib seaborn
print("Required libraries have been installed.")

# --- 5. Health Check: Verify Server and Model Availability ---
print("\n--- Verifying Ollama Server Status ---")
# This command confirms the server is responsive and the model is ready
!ollama list
print("\n Environment configured and ready for use!")

Starting Ollama installation...
>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
Ollama has been installed.
Starting Ollama server in the background...
Ollama server process started.
Pulling the llama3:8b model...
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
Model 'llama3:8b' pulled successfully.
Installing Python libraries...
Required libraries have been instal

## 2. Data Loading & Preparation

Now, we'll mount Google Drive to the Colab environment. This allows us to access our project files as if they were local. After mounting, we will load the dataset with pandas and recreate the `full_description` column.

In [2]:
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/drive')

file_path = '/content/drive/MyDrive/agentic-skill-mapper/data/raw/job_skills.csv'

try:
    df = pd.read_csv(file_path)
    print(f"\n Dataset loaded successfully from: {file_path}")

    # --- Re-create the 'full_description' column ---
    text_columns = ['Responsibilities', 'Minimum Qualifications', 'Preferred Qualifications']
    for col in text_columns:
        df[col] = df[col].fillna('')
    df['full_description'] = df[text_columns].apply(lambda x: '\n\n'.join(x), axis=1)
    print(" 'full_description' column created successfully.")

except FileNotFoundError:
    print(f"\n ERROR: File not found at '{file_path}'.")
    print("Please verify the file path and ensure the file is in your Google Drive.")

df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

 Dataset loaded successfully from: /content/drive/MyDrive/agentic-skill-mapper/data/raw/job_skills.csv
 'full_description' column created successfully.


Unnamed: 0,Company,Title,Category,Location,Responsibilities,Minimum Qualifications,Preferred Qualifications,full_description
0,Google,Google Cloud Program Manager,Program Management,Singapore,"Shape, shepherd, ship, and show technical prog...",BA/BS degree or equivalent practical experienc...,Experience in the business technology market a...,"Shape, shepherd, ship, and show technical prog..."
1,Google,"Supplier Development Engineer (SDE), Cable/Con...",Manufacturing & Supply Chain,"Shanghai, China",Drive cross-functional activities in the suppl...,BS degree in an Engineering discipline or equi...,"BSEE, BSME or BSIE degree.\nExperience of usin...",Drive cross-functional activities in the suppl...
2,Google,"Data Analyst, Product and Tools Operations, Go...",Technical Solutions,"New York, NY, United States",Collect and analyze data to draw insight and i...,"Bachelor’s degree in Business, Economics, Stat...",Experience partnering or consulting cross-func...,Collect and analyze data to draw insight and i...
3,Google,"Developer Advocate, Partner Engineering",Developer Relations,"Mountain View, CA, United States","Work one-on-one with the top Android, iOS, and...",BA/BS degree in Computer Science or equivalent...,"Experience as a software developer, architect,...","Work one-on-one with the top Android, iOS, and..."
4,Google,"Program Manager, Audio Visual (AV) Deployments",Program Management,"Sunnyvale, CA, United States",Plan requirements with internal customers.\nPr...,BA/BS degree or equivalent practical experienc...,CTS Certification.\nExperience in the construc...,Plan requirements with internal customers.\nPr...


## 3. Agent Definition

For maximum portability and simplicity within the Colab environment, we will define our agent's Pydantic model and processing function directly in this notebook. The logic is identical to the one in our `market_analyst.py` script.

In [3]:
from langchain_ollama import OllamaLLM
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field
from typing import List

# -- Pydantic Model for Structured Output --
# Defines the exact data structure we want the LLM to return.
# This ensures the output is consistent, predictable, and easy to work with.

class SkillSet(BaseModel):
    """Output model representing the extracted skills from a job description."""
    technical_skills: List[str] = Field(description="A comprehensive list of specific technical skills, e.g., 'Python', 'PyTorch', 'AWS S3', 'SQL', 'Git'.")
    soft_skills: List[str] = Field(description="A list of soft or behavioral skills, e.g., 'Teamwork', 'Agile Methodologies', 'Problem-solving', 'Communication'.")



def analyze_job_description(job_description_text: str) -> SkillSet:
    """
    Analyzes a job description using a local LLM to extract technical and soft skills.

    This function connects to a local Ollama server, sends the job description
    with a structured prompt, and parses the JSON output into a Pydantic model.

    Args:
        job_description_text: The full text of the job description to be analyzed.

    Returns:
        A SkillSet object containing the lists of extracted skills.
    """
    # Initialize the LLM connection.
    # Temperature=0 makes the output more deterministic and less "creative".
    llm = OllamaLLM(model="llama3:8b", temperature=0)

    # The parser will automatically convert the LLM's JSON string output
    # into our structured SkillSet Pydantic object.
    parser = JsonOutputParser(pydantic_object=SkillSet)

    # The prompt template is the core instruction for the LLM.
    # It defines its role, the task, rules, and the expected output format.
    prompt_template = """
    You are an expert AI assistant specialized in tech recruitment and HR analytics.
    Your primary task is to meticulously analyze the following job description and extract all relevant technical and soft skills.

    Follow these rules precisely:
    1.  **Extract, do not infer**: Only list skills that are explicitly mentioned or very strongly implied in the text.
    2.  **Be Specific**: Avoid overly generic terms. For example, prefer 'AWS S3' over 'Cloud Storage'.
    3.  **Normalize Skills**: Use the canonical name for technologies (e.g., "Python" not "python programming", "PyTorch" not "Pytorch framework").
    4.  **Format**: You MUST return your response *only* as a valid JSON object that adheres to the provided schema. Do not add any introductory text, explanations, or markdown formatting around the JSON.

    {format_instructions}

    JOB DESCRIPTION TEXT TO ANALYZE:
    ---
    {job_description}
    ---
    """

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["job_description"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    # The LCEL (LangChain Expression Language) chain pipes the components together.
    # It's a sequence of operations: format the prompt -> send to LLM -> parse the output.
    chain = prompt | llm | parser

    try:
        # The invoke method runs the chain with the provided input.
        result = chain.invoke({"job_description": job_description_text})
        return SkillSet(**result)
    except Exception as e:
        print(f"Market_Analyst Agent: An error occurred during analysis: {e}")
        # Return an empty SkillSet in case of an error to prevent crashes.
        return SkillSet(technical_skills=[], soft_skills=[])

## 4. Batch Processing with GPU Acceleration

Now we will execute the main task. We'll iterate through our DataFrame, feeding each `full_description` to our agent. Thanks to the Colab GPU, this process should be significantly faster than on a local CPU. We will use `tqdm` to monitor the progress.

In [4]:
import os
from tqdm.auto import tqdm
import ast

# --- 1. Checkpoint Configuration ---
checkpoint_path = '/content/drive/MyDrive/agentic-skill-mapper/data/processed/processed_job_skills_checkpoint.csv'
SAVE_EVERY_N_ROWS = 25

# --- 2. Load Progress or Initialize ---
start_index = 0
try:
    df = pd.read_csv(checkpoint_path, keep_default_na=False)
    print(f"Checkpoint file found at '{checkpoint_path}'. Loading progress...")

    def robust_converter(value):
        if pd.isna(value) or value == '':
            return None
        try:
            return ast.literal_eval(value)
        except:
            return []

    df['extracted_technical_skills'] = df['extracted_technical_skills'].apply(robust_converter)
    df['extracted_soft_skills'] = df['extracted_soft_skills'].apply(robust_converter)

    start_index = df['extracted_technical_skills'].notna().sum()
    print(f"Resuming processing from index {start_index}.")

except FileNotFoundError:
    print("No checkpoint file found. Starting from the beginning.")
    df['extracted_technical_skills'] = [None] * len(df)
    df['extracted_soft_skills'] = [None] * len(df)


# --- 3. Main Processing Loop ---
if start_index < len(df):
    for index, row in tqdm(df.iloc[start_index:].iterrows(), initial=start_index, total=len(df), desc="Processing Jobs"):
        description = row['full_description']

        if isinstance(description, str) and description.strip() != '':
            skills_result = analyze_job_description(description)
            df.at[index, 'extracted_technical_skills'] = skills_result.technical_skills
            df.at[index, 'extracted_soft_skills'] = skills_result.soft_skills
        else:
            df.at[index, 'extracted_technical_skills'] = []
            df.at[index, 'extracted_soft_skills'] = []

        if (index + 1) % SAVE_EVERY_N_ROWS == 0:
            df.to_csv(checkpoint_path, index=False)

    df.to_csv(checkpoint_path, index=False)
    print("\n Batch processing complete! Final results saved.")
else:
    print("\n Processing was already 100% complete from previous runs.")


display(df[['Title', 'extracted_technical_skills', 'extracted_soft_skills']].tail())

Checkpoint file found at '/content/drive/MyDrive/agentic-skill-mapper/data/processed/processed_job_skills_checkpoint.csv'. Loading progress...
Resuming processing from index 1225.


Processing Jobs:  98%|#########8| 1225/1250 [00:00<?, ?it/s]


 Batch processing complete! Final results saved.


Unnamed: 0,Title,extracted_technical_skills,extracted_soft_skills
1245,Global Investigator,"[Google's codes, Google's policies]","[Teamwork, Problem-solving, Communication, Lea..."
1246,Campus Security Manager,"[Google applications, Common business software...","[Teamwork, Flexibility, Independence, Written ..."
1247,Facilities Manager,[Google products],"[Teamwork, Agile Methodologies, Problem-solvin..."
1248,Physical Security Manager,[],"[multi-disciplinary approach, high ethical sta..."
1249,Physical Security Manager,[Google Security Vendor Guard Force],"[Agility, Attention to Detail, Discretion, Eff..."


## 5. Save Processed Data

The final step in this notebook is to save our enriched DataFrame, which now contains all the extracted skills.

This action creates a persistent, structured dataset in a new CSV file within our Google Drive. By saving the results, we avoid ever having to re-run this time-consuming batch processing task.

This new file, `processed_google_jobs.csv`, will serve as the clean input for our next local notebook, where we will perform the final analysis.

In [7]:
output_path = '/content/drive/MyDrive/agentic-skill-mapper/data/processed/processed_google_jobs.csv'

os.makedirs(os.path.dirname(output_path), exist_ok=True)

df.to_csv(output_path, index=False)

print(f"DataFrame successfully saved")

DataFrame successfully saved
