# Homework 2: CV Verification System

This notebook implements an agentic CV verification pipeline using MCP SocialGraph tools.


# Setup


## Installing packages


In [1]:
!pip install requests PyPDF2 gdown
!pip install 'markitdown[pdf]'
!pip install langchain_mcp_adapters langchain_google_genai langchain-openai




## Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `VERTEX_API_KEY`.


1.   Look for the key icon on the left panel of your colab.
2.   Under `Name`, create `VERTEX_API_KEY`.
3. Copy your key to `Value`.

If you cannot use VERTEX_API_KEY, you can use deepseek models via `DEEPSEEK_API_KEY`. It does not affect your score.


In [2]:
from google.colab import userdata
GEMINI_VERTEX_API_KEY = userdata.get('GEMINI_API_KEY')
# DEEPSEEK_API_KEY = userdata.get('DEEPSEEK_API_KEY')

## Downloading sample CVs


## Downloading sample_cv.pdf
The codes below download the sample CV

In [3]:
import os
import gdown

folder_id = "1adYKq7gSSczFP3iikfA8Er-HSZP6VM7D"
folder_url = f"https://drive.google.com/drive/folders/{folder_id}"

output_dir = "downloaded_cvs"
os.makedirs(output_dir, exist_ok=True)

gdown.download_folder(
    url=folder_url,
    output=output_dir,
    quiet=False,
    use_cookies=False
)

Retrieving folder contents


Processing file 1NR1RUKx4GyM7QOBxKXkfh4e8jUkxFCsp CV_1.pdf
Processing file 16lrd-uO8AAnCnv7UG9Rs_Nk6SUu0Iwbs CV_2.pdf
Processing file 15hVEuBan_EKhEty2aZBd6rcpDpP4o7Vr CV_3.pdf
Processing file 1Y2w_mAUEhg4vZBdvvR-0n3Jf2mKuGDRk CV_4.pdf
Processing file 1PLwkva-pdua6ZVvmLg9mxHeljq9D8C_C CV_5.pdf


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1NR1RUKx4GyM7QOBxKXkfh4e8jUkxFCsp
To: /content/downloaded_cvs/CV_1.pdf
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 147k/147k [00:00<00:00, 36.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=16lrd-uO8AAnCnv7UG9Rs_Nk6SUu0Iwbs
To: /content/downloaded_cvs/CV_2.pdf
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 75.1k/75.1k [00:00<00:00, 6.52MB/s]
Downloading...
From: https://drive.google.com/uc?id=15hVEuBan_EKhEty2aZBd6rcpDpP4o7Vr
To: /content/downloaded_cvs/CV_3.pdf
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 72.0k/72.0k [00:00<00:00, 7.19MB/s]
Downloading...
From: https://drive.google.com/uc?id=1Y2w_mAUEhg4vZBdvvR-0n3Jf2mKuGDRk
To: /content/downloaded_cvs/CV_4.pdf
100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 73.3k/73.3k [00:00<00:00, 31.3MB/s]
Downloading...
From: https://drive.google.com/uc?id=1PLwkva-pdua6ZVvmLg9mxHeljq9D8C_C
To: /content/downloaded_cvs

['downloaded_cvs/CV_1.pdf',
 'downloaded_cvs/CV_2.pdf',
 'downloaded_cvs/CV_3.pdf',
 'downloaded_cvs/CV_4.pdf',
 'downloaded_cvs/CV_5.pdf']

In [4]:
# =====================================================
#  Load and display all CV PDFs in order
# =====================================================
import os
from markitdown import MarkItDown

cv_dir = "downloaded_cvs"

# Initialize MarkItDown
md = MarkItDown(enable_plugins=False)

# Collect and sort PDFs numerically
pdf_files = sorted(
    [f for f in os.listdir(cv_dir) if f.lower().endswith(".pdf")],
    key=lambda x: int("".join(filter(str.isdigit, x)))  # CV_1.pdf â†’ 1
)

all_cvs = []

for pdf_name in pdf_files:
    pdf_path = os.path.join(cv_dir, pdf_name)
    result = md.convert(pdf_path)

    all_cvs.append({
        "file": pdf_name,
        "text": result.text_content
    })

    print("=" * 80)
    print(f"ðŸ“„ {pdf_name}")
    print("=" * 80)
    print(result.text_content)
    print(" ")


ðŸ“„ CV_1.pdf
|     |     |     |     | John         |           | Smith        |                   |     |     |
| --- | --- | --- | --- | ------------ | --------- | ------------ | ----------------- | --- | --- |
|     |     |     |     | Marketing    |           | Professional |                   |     |     |
|     |     |     |     | + Singapore, | Singapore |              | (cid:209) Kowloon |     |     |
Experience
|                |                  |     |          |                     |              |            |     | 2020 â€“ | Present |
| -------------- | ---------------- | --- | -------- | ------------------- | ------------ | ---------- | --- | ------ | ------- |
| Engineer,      | ByteDance        |     |          |                     |              |            |     |        |         |
| â€¢ Worked       | in a fast-paced, |     | global   | technology          | environment. |            |     |        |         |
| â€¢ Collaborated | across           |     | teams

# Connect to our MCP server

Documentation about MCP: https://modelcontextprotocol.io/docs/getting-started/intro.

Using MCP servers in Langchain https://docs.langchain.com/oss/python/langchain/mcp.


## Check which tools that the MCP server provide


In [7]:
import asyncio
import json
import re
from langchain_core.tools import tool
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage, ToolMessage
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_mcp_adapters.client import MultiServerMCPClient
from pydantic import BaseModel, Field
from typing import List, Literal, Optional, get_args, Union
from datetime import datetime

import nest_asyncio
nest_asyncio.apply()

client = MultiServerMCPClient({
    "social_graph": {
        "transport": "http",
        "url": "https://ftec5660.ngrok.app/mcp",
        "headers": {"ngrok-skip-browser-warning": "true"}
    }
})

mcp_tools = await client.get_tools()
tools = mcp_tools

for tool in mcp_tools:
    print(tool.name)
    print(tool.description)
    print(tool.args)
    print("\n\n------------------------------------------------------\n\n")

search_facebook_users
Search for Facebook users by display name (supports partial and fuzzy matching).

Args:
    q: Search query string (case-insensitive, matches any part of display name)
       Examples: "John", "john smith", "Smith"
    limit: Maximum number of results to return (default: 20, max: 20)
    fuzzy: Enable fuzzy matching if exact search returns no results (default: True)

Returns:
    List of user dictionaries, each containing:
    - id (int): Unique Facebook user ID for use with get_facebook_profile()
    - display_name (str): User's Facebook display name (may differ from legal name)
    - city (str): Current city of residence
    - country (str): Country of residence
    - match_type (str): "exact" or "fuzzy" (indicates search method used)
    
    Returns empty list [] if no matches found.

Example:
    search_facebook_users("Alex Chan", limit=5)
    â†’ [{"id": 123, "display_name": "Alex Chan", "city": "Hong Kong", "country": "Hong Kong", "match_type": "exact"}]
  

In [9]:
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=GEMINI_VERTEX_API_KEY,
     vertexai=False,
    temperature=0
)

llm_with_tools = llm.bind_tools(tools)

# Define function

In [19]:
# ---------------------- Extract text from tool result ----------------------
def extract_text(result):
    """Recursively extract string content from tool call results."""
    if isinstance(result, str):
        return result
    if isinstance(result, list):
        return "\n".join(extract_text(item) for item in result)
    if hasattr(result, 'content'):
        return extract_text(result.content)
    if hasattr(result, 'text'):
        return result.text
    return str(result)

# ---------------------- Call tool by name ----------------------
async def call(tools, tool_name, args):
    for tool in tools:
        if tool.name == tool_name:
            raw = await tool.ainvoke(args) if hasattr(tool, "ainvoke") else tool.invoke(args)
            return extract_text(raw)
    return f"Error: Tool {tool_name} not found."

# Facebook & LinkedIn Agent

In [20]:
# ---------------------- Facebook Agent ----------------------
async def facebook_agent(query: str, _llm=llm, _tools=mcp_tools):
    system_prompt = """
You are the Facebook Verification Agent. Search by NAME ONLY.
`search_facebook_users` accepts only `q`, `limit`, `fuzzy`. NEVER use `location` as input.
After search, call `get_facebook_profile()` until a good match is found.
If you cannot find a good match after reasonable attempts, select the most similar profile (even if low confidence) and explain why it's not a perfect match. Do not simply give up.


MISMATCH RULES:
- Company mismatch (EY vs PwC) = MAJOR â†’ confidence LOW
- Title mismatch (Senior Engineer vs Engineer) = MINOR
- Education mismatch (PhD vs MSc) = MAJOR
- Unclear degree (BSc vs BSc in Engineering) = MINOR

Output format:
Match Confidence: high|medium|low|none
Matched Fields ...
Mismatched Fields ...
Uncertain Fields ...
Profile Summary: ...
"""

    llm_tools = _llm.bind_tools(_tools)
    history = [("system", system_prompt), ("human", f"Query: {query}")]
    for _ in range(20):
        response = llm_tools.invoke(history)
        history.append(response)
        if not response.tool_calls:
            return extract_text(response.content)
        for tc in response.tool_calls:
            obs = await call(_tools, tc["name"], tc["args"])
            history.append(ToolMessage(tool_call_id=tc["id"], content=obs))
    return "Loop reached max turns without final answer."

# ---------------------- LinkedIn Agent ----------------------
async def linkedin_agent(query: str, _llm=llm, _tools=mcp_tools):
    system_prompt = """
You are the LinkedIn Verification Agent.
Search strategy:
1. Call `search_linkedin_people(q=name, industry=industry)`, set limit to 5.
2. Don't use the location arg in `search_linkedin_people`
3. Analyze headlines to pick best matches.
4. Call `get_linkedin_profile()` to get full LinkedIn profile.
5. Verify Company, Title, Dates, Education.

MISMATCH RULES:
- PhD vs MSc = MAJOR
- Different schools = MAJOR
- Senior vs regular = MINOR
- Different field/industry = MAJOR

Output format same as Facebook.
"""
    llm_tools = _llm.bind_tools(_tools)
    history = [("system", system_prompt), ("human", f"Query: {query}")]
    for _ in range(20):
        response = llm_tools.invoke(history)
        history.append(response)
        if not response.tool_calls:
            return extract_text(response.content)
        for tc in response.tool_calls:
            obs = await call(_tools, tc["name"], tc["args"])
            history.append(ToolMessage(tool_call_id=tc["id"], content=obs))
    return "Loop reached max turns without final answer."

In [21]:
# ---------------------- Structured Models ----------------------
RiskLevel = Literal["low", "medium", "high"]
Severity = Literal["minor", "major", "acceptable"]

class Item(BaseModel):
    field: str
    description: str
    cv: str
    facebook: str
    linkedin: str
    severity: Severity

class Report(BaseModel):
    risk_level: RiskLevel
    summary: str
    key_matches: List[Item]
    key_discrepancies: List[Item]
    key_missing: List[Item]
    internal_consistency: List[Item]
    uncertains: List[Item]

    @property
    def text(self) -> str:
        return self.model_dump_json(indent=2, exclude_none=True)

class FinalReport(Report):
    final_score: float
    cv_struct: str
    facebook_report: str
    linkedin_report: str

    @classmethod
    def from_report(cls, report: Report, score: float, cv: str, fb: str, li: str):
        return cls(**report.model_dump(exclude_none=True), final_score=score, cv_struct=cv, facebook_report=fb, linkedin_report=li)

# ---------------------- CV Extraction Models ----------------------
Present = Literal["Present"]

class EducationField(BaseModel):
    school: Optional[str]
    degree: Optional[str]
    graduation_year: Optional[int]

class ExperienceField(BaseModel):
    comapany: Optional[str]
    title: Optional[str]
    start_year: Optional[int]
    # FIX 1: Allow both integer AND "Present" (and also "Current" which we'll normalize)
    end_year: Optional[Union[int, Present]]

class ExtractedCV(BaseModel):
    name: Optional[str]
    location: List[str]
    current_title: Optional[str]
    education: List[EducationField]
    experience: List[ExperienceField]
    skills: List[str]

    @property
    def text(self) -> str:
        return self.model_dump_json(indent=2, exclude_none=True)

# ---------------------- CV Extraction with Normalization ----------------------
def extract_cv_agent(text: str, _llm=llm):
    system_prompt = """
Extract structured info from CV. DO NOT call tools. Use null if missing.
IMPORTANT: For ongoing jobs, use "Present" (not "Current" or other variations).
"""
    history = [('system', system_prompt), ('human', text)]
    ext = _llm.with_structured_output(ExtractedCV).invoke(history)

    # FIX 2: Normalize location - split by commas
    loc = []
    for l in ext.location:
        loc.extend([s.strip() for s in l.split(',')])
    ext.location = loc

    # FIX 3: Normalize end_year - convert any "Current"/"current" to "Present"
    for exp in ext.experience:
        if exp.end_year and isinstance(exp.end_year, str):
            if exp.end_year.lower() in ["current", "now", "today"]:
                exp.end_year = "Present"

    # Internal consistency checks (unchanged)
    cy = datetime.now().year
    internal = []
    exp = ext.experience
    for i in range(len(exp)):
        for j in range(i+1, len(exp)):
            a, b = exp[i], exp[j]
            a_end = cy if a.end_year in get_args(Present) or a.end_year is None else a.end_year
            b_end = cy if b.end_year in get_args(Present) or b.end_year is None else b.end_year
            if a.start_year and b.start_year and a.start_year < b_end and b.start_year < a_end:
                internal.append(f"{a.comapany}[{a.start_year}-{a.end_year}] overlap with {b.comapany}[{b.start_year}-{b.end_year}]")
        e = exp[i]
        st = e.start_year or 0
        ed = cy if e.end_year in get_args(Present) or e.end_year is None else e.end_year
        if st > cy or ed > cy:
            internal.append(f"{e.comapany}[{e.start_year}-{e.end_year}] contains future year")
        if st > ed:
            internal.append(f"{e.comapany}[{e.start_year}-{e.end_year}] start > end")
    present = [e for e in exp if e.end_year in get_args(Present)]
    if len(present) > 1:
        internal.append(f"Multiple present jobs: {', '.join([f'{e.comapany}[{e.start_year}-{e.end_year}]' for e in present])}")
    dic = ext.model_dump()
    dic["internal_inconsistent"] = internal
    return json.dumps(dic, indent=2)

# Scoring & Reporting Agent

In [22]:
# ---------------------- Scoring & Reporting ----------------------
def scoring_agent(report: str, _llm=llm) -> float:
    sys_prompt = """
Score the CV trustworthiness from 0-1 based on the report.

Scoring Guidelines:
- **0.8-1.0:** CV is highly consistent with both LinkedIn and Facebook profiles. Minimal or no discrepancies.
- **0.51-0.79:** CV shows acceptable alignment in core education and professional experience (e.g., names of institutions, companies, and roles) with *at least one* social media profile (LinkedIn or Facebook). Minor discrepancies such as specific employment dates, "current" status indicators, or slight location variations are tolerated if the overall narrative holds true. If there are general matches for school and experience on LinkedIn, score should be 0.5 or above.
- **0.0-0.49:** CV presents severe and irreconcilable contradictions with *both* LinkedIn and Facebook profiles regarding fundamental claims like education or work history. This indicates significant misrepresentation or completely incorrect information.

Output only the number(float).
"""
    score = _llm.invoke([('system', sys_prompt), ('human', report)]).content
    return score

def report_agent(cv, fb, li, _llm=llm) -> Report:
    system_prompt = """
Synthesize verification report from CV, Facebook, and LinkedIn.
- SUPPORT FROM EITHER PLATFORM = MATCH (include in key_matches).
- Discrepancies only if CV contradicts BOTH platforms.
- DO NOT compare Facebook vs LinkedIn directly.
- Include internal consistency issues.
- Use severity: major/minor/acceptable.
"""
    llm_struct = _llm.with_structured_output(Report)
    history = [('system', system_prompt), ('human', f"[CV]\n{cv}\n[Facebook]\n{fb}\n[LinkedIn]\n{li}")]
    return llm_struct.invoke(history)

# Main Agent

In [23]:
async def main_agent(cv):
    text = extract_cv_agent(cv['text'])
    fb, li = await asyncio.gather(
        facebook_agent(text),
        linkedin_agent(text)
    )
    # Convert to string
    fb = extract_text(fb) if not isinstance(fb, str) else fb
    li = extract_text(li) if not isinstance(li, str) else li
    report = report_agent(text, fb, li)
    score = scoring_agent(report.text)
    return FinalReport.from_report(report, score, text, fb, li)

# Result

In [24]:
# ---------------------- Run on all CVs ----------------------
reports = []
for cv in all_cvs:
    reports.append(await main_agent(cv))

# ---------------------- Output ----------------------
for re in reports:
    print(re.text)

{
  "risk_level": "medium",
  "summary": "The candidate's CV largely aligns with their LinkedIn profile, providing strong verification for most details including name, location, current title, education, and skills. However, Facebook shows several discrepancies with the CV, particularly regarding current employment company and education details. A significant discrepancy exists for the current employment end year, where the CV states 'Present' but LinkedIn's internal check indicates the employment is 'Not current'. This is also flagged as a major internal inconsistency within the LinkedIn data itself, contributing to a medium overall risk level.",
  "key_matches": [
    {
      "field": "name",
      "description": "Candidate's full name.",
      "cv": "John Smith",
      "facebook": "John Smith",
      "linkedin": "John Smith",
      "severity": "acceptable"
    },
    {
      "field": "location",
      "description": "Candidate's listed locations.",
      "cv": "Singapore, Kowloon",


In [25]:
tool_map = {tool.name: tool for tool in mcp_tools}
# This block provides you some tests to get faminilar with our MCP server

# # Test 1: Search Facebook users (exact match)
# await tools[0].ainvoke({'q': "Alex Chan", 'limit': 5})

# # Test 2: Search Facebook users (fuzzy match with typo)
# await tools[0].ainvoke({'q': "Alx Chn", 'limit': 5, 'fuzzy': True})

# # Test 3: Get Facebook profile
# await tools[1].ainvoke({'user_id': 123})

# # Test 4: Get Facebook mutual friends
# await tools[2].ainvoke({'user_id_1': 123, 'user_id_2': 456})

# # Test 5: Search LinkedIn people (exact match)
# await tools[3].ainvoke({'q': "Python", 'location': "Hong Kong", 'limit': 5})

# # Test 6: Search LinkedIn people (fuzzy match with typo)
# await tools[3].ainvoke({'q': "Python", 'location': "Hong Kong", 'limit': 5, 'fuzzy': True})

# # Test 7: Get LinkedIn profile
# await tools[4].ainvoke({'person_id': 456})

# Test 8: Get LinkedIn interactions
await tools[5].ainvoke({'person_id': 456})

[{'type': 'text',
  'text': '{"profile_id":456,"post_count":4,"total_likes":5,"liked_by":[4390,3622,7500,4269,8464],"engagement_score":1.25}',
  'id': 'lc_890fbb5a-326d-4718-8d87-6bed2ea6af0c'}]

---
# Sample Verification Results


In [26]:
print("\n" + "="*60)
print("CV VERIFICATION RESULTS (Plain Text Summary)")
print("="*60)

for idx, re in enumerate(reports, 1):
    # Parsing names from cv_struct
    try:
        cv_data = json.loads(re.cv_struct)
        name = cv_data.get("name", "Unknown")
    except:
        name = "Unknown"

    print(f"\nCV_{idx}.pdf - {name}")
    print(f"  Risk Level  : {re.risk_level.upper()}")
    print(f"  Final Score : {re.final_score} (0-1)")
    print(f"  Summary     : {re.summary[:200]}..." if len(re.summary) > 200 else f"  Summary     : {re.summary}")
    print("-" * 60)


CV VERIFICATION RESULTS (Plain Text Summary)

CV_1.pdf - John Smith
  Risk Level  : MEDIUM
  Final Score : 0.65 (0-1)
  Summary     : The candidate's CV largely aligns with their LinkedIn profile, providing strong verification for most details including name, location, current title, education, and skills. However, Facebook shows se...
------------------------------------------------------------

CV_2.pdf - Minh Pham
  Risk Level  : MEDIUM
  Final Score : 0.6 (0-1)
  Summary     : The candidate's CV largely aligns with their LinkedIn profile, confirming key details such as name, location, current and previous employment (company, title, dates), education, and skills. However, t...
------------------------------------------------------------

CV_3.pdf - Wei Zhang
  Risk Level  : LOW
  Final Score : 0.9 (0-1)
  Summary     : The candidate's profile shows strong consistency across CV and LinkedIn for name, location (Munich, Germany), current title, company, education institution, and gra

# Evaluation code

In the test phase, you will be given 5 CV files with fixed names:

    CV_1.pdf, CV_2.pdf, CV_3.pdf, CV_4.pdf, CV_5.pdf

Your system must process these CVs and output a list of 5 scores,
one score per CV, in the same order:

    scores = [s1, s2, s3, s4, s5]

Each score must be a float in the range [0, 1], representing the
reliability or confidence that the CV is valid (or meets the task criteria).

The ground-truth labels are binary:

    groundtruth = [0 or 1, ..., 0 or 1]

Each CV is evaluated independently using a threshold of 0.5:

- If score > 0.5 and groundtruth == 1 â†’ Full credit
- If score â‰¤ 0.5 and groundtruth == 0 â†’ Full credit
- Otherwise â†’ No credit

In other words, 0.5 is the decision threshold.

- Each CV contributes equally.
- Final score = (number of correct decisions) / 5


In [27]:
# =====================================================
#  Evaluation code
# =====================================================

def evaluate(scores, groundtruth, threshold=0.5):
    """
    scores: list of floats in [0, 1], length = 5
    groundtruth: list of ints (0 or 1), length = 5
    """
    assert len(scores) == 5
    assert len(groundtruth) == 5

    correct = 0
    decisions = []

    for s, gt in zip(scores, groundtruth):
        pred = 1 if s > threshold else 0
        decisions.append(pred)
        if pred == gt:
            correct += 1

    final_score = correct / len(scores)

    return {
        "decisions": decisions,
        "correct": correct,
        "total": len(scores),
        "final_score": final_score
    }

In [29]:
scores = [r.final_score for r in reports] # Your code should generate this list [0.2, 0.3, 0.4, 0.5, 0.6]
groundtruth = [1, 1, 1, 0, 0] # Do not modify

result = evaluate(scores, groundtruth)
print(result)

{'decisions': [1, 1, 1, 0, 0], 'correct': 5, 'total': 5, 'final_score': 1.0}
