<a href="https://colab.research.google.com/github/PriyaSinha786/research-papers/blob/main/CDIPR/hr_poc_colab_v4_fixed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HR Governance POC v4 — Colab Notebook (Fixed OpenAI & HF issues)

This fixed notebook includes:
- pinned `huggingface_hub` to avoid `cached_download` import error
- upgraded `openai` client usage to the modern `OpenAI` class for embeddings
- diagnostic cells to check for name collisions

**Run cells in order.** If you changed packages, restart the runtime (Runtime → Restart runtime) before running the index cell.

In [2]:
# Install and pin packages that are known to work together
# This pins huggingface_hub to a compatible version for sentence-transformers 2.2.2
!pip install --upgrade --quiet openai
!pip install --upgrade --quiet sentence-transformers==2.2.2
!pip install --upgrade --quiet "huggingface_hub==0.25.2"
!pip install --upgrade --quiet scikit-learn pandas joblib networkx matplotlib
print('Installed / upgraded packages. IMPORTANT: If this changed packages, restart the runtime (Runtime -> Restart runtime) and re-run the notebook cells from the start.)')

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gradio 5.44.1 requires huggingface-hub<1.0,>=0.33.5, but you have huggingface-hub 0.25.2 which is incompatible.
transformers 4.56.1 requires huggingface-hub<1.0,>=0.34.0, but you have huggingface-hub 0.25.2 which is incompatible.
diffusers 0.35.1 requires huggingface-hub>=0.34.0, but you have huggingface-hub 0.25.2 which is incompatible.[0m[31m
[0mInstalled / upgraded packages. IMPORTANT: If this changed packages, restart the runtime (Runtime -> Restart runtime) and re-run the notebook cells from the start.)


In [3]:
# Diagnostic checks for common problems (name shadowing or missing attributes)
import os, sys
print('Working directory:', os.getcwd())
print('Files here:', [f for f in os.listdir('.')][:50])

try:
    import openai
    print('\nopenai module:', openai)
    print('openai __file__:', getattr(openai, '__file__', 'builtin or not file-backed'))
    print('\nSelect attributes containing "emb" or "Emb":', [k for k in dir(openai) if 'emb' in k.lower() or 'Emb' in k])
except Exception as e:
    print('Importing openai failed:', e)

try:
    import sentence_transformers, huggingface_hub
    print('\nsentence_transformers version OK, huggingface_hub path:', getattr(huggingface_hub, '__file__', None))
except Exception as e:
    print('\nsentence-transformers / huggingface_hub import issue:', e)

print('\nIf you see openai pointing to a local file (e.g. /content/openai.py), rename or remove that file and re-run this cell.\n')

Working directory: /content
Files here: ['.config', 'sample_data']

openai module: <module 'openai' from '/usr/local/lib/python3.12/dist-packages/openai/__init__.py'>
openai __file__: /usr/local/lib/python3.12/dist-packages/openai/__init__.py

Select attributes containing "emb" or "Emb": ['Embedding', 'embeddings']

sentence-transformers / huggingface_hub import issue: huggingface-hub>=0.34.0,<1.0 is required for a normal functioning of this module, but found huggingface-hub==0.25.2.
Try: `pip install transformers -U` or `pip install -e '.[dev]'` if you're working with git main

If you see openai pointing to a local file (e.g. /content/openai.py), rename or remove that file and re-run this cell.



In [4]:
# Set your OpenAI API key (optional). Use getpass so it isn't stored in clear text.
from getpass import getpass
import os
key = getpass('Enter your OpenAI API key (leave blank to skip): ')
if key:
    os.environ['OPENAI_API_KEY'] = key
    print('OPENAI_API_KEY set (in environment).')
else:
    print('No API key provided; the notebook will use local SBERT or TF-IDF fallback.')

Enter your OpenAI API key (leave blank to skip): ··········
OPENAI_API_KEY set (in environment).


In [5]:
# Prepare synthetic data (resumes, jds, policies, attrition CSV)
import os, random, json
from pathlib import Path
import pandas as pd
random.seed(42)
BASE = Path('/content/hr_poc_openai')
DATA = BASE / 'data'
RES = DATA / 'resumes'
JDS = DATA / 'jds'
POL = DATA / 'policy_docs'
MODELS = BASE / 'models'
OUT = BASE / 'output'
for p in [DATA, RES, JDS, POL, MODELS, OUT]:
    p.mkdir(parents=True, exist_ok=True)

# Attrition CSV
rows = []
for i in range(500):
    age = random.randint(22,60)
    monthly_income = random.randint(2000,20000)
    job_sat = random.choice([1,2,3,4])
    years = random.randint(0,30)
    gender = random.choice(['Male','Female'])
    education = random.choice([1,2,3,4,5])
    prob_attr = 0.2
    if age < 30 and job_sat <= 2 and years < 3:
        prob_attr = 0.6
    attrition = 1 if random.random() < prob_attr else 0
    rows.append([age, monthly_income, job_sat, years, gender, education, attrition])
df = pd.DataFrame(rows, columns=['Age','MonthlyIncome','JobSatisfaction','YearsAtCompany','Gender','Education','Attrition'])
df.to_csv(DATA/'attrition_synthetic.csv', index=False)
print('Wrote attrition CSV to', DATA/'attrition_synthetic.csv')

# Create resumes
skills_pool = ['Python','Java','SQL','Machine Learning','Deep Learning','NLP','Computer Vision','Data Engineering','TensorFlow','PyTorch','Kubernetes','AWS','Docker','Communication','Leadership','UX Design','Figma','React','Sales','SEO','Google Analytics','Recruiting','Interviewing','Payroll','Testing']
roles = ['Data Scientist','ML Engineer','Backend Developer','DevOps Engineer','Product Manager','QA Engineer','UX Designer','Sales Executive','Marketing Specialist','HR Specialist']
for i in range(1,81):
    name = f'Candidate_{i}'
    role = roles[i % len(roles)]
    if i % 8 == 0:
        skills = random.sample([s for s in skills_pool if s not in ['Python','Machine Learning','SQL','TensorFlow','PyTorch']], k=random.randint(2,5))
    else:
        if 'Data' in role or 'ML' in role:
            core = ['Python','SQL','Machine Learning','TensorFlow','PyTorch']
        elif 'Backend' in role or 'DevOps' in role:
            core = ['Java','Kubernetes','Docker','AWS','SQL']
        elif 'Product' in role:
            core = ['Communication','Leadership','React']
        elif 'QA' in role:
            core = ['Testing','Python','Java']
        elif 'UX' in role:
            core = ['UX Design','Figma','Communication']
        elif 'Sales' in role or 'Marketing' in role:
            core = ['Sales','SEO','Google Analytics','Communication']
        elif 'HR' in role:
            core = ['Recruiting','Interviewing','Payroll','Communication']
        skills = list(set(random.sample(core, k=max(1,min(len(core),2))) + random.sample(skills_pool, k=random.randint(1,3))))
    years = random.randint(0,12)
    exp = f'I worked as a {role} for {years} years. I have experience in ' + ', '.join(skills) + '.'
    resume_text = f'{name}\\n{role}\\n{exp}\\nResponsibilities: Delivered projects.'
    (RES/f'resume_{i}.txt').write_text(resume_text)
print('Wrote resumes to', RES)

# JDs
jds = {
    'JD_Data_Scientist.txt': 'Looking for a Data Scientist with Python, SQL, Machine Learning and TensorFlow or PyTorch.',
    'JD_Backend_Developer.txt': 'Seeking Backend Developer experienced in Java, SQL, Docker, and AWS.',
    'JD_DevOps_Engineer.txt': 'DevOps Engineer with Kubernetes, Docker, AWS, and CI/CD automation.',
    'JD_UX_Designer.txt': 'UX Designer with Figma, user research and prototyping experience.',
    'JD_HR_Specialist.txt': 'HR Specialist experienced in recruiting, interviewing, payroll systems.'
}
for fn, txt in jds.items():
    (JDS/fn).write_text(txt)
print('Wrote JDs to', JDS)

# Policies
policies = {
    'policy_1.txt': 'Equal Opportunity Policy: assess candidates on skills and experience.',
    'policy_2.txt': 'Data Privacy Policy: Candidate personal data must be handled per local laws. Do not expose PII.',
    'policy_3.txt': 'Promotion Eligibility: Minimum 2 years in role and demonstrable impact.',
    'policy_4.txt': 'Interview Feedback Policy: notes must be factual.'
}
for fn, txt in policies.items():
    (POL/fn).write_text(txt)
print('Wrote policies to', POL)

# Save config
config = {'resumes_dir': str(RES), 'jds_dir': str(JDS), 'policies_dir': str(POL), 'attrition_csv': str(DATA/'attrition_synthetic.csv')}
(DATA/'config.json').write_text(json.dumps(config, indent=2))
print('Wrote config.json')

Wrote attrition CSV to /content/hr_poc_openai/data/attrition_synthetic.csv
Wrote resumes to /content/hr_poc_openai/data/resumes
Wrote JDs to /content/hr_poc_openai/data/jds
Wrote policies to /content/hr_poc_openai/data/policy_docs
Wrote config.json


In [6]:
# Build document index (preferred: OpenAI embeddings via new client API).
import os, joblib, json
from pathlib import Path
import numpy as np
BASE = Path('/content/hr_poc_openai')
DATA = BASE / 'data'
RES = DATA / 'resumes'
JDS = DATA / 'jds'
POL = DATA / 'policy_docs'
MODELS = BASE / 'models'
MODELS.mkdir(parents=True, exist_ok=True)

# Gather corpus
corpus = []
meta = []
for p in sorted(RES.glob('*.txt')):
    corpus.append(p.read_text()); meta.append({'source':str(p),'type':'resume'})
for p in sorted(JDS.glob('*.txt')):
    corpus.append(p.read_text()); meta.append({'source':str(p),'type':'jd'})
for p in sorted(POL.glob('*.txt')):
    corpus.append(p.read_text()); meta.append({'source':str(p),'type':'policy'})
print('Corpus size:', len(corpus))

use_openai = bool(os.environ.get('OPENAI_API_KEY'))
use_embeddings = False

if use_openai:
    try:
        # Modern OpenAI client usage
        from openai import OpenAI
        client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
        emb_model = 'text-embedding-3-small'
        print('Creating OpenAI embeddings with', emb_model)
        all_embs = []
        B = 50
        for i in range(0, len(corpus), B):
            batch = corpus[i:i+B]
            resp = client.embeddings.create(model=emb_model, input=batch)
            # resp.data is sequence of objects with .embedding attribute
            batch_embs = [d.embedding for d in resp.data]
            all_embs.extend(batch_embs)
        embs = np.array(all_embs)
        joblib.dump({'embeddings':embs, 'docs':corpus, 'meta':meta, 'method':'openai', 'model':emb_model}, MODELS/'doc_index_openai.joblib')
        print('Saved OpenAI index to', MODELS/'doc_index_openai.joblib')
        use_embeddings = True
    except Exception as e:
        print('OpenAI embeddings failed:', e)
        use_openai = False

if (not use_embeddings):
    try:
        from sentence_transformers import SentenceTransformer
        model_name = 'paraphrase-MiniLM-L3-v2'
        print('Using SBERT model', model_name)
        model = SentenceTransformer(model_name)
        embs = model.encode(corpus, show_progress_bar=True)
        joblib.dump({'embeddings':embs, 'docs':corpus, 'meta':meta, 'method':'sbert', 'model':model_name}, MODELS/'doc_index_sbert.joblib')
        print('Saved SBERT index to', MODELS/'doc_index_sbert.joblib')
        use_embeddings = True
    except Exception as e:
        print('SBERT failed:', e)

if (not use_embeddings):
    print('Falling back to TF-IDF')
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.neighbors import NearestNeighbors
    vec = TfidfVectorizer(max_features=4000)
    X = vec.fit_transform(corpus)
    nn = NearestNeighbors(n_neighbors=6, metric='cosine').fit(X)
    joblib.dump({'vectorizer':vec, 'nn':nn, 'docs':corpus, 'meta':meta, 'method':'tfidf'}, MODELS/'doc_index_tfidf.joblib')
    print('Saved TF-IDF index to', MODELS/'doc_index_tfidf.joblib')

Corpus size: 89
Creating OpenAI embeddings with text-embedding-3-small
Saved OpenAI index to /content/hr_poc_openai/models/doc_index_openai.joblib


In [7]:
# Train RandomForest on attrition CSV (same as before)
from pathlib import Path
import pandas as pd, joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
BASE = Path('/content/hr_poc_openai')
DATA = BASE / 'data'
MODELS = BASE / 'models'
df = pd.read_csv(DATA/'attrition_synthetic.csv')
df2 = pd.get_dummies(df, columns=['Gender','Education'], drop_first=True)
X = df2.drop(columns=['Attrition']); y = df2['Attrition']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, stratify=y)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, preds))
print('F1:', f1_score(y_test, preds))
joblib.dump({'model':clf, 'features':list(X.columns)}, MODELS/'attrition_rf.joblib')
print('Saved model to', MODELS/'attrition_rf.joblib')

Accuracy: 0.79
F1: 0.0
Saved model to /content/hr_poc_openai/models/attrition_rf.joblib


In [8]:
# Audit logger for RAG queries
import json, time
from pathlib import Path
BASE = Path('/content/hr_poc_openai')
OUT = BASE / 'output'
OUT.mkdir(parents=True, exist_ok=True)
LOG_FILE = OUT / 'audit_log.jsonl'
def audit_log(entry: dict):
    entry = dict(entry)
    entry['ts'] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())
    with open(LOG_FILE, 'a') as f:
        f.write(json.dumps(entry) + '\n')
    print('Logged audit entry to', LOG_FILE)

In [11]:
# RAG retrieval + ChatCompletion (OpenAI if present)
import os, joblib, numpy as np
from pathlib import Path
BASE = Path('/content/hr_poc_openai')
MODELS = BASE / 'models'
DATA = BASE / 'data'
RES = DATA / 'resumes'

# Load index helper
def load_index():
    if (MODELS/'doc_index_openai.joblib').exists():
        return joblib.load(MODELS/'doc_index_openai.joblib')
    if (MODELS/'doc_index_sbert.joblib').exists():
        return joblib.load(MODELS/'doc_index_sbert.joblib')
    if (MODELS/'doc_index_tfidf.joblib').exists():
        return joblib.load(MODELS/'doc_index_tfidf.joblib')
    raise FileNotFoundError('No index found. Run the index cell.')

idx = load_index()
method = idx.get('method')
from numpy.linalg import norm

def retrieve_topk(query_text, topk=3):
    docs = idx['docs']; meta = idx['meta']
    if method in ('openai','sbert'):
        try:
            if method == 'openai':
                from openai import OpenAI
                client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
                q_emb = np.array(client.embeddings.create(model=idx.get('model','text-embedding-3-small'), input=[query_text]).data[0].embedding)
            else:
                from sentence_transformers import SentenceTransformer
                model = SentenceTransformer(idx.get('model','paraphrase-MiniLM-L3-v2'))
                q_emb = model.encode([query_text])[0]
        except Exception as e:
            print('Query encoding failed:', e); return []
        embs = np.array(idx['embeddings'])
        scores = (embs @ q_emb) / ((norm(embs, axis=1) * norm(q_emb)) + 1e-8)
        ids = list(scores.argsort()[-topk:][::-1])
        return [{'score': float(scores[i]), 'text': docs[i], 'meta': meta[i]} for i in ids]
    else:
        vec = idx['vectorizer']; nn = idx['nn']
        qv = vec.transform([query_text])
        dists, ids = nn.kneighbors(qv, n_neighbors=topk)
        return [{'score': float(1-d), 'text': idx['docs'][i], 'meta': idx['meta'][i]} for i,d,i in zip(ids[0], dists[0], ids[0])]

def rag_answer(resume_text, topk=3):
    retrieved = retrieve_topk(resume_text, topk=topk)
    print('\nTop retrieved docs:')
    for r in retrieved:
        print('-', Path(r['meta']['source']).name, f'(score={r["score"]:.3f})')
    sources = [r['meta']['source'] for r in retrieved]
    context = ''
    for i,r in enumerate(retrieved):
        snippet = r['text'][:800]
        context += f'[DOC_{i}] {Path(r["meta"]["source"]).name}\n{snippet}\n\n'
    answer = None
    if os.environ.get('OPENAI_API_KEY'):
        try:
            from openai import OpenAI
            client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY'))
            system = 'You are an HR governance assistant. Use retrieved passages and always cite filenames like [DOC_0].'
            prompt = f'Resume:\n{resume_text}\n\nRetrieved documents:\n{context}\n\nTask: Provide a 2-3 sentence recommendation (HIRE / NO HIRE / CONSIDER), justify it, and cite sources as [DOC_i].'
            resp = client.chat.completions.create(model='gpt-4o-mini', messages=[{'role':'system','content':system},{'role':'user','content':prompt}], max_tokens=300)
            answer = resp.choices[0].message.content
        except Exception as e:
            print('OpenAI chat failed:', e)
            answer = None
    if not answer:
        text = resume_text.lower()
        if 'machine learning' in text or 'data' in text or 'ml' in text:
            rec = 'Consider for Data/ML role — strong relevant skills.'
        else:
            rec = 'Consider with caution — insufficient domain-specific skills.'
        reasons = 'Recommendation based on skills and policy.'
        cited = ', '.join([Path(s).name for s in sources])
        answer = f'Recommendation: {rec}\nReasons: {reasons}\nCited sources: {cited}'
    audit_log({'query': resume_text[:200], 'retrieved':[Path(s).name for s in sources], 'answer': answer.splitlines()[0]})
    return answer

# Demo on first 6 resumes
for p in sorted(RES.glob('*.txt'))[:6]:
    print('\n====', p.name, '====')
    rtext = p.read_text()
    print(rag_answer(rtext, topk=3))


==== resume_1.txt ====

Top retrieved docs:
- resume_1.txt (score=1.000)
- resume_11.txt (score=0.909)
- resume_21.txt (score=0.878)
Logged audit entry to /content/hr_poc_openai/output/audit_log.jsonl
Recommendation: NO HIRE

Justification: Candidate_1 has no practical experience as an ML Engineer, which is a significant drawback given the demands of the role. While they possess relevant skills in UX Design, Python, and TensorFlow, the lack of experience compared to other candidates, such as Candidate_21 with 6 years of relevant experience [DOC_2], makes them less suitable for immediate hiring.

==== resume_10.txt ====

Top retrieved docs:
- resume_10.txt (score=1.000)
- resume_20.txt (score=0.880)
- resume_50.txt (score=0.870)
Logged audit entry to /content/hr_poc_openai/output/audit_log.jsonl
Recommendation: CONSIDER

Candidate_10 has relevant experience as a Data Scientist for 6 years and is skilled in tools such as PyTorch and SQL, which are valuable for the role. However, compare

## Notes

- After running the install cell, if packages were changed, **restart the Colab runtime** (Runtime → Restart runtime) and then re-run from the top. This avoids stale imports.
- The notebook prefers OpenAI embeddings if you provided a valid key. If that fails it tries SBERT, and finally TF-IDF.
- Embedding and Chat calls to OpenAI will consume credits — use text-embedding-3-small and limit corpus size when experimenting.

Artifacts will be under `/content/hr_poc_openai/output/` and indexes in `/content/hr_poc_openai/models/`.