# AI Internship Recommender System

This Jupyter notebook builds a simple, extendable recommender that matches student candidates (from `candidates.csv`) to internship openings (from `internships.csv`). It uses skill similarity (TF-IDF + cosine similarity) plus simple preference boosts (location, qualification, rural/first-time bonuses).

**Libraries used:** `pandas`, `numpy`, `scikit-learn`

> Make sure `candidates.csv` and `internships.csv` are in the same directory as this notebook before running the cells.

## Step 0 — Imports

Import required libraries and set constants.

In [3]:

import os
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from heapq import nlargest

print('Imports ready.')

Imports ready.


## Step 1 — Load & Explore Data

Read `candidates.csv` and `internships.csv`, show shape and first few rows.

In [5]:
# Step 1: Load & Explore Data
CANDIDATES_CSV = 'candidates.csv'
INTERNSHIPS_CSV = 'internships.csv'


for f in (CANDIDATES_CSV, INTERNSHIPS_CSV):
    if not os.path.exists(f):
        raise FileNotFoundError(f"Required file not found: {f}. Please place it in the notebook directory before running.")

candidates = pd.read_csv(CANDIDATES_CSV)
internships = pd.read_csv(INTERNSHIPS_CSV)

print('Candidates shape:', candidates.shape)
display(candidates.head(5))

print('\nInternships shape:', internships.shape)
display(internships.head(5))

Candidates shape: (100, 10)


Unnamed: 0,ID,Name,State,City,Skills,Qualifications,Year_of_Study,GPA,Rural_or_Urban,First_Time_Applicant
0,1,Rajeev Ranjan,Karnataka,Bengaluru,"Social Work, Teaching, Training",BA,3rd,6.99,Rural,No
1,2,Manish Yadav,Uttar Pradesh,Lucknow,"Accounting, Excel, Business Strategy",MBA,Postgraduate,9.35,Rural,No
2,3,Shruti Pandey,Karnataka,Bengaluru,"Accounting, Business Strategy, Financial Analysis",MBA,Graduate,9.86,Rural,No
3,4,Aditya Verma,Maharashtra,Pune,"Public Health, Lab Analysis, Nursing",MBA,Postgraduate,8.52,Urban,Yes
4,5,Siddharth Khanna,Assam,Guwahati,"Policy Analysis, Drafting, Legal Research",Diploma,3rd,7.39,Rural,No



Internships shape: (100, 11)


Unnamed: 0,ID,Company,Role,Required_Skills,Qualification_Required,Location_State,Location_City,Sector,Duration_Months,Stipend_Range,Capacity
0,1,Flipkart,Software Intern,"Python, Django, SQL",B.Com,Uttar Pradesh,Lucknow,IT,5,₹6k–₹37k,7
1,2,Times of India,Accounting Intern,"Accounting, Excel, Auditing",BA,Madhya Pradesh,Indore,Commerce,3,₹28k–₹38k,17
2,3,HDFC Bank,Social Work Intern,"Social Work, Community Outreach, Counselling",BBA,Maharashtra,Pune,Social Work,6,₹8k–₹32k,2
3,4,Indian Railways,Video Editor,"Video Editing, Adobe Premiere Pro, After Effects",Diploma,Punjab,Chandigarh,Design,4,₹16k–₹38k,4
4,5,OYO Rooms,Social Work Intern,"Social Work, Community Outreach, Counselling",B.Com,Bihar,Patna,Social Work,2,₹12k–₹40k,13


## Step 2 — Preprocess Skills Data

- Handle missing values
- Clean skills strings into lists and text (lowercase, trimmed)
- Create helper columns used later

In [9]:
# Step 2:Preprocess Skills Data
def clean_skill_string(s):
    """Convert a comma-separated skills string into a cleaned list of skill tokens."""
    if pd.isna(s):
        return []
    s = str(s)
    parts = [p.strip() for p in s.split(',') if p.strip()]
    parts = [re.sub(r"\s+", ' ', p).lower() for p in parts]
    return parts

# Apply cleaning to candidates
candidates['skills_list'] = candidates['Skills'].apply(clean_skill_string)
candidates['skills_set'] = candidates['skills_list'].apply(set)
candidates['skills_text'] = candidates['skills_list'].apply(lambda l: ' '.join(l))

# Apply cleaning to internships
internships['req_skills_list'] = internships['Required_Skills'].apply(clean_skill_string)
internships['req_skills_set'] = internships['req_skills_list'].apply(set)
internships['req_skills_text'] = internships['req_skills_list'].apply(lambda l: ' '.join(l))

# Fill missing qualifications/location with placeholders
candidates['Qualifications'] = candidates['Qualifications'].fillna('Unknown')
internships['Qualification_Required'] = internships['Qualification_Required'].fillna('Any')
internships['Location_State'] = internships['Location_State'].fillna('Unknown')
internships['Location_City'] = internships['Location_City'].fillna('Unknown')

print('Sample cleaned candidate skills:')
display(candidates[['ID','Name','skills_list']].head())

print('\nSample cleaned internship skills:')
display(internships[['ID','Company','Role','req_skills_list']].head())

Sample cleaned candidate skills:


Unnamed: 0,ID,Name,skills_list
0,1,Rajeev Ranjan,"[social work, teaching, training]"
1,2,Manish Yadav,"[accounting, excel, business strategy]"
2,3,Shruti Pandey,"[accounting, business strategy, financial anal..."
3,4,Aditya Verma,"[public health, lab analysis, nursing]"
4,5,Siddharth Khanna,"[policy analysis, drafting, legal research]"



Sample cleaned internship skills:


Unnamed: 0,ID,Company,Role,req_skills_list
0,1,Flipkart,Software Intern,"[python, django, sql]"
1,2,Times of India,Accounting Intern,"[accounting, excel, auditing]"
2,3,HDFC Bank,Social Work Intern,"[social work, community outreach, counselling]"
3,4,Indian Railways,Video Editor,"[video editing, adobe premiere pro, after effe..."
4,5,OYO Rooms,Social Work Intern,"[social work, community outreach, counselling]"


## Step 3 — Vectorize Skills for AI Matching

Use TF-IDF on combined skill texts (candidates + internships), then compute cosine similarity.

In [12]:
# Step 3:Vectorize skills and compute cosine similarity
# Build corpus from both sides to ensure shared vocabulary
corpus = pd.concat([candidates['skills_text'], internships['req_skills_text']], ignore_index=True).astype(str)

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus)

n_candidates = candidates.shape[0]
n_internships = internships.shape[0]

X_candidates = X[:n_candidates]
X_internships = X[n_candidates:n_candidates+n_internships]

similarity_matrix = cosine_similarity(X_candidates, X_internships)
print('Similarity matrix shape:', similarity_matrix.shape)

Similarity matrix shape: (100, 100)


## Step 4 — Apply Matching Logic

Start from TF-IDF similarity, then apply preference boosts:

- +2.0 if Qualification matches  
- +0.5 if City matches  
- +0.3 if State matches  
- +0.2 if candidate is Rural  
- +0.2 if First_Time_Applicant == Yes  
- +1.0 per skill match  
- -2.0 if Qualification mismatch  

> The total score is then normalized to a **percentage fit (0–100%)**.  
> Eligibility is determined using a configurable threshold (default 50%).  


In [57]:
# Step 4: Improved Matching logic with normalized percentage score
def compute_boosted_score(base_score, cand_row, intern_row):
    """
    Returns a percentage match score (0-100%) and reasons for boosting.
    """
    # Count skill matches
    cand_skills = cand_row.get('skills_set', set())
    intern_skills = intern_row.get('req_skills_set', set())
    matched_skills = cand_skills.intersection(intern_skills)
    skill_score = len(matched_skills) * 1.0  # 1pt per matching skill
    reasons = []
    if matched_skills:
        reasons.append('Skills matched: ' + ', '.join([s.title() for s in sorted(matched_skills)]))
    else:
        reasons.append('No direct skill overlap')
    
    score = skill_score
    
    # Qualification match
    cand_q = str(cand_row.get('Qualifications','')).strip().lower()
    intern_q = str(intern_row.get('Qualification_Required','')).strip().lower()
    if intern_q not in ('any','') and cand_q == intern_q:
        score += 2.0
        reasons.append('Qualification matched')
    elif intern_q not in ('any','') and cand_q != intern_q:
        score -= 2.0  #ineligible penalty
        reasons.append('Qualification mismatch penalty')
    
    # City match
    try:
        if str(cand_row.get('City','')).strip().lower() == str(intern_row.get('Location_City','')).strip().lower():
            score += 0.5
            reasons.append('City matched')
    except Exception:
        pass
    
    # State match
    try:
        if str(cand_row.get('State','')).strip().lower() == str(intern_row.get('Location_State','')).strip().lower():
            score += 0.3
            reasons.append('State matched')
    except Exception:
        pass
    
    # Rural boost
    try:
        if str(cand_row.get('Rural_or_Urban','')).strip().lower() == 'rural':
            score += 0.2
            reasons.append('Rural boost')
    except Exception:
        pass
    
    # First-time applicant boost
    try:
        if str(cand_row.get('First_Time_Applicant','')).strip().lower() == 'yes':
            score += 0.2
            reasons.append('First-time applicant boost')
    except Exception:
        pass
    
    # Calculate max possible score for normalization
    max_score = len(intern_skills) * 1.0 + 2.0 + 0.5 + 0.3 + 0.2 + 0.2 
    if score < 0:
        score = 0  # do not allow negative scores
    perc_score = round((score / max_score) * 100, 2)
    
    return perc_score, reasons


final_matches['Eligible'] = final_matches['Score'] >= 50  #edit here 
final_matches.loc[final_matches['Reason'].str.contains('Qualification mismatch penalty'), 'Eligible'] = False













# # Step 4: Matching logic helpers
# def compute_boosted_score(base_score, cand_row, intern_row):
#     score = float(base_score)
#     reasons = []
#     # State match
#     try:
#         if str(cand_row.get('State','')).strip().lower() == str(intern_row.get('Location_State','')).strip().lower():
#             score += 0.1
#             reasons.append('State matched')
#     except Exception:
#         pass
#     # City match
#     try:
#         if str(cand_row.get('City','')).strip().lower() == str(intern_row.get('Location_City','')).strip().lower():
#             score += 0.1
#             reasons.append('City matched')
#     except Exception:
#         pass
#     # Qualification match (exact)
#     try:
#         cand_q = str(cand_row.get('Qualifications','')).strip().lower()
#         intern_q = str(intern_row.get('Qualification_Required','')).strip().lower()
#         if intern_q not in ('any','') and cand_q == intern_q:
#             score += 0.2
#             reasons.append('Qualification matched')
#     except Exception:
#         pass
#     # Rural boost
#     try:
#         if str(cand_row.get('Rural_or_Urban','')).strip().lower() == 'rural':
#             score += 0.05
#             reasons.append('Rural boost')
#     except Exception:
#         pass
#     # First-time applicant boost
#     try:
#         if str(cand_row.get('First_Time_Applicant','')).strip().lower() == 'yes':
#             score += 0.05
#             reasons.append('First-time applicant boost')
#     except Exception:
#         pass
#     return score, reasons

## Step 5 — Generate Top Recommendations

For each candidate, select Top 3 internships ranked by boosted score and provide explanations.

In [60]:
# Step 5: Generate top recommendations with eligibility
TOP_K = 3
matches = []

ELIGIBILITY_THRESHOLD = 50 

for ci in range(n_candidates):
    cand_row = candidates.iloc[ci].to_dict()
    base_scores = similarity_matrix[ci]  # skill-based similarity
    scored_list = []
    
    for ii in range(n_internships):
        intern_row = internships.iloc[ii].to_dict()
        perc_score, reasons = compute_boosted_score(base_scores[ii], cand_row, intern_row)
        
        # Determine eligibility
        eligible = perc_score >= ELIGIBILITY_THRESHOLD
        # If there is a qualification mismatch penalty, mark as not eligible
        if 'Qualification mismatch penalty' in reasons:
            eligible = False
        
        scored_list.append((ii, perc_score, reasons, eligible))
    
    # Select top K internships for this candidate
    topk = nlargest(TOP_K, scored_list, key=lambda x: x[1])
    
    for (ii, score, reasons, eligible) in topk:
        matches.append({
            'Candidate_ID': candidates.at[ci,'ID'],
            'Candidate_Name': candidates.at[ci,'Name'],
            'Internship_ID': internships.at[ii,'ID'],
            'Company': internships.at[ii,'Company'],
            'Role': internships.at[ii,'Role'],
            'Score': score,  # already percentage
            'Eligible': 'Yes' if eligible else 'No',
            'Reason': '; '.join(reasons)
        })

final_matches = pd.DataFrame(matches)
final_matches = final_matches.sort_values(['Candidate_ID','Score'], ascending=[True, False]).reset_index(drop=True)
print('Total matches generated:', final_matches.shape[0])


# # Step 5: Generate top recommendations
# TOP_K = 3
# matches = []

# for ci in range(n_candidates):
#     cand_row = candidates.iloc[ci].to_dict()
#     base_scores = similarity_matrix[ci]  # length = n_internships
#     scored_list = []
#     for ii in range(n_internships):
#         intern_row = internships.iloc[ii].to_dict()
#         base = base_scores[ii]
#         boosted, boost_reasons = compute_boosted_score(base, cand_row, intern_row)
#         skills_intersection = candidates.at[ci, 'skills_set'].intersection(internships.at[ii, 'req_skills_set'])
#         if skills_intersection:
#             skill_reason = 'Skills matched: ' + ', '.join([s.title() for s in sorted(skills_intersection)])
#         else:
#             skill_reason = 'No direct skill overlap'
#         reasons = [skill_reason] + boost_reasons
#         full_reason = '; '.join(reasons)
#         scored_list.append((ii, boosted, full_reason))
#     topk = nlargest(TOP_K, scored_list, key=lambda x: x[1])
#     for (ii, sc, reason) in topk:
#         matches.append({
#             'Candidate_ID': candidates.at[ci,'ID'],
#             'Candidate_Name': candidates.at[ci,'Name'],
#             'Internship_ID': internships.at[ii,'ID'],
#             'Company': internships.at[ii,'Company'],
#             'Role': internships.at[ii,'Role'],
#             'Score': round(sc,4),
#             'Reason': reason
#         })

# final_matches = pd.DataFrame(matches)
# final_matches = final_matches.sort_values(['Candidate_ID','Score'], ascending=[True, False]).reset_index(drop=True)
# print('Total matches generated:', final_matches.shape[0])

Total matches generated: 300


## Step 6 — Save & Display Results

Save `final_matches.csv` and display the first 10 recommendations.

In [63]:
# Step 6: Save and display with eligibility
OUT_CSV = 'final_matches.csv'
final_matches.to_csv(OUT_CSV, index=False)
print(f'Saved final matches to {OUT_CSV}')

print('\nSample matches (first 10):')
display(final_matches.head(10))


Saved final matches to final_matches.csv

Sample matches (first 10):


Unnamed: 0,Candidate_ID,Candidate_Name,Internship_ID,Company,Role,Score,Eligible,Reason
0,1,Rajeev Ranjan,9,EY India,Teaching Intern,67.74,Yes,"Skills matched: Teaching, Training; Qualificat..."
1,1,Rajeev Ranjan,6,Teach For India,Social Work Intern,51.61,Yes,Skills matched: Social Work; Qualification mat...
2,1,Rajeev Ranjan,8,Akshaya Patra Foundation,Social Work Intern,51.61,Yes,Skills matched: Social Work; Qualification mat...
3,2,Manish Yadav,36,Swiggy,HR Intern,51.61,Yes,Skills matched: Excel; Qualification matched; ...
4,2,Manish Yadav,72,ICICI Bank,HR Intern,51.61,Yes,Skills matched: Excel; Qualification matched; ...
5,2,Manish Yadav,81,Teach For India,Data Analyst Intern,51.61,Yes,Skills matched: Excel; Qualification matched; ...
6,3,Shruti Pandey,79,Deloitte India,Copywriter,40.32,No,No direct skill overlap; Qualification matched...
7,3,Shruti Pandey,12,HDFC Bank,Copywriter,35.48,No,No direct skill overlap; Qualification matched...
8,3,Shruti Pandey,33,Infosys,Copywriter,35.48,No,No direct skill overlap; Qualification matched...
9,4,Aditya Verma,90,NITI Aayog,Business Development Intern,48.39,No,No direct skill overlap; Qualification matched...


## Optional — Test with a Live Candidate Input

Demonstrates recommending for a single candidate dictionary rather than the full CSV.

In [65]:
# Optional live-candidate recommend function with eligibility
def recommend_for_live_candidate(live_candidate, top_k=5, eligibility_threshold=50):
    # Clean and vectorize skills
    skills_list = clean_skill_string(live_candidate.get('Skills',''))
    skills_text = ' '.join(skills_list)
    vec = vectorizer.transform([skills_text])
    sims = cosine_similarity(vec, X_internships).flatten()
    
    recs = []
    for ii in range(n_internships):
        intern_row = internships.iloc[ii].to_dict()
        base = sims[ii]
        perc_score, boost_reasons = compute_boosted_score(base, live_candidate, intern_row)
        
        # Determine eligibility
        eligible = perc_score >= eligibility_threshold
        if 'Qualification mismatch penalty' in boost_reasons:
            eligible = False
        
        recs.append((ii, perc_score, boost_reasons, eligible))
    
    # Select top K
    top = nlargest(top_k, recs, key=lambda x: x[1])
    
    rows = []
    for ii, sc, reasons, eligible in top:
        rows.append({
            'Internship_ID': internships.at[ii,'ID'],
            'Company': internships.at[ii,'Company'],
            'Role': internships.at[ii,'Role'],
            'Score': sc,  # percentage fit
            'Eligible': 'Yes' if eligible else 'No',
            'Reason': '; '.join(reasons)
        })
    
    return pd.DataFrame(rows)

# Example usage
live_candidate = {
    'Name':'Live Candidate',
    'State':'Guwahati',
    'City':'Dispur',
    'Skills':'Content Writing, Copywriting',
    'Qualifications':'BBA',
    'Year_of_Study':'3rd',
    'GPA':'7',
    'Rural_or_Urban':'Urban',
    'First_Time_Applicant':'Yes'
}

print('Top recommendations for example live candidate:')
display(recommend_for_live_candidate(live_candidate, top_k=5))


Top recommendations for example live candidate:


Unnamed: 0,Internship_ID,Company,Role,Score,Eligible,Reason
0,3,HDFC Bank,Social Work Intern,35.48,No,No direct skill overlap; Qualification matched...
1,16,Infosys,Data Analyst Intern,35.48,No,No direct skill overlap; Qualification matched...
2,20,Smile Foundation,UI/UX Intern,35.48,No,No direct skill overlap; Qualification matched...
3,53,Indian Railways,Software Intern,35.48,No,No direct skill overlap; Qualification matched...
4,68,Indian Railways,Marketing Associate,35.48,No,No direct skill overlap; Qualification matched...
