In [6]:
import nltk
from nltk.corpus import reuters

# Download Reuters corpus (Run this once)
nltk.download('reuters')

# Get a list of all file IDs (documents)
file_ids = reuters.fileids()

# Extract text from the first few documents (as sample resumes)
resume_corpus = [" ".join(reuters.words(file_id)) for file_id in file_ids[:50]]

# Print sample resumes
for i, resume in enumerate(resume_corpus, 1):
    print(f"\nResume {i}:")
    print(resume[:500])  # Print only first 500 characters for readability



Resume 1:
ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And lead to curbs on American imports of their products . But some exporters said that while the conflict would 

Resume 2:
CHINA DAILY SAYS VERMIN EAT 7 - 12 PCT GRAIN STOCKS A survey of 19 provinces and seven cities showed vermin consume between seven and 12 pct of China ' s grain stocks , the China Daily said . It also said that each year 1 . 575 mln tonnes , or 25 pct , of China ' s fruit output are left to rot , and 2 . 1 mln tonnes , or up to 30 pct , of its vegetables . The paper blamed the waste on inadequate storage and bad preservation methods . It said the government had launched a n

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\SATHVIK\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


In [7]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Download stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Preprocessing function
def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Lowercase and tokenize
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words and word not in string.punctuation]
    return " ".join(tokens)

# Apply preprocessing to all resumes
cleaned_resumes = [preprocess_text(resume) for resume in resume_corpus]

# Print cleaned sample
for i, resume in enumerate(cleaned_resumes, 1):
    print(f"\nCleaned Resume {i}:")
    print(resume[:500])  # First 500 characters


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SATHVIK\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SATHVIK\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Cleaned Resume 1:
asian exporters fear damage u .- japan rift mounting trade friction u japan raised fears among many asia exporting nations row could inflict far reaching economic damage businessmen officials said told reuter correspondents asian capitals u move japan might boost protectionist sentiment u lead curbs american imports products exporters said conflict would hurt long run short term tokyo loss might gain u said impose 300 mln dlrs tariffs imports japanese electronics goods april 17 retaliation japan

Cleaned Resume 2:
china daily says vermin eat 7 12 pct grain stocks survey 19 provinces seven cities showed vermin consume seven 12 pct china grain stocks china daily said also said year 1 575 mln tonnes 25 pct china fruit output left rot 2 1 mln tonnes 30 pct vegetables paper blamed waste inadequate storage bad preservation methods said government launched national programme reduce waste calling improved technology storage preservation greater production additives paper gav

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Convert resumes into TF-IDF matrix
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(cleaned_resumes)

# Get total TF-IDF scores for each resume
tfidf_scores = np.array(tfidf_matrix.sum(axis=1)).flatten()

# Find top N resumes based on highest TF-IDF scores
top_n = 3
top_indices = np.argsort(tfidf_scores)[-top_n:][::-1]  # Sort in descending order

# Print top resumes
print("\nTop Resumes based on TF-IDF scores:")
for idx in top_indices:
    print(f"\nResume {idx+1}:")
    print(cleaned_resumes[idx][:500])  # Print first 500 chars



Top Resumes based on TF-IDF scores:

Resume 34:
economic spotlight kuwaiti economy kuwait oil reliant debt ridden economy started pull nosedive oil prices determine pace recovery bankers economists say crucial ability 13 member opec hold oil prices around new benchmark 18 dlrs barrel northern hemisphere summer demand usually slackens bankers estimate economy measured terms gross domestic product gdp shrank 19 pct real terms last year contracting 8 1 pct year taking account inflation consumer prices 1 5 pct 1985 slowing 1 0 pct 1986 factors de

Resume 1:
asian exporters fear damage u .- japan rift mounting trade friction u japan raised fears among many asia exporting nations row could inflict far reaching economic damage businessmen officials said told reuter correspondents asian capitals u move japan might boost protectionist sentiment u lead curbs american imports products exporters said conflict would hurt long run short term tokyo loss might gain u said impose 300 mln dlrs tariffs 

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

# Define a sample job description
job_desc = "Looking for a data scientist with experience in Python, machine learning, and deep learning."

# Preprocess the job description
job_desc_cleaned = preprocess_text(job_desc)

# Transform job description into TF-IDF vector
job_tfidf = vectorizer.transform([job_desc_cleaned])

# Compute cosine similarity between resumes and job description
similarities = cosine_similarity(tfidf_matrix, job_tfidf).flatten()

# Get top matching resumes
top_indices = np.argsort(similarities)[-top_n:][::-1]

# Print top matches
print("\nTop Resumes Matching the Job Description:")
for idx in top_indices:
    print(f"\nResume {idx+1} (Similarity: {similarities[idx]:.2f}):")
    print(cleaned_resumes[idx][:500])  # First 500 chars



Top Resumes Matching the Job Description:

Resume 39 (Similarity: 0.12):
japanese official takes data microchip talks ministry international trade industry miti vice minister makoto kuroda leaves washington today data hopes refute u charges japan violated pact microchip trade three man japanese trade team already washington laying groundwork talks kuroda deputy u trade representative michael smith aimed persuading u impose tariffs certain japanese products kuroda said taking new proposals `` nothing briefcase except explanation current situation '' kuroda told daily 

Resume 22 (Similarity: 0.06):
german industrial employment seen stagnating number workers employed west german industrial sector stagnated last quarter 1986 50 000 increase overall employment benefited services branch diw economic institute said diw report added general downturn economy since last autumn negative effect willingness firms take workers referred marked downturn number workers taken capital goods sector new 