# **Title : LLM Training Phase 1**
# **Document by : Sanchana Mohankumar**

### **Mounting Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## **Data**

Link - https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset


## **Installations**

In [None]:
pip install openai

In [None]:
pip install PyPDF2

## **Libraries**

In [4]:
import openai
import json
import PyPDF2
import numpy as np
import os

import nltk

import PyPDF2
import openai

import multiprocessing

import gensim
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models.doc2vec import TaggedDocument

### **API KEY**

In [5]:
# OpenAI API key
openai.api_key = "YOUR_API_KEY"

### **Data Preprocessing Downloads**



In [None]:
# Downloading NLTK data
nltk.download("punkt")
nltk.download("stopwords")

# Initialized NLTK's stopwords
stop_words = set(nltk.corpus.stopwords.words("english"))

# **PART 1**



### **TASK 1:**
Task 1 code automates the process of extracting text from a PDF resume, sends it to GPT-3 for further processing, receives a JSON-like response, and stores it in a specified file on Google Drive. In this task we are passing 1 file and checking the output json file.


This code performs the following tasks:

**1. Text Extraction:** It extracts text content from a PDF file located at pdf_resume_path using the PyPDF2 library. This extracted text is essentially the content of a resume.

**2. Prompt Construction:** It constructs a GPT-3 prompt by including the extracted resume text within a structured prompt. The prompt asks GPT-3 to generate a JSON-like output based on the resume content.

**3. GPT-3 Interaction:** It interacts with the OpenAI GPT-3 API, using the constructed prompt. It requests GPT-3 to generate a text response in JSON format based on the provided resume.

**4. JSON Output Retrieval:** The generated JSON-like output from GPT-3 is obtained and stored in the raw_output_text variable.

**5. File Saving:** The JSON output is saved to a file located at output_file_path, which is specified as "/content/drive/MyDrive/output_1.json". This allows you to store the generated JSON output in your Google Drive.

**6. Printing:** Finally, the generated JSON-like output is printed to the console.



In [20]:
# Converted to text from PDF file using PyPDF2 library
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Constructed and sent the prompt to GPT-3
def generate_raw_output(pdf_resume_path):
    resume_text = extract_text_from_pdf(pdf_resume_path)

    prompt = f"Given the following resume:\n{resume_text}\n\nPlease generate a JSON output."

    response = openai.Completion.create(
        engine="text-davinci-003",  # Max Tokens Allowed 4097
        prompt=prompt,
        max_tokens= 2500
    )

    output_text = response.choices[0].text.strip()
    return output_text

# Provided the path to the PDF resume file
pdf_resume_path = "/content/drive/MyDrive/Resume_Database/Sanchana_Mohankumar_Resume.pdf"

# Generated raw JSON-like output using GPT-3
raw_output_text = generate_raw_output(pdf_resume_path)

# Saved the JSON output to a file in Google Drive
output_file_path = "/content/drive/MyDrive/output_1.json"
with open(output_file_path, "w") as output_file:
    output_file.write(raw_output_text)

print(raw_output_text)

{
  "name": "Sanchana",
  "contactInfo": {
    "phoneNumber": "+1 (206) 209-9901",
    "email": "mohankumar.s@northeastern.edu",
    "name": "Mohankumar"
  },
  "socialProfiles": {
    "linkedin": "www.linkedin.com/in/sanchanamohankumar",
    "github": "https://github.com/Sanchana1997"
  },
  "backgroundSummary": "Experienced and driven Data Analyst with almost 2 years of experience in delivering results. Seeking new challenges and opportunities in field of Data to apply my skills and expertise.",
  "workExperience": [
    {
      "position": "Graduate Teaching Assistant",
      "employer": "Northeastern University",
      "timePeriod": "January 2023 - April 2023",
      "jobsPerformed": [
        "Demonstrated technical expertise in python, statistical analysis, and machine learning while conducting lab sessions, grading assignments, and assisting students with data mining course work."
      ]
    },
    {
      "position": "Data Analyst",
      "employer": "Accenture",
      "timePe

### **TASK 2:**

#### **tf-idf Vectorization**
In Task 2, I automated the process of extracting text from PDF resumes, sending it to GPT-3 for further processing, receiving a JSON-like response as we did in Task 1 further proceding with vectorizing it using TF-IDF, calculating cosine similarity scores with an input resume, and identifying and printing the top 5 most similar resumes along with their similarity scores.

**1.Text Extraction:** It defines a function extract_text_from_pdf to extract text content from a PDF file using the PyPDF2 library. The extracted text represents the content of a resume.

**2.GPT-3 Prompt Construction:** It defines a function generate_raw_output that extracts text from a specified PDF resume file, constructs a GPT-3 prompt by embedding the extracted text, and sends this prompt to the GPT-3 model. The model is instructed to generate a JSON-like output based on the provided resume text.

**3.TF-IDF Vectorization:** It defines a function vectorize_resume to convert resume text into a TF-IDF vector using scikit-learn's TF-IDF vectorizer.

**4.Main Function:** In the main function:

- It loads and vectorizes an input resume (specified by input_resume_path) using TF-IDF.
- Creates a TF-IDF vectorizer and fits it with the input resume text.
- Vectorizes all resumes in the database folder using TF-IDF and stores them as TF-IDF matrices.
- Calculates the cosine similarity between the input resume and all resumes in the database using TF-IDF vectors.
- Retrieves the indices of the most similar resumes based on cosine similarity and their respective similarity scores.
- Prints the top 5 most similar resume filenames along with their cosine similarity scores.

In [11]:
# Defined a function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Constructed and send the prompt to GPT-3
def generate_raw_output(pdf_resume_path):
    resume_text = extract_text_from_pdf(pdf_resume_path)

    prompt = f"Given the following resume:\n{resume_text}\n\nPlease generate a JSON output."

    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=2500
    )

    text = response.choices[0].text.strip()
    return text

# Defined a function to vectorize the text using TF-IDF
def vectorize_resume(text, tfidf_vectorizer):
    tfidf_matrix = tfidf_vectorizer.transform([text])
    return tfidf_matrix


def main():
    # Loaded and vectorized the input resume using TF-IDF
    input_resume_path = "/content/drive/MyDrive/Resume_Database/Data_Resume.pdf"
    input_resume_text = extract_text_from_pdf(input_resume_path)

    # Created a TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit([input_resume_text])

    # Vectorized the resumes in the database using TF-IDF
    database_resumes = []
    database_folder = "/content/drive/MyDrive/Resume_Database"

    # Listed all files in the Resume_Database folder
    pdf_resume_filenames = [filename for filename in os.listdir(database_folder) if filename.lower().endswith(".pdf")]

    for pdf_resume_filename in pdf_resume_filenames:
        db_resume_path = os.path.join(database_folder, pdf_resume_filename)
        db_resume_text = extract_text_from_pdf(db_resume_path)
        db_resume_vector = vectorize_resume(db_resume_text, tfidf_vectorizer)
        database_resumes.append(db_resume_vector)

    # Converted the list of TF-IDF vectors to a numpy array for similarity calculation
    database_matrix = [tfidf_vector.toarray()[0] for tfidf_vector in database_resumes]

    # Calculated the cosine similarity between the input resume and all resumes in the database
    input_resume_vector = tfidf_vectorizer.transform([input_resume_text])
    cosine_similarities = cosine_similarity(input_resume_vector, database_matrix).flatten()

    # Got the indices of the most similar resumes and their cosine similarity scores
    indices = cosine_similarities.argsort()[::-1]
    top_5_similar_resumes = [(pdf_resume_filenames[idx], cosine_similarities[idx]) for idx in indices[:5]]

    # Printed the top 5 most similar resume filenames along with their cosine similarity scores
    print("Top 5 most similar resume filenames and their cosine similarity scores:")
    for resume_filename, similarity_score in top_5_similar_resumes:
        print(f"Filename: {resume_filename}, Similarity Score: {similarity_score}")

if __name__ == "__main__":
    main()

Top 5 most similar resume filenames and their cosine similarity scores:
Filename: Data_Resume.pdf, Similarity Score: 1.0
Filename: Sanchana_Mohankumar_Resume.pdf, Similarity Score: 0.8420834224000379
Filename: Data_Scientist_Resume.pdf, Similarity Score: 0.7429042776944733
Filename: Data2_Resume.pdf, Similarity Score: 0.718642800041404
Filename: Fitness_Resume.pdf, Similarity Score: 0.7138810369392904


#### **Doc2Vec Vectorization**
In Task 2, I automated the process of extracting text from PDF resumes, sending it to GPT-3 for further processing, receiving a JSON-like response as we did in Task 1 further proceding with vectorizing it using Doc2Vec, calculating cosine similarity scores with an input resume, and identifying and printing the top 5 most similar resumes along with their similarity scores.

**1.Text Extraction:** It defines a function extract_text_from_pdf to extract text content from a PDF file using the PyPDF2 library. The extracted text represents the content of a resume.

**2. Text Preprocessing and Tagging**: Preprocess_text is responsible for preprocessing the extracted text. It tokenizes the text by splitting it into words and removes punctuation, stopwords, and other noise. The result is a cleaned and tokenized version of the text. The text is then tagged using the Gensim library's TaggedDocument object, associating each document with its original text.

**3.GPT-3 Prompt Construction:** It defines a function generate_raw_output that extracts text from a specified PDF resume file, constructs a GPT-3 prompt by embedding the extracted text, and sends this prompt to the GPT-3 model. The model is instructed to generate a JSON-like output based on the provided resume text.

**4. Main Function:** The main function is where everything comes together:

- Load and Preprocess Data: It loads and preprocesses the input resume text and the text of resumes in the database. The input resume and database resumes are represented as TaggedDocument objects.
- Model Training: It trains a Doc2Vec model using the TaggedDocuments from the input and database resumes. The model learns vector representations of the resume texts.
- Cosine Similarity Calculation: It calculates cosine similarity scores between the input resume and all resumes in the database. The similarity scores indicate how closely related each database resume is to the input resume.
- Identify Top Matches: It identifies the top 5 resumes in the database that have the highest cosine similarity scores with the input resume.
- Display Results: It displays the filenames of the top 5 most similar resumes along with their respective similarity scores.










In [12]:
# Function to extract text from a PDF
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Function to preprocess text and create TaggedDocument objects
def preprocess_text(text):
    # Split text into words and remove punctuation, stopwords, etc.
    words = gensim.utils.simple_preprocess(text)
    return TaggedDocument(words=words, tags=[text])

# Function to generate a JSON output using GPT-3
def generate_json_output(resume_text):
    prompt = f"Given the following resume:\n{resume_text}\n\nPlease generate a JSON output."

    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=prompt,
        max_tokens=2500
    )

    json_output = response.choices[0].text.strip()
    return json_output

def main():
    # Load and preprocess the input resume
    input_resume_path = "/content/drive/MyDrive/Resume_Database/Data_Resume.pdf"
    input_resume_text = extract_text_from_pdf(input_resume_path)
    input_resume_doc = preprocess_text(input_resume_text)

    # Generated JSON output using GPT-3
    json_output = generate_json_output(input_resume_text)

    # Preprocessed and tag the resumes in the database
    database_resumes = []
    database_folder = "/content/drive/MyDrive/Resume_Database"

    pdf_resume_filenames = [filename for filename in os.listdir(database_folder) if filename.lower().endswith(".pdf")]

    for pdf_resume_filename in pdf_resume_filenames:
        db_resume_path = os.path.join(database_folder, pdf_resume_filename)
        db_resume_text = extract_text_from_pdf(db_resume_path)
        db_resume_doc = preprocess_text(db_resume_text)
        database_resumes.append(db_resume_doc)

    # Trained a Doc2Vec model
    model = Doc2Vec(vector_size=300, min_count=2, epochs=40)
    model.build_vocab([input_resume_doc] + database_resumes)
    model.train([input_resume_doc] + database_resumes, total_examples=model.corpus_count, epochs=model.epochs)

    # Calculated cosine similarity scores
    similarity_scores = []
    input_vector = model.dv[input_resume_doc.tags[0]]
    for db_resume in database_resumes:
        db_vector = model.dv[db_resume.tags[0]]
        similarity = cosine_similarity([input_vector], [db_vector])[0][0]
        similarity_scores.append(similarity)

    # Created a list of tuples with filename and similarity score
    similarity_results = list(zip(pdf_resume_filenames, similarity_scores))

    # Sorted the results by similarity score in descending order
    similarity_results.sort(key=lambda x: x[1], reverse=True)

    print("Top 5 Resumes with the Highest Cosine Similarity Scores:")
    for filename, similarity_score in similarity_results[:5]:
        print(f"Filename: {filename}, Similarity Score: {similarity_score:.4f}")


if __name__ == "__main__":
    main()


Top 5 Resumes with the Highest Cosine Similarity Scores:
Filename: Data_Resume.pdf, Similarity Score: 1.0000
Filename: Sanchana_Mohankumar_Resume.pdf, Similarity Score: 0.8374
Filename: Data2_Resume.pdf, Similarity Score: 0.7934
Filename: Robotics_Engineer_Resume.pdf, Similarity Score: 0.6508
Filename: Data_Scientist_Resume.pdf, Similarity Score: 0.5340


# **PART 1 Conclusion**

As we can see in this phase I have used 1 resume for Task 1 to output the input resume in Json format. Further in Task 2 we have used tf-idf and Doc2vec Vectorization to see the difference in Similarity scores.

- While TF-IDF has its strengths, such as simplicity and interpretability, it may not capture the distinct semantic relationships present in documents like resumes. Doc2Vec, on the other hand, is better suited for tasks that require understanding the meaning and context of text, making it a more appropriate choice for matching and ranking resumes in this case.




# **PART 1 Future Enhancements**

**1.Feature Engineering:** My efforts extended to feature engineering, where I explored various text representations. While I effectively utilized tf-idf and Doc2vec, it's important to recognize that the realm of word embeddings offers a plethora of techniques to experiment with. By delving into both established and cutting-edge methods, I can further refine the accuracy and precision of the resume matching system.

**2.Text Preprocessing:** A pivotal aspect of my strategy was text preprocessing. Beyond my current techniques, there is room for implementing advanced preprocessing methods. These techniques can significantly improve the quality of input data, leading to more precise and relevant results in the resume matching process.

**3.Similarity Metrics:** The foundation of my resume matching system lies in similarity metrics. While I've employed metrics like cosine similarity, exploring a wider range of metrics such as Jaccard similarity, Euclidean distance, and other innovative alternatives could bring about refinements that enhance the overall effectiveness and reliability of the system.


# **Deliverables PART 1:**
- Task 1:
Input_data_Task_1
Output_task_1
- Task 2:
Input_data_Task_2,
Output_task_2

## **PART 2**

### **TASK 1:**

This code provides a comprehensive workflow to preprocess and enhance multiple PDF resumes using GPT-3, making it suitable for various applications like resume parsing and improvement.

**1. Text Extraction (`extract_text_from_pdf`):**
- This function extracts the text content from a PDF file. It opens the PDF, reads its pages, and accumulates the text.

**2. Text Preprocessing (`preprocess_text`):**
- Text preprocessing is essential to improve the quality of input data for GPT-3. All text is converted to lowercase to ensure consistency and various other text preprocessing steps are perfomed like Tokenization, Stopword Removal, Noise Reduction and Text Normalization.

**Text Enhancement (`process_chunk_with_gpt`):**
- This function uses GPT-3 to improve text quality. It takes several inputs:
  - `prompt`: A predefined prompt instructing GPT-3 on what to do.
  - `chunk`: A portion of the text for GPT-3 to work on.
  - `resume_text`: The entire resume text (context).
  - `start_date`, `end_date`, `experience_details`: Specific details to be incorporated into the text.
- It assembles a full prompt with the provided data and sends it to GPT-3. GPT-3 processes the prompt and returns enhanced text.

**Single Resume Processing (`process_single_pdf`):**
- This function takes a single PDF resume and enhances it

**Batch Processing (`process_pdfs_in_folder`):**
- This function handles multiple resumes efficiently using multiprocessing which enhance resumes in parallel:
- Outputs the enhanced data in JSON format to "output.json."


In [17]:
# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

# Function to perform text preprocessing
def preprocess_text(text):
    # Lowercasing
    text = text.lower()

    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Stopword removal and noise removal
    words = nltk.word_tokenize(text)
    words = [word for word in words if word.isalnum() and word not in stop_words]

    # Text normalization
    text = " ".join(words)

    return text

# Function to process a chunk of text using GPT-3
def process_chunk_with_gpt(prompt, chunk, resume_text, start_date, end_date, experience_details):
    # Replace placeholders with actual data
    full_prompt = f"{prompt}\n\n{chunk}"
    full_prompt = full_prompt.replace("{resume_text}", resume_text)
    full_prompt = full_prompt.replace("{start_date}", start_date)
    full_prompt = full_prompt.replace("{end_date}", end_date)
    full_prompt = full_prompt.replace("{experience_details}", experience_details)

    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=full_prompt,
        max_tokens=1500
    )

    return response.choices[0].text.strip()

# Function to process a single PDF file
def process_single_pdf(pdf_path, prompt, start_date, end_date, experience_details):
    resume_text = extract_text_from_pdf(pdf_path)

    # Preprocess the text
    cleaned_resume_text = preprocess_text(resume_text)

    # Process the text in smaller chunks using GPT-3
    chunk_size = 6000  # Adjust the chunk size as needed
    chunked_text = [cleaned_resume_text[i:i+chunk_size] for i in range(0, len(cleaned_resume_text), chunk_size)]
    generated_text = ""
    for chunk in chunked_text:
        # Replace placeholders with actual datai
        generated_chunk = process_chunk_with_gpt(prompt, chunk, cleaned_resume_text, start_date, end_date, experience_details)
        generated_text += generated_chunk

    return pdf_path, generated_text

# Function to process PDFs in a folder and its subdirectories using multiprocessing
def process_pdfs_in_folder(root_folder):
    output_data = {}  # Initializing an empty dictionary to store output data
    pdf_files = []  # Created a list to store PDF file paths

    for root, dirs, files in os.walk(root_folder):
        for file_name in files:
            if file_name.endswith(".pdf"):
                pdf_resume_path = os.path.join(root, file_name)
                pdf_files.append(pdf_resume_path)  # Stored PDF file paths

    # Created a pool of worker processes
    pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())

    # Defined placeholder values for resume_text, start_date, end_date, and experience_details
    resume_text = "Placeholder for resume text"
    start_date = "Placeholder for start date"
    end_date = "Placeholder for end date"
    experience_details = "Placeholder for experience details"

    # Used the map function to process PDF files in parallel
    results = pool.starmap(process_single_pdf, [(pdf_path, prompt, start_date, end_date, experience_details) for pdf_path in pdf_files])

    # Closed the pool of worker processes
    pool.close()
    pool.join()

    # Stored the results in the output dictionary
    for pdf_path, generated_text in results:
        output_data[os.path.basename(pdf_path)] = generated_text

    # Output the data in JSON format
    with open("output.json", "w") as json_file:
        json.dump(output_data, json_file, indent=4)

# Provided the root folder path where subfolders contain PDF files
root_folder = "/content/drive/MyDrive/data/AGRICULTURE_ART_Resumes"

# Defined my prompt here
prompt = "Given the following resume:\n{resume_text}\n\nPlease generate a JSON output."

# Processed PDF files in the specified folder and subdirectories using multiprocessing
process_pdfs_in_folder(root_folder)


In [None]:
# Load the JSON data from the file
with open("output.json", "r") as json_file:
    data = json.load(json_file)

# Print the data
print(json.dumps(data, indent=4))


### **TASK 2:**

This code is used for matching an input resume against a database of resumes based on their text content using TF-IDF and cosine similarity. It helps identify the most similar resumes in the database to the input resume.

1. **Loading JSON Data:** It defines a function `load_json_data` to load JSON data from a file. The JSON data is expected to contain text data, typically from processed resumes.

2. **Text Vectorization:** It defines a function `vectorize_text_data` to vectorize text data using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization.

3. **Cosine Similarity Calculation:** It defines a function `calculate_cosine_similarity` to calculate the cosine similarity between an input vector and a matrix of vectors. In this context, it calculates the cosine similarity between an input resume and a database of resumes.

4. **Main Function:** In the main function:

   - It loads JSON data from a file (the path needs to be specified).
   - Extracts text data from the loaded JSON data.
   - Vectorizes the text data using TF-IDF.
   - Loads an input resume (the path needs to be specified) and vectorizes it using the same TF-IDF vectorizer.
   - Calculates the cosine similarity between the input resume and all resumes in the database.
   - Retrieves the indices of the most similar resumes based on cosine similarity.
   - Prints the top 5 most similar resume filenames along with their cosine similarity scores.



In [19]:
# Function to load JSON data from a file
def load_json_data(file_path):
    with open(file_path, "r") as json_file:
        data = json.load(json_file)
    return data

# Function to vectorize text data using TF-IDF
def vectorize_text_data(text_data):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(text_data)
    return tfidf_matrix, tfidf_vectorizer

# Function to calculate cosine similarity
def calculate_cosine_similarity(input_vector, database_matrix):
    cosine_similarities = cosine_similarity(input_vector, database_matrix)
    return cosine_similarities

# Main function
def main():
    # Load the JSON data from the file
    json_file_path = "/content/output.json"  # Replace with the correct path to your JSON file
    json_data = load_json_data(json_file_path)

    # Extracted the text data from the JSON file
    text_data = list(json_data.values())

    # Vectorized the text data using TF-IDF
    tfidf_matrix, tfidf_vectorizer = vectorize_text_data(text_data)

    # Defined the input resume text (you can change this)
    input_resume_path = "/content/drive/MyDrive/data/AGRICULTURE_ART_Resumes/AGRICULTURE/10953078.pdf"
    input_resume_text = extract_text_from_pdf(input_resume_path)

    # Vectorized the input resume text using the same TF-IDF vectorizer
    input_resume_vector = tfidf_vectorizer.transform([input_resume_text])

    # Calculated cosine similarity between the input resume and all resumes in the database
    cosine_similarities = calculate_cosine_similarity(input_resume_vector, tfidf_matrix)

    # Get the indices of the most similar resumes
    indices = cosine_similarities.argsort()[0][::-1]

    # Print the top 5 most similar resume filenames and their similarity scores
    top_5_similar_resumes = [(list(json_data.keys())[i], cosine_similarities[0][i]) for i in indices[:5]]
    print("Top 5 most similar resume filenames and their similarity scores:")
    for filename, similarity_score in top_5_similar_resumes:
        print(f"Filename: {filename}, Similarity Score: {similarity_score:.4f}")

if __name__ == "__main__":
    main()

Top 5 most similar resume filenames and their similarity scores:
Filename: 10953078.pdf, Similarity Score: 0.8873
Filename: 24061629.pdf, Similarity Score: 0.3975
Filename: 36102323.pdf, Similarity Score: 0.3599
Filename: 17694454.pdf, Similarity Score: 0.3582
Filename: 46055835.pdf, Similarity Score: 0.3452


# **PART 2 Challenges**
**1.Reducing Resume Processing Time Efficiently:**
- In this case I have used all resumes from Agriculture and Arts Resume Database. At first, it took about 1 hour to process them all, which was quite slow.
- To overcome this challenge and boost efficiency, I implemented advanced parallel processing techniques, specifically leveraging Python's multiprocessing capabilities. By this method, I was able to significantly reduce the processing time to just 30 minutes.

**2.Overcoming Token Limitations with Data Chunking:**

- One of the key obstacles encountered when working with GPT-3 was its token limitation, which capped the input at 4097 tokens. This posed a limitation when dealing with longer resume texts that exceeded this threshold.
- To overcome this challenge and ensure that the GPT-3 model could effectively process the entire content of a resume, I employed a data chunking strategy. Instead of sending the entire resume at once, I divided the text into manageable chunks.
- These chunks were then sequentially processed by the model, allowing it to read and enhance all the data within the resume. This approach ensured that no vital information was omitted and that the enhanced output was comprehensive and accurate.

**3.Data Preprocessing:**
- To optimize the use of the limited tokens available, I employed data preprocessing techniques before sending the text to GPT-3. This preprocessing step involved carefully preparing and structuring the data to extract only the essential information. By doing so, we ensured that GPT-3 could comprehend and provide meaningful responses, even within the token constraints. This approach not only allowed us to work efficiently with token limitations but also enhanced the overall quality of the responses generated by GPT-3.

# **PART 2 Future Enhancements**

**Processing Time:** Leveraging AWS' formidable cloud computing capabilities to introduce advanced parallel processing holds the promise of a faster and more efficient resume processing pipeline. By capitalizing on the cloud's scalability and abundant resources, we can execute numerous CPU-intensive tasks concurrently. This not only unlocks the potential for substantial performance improvements but also offers the possibility of cost reduction by minimizing reliance on on-premises hardware. Ultimately, this enhancement will usher in an era of enhanced efficiency and cost-effectiveness, further elevating the overall effectiveness of our system.

# **Deliverables PART 2:**
- Task 1:
Input_data_Task_1,
Output_task_1
- Task 2:
Input_data_Task_2,
Output_task_2
