# InfoRaiders

* James Ballari
* Kuljeet Kaur
* Hema Sri Nakirikanti
* Tishya Thakkar

# Information Retrieval Problem: Job Matcher and Trends Finder

We propose the development of the "Job Matcher and Trends Finder" application, which aims to streamline the job search process by:
Aggregating Job Listings, Resume-Based Job Matching, Keyword-Based Job Search, Trending Insights
    
### Importance:
* Efficiency in Job Search- Simplifies job search, saves users from manually browsing multiple listings.
* Personalized Job Recommendations - Matches users with jobs based on their resume, allows users to search for jobs based on queries
* Market Insights - Find What's Trending Hidden Trends in regards with companies Hirings, Jobs Positions, Specific Roles, etc. at the current moment.

# Overview of Past and Current Solution Ideas

## State-of-the-Art Solutions

* Job Search Features in Job Searching websites such as Linkdin, Google Jobs, Indeed, Glass Door Reflections etc.
* Good Job Search, Query Reminders from Google Jobs.
<center>
<img src="Images/linkedin.png" alt="LinkedIn" style="width:600px;"/>
</center>

## Solution Ideas from Journal and Conference Papers

* Ranking & Smilarity Matching:
    * Pivoted Document Length Normalization [1]
    * Document Similarity Detection using K-Means and Cosine Distance [2]
    * Document Similarity Index based on the Jaccard Distance for Mail Filtering [3]
    * Similarity Measures for Text Document Clustering [4]
    * Document Search in Information Retrieval System Using Vector Space Model [5]

## Solution Ideas Helpful to the Team

* Web Scraping using selenium
* Document preprocessing(tokenization, Stop words, stemming and lemmatization)
* Ranked Retrival:
    * cosine similarity
    * k-means
    * tf-idf
    * Jaccard coefficient
    * H-index

# New Solution Ideas

* Finding Trends using H-index and G-index: The H-index and G-index are metrics used to evaluate the impact and productivity of an author's or researcher's work in the field of academia. Both indices take into account the number of publications and their citation counts.
#### How is this going to help us?
* Get a score for each term using H-index and G-index
* Score is the maximum value of H such that there is a document with at least H occurances of a term t and there are H such documents. 

# Hardware, Software, and Data

## Hardware Needs

A laptop with the following specifications:
* CPU: intel core i7 9th gen and above
* RAM: 16.0 GB or higher
* OS: Windows 11, 64-bit

## Software Needs

* Language: Python 3.10
* Development environment: JupyterLab 3.6.3
* Python Packages: 
    * nltk 3.8.1
        * snowball
        * WordNetLemmatizer
        * PorterStemmer 
    * pandas 2.1.2
    * gensim 4.3.2
    * selenium 4.15.2
    * numpy 1.26.1

## Data Needs

* Online job posting dataset
  * Collect job posting data from popular websites using web scrapping techniques
      * Used selenium to get the job posting data from www.indeed.com.  
* Sample Resumes
  * planning to create sample resumes for focused testing.
      * Used ChatGPT to create resumes.
  * use publicly available resume's online.
      * Used resume dataset from [kaggle](https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset)

## Software System Diagram

![FlowDiagram](Images/SystemFlow.png)

## Significant Software Tasks Accomplished

* Task Accomplished:
    * Job matching using pearson correlation coefficient
    * Job search using phrase queries
    * Job matching using jaccard coefficient
    * Job trends using H-index

* Testing and evaluation:
    * Tested pearson correlation coefficient using the resumes from the corpus.
    * Tested jaccard coefficient using targeted resume.
    
* Results:
    * tf-idf outperformes cosine similarity ranking, pearson correlation coefficient and jaccard coefficient

# Future Work

<table>
  <tr>
    <th>Team Member Name</th>
    <th>Team Member Tasks</th>
  </tr>
  <tr>
    <td>James Ballari</td>
    <td>
        <ul>
            <li>Work on designing and developing user interface</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Kuljeet Kaur</td>
    <td>
        <ul>
            <li>Work on getting better results with cosine similarity</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Hema Sri Nakirikanti</td>
    <td>
        <ul>
            <li>Work on preprocessing the data for job trends</li>
        </ul>
    </td>
  </tr>
 <tr>
    <td>Tishya Thakkar</td>
    <td>
        <ul>
            <li>Research on new techniques related to document similarity for better results</li>
        </ul>
    </td>
  </tr>
</table>

# Team Report

<table>
  <tr>
    <th>Team Member Name</th>
    <th>Team Member Accomplishments</th>
  </tr>
  <tr>
    <td>James Ballari</td>
    <td>
        <ul>
            <li>Worked on Job Trends</li>
            <li>Worked on gathering more data - job postings</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Kuljeet Kaur</td>
    <td>
        <ul>
            <li>Worked on Jaccard Coefficient</li>
            <li>Tested Jaccard Coefficient and phrase queries</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td>Hema Sri Nakirikanti</td>
    <td>
        <ul>
            <li>Worked on pearson correlation coefficient</li>
            <li>Worked on system flow diagram</li>
        </ul>
    </td>
  </tr>
 <tr>
    <td>Tishya Thakkar</td>
    <td>
        <ul>
            <li>Worked on phrase querries</li>
            <li>Worked on compiling all code into a single notebook</li>
        </ul>
    </td>
  </tr>   
</table>

# Appendix

## Reference List

1. Singhal, A., Buckley, C., & Mitra, M. (2017). Pivoted document length normalization. ACM SIGIR Forum, 51(2), 176–184. https://doi.org/10.1145/3130348.3130365
2. Usino, W., Satria, A., Hamed, K., Bramantoro, A., A, H., & Amaldi, W. (2019). Document similarity detection using k-means and cosine distance. International Journal of Advanced Computer Science and Applications, 10(2). https://doi.org/10.14569/ijacsa.2019.0100222
3. Temma, S., Sugii, M., &amp; Matsuno, H. (2019). The document similarity index based on the Jaccard distance for mail filtering. 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC). https://doi.org/10.1109/itc-cscc.2019.8793419
4. Huang, A., 2008, April. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Vol. 4, pp. 9-56).
5. Yusrandi, et al. “Document Search in Information Retrieval System Using Vector Space Model.” 2021 7th International Conference on Electrical, Electronics and Information Engineering (ICEEIE), IEEE, 2021, pp. 604–08, https://doi.org/10.1109/ICEEIE52663.2021.9616735.

## Other Material

* https://www.depts.ttu.edu/library/
* https://www.linkedin.com/
* https://www.indeed.com/
* https://nlp.stanford.edu/IR-book/
* https://www.kaggle.com/datasets/gauravduttakiit/resume-dataset
* https://www.library.yorku.ca/web/research-metrics/author/#:~:text=g-index%3A%20a%20modification%20of,received%20at%20least%2010%20citations.
* https://dev.to/sajal2692/coding-k-means-clustering-using-python-and-numpy-fg1

# Software
#### Dependencies:
Python 3.10 Packages: 
* nltk 3.8.1
   * snowball
   * WordNetLemmatizer
   * PorterStemmer 
* pandas 2.1.2
* gensim 4.3.2
* selenium 4.15.2
* numpy 1.26.1

Install the above packages using the command:
`pip install package_name==desired_version`

Example:
`pip install numpy==1.26.1`


In [1]:
# Imports
# Import the Selenium library for web automation
from selenium import webdriver

# Import the time module for handling wait times
import time

# Import random, string, and json modules for generating random strings and working with JSON
import random
import string
import json

import re

# Download necessary resources from NLTK
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Import stopwords and word_tokenize from NLTK
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Import PorterStemmer and WordNetLemmatizer from NLTK
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Import the pandas library for data manipulation and analysis
import pandas as pd

# Import the defaultdict class from the collections module
from collections import defaultdict

# Import the sent_tokenize function from nltk.tokenize
from nltk.tokenize import sent_tokenize

# Import the snowball stemmer and WordNetLemmatizer from nltk.stem
from nltk.stem import snowball
from nltk.stem import WordNetLemmatizer

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile
import logging
import numpy as np
import math
import heapq
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.test.utils import get_tmpfile
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tishy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tishy\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tishy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Gathering the Data

### Web Scrapping Using Selenium

#### Dependencies:
* Need to run /chromedeiver/chromedriver.exe in the background

In [2]:
# Constant: Time to wait for the webpage to load
WEBPAGE_WAIT_TIME = 3

# Constant: Time to wait for job-related content to load on the webpage
JOB_LOAD_WAIT_TIME = 3

# Create a Chrome WebDriver instance
driver = webdriver.Chrome()

# Maximize the browser window
driver.maximize_window()   

In [3]:
def wait(time_in_seconds):
    """
    Function to pause the execution for a specified amount of time.

    Parameters:
    - time_in_seconds (int): The duration, in seconds, to wait.
    """
    time.sleep(time_in_seconds)

In [4]:
def get_job_url(page_no):
    """
    Function to generate a job search URL for a specific page number on Indeed.

    Parameters:
    - page_no (int): The page number for the job search results.

    Returns:
    - str: The generated job search URL.
    """
    return 'https://www.indeed.com/jobs?q=&l=Lubbock%2C+TX&radius=100&start=' + str(int(page_no * 10))

In [5]:
def get_job_details_from_page(page_number):
    """
    Function to retrieve job details from a specific page on Indeed.

    Parameters:
    - page_number (int): The page number for the job search results.

    Returns:
    - list: A list of dictionaries containing job details, including URL and content.
    """
    # Initialize an empty list to store job details
    result = []

    # Load the job search URL for the given page number
    driver.get(get_job_url(page_number))
    wait(WEBPAGE_WAIT_TIME)
    
    # Refresh the page to ensure content is up-to-date
    driver.refresh()
    wait(WEBPAGE_WAIT_TIME)

    # Get the total number of job links on the page
    total_links = int(driver.execute_script(' return document.getElementById("mosaic-jobResults").getElementsByTagName("h2").length;'))
    print("Total links found = {0}".format(total_links))

    # Iterate through each job link
    for i in range(total_links):
        data = {}
        
        # Click on the i-th job link to view details
        command = 'document.getElementById("mosaic-jobResults").getElementsByTagName("h2")[{0}].getElementsByTagName("a")[0].click()'.format(i)
        driver.execute_script(command)
        wait(JOB_LOAD_WAIT_TIME)
        
        # Get job content from the page
        job_content = driver.execute_script('return document.getElementById("jobsearch-ViewjobPaneWrapper").innerHTML')
        
        # Store URL and content in a dictionary
        data["url"] = driver.current_url
        data["content"] = job_content
        result.append(data)

    return result

In [6]:
# Pause execution for 5 seconds
wait(5)

# Initialize a dictionary to store job data
result = {}
result["data"] = []

# Get job details from the first 25 pages
for i in range(25):
    # Retrieve job details from the current page and append to the result dictionary
    temp_result = get_job_details_from_page(i)
    for x in temp_result:
        result["data"].append(x)

Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15
Total links found = 15


In [7]:
# Convert the result dictionary to a JSON-formatted string
json_str_jobs = json.dumps(result)

# Generate a unique filename based on the current date and time
file_name = "Data/" + time.strftime("%Y%m%d-%H%M%S") + ".json"

# Open the file in write mode and write the JSON data
file = open(file_name, 'w')
file.write(json_str_jobs)
file.close()

## Data Preprocessing

### Job postings
In this section we are processing the job postings we get after web scrapping

In [8]:
# Compile a regular expression pattern for matching HTML tags
regex_html_tags = re.compile('<.*?>') 

# Specify the filename for the collection data
collection_file = '20231024-214547.json'

In [9]:
# Record the start time
start = time.time()

# Load the collection data from the specified JSON file
with open('Data/'+collection_file) as json_data:
    main_collection = json.load(json_data)

# Extract the 'data' field from the loaded JSON, representing the collection
collection = main_collection["data"]

# Record the end time
end = time.time()

# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))
# Uncomment the line below to print the content of the first item in the collection
# print(collection[0]['content'])

Time taken to process cell: 0.1458725929260254


In [10]:
# Record the start time
start = time.time()

# Create instances of PorterStemmer and WordNetLemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Record the end time
end = time.time()

# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))

Time taken to process cell: 0.0


In [11]:
def create_custom_stopwords():
    stop_words = stopwords.words('english')
    custom_stopwords = {
        "(",
        ")",
        ";",
        "amp",
        "nbsp",
        "'",
        "’",
        ".",
        "-",
        "!",
        "&",
        ",",
        ":"
    }
    new_stopwords = {word: True for word in stop_words}
    new_stopwords.update({word: True for word in custom_stopwords})
    return new_stopwords

In [12]:
# Record the start time
start = time.time()
new_stopwords = create_custom_stopwords()
# Iterate through each document in the collection
for i in range(len(collection)):
    # Assign a document ID to each document
    collection[i]['doc_id'] = i
    
    # Remove HTML tags from the document's content
    collection[i]['text'] = re.sub(regex_html_tags, ' ', collection[i]['content'])
    text = collection[i]['text']
    
    # Tokenize the text
    text_tokens = word_tokenize(text)
    
    # Remove stopwords from the tokenized text
    stop_words_text = [word.lower() for word in text_tokens if not word.lower() in new_stopwords]
    collection[i]['stop_word_text'] = stop_words_text
    
    # Initialize dictionaries and variables for stemming, lemmatization, and frequency tracking
    stem_words_map = {}
    lemm_words_map = {}
    term_frequency = {}
    term_position = {}
    stem_lemm_words = {}
    pos = 0
    stem_lemm_text = ""
    
    # Process each word in the cleaned text
    for word in stop_words_text:
        # Stemming and lemmatization
        stem_word = ps.stem(word)
        lemm_word = lemmatizer.lemmatize(word)
        stem_lemm_word = lemmatizer.lemmatize(stem_word)
        
        # Update dictionaries and variables based on the processed words
        if stem_word in stem_words_map:
            stem_words_map[stem_word] += 1
        else:
            stem_words_map[stem_word] = 1
            
        if stem_lemm_word in stem_lemm_words:
            stem_lemm_words[stem_lemm_word] += 1
        else:
            stem_lemm_words[stem_lemm_word] = 1
        
        if lemm_word in lemm_words_map:
            lemm_words_map[lemm_word] += 1
        else:
            lemm_words_map[lemm_word] = 1
        
        if stem_lemm_word in term_frequency:
            term_frequency[stem_lemm_word] += 1
        else:
            term_frequency[stem_lemm_word] = 1
        
        if stem_lemm_word not in term_position:
            term_position[stem_lemm_word] = []
        term_position[stem_lemm_word].append(pos)
        stem_lemm_text += stem_lemm_word + " "
        pos += 1
    
    # Update the collection with processed information for each document
    collection[i]['stem_words_map'] = stem_words_map
    collection[i]['stem_words_map_len'] = len(stem_words_map)
    collection[i]['lemm_words_map'] = lemm_words_map
    collection[i]['lemm_words_map_len'] = len(lemm_words_map)
    collection[i]['stem_lemm_words_map'] = stem_lemm_words
    collection[i]['stem_lemm_words_map_len'] = len(stem_lemm_words)
    collection[i]['term_frequency_map'] = term_frequency
    collection[i]['term_frequency_map_len'] = len(term_frequency)
    collection[i]['term_frequency_pos'] = term_position
    collection[i]['stem_lemm_text'] = stem_lemm_text
    stem_lemm_text_total_count = 0
    for word, freq in term_frequency.items():
        stem_lemm_text_total_count += freq
    collection[i]['stem_lemm_text_total_count'] = stem_lemm_text_total_count

# Write the processed collection to a new file
file = open("Data/job_postings_collection_" + collection_file, 'w')
file.write(json.dumps(main_collection))
file.close()

# Record the end time
end = time.time()

# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))

Time taken to process cell: 7.7626025676727295


In [13]:
# Record the start time
start = time.time()

# Initialize dictionaries to store collection-wise results
main_collection["document_collection"] = {}
doc_collection_map = main_collection["document_collection"]
terms_in_doc_collection = {}

# Iterate through each document in the collection
for i in range(len(collection)):
    # Iterate through each term in the document's term frequency map
    for word, values in collection[i]['term_frequency_map'].items():
        # Update the term-document map with document indices
        if word in terms_in_doc_collection:
            terms_in_doc_collection[word].append(i)
        else:
            terms_in_doc_collection[word] = [i]

# Update the document collection map in the main collection
doc_collection_map["term_document_map"] = terms_in_doc_collection
doc_collection_map["term_document_map_len"] = len(terms_in_doc_collection)

# Write the updated collection to a new file
file = open("Data/job_postings_collection_" + collection_file, 'w')
file.write(json.dumps(main_collection))
file.close()

# Record the end time
end = time.time()

# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))

Time taken to process cell: 0.37221360206604004


### Resumes
In this section we are preprocessing the resumes for testing

In [14]:
# Read a CSV file into a pandas DataFrame
df = pd.read_csv('Data/UpdatedResumeDataSet.csv')

# Convert the DataFrame into a list of dictionaries and then into a JSON-formatted string
corpus = json.loads(json.dumps(list(df.T.to_dict().values())))

In [15]:
# Open a file in write mode, write the JSON-formatted corpus, and close the file
file = open("Data/corpus.json", 'w')
file.write(json.dumps(corpus))
file.close()

In [16]:
start = time.time()
# Create lists of words and lowercase words for each document in the corpus
for i in range(len(corpus)):
    word_doc = []
    word_doc_lower = []
    
    # Assign a document ID to each document
    corpus[i]['doc_id'] = i
    
    # Tokenize the 'Resume' field and populate word lists
    resume = corpus[i]['Resume']
    resume_tokens = word_tokenize(resume)
    for w in resume_tokens:
        word_doc.append(w)
        word_doc_lower.append(w.lower())
    
    # Update the corpus with the word lists
    corpus[i]['word_doc'] = word_doc
    corpus[i]['word_doc_lower'] = word_doc_lower

# Remove stopwords, punctuation, and non-alphabetic characters from the lowercase word lists
stwords = set(stopwords.words('english'))
punctuation = list(string.punctuation)
for i in range(len(corpus)):
    word_doc_lower_cleaned = [pair for pair in corpus[i]['word_doc_lower'] if pair not in stwords and pair not in punctuation and pair.isalpha()]
    corpus[i]['word_doc_lower_cleaned'] = word_doc_lower_cleaned

# Stem the cleaned lowercase word lists
for i in range(len(corpus)):
    stemmer = snowball.SnowballStemmer('english')
    word_doc_lower_cleaned_stemmed = [stemmer.stem(pair) for pair in corpus[i]['word_doc_lower_cleaned']]
    corpus[i]['word_doc_lower_cleaned_stemmed'] = word_doc_lower_cleaned_stemmed

# Lemmatize the cleaned lowercase word lists
lemmatizer = WordNetLemmatizer()
for i in range(len(corpus)):
    word_doc_lower_cleaned_lem = [lemmatizer.lemmatize(word) for word in corpus[i]['word_doc_lower_cleaned']]
    word_doc_lower_cleaned_stem_lem = [lemmatizer.lemmatize(word) for word in corpus[i]['word_doc_lower_cleaned_stemmed']]
    corpus[i]['word_doc_lower_cleaned_lem'] = word_doc_lower_cleaned_lem
    corpus[i]['word_doc_lower_cleaned_stem_lem'] = word_doc_lower_cleaned_stem_lem

# Write the processed corpus to a new file
file = open("Data/corpus.json", 'w')
file.write(json.dumps(corpus))
file.close()
end = time.time()
# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))

Time taken to process cell: 9.648219347000122


In [17]:
start = time.time()
# Sort the words in each document's 'word_doc_lower_cleaned_stem_lem' list
for c in range(len(corpus)):
    for i in range(len(corpus[c]['word_doc_lower_cleaned_stem_lem']) - 1):
        for j in range(len(corpus[c]['word_doc_lower_cleaned_stem_lem']) - 1 - i):
            if (
                (corpus[c]['word_doc_lower_cleaned_stem_lem'][j] > corpus[c]['word_doc_lower_cleaned_stem_lem'][j + 1])
                or (
                    (
                        corpus[c]['word_doc_lower_cleaned_stem_lem'][j]
                        == corpus[c]['word_doc_lower_cleaned_stem_lem'][j + 1]
                    )
                )
            ):
                temp = corpus[c]['word_doc_lower_cleaned_stem_lem'][j]
                corpus[c]['word_doc_lower_cleaned_stem_lem'][j] = corpus[c]['word_doc_lower_cleaned_stem_lem'][j + 1]
                corpus[c]['word_doc_lower_cleaned_stem_lem'][j + 1] = temp

# Remove duplicate words and count occurrences
for c in range(len(corpus)):
    prev = ["", 0]
    count = 1
    word_doc_lower_cleaned_stem_lem_duprem = []
    for i in range(len(corpus[c]['word_doc_lower_cleaned_stem_lem']) - 1):
        if corpus[c]['word_doc_lower_cleaned_stem_lem'][i] == corpus[c]['word_doc_lower_cleaned_stem_lem'][i + 1]:
            count += 1
        else:
            word_doc_lower_cleaned_stem_lem_duprem.append(
                [corpus[c]['word_doc_lower_cleaned_stem_lem'][i], count]
            )
            count = 1
    # Append the last word and its count
    word_doc_lower_cleaned_stem_lem_duprem.append(
        [corpus[c]['word_doc_lower_cleaned_stem_lem'][-1], count]
    )
    corpus[c]['word_doc_lower_cleaned_stem_lem_duprem'] = word_doc_lower_cleaned_stem_lem_duprem

# Write the updated corpus to a new file
file = open("Data/corpus.json", 'w')
file.write(json.dumps(corpus))
file.close()
end = time.time()
# Print the time taken to process the cell
print("Time taken to process cell: {}".format(end - start))

Time taken to process cell: 57.29428243637085


## Ranked Retrieval using TF-IDF

In [18]:
# Create instances of PorterStemmer and WordNetLemmatizer
ps = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# flags and Constants
COLLECTION_FILE = 'Data/job_postings_collection_20231024-214547.json'
TEST_FILE = "test/resume_125"

In [19]:
# methdods required.

def calculate_TF_IDF(term,idf_score,collection_documents,collection_term_document_map):
    document_freq_of_term = 0
    if term in collection_term_document_map:
            document_freq_of_term = len(collection_term_document_map[term])
    else:
        return 
    total_docs = len(collection_documents)
    idf = math.log(total_docs/document_freq_of_term  ,2)
    for document_index in collection_term_document_map[term]:
        doc = collection_documents[document_index]
        tf = 1
        if term in doc["term_frequency_map"]:
            total_terms = doc["stem_lemm_text_total_count"]
            term_freq = doc["term_frequency_map"][term]
            tf = term_freq / total_terms
            idf_score[document_index][0] +=  (idf * tf)

Note: we have created sample test resumes for job posting with doc_id's 0 and 125. TF-IDF ranks the corresponding job posting 125 in 2nd postion where as job posting 0 at 1st position as a match to the sample resumes Provided. Refer to the flags and constants section above to change the required files.

In [20]:
start = time.time()
collection_data = None  
with open(COLLECTION_FILE, "r") as read_file:
    collection_data = json.load(read_file)
    
collection_documents = collection_data["data"]
collection_term_document_map = collection_data["document_collection"]["term_document_map"]

#process input document.
f_open = open(TEST_FILE,'r')
temp = f_open.read()
f_open.close()
text_tokens = word_tokenize(temp)
query_stop_words_text = [word.lower() for word in text_tokens if not word.lower() in new_stopwords]

query_stem_lemm_words = {}
    
for word in query_stop_words_text:
        stem_word = ps.stem(word)
        #lemm_word = lemmatizer.lemmatize(stem_word)
        stem_lemm_word = lemmatizer.lemmatize(stem_word)
        query_stem_lemm_words[stem_lemm_word] = True

idf_score = []
for i in range(len(collection_documents)):
    idf_score.append([0,i])
    
for term,_ in query_stem_lemm_words.items():
    calculate_TF_IDF(term,idf_score,collection_documents,collection_term_document_map)

idf_score.sort(reverse = True)
for i in range(20):
    print("document {} score {}".format(idf_score[i][1] , idf_score[i][0]))

        
end = time.time()
print("Time taken to process cell {}".format(end - start))

document 153 score 0.9330603815799637
document 125 score 0.9330603815799637
document 288 score 0.4906479599015021
document 172 score 0.4906479599015021
document 90 score 0.4906479599015021
document 369 score 0.4388391823400369
document 357 score 0.41199435472916707
document 166 score 0.41199435472916707
document 316 score 0.4023280921467847
document 370 score 0.392965627304336
document 341 score 0.39063527063687287
document 273 score 0.39063527063687287
document 205 score 0.39063527063687287
document 32 score 0.38933376976986117
document 22 score 0.38933376976986117
document 94 score 0.38543482727810086
document 85 score 0.3828405628640079
document 136 score 0.3746371456076682
document 155 score 0.37232337736836274
document 100 score 0.37232337736836274
Time taken to process cell 0.6352336406707764


## Ranked Retrieval using Doc2Vec Model, Cosine Similarity and K-means

In [21]:
# Set up logging configuration
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# Load the collection data from the specified file
collection_file = 'Data/job_postings_collection_20231024-214547.json'
with open(collection_file) as json_data:
    main_collection = json.load(json_data)
collection = main_collection["data"]
docs = []

# Create TaggedDocuments for training the Doc2Vec model
for i in range(len(collection)):
    docs.append(TaggedDocument(collection[i]['stem_lemm_text'], [i]))

# Initialize and train the Doc2Vec model (dm = 0 is bag of words model)
model = Doc2Vec(docs, dm=0, vector_size=400, workers=4, epochs=1)
model.build_vocab(docs)
model.train(docs, total_examples=model.corpus_count, epochs=2000)

# Save the trained model
fname = './Models/doc2vec_model_' + time.strftime("%Y%m%d-%H%M%S")
model.save(fname)

# Record the end time
end = time.time()
print("Time taken to process cell {}".format(end - start))

2023-11-30 09:50:04,677 : INFO : collecting all words and their counts
2023-11-30 09:50:04,678 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2023-11-30 09:50:04,756 : INFO : collected 70 word types and 375 unique tags from a corpus of 375 examples and 905346 words
2023-11-30 09:50:04,757 : INFO : Creating a fresh vocabulary
2023-11-30 09:50:04,758 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=5 retains 66 unique words (94.29% of original 70, drops 4)', 'datetime': '2023-11-30T09:50:04.758346', 'gensim': '4.3.2', 'python': '3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'prepare_vocab'}
2023-11-30 09:50:04,759 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 905342 word corpus (100.00% of original 905346, drops 4)', 'datetime': '2023-11-30T09:50:04.759344', 'gensim': '4.3.2', 'python': '3.10.13 | packaged by A

2023-11-30 09:50:08,118 : INFO : EPOCH 17: training on 905346 raw words (177011 effective words) took 0.2s, 998615 effective words/s
2023-11-30 09:50:08,280 : INFO : EPOCH 18: training on 905346 raw words (176942 effective words) took 0.2s, 1120107 effective words/s
2023-11-30 09:50:08,461 : INFO : EPOCH 19: training on 905346 raw words (176591 effective words) took 0.2s, 996067 effective words/s
2023-11-30 09:50:08,652 : INFO : EPOCH 20: training on 905346 raw words (177094 effective words) took 0.2s, 957082 effective words/s
2023-11-30 09:50:08,813 : INFO : EPOCH 21: training on 905346 raw words (176624 effective words) took 0.2s, 1108519 effective words/s
2023-11-30 09:50:08,973 : INFO : EPOCH 22: training on 905346 raw words (176174 effective words) took 0.2s, 1117050 effective words/s
2023-11-30 09:50:09,154 : INFO : EPOCH 23: training on 905346 raw words (176591 effective words) took 0.2s, 994025 effective words/s
2023-11-30 09:50:09,318 : INFO : EPOCH 24: training on 905346 raw 

2023-11-30 09:50:18,987 : INFO : EPOCH 79: training on 905346 raw words (176522 effective words) took 0.2s, 973307 effective words/s
2023-11-30 09:50:19,156 : INFO : EPOCH 80: training on 905346 raw words (176781 effective words) took 0.2s, 1067130 effective words/s
2023-11-30 09:50:19,323 : INFO : EPOCH 81: training on 905346 raw words (176893 effective words) took 0.2s, 1094155 effective words/s
2023-11-30 09:50:19,494 : INFO : EPOCH 82: training on 905346 raw words (176562 effective words) took 0.2s, 1048537 effective words/s
2023-11-30 09:50:19,661 : INFO : EPOCH 83: training on 905346 raw words (176578 effective words) took 0.2s, 1090320 effective words/s
2023-11-30 09:50:19,840 : INFO : EPOCH 84: training on 905346 raw words (176730 effective words) took 0.2s, 1007361 effective words/s
2023-11-30 09:50:20,026 : INFO : EPOCH 85: training on 905346 raw words (176981 effective words) took 0.2s, 972155 effective words/s
2023-11-30 09:50:20,195 : INFO : EPOCH 86: training on 905346 ra

2023-11-30 09:50:29,429 : INFO : EPOCH 140: training on 905346 raw words (176914 effective words) took 0.2s, 969222 effective words/s
2023-11-30 09:50:29,636 : INFO : EPOCH 141: training on 905346 raw words (176443 effective words) took 0.2s, 863659 effective words/s
2023-11-30 09:50:29,811 : INFO : EPOCH 142: training on 905346 raw words (175937 effective words) took 0.2s, 1037470 effective words/s
2023-11-30 09:50:30,034 : INFO : EPOCH 143: training on 905346 raw words (177010 effective words) took 0.2s, 800455 effective words/s
2023-11-30 09:50:30,272 : INFO : EPOCH 144: training on 905346 raw words (177038 effective words) took 0.2s, 764449 effective words/s
2023-11-30 09:50:30,470 : INFO : EPOCH 145: training on 905346 raw words (177011 effective words) took 0.2s, 908520 effective words/s
2023-11-30 09:50:30,695 : INFO : EPOCH 146: training on 905346 raw words (176830 effective words) took 0.2s, 796092 effective words/s
2023-11-30 09:50:30,861 : INFO : EPOCH 147: training on 90534

2023-11-30 09:50:40,052 : INFO : EPOCH 201: training on 905346 raw words (176456 effective words) took 0.2s, 1061784 effective words/s
2023-11-30 09:50:40,228 : INFO : EPOCH 202: training on 905346 raw words (177222 effective words) took 0.2s, 1028349 effective words/s
2023-11-30 09:50:40,399 : INFO : EPOCH 203: training on 905346 raw words (176418 effective words) took 0.2s, 1055380 effective words/s
2023-11-30 09:50:40,566 : INFO : EPOCH 204: training on 905346 raw words (176941 effective words) took 0.2s, 1083913 effective words/s
2023-11-30 09:50:40,730 : INFO : EPOCH 205: training on 905346 raw words (176616 effective words) took 0.2s, 1102593 effective words/s
2023-11-30 09:50:40,903 : INFO : EPOCH 206: training on 905346 raw words (176249 effective words) took 0.2s, 1044559 effective words/s
2023-11-30 09:50:41,074 : INFO : EPOCH 207: training on 905346 raw words (176548 effective words) took 0.2s, 1053312 effective words/s
2023-11-30 09:50:41,255 : INFO : EPOCH 208: training on

2023-11-30 09:50:50,646 : INFO : EPOCH 262: training on 905346 raw words (177114 effective words) took 0.2s, 969875 effective words/s
2023-11-30 09:50:50,818 : INFO : EPOCH 263: training on 905346 raw words (177241 effective words) took 0.2s, 1051991 effective words/s
2023-11-30 09:50:50,990 : INFO : EPOCH 264: training on 905346 raw words (176840 effective words) took 0.2s, 1046410 effective words/s
2023-11-30 09:50:51,175 : INFO : EPOCH 265: training on 905346 raw words (176739 effective words) took 0.2s, 966410 effective words/s
2023-11-30 09:50:51,342 : INFO : EPOCH 266: training on 905346 raw words (176461 effective words) took 0.2s, 1080794 effective words/s
2023-11-30 09:50:51,505 : INFO : EPOCH 267: training on 905346 raw words (177232 effective words) took 0.2s, 1109920 effective words/s
2023-11-30 09:50:51,680 : INFO : EPOCH 268: training on 905346 raw words (176883 effective words) took 0.2s, 1023412 effective words/s
2023-11-30 09:50:51,846 : INFO : EPOCH 269: training on 9

2023-11-30 09:51:01,108 : INFO : EPOCH 323: training on 905346 raw words (176397 effective words) took 0.2s, 1082217 effective words/s
2023-11-30 09:51:01,275 : INFO : EPOCH 324: training on 905346 raw words (177030 effective words) took 0.2s, 1078003 effective words/s
2023-11-30 09:51:01,453 : INFO : EPOCH 325: training on 905346 raw words (176693 effective words) took 0.2s, 1019867 effective words/s
2023-11-30 09:51:01,617 : INFO : EPOCH 326: training on 905346 raw words (177371 effective words) took 0.2s, 1100088 effective words/s
2023-11-30 09:51:01,803 : INFO : EPOCH 327: training on 905346 raw words (177197 effective words) took 0.2s, 971927 effective words/s
2023-11-30 09:51:01,980 : INFO : EPOCH 328: training on 905346 raw words (176607 effective words) took 0.2s, 1008095 effective words/s
2023-11-30 09:51:02,148 : INFO : EPOCH 329: training on 905346 raw words (177413 effective words) took 0.2s, 1079846 effective words/s
2023-11-30 09:51:02,317 : INFO : EPOCH 330: training on 

2023-11-30 09:51:11,531 : INFO : EPOCH 384: training on 905346 raw words (177462 effective words) took 0.2s, 1082297 effective words/s
2023-11-30 09:51:11,703 : INFO : EPOCH 385: training on 905346 raw words (175964 effective words) took 0.2s, 1051453 effective words/s
2023-11-30 09:51:11,871 : INFO : EPOCH 386: training on 905346 raw words (176793 effective words) took 0.2s, 1081226 effective words/s
2023-11-30 09:51:12,042 : INFO : EPOCH 387: training on 905346 raw words (176950 effective words) took 0.2s, 1061200 effective words/s
2023-11-30 09:51:12,210 : INFO : EPOCH 388: training on 905346 raw words (176377 effective words) took 0.2s, 1068857 effective words/s
2023-11-30 09:51:12,375 : INFO : EPOCH 389: training on 905346 raw words (176708 effective words) took 0.2s, 1093666 effective words/s
2023-11-30 09:51:12,542 : INFO : EPOCH 390: training on 905346 raw words (176420 effective words) took 0.2s, 1086055 effective words/s
2023-11-30 09:51:12,717 : INFO : EPOCH 391: training on

2023-11-30 09:51:21,891 : INFO : EPOCH 445: training on 905346 raw words (176530 effective words) took 0.2s, 1055322 effective words/s
2023-11-30 09:51:22,058 : INFO : EPOCH 446: training on 905346 raw words (177094 effective words) took 0.2s, 1082347 effective words/s
2023-11-30 09:51:22,226 : INFO : EPOCH 447: training on 905346 raw words (176682 effective words) took 0.2s, 1080332 effective words/s
2023-11-30 09:51:22,407 : INFO : EPOCH 448: training on 905346 raw words (177120 effective words) took 0.2s, 1007891 effective words/s
2023-11-30 09:51:22,571 : INFO : EPOCH 449: training on 905346 raw words (176570 effective words) took 0.2s, 1096078 effective words/s
2023-11-30 09:51:22,740 : INFO : EPOCH 450: training on 905346 raw words (176906 effective words) took 0.2s, 1069425 effective words/s
2023-11-30 09:51:22,912 : INFO : EPOCH 451: training on 905346 raw words (176895 effective words) took 0.2s, 1050816 effective words/s
2023-11-30 09:51:23,077 : INFO : EPOCH 452: training on

2023-11-30 09:51:32,244 : INFO : EPOCH 506: training on 905346 raw words (176599 effective words) took 0.2s, 1134754 effective words/s
2023-11-30 09:51:32,417 : INFO : EPOCH 507: training on 905346 raw words (176826 effective words) took 0.2s, 1039036 effective words/s
2023-11-30 09:51:32,600 : INFO : EPOCH 508: training on 905346 raw words (176437 effective words) took 0.2s, 980888 effective words/s
2023-11-30 09:51:32,765 : INFO : EPOCH 509: training on 905346 raw words (177064 effective words) took 0.2s, 1105126 effective words/s
2023-11-30 09:51:32,935 : INFO : EPOCH 510: training on 905346 raw words (176559 effective words) took 0.2s, 1064451 effective words/s
2023-11-30 09:51:33,129 : INFO : EPOCH 511: training on 905346 raw words (177036 effective words) took 0.2s, 928982 effective words/s
2023-11-30 09:51:33,293 : INFO : EPOCH 512: training on 905346 raw words (177305 effective words) took 0.2s, 1100757 effective words/s
2023-11-30 09:51:33,455 : INFO : EPOCH 513: training on 9

2023-11-30 09:51:42,981 : INFO : EPOCH 567: training on 905346 raw words (176890 effective words) took 0.2s, 1038490 effective words/s
2023-11-30 09:51:43,149 : INFO : EPOCH 568: training on 905346 raw words (177135 effective words) took 0.2s, 1089500 effective words/s
2023-11-30 09:51:43,326 : INFO : EPOCH 569: training on 905346 raw words (176907 effective words) took 0.2s, 1026868 effective words/s
2023-11-30 09:51:43,500 : INFO : EPOCH 570: training on 905346 raw words (176890 effective words) took 0.2s, 1033841 effective words/s
2023-11-30 09:51:43,666 : INFO : EPOCH 571: training on 905346 raw words (176948 effective words) took 0.2s, 1098561 effective words/s
2023-11-30 09:51:43,837 : INFO : EPOCH 572: training on 905346 raw words (176762 effective words) took 0.2s, 1063883 effective words/s
2023-11-30 09:51:44,010 : INFO : EPOCH 573: training on 905346 raw words (176807 effective words) took 0.2s, 1041014 effective words/s
2023-11-30 09:51:44,174 : INFO : EPOCH 574: training on

2023-11-30 09:51:53,423 : INFO : EPOCH 628: training on 905346 raw words (176753 effective words) took 0.2s, 1061921 effective words/s
2023-11-30 09:51:53,593 : INFO : EPOCH 629: training on 905346 raw words (176775 effective words) took 0.2s, 1067511 effective words/s
2023-11-30 09:51:53,768 : INFO : EPOCH 630: training on 905346 raw words (176157 effective words) took 0.2s, 1027314 effective words/s
2023-11-30 09:51:53,937 : INFO : EPOCH 631: training on 905346 raw words (176842 effective words) took 0.2s, 1079854 effective words/s
2023-11-30 09:51:54,101 : INFO : EPOCH 632: training on 905346 raw words (176602 effective words) took 0.2s, 1093323 effective words/s
2023-11-30 09:51:54,272 : INFO : EPOCH 633: training on 905346 raw words (176328 effective words) took 0.2s, 1050730 effective words/s
2023-11-30 09:51:54,444 : INFO : EPOCH 634: training on 905346 raw words (176697 effective words) took 0.2s, 1054235 effective words/s
2023-11-30 09:51:54,612 : INFO : EPOCH 635: training on

2023-11-30 09:52:03,863 : INFO : EPOCH 689: training on 905346 raw words (176923 effective words) took 0.2s, 1045374 effective words/s
2023-11-30 09:52:04,037 : INFO : EPOCH 690: training on 905346 raw words (175896 effective words) took 0.2s, 1042575 effective words/s
2023-11-30 09:52:04,205 : INFO : EPOCH 691: training on 905346 raw words (176501 effective words) took 0.2s, 1074814 effective words/s
2023-11-30 09:52:04,380 : INFO : EPOCH 692: training on 905346 raw words (177180 effective words) took 0.2s, 1046925 effective words/s
2023-11-30 09:52:04,560 : INFO : EPOCH 693: training on 905346 raw words (176818 effective words) took 0.2s, 1002480 effective words/s
2023-11-30 09:52:04,730 : INFO : EPOCH 694: training on 905346 raw words (176198 effective words) took 0.2s, 1051432 effective words/s
2023-11-30 09:52:04,897 : INFO : EPOCH 695: training on 905346 raw words (176724 effective words) took 0.2s, 1080744 effective words/s
2023-11-30 09:52:05,078 : INFO : EPOCH 696: training on

2023-11-30 09:52:14,319 : INFO : EPOCH 750: training on 905346 raw words (176473 effective words) took 0.2s, 1065335 effective words/s
2023-11-30 09:52:14,487 : INFO : EPOCH 751: training on 905346 raw words (177308 effective words) took 0.2s, 1073227 effective words/s
2023-11-30 09:52:14,651 : INFO : EPOCH 752: training on 905346 raw words (177052 effective words) took 0.2s, 1092980 effective words/s
2023-11-30 09:52:14,827 : INFO : EPOCH 753: training on 905346 raw words (176539 effective words) took 0.2s, 1032316 effective words/s
2023-11-30 09:52:15,000 : INFO : EPOCH 754: training on 905346 raw words (176473 effective words) took 0.2s, 1035936 effective words/s
2023-11-30 09:52:15,171 : INFO : EPOCH 755: training on 905346 raw words (176482 effective words) took 0.2s, 1069661 effective words/s
2023-11-30 09:52:15,351 : INFO : EPOCH 756: training on 905346 raw words (176931 effective words) took 0.2s, 997396 effective words/s
2023-11-30 09:52:15,520 : INFO : EPOCH 757: training on 

2023-11-30 09:52:24,838 : INFO : EPOCH 811: training on 905346 raw words (176939 effective words) took 0.2s, 1094772 effective words/s
2023-11-30 09:52:25,012 : INFO : EPOCH 812: training on 905346 raw words (176598 effective words) took 0.2s, 1045006 effective words/s
2023-11-30 09:52:25,178 : INFO : EPOCH 813: training on 905346 raw words (176444 effective words) took 0.2s, 1082228 effective words/s
2023-11-30 09:52:25,353 : INFO : EPOCH 814: training on 905346 raw words (176559 effective words) took 0.2s, 1035036 effective words/s
2023-11-30 09:52:25,524 : INFO : EPOCH 815: training on 905346 raw words (176088 effective words) took 0.2s, 1057520 effective words/s
2023-11-30 09:52:25,694 : INFO : EPOCH 816: training on 905346 raw words (176625 effective words) took 0.2s, 1084307 effective words/s
2023-11-30 09:52:25,859 : INFO : EPOCH 817: training on 905346 raw words (176964 effective words) took 0.2s, 1097694 effective words/s
2023-11-30 09:52:26,030 : INFO : EPOCH 818: training on

2023-11-30 09:52:35,194 : INFO : EPOCH 872: training on 905346 raw words (176775 effective words) took 0.2s, 1065187 effective words/s
2023-11-30 09:52:35,364 : INFO : EPOCH 873: training on 905346 raw words (177151 effective words) took 0.2s, 1077919 effective words/s
2023-11-30 09:52:35,545 : INFO : EPOCH 874: training on 905346 raw words (177140 effective words) took 0.2s, 993397 effective words/s
2023-11-30 09:52:35,739 : INFO : EPOCH 875: training on 905346 raw words (176068 effective words) took 0.2s, 934131 effective words/s
2023-11-30 09:52:35,908 : INFO : EPOCH 876: training on 905346 raw words (176819 effective words) took 0.2s, 1063363 effective words/s
2023-11-30 09:52:36,065 : INFO : EPOCH 877: training on 905346 raw words (176998 effective words) took 0.2s, 1144200 effective words/s
2023-11-30 09:52:36,241 : INFO : EPOCH 878: training on 905346 raw words (176545 effective words) took 0.2s, 1023579 effective words/s
2023-11-30 09:52:36,413 : INFO : EPOCH 879: training on 9

2023-11-30 09:52:45,610 : INFO : EPOCH 933: training on 905346 raw words (176927 effective words) took 0.2s, 1052223 effective words/s
2023-11-30 09:52:45,819 : INFO : EPOCH 934: training on 905346 raw words (177243 effective words) took 0.2s, 874140 effective words/s
2023-11-30 09:52:45,997 : INFO : EPOCH 935: training on 905346 raw words (177184 effective words) took 0.2s, 1008565 effective words/s
2023-11-30 09:52:46,161 : INFO : EPOCH 936: training on 905346 raw words (177046 effective words) took 0.2s, 1102616 effective words/s
2023-11-30 09:52:46,327 : INFO : EPOCH 937: training on 905346 raw words (176647 effective words) took 0.2s, 1103346 effective words/s
2023-11-30 09:52:46,496 : INFO : EPOCH 938: training on 905346 raw words (177145 effective words) took 0.2s, 1076633 effective words/s
2023-11-30 09:52:46,662 : INFO : EPOCH 939: training on 905346 raw words (176946 effective words) took 0.2s, 1095726 effective words/s
2023-11-30 09:52:46,836 : INFO : EPOCH 940: training on 

2023-11-30 09:52:56,133 : INFO : EPOCH 994: training on 905346 raw words (177103 effective words) took 0.2s, 1054142 effective words/s
2023-11-30 09:52:56,307 : INFO : EPOCH 995: training on 905346 raw words (176213 effective words) took 0.2s, 1039027 effective words/s
2023-11-30 09:52:56,487 : INFO : EPOCH 996: training on 905346 raw words (176745 effective words) took 0.2s, 1005532 effective words/s
2023-11-30 09:52:56,676 : INFO : EPOCH 997: training on 905346 raw words (176874 effective words) took 0.2s, 954513 effective words/s
2023-11-30 09:52:56,851 : INFO : EPOCH 998: training on 905346 raw words (176443 effective words) took 0.2s, 1033898 effective words/s
2023-11-30 09:52:57,019 : INFO : EPOCH 999: training on 905346 raw words (177154 effective words) took 0.2s, 1070709 effective words/s
2023-11-30 09:52:57,212 : INFO : EPOCH 1000: training on 905346 raw words (176654 effective words) took 0.2s, 927340 effective words/s
2023-11-30 09:52:57,381 : INFO : EPOCH 1001: training on

2023-11-30 09:53:07,155 : INFO : EPOCH 1055: training on 905346 raw words (176635 effective words) took 0.2s, 1083507 effective words/s
2023-11-30 09:53:07,342 : INFO : EPOCH 1056: training on 905346 raw words (176753 effective words) took 0.2s, 967547 effective words/s
2023-11-30 09:53:07,583 : INFO : EPOCH 1057: training on 905346 raw words (176936 effective words) took 0.2s, 744255 effective words/s
2023-11-30 09:53:07,762 : INFO : EPOCH 1058: training on 905346 raw words (176973 effective words) took 0.2s, 1015298 effective words/s
2023-11-30 09:53:07,934 : INFO : EPOCH 1059: training on 905346 raw words (176749 effective words) took 0.2s, 1038663 effective words/s
2023-11-30 09:53:08,111 : INFO : EPOCH 1060: training on 905346 raw words (177266 effective words) took 0.2s, 1041915 effective words/s
2023-11-30 09:53:08,289 : INFO : EPOCH 1061: training on 905346 raw words (176556 effective words) took 0.2s, 1012205 effective words/s
2023-11-30 09:53:08,481 : INFO : EPOCH 1062: train

2023-11-30 09:53:17,784 : INFO : EPOCH 1116: training on 905346 raw words (176712 effective words) took 0.2s, 1075081 effective words/s
2023-11-30 09:53:17,959 : INFO : EPOCH 1117: training on 905346 raw words (177720 effective words) took 0.2s, 1043641 effective words/s
2023-11-30 09:53:18,126 : INFO : EPOCH 1118: training on 905346 raw words (176580 effective words) took 0.2s, 1081889 effective words/s
2023-11-30 09:53:18,292 : INFO : EPOCH 1119: training on 905346 raw words (176823 effective words) took 0.2s, 1086565 effective words/s
2023-11-30 09:53:18,458 : INFO : EPOCH 1120: training on 905346 raw words (176944 effective words) took 0.2s, 1093996 effective words/s
2023-11-30 09:53:18,623 : INFO : EPOCH 1121: training on 905346 raw words (176859 effective words) took 0.2s, 1098134 effective words/s
2023-11-30 09:53:18,792 : INFO : EPOCH 1122: training on 905346 raw words (176511 effective words) took 0.2s, 1077439 effective words/s
2023-11-30 09:53:18,967 : INFO : EPOCH 1123: tra

2023-11-30 09:53:28,332 : INFO : EPOCH 1177: training on 905346 raw words (176996 effective words) took 0.2s, 1035208 effective words/s
2023-11-30 09:53:28,500 : INFO : EPOCH 1178: training on 905346 raw words (177286 effective words) took 0.2s, 1084778 effective words/s
2023-11-30 09:53:28,668 : INFO : EPOCH 1179: training on 905346 raw words (176848 effective words) took 0.2s, 1069309 effective words/s
2023-11-30 09:53:28,838 : INFO : EPOCH 1180: training on 905346 raw words (176700 effective words) took 0.2s, 1074642 effective words/s
2023-11-30 09:53:29,008 : INFO : EPOCH 1181: training on 905346 raw words (176281 effective words) took 0.2s, 1059928 effective words/s
2023-11-30 09:53:29,176 : INFO : EPOCH 1182: training on 905346 raw words (176020 effective words) took 0.2s, 1075300 effective words/s
2023-11-30 09:53:29,351 : INFO : EPOCH 1183: training on 905346 raw words (176306 effective words) took 0.2s, 1039595 effective words/s
2023-11-30 09:53:29,519 : INFO : EPOCH 1184: tra

2023-11-30 09:53:39,284 : INFO : EPOCH 1238: training on 905346 raw words (176465 effective words) took 0.2s, 917816 effective words/s
2023-11-30 09:53:39,458 : INFO : EPOCH 1239: training on 905346 raw words (177208 effective words) took 0.2s, 1042099 effective words/s
2023-11-30 09:53:39,727 : INFO : EPOCH 1240: training on 905346 raw words (176800 effective words) took 0.3s, 664881 effective words/s
2023-11-30 09:53:39,917 : INFO : EPOCH 1241: training on 905346 raw words (177276 effective words) took 0.2s, 952399 effective words/s
2023-11-30 09:53:40,103 : INFO : EPOCH 1242: training on 905346 raw words (177208 effective words) took 0.2s, 962005 effective words/s
2023-11-30 09:53:40,293 : INFO : EPOCH 1243: training on 905346 raw words (176623 effective words) took 0.2s, 946964 effective words/s
2023-11-30 09:53:40,480 : INFO : EPOCH 1244: training on 905346 raw words (176389 effective words) took 0.2s, 969155 effective words/s
2023-11-30 09:53:40,702 : INFO : EPOCH 1245: training 

2023-11-30 09:53:50,421 : INFO : EPOCH 1299: training on 905346 raw words (177019 effective words) took 0.2s, 973649 effective words/s
2023-11-30 09:53:50,607 : INFO : EPOCH 1300: training on 905346 raw words (177205 effective words) took 0.2s, 974238 effective words/s
2023-11-30 09:53:50,778 : INFO : EPOCH 1301: training on 905346 raw words (176300 effective words) took 0.2s, 1065867 effective words/s
2023-11-30 09:53:50,950 : INFO : EPOCH 1302: training on 905346 raw words (176427 effective words) took 0.2s, 1049246 effective words/s
2023-11-30 09:53:51,116 : INFO : EPOCH 1303: training on 905346 raw words (176673 effective words) took 0.2s, 1091742 effective words/s
2023-11-30 09:53:51,280 : INFO : EPOCH 1304: training on 905346 raw words (176917 effective words) took 0.2s, 1099848 effective words/s
2023-11-30 09:53:51,451 : INFO : EPOCH 1305: training on 905346 raw words (177098 effective words) took 0.2s, 1070011 effective words/s
2023-11-30 09:53:51,620 : INFO : EPOCH 1306: train

2023-11-30 09:54:01,664 : INFO : EPOCH 1360: training on 905346 raw words (177237 effective words) took 0.2s, 1061939 effective words/s
2023-11-30 09:54:01,831 : INFO : EPOCH 1361: training on 905346 raw words (176746 effective words) took 0.2s, 1086236 effective words/s
2023-11-30 09:54:02,020 : INFO : EPOCH 1362: training on 905346 raw words (176641 effective words) took 0.2s, 961906 effective words/s
2023-11-30 09:54:02,201 : INFO : EPOCH 1363: training on 905346 raw words (176736 effective words) took 0.2s, 993097 effective words/s
2023-11-30 09:54:02,368 : INFO : EPOCH 1364: training on 905346 raw words (176668 effective words) took 0.2s, 1078664 effective words/s
2023-11-30 09:54:02,534 : INFO : EPOCH 1365: training on 905346 raw words (177242 effective words) took 0.2s, 1092090 effective words/s
2023-11-30 09:54:02,708 : INFO : EPOCH 1366: training on 905346 raw words (177205 effective words) took 0.2s, 1039835 effective words/s
2023-11-30 09:54:02,876 : INFO : EPOCH 1367: train

2023-11-30 09:54:12,463 : INFO : EPOCH 1421: training on 905346 raw words (177555 effective words) took 0.2s, 930758 effective words/s
2023-11-30 09:54:12,636 : INFO : EPOCH 1422: training on 905346 raw words (177245 effective words) took 0.2s, 1054656 effective words/s
2023-11-30 09:54:12,808 : INFO : EPOCH 1423: training on 905346 raw words (177007 effective words) took 0.2s, 1053623 effective words/s
2023-11-30 09:54:12,985 : INFO : EPOCH 1424: training on 905346 raw words (177505 effective words) took 0.2s, 1026836 effective words/s
2023-11-30 09:54:13,167 : INFO : EPOCH 1425: training on 905346 raw words (177070 effective words) took 0.2s, 989629 effective words/s
2023-11-30 09:54:13,349 : INFO : EPOCH 1426: training on 905346 raw words (176930 effective words) took 0.2s, 990536 effective words/s
2023-11-30 09:54:13,529 : INFO : EPOCH 1427: training on 905346 raw words (177347 effective words) took 0.2s, 1006208 effective words/s
2023-11-30 09:54:13,693 : INFO : EPOCH 1428: traini

2023-11-30 09:54:23,302 : INFO : EPOCH 1482: training on 905346 raw words (177407 effective words) took 0.2s, 1084453 effective words/s
2023-11-30 09:54:23,476 : INFO : EPOCH 1483: training on 905346 raw words (177046 effective words) took 0.2s, 1046326 effective words/s
2023-11-30 09:54:23,679 : INFO : EPOCH 1484: training on 905346 raw words (176866 effective words) took 0.2s, 883310 effective words/s
2023-11-30 09:54:23,871 : INFO : EPOCH 1485: training on 905346 raw words (176862 effective words) took 0.2s, 946192 effective words/s
2023-11-30 09:54:24,041 : INFO : EPOCH 1486: training on 905346 raw words (176902 effective words) took 0.2s, 1074540 effective words/s
2023-11-30 09:54:24,213 : INFO : EPOCH 1487: training on 905346 raw words (176905 effective words) took 0.2s, 1038703 effective words/s
2023-11-30 09:54:24,380 : INFO : EPOCH 1488: training on 905346 raw words (176875 effective words) took 0.2s, 1087966 effective words/s
2023-11-30 09:54:24,545 : INFO : EPOCH 1489: train

2023-11-30 09:54:34,132 : INFO : EPOCH 1543: training on 905346 raw words (176734 effective words) took 0.2s, 1023386 effective words/s
2023-11-30 09:54:34,400 : INFO : EPOCH 1544: training on 905346 raw words (177534 effective words) took 0.3s, 667737 effective words/s
2023-11-30 09:54:34,739 : INFO : EPOCH 1545: training on 905346 raw words (176250 effective words) took 0.3s, 531021 effective words/s
2023-11-30 09:54:35,050 : INFO : EPOCH 1546: training on 905346 raw words (177360 effective words) took 0.3s, 582718 effective words/s
2023-11-30 09:54:35,275 : INFO : EPOCH 1547: training on 905346 raw words (176416 effective words) took 0.2s, 798419 effective words/s
2023-11-30 09:54:35,447 : INFO : EPOCH 1548: training on 905346 raw words (176459 effective words) took 0.2s, 1047533 effective words/s
2023-11-30 09:54:35,617 : INFO : EPOCH 1549: training on 905346 raw words (176823 effective words) took 0.2s, 1059059 effective words/s
2023-11-30 09:54:35,873 : INFO : EPOCH 1550: trainin

2023-11-30 09:54:45,926 : INFO : EPOCH 1604: training on 905346 raw words (176731 effective words) took 0.2s, 1004596 effective words/s
2023-11-30 09:54:46,120 : INFO : EPOCH 1605: training on 905346 raw words (176657 effective words) took 0.2s, 929572 effective words/s
2023-11-30 09:54:46,332 : INFO : EPOCH 1606: training on 905346 raw words (177124 effective words) took 0.2s, 846178 effective words/s
2023-11-30 09:54:46,510 : INFO : EPOCH 1607: training on 905346 raw words (176378 effective words) took 0.2s, 1007350 effective words/s
2023-11-30 09:54:46,733 : INFO : EPOCH 1608: training on 905346 raw words (177307 effective words) took 0.2s, 810278 effective words/s
2023-11-30 09:54:46,913 : INFO : EPOCH 1609: training on 905346 raw words (176697 effective words) took 0.2s, 1010462 effective words/s
2023-11-30 09:54:47,129 : INFO : EPOCH 1610: training on 905346 raw words (176431 effective words) took 0.2s, 823946 effective words/s
2023-11-30 09:54:47,357 : INFO : EPOCH 1611: trainin

2023-11-30 09:54:57,512 : INFO : EPOCH 1665: training on 905346 raw words (176243 effective words) took 0.2s, 797255 effective words/s
2023-11-30 09:54:57,709 : INFO : EPOCH 1666: training on 905346 raw words (177064 effective words) took 0.2s, 916218 effective words/s
2023-11-30 09:54:57,899 : INFO : EPOCH 1667: training on 905346 raw words (176880 effective words) took 0.2s, 948642 effective words/s
2023-11-30 09:54:58,082 : INFO : EPOCH 1668: training on 905346 raw words (176684 effective words) took 0.2s, 992762 effective words/s
2023-11-30 09:54:58,275 : INFO : EPOCH 1669: training on 905346 raw words (176996 effective words) took 0.2s, 940244 effective words/s
2023-11-30 09:54:58,479 : INFO : EPOCH 1670: training on 905346 raw words (176705 effective words) took 0.2s, 879646 effective words/s
2023-11-30 09:54:58,704 : INFO : EPOCH 1671: training on 905346 raw words (177230 effective words) took 0.2s, 805396 effective words/s
2023-11-30 09:54:58,980 : INFO : EPOCH 1672: training o

2023-11-30 09:55:09,844 : INFO : EPOCH 1726: training on 905346 raw words (176895 effective words) took 0.2s, 892490 effective words/s
2023-11-30 09:55:10,028 : INFO : EPOCH 1727: training on 905346 raw words (176459 effective words) took 0.2s, 979639 effective words/s
2023-11-30 09:55:10,208 : INFO : EPOCH 1728: training on 905346 raw words (176815 effective words) took 0.2s, 994638 effective words/s
2023-11-30 09:55:10,399 : INFO : EPOCH 1729: training on 905346 raw words (177161 effective words) took 0.2s, 945091 effective words/s
2023-11-30 09:55:10,576 : INFO : EPOCH 1730: training on 905346 raw words (176514 effective words) took 0.2s, 1013078 effective words/s
2023-11-30 09:55:10,763 : INFO : EPOCH 1731: training on 905346 raw words (176788 effective words) took 0.2s, 958806 effective words/s
2023-11-30 09:55:10,960 : INFO : EPOCH 1732: training on 905346 raw words (177147 effective words) took 0.2s, 914327 effective words/s
2023-11-30 09:55:11,140 : INFO : EPOCH 1733: training 

2023-11-30 09:55:21,213 : INFO : EPOCH 1787: training on 905346 raw words (177035 effective words) took 0.2s, 933925 effective words/s
2023-11-30 09:55:21,381 : INFO : EPOCH 1788: training on 905346 raw words (177961 effective words) took 0.2s, 1098702 effective words/s
2023-11-30 09:55:21,554 : INFO : EPOCH 1789: training on 905346 raw words (176363 effective words) took 0.2s, 1038115 effective words/s
2023-11-30 09:55:21,732 : INFO : EPOCH 1790: training on 905346 raw words (176466 effective words) took 0.2s, 1013780 effective words/s
2023-11-30 09:55:21,903 : INFO : EPOCH 1791: training on 905346 raw words (177011 effective words) took 0.2s, 1055647 effective words/s
2023-11-30 09:55:22,072 : INFO : EPOCH 1792: training on 905346 raw words (176287 effective words) took 0.2s, 1069025 effective words/s
2023-11-30 09:55:22,258 : INFO : EPOCH 1793: training on 905346 raw words (176410 effective words) took 0.2s, 973866 effective words/s
2023-11-30 09:55:22,423 : INFO : EPOCH 1794: train

2023-11-30 09:55:32,063 : INFO : EPOCH 1848: training on 905346 raw words (176713 effective words) took 0.2s, 1012068 effective words/s
2023-11-30 09:55:32,227 : INFO : EPOCH 1849: training on 905346 raw words (177449 effective words) took 0.2s, 1101669 effective words/s
2023-11-30 09:55:32,419 : INFO : EPOCH 1850: training on 905346 raw words (176608 effective words) took 0.2s, 935062 effective words/s
2023-11-30 09:55:32,632 : INFO : EPOCH 1851: training on 905346 raw words (176179 effective words) took 0.2s, 847965 effective words/s
2023-11-30 09:55:32,811 : INFO : EPOCH 1852: training on 905346 raw words (176798 effective words) took 0.2s, 1006553 effective words/s
2023-11-30 09:55:32,988 : INFO : EPOCH 1853: training on 905346 raw words (176809 effective words) took 0.2s, 1012712 effective words/s
2023-11-30 09:55:33,169 : INFO : EPOCH 1854: training on 905346 raw words (176670 effective words) took 0.2s, 997917 effective words/s
2023-11-30 09:55:33,351 : INFO : EPOCH 1855: traini

2023-11-30 09:55:44,264 : INFO : EPOCH 1909: training on 905346 raw words (176583 effective words) took 0.2s, 911043 effective words/s
2023-11-30 09:55:44,532 : INFO : EPOCH 1910: training on 905346 raw words (176156 effective words) took 0.3s, 668516 effective words/s
2023-11-30 09:55:44,759 : INFO : EPOCH 1911: training on 905346 raw words (176291 effective words) took 0.2s, 792697 effective words/s
2023-11-30 09:55:45,011 : INFO : EPOCH 1912: training on 905346 raw words (176522 effective words) took 0.2s, 710888 effective words/s
2023-11-30 09:55:45,243 : INFO : EPOCH 1913: training on 905346 raw words (177082 effective words) took 0.2s, 779012 effective words/s
2023-11-30 09:55:45,500 : INFO : EPOCH 1914: training on 905346 raw words (175872 effective words) took 0.3s, 691569 effective words/s
2023-11-30 09:55:45,689 : INFO : EPOCH 1915: training on 905346 raw words (176925 effective words) took 0.2s, 955618 effective words/s
2023-11-30 09:55:45,888 : INFO : EPOCH 1916: training o

2023-11-30 09:55:56,434 : INFO : EPOCH 1970: training on 905346 raw words (176670 effective words) took 0.2s, 1000022 effective words/s
2023-11-30 09:55:56,609 : INFO : EPOCH 1971: training on 905346 raw words (177059 effective words) took 0.2s, 1020231 effective words/s
2023-11-30 09:55:56,782 : INFO : EPOCH 1972: training on 905346 raw words (176882 effective words) took 0.2s, 1053407 effective words/s
2023-11-30 09:55:56,970 : INFO : EPOCH 1973: training on 905346 raw words (176740 effective words) took 0.2s, 969021 effective words/s
2023-11-30 09:55:57,145 : INFO : EPOCH 1974: training on 905346 raw words (176267 effective words) took 0.2s, 1028417 effective words/s
2023-11-30 09:55:57,341 : INFO : EPOCH 1975: training on 905346 raw words (177446 effective words) took 0.2s, 924843 effective words/s
2023-11-30 09:55:57,505 : INFO : EPOCH 1976: training on 905346 raw words (176790 effective words) took 0.2s, 1098340 effective words/s
2023-11-30 09:55:57,675 : INFO : EPOCH 1977: train

Time taken to process cell 462.6369948387146


In [22]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [23]:
# Other Constants
COLLECTION_FILE = 'Data/job_postings_collection_20231024-214547.json'
DOC2VEC_MODEL_FILE = 'Models/doc2vec_model_20231110-004321'
TEST_FILE = "./test/resume_125"

In [24]:
# Flags
create_vectors = True

In [25]:
# Load the Data
collection_data = None  
with open(COLLECTION_FILE, "r") as read_file:
    collection_data = json.load(read_file)
doc2vec_model = Doc2Vec.load(DOC2VEC_MODEL_FILE)

2023-11-30 10:04:37,974 : INFO : loading Doc2Vec object from Models/doc2vec_model_20231110-004321
2023-11-30 10:04:37,984 : INFO : loading dv recursively from Models/doc2vec_model_20231110-004321.dv.* with mmap=None
2023-11-30 10:04:37,985 : INFO : loading wv recursively from Models/doc2vec_model_20231110-004321.wv.* with mmap=None
2023-11-30 10:04:37,985 : INFO : setting ignored attribute cum_table to None
2023-11-30 10:04:37,987 : INFO : Doc2Vec lifecycle event {'fname': 'Models/doc2vec_model_20231110-004321', 'datetime': '2023-11-30T10:04:37.987693', 'gensim': '4.3.2', 'python': '3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.22621-SP0', 'event': 'loaded'}


In [26]:
# Process Cell
start = time.time()

# Check if creating vectors is required
if create_vectors:
    collection = collection_data['data']
    model = doc2vec_model
    
    # Iterate over each document in the collection
    for i in range(len(collection)):
        text = collection[i]['stem_lemm_text']
        text_tokens = word_tokenize(text)
        
        # Infer the Doc2Vec vector for the document's text
        vector = model.infer_vector(text_tokens)
        vector_norm = vector / np.linalg.norm(vector)
        
        # Update the collection data with the vectors
        collection[i]["vec"] = vector.tolist()
        collection[i]["vec_norm"] = vector_norm.tolist()
        print('.', end='')
        
    # Save the updated collection data
    file = open(COLLECTION_FILE, 'w')
    file.write(json.dumps(collection_data))
    file.close()

# Record the end time
end = time.time()
print('')
print("Time taken to process cell {}".format(end - start))

.......................................................................................................................................................................................................................................................................................................................................................................................
Time taken to process cell 49.57989478111267


In [27]:
# K means and cosine similarity funcitons.

K = int(math.log(len(collection_data['data']),2))
#K = 2
def cosine_similarity(x1, x2):
    return np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

def cosine_similarity_norm(x1, x2):
    return np.dot(x1, x2)

def euclidean_distance(x1, x2):
    return np.sqrt(np.sum(np.power(x1 - x2, 2)))

def initialize_centroids(K, points):
    total_points, point_dimensions = np.shape(points)
    centroids = np.empty((K, point_dimensions))
    selected_points = [False for i in range(len(points))]
    
    # pick a random data point from X as the centroid
    random_point = np.random.choice(range(total_points))
    selected_points[random_point] = True
    centroids[0] =  points[random_point]
    
    # pick the second point which is farthest from the first point
    max_distance = -1
    second_point = None
    for i in range(total_points):
        distance = euclidean_distance(points[i],centroids[0])
        if(distance > max_distance):
            max_distance = distance
            centroids[1] = points[i]
            second_point = i
    selected_points[second_point] = True

    # pick the rest of the k points.
    j = 2
    while j<K:
        new_centroid = 0-1
        max_distance = -1
        for i in range(total_points):
            if not selected_points[i]:
                min_distance = 99999999999
                for l in range(j):
                    distance = euclidean_distance(points[i], centroids[l])
                    if distance < min_distance:
                        min_distance = distance
                if min_distance > max_distance:
                    centroids[j] = points[i];
                    max_distance = min_distance
                    new_centroid = i
        selected_points[new_centroid] = True           
        j+=1;

    return centroids

def closest_centroid(K, point, centroids):
    distances = np.empty(K)
    for i in range(K):
        sim = cosine_similarity_norm(centroids[i], point)
        distances[i] = sim
    return np.argmax(distances) 

def create_clusters( K, centroids, points):
    total_points, point_dimensions = np.shape(points)
    cluster_idx = np.empty(total_points)
    for i in range(total_points):
        cluster_idx[i] = closest_centroid(K, points[i], centroids)
    return cluster_idx

def compute_means(K, cluster_idx,  points):
    total_points, point_dimensions = np.shape(points)
    centroids = np.empty((K, point_dimensions))
    for i in range(K):
        gathered_points = points[cluster_idx == i]
        centroids[i] = np.mean(gathered_points, axis=0)
    return centroids

def kmeans_cluster(K, points, max_iterations = 1500 ):
    centroids = initialize_centroids(K, points)
    #print(f"initial centroids: {centroids}")
    
    for _ in range(max_iterations):
        clusters = create_clusters(K, centroids, points)
        previous_centroids = centroids                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
        centroids = compute_means(K , clusters, points)
        # if the new_centroids are the same as the old centroids, return clusters
        diff = previous_centroids - centroids
        if not diff.any():
            return clusters, centroids
        print(".", end='')
    return clusters, centroids

def getClusterMap(K, clusters):
    myMap = {}
    for i in range(K):
        myMap[i] = [];
    for (index, value) in enumerate(clusters):
        myMap[value].append(index)
    return myMap

def search_k_means(K, doc_vector_norm, points, clusters_map, centroids):
    centroid = closest_centroid(K, doc_vector_norm, centroids)
    points_index = clusters_map[centroid]
    #print(points_index)
    result = []
    for value in points_index:
        #print('cosine_sim : ' , cosine_similarity_norm(points[value], doc_vector_norm), 'point index is : ', value)
        result.append((cosine_similarity_norm(points[value], doc_vector_norm), value))
    result.sort(reverse=True)
    return  result

In [28]:
total_points, dimensions = len(collection_data['data']) , len(collection_data['data'][0]['vec_norm'])
points = np.array([[x for x in y['vec_norm']] for y in collection_data['data']])
clusters, centroids = kmeans_cluster(K, points)
clusters_map = getClusterMap(K, clusters)
print("")
print(clusters)
print("")
print(centroids)
print("")
print(centroids[0])
print("")

.
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 6. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 5. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 7. 0. 0.
 0. 3. 0. 0. 0. 0. 2. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0. 0. 0. 0. 0. 0. 7. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 4. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

In [29]:
f_open = open(TEST_FILE,'r')
temp = f_open.read()
f_open.close()

new_stopwords = create_custom_stopwords()

text_tokens = word_tokenize(temp)
query_stop_words_text = [word.lower() for word in text_tokens if not word.lower() in new_stopwords]


doc_vector = doc2vec_model.infer_vector(query_stop_words_text)
doc_vector_norm = doc_vector / np.linalg.norm(doc_vector)
print('query Resume document normalization vector is ', doc_vector_norm)
results = search_k_means(K, doc_vector_norm, points, clusters_map, centroids)
print()
results_to_print = 50
if len(results)<results_to_print:
    results_to_print = len(results)
    
for i in range(results_to_print):
     print("{}. document {} score {}".format(i+1,results[i][1],results[i][0]))

query Resume document normalization vector is  [ 5.65146990e-02  1.87292751e-02  3.36369127e-02 -7.40271360e-02
 -2.58440208e-02  6.45897835e-02 -3.51775028e-02 -3.38896364e-02
  6.24345206e-02 -2.37462725e-02  3.46487202e-02  5.80606796e-02
 -4.14688773e-02  5.42238131e-02  3.55059281e-03 -5.28287329e-02
 -5.10773137e-02 -2.35499460e-02  3.45377997e-02 -3.96846347e-02
 -2.67160423e-02  1.65704894e-03 -1.19296173e-02 -2.51497645e-02
 -7.62513373e-03 -2.48553813e-04  4.34662886e-02 -1.13668703e-02
  7.40519678e-03  2.91218441e-02 -7.52154514e-02  3.63982166e-03
 -2.86423657e-02 -2.49711331e-03  1.90997347e-02  2.60568336e-02
 -7.52115902e-03  1.73673574e-02 -1.42808165e-02 -2.69986521e-02
 -1.04923043e-02  5.41129559e-02 -5.37676513e-02  6.90250620e-02
 -2.87256762e-02  3.78628671e-02 -4.87763947e-03 -6.04655966e-02
  6.57555386e-02  9.08890665e-02 -5.50300442e-02 -8.89891014e-02
  4.65130098e-02  4.88675907e-02  7.77361495e-03 -5.12365624e-02
  8.82597920e-03 -6.08159183e-03  1.8296096

Note: As done with TF-IDF we have reused the same sample test resumes created for job posting with doc_id's 0 and 125. Document to Vecotr model ranks the job posting 0 at position 37 with score 0.70802332034608 where as job posting 125 at position 5 with score 0.7517491222406862.

Comparing These results of cosine similiarity with TF-IDF we see that TF-IDF shows relavent documents(job positings) at a higher rank.

## Ranked Retrieval using Pearson Correlation Coefficient

In [32]:
from scipy.stats import pearsonr
corpus = None
with open("Data/corpus.json", "r") as read_file:
    corpus = json.load(read_file)
collection = None
with open("Data/job_postings_collection_20231024-214547.json", "r") as read_file:
    collection = json.load(read_file)


resume_numbers= [2,4,5,6,7]
pearson_coffiecient_job_resume ={}
Job_Description_words_count =[]
#for resume_number in range(len(corpus)):
for resume_number in resume_numbers:
  for i in range(len(collection["data"])):
        Job_descriptions_words =  set(collection["data"][i]['stem_lemm_words_map'].keys())
        Resume_words = {word for word in corpus[resume_number]['word_doc_lower_cleaned_stem_lem']}
        final_resume =" ".join(corpus[resume_number]['word_doc_lower_cleaned_stem_lem'])
        words = Resume_words.union(Job_descriptions_words)
        Resume_words_count = [final_resume.split().count(word) for word in words]
        Job_Description_words_count =[]
        for word in words:
          if(word in Job_descriptions_words):
              Job_Description_words_count.append(collection["data"][i]['stem_lemm_words_map'][word])
          else:
              Job_Description_words_count.append(0)
        r, _ = pearsonr(Resume_words_count, Job_Description_words_count)
        pearson_coffiecient_job_resume[i]=r

  sorted_jobs = dict(sorted(pearson_coffiecient_job_resume.items(), key=lambda item: item[1], reverse=True))
  # Displaying the top 10 jobs
  top_10_jobs = dict(list(sorted_jobs.items())[:10])
  print(f"Top 10 jobs for resume {resume_number} with category as {corpus[resume_number]['Category']}:", top_10_jobs)

Top 10 jobs for resume 2 with category as Data Science: {320: -0.05961840151827503, 355: -0.05961840151827503, 58: -0.10117153954008691, 4: -0.11734505048508614, 201: -0.11945814802193899, 93: -0.12094380390207204, 157: -0.12864662516912462, 170: -0.12864662516912462, 122: -0.13458381721398585, 229: -0.13458381721398585}
Top 10 jobs for resume 4 with category as Data Science: {115: 0.02753234282657356, 65: -0.0026504952876935306, 81: -0.0026504952876935306, 157: -0.011533634830674686, 170: -0.011533634830674686, 88: -0.030781546969047467, 320: -0.032222131874075195, 355: -0.032222131874075195, 201: -0.033056736809616426, 3: -0.034443869979628075}
Top 10 jobs for resume 5 with category as Data Science: {320: -0.05270787978642338, 355: -0.05270787978642338, 49: -0.059455450963451555, 65: -0.06222802684163031, 81: -0.06222802684163031, 115: -0.0770641808790755, 151: -0.09171581920116895, 189: -0.09171581920116895, 93: -0.0918376448945398, 154: -0.09336901743273254}
Top 10 jobs for resume 

## Ranked Retrieval using Jaccard Coefficient

In [39]:
def create_stem_lemm_words(text_tokens):
    # Create instances of PorterStemmer and WordNetLemmatizer
    ps = PorterStemmer()
    lemmatizer = WordNetLemmatizer()

    query_stop_words_text = [word.lower() for word in text_tokens if not word.lower() in new_stopwords]

    resume_stem_lemm_words = []

    for word in range(len(query_stop_words_text)):
            stem_word = ps.stem(query_stop_words_text[word])
            #lemm_word = lemmatizer.lemmatize(stem_word)
            stem_lemm_word = lemmatizer.lemmatize(stem_word)
            resume_stem_lemm_words.append(stem_lemm_word)
    return resume_stem_lemm_words

In [40]:
#define Jaccard Similarity function
def jaccard(set1, set2):
    intersection = len(list(set(set1).intersection(set2)))
    union = (len(set1) + len(set2)) - intersection
    return float(intersection) / union

In [41]:
with open("Data/job_postings_collection_20231024-214547.json") as json_data:
    main_collection = json.load(json_data)
    df = main_collection["data"]
    job_words = []
    for i in range(len(df)):
        txt =df[i]["stem_lemm_text"]
        job_tokens = txt.split(' ')
        job_words.append(job_tokens)

In [42]:
#Show job according to the resume
TEST_FILE = "test/resume_12.txt"
f_open = open(TEST_FILE,'r')
temp = f_open.read()
f_open.close()
#query ="Assistant Chiropractic"
text_tokens = word_tokenize(temp)

new_stopwords = create_custom_stopwords()
resume_stem_lemm_words = create_stem_lemm_words(text_tokens)


#calculate Jaccard Coefficient of the two sets
jaccard_simi=[]
for j in job_words:
    y = jaccard(j, resume_stem_lemm_words)
    jaccard_simi.append(y)
largest = max(jaccard_simi)
indx = jaccard_simi.index(max(jaccard_simi))
print("similarity", largest)
print("")
print("************************ Job Description **************************")
print("")
print(df[indx]['text'])
print("")
print("")
print("************************ Resume **************************")
print(temp)

similarity 0.12020725388601036

************************ Job Description **************************

 Return to Search Result  Job Post Details                  Enterprise Architect Infrastructure  - job post            Southern Glazer's Wine &amp; Spirits                                                1,699 reviews      Texas                You must create an Indeed account before continuing to the company website to apply        Apply on company site                 save-icon                               Benefits Pulled from the full job description       401(k)  Disability insurance  Flexible spending account  Health insurance  Life insurance  Paid sick time  Parental leave         Show more   chevron down        Full Job Description    
  What You Need To Know 
   
   Open the door to a groundbreaking tech career with an industry leader. Southern Glazer's Wine &amp; Spirits is North America's preeminent wine and spirits distributor, as well as a family-owned, privately held compan

## Job Search using Phrase Queries

In [43]:
col =[]
query = 'Clinic Patient Access Clerk'
text_tokens = word_tokenize(query)

new_stopwords = create_custom_stopwords()

resume_stem_lemm_words = create_stem_lemm_words(text_tokens)

wordPositionalIndex={}
for stem_lemm_word in resume_stem_lemm_words:
    wordPageIndex ={}
    for index,j in enumerate(job_words):
        postion =[]
        for indx,word in enumerate(j):
            if word==stem_lemm_word:
                postion.append(indx)
        if len(postion) > 0:
            wordPageIndex[index]=postion
    wordPositionalIndex[stem_lemm_word] = wordPageIndex 

commonValObj = {}
#check if query has more than one word
if len(resume_stem_lemm_words)>1:
    key1 = resume_stem_lemm_words[0]
    key2 = resume_stem_lemm_words[1]
    #iterate for first two words, to reduce the number of iteration
    for queryWord in wordPositionalIndex[key1]:
        pos =[]
        isFound = False
        if queryWord in wordPositionalIndex[key2]:
            for val in wordPositionalIndex[key1][queryWord]: 
                if val+1 in wordPositionalIndex[key2][queryWord]:
                    isFound = True
                    pos.append(val+1)
            if len(pos) > 0:
                commonValObj[queryWord]=pos
    newArr =resume_stem_lemm_words[2:]

    if len(newArr)>2:
        lastIndx = 0
        for ind, wrd in enumerate(newArr):
            lastFoundIndx = 0
            for wordPosition in commonValObj:
                isFound = False
                if wordPosition in wordPositionalIndex[wrd]:
                    for val in wordPositionalIndex[wrd][wordPosition]: 
                        if val-1 in commonValObj[wordPosition] :
                            foundItemIndx = commonValObj[wordPosition].index(val - 1)
                            commonValObj[wordPosition][foundItemIndx] = val
                            isFound = True
                            lastIndx = wordPosition              
        if isFound:
            print('***************Job Description for given query***************************')
            print(df[lastIndx]['text'])
            print('******************************************')
    else:
        for kys in commonValObj:
            print('***************Job Description for given query***************************')
            print(df[kys]['text'])
            print('******************************************')
else:
    key = resume_stem_lemm_words[0]
    for kys in wordPositionalIndex[key]:
        print('***************Job Description for given query***************************')
        print(df[kys]['text'])
        print('******************************************')

***************Job Description for given query***************************
 Return to Search Result  Job Post Details                      Clinic Patient Access Clerk  - job post            Martin County Hospital District                                                8 reviews          600 East Interstate 20, Stanton, TX 79782                You must create an Indeed account before continuing to the company website to apply        Apply on company site                 save-icon                               Full Job Description    
 Description: 
   Martin County Hospital District (MCHD) was organized in 1967 and has boundaries coterminous with Martin County, Texas. It is located in the heart of West Texas. Opened in 1949, the hospital was originally located near downtown Stanton. This facility served the community well for many years, but was replaced with a state of the art 18 bed facility in 2012. The new facility is located on I20, and has been designated as a Critical Access Hospi

## Job Trends

We tried using H-index here, but most words were not repeating that frequently in job postings. Only the stop words had good H-index results, which did not give us the trends.

Instead, we used the terms that appeared highest-number of times in at max 15 to 75 percent of the documents.

In [44]:
# other constants
COLLECTION_FILE = 'Data/job_postings_collection_20231024-214547.json'

# load the data
collection_data = None  
with open(COLLECTION_FILE, "r") as read_file:
    collection_data = json.load(read_file)

In [46]:
term_in_documents = collection_data["document_collection"]["term_document_map"]
document_data = collection_data["data"]
results = [] 
documents_in_corpus = len(document_data)
upper_limit = int(0.93 * documents_in_corpus)
lower_limit = int(0.2 * documents_in_corpus)

for term,doc_list in term_in_documents.items():
    
    if len(doc_list) == 0 or len(term) <= 1:
        continue
    total_docs = len(doc_list)
    if total_docs > upper_limit or total_docs < lower_limit:
        continue
    score = 0
    total_doc = len(doc_list)
    total_sum = 0
    for i in doc_list:
        total_sum += document_data[i]["term_frequency_map"][term]
    avg = total_sum / total_doc
    #score = (avg /documents_in_corpus) 
    #score = (1 + total_docs / documents_in_corpus ) * ( avg )
    #score = total_sum
    score = (1 + total_docs / documents_in_corpus) * (1 + avg)
    #score = ( total_docs / documents_in_corpus) * (avg)
    results.append((score,total_doc,term))
results.sort(reverse=True)
for i in range(10):
    print("{}. {} , Score = {}".format(i+1, results[i][2], results[i][0]))

1. requir , Score = 8.710222222222221
2. custom , Score = 8.3793837535014
3. insur , Score = 7.519707317073171
4. servic , Score = 7.212912280701755
5. hour , Score = 7.182731707317075
6. experi , Score = 7.078910299003322
7. sale , Score = 6.908
8. assist , Score = 6.905470085470086
9. employe , Score = 6.831474178403756
10. shift , Score = 6.767701149425288
