<a href="https://colab.research.google.com/github/PedroBritodSa/NLP-Resume-job-description-scan/blob/main/NLP_project_resume_scan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simple Resume Scan and WordCloud
- In this notebook, we will implement a simple NLP structure to scan resumes in docx format and compare with a job description
- The code is structured in functions to facilitate usage and it is necessary to upload docx documents to see it working.  

In [None]:
pip install docx2txt



- Loading some important libraries


In [None]:
import docx2txt #library to transform docx file into txt

from sklearn.feature_extraction.text import CountVectorizer #Simple vectorizer

from sklearn.metrics.pairwise import cosine_similarity #library to search simality in text

import string #to remove punctuations

import re #regular expression library

import pandas as pd

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS # Set of stopwords to be removed

from wordcloud import WordCloud

import matplotlib.pyplot as plt

import plotly.express as px

import nltk

from nltk.tokenize import word_tokenize

from collections import Counter

nltk.download("punkt")


punc = string.punctuation # punctuations to be removed

stopwords = ENGLISH_STOP_WORDS # stopwords to be removed

In [None]:

from google.colab import files

upload = files.upload()

# Data Visualization

- First we will create a tool to  visualize more frequent words in the douments. To do so, we will use the wordcloud library.
- In order to analyse mostly the keywords, it is important to clean the unnecessary words, which we know as stopwords. So we will do it in the following functions

In [None]:
# Creating a function to visualize the most frequency words in the data
# docname must be changed by the name of the docx  that will be analized, for example: docname = "name.docx"
def wc_func(docname):

  data = docx2txt.process(docname)

  data = ' '.join([word for word in data.split() if word.lower() not in stopwords]) # removing stopwords

  wc = WordCloud(background_color="Black", repeat=False, width = 1200 , height = 600) # Setting up the wordcloud

  wc.generate(data) # applying the wc on the data

  plt.figure(figsize = (20,20))

  plt.axis("off") # hide axis

  plt.imshow(wc, interpolation="bilinear")

  plt.show()





In [None]:
wc_func("job_description.docx")

- We can also examine the words by their frequency in the text, which will be set up in the following function.

In [None]:


def most_freq_words(docname,number):

    data = docx2txt.process(docname)

    data = ''.join([char for char in data if char not in punc])


    data = ' '.join([word for word in data.split() if word.lower() not in stopwords]) # removing stopwords

    words = word_tokenize(data.lower())
    # Count word frequency
    word_count = Counter(words)

    # Get the most common words and their frequencies
    most_common_words = word_count.most_common(number)

    # Convert the Counter object to a DataFrame
    df = pd.DataFrame(most_common_words, columns=['Word', 'freq'])

    df = df.sort_values(by='freq', ascending=True)

    fig = px.bar(df, y='Word', x='freq', text_auto='.2s',
            title="Most Common Words", orientation = 'h')

    fig.show()


In [None]:
most_freq_words("job_description.docx",18)

# Scanning the Resume

- Finally, we can use the code sklearn library to create a count_matrix of our texts, and them calculate the similarity between them using the cosine similarity method

In [None]:
def resume_scan(your_resume, job_description):

  resume = docx2txt.process(your_resume) #Reading resume docx

  job_des = docx2txt.process(job_description) #Reading job description docx

  resume = ' '.join([word for word in resume.split() if word.lower() not in stopwords])

  job_des = ' '.join([word for word in job_des.split() if word.lower() not in stopwords])

  text = [resume, job_des] # Seting the two texts in a dataset

  cv = CountVectorizer() # Creating the function CountVectorizer

  count_matrix = cv.fit_transform(text) #Creating the count matrix

  #Print the cosine similarity scores

  print("\nSimilarity Scores:")

  print(cosine_similarity(count_matrix))

  print(" ")

  print(" ")

  print("\nMatch Rate of: " + str(your_resume) +" and " + str(job_description))

  print(" ")

  print(" ")

  matchrate = cosine_similarity(count_matrix)[0][1]

  matchpercen = round(matchrate*100,2) # rouded value

  print("Match percentage between your resume and the job description is " + str(matchpercen))


In [None]:
resume_scan("resume.docx","job_description.docx")