<a href="https://colab.research.google.com/github/Trailblazer29/Resume-Scanner/blob/master/resume_scanner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Applicant Tracking System (ATS)</center>
## <center>Author: **Ilham Seladji**<center>

# Install Required Packages & Software

In [None]:
!pip install docx2txt
!pip install pdfplumber
!sudo apt install tesseract-ocr
!pip install pytesseract

# Import Packages

In [2]:
import io
import os
import pandas as pd
import docx2txt
from itertools import chain
import pdfplumber
import pytesseract
from PIL import Image
from google.colab import files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Convert any file type (DOCX, PDF, TXT, IMAGE) to text

This function converts any file type to text. The conversion of DOCX files is handled using the **doc2txt** package, PDF files are converted using **pdfplumber** and text is extracted from images using the **Tesseract** OCR engine.

In [14]:
def file_to_text(file_path):

    _, file_extension = os.path.splitext(file_path)

    if file_extension == ".docx":
        text = docx2txt.process(file_path).replace("\n","")
        return text

    elif file_extension == ".pdf":
      text = ""
      with pdfplumber.open(file_path) as pdf:
        num_pages = len(pdf.pages)
        for i in range(num_pages):
          page_content = pdf.pages[i].extract_text().replace("\n","")
          text += " "+page_content
      return text

    elif file_extension == ".txt":
      with open(file_path, "r") as f:
        text = f.read()
        text = text.replace("\n","")
      f.close()
      return text

    elif file_extension == ".JPG" or file_extension == ".JPEG" or file_extension == ".PNG":
      image = Image.open(file_path)
      text = pytesseract.image_to_string(image)
      text = text.replace("\n","")
      return text

    else:
      print("Unsupported Format.")

# Open Resume & Job Description

Select a job description and a set of resumes to be evaluated, and convert them into textual format.

In [None]:
job_description = files.upload()
if job_description:
  job_description_path = "/content/" + next(iter(job_description))
  job_description = file_to_text(job_description_path)

In [None]:
resumes = files.upload()

# Evaluate Similarity between Resume & Job Description

* The similarity between the job description and each resume will be assessed based on the **cosine similarity**, from **Scikit-learn**.

* Cosine similarities will be recorded in an Excel file in an ascending order (i.e., the most relevant applicant profiles will appear at the top of the list).

In [None]:
resume_names = []
similarities = [] # Range between 0 (0%) and 1 (100%)

# Save uploaded resumes' names in a list 
for item in chain(resumes.items()):
  resume_name = next(iter(item))
  resume_names.append(resume_name)

# Convert resumes to text and do a pairwise comparison with job description
cv = CountVectorizer()
for name in resume_names:
  path = "/content/" + name
  resume = file_to_text(path)
  content = [job_description, resume] 
  matrix = cv.fit_transform(content)
  similarity_matrix = cosine_similarity(matrix)
  similarity = round(similarity_matrix[0][1],2)
  similarities.append(similarity)

# Display cosine similarities in a dataframe
ats_data = {"Applicant File": resume_names, "Similarity With Job Description": similarities}  
ats_data = pd.DataFrame(ats_data)  
ats_data.sort_values(by="Similarity With Job Description", ascending=True, inplace=True)
print(ats_data)  

# Store cosine similarities in Excel file
ats_data.to_excel("ATS Results.xlsx")