<h1 style='color: gold; text-align: center; font-family: cursive;font-size: 30px;'>Hire the perfect candidate</h1>

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>A Perfect Fit : HackerEarth Machine Learning Challenge</h1>

<h1 style='color: #8f7be9; text-align: left; font-family: cursive;font-size: 20px;'>                            <p>Hiring employees effectively is vital to the survival of any organization. The hiring process consists of soliciting potential candidates during the recruitment and then determining the best candidates to be employees during the selection process.</p>

<p>The selection process, in particular, enables organizations to build and maintain a productive and motivated workforce that will be the key to their success.&nbsp;</p>

<p><strong>Task</strong></p>

<p>You want to hire a Machine Learning engineer for your team. Your manager has provided a job description and&nbsp;resumes of various candidates. You are given the responsibility to filter the candidates that fit the most based on the provided job description for the first interview round.</p>

<p>You are given a dataset that contains the resumes of various candidates. Your task is to determine the percentage that a candidate fits the job role based on the job description.</p>

<p><strong>Data description</strong></p>

<p>The <em>dataset</em> folder contains the following files:</p>

<ul>
	<li><em>train.csv</em>: 90 x 2</li>
	<li><em>trainResumes</em>: 90 resumes that you must use for training your model</li>
	<li><em>test.csv</em>: 60 x 1</li>
	<li><em>testResumes</em>: 90 resumes that you must use for testing&nbsp;your model</li>
	<li><em>sample_submission.csv</em>: 5 x 2</li>
	<li><em>Job description.pdf</em>: PDF file that represents the job description of a Machine Learning engineer</li>
</ul>

<p>The dataset contains the following columns:</p>

<table border="1">
	<tbody>
		<tr>
			<td style="text-align:center"><strong>Column name&nbsp;</strong></td>
			<td style="text-align:center"><strong>Column description</strong></td>
		</tr>
		<tr>
			<td style="text-align:center">CandidateID</td>
			<td>Represents the unique identification number of a candidate</td>
		</tr>
		<tr>
			<td style="text-align:center">Match Percentage</td>
			<td>Represents the percentage that a candidate fits based on the job description</td>
		</tr>
	</tbody>
</table>

<p><strong>Evaluation metric</strong></p>

<pre class="prettyprint">
<code>score = 100*max(0, 1 - metrics.mean_squared_log_error(actual, predicted))</code></pre>

<p><strong>Result submission guidelines</strong></p>

<ul>
	<li>The index&nbsp;is the&nbsp;<em>CandidateID</em><em><strong> </strong></em>column.&nbsp;</li>
	<li>The target&nbsp;is the <em>Match Percentage</em>&nbsp;column.&nbsp;</li>
	<li>The submission file must be submitted in <strong>.csv</strong> format only.</li>
	<li>The size of this submission file must be&nbsp;60 x 2.</li>
</ul>

<p><em>Notes</em>&nbsp;</p>

<p>Ensure that your submission file contains the following:</p>

<ul>
	<li>Correct index values as per the test file</li>
	<li>Correct names of&nbsp;columns as provided in the <em>sample_submission.csv</em> file</li>
</ul></h1>

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>Importing Liabrary...</h1>

In [83]:
# basic modules
import numpy as np
import pandas as pd

# modules to read pdf files
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

# ignore warning
import warnings 
warnings.simplefilter('ignore') 

<h1 style='color: red; text-align: left; font-family: cursive;font-size: 20px;'>Major challenge --> How to read data from pdf file becuase pandas don't support pdf files</h1>

<h1 style='color: lightgreen; text-align: left; font-family: cursive;font-size: 20px;'>Solution --> pdf_reader function</h1>

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>Function to read 'one page' of pdf file</h1>

In [68]:
'''
### Read in PDF complicated text using Pdfminer
'''      


def pdf_text_reader(pdf_file_name, pages=None):
    if pages:
        pagenums = set(pages)
    else:
        pagenums = set()

    ## 1) Initiate the Pdf text converter and interpreter
    textOutput = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, textOutput, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    ## 2) Extract text from file using the interpreter
    infile = open(pdf_file_name, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)        
    infile.close()
    
    ## 3) Extract the paragraphs and close the connections
    paras = textOutput.getvalue()   
    converter.close()
    textOutput.close
    
    return paras


<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>Function to read 'whole' pdf file</h1>

In [69]:
def pdf_reader(file_name):

    file = pdf_text_reader(file_name, pages=[0]) #start from page no 0
    page_no=1
    while True:
      another_page = pdf_text_reader(file_name, pages=[page_no])

      if(len(another_page)==0): #add pages untill pages ends
            break 
      file = file+another_page
      page_no+=1

    file = [i.lower() for i in file.split('\n') if i] #splliting context of file for better handling 
    return file #it returns list containing lines of pdf pages as strings

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>Loading Data...</h1>

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>1) Reading training and testing data</h1>

In [70]:
#training data
train = pd.read_csv('dataset/train.csv')
print("Train Data :") 
train.head(3) #printing first 3 entries


Train Data :


Unnamed: 0,CandidateID,Match Percentage
0,candidate_011,13.6
1,candidate_113,36.63
2,candidate_123,54.93


In [71]:
#testing data 
test = pd.read_csv('dataset/test.csv')  
print("Test Data :") 
test.head(3) #printing first 3 entries 

Test Data :


Unnamed: 0,CandidateID
0,candidate_014
1,candidate_098
2,candidate_075


<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>Check if any missing entry</h1>

In [72]:
train.isnull().sum() 

CandidateID         0
Match Percentage    0
dtype: int64

In [73]:
test.isnull().sum() 

CandidateID    0
dtype: int64

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 15px;'>Observation --> hence there is no missing values in train and test dataset and also from train and test folder it is confirm that it contains all 90 and 60 resumes respectively</h1>

<h1 style='color: red; text-align: left; font-family: cursive;font-size: 20px;'>Challenge --> CandidateId's are shuffled so we can't use them directly</h1>

<h1 style='color: lightgreen; text-align: left; font-family: cursive;font-size: 20px;'>Solution --> we will sort dataframe based on candidateID's </h1>

In [74]:
#sort dataframe based on candidateId
train.sort_values(by ='CandidateID', inplace =True)

#reset index
train.reset_index(inplace=True)
train.drop('index',axis=1,inplace=True)

#print sorted train dataset
train.head() 

Unnamed: 0,CandidateID,Match Percentage
0,candidate_000,13.7
1,candidate_001,40.09
2,candidate_002,48.91
3,candidate_003,36.89
4,candidate_006,44.96


In [75]:
#sort dataframe based on candidateId
test.sort_values(by ='CandidateID', inplace =True)

#reset index
test.reset_index(inplace=True)
test.drop('index',axis=1,inplace=True)

#print sorted test dataset
test.head() 

Unnamed: 0,CandidateID
0,candidate_004
1,candidate_005
2,candidate_014
3,candidate_016
4,candidate_017


<h1 style='color: red; text-align: left; font-family: cursive;font-size: 20px;'>Challenge --> Cv's are in pdf format and but we need them in dataframe</h1>

<h1 style='color: lightgreen; text-align: left; font-family: cursive;font-size: 20px;'>Solution --> Loop through Cv's read them by using pdf_reader function and add to dataframe</h1>

<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>2) Reading CV's</h1>

In [89]:
train["CV"] = 0
for i in range(90):
    train["CV"][i] = pdf_reader("dataset/trainResumes/"+str(train["CandidateID"][i])+".pdf")

train.head(3)     

Unnamed: 0,CandidateID,Match Percentage,CV
0,candidate_000,13.7,"[jacob smith, f r e s h e r, personal profile,..."
1,candidate_001,40.09,"[brianna williams, j uni or developer, work e..."
2,candidate_002,48.91,"[mason quadrado, associate analyst, about, cer..."


In [90]:
test["CV"] = 0
for i in range(60):
    test["CV"][i] = pdf_reader("dataset/testResumes/"+str(test["CandidateID"][i])+".pdf")

test.head(3)    

Unnamed: 0,CandidateID,CV
0,candidate_004,"[olivia santos, consultant analyst, executive,..."
1,candidate_005,"[armin fitzgerald, d a t a m a n ..."
2,candidate_014,"[grace bailry, m a c h i n e l e a r n i n g..."


<h1 style='color: rgb(77, 229, 240); text-align: left; font-family: cursive;font-size: 20px;'>3) Reading Job Description</h1>

In [88]:
jd = pdf_reader("dataset/Job description.pdf")
jd[:5] #print first 5 lines

['machine learning engineering',
 '13585abc',
 'knowledge and innovation',
 'what you’ll do',
 'you will focus on researching, building, and designing self-running artificial intelligence (ai)']