## Recommendation of Job Based on Resume Skills
### Flow of Execution
* Import Of Resume in PDF/Word format
* Cleaning/Skills Extraction based on NlP
* RealTime Scrapping Jobs From <b>Indeed.com</b> based on Skill Set
* Cleaning Job data and Storing In Data Frame
* Vectorization/Embedding of Job Description using Word2Vec transfer Learning Model
* Vectorization/Embedding of Resume text using Word2Vec
* Calculating cosine Similarity Between Both Resume and Job Descriptions
* Suggesting Jobs Based on Similarity Score

### Resume Import , Cleaning,Extraction Of Skills


In [1]:
import warnings
warnings.filterwarnings('ignore')

#### Importing resume in pdf and cleaning text

In [2]:
# Pdf text Extractor
def pdf_reader(path):
    import PyPDF2
    file=open(path,'rb')
    text=[]
    pdf=PyPDF2.PdfFileReader(file)
    
    for i in range(pdf.numPages):
        page=pdf.getPage(i)
        page_text=page.extractText()
        text.append(page_text)
    
    file.close()
    return text

# Clean Resume


def resume_clean(inp_pdf_text):
    import re
    return_list=[]
    for inp_text in inp_pdf_text:
        sub=re.sub('\n+','',inp_text) #remove \n
        sub=re.sub(r'\S*.\S*\/[^\s]+',' ',sub) #remove links
        sub=re.sub(r'\S*@[^\s]+',' ',sub) # remove Email
        sub=re.sub(r'[^ a-z A-Z]',' ',sub) #Remove numbers
        sub=re.sub(r' +',' ',sub) #remove spaces
        return_list.append(sub)
    return return_list

In [3]:
pdf_text=pdf_reader('D:\\IVY Batches\\CV\\Fresher Sample Resume\\Fresher Sample Resume\\SujathaAmbati - Resume.pdf')
pdf_text=resume_clean(pdf_text)


In [4]:
# pdf_text
final_text=pdf_text[0]+pdf_text[1]
# final_text=final_text.lower()


#### POS Tagging

In [10]:
from nltk.tokenize import word_tokenize
import nltk
from nltk.tag import perceptron
tokens=word_tokenize(final_text)
pos_tokens=nltk.pos_tag(tokens)


In [6]:
ner_text=' '.join([i[0] for i in pos_tokens if i[1] in ('NNP')])

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
cvec=TfidfVectorizer(stop_words=ENGLISH_STOP_WORDS)

x=cvec.fit_transform([ner_text])





#### Skills Extraction from Resume text

In [5]:
import pickle
with open('skills.pkl','rb') as file:
    sk=pickle.load(file)

In [8]:
def skill_extract(resume_text,sample_skills):
    import nltk
    tokens=nltk.word_tokenize(resume_text)
    skill_tokens=list(map(' '.join,nltk.everygrams(tokens,1,3)))
    ret_skills=[]
    for i in skill_tokens:
        if i.lower() in sample_skills:
            ret_skills.append(i.lower())
        else:pass
    
    return list(set(ret_skills))

In [9]:
scrap_skills=skill_extract(final_text,sk) # Skills to scrap

In [10]:
scrap_skills

['sql',
 'statistics',
 'python',
 'data',
 'excel',
 'tableau',
 'c',
 'analysis',
 'visualization',
 'machine learning',
 'power bi']

#### Final Text generaion for Vectorization
* Final Text generation removes all entities like Person,Language,Country,Event etc which does not relates with Job Description

In [11]:
import spacy
import en_core_web_trf
spa=spacy.load('en_core_web_trf')

In [12]:
def vec_text(final_text):
    import re
    ner=spa(final_text)
    final_text=final_text.lower()
    for i in ner.ents:
        if i.label_ in ['GPE','PERSON','ORG','LOC','LANGUAGE','EVENT','WORK_OF_ART','DATE','TIME']:
            final_text=final_text.replace(i.text.lower(),' ')
    
    final_text=re.sub(r'\b\w{1}\b',' ',final_text)
    final_text=re.sub(r' +',' ',final_text)
    
    return final_text.strip()

In [13]:
text_to_vectorize=vec_text(final_text)

In [30]:
# text_to_vectorize

## Job-Scrapping

In [14]:
# Job Scrapping
# #################################################
def description_scrap(titles):
    import requests
    from bs4 import BeautifulSoup
    Description=[]
    for title in titles:
        for pages in [0,10,20]:
            link='https://in.indeed.com/jobs?q={}&start={}'.format(title,pages)

            req=requests.get(link)

            bs=BeautifulSoup(req.content,'lxml')

            # Getting Job Titles

            job_div=bs.find_all('div',class_='job_seen_beacon')

            # Getting Job Description        


            for i in job_div:
                description=i.find_all('table',class_='jobCardShelfContainer')
                for j in description:
                    snippet=j.find('tr',class_='underShelfFooter').find_all('div',class_='job-snippet')
                    temp_str=''
                    for k in snippet:
                        lists=k.find_all('li')
                        for l in lists:
                            temp_str=temp_str+' '+l.text
                    Description.append(temp_str)
    
    return Description

# #############################################################

def job_scrap(titles):
    
    from bs4 import BeautifulSoup
    import requests
    titles_show=[]
    company=[]
    location=[]
    salary=[]
    string=''
    for title in titles:
        for i in [0,10,20]:
            link='https://in.indeed.com/jobs?q={}&start={}'.format(title,i)

            req=requests.get(link)
            bs=BeautifulSoup(req.content,'lxml')

            for j in bs.find_all('h2',class_='jobTitle'):
                string=j.text
                titles_show.append(string)


            for k in bs.find_all('span',class_='companyName'):
                company.append(k.text)

            for l in bs.find_all('div',class_='companyLocation'):
                location.append(l.text)


            job_div=bs.find_all('div',class_='job_seen_beacon')

            for i in job_div:
                table=i.find_all('table',class_='jobCard_mainContent')
                for j in table:
                    span=j.find('td').find_all('span')   
                    salary.append(span[-1].text)
        
    
    return titles_show,company,location,salary





# ##################################################################################
# Link retrival
def link_retriver(title):
    import requests
    from bs4 import BeautifulSoup
    append_link=[]
    # All possible links retrival
    for pages in [0,10,20]:
        link='https://in.indeed.com/jobs?q={}&start={}'.format(title,pages)

        req=requests.get(link)
        bs=BeautifulSoup(req.content,'lxml')

        # Getting Job Titles
        job_div=bs.find('div',id='mosaic-provider-jobcards')

        # for i in job_div:
        #     print(i.find_all('a'))
        
        final_link=''
        for i in job_div.find_all('a')[1:]:
            link='https://www.indeed.com'+i.get('href')
            final_link=final_link+'  '+link
        
        append_link.append(final_link)
    
    return append_link

# ##################################################################################


In [15]:
title,company,location,salary=job_scrap(scrap_skills)
description=description_scrap(scrap_skills)



#### Storing Scrapped Jobs in Dataframe and cleaning Text

In [16]:
# JOB dataframe
import pandas as pd

Job_df=pd.DataFrame({'title':title,'company':company,'location':location,'salary':salary,'description':description})
Job_df.head()

Unnamed: 0,title,company,location,salary,description
0,newTeam Lead - Data Analyst,ICICI Bank Ltd,"Mumbai, Maharashtra","₹15,00,000 a year", Provide end to end analytical support for t...
1,newMS SQL Developer,Dusane Infotech Pvt Ltd,"Thane, Maharashtra","₹25,000 - ₹40,000 a month",1.Designing database tables and structures. 2...
2,newSQL Developer,MaaxtreeM,"Hyderabad, Telangana","₹19,413 - ₹88,072 a month",Experience : Freshers/experience of 1 to 5 ye...
3,newSQL Developer,d-insights pvt ltd,"Mumbai, Maharashtra","₹2,00,000 - ₹5,00,000 a year","Knowledge on data bases, preferably oracle/My..."
4,SQL Developer,Premade Innovations Pvt Ltd,"Pune, Maharashtra","₹15,000 - ₹20,000 a month",Candidate must have proven years of experienc...


In [89]:
# Cleaning Description removing non usable characters
def description_clean(text):
    import re
    text=re.sub('[^a-z A-Z]',' ',text)
    text=re.sub(' +',' ',text)
    return text.strip().lower()

# Salary Section have some junk values which are not salary amount so removing them

def salary_clean(text):
    if '₹' in text:
        return text
    else:
        return "Not Available"


In [90]:
Job_df['salary']=Job_df['salary'].apply(salary_clean)

#### To Vectorize text/description generation 
* generation of Final cleaned text of description to be vectorized using word2vec

In [18]:
Job_df['to_vectorize']=Job_df['title']+' '+Job_df['description']
Job_df['to_vectorize']=Job_df['to_vectorize'].apply(description_clean)

### Vectorization/Embedding of Resume and Job description

#### CountVectorizing job description text

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
cnt=CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
X=cnt.fit_transform(Job_df['to_vectorize']).toarray() #count vectorizing

features=cnt.get_feature_names() #feature names

job_vectors_df=pd.DataFrame(X,columns=features) # vectors dataframe

#### Word Embedding using Word2vec Google News 300 model

In [23]:
from gensim.models import keyedvectors

wordvec=keyedvectors.load_word2vec_format('D:\\IVY Batches\\ML DL NLP\\word2vec\\GoogleNews-vectors-negative300.bin.gz',binary=True)

#### Creating Dataframe of Embedded description word vectors
* creating an temporary dataframe to store the embedding results for each of Job description.

In [51]:
import numpy as np
import pandas as pd
embedd_vectors=pd.DataFrame() #empty dataframe
wordvec_keys=list(wordvec.key_to_index.keys()) #Keys in word2vec model

for i in range(job_vectors_df.shape[0]):
    word_bool=list(job_vectors_df.iloc[i,:].values!= 0)
    sent_vector=np.zeros(300)
    words=[val for val,bools in zip(features,word_bool) if bools ]
    
    for word in words:
        if word in wordvec_keys:
            sent_vector=sent_vector+wordvec[word]
        else:pass
    
    embedd_vectors=embedd_vectors.append(pd.DataFrame([sent_vector]))
        
embedd_vectors.reset_index(drop=True,inplace=True)    
    

#### Vectorization/Embedding of Resume Text

In [60]:
count=CountVectorizer(stop_words=ENGLISH_STOP_WORDS)
resume_vec_features=count.fit([text_to_vectorize]).get_feature_names()

resume_wordvec=np.zeros(300)

for i in resume_vec_features:
    if i in wordvec_keys:
        resume_wordvec=resume_wordvec+wordvec[i]

### Calculaing Cosine Similarity between Resume and Job Description

In [80]:
from sklearn.metrics.pairwise import cosine_similarity

similarity=pd.DataFrame() # Creating Sample dataframe to store Similarity Values Temporarily

for i in range(embedd_vectors.shape[0]):
    to_append=cosine_similarity(resume_wordvec.reshape(1,-1),embedd_vectors.iloc[i].values.reshape(1,-1))
    
    similarity=similarity.append(pd.DataFrame(to_append))

similarity.reset_index(drop=True,inplace=True)

In [81]:
Job_df['Similarity_with_resume']=similarity # Appending Similarity to Job Dataframe for recommendation

### Showing Reccomendations

In [91]:
Job_df.sort_values(by='Similarity_with_resume',ascending=False).head(10)[['title','company','location','salary','description']]

Unnamed: 0,title,company,location,salary,description
145,newData Entry Operator,Three Ess Computer Services (I) Pvt. Ltd.,"Mumbai, Maharashtra","₹10,000 - ₹13,000 a month",Proficiency in data capturing and MS Office (...
148,Data Entry Operator,XLNC,"Ghatkopar, Mumbai, Maharashtra","₹13,800 - ₹18,500 a month",Job Role- KYC Document verification. \*We req...
391,newSenior Research Executive - Instrumental Ev...,L'Oreal,"Mumbai, Maharashtra",Not Available,Exposure to any visualization tool (including...
150,newData Entry Operator,Growit India Private Limited (www.thegrowit.com),"Surat, Gujarat","₹10,000 - ₹15,000 a month",Job Role- KYC Document verification. \*We req...
191,newInternational BPO,Epicenter Technologies Pvt. Ltd,"Remote in Bhayandar, Mumbai, Maharashtra","₹15,000 - ₹25,000 a month",Good typing ability (minimum 40 WPM). Minimum...
425,newHIRING Machine Learning_Permanent Work from...,Net Connect,"Remote in Mumbai, Maharashtra",Not Available,Must have strong experience with statistical ...
52,"newAssociate , Content Strategy & Analysis",Netflix,"Mumbai, Maharashtra",Not Available,Candidate must have excellent subjective know...
421,Consultant/AM- AI-Machine Learning - Mumbai,KPMG,"Mumbai, Maharashtra",Not Available,The Test Engineering Services Automation Deve...
203,newData Entry Operator,Growit India Private Limited (www.thegrowit.com),"Surat, Gujarat","₹10,000 - ₹15,000 a month",Well versed with Google sheets & advance exce...
205,Executive Assistant To Leadership,NISA INDUSTRIAL SERVICES PVT. LTD.,"Andheri, Mumbai, Maharashtra","₹2,50,000 - ₹5,00,000 a year",Ability to read and write English. Desire to ...
