## Extract Keywords

This is another place for finding the way to best efficiently extracting keywords.

The end result of this notebook is implemented under keyword_extraction.py

In [4]:
import re # For regex
import PyPDF2 # For pdfs
import random # For sampling
import glob # For multiple files
import keyword_extraction


# Opening PDF file and cleaning text.

pdfFileObj = open('./SamplePDF/Sample_Resume.pdf', 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

print("# of pages:", pdfReader.numPages) 
 
pageObj = pdfReader.getPage(0) 
text = pageObj.extractText()
pdfFileObj.close() 

# Clean spaces and handle special characters.
text = re.sub("\s\s+", " ", text).lower()
text = text.replace("ó","\"").replace("ò","\"").replace("¥!", "").replace("õ","'").replace("\n"," ")

# Apply regex to get list of words.
reg_list = (re.split(r'[^(\w+(\.+|@)\w+)]|[()]', text))

# Filter None values. -> Regex could be implemented better to eliminate this step. 
# but, it works. So don't touch it :D
key_list = list(filter(None,reg_list))
final_text = " ".join(key_list)

# print(text)
print(final_text)

# of pages: 1
ekrem guzelyel b.s. m.s in computer science 2020 ekremguzelyel@gmail.com linkedin.com in ekrem guzelyel 224 830 1998 chicago il 60642 education illinois institute of technology chicago expected dec ember 2020 co terminal degree in computer science b.s. m.s. joint degree specialization in computational intelligence minor in applied mathematics courses taken elem. linear algebra discrete structures data structures algorithms data mining probability statistics machine learning deep learning database organizations gpa 3. 47 4.00 work experience projects machine learning lab at iit undergraduate research assistant chicago il october 2018 present research data categorization for transparent text classification summarize preprocess movie reviews to desired format. train a convolutional neural network cnn model to classify text document s. automate b ack propagat ion to identify reasoning for the decisions using multiple models. google engineering practicum intern ep mountain vie

### Helper words, keywords
Hide these in another folder as text files later on.
- See below for easier way to implement

In [5]:
# Action Words
# Taken from https://www.thebalancecareers.com/list-of-resume-and-cover-letter-keywords-2060287

fa = open(r"./keywords/action_key.txt", "r")
action_words = fa.read()

In [6]:
action_set = set(re.split(', |\n+', action_words.lower())) # another way is '\W+'

In [7]:
# Sample from action_set
for num, word in enumerate(random.sample(action_set, 10)): print(word)
# [print(word) for num, word in enumerate(random.sample(action_set,10))]

performed
multiplied
judged
experienced
welcomed
assembled
drafted
built
updated
measured


In [8]:
text_set = set(key_list)

In [9]:
len(text_set), len(action_set)

(332, 267)

In [10]:
len(text_set&action_set)

10

In [11]:
text_set.intersection(action_set)

{'analyzed',
 'created',
 'expected',
 'implemented',
 'launched',
 'observed',
 'organized',
 'ranked',
 'tested',
 'trained'}

In [12]:
# Open keywords for "Computer Skills"
f = open(r"./keywords/comp_sci_key.txt", "r")
computer_skills = f.read()

In [13]:
comp_skill_set = set(re.split('\W+', computer_skills.lower()))
random.sample(comp_skill_set,10)

['based',
 'analyze',
 'testing',
 'reporting',
 'aros',
 'analytics',
 'logical',
 'optimizing',
 'machine',
 'third']

In [14]:
# Soft Skills
fs = open(r"./keywords/soft_skills.txt", "r")
soft_skills = fs.read()

In [15]:
soft_skill_set = set(re.split('\W+', soft_skills.lower()))
random.sample(soft_skill_set,10)

['energy',
 'relationships',
 'communication',
 'monitoring',
 'etiquette',
 'influential',
 'managing',
 'conversations',
 'storytelling',
 'adaptable']

__Even an easier way__

In [16]:
keyword_dic = {}
stop_words = ['with','to', 'of','the', 'a', 'and','on']

list_of_files = glob.glob('./keywords/*.txt')           # create the list of file
for file_name in list_of_files:
#     print(file_name[11:-4])

    FI = open(file_name, 'r')
    keys = FI.read()
    
    keyword_dic[file_name[11:-4]] = set(re.split('\W+', keys.lower())).difference(stop_words)


    FI.close()

In [17]:
for i in keyword_dic:
    print(i, random.sample(keyword_dic[i],5))

action_key ['set', 'utilized', 'illustrated', 'headed', 'invested']
hard_skills ['legal', 'manufacturing', 'engineering', 'carpentry', 'bookkeeping']
soft_skills ['social', 'independent', 'deal', 'writing', 'solving']
comp_sci_key ['technologies', 'languages', 'coding', 'aros', 'interfaces']
legal_key ['real', 'court', 'litigation', 'criminology', 'notarization']


In [18]:
## Trying new keyword_extraction.py
new_dic = keyword_extraction.import_keys()
print(new_dic.keys())

new_resume_keys, new_resume_text = keyword_extraction.import_resume()
print(new_resume_text[:15])

Keywords imported.
dict_keys(['action_key', 'hard_skills', 'soft_skills', 'comp_sci_key', 'legal_key'])
Resume found.
# of pages: 1
Resume successfully extracted
ekrem guzelyel 


## Comparing resume with keywords

-> _This is left for another notebook_

Keyword references:

Action Words, hard, soft skills, CS - https://www.thebalancecareers.com/list-of-resume-and-cover-letter-keywords-2060287

Legal - https://aneliteresume.com/resume-writing/keywords-are-key-law/

------
## The Idea

Create a panel for percentage of how good a candidate is. 
- Find average of keywords occuring in a resume.
    - i.e. 15 action_keywords is a wonderful sample. Show it as green pie chart.
- Extract email, education, major
- Find matches with skills.
    - Return buzz words like Research, Machine Learning, Deep Neural Networks, Marketing, Java...
- Find companies that match.
    - Machine Learning can be implemented in this part. 
    - Label marketing, negotiation, sales as 1, coding, python as 0; then try to predict which category the applicant belongs.
- Give volunteer score.
    - Again, keywords like helped, organized, free should do.
- Find extracurricular score.
    - Brainstorm. A keyword list for extracurricular actions might work.
    
After evaluations, based on the score of the candidate make recommendations (both for recruiters and candidates). 
- Eg. Your extracurricular seems insufficient. Do something! (for candidate)
- This applicant is a good match for "Software Engineering Internship" position. (for recruiters)