# Context
Goal of project is to classify resumes (not to grade resumes). Therefore, goal of preprocessing is to ensure that the text are properly normalized such that they can be properly compared. Priority should therefore be given to keywords etc. due to the specialized nature of each class.

# Data Exploration

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import os
from collections import Counter
from nltk.stem import WordNetLemmatizer
from nltk.metrics.distance import edit_distance as levenshteinDistance

from typing_extensions import Literal

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jscho\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\jscho\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\jscho\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\jscho\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
VAR = {
    'data_path': os.path.join('UpdatedResumeDataSet_T1_7.csv'),
    'batch_size': 32,
}

In [4]:
res_data_raw = pd.read_csv(VAR['data_path'])

In [5]:
res_data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9595 entries, 0 to 9594
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  9595 non-null   object
 1   Resume    9595 non-null   object
dtypes: object(2)
memory usage: 150.1+ KB


In [6]:
res_data_raw.head(5)

Unnamed: 0,Category,Resume
0,Data Science,qwtnrvduof Education Details \nMay 2013 to May...
1,Data Science,"qwtnrvduof Areas of Interest Deep Learning, Co..."
2,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
3,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
4,Data Science,"SKILLS C Basics, IOT, Python, MATLAB, Data Sci..."


### Cleaning

In [7]:
res_data_raw['Resume'].duplicated(keep='first').sum() #Check for duplicates

9407

In [8]:
res_data = res_data_raw.drop_duplicates(subset=['Resume'])

In [9]:
res_data['Category'].value_counts()

Category
Java Developer               14
Data Science                 12
HR                           12
Database                     11
Advocate                     10
DotNet Developer              8
Hadoop                        8
DevOps Engineer               8
Business Analyst              8
Testing                       8
Civil Engineer                7
SAP Developer                 7
Health and fitness            7
Python Developer              7
Arts                          7
Automation Testing            7
Electrical Engineering        6
Sales                         6
Network Security Engineer     6
ETL Developer                 6
Mechanical Engineer           5
Web Designing                 5
Blockchain                    5
Operations Manager            4
PMO                           4
Name: count, dtype: int64

### Explore Resume Text

In [10]:
sample_res = res_data['Resume'][0]
print(sample_res)

qwtnrvduof Education Details 
May 2013 to May 2017 BbNTGBqLmkKE   UIT-RGPV
Data Scientist 

Data Scientist - Matelabs
Skill Details 
Python- Exprience - Less than 1 year months
Statsmodels- Exprience - 12 months
AWS- Exprience - Less than 1 year months
Machine learning- Exprience - Less than 1 year months
Sklearn- Exprience - Less than 1 year months
Scipy- Exprience - Less than 1 year months
Keras- Exprience - Less than 1 year monthsCompany Details 
company - Matelabs
description - ML Platform for business professionals, dummies and enthusiastsckeKJOFvWQ
60/A Koramangala 5th block,
Achievements/Tasks behind sukh sagar, Bengaluru,
India                               Developed and deployed auto preprocessing steps of machine learning mainly missing value
treatment, outlier detection, encoding, scaling, feature selection and dimensionality reductionqunsOBcUdT
Deployed automated classification and regression modelRYNOolXhuV
linkedinSAJhwmUxoOcom/in/aditya-rathore-
b4600b146                

Initial Review:
Some words are seemingly gibberish and consists of a sequence of random characters

These words should be removed. However, care must be taken to ensure that other important text such as links are not classified as gibberish

List of issues:
Broken links (Solved)
Long whitespaces (Solved)
Combined words without clear separators https://github.com/grantjenks/python-wordsegment

In [11]:
sample_res

'qwtnrvduof Education Details \nMay 2013 to May 2017 BbNTGBqLmkKE   UIT-RGPV\nData Scientist \n\nData Scientist - Matelabs\nSkill Details \nPython- Exprience - Less than 1 year months\nStatsmodels- Exprience - 12 months\nAWS- Exprience - Less than 1 year months\nMachine learning- Exprience - Less than 1 year months\nSklearn- Exprience - Less than 1 year months\nScipy- Exprience - Less than 1 year months\nKeras- Exprience - Less than 1 year monthsCompany Details \ncompany - Matelabs\ndescription - ML Platform for business professionals, dummies and enthusiastsckeKJOFvWQ\n60/A Koramangala 5th block,\nAchievements/Tasks behind sukh sagar, Bengaluru,\nIndia                               Developed and deployed auto preprocessing steps of machine learning mainly missing value\ntreatment, outlier detection, encoding, scaling, feature selection and dimensionality reductionqunsOBcUdT\nDeployed automated classification and regression modelRYNOolXhuV\nlinkedinSAJhwmUxoOcom/in/aditya-rathore-\nb46

In [12]:
def clean_links(potentialLinks: list):
    
    '''
    Assumption: Potential link will always have at the minimum a .com
    
    Checks validity of link and returns cleaned link string
    '''
    
    assert isinstance(potentialLinks, list)
    
    http_exist = False
    www_exist = False
    com_exist = False
    
    if len(potentialLinks) < 1:
        return []
    
    ret_list = []
    
    for link in potentialLinks:
        
        http_match = re.search(r'(https?)(:)?(\/){0,2}', link)
        www_match = re.search(r'(www)(\.)?', link)
        com_match = re.search(r'(\.)?(com)', link)
        # print('flagged', link)
        
        #http
        if http_match != None:
            http_exist = True
        
        #www
        if www_match != None:
            www_exist = True
        
        #com
        if com_match != None:
            com_exist = True
            
        if (com_exist) or (com_exist and www_exist) or (com_exist and www_exist and http_exist):
            link = re.sub(r'(https?)(:)?(\/){0,2}', 'https://', link)
            link = re.sub(r'(www)(\.)?', 'www.', link)
            link = re.sub(r'(\.)?(com)', '.com', link)
            
            ret_list.append(link)
        else:
            #Not valid link
            ret_list.append(False)
            
    return ret_list

In [13]:
def clean_raw_text(text: str):
    
    # Clean links section
    potential_links = re.findall(
        r'(?:(?:https?:?\/\/{1,2})?w{1,3}\.?)?[a-zA-z0-9]{1,2048}\.?[a-zA-Z0-9]{1,6}\/\b[/\-a-zA-Z0-9]*\w', text
    ) 
    '''
    / will flag a sequence of characters as potential links
    
    Optional criteions: 
    http(s)
    //
    www & .
    . & com
    '''
    
    finalized_links = clean_links(potential_links)

    for potential_link, finalized_link in zip(potential_links, finalized_links):
        if finalized_link == False:
            continue
        else:
#             print('real_links', finalized_link)
            text = re.sub(potential_link, ' ', text) #Remove link
    
    #Clean non-characters
    text = re.sub(r'[^a-zA-Z0-9]', r' ', text)
    
    #Normalize text
    text = text.lower()

    #Clean whitespace section
    text = re.sub(r'[ ]{1,}', r' ', text)
    
    return text

clean_raw_text(sample_res)

'qwtnrvduof education details may 2013 to may 2017 bbntgbqlmkke uit rgpv data scientist data scientist matelabs skill details python exprience less than 1 year months statsmodels exprience 12 months aws exprience less than 1 year months machine learning exprience less than 1 year months sklearn exprience less than 1 year months scipy exprience less than 1 year months keras exprience less than 1 year monthscompany details company matelabs description ml platform for business professionals dummies and enthusiastsckekjofvwq 60 a koramangala 5th block achievements tasks behind sukh sagar bengaluru india developed and deployed auto preprocessing steps of machine learning mainly missing value treatment outlier detection encoding scaling feature selection and dimensionality reductionqunsobcudt deployed automated classification and regression modelrynoolxhuv b4600b146 reasearch and deployed the time series forecasting model arima sarimax holt winter and prophetiqmadshiyn worked on meta feature

In [14]:
def check_numpy(text):
    
    if isinstance(text, list):
        text = np.array(text)
        return text
    elif isinstance(text, np.ndarray):
        return text
    else:
        raise TypeError('Not a list or numpy array')

In [15]:
def in_english_corpus(text: list | np.ndarray, behaviour: Literal['inside', 'outside'] = 'inside'):
    
    text = check_numpy(text)

    english_dictionary = nltk.corpus.words.raw().split('\n')

    english_dictionary = [word.lower() for word in english_dictionary] # normalize to lowercase
    
    word_in_dict_bool = np.isin(text, english_dictionary)
    
    if behaviour == 'inside':
        return text[word_in_dict_bool]
    elif behaviour == 'outside':
        word_not_in_dict_bool = np.invert(word_in_dict_bool)
        return text[word_not_in_dict_bool]
    else:
        return None

In [16]:
def clean_structured_text(text: list | np.ndarray, customer_dictionary: list = nltk.corpus.words.raw().split('\n')):
    
    #TODO There may be no point to cleaning mistyped random words > Intefere with keywords > Model may have to simply learn the noise

    text = check_numpy(text)
    
    customer_dictionary = [word.lower() for word in customer_dictionary] # normalize to lowercase
    
    word_in_dict_bool = np.isin(text, customer_dictionary)
    
    word_not_in_dict_bool = np.invert(word_in_dict_bool)
    
    
    
    words_in_dict = text[word_not_in_dict_bool]
    
    print(words_in_dict)

# clean_structured_text(sample_lemmas)

In [17]:
def wordnet_tag_format(tag: str):
    if tag.startswith('N'):
        return 'n'
    if tag.startswith('V'):
        return 'v'
    if tag.startswith('A'):
        return 'a'
    if tag.startswith('R'):
        return 'r'
    
    return 'n' #Ensure lemmatize function can run

In [18]:
def extract_lemmas(tagged_tokens: list[tuple], lemmatizer=nltk.stem.WordNetLemmatizer()):
    lemmas = [lemmatizer.lemmatize(token[0], wordnet_tag_format(token[1])) for token in tagged_tokens]
    
    return lemmas

--- Start Analysis ---
### Goal:

With reference to a list of possible english words, we aim to separate keywords important to job scopes and misspelled/invalid words. With the list of misspelled/invalids words, we can find the closest possible related word using Levenshtein Distance.

### Theory:

Since most misspelled words with random number of combinations occuringly more than once has a very small probability it is more likely that we will see keywords occur more frequently compared to misspelled words and characters of random sequence.

In [19]:
def extract_common_words_from_raw_data_ood(resumes_df: pd.DataFrame, column: str):
    
    resumes = resumes_df[column].to_numpy()
    resumes = check_numpy(resumes)

    lemmatizer = WordNetLemmatizer()
    
    counter = Counter()
    
    for index, resume in enumerate(resumes):
        normalized_resume = extract_lemmas(
            nltk.pos_tag(
                nltk.tokenize.word_tokenize(
                    clean_raw_text(resume))), 
            lemmatizer)
        
        # out of dictionary
        ood = in_english_corpus(normalized_resume, 'outside')
        counter.update(ood)
        
        if index % 10 == 0:
            print(index)
        
    return counter

In [20]:
count = extract_common_words_from_raw_data_ood(res_data, 'Resume')
print(count)

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
Counter({'exprience': 633, '1': 334, 'database': 328, 'sql': 238, 'maharashtra': 230, 'le': 224, '6': 211, 'ltd': 190, 'pune': 156, 'software': 151, 'monthscompany': 129, 'pvt': 119, 'mumbai': 119, 'automation': 114, '2': 98, 'mysql': 87, '3': 80, 'hadoop': 77, '2008': 74, '4': 72, 'etc': 71, '2016': 68, '2012': 66, 'javascript': 66, '5': 65, 'jquery': 65, 'html': 64, 'windows': 62, '2017': 61, '7': 56, 'linux': 55, '24': 55, 'coordinate': 50, 'troubleshoot': 50, 'etl': 49, '2014': 49, 'co': 45, 'hana': 44, 'website': 44, '2015': 43, 'microsoft': 42, '2018': 39, 'monitoring': 39, '10': 38, 'ajax': 38, '8': 37, 'cs': 37, '2010': 36, '2013': 35, 'hr': 35, 'online': 34, 'informatica': 34, '12': 33, 'qa': 33, 'nagpur': 32, 'firewall': 32, 'sqoop': 29, 'automate': 28, 'api': 28, 'mvc': 28, 'db': 28, 'xp': 27, 'dr': 27, 'hdfs': 27, '2011': 26, 'jsp': 26, 'unix': 26, 'erp': 25, 'com': 25, '9': 24, 'ssc': 24, 'tata': 24, '11g': 2

#### Output Analysis

Clearly misspelled words like "exprience" occur most frequently, and sequence of seemingly random characters "bbntgbqlmkkeckekjofvwq" appeared 5 times.

Conversely, keywords like "mozilla" which may be important to Web Designers only appeared once. Other important keywords like "tensorflow" and "scikit" only appears 5 times, the same as "bbntgbqlmkkeckekjofvwq". I therefore hypothesise that words such as "bbntgbqlmkkeckekjofvwq" occurring is not based on chance due to the miniscule probability. There is thus no apparent clear threshold/boundary between misspelled/noise words and keywords.

--- End Analysis ---

In [21]:
def pipeline(filepath: str, feature_name: str):
    
    def total_normalize(text):
        text = clean_raw_text(text)
        text_tag = nltk.pos_tag(
            nltk.word_tokenize(text)
        )
        text_lemmas = extract_lemmas(text_tag)
        
        return ' '.join(text_lemmas)
    
    df = pd.read_csv(filepath)
    # df = df.drop_duplicates(subset=[feature_name], keep='first')
    df[feature_name] = df[feature_name].apply(total_normalize)
    
    return df

In [22]:
processed_resumes = pipeline('UpdatedResumeDataSet_T1_7.csv', feature_name='Resume')
processed_resumes['Resume'][2]

'skill r python sap hana tableau sap hana sql sap hana pal m sql sap lumira c linear program data model advance analytics scm analytics retail analytics social medium analytics nlp education detail january 2017 to january 2018 pgdm business analytics great lake institute of management illinois institute of technology january 2013 bachelor of engineering electronics and communication bengaluru karnataka new horizon college of engineering bangalore visvesvaraya technological university data science consultant consultant deloitte usi skill detail linear program exprience 6 month retail exprience 6 month retail marketing exprience 6 month scm exprience 6 month sql exprience le than 1 year month deep learn exprience le than 1 year month machine learn exprience le than 1 year month python exprience le than 1 year month r exprience le than 1 year monthscompany detail company deloitte usi description the project involve analyse historic deal and come with insight to optimize future dealsbntgbq

In [23]:
processed_resumes.to_csv('cleanedResumes.csv', index=False)

In [24]:
print(processed_resumes['Category'].value_counts())

Category
Java Developer               839
Testing                      699
DevOps Engineer              549
Python Developer             479
Web Designing                449
HR                           439
Hadoop                       419
Blockchain                   399
ETL Developer                399
Operations Manager           399
Data Science                 399
Sales                        399
Mechanical Engineer          399
Arts                         359
Database                     329
Electrical Engineering       299
Health and fitness           299
PMO                          299
Business Analyst             279
DotNet Developer             279
Automation Testing           259
Network Security Engineer    249
SAP Developer                239
Civil Engineer               239
Advocate                     199
Name: count, dtype: int64


In [25]:
processed_resumes

Unnamed: 0,Category,Resume
0,Data Science,qwtnrvduof education detail may 2013 to may 20...
1,Data Science,qwtnrvduof area of interest deep learn control...
2,Data Science,skill r python sap hana tableau sap hana sql s...
3,Data Science,education detail mca ymcaust faridabad haryana...
4,Data Science,skill c basic iot python matlab data science m...
...,...,...
873,Testing,skill set o window xp 7 8 8bntgbqlmkk1 10 data...
874,Testing,good logical and analytical skill positive att...
878,Testing,personal skill quick learner eagerness to lear...
1540,DevOps Engineer,core skill project program management agile sc...


# OUT OF CODE

In [49]:
from nltk.corpus import wordnet as wn

In [50]:
wn.synsets('prophetiqmadshiyn')

[]

In [51]:
sample_lemmas

NameError: name 'sample_lemmas' is not defined

In [None]:
lemmatizer.lemmatize('deployed', 'v')

In [None]:
sample_tag

In [None]:
# Watch Cell
# print(clean_raw_text(sample_res))
# clean_raw_text(sample_res)
# nltk.corpus.words.raw().split('\n')
# np.isin(['detail'], nltk.corpus.words.raw().split('\n'))]
def x():
    x_dict = nltk.corpus.words.raw().split('\n')
    x_list = [levenshteinDistance('sklearn', word) for word in x_dict]
    
    id = x_list.index(min(x_list))
    print(id)
    print(min(x_list))
    print(nltk.corpus.words.raw().split('\n')[id])
    
x()

# lemmatizer.lemmatize('extracting')

In [None]:
from nltk.metrics.distance import edit_distance as test
test('aws', 'reductionqunsobcudt', substitution_cost=1)