Extracting skills from job description
Function will return nested dict (see NLP_skillNer_example)

!NOTE: The extraction is a very long process. 500 jobs can take multiple hours. 

There are errors in specific rows. Solved beneath "return None" and with a "for loop and continue" beneath
If you want the dictionnaires for each row as a column of the df, run cell below with the function "skills"

In [1]:
# imports SKILLNER
import spacy
from spacy.matcher import PhraseMatcher

# load default skills data base
from skillNer.general_params import SKILL_DB # EMSI skills database 
# import skill extractor
from skillNer.skill_extractor_class import SkillExtractor

import pandas as pd 
import json
import numpy as np
import ast  

In [4]:
df_main = pd.read_csv("data/df_clean_for_token.csv")

In [5]:
df_main.head()

Unnamed: 0,index,description,date_time
0,0,"as the leader in cloud-managed it, cisco merak...",2023-08-02 03:00:13.054897
1,1,as a senior business analyst you will contribu...,2023-08-02 03:00:13.054897
2,2,overview:\n\namyx is seeking to hire a data an...,2023-08-02 03:00:13.054897
3,3,i am looking for someone to help me build an a...,2023-08-02 03:00:13.054897
4,4,position vacancy – data analyst to support the...,2023-08-02 03:00:13.054897


In [6]:
# NLP + skillner
nlp = spacy.load("en_core_web_lg")
# init skill extractor
skill_extractor = SkillExtractor(nlp, SKILL_DB, PhraseMatcher)

loading full_matcher ...
loading abv_matcher ...
loading full_uni_matcher ...
loading low_form_matcher ...
loading token_matcher ...


Function that extracts skills. Can be directly used to apply to a df

In [None]:
# #dictionnaires for each row
# def skills(text):
#     try:
#         skills_ex = skill_extractor.annotate(text)
#         return(skills_ex)
#     except Exception as e:
#         return None

# # applies the function to the range of rows specified by indexing
# df_main['skills_ex'] = df_main['description_cleaned'][0:5].apply(skills) 

If you want a list of dictionnaires run the loop. Adjust the index range to fit your needs, but keep in mind how long it takes to process a large number of jobs.
The list of dictionaires can then be saved and exported as e.g df 

In [None]:
# Workaround of skillners bugs by ignoring the errors
# looping through rows in df.description_cleaned
counter = 0
#The list that returns the indexes of the error "out of range", e.g., for [0:500] = [146, 171, 247, 249, 274, 285, 286, 355, 357, 438, 499]
index_list = [] 
skills_list = [] 


for id in df_main['description_cleaned'][30990:31000]: #specify the index range 
    try:
        skills_ex = skill_extractor.annotate(id)
        skills_list.append(skills_ex)
        counter = counter + 1 #ignores this when fails
        print(counter)
    except Exception as e:
        print(f"Error processing entry at counter {counter}: {str(e)}")
        index_list.append(counter) # list of indexes that werent processed
        counter = counter + 1
        continue

print(counter)
print(len(skills_list))
display(skills_list)
print(index_list)

There are two way to work now.
1. work directly with skills_list.                              -- marked as WAY 1
2. save skills_list and export-import back as e.g dataframe     -- marked as WAY 2

WAY 1

Code Issues:
1. Unhashable Type Error:

    The initial error was caused by attempting to use a dictionary as a key in another dictionary (dict_frequency). Dictionaries are mutable objects and not hashable.

2. TypeError: Object of type int64 is not JSON serializable:

    This issue likely occurred when trying to serialize an int64 object to JSON. The int64 type is specific to NumPy and can cause issues when working with JSON.

3. Empty Dictionaries:

    Even after resolving the above issues, the dictionaries dict_hard and dict_soft remained empty, which could be due to mismatched keys between skills_list and SKILL_DB.

In [None]:
# WAY 1
# above code does not summarize skills (eg. we have analytics, analytical & data analysis seperately)
# to group them we will access skill_ID from the output dict, count them and translate them back to skill_name from the SKILL_DB library

# Extract values with the key 'skill_id' from each dictionary in the skills_list
skills_ids = []
for d in skills_list:
    full_matches = d.get('results', {}).get('full_matches', [])
    ngram_scored = d.get('results', {}).get('ngram_scored', [])
    for item in full_matches:
        skills_ids.append(item.get('skill_id'))
    for item in ngram_scored:
        skills_ids.append(item.get('skill_id'))
        
print(skills_ids)

In [None]:
class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        return super(NumpyEncoder, self).default(obj)

# getting frequency of our skills from skills_list that utilized skill_id
dict_frequency = {}
for i in skills_list:
    # Convert the dictionary to a JSON string using the custom encoder
    i_json = json.dumps(i, sort_keys=True, cls=NumpyEncoder)
    
    if i_json in dict_frequency:
        dict_frequency[i_json] += 1
    else:
        dict_frequency[i_json] = 1

print(dict_frequency)

In [None]:
# Initialize empty dictionaries to store counts of hard and soft skills
dict_hard = {}
dict_soft = {}

# Iterate over each item in the skills_list
for item in skills_list:
    # Extract the 'full_matches' list from the 'results' dictionary, default to an empty list if not present
    full_matches = item.get('results', {}).get('full_matches', [])
    
    # Iterate over each match in the 'full_matches' list
    for match in full_matches:
        # Extract the 'skill_id' from the match
        skill_id = match.get('skill_id')
        
        # Check if 'skill_id' exists
        if skill_id:
            # Retrieve 'skill_type' and 'skill_name' from SKILL_DB using 'skill_id'
            skill_type = SKILL_DB.get(skill_id, {}).get('skill_type')
            skill_name = SKILL_DB.get(skill_id, {}).get('skill_name')
            
            # Check if 'skill_type' is 'Hard Skill' and 'skill_name' exists
            if skill_type == 'Hard Skill' and skill_name:
                # Update count in 'dict_hard'
                dict_hard[skill_name] = dict_hard.get(skill_name, 0) + 1
            # If not a 'Hard Skill' and 'skill_name' exists
            elif skill_name:
                # Update count in 'dict_soft'
                dict_soft[skill_name] = dict_soft.get(skill_name, 0) + 1

# Sort dictionaries in descending order by count
sorted_dict_soft = dict(sorted(dict_soft.items(), key=lambda item: item[1], reverse=True))
sorted_dict_hard = dict(sorted(dict_hard.items(), key=lambda item: item[1], reverse=True))

# Display sorted dictionaries
display(sorted_dict_soft, sorted_dict_hard)

WAY 2 - import skills_set saved as dataframe from csvs

In [None]:
# Export
# As going through job descriptions takes much time, it is better to split across multiple data frames

#skills_list_df = pd.DataFrame(skills_list)
#skills_list_df.to_csv("data/skills_list_31500_32022.csv") # index_list [18,19]
#skills_list_df.to_csv("data/skills_list_31000_31500.csv") # index list empty
#skills_list_df.to_csv("data/skills_list_29500_31000.csv") #[2, 98, 103, 261, 262, 282, 311, 448, 526, 758, 759, 883, 911, 926, 943, 955, 963, 1000]

In [None]:
# Import back
SKILLS1 = pd.read_csv("data/skills_list_31500_32022.csv")
SKILLS2 = pd.read_csv("data/skills_list_31000_31500.csv")# check me later! repeating text "position summary what you ...""
SKILLS3 = pd.read_csv("data/skills_list_29500_31000.csv")


In [None]:
# Merge all data frames to one in order to extract the skills
SKILLS = pd.concat([SKILLS1, SKILLS2, SKILLS3], ignore_index=True)
SKILLS.shape

In [None]:
# import ast needed
# Read the CSV file, specifying that 'results' should be treated as a literal_eval

# Initialize an empty list to store skill_ids
skills_ids = []

# Iterate over each row in SKILLS DataFrame
for _, row in SKILLS.iterrows():
    # Safely evaluate the literals in 'results' column using ast.literal_eval
    results_dict = ast.literal_eval(row['results'])
    
    # Extract 'full_matches' and 'ngram_scored' from the evaluated dictionary
    full_matches = results_dict.get('full_matches', [])
    ngram_scored = results_dict.get('ngram_scored', [])
    
    # Iterate over full_matches
    for item in full_matches:
        skills_ids.append(item.get('skill_id'))
    
    # Iterate over ngram_scored
    for item in ngram_scored:
        skills_ids.append(item.get('skill_id'))

# Display the first few skill_ids
print(skills_ids[:5])

In [None]:
class NumpyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        return super(NumpyEncoder, self).default(obj)

# Getting frequency of skills from SKILLS DataFrame that utilized skill_id
dict_frequency = {}

# Iterate over each row in SKILLS DataFrame
for _, row in SKILLS.iterrows():
    # Safely evaluate the literals in 'results' column using ast.literal_eval
    results_dict = ast.literal_eval(row['results'])
    
    # Convert the dictionary to a JSON string using the custom encoder
    results_json_str = json.dumps(results_dict, sort_keys=True, cls=NumpyEncoder)
    
    if results_json_str in dict_frequency:
        dict_frequency[results_json_str] = dict_frequency[results_json_str] + 1
    else:
        dict_frequency[results_json_str] = 1

display(dict_frequency)

In [None]:
# Initialize empty dictionaries to store counts of hard and soft skills
dict_hard = {}
dict_soft = {}

# Iterate over each row in the SKILLS DataFrame
for _, row in SKILLS.iterrows():
    # Safely evaluate the literals in 'results' column using ast.literal_eval
    results_dict = ast.literal_eval(row['results'])
    
    # Extract the 'full_matches' list from the 'results' dictionary, default to an empty list if not present
    full_matches = results_dict.get('full_matches', [])
    
    # Iterate over each match in the 'full_matches' list
    for match in full_matches:
        # Extract the 'skill_id' from the match
        skill_id = match.get('skill_id')
        
        # Check if 'skill_id' exists
        if skill_id:
            # Retrieve 'skill_type' and 'skill_name' from SKILL_DB using 'skill_id'
            skill_type = SKILL_DB.get(skill_id, {}).get('skill_type')
            skill_name = SKILL_DB.get(skill_id, {}).get('skill_name')
            
            # Check if 'skill_type' is 'Hard Skill' and 'skill_name' exists
            if skill_type == 'Hard Skill' and skill_name:
                # Update count in 'dict_hard'
                dict_hard[skill_name] = dict_hard.get(skill_name, 0) + 1
            # If not a 'Hard Skill' and 'skill_name' exists
            elif skill_name:
                # Update count in 'dict_soft'
                dict_soft[skill_name] = dict_soft.get(skill_name, 0) + 1

# Sort dictionaries in descending order by count
sorted_dict_soft = dict(sorted(dict_soft.items(), key=lambda item: item[1], reverse=True))
sorted_dict_hard = dict(sorted(dict_hard.items(), key=lambda item: item[1], reverse=True))

# Display sorted dictionaries
display(sorted_dict_soft, sorted_dict_hard)


Creating a skill list based on the extracted skill

In [None]:
# add info hard/ soft skill to skill count data
# first convert to DataFrame:
df_soft = pd.DataFrame(list(sorted_dict_soft.items()), columns=['skill', 'count'])
df_hard = pd.DataFrame(list(sorted_dict_hard.items()), columns=['skill', 'count'])
df_soft.head()

In [None]:
# add type of skill inside the data frames in order to concat
df_soft['type'] = 'Soft Skill'
df_hard['type'] = 'Hard Skill'

# concat df_soft & df_hard to create one complete dataframe
df_type = pd.concat([df_hard, df_soft])
df_type.head()

In [None]:
df_type["type"].value_counts() # there are ca. 16+ x more hard skills than soft skills


In [None]:
# ideas for later, not relevant currently
# preparing for nlp stemmer that has problems when there are multiple words per entry
df_type_brackets = df_type["skill"].str.contains("\(")
df_type2 = df_type[df_type_brackets]
display(df_type2)

In [None]:
df_type.head()

In [None]:
# ideas for later, not relevant currently
# splitting strings into seperate words in skill
df_type['word_list'] = df_type['skill'].str.split()
display(df_type[12:21])

In [None]:
#removed skills that appeared only once
df_skills_top = df_type.query('count >= 2').copy()
#drop count because this the overall count that includes multiple mentions of the same skill inside the same job description
df_skills_top.drop(['count'], axis=1, inplace=True)
# remove brackets and abbreviations inside
df_skills_top['skill_clean'] = df_skills_top['skill'].str.split('(').str[0] 
# convert to lower case as description is also lower case
df_skills_top['skill_clean'] = df_skills_top['skill_clean'].apply(lambda x: x.lower()) 

display(df_skills_top)

In [None]:
df_skills_top['skill_clean'] = df_skills_top['skill_clean'].str.replace('s$', '', regex=True)
df_skills_top[df_skills_top['skill_clean'].str.contains('communicat', case=False)]

This section is not finished yet

Running a skill list through job descriptions

In [None]:
# For this section we utilze a skill list that was not extracted above, but from a different data set
# later that skill list and the skill list produced in this notebook will be combined
# import the "other" skills list

In [None]:
#make a list of all skills
skill_list_top = df_top_skills.skill.tolist()

# #stammer
ps = PorterStemmer()
  
for w in skill_list_top:
    print(w, " : ", ps.stem(w))


In [None]:
# uploading
# load to database
# df:exp1 as "jobs_current_skills_timeline"
#df_skill_count as "skill_count_current"

from dotenv import load_dotenv
load_dotenv()

# write dataset into database

# Import get_engine from sql_functions.py. You will need to restart your kernel and rerun at this point since we changed the module since we first imported it.
from sql_functions import get_engine
# create a variable called engine using the get_engine function
engine = get_engine()

import psycopg2

table_name = "skill_count_current"
schema = 'capstone_datacvpro'

# Write records stored in a dataframe to SQL database
if engine!=None:
    try:
        df_skill_count.to_sql(name=table_name, # Name of SQL table variable  
                        con=engine, # Engine or connection
                        schema=schema, # your class schema variable
                        if_exists='replace', # Drop the table before inserting new values 
                        index=False, # Write DataFrame index as a column
                        chunksize=5000, # Specify the number of rows in each batch to be written at a time
                        method='multi') # Pass multiple values in a single INSERT clause
        print(f"The {table_name} table was imported successfully.")
    # Error handling
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        engine = None
else:
    print('No engine')