# Job Description Analysis ('Bag-of-Words') Single Word Approach

##### This guide is to communicate an approach or methodology for analyzing job descriptions. These are several potential options that this can be used to:¶
+ Compare technologies being used by companies in the same sector or companies operating as competitors.
+ Analyze the number of companies that are hiring for specific technologies to gain an understaing of 'market share'. For example: how many companies are hiring for aws vs azure?
+ Analyze specific companies that are hiring for technologies that are not being widely used.
+ Identify trending technologies; identify companies as early adopters or technology laggers.

In [13]:
# Import Libraries 
import string
from nltk.corpus import stopwords
import datetime
import pandas as pd
import numpy as np
import collections
import os
from xml.etree import ElementTree as ET

# set stopwords.  After installing the nltk library, you can modify the stopwords text file to add or 
# remove additional stop words as desired.  For this tutorial, I am using the library defaults.
stopwords = set(stopwords.words('english'))

# Visual control for the purpose of tutorial display
pd.set_option('display.max_rows', 500)

# Get Specific job Descriptions You Are Looking to Analyze

### Step 1:  Get hashes for the records you want to analyze

###### Get Hashes you are interested in analyzing.  This is done by querying job_records.  For example, if using SQL you can use the following Query.
```mysql
SELECT job_records.hash
FROM job_records
JOIN fs_company_identifiers ON job_records.company_id = fs_company_identifiers.company_id
WHERE 
    fs_company_identifiers.primary_flag = True AND 
    fs_company_identifiers.end_date IS NULL AND isin IN ('US0378331005', 'US5949181045', 'US0231351067');

INTO OUTFILE '/hashes.csv'
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n';
```

### Step 2: Load job descriptions into a database and filter for hashes you are interested in.¶

##### Caution: We recommend using a production grade database of some kind for instead of flat csv files for production. This guide is only to communicate methodology.¶


In [None]:
# This converts xml descriptions to csv.  I will be working with descriptions locally as flat csv files  
# It is highly reccomended that you replace this step by loading data into a database based on your internal systems
# SQL, SolR, mongodb, etc.

files = os.listdir("../linkup_job_descriptions/xml/")

for file in files:        
    df = pd.DataFrame(list(map(lambda x: (x[0].text,x[1].text),ET.parse(file).getroot())))
    df.columns = ['jobHash','jobDesc']
    
    # Filter for hashes you want from your universe
    df = df[df['jobHash'].isin(hashes)]
    
    # Ensure all jobDesc are type string
    df['jobDesc'] = df['jobDesc'].astype(str)
    
    # Text cleanup
    df['jobDesc'] = df['jobDesc'].str.replace('\n'," ").str.replace('\nn'," ").
    df['jobDesc'] = df['jobDesc'].str.translate(str.maketrans('','',string.punctuation)).str.lower().str.split()
    df['jobDesc'] = df['jobDesc'].apply(lambda x: list(dict.fromkeys(x)))
    
    # Remove Stopwords
    df['jobDesc'] = df['jobDesc'].apply(lambda x: [item for item in x if item not in stopwords])
    df['jobDesc'] = df['jobDesc'].apply(lambda x: ' '.join(x))
    
    # Export
    df.to_csv("../linkup_job_descriptions/csv/"+str(x) + '-' + file, index = False)


### Step 3: Combine all output files from previous step and do word counts with threshold

##### Caution: We recommend using a production grade database of some kind for instead of flat csv files for production. This guide is only to communicate methodology.

In [None]:
files = os.listdir("../linkup_job_descriptions/csv/")

# Combine all csv files together into one DataFrame
df_full = pd.DataFrame()    
for file in files:
    temp = pd.read_csv(file)
    df_full = pd.concat([df_full,temp])

# Count the number of job descriptions that each word appears in
df_full['jobDesc'] = df_full['jobDesc'].astype(str)  
counts = collections.Counter([y for x in df_full['jobDesc'].values.flatten() for y in x.split()])    

# Filter out job descriptions based on min_threshhold.  This should be adjusted based on use case.
min_threshold = 1000
counts = {x : counts[x] for x in counts if counts[x] >= min_threshold
          
# Optional conversion to DataFrame - Can be left as a Dictionary.
counts = pd.DataFrame(list(counts.items()))
counts.to_csv('counts.csv',index = False)

### Step 4: Examine Results

##### Please see below for this methodology being applied to the Russell 3000

In [23]:
# load Results
counts = pd.read_csv('/Users/iflath/DataFiles/raw_daily_2019-10-24/Russel 3k/counts.csv', names = ['word','count'])
counts = counts.sort_values(['count'], ascending = False)

In [35]:
# Search for specific key words
search_words = ['python','r','data','analytics','javascript','mongodb','sql','aws','azure']
counts[counts['word'].isin(search_words)].sort_values(['count'], ascending = False)

Unnamed: 0,word,count
1557,data,203320
3760,analytics,42682
3313,sql,26255
1571,python,23195
1592,aws,18327
3103,javascript,12073
6491,azure,8843
7342,r,6957
5815,mongodb,2136


### Optional Method for Parsing descriptions that will be faster for smaller subsets, or small subsets of key words.

##### This Command Line script will do the following:

1. Loop line by line through xml file
2. Use regular expressions to grab the hash and the description
3. Search each description for specific keyword

##### This is intended to be started code for fast processing and analysis of the xml job descriptions file, not final production code.

In [None]:
import time, re, csv
import argparse,sys

##Add CLI Args
parser=argparse.ArgumentParser()

parser.add_argument('--inputFile', help='the XML descriptions file to read')
parser.add_argument('--outputFile', help='the CSV file to print the filtered descriptions')
parser.add_argument('--keyword', help='the keyword to search descriptions for')

args=parser.parse_args()

if (not args.inputFile or not args.outputFile):
    print("Must specify both an input and output file.")
    exit()

if (args.keyword):
    keyword = args.keyword

start = time.time()
count = 0;
with open(args.inputFile) as file, open(args.outputFile, "wb") as outFile:
    jobHash = '';
    jobDesc = '';
    inJobElement = False
    writer = csv.writer(outFile)
    for line in file:
        if "<job>" in line and not "</job>" in line:
            inJobElement = True
        elif "</job>" in line:
            inJobElement = False
            if "<job>" in line:
                inJobElement = True
        elif inJobElement == True:
            if "<hash>" in line:

                jobHash = re.match(r"\s*<hash>(.*)<", line).group(1)
            
            elif "<description>" in line:

                jobDesc = re.match(r"\s*<description><..CDATA.(.*)]]><", line).group(1)
                
        if jobHash and jobDesc:
            #do Processing
            if(not keyword or keyword in jobDesc):
                writer.writerow([jobHash, jobDesc])
                jobHash = '';
                jobDesc = '';
                count = count + 1
        
print count
end = time.time()
print(end - start)