# Homework #1

In this homework you will be analyzing job descriptions from a number of different fields. The thought is that these job descriptions might contain both jargon word ands phrases.

The challenge here will be to analyze the text of the included job descriptions, but to also compare the words and phrases there with a reference set. In this case, we will use Reuters news articles as a background corpus to compare our possible jargon text with.

This homework will require that you read in the text of the job descriptions and then tokenize them. You will then need to take the tokens and compare them to the Reuters as both individual tokens and also as bigrams.

You need not look at the frequency of the terms. We are aiming for just term differences, so simply reporting back the tokens that are only in the job descriptions will be sufficient. One key thing to consider here is what kind of tokens will you want to report on. For example, the job descriptions might contain numbers and other things. Generally, you'd not want to report back numbers. Also, you might want to consider lowercasing things. 

If you'd like you can also try to stem or lemmatize the text.

The code has been built around using NLTK, but you could just as easily do this with Spacy.

In [1]:
import nltk.data
from os import listdir
from os.path import isfile, join
from nltk.util import bigrams 
from nltk.tokenize import TreebankWordTokenizer

In [2]:
# here we will import necessary libraries for using NLTK

# check = nltk.download('punkt')
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
treebank_tokenizer = TreebankWordTokenizer()



In [3]:
dir_base = "C:/Users/Tanaya/PycharmProjects/s20_ds_nlp/homeworks/homework_1/data"


####
# Notice: We are reusing code from class notes... remember these kind of building blocks
####

def read_file(filename):
    input_file_text = open(filename , encoding='utf-8').read()
    return input_file_text

    
def read_directory_files(directory):
    file_texts = []
    files = [f for f in listdir(directory) if isfile(join(directory, f))]
    for f in files:
        file_text = read_file(join(directory, f))
        file_texts.append({"file":f, "content": file_text })
    return file_texts
    
# here we will generate the list that contains all the files and their contents
text_corpus = read_directory_files(dir_base)

In [4]:
###
# You will need to work on filling out the content of this method. 
###
from nltk.tokenize import TreebankWordTokenizer
from nltk.corpus import stopwords
import string
from nltk.util import ngrams


def process_description(job_description, n_gram = False):

    # take the job description text, and tokenize it
    treebank_tokenizer = TreebankWordTokenizer()
    tokens = treebank_tokenizer.tokenize(job_description)
    
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    
    # remove punctuation from each word
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]

    # remove stop words
    tokens_sw = [word for word in tokens if word not in list(stopwords.words('english'))]

    # you could also remove numbers and other noise tokens here too
    token_alp = [word for word in tokens_sw if word.isalpha()]
#     print(token_alp)

    # also, you might generate bigrams here as well
    if n_gram:
        token_alp = ngrams(token_alp, 2)
        #     print(bigrams)
        #     freq_bi = nltk.FreqDist(bigrams)

        #     freq_bi.plot(20)
    
    # the later function assumes you are returning a list of terms
    return token_alp



In [5]:
# This loop will simply apply your method to all the job descriptions
all_job_description_words = []
for job_description in text_corpus:
    job_description = job_description["content"]
    all_job_description_words.extend(process_description(job_description))

In [6]:
from nltk.corpus import reuters
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents
# you use while developing your technique 
# ex. reuters.fileids()[0:25]
print(num_docs)
all_reuters_words = []

# this will only iterate over the first 25 documents, 
# for the real submission you will need to run across more documents
# perhaps 250 documents, or all of them
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    reuters_text = reuters.open(doc_id).read()
    # here you could perhaps run the same job description processing method
    try:
        all_reuters_words.extend(process_description(reuters_text))
    except:
        pass
    # then you could simply add the output to the all_reuters_words list


10788


In [7]:
# here you will want to find ways to compare the words in the job 
# descriptions and the reuters text
# you might consider using Python's set capabilities to intersect things
# also, you might just iterate over the job description words to see if 
# they are in the reuters word list

# all_job_description_words
# print("all_job_description_words", all_job_description_words)
# all_reuters_words
# print("all_reuters_words", all_reuters_words)

# intersection
print("Common description between reuters and job descriptions", set(all_reuters_words).intersection(set(all_job_description_words)))


# not common 
print("job descriptions not present in the reuters", set(all_job_description_words) - set(all_reuters_words))


Common description between reuters and job descriptions {'system', 'identity', 'according', 'called', 'broad', 'first', 'official', 'returning', 'perception', 'doctors', 'positive', 'daily', 'pay', 'confidence', 'needed', 'software', 'etc', 'annually', 'provides', 'encouraged', 'pressure', 'dominion', 'contribute', 'information', 'processing', 'assistance', 'everything', 'growing', 'plans', 'keen', 'realistic', 'make', 'societies', 'freely', 'psychological', 'encourage', 'path', 'compensated', 'entirely', 'asking', 'note', 'analytics', 'savings', 'department', 'provider', 'assisting', 'locate', 'protects', 'businesses', 'older', 'closing', 'following', 'quickly', 'directly', 'want', 'persons', 'assure', 'inconsistent', 'pass', 'keller', 'gather', 'define', 'customized', 'designated', 'arithmetic', 'overseeing', 'appetite', 'commercial', 'advancing', 'laboratory', 'healthcare', 'oral', 'solution', 'commissions', 'united', 'blanks', 'timely', 'option', 'show', 'duty', 'staffing', 'orient

In [8]:
## Bigrams

In [9]:
# This loop will simply apply your method to all the job descriptions
all_job_description_words = []
for job_description in text_corpus:
    job_description = job_description["content"]
#     print(job_description)
    all_job_description_words.extend(process_description(job_description, "TRUE"))

In [10]:
from nltk.corpus import reuters
num_docs = len(reuters.fileids())
#  this has a large number of files... 
# you might wish to limit the number of documents
# you use while developing your technique 
# ex. reuters.fileids()[0:25]
print(num_docs)
all_reuters_words = []

# this will only iterate over the first 25 documents, 
# for the real submission you will need to run across more documents
# perhaps 250 documents, or all of them
for doc_id in reuters.fileids(): 
    # this doc_text variable will give you a text version of the news article. This could be tokenized.
    reuters_text = reuters.open(doc_id).read()
    # here you could perhaps run the same job description processing method
    try:
        all_reuters_words.extend(process_description(reuters_text, "TRUE"))
    except:
        pass
    # then you could simply add the output to the all_reuters_words list


10788


In [11]:
# all_job_description_words
# print("all_job_description_words", all_job_description_words)
# all_reuters_words
# print("all_reuters_words", all_reuters_words)

print("bigrams from job description not present in reuters", set(all_job_description_words) - set(all_reuters_words))

bigrams from job description not present in reuters {('offices', 'disseminate'), ('university', 'years'), ('company', 'north'), ('reports', 'followsup'), ('work', 'requires'), ('excellence', 'care'), ('recruiter', 'information'), ('multiple', 'locations'), ('members', 'competitive'), ('establish', 'compassionate'), ('challenges', 'tough'), ('term', 'care'), ('develop', 'elements'), ('priorities', 'inside'), ('changeorders', 'subcontractors'), ('working', 'environment'), ('success', 'discover'), ('skills', 'fear'), ('new', 'referral'), ('attitudes', 'empower'), ('network', 'build'), ('officer', 'basic'), ('qualified', 'applicants'), ('unique', 'earned'), ('comfortable', 'collaborating'), ('job', 'description'), ('assigned', 'team'), ('driven', 'entrepreneurial'), ('creating', 'maintaining'), ('experience', 'nursing'), ('training', 'feedback'), ('insurance', 'vision'), ('settle', 'less'), ('resources', 'disposal'), ('game', 'win'), ('site', 'logistics'), ('apple', 'take'), ('league', 'nu

# Analysis of your results

Below this cell, please put a short writeup of your approach and comments on your results. The goal here is to explain how well you think your method worked based on looking at some of your output data. Additionally, please describe things you might do fifferently or ways in which you might improve the process if you were given more time.

The approach I follow to solve this assignment, is to break down the sentences into tokens. All the tokens are changed to lower case. I have clean the list of token by removing the numeric words, special characters and stop words. The similar process is applied on reuters data. Further, I have compared token from job description and reuters. The observation is for the tokens not occuring in reuters is mostly the jargon considering words such as glucose, salesforce, and tableau. But there are a lot of words which are not jargons still exists in the list. Hence this method needs to be further enchanced.  So I decided to look at bigrams. The bigrams generated by reuters are compared with the job description to identify the jargons in the job description. Even after looking at the bigram I had a similar observation. The bigrams not occuring in the reuters list are likely to be jargon words.

Further I could have matched the list of the data with another academic corpus of different job descriptions. It will give me a good match of jargon present in the domain. I would also look at the occurence of the tokens or bigrams to understand if the tokens are really jargon.