<h2>Keyword Extraction - TF-IDF</h2>

Author: Joshua White

This notebook will contain code used extract keywords from pandas series for model creation for my CSCE 799 Expirment and my CSCE 623 Project. The example input file, 'nyc-jobs-cleaned.csv', has already had some pre-processing done to the 'job description' column, and we will be using TF-IDF to extract key words from this column.

Source: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.XpRvAVNKi8U

In [1]:
#Start out with some imports 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import re

#Set up the pandas frame on an already preprocessed file
DFrame = pd.read_csv('training_set_1.csv')

In [2]:
#This is old code but I'm leaving in the function declaration in case it's necessary later. 
#Right now our 'job_description' column is a column of lists, and we just want it 
#to be a column of strings. We will transform it here
def convert_to_text(text):
    return ' '.join(text)

#Now we will make a new column called 'jd_text' and run the function on it
#DFrame['jd_text'] = DFrame['job_description'].apply(lambda x: convert_to_text(x))
#DFrame['jd_text'].head(10)

In [3]:
#Now we will create a vocabulary of words from the job_description column.
#You can choose to ignore words that appear in some % of the documents here
cv = CountVectorizer(max_df = 0.85)
word_count_vector = cv.fit_transform(DFrame['processed_text'])

Now we can take a look at how many words we are working with. 

In [4]:
word_count_vector.shape #The second number will be the size of the vocabulary

(1328, 5483)

If we wanted to limit the words we are working with we could easily change this by setting max_features = 'size' in the CountVectorizer call, lets raise the threshold to the most frequent 10,000 words.  
  
The cv.fit_transform(...) creates the vocabulary and returns a term-document matrix, which is what we are looking for. Each column in this matrix represents a word in the vocabulary while each row represents a document, in this example a document is just the job description text, in our dataset where the values in this case are word counts. In this representation, counts of some words could be 0 if the word did not appear in the corresponding document.  

If you wanted you could also filter out stop words in the CountVectorizer call by including (..., stop_words = <list_of_stop_words>, ...) as a parameter. 

Remember after this the size of the matrix will be a 1661 x 'max_features'.

In [5]:
cv = CountVectorizer(max_df = 0.85, max_features = 10000)
word_count_vector = cv.fit_transform(DFrame['processed_text'])

Now we can take a quick look at some of the words in our vocabulary to make sure it makes sense with the following code. 

In [6]:
list(cv.vocabulary_.keys())[:10]

['plumbing',
 'engineer',
 'please',
 'read',
 'posting',
 'carefully',
 'make',
 'certain',
 'meet',
 'minimum']

Now we are going to compute the IDF values. We are going to take the matrix we made from CountVectorizer and generate the IDF. 

In [7]:
#Imports
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

Now that we have the IDF we can compute the TF-IDF value for a "document" (whatever we point it at), and get the vector of TF-IDF scores. Then we sort the words in the vector in descending order of TF-IDF values and iterate over it one more time to extract the top-n keywords for each "document" (in this case it's the job description text). 

First we need to define some functions, I'm just going to copy these functions from the article:

In [8]:
# This function sorts the values in the vector while preserving the column index
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

# This function gets the feature names and tf-idf score of top n items
def extract_topn_from_vector(feature_names, sorted_items, topn=20):
    #use only topn items from vector
    sorted_items = sorted_items[:topn]
 
    score_vals = []
    feature_vals = []
    
    # word index and corresponding tf-idf score
    for idx, score in sorted_items:
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])
 
    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

Now we will try to get the top-n terms from each job description. 

In [9]:
# you only need to do this once, this is a mapping of index
feature_names=cv.get_feature_names()
 

def generate_keywords(text):
    #generate tf-idf for the given document
    tf_idf_vector=tfidf_transformer.transform(cv.transform([text]))
    
    #sort the tf-idf vectors by descending order of scores
    sorted_items=sort_coo(tf_idf_vector.tocoo())
    
    #extract only the top n; n here is 10
    keywords = extract_topn_from_vector(feature_names, sorted_items, 20)
    
    # The {} mean's that the thing is a dictionary list, and we just want strings, so use this code. 
    # Source: https://stackoverflow.com/questions/16819222/how-to-return-dictionary-keys-as-a-list-in-python
    list_of_keywords = list(keywords.keys())
    list_of_keywords = convert_to_text(list_of_keywords)
    return list_of_keywords

DFrame['keywords'] = DFrame['processed_text'].apply(lambda x: generate_keywords(x))
DFrame['keywords'].head(10)

0    plumbing engineering design project phase cons...
1    homeless aside housing set rental restriction ...
2    distribute equipment supply stock mail materia...
3    metal welding wearing burning mechanic face eq...
4    investment estate real portfolio asset committ...
5    eligibility federal child information liaison ...
6    environmental direction facility research trea...
7    payment vendor fiscal provider someone supervi...
8    inspector investigative investigation general ...
9    water accountable submittal engineering meetin...
Name: keywords, dtype: object

In [10]:
# Export the dataframe here:
DFrame.to_csv(r'C:\Users\Joshua\Google Drive\Thesis Work\Python\training_set__keywords.csv', index = False, header = True)