# Job Aggregator

1. [Motivation](#motivation)
2. [Approach](#approach)
3. [Gathering the Job Links](#joblinks)
4. [Web Scraping](#scraping)
5. [Analysis of Jobs Data](#analysis)          
    5.1 [Useful Functions](#functions)    
    5.2 [Jobs by Sector and Size](#sector_size)    
    5.3 [Analysis of Job Descriptions](#job_descriptions)    
    5.4 [Term Frequency - Inverse Document Frequency](#tfidf)    
    5.5 [Extracting Top Key Words from a Job Description](#keywords)

# How to Use this Notebook

Sections 1-4 are to get the data. If you already have your data saved in a CSV format, skip to section 5 and make sure to load it.


<a id='motivation'></a>

## 1. Motivation

 - Anyone who has been on a modern job hunt has probably wondered and worried about the central question: 
 
 **what are companies looking for in candidates?**
 

 - Job boards have descriptions that describe the requirements and preferred skills companies seek based on specific positions.
 

 - Reading each description is useful for knowing what specific jobs require, but a **statistical** view of these descriptions would help provide a valuable perspective. What key words appear most frequently, and specifically, what skills are in greatest demand?
 
 

 - This information would empower students, career advisors, and job seekers to cater to the job market.



<a id='approach'></a>

## 2. Approach

Reading each description one by one and keeping track of statistics like word counts is infeasible when the scale becomes too large. This is where computational resources like `Python` really shine. 

We'll need to perform a search in job boards site like Glassdoor and LinkedIn and then save job descriptions and company information. To perform this [webscraping](https://en.wikipedia.org/wiki/Web_scraping), we'll make use of libraries like `scrapy` and `Selenium` to crawl through the various job description subpages and interact with them to get all the information.

There are two tasks needed to get our data:


1. Get a list of job links to feed `scrapy`


2. Use `scrapy` to get what we need from each site.


Where does `Selenium` come in? Websites sometimes run javascript which changes content based on user actions. This is the case of a *dynamic* website. `Seleneium` automates browser actions and can navigate the cases of complicated websites.

<a id='joblinks'></a>

## 3. Gathering the Job Links

Here you can input your own Glassdoor account login information and customize the job title to search for as well as the list of cities to search in. It's worth noting that to prevent duplicate jobs, the cities should be spread out for lowest overlap. Adjusting the number of cities will affect the time it takes to aggregate all the job links.

In [None]:
import time
from selenium import webdriver
from job_scraping_functions import GlassdoorDriver, LinkedInDriver
import pickle

In [None]:
# Choose file names to save the links under

base_name = ''
job_links_glassdoor =  base_name + '_search_glassdoor.pickle'
job_links_linkedin = base_name + '_search_linkedin.pickle'

output_job_data_glassdoor = base_name+'_glassdoor.csv'
output_job_data_linkedin = base_name+'_linkedin.csv'



# Put in any valid login account for Glassdoor
username = ''
password = ''

# Replace with the job title you wish to search for
job = 'Operations Manager'

# Place cities as they appear in Glassdoor, in the format City, state initials
locations_ = ['Davis, CA', 'San Francisco, CA', 'San Jose, CA', 'Los Angeles, CA',
             'San Diego, CA', 'Irvine, CA', 'Portland, OR', 'Houston, TX',
             'Austin, TX', 'Charlotte, NC','Raleigh, NC', 'Nashville, TN',
             'Miami, FL', 'Tampa, FL', 'Orland, FL', 'Washington, DC',
             'Arlington, VA','Boston, MA', 'New York, NY']

# Choose a subset of the above locations, or even new ones you've tried on Glassdoor

locations = ['San Francisco, CA', 'San Jose, CA']

In [7]:
driver_glassdoor = GlassdoorDriver()
# We save the links in a dictionary and write that to a disk with a filename we choose
glassdoor_start = time.time()
job_dict_glassdoor = driver_glassdoor.full_search_results(username, password, job, locations)
glassdoor_end = time.time()

Problem with page 29
Problem with page 29


In [5]:
driver_linkedin = LinkedInDriver()
linkedin_start = time.time()
job_dict_linkedin = driver_linkedin.full_search_results(job, locations, delay=2)
linkedin_end = time.time()

In [6]:
job_dict_linkedin

{'San Francisco, CA': ['https://www.linkedin.com/jobs/view/operations-manager-at-stockwell-1453535656?refId=6384f27b-263a-4e2d-8a40-e61caa33a0e8&position=1&pageNum=0&trk=guest_job_search_job-result-card_result-card_full-click',
  'https://www.linkedin.com/jobs/view/senior-creative-operations-manager-at-planet-interactive-1483161817?refId=6384f27b-263a-4e2d-8a40-e61caa33a0e8&position=2&pageNum=0&trk=guest_job_search_job-result-card_result-card_full-click',
  'https://www.linkedin.com/jobs/view/operations-manager-at-paradigm-1490886996?refId=6384f27b-263a-4e2d-8a40-e61caa33a0e8&position=3&pageNum=0&trk=guest_job_search_job-result-card_result-card_full-click',
  'https://www.linkedin.com/jobs/view/operations-manager-at-dark-heart-nursery-1477492496?refId=6384f27b-263a-4e2d-8a40-e61caa33a0e8&position=4&pageNum=0&trk=guest_job_search_job-result-card_result-card_full-click',
  'https://www.linkedin.com/jobs/view/operations-manager-at-g4s-1461198131?refId=6384f27b-263a-4e2d-8a40-e61caa33a0e8&

In [8]:
glassdoor_time = glassdoor_end-glassdoor_start
linkedin_time = linkedin_end - linkedin_start
print('Glassdoor Link Mining Time: %.2f' % glassdoor_time)
print('LinkedIn Link Time: %.2f' % linkedin_time)
with open(job_links_glassdoor, 'wb') as f:
    pickle.dump(job_dict_glassdoor, f)
with open(job_links_linkedin, 'wb') as f:
    pickle.dump(job_dict_linkedin, f)

Glassdoor Link Mining Time: 546.35
LinkedIn Link Time: 193.78


<a id='webscraping'></a>

## 4. Webscraping the Job Descriptions

Now that we've gathered the job links, and saved them in a dictionary, we can now use `scrapy` to go through all these sites and then webscrape the relevant job data, including company size, location, industry, sector, location and job description.

To make things simple, we'll run it as a script. This involves setting up a Twisted [reactor](https://twistedmatrix.com/documents/13.1.0/core/howto/reactor-basics.html) to allow for [asynchrony](https://en.wikipedia.org/wiki/Asynchrony_(computer_programming)) which makes it possible to scrape many sites quickly.

In [9]:
from twisted.internet import reactor
import scrapy
import logging
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
import pickle
from job_scraping_functions import GlassdoorDriver, LinkedInDriver
import re



class GlassdoorSpider(scrapy.Spider):

    name = "glassdoor"
    
    global output_job_data_glassdoor
    
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_FORMAT':'csv',                                 
        'FEED_URI': output_job_data_glassdoor                       
    }
    
    def start_requests(self):
        
        # this file should be changed to whatever dictionary was saved
        # from the previous search
        global job_links_glassdoor
        with open(job_links_glassdoor, 'rb') as f:
            location_urls = pickle.load(f)
        for location in location_urls:
            for url in location_urls[location]:
                yield scrapy.Request(url=url, callback=self.parse,
                                     cb_kwargs=dict(location=location))

    def parse(self, response, location):
        driver = GlassdoorDriver()
        driver.get(response.url)
        # Get job description
        job_container = driver.find_element_by_xpath(
            '//*[@id="JobDescriptionContainer"]')
        job_text = str(job_container.text)
        job_text = re.sub('\n', ' ', job_text)

        # Access company data on dynamic webpage via the Company tab
        tabs = driver.find_elements_by_class_name('link')
        for i in tabs:
            if re.search('Company', str(i.text)):
                i.click()
                break
        company_container = driver.find_element_by_xpath(
            '//*[@id="CompanyContainer"]')
        company_name_element = driver.find_element_by_xpath(
            '//*[@id="HeroHeaderModule"]/div[3]/div[1]/div[2]/span[2]')
        company_name = str(company_name_element.text)
        company_text = str(company_container.text)
        job_data = {'Location': location, 'Name': company_name, 'Description': job_text}
        driver.close()
        for i in ['Size', 'Industry', 'Sector']:
            job_data[i] = re.search(f'({i})(.*)(\n)', company_text).group(2)
        yield job_data

        
class LinkedinSpider(scrapy.Spider):

    name = "linkedin"
    
    global output_job_data_linkedin
    
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_FORMAT':'csv',                                 
        'FEED_URI': output_job_data_linkedin                       
    }
    
    def start_requests(self):        
        # this file should be changed to whatever dictionary was saved
        # from the previous search
        global job_links_linkedin
        with open(job_links_linkedin, 'rb') as f:
            location_urls = pickle.load(f)
        for location in location_urls:
            for url in location_urls[location]:
                yield scrapy.Request(url=url, callback=self.parse,
                                     cb_kwargs=dict(location=location))

    def parse(self, response, location):
        driver = LinkedinDriver()
        driver.get(response.url)
        # Get job description
        job_data = driver.get_job_data()
        yield job_data
        

In [None]:
configure_logging
runner = CrawlerRunner()

glassdoor_start = time.time()
runner.crawl(GlassdoorSpider)
glassdoor_end = time.time()

linkedin_start = time.time()
runner.crawl(LinkedinSpider)
linkedin_end = time.time()

d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

In [11]:
print('Glassdoor Job Scraping Time: %d' % (glassdoor_end-glassdoor_start))
print('LinkedIn Job Scraping Time: %d' % (linkedin_end-linkedin_start))

Glassdoor Job Scraping Time: 0
LinkedIn Job Scraping Time: 0


<a id='analysis'></a>

## 5. Analysis of Job Data

Now that we have the data, let's take a look at the fields we've scraped using `pandas`.

In [None]:
import pandas as pd
df_glassdoor = pd.read_csv(output_job_data_glassdoor)
df_linkedin = pd.read_csv(output_job_data_linkedin)

In [None]:
df_glassdoor.tail()

In [None]:
df_linkedin.tail()

<a id='functions'></a>

### 5.1 Useful Functions

To better visualize and perform some natural language processing (NLP) analysis, we define some helper functions.

In [None]:
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from nltk.stem.wordnet import WordNetLemmatizer
import re
import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


def normalize_text(df, stop_words):
    """
    Takes in a Pandas DataFrame object of text and returns a normalized
    corpus

    Parameters
    ---------------
   
    df : pandas DataFrame object
        The text column to process.
    stop_word : iterable
        Common English words to disregard in analysis
    
    Returns
    ----------------

    corpus : list of strings
        A list of text documents with the applied modifications
    
    """
    df.dropna()
    corpus = []
    lem = WordNetLemmatizer()
    word_count = df.apply(lambda x: len(str(x).split(" ")))
    num_words = word_count.count()
    for i in range(num_words):
        try:
            text = df[i]
            text = text.split()
            text = [lem.lemmatize(word.lower()) for word in text if not word in stop_words]
            text = " ".join(text)
            corpus.append(text)
        except:
            print(f'Trouble with row {i}')
            continue
    return corpus


def df_counts(df, category):
    """
    Creates a dataframe of counts for a specified 
    categorical variable

    Parameters
    ---------------
   
    df : pandas DataFrame object
        DataFrame object with categorical values of interest
    category : string
        String corresponding to a categorical column in the dataframe

    Returns
    ----------------

    df2 : pandas DataFrame object
        New dataframe with the counts of the categorical variable
    
    """ 
    df_grouped = (df
             .groupby(category)[category]
             .count()
             )
    
    d = {category: df_grouped.index.values, 'Count':df_grouped.values.flatten()}
    df2 = pd.DataFrame(d)
    return df2    
    
    
def flatten_groupby(df, cols, label):
    """
    Converts the multi-level dataframe from a multi-level groupby
    operation into a single level dataframe

    Parameters
    ---------------
   
    df : pandas DataFrame object
        The multi-level dataframe resulting from a groupby operation
    cols : string or list of strings
        The column(s) used for grouping in the order used for grouping
    label : string
        Name assigned to last column of dataframe, corresponding to the
        group operation ie. count, sum, mean etc.

    Returns
    ----------------

    df_flat_groupby : pandas DataFrame object        
        Flat hierarchy dataframe
        
    """
    
    cols_data = {}
    # Single groupby category case
    if type(df.index) is pd.core.indexes.base.Index:
        if type(cols) is list:
            category = cols[0]
        elif type(cols) is str:
            category = cols
        cols_data[category] = df.index.values
    
    # Multiple grouping case   
    elif type(df.index) is pd.core.indexes.multi.MultiIndex:
        for idx, category in enumerate(cols):
            cols_data[category] = [tup[idx] for tup in df.index.values]
    cols_data[label] = df.values.flatten()
    
    df_flat_groupby = pd.DataFrame(cols_data)
    
    return df_flat_groupby


def plot_word_cloud(corpus, stop_words,
    max_words=100,
    max_font_size=50,
    random_state=0, background_color='white'):
    """
    Creates and plots word cloud    

    Parameters
    ---------------
    corpus : iterable of text
        The collection of texts used to generate the wordcloud
    stop_word : iterable
        Common English words to disregard
    max_words : int
        Cutoff for the number of top frequent words to display.
    max_font_size : int
        Font size of the most frequent, and thus the largest, word
        
    """ 
    wordcloud = WordCloud( 
        background_color=background_color,
        stopwords=stop_words,
        max_words=max_words,
        max_font_size=max_font_size,
        random_state=random_state
        ).generate(str(corpus))
    print(wordcloud)
    fig = plt.figure(1)
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()
    
    
def get_top_N_grams(corpus, stop_words, ngram_range=(1,1), max_features=2000):
    """
    Get most frequently occuring words
    
    Parameters:
    ----------------
    corpus: list of words
        Text to analyze
    stop_word : iterable
        Common English words to disregard
    ngram_range : tuple (a,b)
        Range of N, [a,b], to consider for N-grams
    max_features: int or None
        Limit of top words to return
    N: int
        Defines the N-grams to examine
        
    Returns:    
    ----------------
    word_freq: list of strings
         description
    
    """
    vec = CountVectorizer(ngram_range=ngram_range, 
                          max_features=max_features,
                          token_pattern=r'(?u)\b\w+\b',
                          stop_words=stop_words).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_of_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_of_words[0,idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:max_features]


def search_top_N_grams(corpus, stop_words, n_grams, max_features=2000):
    """
    Searches for the given N-grams' rank among the top frequent N-grams.    

    Parameters
    ---------------
   
    corpus : list of strings
        List of string documents
    stop_word : iterable
        Common English words to disregard
    n_gram : list of strings
        The N-grams to search for, must be the same size
    max_features : int
        The cutoff for how many top N-grams to show

    Returns
    ----------------

    sorted_df : DataFrame object 
        Pandas dataframe with columns N-gram, Count, and Rank with words ordered
        by rank in ascending order
       
    """
    counts = [None for i in n_grams]
    rankings = counts[:]
    
    n_grams_lowercase = [i.lower() for i in n_grams]
    N_ = [len(i.split()) for i in n_grams]
    N_max = max(N_)
    topn = get_top_N_grams(corpus, stop_words, ngram_range=(1,N_max), max_features=max_features)
    
    for top_n_gram,count in topn:
        if top_n_gram in n_grams_lowercase:
            idx = n_grams_lowercase.index(top_n_gram)
            rankings[idx]= topn.index((top_n_gram, count))+1
            counts[idx] = count
            
    d = {'N-gram': n_grams, 'Count':counts, 'Ranking':rankings}
    df = pd.DataFrame(d)       
    sorted_df = df.sort_values('Ranking', ascending=True).reset_index(drop=True)
    return sorted_df

def bar_plot(df,x,y,title=None,figsize=(13,8), rotation=75,
             xtickfont=12, xlabelfont=16, ylabelfont=16,
             titlefont=25, context='poster', ci=None):
    """
    Wrapper function for a barplot with seaborn

    Parameters
    ---------------
   
    df : pandas DataFrame object
        Dataframe containing the data to plot
    x : string
        Column in dataframe containing x values
    y : string
        Column in dataframe containing y values
    title : string
        Title of the chart
    figsize : tuple of numbers
        Dimensions of chart
    rotation : int
        Degree of rotation for xtick labels
    xtickfont : int
        Fontsize for xtick labels
    context : string
        Specific seaborn context keywords ie 'talk','poster','figure'
        'notebook', 'paper'
    ci : float or None
        Confidence interval bar width
    """ 

    sns.set(rc={'figure.figsize':figsize})    
    g = sns.barplot(x=df[x], y=df[y],data=df,ci=ci)
    g.set_xticklabels(df[x], rotation=75, fontdict={'fontsize':xtickfont})
    g.set_xlabel(xlabel=x,fontdict={'fontsize':xlabelfont})
    g.set_ylabel(ylabel=y,fontdict={'fontsize':ylabelfont})
    
    g.set_title(title, fontdict={'fontsize':titlefont});
    sns.set(context=context)


def plot_N_gram_counts(corpus, stop_words, n_grams, title, rotation=75, max_features=2000):
    """
    Plot the word counts of the specified N-grams if they are found among the
    top corpus.    

    Parameters
    ---------------
   
    corpus : list of strings
        List of string documents
    stop_word : iterable
        Common English words to disregard
    n_grams : list of strings
        The N-grams to search for, must be the same size
    rotation : int
        Degrees to rotate the x-axis tick labels by 
    max_features : int
        The cutoff for how many top N-grams to consider
    title : string
        Title to give the plot, rational for the chosen words

    Returns
    ----------------

    sorted_df : DataFrame object 
        Pandas dataframe with columns N-gram, Count, and Rank with words ordered
        by rank in ascending order.
       
    """
    
    df = search_top_N_grams(corpus, stop_words, n_grams, max_features)
    x = 'N-gram'
    y = 'Count'
    
    bar_plot(df, x, y, title=title, rotation=rotation)    
    
def plot_top_N_grams(corpus, stop_words, max_features=10, ngram_range=(1,1), rotation=75):
    """
    Plots a bar chart of the most frequently occuring words
    
    Parameters:
    ----------------
    corpus: list of text
        Text to analyze
    stop_word : iterable
        Common English words to disregard
    max_features: int or None
        Limit of top words to return
    ngram_range: tuple
        Defines the N-grams to examine
        
    """
    top_words = get_top_N_grams(corpus, stop_words, max_features=max_features, ngram_range=ngram_range)
    top_df = pd.DataFrame(top_words)
    top_df.columns=["N-grams", "Freq"]
    
    #Barplot of most freq words
    
    context='talk'
    x="N-grams"
    y="Freq"   
    title =f'Top {max_features} {ngram_range}-grams'
    bar_plot(top_df, x, y, title=title,)
    
def create_tfidf_vector(text, corpus, stop_words, max_features=10000, ngram_range=(1,3)):
    """
    Convert text to a term-frequency inverse document frequency vector
    based on a fitted corpus and include feature names  

    Parameters
    ---------------
   
    text : string
        Job description to convert to a ti-fidf word vector         
    corpus : list of strings        
        Text to analyze, the documents from which inverse document frequency is
        calculated
    stop_words : list of strings
        Words to ignore in analysis due to being common or insignificant 
    max_features : int
        Cutoff for the number of words to consider
    ngram_range : tuple of ints
        Min and max n-gram sizes to consider

    Returns
    ----------------

    tifidf_vector : scipy sparse matrix object
        Vector object of tf-idf scores whose indices correspond to words
    feature_names : array
        Array of vocabulary words at the indices of the tfidf_vector
    
    """
    TV = TfidfVectorizer(max_df=.8, stop_words=stop_words,
                    max_features=10000, ngram_range=(1,3),
                    smooth_idf=True, use_idf=True,
                    min_df=1)
    TV.fit(corpus)
    tfidf_vector = TV.transform([text])
    feature_names = TV.get_feature_names()
    
    return tfidf_vector, feature_names


        
def sort_tfidf_vector(tf_idf_vector):
    """
    Sorts tf-idf vector in descending order
    
    Parameters:
    ----------------
    tif_idf_vector: scipy.sparse.csr_matrix object
        Compressed Sparse Row (row, column, data) representation of a sparse matrix            
    
    Returns:    
    ----------------
    sorted_tups: list of tuples
         Sorted tuples of tf-idf scores and corresponding feature indices
    
    """ 
    tuples = zip(tf_idf_vector.data, tf_idf_vector.indices)
    return sorted(tuples, key=lambda x: (x[0],x[1]), reverse=True)

def extract_topn_from_vector(feature_names, tf_idf_vector, topn=10):
    """
    Get top n features ordered by tf-idf
    
    Parameters:
    ----------------
    feature_names: list
        List of column features, ie n-grams
    tif_idf_vector: scipy.sparse.csr_matrix object
        Compressed Sparse Row (row, column, data) representation of a sparse matrix            
    topn: int
        Cutoff of top-n list
        
    Returns:    
    ----------------
    keywords_scores: pandas DataFrame object
        DataFrame of keywords and td-idf scores in order of descending scores
    
    """
    sorted_items = sort_tfidf_vector(tf_idf_vector)
    keywords = []
    scores = []
    d = {feature_names[idx]:round(score,3) for score,idx in sorted_items[:topn]}
    return pd.DataFrame(d.items(),columns=['Keyword', 'tf-idf Score'])
    

<a id='sector_size'></a>

### 5.2 Jobs by Sector and Size

#### Overall

How many different sectors of the economy is our job title found in?

In [None]:
# Create a column of sector counts
category = 'Sector'
df_sector_counts = df_counts(df, category)
len(df['Sector'].unique())

Which sectors have the most openings? 

In [None]:
df_sector_counts.sort_values('Count', ascending=False).reset_index(drop=True)

Here we group the job openings by location and economic sector:

In [None]:
cols = ['Location','Sector']
df_sector2 = (df
                 .groupby(cols)[cols[-1]]
                 .count()
                )

df_location_sectorcounts = flatten_groupby(df=df_sector2,cols=cols,label='Count')
df_location_sectorcounts

Here are visualizations of these numbers:

In [None]:
plot_args = {'df':df_sector_counts, 'x':'Sector','y':'Count',
             'title': 'Jobs by Sector', 'ci':None}
bar_plot(**plot_args)

#### Specific Locations

The spread of jobs might be significantly different depending on locations, so it might be worth looking at the differences to consider which locations to focus on. In the following script, feel free to change the variable `city` to whatever you'd like in the proper format.

In [None]:
# repeat plot above but passing it to seaborn filtering by location
city = 'San Jose, CA (US)'
df_city = df_location_sectorcounts[df_location_sectorcounts['Location']==city]
plot_args = {'df':df_city, 'x':'Sector','y':'Count',
             'title': f'Jobs by Sector: {city}', 'ci':None}
bar_plot(**plot_args)

In [None]:
city = 'San Francisco, CA (US)'
df_city = df_location_sectorcounts[df_location_sectorcounts['Location']==city]
plot_args = {'df':df_city, 'x':'Sector','y':'Count',
             'title': f'Jobs by Sector: {city}', 'ci':None}
bar_plot(**plot_args)

<a id='job_descriptions'></a>

### 5.3 Analysis of Job Descriptions

#### Creating a Corpus of Text

Below we define *stop words* which we ignore because they are in the background so often. When using your own dataset, you should iteratively add stop words that you don't want to see. We're going to use the full list of job descriptions as our **corpus** of words to analyze.

In [None]:
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words("english"))
##Creating a list of custom stopwords
new_words = ['sexual', 'orientation', 'equal', 'opportunity', 'race', 'color',
            'gender', 'identity', 'regard', 'religion', 'veteran', 'status', 'protected', 'we', 'the', 'you',
            'sex', 'national', 'origin']
stop_words = stop_words.union(new_words)

corpus = normalize_text(df['Description'],stop_words=stop_words)

In [None]:
# number of stop words we'll exclude
len(stop_words)

In [None]:
# the number of job descriptions
len(corpus)

#### Wordcounts

Here's a quick graphic to look at the top n-grams based on size of the font. Notice any words that don't seem interesting? You can always add them to the stop words list previously defined so more meaningful words can be emphasized.

In [None]:
plot_word_cloud(corpus, stop_words)

#### Top N-grams

The actual n-gram counts that were used to create the previous visualization can be seen below. We can see the words that pop up the most in job descriptions:

In [None]:
topn = get_top_N_grams(corpus, stop_words, ngram_range=(1,1), max_features=100000)
topn

#### Searching for N-gram Ranking

However, what if we wanted to look at the relative frequencies of words of our own choice? We can accomplish that by searching the above list with the following code along with a list of words we define and come up with a "top skills" list ourselves:

In [None]:
# Feel free to replace this list with words of your choice
n_grams = ['Python','R','SAS','Stata','Java','C','Javascript', 'Excel', 'Power Point', 'Tableau', 'Word', 'html']

skill_rankings = search_top_N_grams(corpus, stop_words, n_grams, max_features=1000000)

skill_rankings

#### Plotting the Top N-grams

We can visualize the above numbers in a bar graph:

In [None]:
plot_top_N_grams(corpus, stop_words, ngram_range=(1,3), max_features=20)

<a id='tfidf'></a>

### 5.4 Term Frequency - Inverse Document Frequency

In some cases, we may wish to convert the raw word counts previously produced to a normalized metric which favors words that biases words that appear frequently in a document but penalizes words that frequently occur in the background corpus. This is the [term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

**Term Frequency**

$$ tf(t,d) = \text{raw count}$$

where $t$ is the term and $d$ is the document.

**Inverse Document Frequency**

$$ idf(t) = \log\Big(\frac{n}{df(t)+1}\Big)$$

where $df(t)$ is the document frequency, the number of documents containing term $t$ and $n$ is the number of documents in the corpus. 

**Term Frequency - Inverse Document Frequency**

$$ td-idf = tf(t,d)*idf(t)$$

<a id='keywords'></a>

### 5.5 Extracting the Top Keywords from a Jobs Description

What if we were looking at a job description and wanted to focus on words unique to that one as opposed to being found commonly in other descriptions? We could then use the tf-idf metric to find *keywords* of the job description of interest.

To do this we first choose a job description to convert. For example, we can choose one from the job descriptions we scraped in our corpus:

In [None]:
corpus[100]

Now we take this text and feed it into a few functions that turn the description into a word vector that contains tf-idf scores. 

In [None]:
text = corpus[100]
tfidf_vector, feature_names = create_tfidf_vector(text, corpus, stop_words, ngram_range=(1,4))
extract_topn_from_vector(feature_names, tfidf_vector, topn=40)