# Creating a Wordcloud in Python

![](wc.png)

Wordclouds are an interesting form of data visualization which show the prevalence of words in a given piece of text. 
In the context of applying for jobs, they seem particularly adept at identifying potential keywords (and/or *buzzwords*) which employers may be looking for in an application. 
It seems that the use of Applicant Tracking Systems (ATS) is becoming a major hurdle for jobseekers, with [tons](https://www.topresume.com/career-advice/what-is-an-ats-resume) [of](https://www.forbes.com/sites/nextavenue/2014/03/18/how-to-get-your-resume-read-by-an-employer/#346551266865) [articles](https://business.fullerton.edu/news/2018/05/29/applicant-tracking-systems-what-job-seekers-should-know/) outlining how to "*optimize*" one's resume for ATS to avoid being filtered out by algorithms.

In the spirit of desiring fulfilling employment, and recognizing the importance of displaying the traits which an employer deems most important, I wanted to see which key words were showing up most often in the kinds of jobs I was applying to. 
It also seemed like a perfect opportunity to get my feet wet with text analysis in Python.

I list the resources I used in the cell below, and try to note any place where code was shamelessly stolen. 
Anyway, let's get started and load in the libraries

In [140]:
# Resources:
#   -   Text Analysis in Python: https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b
#   -   Text Analysis overview: https://monkeylearn.com/text-analysis/
#   -   https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f
#   -   https://towardsdatascience.com/basic-binary-sentiment-analysis-using-nltk-c94ba17ae386
#   -   https://stackoverflow.com/questions/32957895/wordnetlemmatizer-not-returning-the-right-lemma-unless-pos-is-explicit-python


# Importing necessary libraries
import pandas as pd
import numpy as np
import string
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, RegexpTokenizer
from nltk.corpus import stopwords, wordnet
import re
from wordcloud import WordCloud, STOPWORDS
from PIL import Image
import matplotlib as plt


# Functions

The functions used to process the text data are:

- **clean()** which removes characters which do not aid analysis - newline, dashes, slashes, etc.  
- **remove_stopwords()** does what it says on the tin  
- **word_lemmatizer()** reduces words to their root form e.g. looking => look  
- **get_wordnet_pos()** is a function which tags words with their corresponding *part-of-speech*. This is necessary for lemmatization.  
- **top_words()** finds the most commonly used words
- **make_image()** creates the wordcloud and outputs a .png file

In [141]:
## Functions -----------------------------------------------------------------------------------------------------------
# Text cleaning function, shamelessly stolen from: https://github.com/datanizing/reddit-selfposts-blog
def clean(s):
    s = re.sub(r'((\n)|(\r))', " ", s) # replace newline characters and \r whatever it is (another line break char?) with spaces
    s = re.sub(r'\r(?=[A-Z].)', "", s) # remove \r when it is next to a word
    s = re.sub(r'/', " ", s) # replace forward slashes with spaces
    s = re.sub(r'\-', " ", s) # replace dashes with spaces (I will be forever cursed for not accounting for the em dash)
    no_punct = "".join([c.lower() for c in s if c not in string.punctuation]) # remove punctuation

    return no_punct

# Function to remove stopwords from a list of words
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

# Function to lemmatize strings from a list of words
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)
# Lemmatizing reduces words to their root form
def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(word = i, pos= get_wordnet_pos(i)) for i in text]
    return(lem_text)

# Function for finding the most commonly used words in job desc
def top_words(cleaned_desc, n = 3):
    # Count word freq
    freq = pd.Series(cleaned_desc.split()).value_counts()

    # Select top 3 words
    top_n = freq[:n].index.to_list()
    return(top_n)

# Function for creating masked wordcloud
# Found here: https://amueller.github.io/word_cloud/auto_examples/masked.html
def make_image(text, img):
    # Need to get a mask image
    mask = np.array(Image.open(img))

    wc = WordCloud(background_color="white", max_words=1000, mask=mask)
    # generate word cloud
    wc.generate_from_frequencies(text)

    # show
    plt.imshow(wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()


# Read in data

First thing's first, we'll read in the data. 
This is a simple dataset of job postings I collected using Google Sheets. 
I collected the **Position**, **Company**, **Location**, **Role Description**, **Qualifications**, and **Benefits** (although benefits information was far less common).

In [142]:
# Read in data
jobapps_file = 'data/jobapps.csv'
jobapps_df = pd.read_csv(jobapps_file)

jobapps_df.columns

Index(['Position', 'Company', 'Type', 'Location', 'Link', 'description', 'qualifications', 'benefits'], dtype='object')

In [143]:
jobapps_df.head()

Unnamed: 0,Position,Company,...,qualifications,benefits
0,Associate Governmental Program Analyst,Fair Employment Agency,...,,
1,Data and Policy Analyst - Statistical Programmer,Acumen LLC,...,"Bachelor’s degree in a quantitative, public policy, or related field or equivalent relevant expe...",
2,Lead Business Intelligence Engineer,sweetgreen,...,"Experience with modern data platforms (e.g. AWS, Snowflake, SQL, Airflow, Python)\nExperience wi...",Three different medical plans to suit your and your family's needs\nDental and Vision insurance\...
3,Capacity Planning Analyst,Beyond Meat,...,"5+ years of experience in operations or business analysis\r\nBachelor’s degree in statistics, ap...",
4,Data and Policy Analyst - Writer/Coordinator,Acumen LLC,...,"Bachelor’s degree in a quantitative, public policy, or related field or equivalent relevant expe...",


In [144]:
# Isolate job description text
job_desc = jobapps_df[['description']]

## Cleaning

First we clean the text data. You can see the results below, this function removes punctuation, converts the text to lowercase and generally makes the text more machine-readable.

In [145]:
# Cleaning
job_desc_clean = job_desc
job_desc_clean = job_desc_clean.assign(desc_clean = job_desc.description.apply(clean))

job_desc_clean.head()

Unnamed: 0,description,desc_clean
0,"30% Ensure that the DFEH complies with all OSHA/Cal-OSHA Regulations, in part by maintaining fam...",30 ensure that the dfeh complies with all osha cal osha regulations in part by maintaining famil...
1,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...
2,Lead BI Engineers are responsible for owning approximately 25% of our reporting portfolio - owni...,lead bi engineers are responsible for owning approximately 25 of our reporting portfolio ownin...
3,We are looking for an exceptional analyst who can diagnose and solve complex business problems t...,we are looking for an exceptional analyst who can diagnose and solve complex business problems t...
4,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...


## Tokenization

[Tokenization](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) is the process of separating a sentence into smaller chunks such as words and number elements.

Below I tokenize the job description text twice - once with stop words and once without, but this was mostly for my own edification to see the difference between the two. 
Stopwords are words which are fundamental to language but may not always be useful when analyzing text.  

For example, you can see below by comparing `desc_clean` and `desc_clean_nostop` what kind of words are removed: 
"*that*", "*the*", "*with*", "*a*", etc.

In [146]:
## Tokenizing
# Instantiate Tokenizer
tokenizer = RegexpTokenizer(r'\w+')

# Add tokenized column
job_desc_clean['desc_tokenized'] = job_desc_clean.desc_clean.apply(lambda x: tokenizer.tokenize(x))

# Remove stop words
job_desc_clean['desc_clean_nostop'] = job_desc_clean['desc_clean'].apply(lambda x: " ".join(x for x in x.split() if x not in stopwords.words('english')))
# Add tokenized column w/o stop words
job_desc_clean['desc_tokenized_nostop'] = job_desc_clean.desc_tokenized.apply(lambda x: remove_stopwords(x))

job_desc_clean.head()

Unnamed: 0,description,desc_clean,desc_tokenized,desc_clean_nostop,desc_tokenized_nostop
0,"30% Ensure that the DFEH complies with all OSHA/Cal-OSHA Regulations, in part by maintaining fam...",30 ensure that the dfeh complies with all osha cal osha regulations in part by maintaining famil...,"[30, ensure, that, the, dfeh, complies, with, all, osha, cal, osha, regulations, in, part, by, m...",30 ensure dfeh complies osha cal osha regulations part maintaining familiarity current laws regu...,"[30, ensure, dfeh, complies, osha, cal, osha, regulations, part, maintaining, familiarity, curre..."
1,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...,"[data, and, policy, analysts, perform, a, wide, array, of, functions, as, part, of, the, researc...",data policy analysts perform wide array functions part research process applicants interested fo...,"[data, policy, analysts, perform, wide, array, functions, part, research, process, applicants, i..."
2,Lead BI Engineers are responsible for owning approximately 25% of our reporting portfolio - owni...,lead bi engineers are responsible for owning approximately 25 of our reporting portfolio ownin...,"[lead, bi, engineers, are, responsible, for, owning, approximately, 25, of, our, reporting, port...",lead bi engineers responsible owning approximately 25 reporting portfolio owning exec level stak...,"[lead, bi, engineers, responsible, owning, approximately, 25, reporting, portfolio, owning, exec..."
3,We are looking for an exceptional analyst who can diagnose and solve complex business problems t...,we are looking for an exceptional analyst who can diagnose and solve complex business problems t...,"[we, are, looking, for, an, exceptional, analyst, who, can, diagnose, and, solve, complex, busin...",looking exceptional analyst diagnose solve complex business problems data analysis scenario mode...,"[looking, exceptional, analyst, diagnose, solve, complex, business, problems, data, analysis, sc..."
4,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...,"[data, and, policy, analysts, perform, a, wide, array, of, functions, as, part, of, the, researc...",data policy analysts perform wide array functions part research process applicants interested fo...,"[data, policy, analysts, perform, wide, array, functions, part, research, process, applicants, i..."


Now that the text is tokenized, we can create a frequency distribution to see how often a particular word comes up.

In [147]:
# finding the frequency distinct in the tokens
# Importing FreqDist library from nltk and passing token into FreqDist
from nltk.probability import FreqDist
job_desc_freq = [FreqDist(desc) for desc in job_desc_clean.desc_tokenized_nostop]
job_desc_freq

[FreqDist({'dfeh': 5, 'regulations': 5, 'plans': 5, 'services': 5, 'state': 5, 'contract': 5, 'purchase': 5, 'maintain': 5, 'evacuation': 4, 'coordinate': 4, ...}),
 FreqDist({'research': 3, 'statistical': 3, 'data': 2, 'perform': 2, 'analyses': 2, 'policy': 1, 'analysts': 1, 'wide': 1, 'array': 1, 'functions': 1, ...}),
 FreqDist({'data': 10, 'customer': 8, 'within': 5, 'reporting': 4, 'marketing': 4, 'teams': 3, 'customers': 3, 'days': 3, 'owning': 2, 'portfolio': 2, ...}),
 FreqDist({'capacity': 5, 'multiple': 4, 'global': 4, 'production': 4, 'business': 3, 'data': 3, 'planning': 3, 'worldwide': 3, 'ability': 3, 'analyst': 2, ...}),
 FreqDist({'research': 3, 'findings': 3, 'perform': 2, 'project': 2, 'clients': 2, 'data': 1, 'policy': 1, 'analysts': 1, 'wide': 1, 'array': 1, ...}),
 FreqDist({'data': 6, 'business': 5, 'insights': 5, 'analytics': 4, 'work': 3, 'player': 3, 'call': 2, 'duty': 2, 'mobile': 2, 'activision': 2, ...}),
 FreqDist({'business': 11, 'data': 7, 'sales': 6, 'fi

Let's simplify and find the 10 most common words in each job description

In [148]:
# To find the frequency of top 10 words
desc_most_common = [fdist.most_common(10) for fdist in job_desc_freq]
desc_most_common

[[('dfeh', 5),
  ('regulations', 5),
  ('plans', 5),
  ('services', 5),
  ('state', 5),
  ('contract', 5),
  ('purchase', 5),
  ('maintain', 5),
  ('evacuation', 4),
  ('coordinate', 4)],
 [('research', 3),
  ('statistical', 3),
  ('data', 2),
  ('perform', 2),
  ('analyses', 2),
  ('policy', 1),
  ('analysts', 1),
  ('wide', 1),
  ('array', 1),
  ('functions', 1)],
 [('data', 10),
  ('customer', 8),
  ('within', 5),
  ('reporting', 4),
  ('marketing', 4),
  ('teams', 3),
  ('customers', 3),
  ('days', 3),
  ('owning', 2),
  ('portfolio', 2)],
 [('capacity', 5),
  ('multiple', 4),
  ('global', 4),
  ('production', 4),
  ('business', 3),
  ('data', 3),
  ('planning', 3),
  ('worldwide', 3),
  ('ability', 3),
  ('analyst', 2)],
 [('research', 3),
  ('findings', 3),
  ('perform', 2),
  ('project', 2),
  ('clients', 2),
  ('data', 1),
  ('policy', 1),
  ('analysts', 1),
  ('wide', 1),
  ('array', 1)],
 [('data', 6),
  ('business', 5),
  ('insights', 5),
  ('analytics', 4),
  ('work', 3),
 

## Lemmatization

In [149]:
## Lemmatization
# Importing Lemmatizer library from nltk
lemmatizer = WordNetLemmatizer()

# Add lemmatized column
job_desc_clean['desc_lemmatized'] = job_desc_clean.desc_tokenized_nostop.apply(lambda x: word_lemmatizer(x))

job_desc_clean.head()

Unnamed: 0,description,desc_clean,...,desc_tokenized_nostop,desc_lemmatized
0,"30% Ensure that the DFEH complies with all OSHA/Cal-OSHA Regulations, in part by maintaining fam...",30 ensure that the dfeh complies with all osha cal osha regulations in part by maintaining famil...,...,"[30, ensure, dfeh, complies, osha, cal, osha, regulations, part, maintaining, familiarity, curre...","[30, ensure, dfeh, complies, osha, cal, osha, regulation, part, maintain, familiarity, current, ..."
1,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...,...,"[data, policy, analysts, perform, wide, array, functions, part, research, process, applicants, i...","[data, policy, analyst, perform, wide, array, function, part, research, process, applicant, inte..."
2,Lead BI Engineers are responsible for owning approximately 25% of our reporting portfolio - owni...,lead bi engineers are responsible for owning approximately 25 of our reporting portfolio ownin...,...,"[lead, bi, engineers, responsible, owning, approximately, 25, reporting, portfolio, owning, exec...","[lead, bi, engineer, responsible, own, approximately, 25, reporting, portfolio, own, exec, level..."
3,We are looking for an exceptional analyst who can diagnose and solve complex business problems t...,we are looking for an exceptional analyst who can diagnose and solve complex business problems t...,...,"[looking, exceptional, analyst, diagnose, solve, complex, business, problems, data, analysis, sc...","[look, exceptional, analyst, diagnose, solve, complex, business, problem, data, analysis, scenar..."
4,Data and Policy Analysts perform a wide array of functions as part of the research process. Thos...,data and policy analysts perform a wide array of functions as part of the research process those...,...,"[data, policy, analysts, perform, wide, array, functions, part, research, process, applicants, i...","[data, policy, analyst, perform, wide, array, function, part, research, process, applicant, inte..."


You might notice that the lemmatization isn't completely accurate e.g. complies is not lemmatized to *comply* and reporting does not become *report*. 
To be honest, I am not completely sure why this happens.  The *part-of-speech* tagging is not tagging as expected in a number of cases, and that is something I will have to investigate in the future.

## Count word frequencies

So, what words occur most frequently?

Well we have a number of different text columns which we can look at. First let's simply look at the cleaned text.

In [150]:
# Count word frequencies
freq = pd.Series(' '.join(job_desc_clean['desc_clean']).split()).value_counts()[:10]
freq

and         288
to          118
the         108
of           89
data         82
in           51
with         48
for          45
business     39
our          37
dtype: int64

Notice that we get a lot of stop words. This is one reason why we remove them. 
So what about the cleaned text without the stop words?

In [151]:
# Count word freq w/o stop words
freq_nostop = pd.Series(' '.join(job_desc_clean['desc_clean_nostop']).split()).value_counts()

freq_nostop

data              82
business          39
analysis          22
work              21
reports           17
                  ..
idea               1
inspections        1
approach           1
transportation     1
bar                1
Length: 1111, dtype: int64

Now we're starting to get a little insight into the content of the job descriptions, **data**, **business**, **analysis**, and **work** are our most common words. As you might be able to tell, I am looking primarily at jobs which leverage data analysis.

How does this differ when we use lemmatized words instead?

In [152]:
freq_lemma = pd.Series(' '.join(job_desc_clean['desc_lemmatized'].apply(lambda x: ' '.join(x))).split()).value_counts()

freq_lemma

data          82
business      39
analysis      30
team          27
work          27
              ..
capability     1
12             1
good           1
instal         1
40lbs          1
Length: 933, dtype: int64

You might notice that there is a little inconsistency when switching between applying functions to `desc_clean_nostop` and `desc_lemmatized`. This is due to the two being different data types - a string, and a list respectively.

In [153]:
# Select top words for each
job_desc_clean['top_words'] = job_desc_clean.desc_clean_nostop.apply(lambda x: top_words(x, 5))

# Join back to the job data to see each position's most common terms
jobapps_df.iloc[:,0:2].join(job_desc_clean.top_words).head()

Unnamed: 0,Position,Company,top_words
0,Associate Governmental Program Analyst,Fair Employment Agency,"[state, purchase, services, regulations, contract]"
1,Data and Policy Analyst - Statistical Programmer,Acumen LLC,"[statistical, research, perform, data, analyses]"
2,Lead Business Intelligence Engineer,sweetgreen,"[data, customer, within, marketing, reporting]"
3,Capacity Planning Analyst,Beyond Meat,"[capacity, production, multiple, global, business]"
4,Data and Policy Analyst - Writer/Coordinator,Acumen LLC,"[research, findings, project, perform, clients]"


In [154]:
# What about the top words using lemmatized descriptions?
job_desc_clean['top_words_lemma'] = job_desc_clean.desc_lemmatized.apply(lambda x: ' '.join(x)).apply(lambda x: top_words(x, 5))

jobapps_df.iloc[:,0:2].join(job_desc_clean.top_words_lemma).head()

Unnamed: 0,Position,Company,top_words_lemma
0,Associate Governmental Program Analyst,Fair Employment Agency,"[contract, maintain, office, service, dfeh]"
1,Data and Policy Analyst - Statistical Programmer,Acumen LLC,"[statistical, research, perform, data, analysis]"
2,Lead Business Intelligence Engineer,sweetgreen,"[customer, data, within, reporting, marketing]"
3,Capacity Planning Analyst,Beyond Meat,"[capacity, multiple, global, production, business]"
4,Data and Policy Analyst - Writer/Coordinator,Acumen LLC,"[finding, research, report, perform, coordinate]"


# Wordcloud

So now that we have our top words, we can make the wordcloud. To do this, we will use the [wordcloud](https://pypi.org/project/wordcloud/) library. You can find a little tutorial on how to use it [here](https://www.geeksforgeeks.org/generating-word-cloud-python/). 

First we convert our word frequencies into a dictionary, and then we pass it to the `make_image()` function.

In [155]:
# Wordcloud
## Convert word frequencies to dictionary
dict_for_wc = freq_lemma.to_dict()

dict_for_wc

{'data': 82,
 'business': 39,
 'analysis': 30,
 'team': 27,
 'work': 27,
 'report': 20,
 'analyst': 19,
 'support': 18,
 'develop': 18,
 'reporting': 16,
 'insight': 16,
 'provide': 16,
 'include': 15,
 'research': 14,
 'project': 13,
 'customer': 13,
 'service': 12,
 'process': 12,
 'policy': 12,
 'analytics': 12,
 'product': 12,
 'use': 12,
 'across': 12,
 'contract': 12,
 'system': 11,
 'related': 11,
 'analyze': 11,
 'program': 11,
 'performance': 10,
 'drive': 10,
 'plan': 10,
 'ensure': 10,
 'manager': 10,
 'operation': 10,
 'management': 10,
 'within': 10,
 'role': 9,
 'organization': 9,
 'prepare': 9,
 'build': 9,
 'need': 9,
 'client': 9,
 'stakeholder': 9,
 'sale': 8,
 'perform': 8,
 'requirement': 8,
 'office': 8,
 'complex': 8,
 'public': 8,
 'maintain': 8,
 'experience': 8,
 'development': 7,
 'model': 7,
 'new': 7,
 'planning': 7,
 'manage': 7,
 'internal': 7,
 'coordinate': 7,
 'financial': 7,
 'conduct': 7,
 'source': 7,
 'multiple': 7,
 'tool': 7,
 'cross': 7,
 'operat

In [156]:
# plot the WordCloud image
# makeImage(dict_for_wc, 'charlie_black.png')

To give the wordcloud a specific shape, we need to use an image as a mask.  In order to do this, grab an image online and select only the part of the image you want to be the mask for the wordcloud. I used photoshop to select the shape I wanted:

![Image after using selection tool in photoshop](charlie.png)

Then fill the shape with black:

![Image filled black](charlie_black.png)

And now we can pass this image to our function `make_image()`

And here is the result:

![jobbies](wc.png)

Now we can [strap on our job helmets, squeeze down into a job cannon, and fire off into Jobland where jobs grow on jobbies](https://youtu.be/wbq571QME2Y?t=31)