# Pre-Processing Indeed Employee Comments

### Importing The Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import fasttext
import pickle
import contractions
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer

pd.options.mode.chained_assignment = None
pd.set_option('display.max_colwidth', 100)


In [2]:
with open('indeed_scrape.csv') as f:
    df = pd.read_csv(f)
f.close()

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,rating,rating_title,rating_description,rating_pros,rating_cons
0,0,4,Design work with their engineering team,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,,
1,1,5,Great work environment,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,,
2,2,5,Nice place to work,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,,
3,3,5,\uma empesa ótima para trabalhar,A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...,,
4,4,5,Amazing work culture,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,,


In [4]:
df.drop('Unnamed: 0', axis=1, inplace=True)

In [5]:
# it was not uncommon to see rating_pros and rating_cons to be missing
# not worried about a missing rating_title as we wouldn't be using this data for analysis 
for col in df.columns:
    print(col, df[col].isnull().sum())

rating 0
rating_title 7
rating_description 0
rating_pros 2278
rating_cons 2440


In [6]:
# we'll be using rating and rating_description exclusively for our work. 
rws = df.loc[:, ['rating', 'rating_description']]
rws.head()

Unnamed: 0,rating,rating_description
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...
3,5,A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...


## Data Pre-Processing

Let's begin by expanding any contractions we might have (ie. "I've" or "I'll").  Keep in mind that this will effectively tokenize the rating descriptions but each contraction wil be a single token.  I other words, "I've" will be "I have" instead of "I", "Have".

In [7]:
rws['no_contract'] = rws['rating_description'].apply(lambda x: [contractions.fix(word) for word in x.split()])
rws.head(25)

Unnamed: 0,rating,rating_description,no_contract
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ..."
3,5,A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...,"[A, melhor, empresa, que, trabalhei., A, viagem, que, ganhei, aos, EUA, -, Moutain, View, para, ..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not..."
6,5,Technically strong people. Google was client and i wasn't directly working for google.. We had ...,"[Technically, strong, people., Google, was, client, and, i, was not, directly, working, for, goo..."
7,5,I was on contract at Google for 7 months and loved every minute of it! The people are great and ...,"[I, was, on, contract, at, Google, for, 7, months, and, loved, every, minute, of, it!, The, peop..."
8,5,Great experience. Great perspective. Google fiber optics. Great place to get your cash and check...,"[Great, experience., Great, perspective., Google, fiber, optics., Great, place, to, get, your, c..."
9,4,"I really enjoyed working there. It was a great environment, had good food free lunch and made go...","[I, really, enjoyed, working, there., It, was, a, great, environment,, had, good, food, free, lu..."


Now that the contractions are expanded let's turn the list back into a string so we can properly tokenize the expanded contractions as two separate tokens instead of one. 

In [8]:
rws['rating_description_str'] = [' '.join(map(str, l)) for l in rws['no_contract']]
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...
3,5,A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...,"[A, melhor, empresa, que, trabalhei., A, viagem, que, ganhei, aos, EUA, -, Moutain, View, para, ...",A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...


In [9]:
# Language detection
# download the pretrained model 
#  https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
pretrained_model = "lid.176.bin" 
model = fasttext.load_model(pretrained_model)

langs = []
for sent in rws['rating_description_str']:
    lang = model.predict(sent)[0]
    langs.append(str(lang)[11:13])

rws['langs'] = langs

rws.head()



Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en
3,5,A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...,"[A, melhor, empresa, que, trabalhei., A, viagem, que, ganhei, aos, EUA, -, Moutain, View, para, ...",A melhor empresa que trabalhei. A viagem que ganhei aos EUA - Moutain View para conhecer a sede ...,pt
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en


In [10]:
# removing non-english reviews
rws = rws[rws['langs'] == 'en']
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en


Now let's tokenize as normal and the expanded contracts will be tokenized accurately. 

In [11]:
rws['tokenized'] = rws['rating_description_str'].apply(word_tokenize)
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n..."


In [12]:
rws['lower'] = rws['tokenized'].apply(lambda x: [word.lower() for word in x])
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n..."


In [13]:
punc = string.punctuation
rws['no_punc'] = rws['lower'].apply(lambda x: [word for word in x if word not in punc])
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower,no_punc
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, involved, multiple, design, phases, and,..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, you, are, given, an, o..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ...","[work, is, responsibility, culture, is, great, the, hardest, part, of, job, is, that, it, is, ve..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, i, have, learn..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, fostered, creativity, and, did, not,..."


In [14]:
stop_words = set(stopwords.words('english'))
rws['stopwords_removed'] = rws['no_punc'].apply(lambda x: [word for word in x if word not in stop_words])
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower,no_punc,stopwords_removed
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, involved, multiple, design, phases, and,...","[contracted, design, custom, pitot, test, adapters, involved, multiple, design, phases, prototyp..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, you, are, given, an, o...","[lots, support, collaboration, across, many, engaging, projects, given, opportunity, grow, ideas..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ...","[work, is, responsibility, culture, is, great, the, hardest, part, of, job, is, that, it, is, ve...","[work, responsibility, culture, great, hardest, part, job, hectic, management, good, google, ads..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, i, have, learn...","[amazing, work, environment, everyone, smart, friendly, learned, lot, great, coworkers, amazing,..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, fostered, creativity, and, did, not,...","[productive, innovative, culture, environment, fostered, creativity, limit, potential, positive,..."


In [15]:
rws['pos_tags'] = rws['stopwords_removed'].apply(nltk.tag.pos_tag)
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower,no_punc,stopwords_removed,pos_tags
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, involved, multiple, design, phases, and,...","[contracted, design, custom, pitot, test, adapters, involved, multiple, design, phases, prototyp...","[(contracted, VBN), (design, NN), (custom, NN), (pitot, JJ), (test, NN), (adapters, NNS), (invol..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, you, are, given, an, o...","[lots, support, collaboration, across, many, engaging, projects, given, opportunity, grow, ideas...","[(lots, NNS), (support, NN), (collaboration, NN), (across, IN), (many, JJ), (engaging, VBG), (pr..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ...","[work, is, responsibility, culture, is, great, the, hardest, part, of, job, is, that, it, is, ve...","[work, responsibility, culture, great, hardest, part, job, hectic, management, good, google, ads...","[(work, NN), (responsibility, NN), (culture, NN), (great, JJ), (hardest, JJS), (part, NN), (job,..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, i, have, learn...","[amazing, work, environment, everyone, smart, friendly, learned, lot, great, coworkers, amazing,...","[(amazing, VBG), (work, NN), (environment, NN), (everyone, NN), (smart, JJ), (friendly, RB), (le..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, fostered, creativity, and, did, not,...","[productive, innovative, culture, environment, fostered, creativity, limit, potential, positive,...","[(productive, JJ), (innovative, JJ), (culture, NN), (environment, NN), (fostered, VBD), (creativ..."


In [16]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [17]:
rws['wordnet_pos'] = rws['pos_tags'].apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower,no_punc,stopwords_removed,pos_tags,wordnet_pos
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, involved, multiple, design, phases, and,...","[contracted, design, custom, pitot, test, adapters, involved, multiple, design, phases, prototyp...","[(contracted, VBN), (design, NN), (custom, NN), (pitot, JJ), (test, NN), (adapters, NNS), (invol...","[(contracted, v), (design, n), (custom, n), (pitot, a), (test, n), (adapters, n), (involved, v),..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, you, are, given, an, o...","[lots, support, collaboration, across, many, engaging, projects, given, opportunity, grow, ideas...","[(lots, NNS), (support, NN), (collaboration, NN), (across, IN), (many, JJ), (engaging, VBG), (pr...","[(lots, n), (support, n), (collaboration, n), (across, n), (many, a), (engaging, v), (projects, ..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ...","[work, is, responsibility, culture, is, great, the, hardest, part, of, job, is, that, it, is, ve...","[work, responsibility, culture, great, hardest, part, job, hectic, management, good, google, ads...","[(work, NN), (responsibility, NN), (culture, NN), (great, JJ), (hardest, JJS), (part, NN), (job,...","[(work, n), (responsibility, n), (culture, n), (great, a), (hardest, a), (part, n), (job, n), (h..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, i, have, learn...","[amazing, work, environment, everyone, smart, friendly, learned, lot, great, coworkers, amazing,...","[(amazing, VBG), (work, NN), (environment, NN), (everyone, NN), (smart, JJ), (friendly, RB), (le...","[(amazing, v), (work, n), (environment, n), (everyone, n), (smart, a), (friendly, r), (learned, ..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, fostered, creativity, and, did, not,...","[productive, innovative, culture, environment, fostered, creativity, limit, potential, positive,...","[(productive, JJ), (innovative, JJ), (culture, NN), (environment, NN), (fostered, VBD), (creativ...","[(productive, a), (innovative, a), (culture, n), (environment, n), (fostered, v), (creativity, n..."


In [18]:
wnl = WordNetLemmatizer()
rws['lemmatized'] = rws['wordnet_pos'].apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
rws.head()

Unnamed: 0,rating,rating_description,no_contract,rating_description_str,langs,tokenized,lower,no_punc,stopwords_removed,pos_tags,wordnet_pos,lemmatized
0,4,Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,"[Contracted, to, design, custom, pitot, test, adapters., Involved, multiple, design, phases, and...",Contracted to design custom pitot test adapters. Involved multiple design phases and prototypes....,en,"[Contracted, to, design, custom, pitot, test, adapters, ., Involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, ., involved, multiple, design, phases, a...","[contracted, to, design, custom, pitot, test, adapters, involved, multiple, design, phases, and,...","[contracted, design, custom, pitot, test, adapters, involved, multiple, design, phases, prototyp...","[(contracted, VBN), (design, NN), (custom, NN), (pitot, JJ), (test, NN), (adapters, NNS), (invol...","[(contracted, v), (design, n), (custom, n), (pitot, a), (test, n), (adapters, n), (involved, v),...","[contract, design, custom, pitot, test, adapter, involve, multiple, design, phase, prototypes, f..."
1,5,Lots of support and collaboration across many engaging projects. You are given an opportunity to...,"[Lots, of, support, and, collaboration, across, many, engaging, projects., You, are, given, an, ...",Lots of support and collaboration across many engaging projects. You are given an opportunity to...,en,"[Lots, of, support, and, collaboration, across, many, engaging, projects, ., You, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, ., you, are, given, an...","[lots, of, support, and, collaboration, across, many, engaging, projects, you, are, given, an, o...","[lots, support, collaboration, across, many, engaging, projects, given, opportunity, grow, ideas...","[(lots, NNS), (support, NN), (collaboration, NN), (across, IN), (many, JJ), (engaging, VBG), (pr...","[(lots, n), (support, n), (collaboration, n), (across, n), (many, a), (engaging, v), (projects, ...","[lot, support, collaboration, across, many, engage, project, give, opportunity, grow, idea, resp..."
2,5,Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,"[Work, is, responsibility., Culture, is, great., The, Hardest, part, of, job, is, that, it, is, ...",Work is responsibility. Culture is great. The Hardest part of job is that it is very hectic. Man...,en,"[Work, is, responsibility, ., Culture, is, great, ., The, Hardest, part, of, job, is, that, it, ...","[work, is, responsibility, ., culture, is, great, ., the, hardest, part, of, job, is, that, it, ...","[work, is, responsibility, culture, is, great, the, hardest, part, of, job, is, that, it, is, ve...","[work, responsibility, culture, great, hardest, part, job, hectic, management, good, google, ads...","[(work, NN), (responsibility, NN), (culture, NN), (great, JJ), (hardest, JJS), (part, NN), (job,...","[(work, n), (responsibility, n), (culture, n), (great, a), (hardest, a), (part, n), (job, n), (h...","[work, responsibility, culture, great, hard, part, job, hectic, management, good, google, ad, le..."
4,5,An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly., I, have, lear...",An amazing work environment where everyone is very smart and friendly. I have learned a lot from...,en,"[An, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., I, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, ., i, have, le...","[an, amazing, work, environment, where, everyone, is, very, smart, and, friendly, i, have, learn...","[amazing, work, environment, everyone, smart, friendly, learned, lot, great, coworkers, amazing,...","[(amazing, VBG), (work, NN), (environment, NN), (everyone, NN), (smart, JJ), (friendly, RB), (le...","[(amazing, v), (work, n), (environment, n), (everyone, n), (smart, a), (friendly, r), (learned, ...","[amaze, work, environment, everyone, smart, friendly, learn, lot, great, coworkers, amaze, manag..."
5,5,A productive and innovative culture and environment. Fostered creativity and did not limit your ...,"[A, productive, and, innovative, culture, and, environment., Fostered, creativity, and, did, not...",A productive and innovative culture and environment. Fostered creativity and did not limit your ...,en,"[A, productive, and, innovative, culture, and, environment, ., Fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, ., fostered, creativity, and, did, n...","[a, productive, and, innovative, culture, and, environment, fostered, creativity, and, did, not,...","[productive, innovative, culture, environment, fostered, creativity, limit, potential, positive,...","[(productive, JJ), (innovative, JJ), (culture, NN), (environment, NN), (fostered, VBD), (creativ...","[(productive, a), (innovative, a), (culture, n), (environment, n), (fostered, v), (creativity, n...","[productive, innovative, culture, environment, foster, creativity, limit, potential, positive, t..."


In [19]:
with open('indeed_scrape_clean.pkl', 'wb') as pickle_file:
    pickle.dump(rws, pickle_file)