##An introduction to this project

### What is the goal of this research?
The goal of this research is to help jobseekers seeking to work in the niche area of marketing research to gain a clear and current understand of how they should package themselves to become attractive to the "global" employer.

###What are the objectives of this research?
1.   Determine the key competencies that are needed in Marketing Researchers
2.   Determine the general characteristics of employers seeking this skillset
3.   Track changes in employer requirements and Marketing Researcher competencies over time

###What is the data source and what are the data characteristics?
*   This data is the first 250 results from the Indeed job search engine for the query 'Marketing research'
*   The .CSV file indicates the date and time that a round of data mining was completed
*   This is dynamic data that changes; therefore, each round of mined data has to be saved for purposes of analysis.
*   This data in its raw state has quite a few HTML elements that have to be removed for purposes of extracting raw content.






###What are some preliminary assumptions that we may have about this data?

1.   The job ad was posted by a HR professional
2.   The job description in each job advert will match the actual day-to-day work required
3.   The higher the frequency of a word denoting a skill occurring, the higher the demand for that particular skill
4.   Marketing research is an interdisciplinary field, therefore different competencies around this role from different fields are required by employers.
5.   The requirements for this role apply for all employers around the world (the employers under study are all from the United States, or
their headquarters are in the United States; even so, we will use these listings as a global benchmark for the time being).

##Data mining

In [0]:
import requests
from bs4 import BeautifulSoup
import html5lib
import pandas as pd
import numpy as np
import base64
import datetime

In [0]:
data = []

pgs = ['l=']
pgs.extend(range(0, 1000))

for pg in pgs:
  if pg != 'l=':
    r = requests.get('https://www.indeed.com/jobs?q=Marketing+research&start='+str(pg)+'0')
  else:
    r = requests.get('https://www.indeed.com/jobs?q=Marketing+research&'+str(pg))
  soup = BeautifulSoup(r.content, "html5lib")
  info_block = soup.find_all("a", attrs={"class":"jobtitle turnstileLink "})
  test_info_block = str(info_block)
  new = test_info_block.split(' href="', 10)
  
  i=1
  for i in range(11):
    new_new = new[i].split('" id=', 1)
    link = new_new[0].replace('/rc/clk', 'https://www.indeed.com/viewjob')
    data.append(link)

In [3]:
data_df = pd.DataFrame(data)
data_df.columns = ['links']
data_df = data_df[~data_df.links.str.startswith('[<a class')]
data_df = data_df[~data_df.links.str.startswith('/pagead/clk?')]
data_df = data_df[~data_df.links.str.startswith('/company/')]
print(len(data_df.links.unique()), ' unique links found from a dataframe of shape ', data_df.shape)

924  unique links found from a dataframe of shape  (10007, 1)


In [0]:
a = data_df.links.unique()
new_data = []
i=0
for i in range(len(a)):
  new_r = requests.get(a[i])
  new_soup = BeautifulSoup(new_r.content, "html5lib")
  new_info_block = new_soup.find_all("div", attrs={"class":"jobsearch-jobDescriptionText"})
  new_data.append(new_info_block)

In [5]:
new_data_df = pd.DataFrame(new_data)
new_data_df = new_data_df.astype('str')

stop_time = str(datetime.datetime.now())
print('Mining completed at ', stop_time)
new_data_df.shape

Mining completed at  2020-01-13 09:19:40.395290


(924, 1)

##Data loading and initial transformation into raw content

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
nltk.download('stopwords')
nltk.download('punkt')

!pip install unidecode
import unidecode

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Collecting unidecode
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 2.6MB/s 
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.1.1


In [0]:
# Loading data...
web_content = new_data_df

In [0]:
# We start by removing unnecessary columns and renaming the remaining column
web_content.columns = ['text']

In [60]:
# Remove any accents present
i=0
for i in range(len(web_content['text'])):
  web_content['text'].iloc[i] = unidecode.unidecode(web_content['text'].iloc[i])

# Strip symbols
fillers = ['[^\w\s]']

for filler in fillers:
  web_content['text'] = web_content['text'].str.replace(filler, ' ')

# Make everything lowercase
web_content['text'] = web_content['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [0]:
# Clean the text by replacing html denoters, tags and double spaces with a single space
tags = [' div ', ' class ', ' jobsearch ', ' jobdescriptiontext ', ' id ',
        ' p ', '\n', ' b ', ' ul ', ' li ', ' br ', ' h3 ', ' jobsectionheader ']

for tag in tags:
  web_content['text'] = web_content['text'].str.replace(tag, ' ')

web_content['text'] = web_content['text'].str.replace('  ', ' ')

In [0]:
# Tokenize words
tokenized_text = [word_tokenize(i) for i in  web_content['text']]
web_content.loc[:, 'tokenized_text'] = tokenized_text

allWords = []
i = 1
for wordList in web_content['tokenized_text']:
  allWords += wordList  

# Get their frequency
dataset_Fdist = nltk.FreqDist(allWords)
# and save it as a frequency distribution table i.e. a df
freq_df = pd.DataFrame.from_dict(dataset_Fdist, orient='index')
freq_df.reset_index(level=0, inplace=True)
freq_df.columns = ['word','frequency']

# then pick the most common words
dataset_mode = dataset_Fdist.most_common(100)

In [12]:
#Download this table to visualize offline:
from IPython.display import HTML
import base64 

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv(index =True)
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

name = 'search_data mined ' + stop_time + '.csv'
create_download_link(freq_df, filename=name)

In [13]:
web_content.head(10)

Unnamed: 0,text,tokenized_text
0,caring for the world one person at a time ...,"[caring, for, the, world, one, person, at, a, ..."
1,facebook s mission is to give people the p...,"[facebook, s, mission, is, to, give, people, t..."
2,what makes microsoft a great place for mar...,"[what, makes, microsoft, a, great, place, for,..."
3,facebook s mission is to give people the p...,"[facebook, s, mission, is, to, give, people, t..."
4,intern at one of the most interesting in...,"[intern, at, one, of, the, most, interesting, ..."
5,facebook s mission is to give people the p...,"[facebook, s, mission, is, to, give, people, t..."
6,facebook s mission is to give people the p...,"[facebook, s, mission, is, to, give, people, t..."
7,we are the espn brand marketing team ...,"[we, are, the, espn, brand, marketing, team, a..."
8,summary posted aug 27 2019 weekly ho...,"[summary, posted, aug, 27, 2019, weekly, hours..."
9,facebook s mission is to give people the p...,"[facebook, s, mission, is, to, give, people, t..."


##EDA: Top 20 Nouns & their Neighbours

In [0]:
# We start by examining word neighbours for the key words
# Let's create a separate dataset examining this:
word_neighbour = web_content['text']

In [0]:
# Save individual jds:
jd_df = pd.DataFrame(word_neighbour)

In [0]:
# Now we can see the word neighbours of the top 20 nouns:
terms = ['marketing', 'research' , 'experience', 'work', 'team',
         'skills', 'business', 'ability', 'data', 'management',
         'media', 'development', 'product', 'support', 'market',
         'job',  'position', 'content', 'projects', 'knowledge']

for term in terms:
  pattern_a = ' ' + term + ' '
  pattern_b = '-' + term + '-'
  word_neighbour = word_neighbour.str.replace(pattern_a, pattern_b)

# Finally, remove the "words" -- and -
word_neighbour = word_neighbour.str.replace(' -- ', '')
word_neighbour = word_neighbour.str.replace(' - ', '')

In [0]:
# Tokenize!
wn_tokenized_text = [word_tokenize(i) for i in  word_neighbour]
word_neighbour_df = pd.DataFrame()
word_neighbour_df.loc[:, 'word_neighbour_tokenized_text'] = wn_tokenized_text

wn_allWords = []
i = 1
for wordList in word_neighbour_df['word_neighbour_tokenized_text']:
  wn_allWords += wordList  

# Get their frequency
wn_dataset_Fdist = nltk.FreqDist(wn_allWords)
# and save it as a frequency distribution table i.e. a df
word_neighbour_freq_df = pd.DataFrame.from_dict(wn_dataset_Fdist, orient='index')
word_neighbour_freq_df.reset_index(level=0, inplace=True)
word_neighbour_freq_df.columns = ['word','frequency']

In [37]:
# Download to view offline:
name2 = 'wn_data mined ' + stop_time + '.csv'
create_download_link(word_neighbour_freq_df, filename=name2)

##EDA: Top 20 Noun Frequency in Job Ad

In [0]:
# Count occurence of any of the top 20 nouns in each listing:
for term in terms:
  term_counter = jd_df['text'].str.count(' '+term+' ')
  string = term + '_wc'
  jd_df.loc[:, string] = term_counter

In [92]:
# Download the jd dataframe
name3 = 'jd_data mined ' + stop_time + '.csv'
create_download_link(jd_df, filename=name3)

Output hidden; open in https://colab.research.google.com to view.