<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [3]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [4]:
#Read the csv into a dataframe, get the shape

df = pd.read_csv('data/job_listings.csv')
df.shape

(426, 3)

In [5]:
#Look at the head and tail of the df

df.head(10)

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist
5,5,b'<div>Create various Business Intelligence An...,Data Scientist
6,6,b'<div><p>As Spotify Premium swells to over 96...,Associate Data Scientist – Premium Analytics
7,7,"b""Everytown for Gun Safety, the nation's large...",Data Scientist
8,8,"b""<ul><li>MS in a quantitative discipline such...",Sr. Data Scientist
9,9,b'<div><p>Slack is hiring experienced data sci...,"Data Scientist, Lifecyle"


In [6]:
df.tail(10)

Unnamed: 0.1,Unnamed: 0,description,title
416,416,b'<div></div><div><div><div><div><div><div>Los...,Senior Data Scientist
417,417,b'<div><b><i>About the Role...</i></b><br/>\n<...,Data Analyst / Jr. Data Scientist
418,418,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist- Enterprise Product Analytics
419,419,b'<div><ul><li>Bachelor\xe2\x80\x99s or Master...,Data Scientist - Delphi
420,420,"b'<div><div>At Uber, we ignite opportunity by ...","Sr Data Scientist, NLP - Customer Obsession"
421,421,"b""<b>About Us:</b><br/>\nWant to be part of a ...",Senior Data Science Engineer
422,422,"b'<div class=""jobsearch-JobMetadataHeader icl-...",2019 PhD Data Scientist Internship - Forecasti...
423,423,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist - Insurance
424,424,"b""<p></p><div><p>SENIOR DATA SCIENTIST</p><p>\...",Senior Data Scientist
425,425,b'<div></div><div><div><div><div><p>Cerner Int...,Data Scientist


In [7]:
#View a full description

df['description'][0]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

In [13]:
#Function that cleans up html tags

def clean_text(text):
    text = BeautifulSoup(ihtml.unescape(text)).text
    text = re.sub(r"\\xe2\\x80\\x99", "", text)
    text = re.sub(r"\xe2\x80\x93", "", text)
    text = re.sub(r"\\xc2\\xa8", "", text)
    text = re.sub(r"\\n", " ", text)
    text = re.sub(r"b\'", " ", text)
    return text

In [14]:
#Clean up one full description to test

clean_text(df['description'][0])

NameError: name 'html' is not defined

In [197]:
#Clean up the entire column, output it into a new column

df['clean_description'] = df['description'].apply(lambda x: clean_text(x))
df.head()

Unnamed: 0.1,Unnamed: 0,description,title,clean_description
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements: Conceptual understanding i..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description As a Data Scientist 1, you ..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


## 2) Use Spacy to tokenize the listings 

In [198]:
nlp = spacy.load('en_core_web_md')

In [199]:
#Use spacy to take out stop words

tokens = []

for doc in nlp.pipe(df['clean_description'], disable=['parser', 'tagger']):
    doc_tokens = []
    for token in doc:
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.lemma_.lower())
    tokens.append(doc_tokens)
    
df['spacy_tokens'] = tokens

df.head()

Unnamed: 0.1,Unnamed: 0,description,title,clean_description,spacy_tokens
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements: Conceptual understanding i...","[b""job, requirements, conceptual, understand, ..."
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description As a Data Scientist 1, you ...","[b'job, description, , data, scientist, 1, he..."
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...,"[b'as, data, scientist, work, consult, busines..."
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen...","[b'$4,969, $, 6,756, monthcontractunder, gener..."
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...,"[b'location, usa, \xe2\x80\x93, multiple, loca..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [200]:
data = []

def gather_data(column):
    for row in column:
        data.append(row)
    return data

tokens = gather_data(df['spacy_tokens'])

In [201]:
#Create the transformer
# vect = CountVectorizer(lowercase=False)
vect = CountVectorizer(stop_words='english',
                       ngram_range=(1,2),
                       min_df=3,
                       max_df=0.25,
                       max_features=20)
#Build the vocab
vect.fit(df['clean_description'])

#Transform text
dtm = vect.transform(df['clean_description'])

dtm

<426x20 sparse matrix of type '<class 'numpy.int64'>'
	with 1942 stored elements in Compressed Sparse Row format>

## 4) Visualize the most common word counts

In [202]:
#Get the word counts in matrix form

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())
print(dtm.shape)
dtm.head()

(426, 20)


Unnamed: 0,benefits,clients,cross,data driven,digital,engineers,functional,global,growth,health,intelligence,internal,lead,level,model,optimization,plus,process,project,visualization
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0
1,0,1,1,0,0,0,1,0,1,0,0,0,0,1,0,0,2,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [203]:
dtm['data driven'].sum()

144

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [204]:
#Instantiate vectorizer object
tfidf = TfidfVectorizer(stop_words='english',
                        ngram_range=(1,2),
                        min_df=3,
                        max_df=0.25)

#Createa a vocabulary and get word counts per document
dtm = tfidf.fit_transform(df['clean_description'])

#Get features names to use as dataframe column headers
dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())
print(dtm.shape)
dtm.head()

(426, 12100)


Unnamed: 0,000,000 employees,000 yearthe,04,10,10 time,10 years,100,100 000,100 companies,...,youll opportunity,youll partner,youll work,youre,youre data,youre looking,youre ready,youve,youve worked,yrs
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [205]:
from sklearn.metrics.pairwise import cosine_similarity

#Calculate distance of TF-IDF Vectors
dist_matrix = cosine_similarity(dtm)
print(dist_matrix.shape)

#Turn it into a dataframe
df_cosine = pd.DataFrame(dist_matrix)
df_cosine.head()

(426, 426)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,416,417,418,419,420,421,422,423,424,425
0,1.0,0.018515,0.0,0.0,0.0,0.049815,0.005967,0.018846,0.021252,0.011684,...,0.032188,0.029858,0.006242,0.019773,0.016934,0.0,0.005607,0.021672,0.030512,0.022926
1,0.018515,1.0,0.037708,0.01729,0.0,0.039791,0.052584,0.040737,0.063558,0.039914,...,0.060049,0.097631,0.010475,0.034526,0.042631,0.045471,0.018047,0.044502,0.052045,0.040193
2,0.0,0.037708,1.0,0.00401,0.014037,0.014103,0.022646,0.021887,0.013719,0.007796,...,0.017142,0.0117,0.013144,0.03106,0.025584,0.01857,0.006458,0.00339,0.042726,0.00912
3,0.0,0.01729,0.00401,1.0,0.0,0.027795,0.004364,0.046463,0.012102,0.028322,...,0.017925,0.020473,0.0,0.026256,0.015509,0.032969,0.005033,0.050179,0.040611,0.072124
4,0.0,0.0,0.014037,0.0,1.0,0.0,0.0,0.012558,0.013185,0.008601,...,0.0,0.015495,0.0,0.028894,0.0,0.0,0.0,0.006952,0.0,0.021582


In [207]:
df_cosine[df_cosine[0] < 1][0].sort_values(ascending=False)[:10]

338    0.102782
115    0.099515
274    0.088402
168    0.086934
276    0.078681
403    0.072875
366    0.070662
199    0.070210
206    0.068376
352    0.068376
Name: 0, dtype: float64

In [214]:
from sklearn.neighbors import NearestNeighbors

#Fit on DTM 
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

#sample a doc from dtm to use as the query point
doc = dtm.iloc[0].values

#Query using kneighbors
nn.kneighbors([doc])

(array([[0.        , 1.33956587, 1.34200255, 1.35025769, 1.35134485]]),
 array([[  0, 338, 115, 274, 168]], dtype=int64))

In [213]:
#Comparing the docs that match best

df['clean_description'][0][:350]

'b"Job Requirements: Conceptual understanding in Machine Learning models like Naive Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data '

In [212]:
df['clean_description'][168][:350]

'b"Logistics done differently. At XPO Logistics, we invest over $450 million in technology every year so that we can continue to develop state-of-the-art solutions for our customers. As the Data Scientist, you will be responsible for developing analytical experiments in a methodical manner and regularly evaluating alternate models to support strateg'

In [215]:
df['clean_description'][338][:350]

'b"The FCA Manufacturing Planning and Control (MPC) organization is currently seeking a highly skilled, result-oriented Data Scientist to join our BDA (Big Data & Analytics) team at our FCA Headquarters in Auburn Hills, Michigan. The Data Scientist position offers the selected candidate an opportunity to be an integral part of a company whose challe'

In [216]:
#Picking a random row not on the knearest neighbors list to compare

df['clean_description'][10][:350]

"b'Who We Are BlackThorn Therapeutics is a computational sciences company with capabilities to develop proprietary therapeutics focused on neurobehavioral disorders such as depression, schizophrenia, and autism. We have pioneered the development of a computational psychiatry platform to advance our robust pipeline of novel therapeutics. We leverage "

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 