<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [15]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [22]:


##### Your Code Here #####
PATH = './data/job_listings.csv'
                
df = pd.read_csv(PATH)

In [23]:
df = df[['description', 'title']]
df.head()

Unnamed: 0,description,title
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [24]:
def soupify(row):
    soup = BeautifulSoup(row, 'html.parser')
    text = soup.get_text()
    return text
    

In [25]:
df['description'] = df['description'].apply(soupify)
df['description'] = df['description'].apply(lambda x: x.strip('b'))

In [26]:
df.head()

Unnamed: 0,description,title
0,"""Job Requirements:\nConceptual understanding i...",Data scientist
1,"'Job Description\n\nAs a Data Scientist 1, you...",Data Scientist I
2,'As a Data Scientist you will be working on co...,Data Scientist - Entry Level
3,"'$4,969 - $6,756 a monthContractUnder the gene...",Data Scientist
4,'Location: USA \xe2\x80\x93 multiple locations...,Data Scientist


## 2) Use Spacy to tokenize the listings 

In [16]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")

In [27]:
def tokenize(row):
    doc = nlp(row)
    tokenized_row = [token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)]
    return tokenized_row

In [28]:
df['tokens'] = df['description'].apply(tokenize)
df.head()

Unnamed: 0,description,title,tokens
0,"""Job Requirements:\nConceptual understanding i...",Data scientist,"[job, Requirements:\nConceptual, understanding..."
1,"'Job Description\n\nAs a Data Scientist 1, you...",Data Scientist I,"[job, description\n\na, Data, scientist, 1, he..."
2,'As a Data Scientist you will be working on co...,Data Scientist - Entry Level,"[Data, scientist, work, consult, business, res..."
3,"'$4,969 - $6,756 a monthContractUnder the gene...",Data Scientist,"[$, 4,969, $, 6,756, monthcontractunder, gener..."
4,'Location: USA \xe2\x80\x93 multiple locations...,Data Scientist,"[location, USA, \xe2\x80\x93, multiple, locati..."


## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [55]:
##### Your Code Here #####
vect = CountVectorizer(stop_words='english', max_features=1000)
vect.fit(df['description'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=1000, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [56]:
dtm = vect.transform(df['description'])

In [57]:
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

In [58]:
dtm.head()

Unnamed: 0,000,10,100,2019,40,abilities,ability,able,academic,access,...,xa6,xae,xb7,xbb,xc2,xe2,xef,year,years,york
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,...,2,0,0,0,0,8,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


## 4) Visualize the most common word counts

In [59]:
##### Your Code Here #####
dtm.sum().sort_values(ascending=False)[:10]

data          4394
xe2           1417
x80           1404
experience    1238
business      1198
work           976
team           972
science        956
learning       912
analytics      730
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [60]:
##### Your Code Here #####
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)

dtm = tfidf.fit_transform(df['description'])

dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

dtm.head()

Unnamed: 0,000,04,10,100,1079302,11,12,125,14,15,...,years,yearthe,yes,yeti,york,young,yrs,zeus,zf,zillow
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.093431,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
dtm.sum().sort_values(ascending=False)[:10]

data          57.941157
xe2           22.472915
x80           22.347036
business      19.469013
experience    18.524879
learning      16.979824
team          15.454177
work          15.408115
science       15.344552
analytics     14.373028
dtype: float64

## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [63]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [71]:
ideal = ['Entry, Junior level. Lots of room for advancemnet. On the job training. Relaxed work environment.']

new = tfidf.transform(ideal)

In [72]:
nn.kneighbors(new.todense())

(array([[1.34930319, 1.35010339, 1.35022157, 1.35022157, 1.35876917]]),
 array([[185,  70,  69, 402, 170]], dtype=int64))

In [73]:
df['description'][70]

'"Junior Data Scientist - Big Data (Entry-Level)-1079302\\n\\nWHO WE ARE\\n\\n\\nKBRwyle is a global government services organization delivering full life cycle professional and technical services from over 60 U.S. and 40 international locations. Our core capabilities include logistics, engineering, science, cyber, intelligence and security services.\\n\\n\\nWHAT TO EXPECT\\n\\n\\nWhen you become part of the KBR team, your career opportunities are endless. We offer challenging assignments on some of the world\'s largest and most complex projects where our customers have come to value us, because they know, We Deliver!\\nABOUT THIS POSITION\\n\\n\\nThe successful candidate will be part of a talented team developing and maintaining mission-critical applications for NAVAIR\\xe2\\x80\\x99s 6.8.4 commodity line. The successful candidate will be assigned tasks in support of data analytics tasks. Other duties include: documenting, managing configuration, testing and bug fixing. The successful

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 