<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [107]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
from spacy.tokenizer import Tokenizer
from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [215]:
from bs4 import BeautifulSoup
import requests
df = pd.read_csv('job_listings.csv')
##### Your Code Here #####

                


In [185]:
df.head()

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist


In [186]:
to_scrape = [df['description']]

In [187]:
to_scrape

[0      b"<div><div>Job Requirements:</div><ul><li><p>...
 1      b'<div>Job Description<br/>\n<br/>\n<p>As a Da...
 2      b'<div><p>As a Data Scientist you will be work...
 3      b'<div class="jobsearch-JobMetadataHeader icl-...
 4      b'<ul><li>Location: USA \xe2\x80\x93 multiple ...
                              ...                        
 421    b"<b>About Us:</b><br/>\nWant to be part of a ...
 422    b'<div class="jobsearch-JobMetadataHeader icl-...
 423    b'<div class="jobsearch-JobMetadataHeader icl-...
 424    b"<p></p><div><p>SENIOR DATA SCIENTIST</p><p>\...
 425    b'<div></div><div><div><div><div><p>Cerner Int...
 Name: description, Length: 426, dtype: object]

In [216]:
df['soup_text'] = [BeautifulSoup(text).get_text() for text in df['description']]

In [150]:
df['soup_text'][0]

'b"Job Requirements:\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them\\nIntermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)\\nExposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R\\nAbility to communicate Model findings to both Technical and Non-Technical stake holders\\nHands on experience in SQL/Hive or similar programming language\\nMust show past work via GitHub, Kaggle or any other published article\\nMaster\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.\\nApply Now"'

In [131]:
def clean_description(description):
    pos = re.sub(r'.\\n', ' ', description)
    return pos



In [217]:
df['soup_text'] = df['soup_text'].replace(r'.\\n', ' ', regex=True)

In [218]:
df['soup_text'] = df['soup_text'].replace(r'.\\xe2\\x80\\x99', "'", regex=True )

In [219]:
df['soup_text'] = df['soup_text'].replace(r'.\\xe2\\x80\\xa', "", regex=True )

In [213]:
# df['soup_text'] = df['soup_text'].apply(get_lemmas)

## 2) Use Spacy to tokenize the listings 

In [220]:
##### Your Code Here #####
nlp = spacy.load("en_core_web_lg")

In [277]:
STOP_WORDS = nlp.Defaults.stop_words.union(['b"job', 'requirements', 'data', 'scientists', "job", "b'job", "you\'ll", "x80", "x93", "xe2", 	"you\'re", "\ntoday" ])

tokenizer = Tokenizer(nlp.vocab)

In [278]:
tokens = []

for doc in tokenizer.pipe(df['soup_text']):

    doc_tokens =[]

    for token in doc:
        if token.text.lower() not in STOP_WORDS:
            doc_tokens.append(token.text.lower())
    tokens.append(doc_tokens)
df['tokens'] = tokens

In [279]:
df['tokens']

0      [conceptual, understanding, machine, learning,...
1      [descriptio, \nas, scientist, 1,, help, build,...
2      [b'as, scientist, working, consulting, busines...
3      [b'$4,969, -, $6,756, monthcontractunder, gene...
4      [b'location:, usa, \xe2\x80\x93, multiple, loc...
                             ...                        
421    [b"about, want, fantastic, fun, startup, tha's...
422    [b'internshipat, uber,, ignite, opportunity, s...
423    [b'$200,000, -, $350,000, yeara, million, peop...
424    [b"senior, scientis, descriptio, \nabout, u, \...
425    [b'cerner, intelligence, new,, innovative, org...
Name: tokens, Length: 426, dtype: object

In [262]:
def get_lemmas(text):

    lemmas = []
    
    doc = nlp(text)
    
    # Something goes here :P
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)):
            lemmas.append(token.lemma_)
    
    return lemmas

In [263]:
def listToString(s):

    str1=" "
    return (str1.join(s))

In [264]:
df['tokens'] = df['tokens'].apply(listToString)

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [265]:
##### Your Code Here #####
from sklearn.feature_extraction.text import CountVectorizer

In [266]:
vect = CountVectorizer(stop_words='english',
                       max_df=.90,
                       min_df=.10)
                       

In [267]:
dtm = vect.fit_transform(df['tokens'])

In [268]:
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

In [269]:
dtm.head()

Unnamed: 0,ability,able,access,achieve,action,actionable,ad,additional,address,advanced,...,worl,world,writing,written,x80,x93,xe2,year,years,yo
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,1,0,0,...,0,2,2,1,0,0,0,1,0,0
2,1,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,1,0,1,0


## 4) Visualize the most common word counts

In [270]:
##### Your Code Here #####


## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [271]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [272]:
def tokenize(document):
    """
    Takes a doc and returns a list of tokens in the form of lemmas.
    Stop words and punctuation are filtered out. 
    """
    
    doc = nlp(document)
    
    return [token.lemma_.strip() for token in doc if (token.is_stop != True) and (token.is_punct != True)]

In [273]:
##### Your Code Here #####
tfidf = TfidfVectorizer(stop_words='english', 
                        ngram_range=(1,2),
                        max_df=.97,
                        min_df=3,
                        tokenizer=tokenize)

In [274]:
dtm_tfidf = tfidf.fit_transform(df['tokens'])

In [275]:
dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())

In [276]:
dtm_tfidf

Unnamed: 0,Unnamed: 1,\ntoday,ability,clearance,design,excellent,experience,hand,proficiency,provide,...,york,york city,york office,you\'ll,you\'ll d,you\'ll work,you\'re,yrs,|,||
0,0.076140,0.0,0.143124,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
1,0.034248,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.039943,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
422,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.110078,0.073072,0.0,0.0,0.0,0.0,0.0
423,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0
424,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [282]:
##### Your Code Here #####
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(algorithm="kd_tree")
nn.fit(dtm_tfidf)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [284]:
doc_index = 0
doc_vect = [dtm_tfidf.iloc[doc_index].values]
doc_vect

[array([0.07613972, 0.        , 0.14312408, ..., 0.        , 0.        ,
        0.        ])]

In [287]:
ideal_desc = ["""ideal candidate can code well in python and use relevant modules to data science. we have a clear drive towards the future of green energy"""]

In [288]:
new = tfidf.transform(ideal_desc)

In [289]:
nn.kneighbors(new.todense())

(array([[1.33156575, 1.33598471, 1.33845594, 1.34250948, 1.35195631]]),
 array([[ 62, 308, 368, 297, 325]], dtype=int64))

In [290]:
df['soup_text'][62]

"b'Business Unit Introduction \\nThis position is in the Enterprise Digital and Analytics \\xe2\\x80\\x93 Global Customer Data Science team and is responsible for driving positive value for the consumer base, our internal partners and shareholders through best-in-class data and decision science \\nThe Team Our team is the decision science team with emphasis on optimization, simulation, reinforcement learning, and Natural Language Processing (NLP). We are responsible for the innovation of the AMEX Chatbot module and the optimization layer of Orchestra, a cross-channel content delivery system \\nResponsibilities and Value Proposition applying rigorious data science and optimization techniques on digital advertising helping our customer care professionals to reolve card members\\' issues faster and more accurately by utilizing NLP techniques collaborating with internal production teams to build and launch machine learning products monitoring and delivering insightful analytics on model pe

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 