<a href="https://colab.research.google.com/github/BrokenShell/DS-Unit-4-Sprint-1-NLP/blob/master/module2-vector-representations/LS_DS_412_Vector_Representations_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 1 Assignment 2*

In [1]:
!pip install IteratorAlgorithms

Collecting IteratorAlgorithms
  Downloading https://files.pythonhosted.org/packages/f5/e3/30ba5d26018ebf1c9d8bb5a6af3d73193d3d4766efa7f0f5483ffc43fb6e/IteratorAlgorithms-0.1.4-py3-none-any.whl
Installing collected packages: IteratorAlgorithms
Successfully installed IteratorAlgorithms-0.1.4


In [None]:
!python -m spacy download en_core_web_lg

In [1]:
# If this cell wont run, restart the runtime!
import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_lg")

In [3]:
import pandas as pd
import numpy as np
import requests
import string
import re

import IteratorAlgorithms as ia

import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.decomposition import PCA

In [4]:
text = "We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humans—specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl."
doc = nlp(text)
print({token.lemma_ for token in doc if (token.is_stop != True) and (token.is_punct != True)})

{'specifically', 'indicator', 'interesting', 'user', 'scrape', 'diversity', 'new', 'CommonCrawl', 'curate', 'receive', 'internet', 'link', 'funny', '3', 'dataset', 'high', 'karma', 'heuristic', 'think', 'datum', 'content', 'quality', 'document', 'find', 'preserve', 'educational', 'human', 'similar', 'emphasize', 'filter', 'order', 'outbound', 'Reddit', 'lead', 'page', 'create'}


## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read thru the documentation to accomplish this task. 

`Tip:` You will need to install the `bs4` library inside your conda environment. 

In [5]:
url = "https://github.com/BrokenShell/DS-Unit-4-Sprint-1-NLP/blob/master/module2-vector-representations/data/job_listings.csv?raw=true"

In [6]:
jobs = pd.read_csv(url)
jobs

Unnamed: 0.1,Unnamed: 0,description,title
0,0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist
1,1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I
2,2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level
3,3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist
4,4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist
...,...,...,...
421,421,"b""<b>About Us:</b><br/>\nWant to be part of a ...",Senior Data Science Engineer
422,422,"b'<div class=""jobsearch-JobMetadataHeader icl-...",2019 PhD Data Scientist Internship - Forecasti...
423,423,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist - Insurance
424,424,"b""<p></p><div><p>SENIOR DATA SCIENTIST</p><p>\...",Senior Data Scientist


In [7]:
def drink_soup(src):
    soup = BeautifulSoup(src[2:-1], 'html.parser')
    return (soup.text.lower()
            .replace(':', ' ')
            .replace('\\n', ' ')
            .replace('\\', ' ')
            .replace(', ', ' ')
            .replace('/', ' ')
            .lower()
    )

In [8]:
def clean_str(s):
    return re.sub(r'[^a-z ]', '', s.lower())

In [9]:
jobs = pd.read_csv(url)
jobs = jobs.drop(columns=['Unnamed: 0'])
jobs = jobs[['title', 'description']]
jobs['description'] = jobs['description'].apply(drink_soup)
jobs['description']

0      job requirements  conceptual understanding in ...
1      job description  as a data scientist 1 you wil...
2      as a data scientist you will be working on con...
3      $4,969 - $6,756 a monthcontractunder the gener...
4      location  usa  xe2 x80 x93 multiple locations ...
                             ...                        
421    about us  want to be part of a fantastic and f...
422    internshipat uber we ignite opportunity by set...
423    $200,000 - $350,000 a yeara million people a y...
424    senior data scientist job description  about u...
425    cerner intelligence is a new innovative organi...
Name: description, Length: 426, dtype: object

In [10]:
jobs['description'] = jobs['description'].apply(clean_str)
jobs['description']

0      job requirements  conceptual understanding in ...
1      job description  as a data scientist  you will...
2      as a data scientist you will be working on con...
3         a monthcontractunder the general supervisio...
4      location  usa  xe x x multiple locations  year...
                             ...                        
421    about us  want to be part of a fantastic and f...
422    internshipat uber we ignite opportunity by set...
423       a yeara million people a year die in car co...
424    senior data scientist job description  about u...
425    cerner intelligence is a new innovative organi...
Name: description, Length: 426, dtype: object

## 2) Use Spacy to tokenize the listings 

In [11]:
stop_words = {
    'xe', 'xs', 'x', 'aa', 'aap', 'aas', 'ab'
}

In [12]:
tokenizer = Tokenizer(nlp.vocab)

tokens = []
strings = []
for doc in tokenizer.pipe(jobs['description'], batch_size=500):
    doc_tokens = []
    for token in doc:
        t = token.lemma_.strip()
        if t and (token.is_stop == False) and (token.is_punct == False) and (t not in stop_words):
            doc_tokens.append(t)
    tokens.append(doc_tokens)
    strings.append(' '.join(doc_tokens))
jobs['tokens'] = tokens
jobs['strings'] = strings
jobs['strings']

0      job requirement conceptual understand machine ...
1      job description datum scientist help build mac...
2      datum scientist work consult business responsi...
3      monthcontractunder general supervision profess...
4      location usa multiple location year analytics ...
                             ...                        
421    want fantastic fun startup revolutionize onlin...
422    internshipat uber ignite opportunity set world...
423    yeara million people year die car collision wo...
424    senior datum scientist job description amplion...
425    cerner intelligence new innovative organizatio...
Name: strings, Length: 426, dtype: object

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [13]:
data = jobs['strings']

In [14]:
vect = CountVectorizer(stop_words='english')

#Learn our Vocab
vect.fit(data)

# Get sparse dtm
dtm = vect.transform(data)

dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

In [15]:
dtm

Unnamed: 0,abernathy,ability,able,abound,abroad,absence,absolutely,absorb,abstract,abstraction,abstractly,abtest,abundant,abuse,academia,academic,academy,accelerate,acceleration,accelerator,accelerometer,accept,acceptable,acceptance,acceptedcurrent,access,accessibility,accessible,accidental,accolade,accommodate,accommodation,accommodationspaloaltonetworkscom,accommodationsrelxcom,accomplish,accomplishment,accord,accordance,account,accountability,...,xtypically,xunlocking,xve,xwe,xyou,yard,year,yeara,yearas,yearcollects,yeardescription,yearjob,yearlrs,yearsexperience,yearsummary,yearthe,yeartitle,yearworking,yes,yeti,yield,york,youd,youll,young,youre,youtube,youve,yr,zenreach,zero,zeus,zf,zheng,zillow,zogsports,zone,zoom,zuckerberg,zurich
0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
421,0,2,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
422,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
423,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
424,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [16]:
df = pd.DataFrame(columns=['Word', 'Count'])
for idx, col in enumerate(dtm.columns):
    df.loc[idx] = (col, dtm[col].sum())
df.head()

Unnamed: 0,Word,Count
0,abernathy,3
1,ability,554
2,able,142
3,abound,1
4,abroad,1


In [17]:
df_select = df[df['Count'] > 500]

In [18]:
alt.Chart(df_select, title="Word Count").mark_circle(size=200).encode(
    x=alt.X('Word:O', sort='y'),
    y='Count:Q',
    color=alt.Color('Count', scale=alt.Scale(scheme='lightmulti')),
    tooltip=['Word', 'Count'],
)

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [19]:
def tokenize(document):
    doc = nlp(document)
    return [
        token.lemma_.strip() for token in doc 
        if (token.is_stop != True) and (token.is_punct != True)
    ]

In [20]:
tfidf = TfidfVectorizer(tokenizer=tokenize)
t_vec = tfidf.fit_transform(df['Word'])
t_vec_df = pd.DataFrame(t_vec.todense(), columns=tfidf.get_feature_names())
t_vec_df.head()

Unnamed: 0,-PRON-,abernathy,ability,able,abound,abroad,absence,absolutely,absorb,abstract,abstraction,abstractly,abtest,abundant,abuse,academia,academic,academy,accelerate,acceleration,accelerator,accelerometer,accept,acceptable,acceptance,acceptedcurrent,access,accessibility,accessible,accidental,accolade,accommodate,accommodation,accommodationspaloaltonetworkscom,accommodationsrelxcom,accomplish,accomplishment,accord,accordance,account,...,xtable,xthe,xthink,xto,xtypically,xunlocke,xve,xwe,xyou,yard,year,yeara,yearas,yearcollects,yeardescription,yearjob,yearlrs,yearsexperience,yearsummary,yearthe,yeartitle,yearworke,yes,yeti,yield,york,young,youtube,yr,zenreach,zero,zeus,zf,zheng,zillow,zogsport,zone,zoom,zuckerberg,zurich
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
tfidf = TfidfVectorizer(stop_words='english',
                        max_df=.97,
                        min_df=3,
                        tokenizer=tokenize)

dtm = tfidf.fit_transform(data)

dtm = pd.DataFrame(dtm.todense(), columns=tfidf.get_feature_names())

dtm.head()

Unnamed: 0,2,ability,able,absence,absolutely,abstract,abundant,academic,accelerate,accept,access,accessibility,accessible,accommodate,accommodation,accomplish,accomplishment,accord,accordance,account,accountability,accountable,accredit,accuracy,accurate,accurately,achieve,achievement,acquire,acquisition,act,action,actionable,activation,active,actively,activity,actual,actuarial,acumen,...,workforce,worklife,workload,workplace,workshop,workspace,world,worldaltering,worldclass,worldwide,worth,wrangle,write,wwwcivisanalyticscom,wwwdolgov,wwwkbrcom,wwwsquarespacecom,xa,xae,xand,xbb,xbig,xc,xcbig,xd,xef,xgboost,xincluding,xll,xre,xt,xto,xve,xwe,y,year,yearthe,yes,york,yr
0,0.0,0.107926,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.066365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.138109,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.023436,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.068315,0.0,0.0,0.0,0.0,0.0,0.051326,0.0,0.0,0.0,0.0,0.0,0.0,0.057053,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.028736,0.0,0.0,0.0,0.0,0.0,0.086465,0.0,0.0,0.0,0.0,0.110111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09111,0.0,0.0,0.0,0.0,0.0,0.020168,0.0,0.0,0.0,0.0
2,0.0,0.076509,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.133172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056276,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1066,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [22]:
# Fit on TF-IDF Vectors
nn  = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [23]:
# Query Using kneighbors 
nn.kneighbors([dtm.iloc[100]])

(array([[0.        , 0.        , 1.20915576, 1.24145593, 1.2438092 ]]),
 array([[100,  47, 201, 307, 336]]))

In [24]:
data[100][:400]

'description rare opportunity join development division information system steward health care leader competitive forprofit hospital industry seek motivate experience datum scientist contribute software development initiative improve quality health care country individual work programmer analyst senior level management optimize development process enterprise expand exist predictive analytics field '

In [25]:
data[336][:400]

'discover business insight identify opportunity provide solution recommendation solve business problem use statistical algorithmic datum mine visualization technique level close supervision conduct predictive analysis population health management market campaign management forecast analyze design solution healthcare datum work dataset vary degree size complexity include structure unstructured datum'

In [26]:
my_job_description = ["Teach Python"]

In [27]:
new = tfidf.transform(my_job_description)
nn.kneighbors(new.todense())

(array([[1.2914778 , 1.33558843, 1.34289143, 1.35887689, 1.35887689]]),
 array([[385, 160,  22, 352, 206]]))

In [28]:
data[[385, 160,  22, 352, 206]]

385    coursera found computer science professor stan...
160    core challenge come intersect grow internal da...
22     seek datum scientist join product insight team...
352    openx seek datum scientist responsible execute...
206    openx seek datum scientist responsible execute...
Name: strings, dtype: object

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that *algorithms that depend on Euclidean distance break down with high dimensional data*.

> Euclidean distance does NOT break in higher dimensions. Human intuition breaks in higher dimensions - not the math!

 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 