<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>

# Vector Representations
## *Data Science Unit 4 Sprint 2 Assignment 2*

In [1]:
import re
import string

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy

from bs4 import BeautifulSoup

## 1) *Clean:* Job Listings from indeed.com that contain the title "Data Scientist" 

You have `job_listings.csv` in the data folder for this module. The text data in the description column is still messy - full of html tags. Use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library to clean up this column. You will need to read through the documentation to accomplish this task. 

In [3]:
from bs4 import BeautifulSoup
import requests

df = pd.read_csv('./data/job_listings.csv')

def clean_description(desc):
    soup = BeautifulSoup(desc)
    return soup.get_text()
df['clean_desc'] = df['description'].apply(clean_description)
          


In [10]:
df = df.drop(columns='Unnamed: 0')
print(df.shape)
df.head()

(426, 3)


Unnamed: 0,description,title,clean_desc
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,"b""Job Requirements:\nConceptual understanding ..."
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,"b'Job Description\n\nAs a Data Scientist 1, yo..."
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,b'As a Data Scientist you will be working on c...
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,"b'$4,969 - $6,756 a monthContractUnder the gen..."
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,b'Location: USA \xe2\x80\x93 multiple location...


## 2) Use Spacy to tokenize the listings 

In [18]:
# Base
from collections import Counter
import re
 
import pandas as pd

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

nlp = spacy.load("en_core_web_lg")

In [49]:
import re
def tokenize(text):
    """Parses a string into a list of semantic units (words)

    Args:
        text (str): The string that the function will tokenize.

    Returns:
        list: tokens parsed out by the mechanics of your choice
    """
    
    tokens = re.sub('[^a-zA-Z 0-9]', '', text)
    tokens = tokens.lower().split()
    
    return tokens
df['tokens'] = df['clean_desc'].apply(tokenize)

In [54]:
#df['tokens'] = df['tokens'].map(nlp)

In [59]:
type(df.tokens.array[0])

list

## 3) Use Scikit-Learn's CountVectorizer to get word counts for each listing.

In [74]:
from sklearn.feature_extraction.text import CountVectorizer

# create the transformer
vect = CountVectorizer(stop_words='english', 
                       max_df=.95,
                       max_features=1500,
                      ngram_range=(1,3),
                      tokenizer=tokenize)
data = df.tokens
# build vocab
#vect.fit(data)

# transform text
dtm = vect.fit_transform(df.clean_desc)
#word_vect = df['tokens'].apply(lambda x: vect.fit_transform(x))
dtm = pd.DataFrame(dtm.todense(), columns=vect.get_feature_names())

In [75]:
print(dtm.shape)
dtm.head()

(426, 1500)


Unnamed: 0,1,10,100,12,2,2 years,2019,3,3 years,4,...,years experience data,years professional,years relevant,years work,yearsxe2x80x99,yearsxe2x80x99 experience,york,youll,youre,youxe2x80x99ll
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 4) Visualize the most common word counts

In [81]:
dtm.sum().sort_values(ascending=False).head(10)

business            1093
experience          1022
work                 927
science              877
team                 859
learning             842
machine              667
analytics            645
machine learning     598
analysis             581
dtype: int64

## 5) Use Scikit-Learn's tfidfVectorizer to get a TF-IDF feature matrix

In [77]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf= TfidfVectorizer(stop_words='english', max_features=5000)
dtm2 = tfidf.fit_transform(df.clean_desc)
dtm2 = pd.DataFrame(dtm2.todense(), columns=tfidf.get_feature_names())
dtm2.head()

Unnamed: 0,000,04,10,100,1079302,11,12,125,14,15,...,years,yearthe,yes,yeti,york,young,yrs,zeus,zf,zillow
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.093431,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 6) Create a NearestNeighbor Model. Write the description of your ideal datascience job and query your job listings. 

In [78]:
from sklearn.neighbors import NearestNeighbors

nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm2)
nn.kneighbors([dtm2.iloc[42].values])

(array([[0.        , 0.        , 1.20452612, 1.2724882 , 1.2724882 ]]),
 array([[138,  42, 226,  52, 142]], dtype=int64))

In [80]:
df.clean_desc[52]

'b"The challenge\\nAdobe is looking for a Senior Data Scientist who will be building the next generation of marketing cloud products by leveraging machine learning, predictive modeling and optimization techniques. These products would help businesses understand, manage, and optimize the experience throughout the customer journey. Example applications include real-time online media optimization, media attribution, predictive sales analytics, product recommendation, mobile analytics, predictive customer scoring and segmentation and large-scale experimentation.\\nIdeal candidates will have a strong academic background as well as technical skills including applied statistics, machine learning, data mining, and software development. Familiarity working with large-scale datasets and big data techniques would be a plus.\\nWhat you\\xe2\\x80\\x99ll do\\nDevelop predictive models on large-scale datasets to address various business problems through leveraging advanced statistical modeling, machi

## Stretch Goals

 - Try different visualizations for words and frequencies - what story do you want to tell with the data?
 - Scrape Job Listings for the job title "Data Analyst". How do these differ from Data Scientist Job Listings
 - Try and identify requirements for experience specific technologies that are asked for in the job listings. How are those distributed among the job listings?
 - Use a clustering algorithm to cluster documents by their most important terms. Do the clusters reveal any common themes?
  - **Hint:** K-means might not be the best algorithm for this. Do a little bit of research to see what might be good for this. Also, remember that algorithms that depend on Euclidean distance break down with high dimensional data.
 - Create a labeled dataset - which jobs will you apply for? Train a model to select the jobs you are most likely to apply for. :) 