# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [4]:
##### Your Code Here #####
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv')

df.head()


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [20]:
df = df.dropna()

In [21]:
from sklearn.model_selection import train_test_split

X = df['description']
y=df['job']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [22]:
# Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'div': 2484, 'unitedhealthcare': 8925, 'company': 1673, 'rise': 7590, 'expanding': 2993, 'multiple': 5082, 'directions': 2389, 'borders': 1104, 'way': 9222, 'think': 8627, 'innovation': 4073, 'isn': 4267, 'gadget': 3406, 'transforming': 8753, 'health': 3697, 'care': 1297, 'industry': 4012, 'ready': 7253, 'make': 4703, 'difference': 2356, 'home': 3774, 'start': 8227, 'doing': 2517, 'life': 4543, 'best': 1021, 'work': 9310, 'sm': 8016, 'br': 1120, 'primary': 6929, 'responsibilities': 7513, 'ul': 8864, 'li': 4528, 'responsible': 7515, 'management': 4714, 'manipulation': 4729, 'structured': 8333, 'data': 2127, 'focus': 3254, 'building': 1195, 'business': 1209, 'intelligence': 4150, 'tools': 8691, 'conducting': 1769, 'analysis': 614, 'distinguish': 2471, 'patterns': 6568, 'recognize': 7281, 'trends': 8792, 'performing': 6619, 'normalization': 5853, 'operations': 6380, 'assuring': 814, 'quality': 7163, 'ncreating': 5374, 'specifications': 8132, 'bring': 1156, 'common': 1650, 'structure': 83

In [23]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns = vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9519)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns = vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9519)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



In [28]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 1.0
Test Accuracy: 0.92


In [29]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')


Train Accuracy: 0.9573934837092731
Test Accuracy: 0.91


In [30]:
# TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'div': 2484, 'unitedhealthcare': 8925, 'company': 1673, 'rise': 7590, 'expanding': 2993, 'multiple': 5082, 'directions': 2389, 'borders': 1104, 'way': 9222, 'think': 8627, 'innovation': 4073, 'isn': 4267, 'gadget': 3406, 'transforming': 8753, 'health': 3697, 'care': 1297, 'industry': 4012, 'ready': 7253, 'make': 4703, 'difference': 2356, 'home': 3774, 'start': 8227, 'doing': 2517, 'life': 4543, 'best': 1021, 'work': 9310, 'sm': 8016, 'br': 1120, 'primary': 6929, 'responsibilities': 7513, 'ul': 8864, 'li': 4528, 'responsible': 7515, 'management': 4714, 'manipulation': 4729, 'structured': 8333, 'data': 2127, 'focus': 3254, 'building': 1195, 'business': 1209, 'intelligence': 4150, 'tools': 8691, 'conducting': 1769, 'analysis': 614, 'distinguish': 2471, 'patterns': 6568, 'recognize': 7281, 'trends': 8792, 'performing': 6619, 'normalization': 5853, 'operations': 6380, 'assuring': 814, 'quality': 7163, 'ncreating': 5374, 'specifications': 8132, 'bring': 1156, 'common': 1650, 'structure': 83

In [31]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9519)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.030841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9519)


Unnamed: 0,00,000,00011236,00079,00805,00am,00pm,01,02115,03,...,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zoom,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9598997493734336
Test Accuracy: 0.9




In [34]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.949874686716792
Test Accuracy: 0.9


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
