# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [85]:
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

In [86]:
jobs = pd.read_csv('https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv',
                  encoding='utf-8')

In [87]:
jobs.head()

Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [88]:
jobs.isnull().sum()

description    1
title          1
job            0
dtype: int64

In [89]:
# drop nulls, this isn't informative
jobs = jobs.dropna(axis=0)
jobs.isnull().sum()

description    0
title          0
job            0
dtype: int64

In [90]:
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

jobs['description'] = jobs['description'].apply(cleanhtml)

In [91]:
X = jobs.description
y = jobs.job.map({'Data Scientist' : 1, 'Data Analyst' : 0})

X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.2,
                                                   stratify=y)

### Count Vectorizer

In [92]:
lr_count = make_pipeline(CountVectorizer(stop_words='english'),
                        LogisticRegression(solver='lbfgs',
                                          max_iter=500))
lr_grid_params = [{'countvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
                  'countvectorizer__max_features' : [50, 100, None]}]

lr_grid = GridSearchCV(lr_count, lr_grid_params, cv=3)
lr_grid.fit(X_train, y_train)
print ('Best Params', lr_grid.best_params_)
print ('CV Score', lr_grid.best_score_)
print ('Test Score', lr_grid.score(X_test, y_test))

Best Params {'countvectorizer__max_features': None, 'countvectorizer__ngram_range': (1, 2)}
CV Score 0.9298245614035088
Test Score 0.88


In [93]:
nb_count = make_pipeline(CountVectorizer(stop_words='english'),
                        MultinomialNB())
nb_grid_params = [{'countvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
                  'countvectorizer__max_features' : [50, 100, None]}]

nb_grid = GridSearchCV(nb_count, nb_grid_params, cv=3)
nb_grid.fit(X_train, y_train)
print ('Best Params', nb_grid.best_params_)
print ('CV Score', nb_grid.best_score_)
print ('Test Score', nb_grid.score(X_test, y_test))

Best Params {'countvectorizer__max_features': 100, 'countvectorizer__ngram_range': (1, 2)}
CV Score 0.9197994987468672
Test Score 0.89


### TF-IDF Vectorizer

In [94]:
lr_tfidf = make_pipeline(TfidfVectorizer(stop_words='english'),
                        LogisticRegression(solver='lbfgs',
                                          max_iter=500))
lr_grid_params = [{'tfidfvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
                  'tfidfvectorizer__max_features' : [50, 100, None]}]
lr_grid = GridSearchCV(lr_tfidf, lr_grid_params, cv=3)
lr_grid.fit(X_train, y_train)
print ('Best Params', lr_grid.best_params_)
print ('CV Score', lr_grid.best_score_)
print ('Test Score', lr_grid.score(X_test, y_test))

Best Params {'tfidfvectorizer__max_features': 100, 'tfidfvectorizer__ngram_range': (1, 2)}
CV Score 0.9172932330827067
Test Score 0.89


In [95]:
nb_tfidf = make_pipeline(TfidfVectorizer(stop_words='english'),
                        MultinomialNB())
nb_grid_params = [{'tfidfvectorizer__ngram_range' : [(1,1), (1,2), (1,3)],
                  'tfidfvectorizer__max_features' : [50, 100, None]}]
nb_grid = GridSearchCV(nb_tfidf, nb_grid_params, cv=3)
nb_grid.fit(X_train, y_train)
print ('Best Params', nb_grid.best_params_)
print ('CV Score', nb_grid.best_score_)
print ('Test Score', nb_grid.score(X_test, y_test))

Best Params {'tfidfvectorizer__max_features': 100, 'tfidfvectorizer__ngram_range': (1, 2)}
CV Score 0.9147869674185464
Test Score 0.87


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
