# Now it's your turn!

Use the following dataset of scraped "Data Scientist" and "Data Analyst" job listings to create your own Document Classification Models.

<https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv>

Requirements:

- Apply both CountVectorizer and TfidfVectorizer methods to this data and compare results
- Use at least two different classification models to compare differences in model accuracy
- Try to "Hyperparameter Tune" your model by using different n_gram ranges, max_results, and data cleaning methods
- Try and get the highest accuracy possible!

In [1]:
import pandas as pd
url = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-2-NLP/master/module3-Document-Classification/job_listings.csv'
df = pd.read_csv(url)
print(df.shape)
df.head()

(500, 3)


Unnamed: 0,description,title,job
0,"b""<div><div>Job Requirements:</div><ul><li><p>...",Data scientist,Data Scientist
1,b'<div>Job Description<br/>\n<br/>\n<p>As a Da...,Data Scientist I,Data Scientist
2,b'<div><p>As a Data Scientist you will be work...,Data Scientist - Entry Level,Data Scientist
3,"b'<div class=""jobsearch-JobMetadataHeader icl-...",Data Scientist,Data Scientist
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple ...,Data Scientist,Data Scientist


In [2]:
# What does one entry look like?
df.description[0]

'b"<div><div>Job Requirements:</div><ul><li><p>\\nConceptual understanding in Machine Learning models like Nai\\xc2\\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN along with hands-on experience in at least 2 of them</p>\\n</li><li><p>Intermediate to expert level coding skills in Python/R. (Ability to write functions, clean and efficient data manipulation are mandatory for this role)</p>\\n</li><li><p>Exposure to packages like NumPy, SciPy, Pandas, Matplotlib etc in Python or GGPlot2, dplyr, tidyR in R</p>\\n</li><li><p>Ability to communicate Model findings to both Technical and Non-Technical stake holders</p>\\n</li><li><p>Hands on experience in SQL/Hive or similar programming language</p>\\n</li><li><p>Must show past work via GitHub, Kaggle or any other published article</p>\\n</li><li><p>Master\'s degree in Statistics/Mathematics/Computer Science or any other quant specific field.</p></li></ul><div><div><div><div><div><d

In [3]:
pd.set_option('display.max_colwidth', 200)
df.sample(5)

Unnamed: 0,description,title,job
480,"b'<div><p>Responsible for data acquisition and ongoing data integrity in University Advancement systems. Works independently to analyze, compare, manipulate, and transform data sets in preparation...",Data Analyst,Data Analyst
473,"b'<div class=""jobsearch-JobMetadataHeader icl-u-xs-mb--md""><div class=""jobsearch-JobMetadataHeader-item icl-u-xs-mt--xs"">Internship</div></div><div><div><b>BrightStart Intern - Salesforce Data Ana...",BrightStart - Intern - Salesforce Data Analyst,Data Analyst
250,"b'<div>Creates, generates and analyzes statistical reports to track and measurevarious clinical functions. Troubleshoots hardware and softwareproblems, maintains databases, and provides standing a...",DATA ANALYST,Data Analyst
4,b'<ul><li>Location: USA \xe2\x80\x93 multiple locations</li>\n<li>2+ years of Analytics experience</li>\n<li>Understand business requirements and technical requirements</li>\n<li>Can handle data e...,Data Scientist,Data Scientist
200,"b'<p></p><div><h2 class=""jobSectionHeader""><b>About the team</b></h2>\nZillow is looking for an extraordinary Data Scientist to join a growing team. Zillow is on a mission to give consumers certai...",Data Scientist- Zillow Offers,Data Scientist


In [4]:
df.isna().sum()

description    1
title          1
job            0
dtype: int64

In [5]:
df = df.dropna().reset_index()
df.isna().sum()

index          0
description    0
title          0
job            0
dtype: int64

In [6]:
df.job.value_counts()

Data Scientist    250
Data Analyst      249
Name: job, dtype: int64

In [7]:
df['label_num'] = df.job.map({'Data Analyst': 0, 'Data Scientist': 1})
df.tail()

Unnamed: 0,index,description,title,job,label_num
494,495,"b'<div><p><b>POSITION DESCRIPTION</b><br/>\n<br/>\nThe <b>Data Analysts\xe2\x80\x99</b> primary responsibilities will be to fulfill data requests for RFPs, research, evaluation and CQI projects. T...",Data Analyst,Data Analyst,0
495,496,b'The Data Analyst supports the Service Delivery team from initiation to project delivery and closure. The DA is an important resource in helping to streamline and create best practices to ensure ...,Data Analyst,Data Analyst,0
496,497,"b'TITLE Data Analyst\n<br/><br/>\n<b>MINIMUM CLEARANCE LEVEL:</b> SECRET\n<br/><br/>\n<b>CITIZENSHIP:</b> US Citizenship\n<br/><b>LOCATION:</b> Warner Robins Air Force Base, Warner Robins, GA\n<br...",Systems Data Analyst,Data Analyst,0
497,498,"b'<div><h2 class=""jobSectionHeader""><b>Overview\n</b></h2><div><div><div><div>Zywave provides unique, industry-leading solutions for insurance brokers, and is hiring an Associate Database Analyst ...",Associate Data Analyst,Data Analyst,0
498,499,"b'<div><div>Location: El Segundo, California, United States<br/>\n</div><div></div><div><b>\nJob Summary:</b></div><div></div><div><br/>\nAT&amp;T is seeking a Data Analyst to partner with Enterta...",Data Analyst,Data Analyst,0


In [8]:
len(df)

499

In [9]:
from sklearn.model_selection import train_test_split
X = df.description
y = df.label_num
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(399,)
(100,)
(399,)
(100,)


In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')
vectorizer.fit(X_train)
print(vectorizer.vocabulary_)

{'div': 2421, 'class': 1459, 'jobsearch': 4203, 'jobmetadataheader': 4201, 'icl': 3755, 'xs': 9261, 'mb': 4694, 'md': 4702, 'item': 4171, 'mt': 4952, 'temporary': 8405, 'internship': 4077, 'description': 2218, 'nthe': 5996, 'data': 2068, 'analytics': 605, 'innovations': 3970, 'group': 3501, 'docomo': 2441, 'seeking': 7661, 'ph': 6522, 'students': 8178, 'computer': 1688, 'science': 7590, 'engineering': 2730, 'related': 7220, '12': 37, 'week': 9043, 'summer': 8243, 'aws': 889, 'operation': 6226, 'cost': 1913, 'analysis': 599, 'nrequired': 5872, 'skills': 7847, 'experiences': 2931, 'education': 2593, 'ncandidates': 5157, 'position': 6644, 'solid': 7914, 'background': 909, 'machine': 4568, 'learning': 4376, 'optimization': 6244, 'theory': 8447, 'experience': 2929, 'algorithm': 536, 'design': 2223, 'developing': 2255, 'supervised': 8252, 'unsupervised': 8776, 'models': 4874, 'depth': 2208, 'concepts': 1696, 'statistics': 8092, 'linux': 4469, 'ssh': 8029, 'github': 3406, 'sql': 8017, 'highly

In [12]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9288)


Unnamed: 0,00,000,00011236,00079,00805,00pm,01,02115,03,0356,...,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9288)


Unnamed: 0,00,000,00011236,00079,00805,00pm,01,02115,03,0356,...,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zywave
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
from sklearn.linear_model import LogisticRegression

LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)



In [15]:
from sklearn.metrics import accuracy_score

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.92


In [16]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9674185463659147
Test Accuracy: 0.88


In [18]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.89


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=None, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

print(vectorizer.vocabulary_)

{'div': 2421, 'class': 1459, 'jobsearch': 4203, 'jobmetadataheader': 4201, 'icl': 3755, 'xs': 9261, 'mb': 4694, 'md': 4702, 'item': 4171, 'mt': 4952, 'temporary': 8405, 'internship': 4077, 'description': 2218, 'nthe': 5996, 'data': 2068, 'analytics': 605, 'innovations': 3970, 'group': 3501, 'docomo': 2441, 'seeking': 7661, 'ph': 6522, 'students': 8178, 'computer': 1688, 'science': 7590, 'engineering': 2730, 'related': 7220, '12': 37, 'week': 9043, 'summer': 8243, 'aws': 889, 'operation': 6226, 'cost': 1913, 'analysis': 599, 'nrequired': 5872, 'skills': 7847, 'experiences': 2931, 'education': 2593, 'ncandidates': 5157, 'position': 6644, 'solid': 7914, 'background': 909, 'machine': 4568, 'learning': 4376, 'optimization': 6244, 'theory': 8447, 'experience': 2929, 'algorithm': 536, 'design': 2223, 'developing': 2255, 'supervised': 8252, 'unsupervised': 8776, 'models': 4874, 'depth': 2208, 'concepts': 1696, 'statistics': 8092, 'linux': 4469, 'ssh': 8029, 'github': 3406, 'sql': 8017, 'highly

In [20]:
train_word_counts = vectorizer.transform(X_train)

X_train_vectorized = pd.DataFrame(train_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_train_vectorized.shape)
X_train_vectorized.head()

(399, 9288)


Unnamed: 0,00,000,00011236,00079,00805,00pm,01,02115,03,0356,...,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
test_word_counts = vectorizer.transform(X_test)

X_test_vectorized = pd.DataFrame(test_word_counts.toarray(), columns=vectorizer.get_feature_names())

print(X_test_vectorized.shape)
X_test_vectorized.head()

(100, 9288)


Unnamed: 0,00,000,00011236,00079,00805,00pm,01,02115,03,0356,...,zeta,zetahub,zeus,zheng,zillow,zogsports,zoho,zone,zones,zywave
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
LR = LogisticRegression(random_state=42).fit(X_train_vectorized, y_train)

train_predictions = LR.predict(X_train_vectorized)
test_predictions = LR.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9649122807017544
Test Accuracy: 0.86




In [23]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB().fit(X_train_vectorized, y_train)

train_predictions = MNB.predict(X_train_vectorized)
test_predictions = MNB.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9649122807017544
Test Accuracy: 0.85


In [25]:
from sklearn.ensemble import RandomForestClassifier

RFC = RandomForestClassifier(n_estimators=100).fit(X_train_vectorized, y_train)

train_predictions = RFC.predict(X_train_vectorized)
test_predictions = RFC.predict(X_test_vectorized)

print(f'Train Accuracy: {accuracy_score(y_train, train_predictions)}')
print(f'Test Accuracy: {accuracy_score(y_test, test_predictions)}')

Train Accuracy: 0.9974937343358395
Test Accuracy: 0.89


In [28]:
from bs4 import BeautifulSoup
from unidecode import unidecode
# Clean the description 

def clean_html_with_bs4(string):
    soup = BeautifulSoup(string)
    string = soup.get_text()
    return string

listings = []
for x in df['description']:
    # Remove extra quotation marks
    x = x[2:-1]
    # Clean out HTML
    x = clean_html_with_bs4(x)
    # Remove line breaks
    x = x.replace('\\n',' ')
    # Translate unicode characters to ASCII
#     x = unidecode(x)
    listings.append(x)
    
df['description'] = listings
df.head()

Unnamed: 0,index,description,title,job,label_num
0,0,"Job Requirements: Conceptual understanding in Machine Learning models like Nai\xc2\xa8ve Bayes, K-Means, SVM, Apriori, Linear/ Logistic Regression, Neural, Random Forests, Decision Trees, K-NN alo...",Data scientist,Data Scientist,1
1,1,"Job Description As a Data Scientist 1, you will help us build machine learning models, data pipelines, and micro-services to help our clients navigate their healthcare journey. You will do so by ...",Data Scientist I,Data Scientist,1
2,2,"As a Data Scientist you will be working on consulting side of our business. You will be responsible for analyzing large, complex datasets and identify meaningful patterns that lead to actionable r...",Data Scientist - Entry Level,Data Scientist,1
3,3,"$4,969 - $6,756 a monthContractUnder the general supervision of Professors Dana Mukamel and Kai Zheng, the incumbent will join the CalMHSA Mental Health Tech Suite Innovation (INN) Evaluation Team...",Data Scientist,Data Scientist,1
4,4,"Location: USA \xe2\x80\x93 multiple locations 2+ years of Analytics experience Understand business requirements and technical requirements Can handle data extraction, preparation and transformatio...",Data Scientist,Data Scientist,1


In [29]:
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.pipeline import Pipeline

In [30]:
vectorizers = [
    TfidfVectorizer(stop_words='english',
                    max_features=None),
    CountVectorizer(stop_words='english',
                   max_features=None)
]

classifiers = [
    MultinomialNB(),
    LinearSVC(),
    LogisticRegression(),
    RandomForestClassifier()
]

clf_names = [
         "Naive Bayes",
         "Linear SVC",
         "Logistic Regression",
         "Random Forest"
        ]

vect_names = [
    "TfidfVectorizer",
    "CountVectorizer"
]

clf_params = [
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__alpha': (1e-2, 1e-3)},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__C': (np.logspace(-5, 1, 5))},
              {'vect__ngram_range': [(1, 1), (1, 2)],
              'clf__max_depth': (1, 2)},
             ]

In [35]:
models = []
for classifier, clf_name, params in zip(classifiers, 
                                        clf_names, 
                                        clf_params):
    for vectorizer, vect_name in zip(vectorizers, 
                                     vect_names):
        pipe = Pipeline([
            ('vect', vectorizer),
            ('clf', classifier),
        ])
        gs = GridSearchCV(pipe, 
                          param_grid=params, 
                          n_jobs=-1,
                          scoring='roc_auc',
                          cv=5,
                          verbose=10)
        
        gs.fit(df.description, df.label_num)
        score = gs.best_score_
        print(f'''
Classifier: {clf_name}
Vectorizer: {vect_name}
Score: {gs.best_score_:.4f}
Params: {gs.best_params_}
------------------------------
            ''')
        models.append((clf_name, vect_name, gs.best_score_, gs.best_params_))

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    6.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    7.9s
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:   10.3s remaining:    2.5s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:   11.2s finished



Classifier: Naive Bayes
Vectorizer: TfidfVectorizer
Score: 0.9155
Params: {'clf__alpha': 0.01, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    5.6s remaining:    1.3s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    6.7s finished



Classifier: Naive Bayes
Vectorizer: CountVectorizer
Score: 0.8978
Params: {'clf__alpha': 0.01, 'vect__ngram_range': (1, 1)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.9s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    7.5s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   16.5s finished



Classifier: Linear SVC
Vectorizer: TfidfVectorizer
Score: 0.9606
Params: {'clf__C': 0.31622776601683794, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.6s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    7.4s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.6s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   14.9s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   19.2s finished



Classifier: Linear SVC
Vectorizer: CountVectorizer
Score: 0.9761
Params: {'clf__C': 0.01, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    7.2s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.2s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   16.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.



Classifier: Logistic Regression
Vectorizer: TfidfVectorizer
Score: 0.9602
Params: {'clf__C': 10.0, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    5.7s
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:    7.3s
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.1s
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   13.8s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   17.0s finished



Classifier: Logistic Regression
Vectorizer: CountVectorizer
Score: 0.9754
Params: {'clf__C': 10.0, 'vect__ngram_range': (1, 2)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.3s
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    5.6s remaining:    1.3s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    6.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.



Classifier: Random Forest
Vectorizer: TfidfVectorizer
Score: 0.8796
Params: {'clf__max_depth': 2, 'vect__ngram_range': (1, 1)}
------------------------------
            
Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    1.5s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    5.9s remaining:    1.4s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    7.0s finished



Classifier: Random Forest
Vectorizer: CountVectorizer
Score: 0.8447
Params: {'clf__max_depth': 2, 'vect__ngram_range': (1, 1)}
------------------------------
            




In [49]:
models = sorted(models, key=lambda tup: tup[2])
print('And the winner is...')
print()
print('Classifier:', models[-1][0])
print('Vectorizer:', models[-1][1])
print('Score:', models[-1][2])
print('Params:', models[-1][3])

And the winner is...

Classifier: Linear SVC
Vectorizer: CountVectorizer
Score: 0.9760566030019222
Params: {'clf__C': 0.01, 'vect__ngram_range': (1, 2)}


# Stretch Goals

- Try some agglomerative clustering using cosine-similarity-distance! (works better with high dimensional spaces) robust clustering - Agglomerative clustering like Ward would be cool. Try and create an awesome Dendrogram of the most important terms from the dataset.

- Awesome resource for clustering stretch goals: 
 - Agglomerative Clustering with Scipy: <https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/>
 - Agglomerative Clustering for NLP: <http://brandonrose.org/clustering>
 
- Use Latent Dirichlet Allocation (LDA) to perform topic modeling on the dataset: 
 - Topic Modeling and LDA in Python: <https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24>
 - Topic Modeling and LDA using Gensim: <https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/>
