<a href="https://colab.research.google.com/github/Lakshmi-Chandana/DATA-602/blob/main/NaturalLanguageProcessingNLP_4_full_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Full NLP and ML Pipeline for Document Classification
Based on following tutorials   
With Permission from Michale Harmon:
http://michael-harmon.com/blog/NLP.html   
https://github.com/mdh266/DocumentClassificationNLP/blob/master/NLP.ipynb   

Classification of text documents using sparse features:   
https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html   

https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html   


###  20 News Groups Corpus, Sample dataset included in scikit-learn
A collection of almost 20,000 articles on 20 different topics or 'newsgroups'.   
Corpus: Text Collection

In [None]:
import pandas as pd

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [None]:
# first 3 classes (of 20)
twenty_train.target_names[0:3]
# python indexing excludes end index

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc']

In [None]:
# data and target
# has the input and desired output
# 11K of them are split for training pairs

print( len(twenty_train.data) )
print( len(twenty_train.target) )

11314
11314


In [None]:
i = 29
print(twenty_train.data[i])

From: jimf@centerline.com (Jim Frost)
Subject: Re: Is car saftey important?
Organization: CenterLine Software, Inc.
Lines: 14
NNTP-Posting-Host: 140.239.3.202

tcorkum@bnr.ca (Trevor Corkum) writes:
>Is it only me, or is
>safety not one of the most important factors when buying a car?

It depends on your priorities.  A lot of people put higher priorities
on gas mileage and cost than on safety, buying "unsafe" econoboxes
instead of Volvos.  I personally take a middle ground -- the only
thing I really look for is a three-point seatbelt and 5+mph bumpers.
I figure that 30mph collisions into brick walls aren't common enough
for me to spend that much extra money for protection, but there are
lots of low-speed collisions that do worry me.

jim frost
jimf@centerline.com



In [None]:
twenty_train.target[i]

7

In [None]:
twenty_train.target_names[  twenty_train.target[i]  ]

'rec.autos'

## scikit-learn Pipeline
- Scitkit-learn pipelines are a sequence of transforms followed by a final estimator.   
- Intermediate steps within the pipeline must be ‘transforms’ 
 * they must implement fit and transform methods 
 * The CountVectorizer and TfidfTransformer are transformers in this example   
- The estimator of a pipeline, the final step, only needs to implement the fit method   

### A simple pipeline with two steps, count vectorizer and model that uses the count vectors.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

pipe = Pipeline([('vect', CountVectorizer(stop_words='english')),
                 ('model', MultinomialNB()),])

mod = pipe.fit(twenty_train.data, twenty_train.target)

predicted = mod.predict(twenty_test.data)

print(classification_report(twenty_test.target,
                            predicted, 
                            target_names=twenty_test.target_names))

from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(twenty_test.target, predicted))

                          precision    recall  f1-score   support

             alt.atheism       0.80      0.81      0.80       319
           comp.graphics       0.65      0.80      0.72       389
 comp.os.ms-windows.misc       0.80      0.04      0.08       394
comp.sys.ibm.pc.hardware       0.55      0.80      0.65       392
   comp.sys.mac.hardware       0.85      0.79      0.82       385
          comp.windows.x       0.69      0.84      0.76       395
            misc.forsale       0.89      0.74      0.81       390
               rec.autos       0.89      0.92      0.91       396
         rec.motorcycles       0.95      0.94      0.95       398
      rec.sport.baseball       0.95      0.92      0.93       397
        rec.sport.hockey       0.92      0.97      0.94       399
               sci.crypt       0.80      0.96      0.87       396
         sci.electronics       0.79      0.70      0.74       393
                 sci.med       0.88      0.87      0.87       396
         

### Adding tf-idf transformation to the count vectors

I used these 2 steps in the pipe to give an example. These first 2 stesp are equivalent to TfidfVectorizer. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html says: "Equivalent to CountVectorizer followed by TfidfTransformer."


In [None]:
# Easyly to add, remove, or modify steps and retest 
from sklearn.feature_extraction.text import TfidfTransformer

pipe = Pipeline([('vect', CountVectorizer(stop_words='english')),
                  ('tfidf', TfidfTransformer()), # Added TFIDF
                  ('model', MultinomialNB()),])

mod = pipe.fit(twenty_train.data, twenty_train.target)

predicted = mod.predict(twenty_test.data)

print(classification_report(twenty_test.target,
                            predicted, 
                            target_names=twenty_test.target_names))

from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(twenty_test.target, predicted))

                          precision    recall  f1-score   support

             alt.atheism       0.80      0.69      0.74       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.79      0.72      0.75       394
comp.sys.ibm.pc.hardware       0.68      0.81      0.74       392
   comp.sys.mac.hardware       0.86      0.81      0.84       385
          comp.windows.x       0.87      0.78      0.82       395
            misc.forsale       0.87      0.80      0.83       390
               rec.autos       0.88      0.91      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.91      0.92      0.92       397
        rec.sport.hockey       0.88      0.98      0.93       399
               sci.crypt       0.75      0.96      0.84       396
         sci.electronics       0.84      0.65      0.74       393
                 sci.med       0.92      0.79      0.85       396
         

In [None]:
# from sklearn.svm import SVC

# # Easyly to add, remove, or modify steps and retest 
# pipe = Pipeline([('vect', CountVectorizer(stop_words='english')),
#                   ('tfidf', TfidfTransformer()), # Added TFIDF
#                   ('model', SVC()),])

# mod = pipe.fit(twenty_train.data, twenty_train.target)

# predicted = mod.predict(twenty_test.data)

# print(classification_report(twenty_test.target,
#                             predicted, 
#                             target_names=twenty_test.target_names))

# from sklearn.metrics import accuracy_score
# print("Accuracy:", accuracy_score(twenty_test.target, predicted))

## Experimenting, and Hyperparameter tuning using GridSearchCV and Pipeline
Similar to testing whether to remove stopwords or not, we want to run many experiments, with many combinations of parameters.  
GridSearchCV does this, for an estimator, or on a full pipeline. 
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html   
   
For  setting parameters of the various steps in the 
pipeline, you use <step name>+"__"+<parameter name>, for the list of possible values you want to test.

#### Example Experiment; Effect of removing stop_words on accuracy:

In [None]:
# Configure experiments
    
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([('vect', CountVectorizer()),
                  ('tfidf', TfidfTransformer()),
                  ('model', MultinomialNB()),])

### stop_words is a parameter for CountVectorizer
### since as a step in pipeline its name is vect, 
### the key vect__stop_words contains the options we want to experiment
### in the parameters dictionary: 
parameters = {'vect__stop_words': ('english', None)}

# We can perform the grid search using all available CPU by setting n_jobs=-1:
grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)

In [None]:
# Run the experiments 
grid_search.fit(twenty_train.data, twenty_train.target)

Fitting 5 folds for each of 2 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   32.9s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [None]:
# Results of 
grid_search.cv_results_

{'mean_fit_time': array([4.96422958, 5.16370244]),
 'mean_score_time': array([1.06413231, 1.09180346]),
 'mean_test_score': array([0.88165116, 0.84399858]),
 'param_vect__stop_words': masked_array(data=['english', None],
              mask=[False, False],
        fill_value='?',
             dtype=object),
 'params': [{'vect__stop_words': 'english'}, {'vect__stop_words': None}],
 'rank_test_score': array([1, 2], dtype=int32),
 'split0_test_score': array([0.88422448, 0.84887318]),
 'split1_test_score': array([0.88378259, 0.84180292]),
 'split2_test_score': array([0.88024746, 0.84401237]),
 'split3_test_score': array([0.87715422, 0.84136103]),
 'split4_test_score': array([0.88284704, 0.84394341]),
 'std_fit_time': array([0.31819013, 0.05190805]),
 'std_score_time': array([0.08349578, 0.17589451]),
 'std_test_score': array([0.00263772, 0.00266618])}

As GridSearchCV already does cross validation, we could combine train and test data and feed all to GridSearchCV.

In [None]:
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('model',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [None]:
parameters2 = {'tfidf__use_idf': (True, False),
              'model__alpha': (1e1, 1e-3),
              'model__fit_prior': (True,False)}


grid_search2 = GridSearchCV(pipe, parameters2, n_jobs=-1, cv=5)
grid_search2.fit(twenty_train.data, twenty_train.target)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [None]:
grid_search2.cv_results_

{'mean_fit_time': array([4.70247774, 4.76134863, 4.88082309, 4.91927133, 5.05858846,
        4.58875818, 4.89991722, 4.62617698]),
 'mean_score_time': array([1.00830226, 1.05477238, 1.20125141, 1.1564826 , 1.14066358,
        1.03355842, 1.09082742, 0.95299568]),
 'mean_test_score': array([0.82066512, 0.75702699, 0.85451692, 0.78133328, 0.90772501,
        0.91161416, 0.90754825, 0.91223293]),
 'param_model__alpha': masked_array(data=[10.0, 10.0, 10.0, 10.0, 0.001, 0.001, 0.001, 0.001],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_model__fit_prior': masked_array(data=[True, True, False, False, True, True, False, False],
              mask=[False, False, False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_tfidf__use_idf': masked_array(data=[True, False, True, False, True, False, True, False],
              mask=[False, False, False, False, False,

In [None]:
grid_search2.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words='english', strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=False)),
                ('model',
                 MultinomialNB(alpha=0.001, class_prior=None,
                               fit_prior=False))],
         verbose=False)

In [None]:
predicted2 = grid_search2.predict(twenty_test.data)

print(classification_report(twenty_test.target,
                            predicted2, 
                            target_names=twenty_test.target_names))

print("Accuracy:", accuracy_score(twenty_test.target, predicted2))

                          precision    recall  f1-score   support

             alt.atheism       0.85      0.81      0.83       319
           comp.graphics       0.66      0.74      0.70       389
 comp.os.ms-windows.misc       0.72      0.63      0.67       394
comp.sys.ibm.pc.hardware       0.65      0.72      0.68       392
   comp.sys.mac.hardware       0.83      0.82      0.83       385
          comp.windows.x       0.83      0.76      0.80       395
            misc.forsale       0.80      0.82      0.81       390
               rec.autos       0.89      0.89      0.89       396
         rec.motorcycles       0.94      0.96      0.95       398
      rec.sport.baseball       0.96      0.93      0.94       397
        rec.sport.hockey       0.94      0.97      0.96       399
               sci.crypt       0.89      0.94      0.91       396
         sci.electronics       0.80      0.74      0.77       393
                 sci.med       0.90      0.83      0.86       396
         

Increased from 0.816

In [None]:
0.832 - 0.816

0.016000000000000014

1 or 2 more documents classified correctly as a result of hyperparameter tuning

## Next Steps

Productionalize the classification model:
- Save the model to file
- Implement a web service that will load the model
 * Make prediction when a request is received using model
 * return the prediction

In [None]:
5+6

11

In [None]:
grid_search2.predict(["Fast Automobiles are fun to drive","ethernet cables suck"])

array([7, 4])

## Save model to disk for later reuse
You don't want to retrain each time!

In [None]:
%pip install joblib



In [None]:
import joblib
joblib.dump(grid_search2.best_estimator_,"email_classifier.joblib")

['email_classifier.joblib']

In [None]:
email_classifier_model = joblib.load("email_classifier.joblib")

In [None]:
### And use it to classifiy, like above
email_classifier_model.predict(["Fast Automobiles are fun to drive","ethernet cables suck"])

array([7, 4])