# **Scikit-learn - NLP (Natural Language Processing)**

## Objectives

* Understand and create an ML pipeline for NLP (Natural Language Processing)




---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [6]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\ML_practice\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [7]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [8]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\mikee\\Desktop\\ML_practice'

## NLP (Natural Language Processing)

Import Libraries

In [9]:
import numpy as np
import pandas as pd

Conversational language is unlike text neatly entered into form inputs. Instead it is unstructured data that cannot be neatly broken down into elements in a row-column database table; there is a vast quantity of information available within it and waiting to be accessed.

* Therefore, natural language processing aims to gather, extract and make available all of this information.

NLP is not a trivial task since its goal is to understand the language and not only process the text/strings/keywords.

* As we know, language is ambiguous, subjective and subtle. New words and terms are constantly added/updated, and their meaning may change according to the context.
* These aspects all together make NLP a very interesting and challenging task for ML.

We will study NLP (Natural Language Processing) as a supervised learning approach where the features are text, and the target variable is a meaning associated with that given text. Therefore, the ML task is Classification.

* Therefore, the workflow will be similar to what we covered for Classification tasks, where we:
    * Load the data
    * Define the pipeline steps
    * Split the data into train and test sets
    * Train multiple pipelines using hyperparameter optimisation
    * Evaluate pipeline performance
* One difference will be defining the pipeline steps, where we will use steps for pre-processing the textual data before the modelling stage. Once you have a processed text, you can use ML algorithms to predict your target variable.


### Load data

We will use a dataset that contains records telling if a given SMS message is spam or not (spam or ham). We load the data from GitHub.

In this project, we are interested in predicting if a given message is spam or not; therefore, the ML task is Classification.

In [10]:
url = 'https://raw.githubusercontent.com/ShresthaSudip/SMS_Spam_Detection_DNN_LSTM_BiLSTM/master/SMSSpamCollection'
df = (pd.read_csv(url, sep ='\t',names=["label", "message"])
    .sample(frac=0.6, random_state=0)
    .reset_index(drop=True)
    )
df = df.sample(frac=0.5, random_state=101)
print(df.shape)
df.head()

(1672, 2)


Unnamed: 0,label,message
1337,spam,Someone U know has asked our dating service 2 ...
568,ham,I'm home. Doc gave me pain meds says everythin...
1548,ham,"Feb &lt;#&gt; is ""I LOVE U"" day. Send dis to..."
2603,ham,Just finished. Missing you plenty
1966,ham,Hello. Sort of out in town already. That . So ...


#### Split Data

As usual, we are splitting the data into train and test sets.

* In this case, the dataset has two columns containing the message text, and the label tells whether the SMS message was spam or not.
* In the end, we have a Pandas Series for the features (message) and target (label) - note the brackets subsetting the data, for example, df['message']

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'],
                                                    test_size=0.2, random_state=101)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(1337,) (1337,) (335,) (335,)


### Create the pipeline

We will consider steps for (1) cleaning the textual data and (2) representing the text as numbers or feature extraction.

* (1) In our case, we will make the text lowercase and remove punctuation for text cleaning.

    * The practical tasks for cleaning the textual data will differ from dataset to dataset; for example, you may have a dataset where you need to clean HTML tags, so you need a function to do that for you, or eventually, you need to remove diacritics (marks located above or below a letter to reflect a particular pronunciation, like resumé)
* (2) There are also multiple techniques for feature extraction; we will consider the ones we covered in Module 2; in this case, we will tokenise the text and then use TF-IDF (Term Frequency－Inverse Document Frequency).

We are using texthero (TextHero does not work with Python 3.12) so we use pandas and regex instead.

*  We need to create a custom Python class to integrate into the pipeline. This is a task that requires expertise and understanding, and it's a key step in ensuring our custom transformer is seamlessly added to the ML pipeline. We are using the same approach for creating custom transformers we saw in the feature-engine lesson, where we use BaseEstimator and TransformerMixin and create fit and transform methods. So the custom transformer can be added correctly to the ML pipeline.

In [12]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import re

class text_cleaning(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        # Lowercase and remove punctuation using pandas and regex
        return X.apply(lambda s: re.sub(r'[^\w\s]', '', str(s).lower()))

For feature extraction, we use CountVectorizer and TfidfTransformer. You can find their documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and h[ere](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).

* We need to convert the textual data to a format from which the algorithms can learn the relationships, also known as vectors.
* CountVectorizer: According to its documentation, it converts a collection of text documents to a matrix of token counts. It stores the number of times every word is used in our text data. We are also removing English "stop words".
* (TfidfTransformer) Term Frequency－Inverse Document Frequency Transformer: It transforms a count matrix to a normalised tf or tf-idf representation according to its documentation. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and are empirically less informative than features that occur in a small fraction of the data. In addition, this highlights the words that are unique to a document, thus better for characterising it.

Our pipeline will have four steps:

* Text cleaning: lowercase the text and remove punctuation
* CountVectorizer: convert text to token
* TF-IDF: transform a count matrix to a normalised tf or tf-idf representation
Model

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def PipelineOptimization(model):
  pipeline = Pipeline([
                       
        ( 'text_cleaning', text_cleaning() ),
        ( 'vect', CountVectorizer(stop_words='english') ),
        ( 'tfidf', TfidfTransformer() ),
        ( 'model', model )
    ])
  
  return pipeline

We load the Python class (HyperparameterOptimizationSearch) which aims to fit a set of algorithms with multiple hyperparameters. A quick reminder of what this class does:

* We define a set of algorithms and their respective hyperparameter values.
* The code iterates on each algorithm and fits pipelines using GridSearchCV, considering its respective hyperparameter values. The result is stored. This process is repeated for all algorithms that the user listed.
* Once all pipelines are trained, the developer can retrieve a list with a performance result summary and an object that contains all trained pipelines. The developer can then subset the best pipeline.

In [14]:
from sklearn.model_selection import GridSearchCV
class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model=  PipelineOptimization(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs, verbose=verbose, scoring=scoring)
            gs.fit(X,y)
            self.grid_searches[key] = gs    

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                 'estimator': key,
                 'min_score': min(scores),
                 'max_score': max(scores),
                 'mean_score': np.mean(scores),
                 'std_score': np.std(scores),
            }
            return pd.Series({**params,**d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]        
                scores.append(r.reshape(len(params),1))

            all_scores = np.hstack(scores)
            for p, s in zip(params,all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score', 'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

### List Algorithms

Now, we list the algorithms we want to use for this task. First, we are considering new estimators from Scikit-learn that typically offer reasonable performance for NLP tasks.

* It doesn't mean we couldn't have considered the algorithms we have seen already in the course, like tree-based algorithms. However, the central aspect is that we should use algorithms that are proven to be more effective for NLP tasks, giving you a solid foundation for your learning.
* We will consider four algorithms.


In [15]:
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

models_search = {
    "MultinomialNB":MultinomialNB(),
    "SGDClassifier":SGDClassifier(random_state=101),
    "SVC": SVC(random_state=101),
    "LinearSVC": LinearSVC(random_state=101),
}


params_search = {
   "MultinomialNB":{},
    "SGDClassifier": {},
   "SVC": {},
    "LinearSVC": {},
}


#### Fit multiple pipelines with multiple algorithms using their default hyperparameters

We start by fitting multiple pipelines using the default hyperparameters.

We pass in the training data, set the scoring metric to accuracy (we assume our stakeholders are interested in how accurate their system is) and set cv=4.

In [16]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=4)


Running GridSearchCV for MultinomialNB 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for SGDClassifier 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for SGDClassifier 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for SVC 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for SVC 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for LinearSVC 

Fitting 4 folds for each of 1 candidates, totalling 4 fits

Running GridSearchCV for LinearSVC 

Fitting 4 folds for each of 1 candidates, totalling 4 fits




We can now check the training results summary.

Note that SGDClassifier performed best, and the difference to LinearSVC is slight; both are close. The other algorithms also perform well.

In [17]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score
1,SGDClassifier,0.970149,0.97607,0.98503,0.005577
3,LinearSVC,0.958084,0.968587,0.979042,0.00748
2,SVC,0.92515,0.931185,0.937313,0.004301
0,MultinomialNB,0.92515,0.930443,0.934132,0.003859


## NLP Part 2

We fit our 2 best performing algorithms to our pipeline next.

In [18]:

models_search = {
    #"MultinomialNB":MultinomialNB(),
    "SGDClassifier":SGDClassifier(random_state=101),
    #"SVC": SVC(random_state=101),
    "LinearSVC": LinearSVC(random_state=101),
}


params_search = {
   #"MultinomialNB":{},
    "SGDClassifier": {'model__tol':[1e-2, 1e-1], },
   #"SVC": {},
    "LinearSVC": {'model__tol':[1e-2, 1e-1], },
}

Next we fit multiple pipelines with the algorithms selected

In [19]:
search = HyperparameterOptimizationSearch(models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring='accuracy',
           n_jobs=-2,
           cv=4)


Running GridSearchCV for SGDClassifier 

Fitting 4 folds for each of 2 candidates, totalling 8 fits

Running GridSearchCV for LinearSVC 

Fitting 4 folds for each of 2 candidates, totalling 8 fits

Running GridSearchCV for LinearSVC 

Fitting 4 folds for each of 2 candidates, totalling 8 fits




Then we check the results summary

In [22]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

Unnamed: 0,estimator,min_score,mean_score,max_score,std_score,model__tol
2,LinearSVC,0.958084,0.968587,0.979042,0.00748,0.01
3,LinearSVC,0.958084,0.968587,0.979042,0.00748,0.1
0,SGDClassifier,0.958084,0.968585,0.976048,0.006531,0.01
1,SGDClassifier,0.958084,0.968585,0.976048,0.006531,0.1


Linear SVC performs slightly better though the results are close. Next we check for best model.

In [23]:
best_model = grid_search_summary.iloc[0, 0]
best_model

'LinearSVC'

Then best Params

In [24]:
grid_search_pipelines[best_model].best_params_

{'model__tol': 0.01}

Finally the best pipeline

In [25]:
best_pipeline = grid_search_pipelines[best_model].best_estimator_
best_pipeline

### Pipeline performance

Next we check our pipeline performance

In [26]:
from sklearn.metrics import classification_report, confusion_matrix

def confusion_matrix_and_report(X,y,pipeline,label_map):

  prediction = pipeline.predict(X)

  print('---  Confusion Matrix  ---')
  print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
        columns=[ ["Actual " + sub for sub in label_map] ], 
        index= [ ["Prediction " + sub for sub in label_map ]]
        ))
  print("\n")


  print('---  Classification Report  ---')
  print(classification_report(y, prediction, target_names=label_map),"\n")


def clf_performance(X_train,y_train,X_test,y_test,pipeline,label_map):
  print("#### Train Set #### \n")
  confusion_matrix_and_report(X_train,y_train,pipeline,label_map)

  print("#### Test Set ####\n")
  confusion_matrix_and_report(X_test,y_test,pipeline,label_map)

We pass in the arguments.

* Train and Test set
* Best pipeline
* for label_map, we get the classes name with .unique()


Note: The model learned the relationships in the data in the train set and predicted everything correctly. In the test set, we had a few misclassifications, but still, the performance looks good, and the model could generalise on the unseen data (test set).

In [27]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=best_pipeline,
                label_map= df['label'].unique()
                )

#### Train Set #### 

---  Confusion Matrix  ---
                Actual spam Actual ham
Prediction spam        1174          0
Prediction ham            0        163


---  Classification Report  ---
              precision    recall  f1-score   support

        spam       1.00      1.00      1.00      1174
         ham       1.00      1.00      1.00       163

    accuracy                           1.00      1337
   macro avg       1.00      1.00      1.00      1337
weighted avg       1.00      1.00      1.00      1337
 

#### Test Set ####

---  Confusion Matrix  ---
                Actual spam Actual ham
Prediction spam         292          5
Prediction ham            1         37


---  Classification Report  ---
              precision    recall  f1-score   support

        spam       0.98      1.00      0.99       293
         ham       0.97      0.88      0.93        42

    accuracy                           0.98       335
   macro avg       0.98      0.94      0.96       335
w

#### Using Real Time Data

We can now make predictions using real-time data in the form of messages

In [33]:
##############  Real-time Prediction  ######################################
real_time_msg = 'Click here for free spins'
########################################################################

X_live = pd.Series(data=real_time_msg, name='message')
best_pipeline.predict(X_live)

array(['spam'], dtype=object)