# Enriched feature engineering for NLP

By HK Turesson

This tutorial explores how to enrich BOW representations with non-standard features such as part-of-speech (POS) tags, dependencies, word shapes, etc. 

We will use [spaCy](https://spacy.io/) - an advanced NLP library - to enrich the documents.

## Imports

In [None]:
import spacy
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

## Load spaCy's English pipeline
[`en_core_web_sm`](https://spacy.io/models/en#en_core_web_sm) is an English spaCy pipeline optimized for CPU ([see here](https://spacy.io/models/en#en_core_web_sm) for details). It's components are: `tok2vec`, `tagger`, `parser`, `senter`, `ner`, `attribute_ruler`, `lemmatizer`.
`en_core_web_sm` is already installed on Google Colab, however if get an error when loading it try downloading with `python -m spacy download en_core_web_sm`.

In [None]:
nlp = spacy.load("en_core_web_sm")

## Tokenization with spaCy

In [None]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

In [None]:
print('Text\t\tLemma\tPOS\tTag\tDep\tShape\talpha\tstop')
print('-'*80)
for token in doc:
    print(f'{token.text}\t\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t{token.dep_}\t{token.shape_}\t{token.is_alpha}\t{token.is_stop}')

See spaCy's [linguistic features documentation](https://spacy.io/usage/linguistic-features) for full explaination.

## Data

We will use the dataset [BANKING77](https://huggingface.co/datasets/PolyAI/banking77).
BANKING77 is composed of online banking queries annotated with their corresponding intents. It provides a very fine-grained set of intents in the banking domain. It comprises 13,083 customer service queries labelled with 77 intents. It focuses on fine-grained single-domain intent detection.

In [None]:
!unzip banking_data.zip

**Task**: Read `train.csv` and `test.csv,` storing the data with the names `train_data` and `test_data,` respectively.

**Tutorial question 1**: What is the last text in `train_data`?

**Tutorial question 2**: How many unique classes are in the data set?

## Pre-processing

Applying spaCy's `nlp()` pipeline to a document takes a bit of time. If possible, it is best to only do it once. Thus, we'll do it once, store the output in `train_docs` and `test_docs` and then use these pre-computed lists repeatedly.

In [None]:
train_docs, test_docs = [], []

for i, row in train_data.iterrows():
  train_docs.append(nlp(row['text']))

for i, row in test_data.iterrows():
  test_docs.append(nlp(row['text']))    

### Helper function to enrich features

Concatenating the linguistic features into a new long string (i.e. un-tokenized document) and then tokenizing it again using sklearn's `TfidfVectorizer` is a bit hacky. However, here we do it for educational puproses.

In [None]:
def enrich_features(docs, features):
    """
    Arguments
    ---------
        docs     : A list of outputs from spaCy's nlp()
        features : A dictionary with the following keys
                    'keep_noalpha', 
                    'rm_stop',
                    'text',
                    'lemma',
                    'pos',
                    'tag',
                    'dep',
                    'shape'
                   and boolean values.
    
                   E.g.:
                       features = {
                        'keep_noalpha': False,
                        'rm_stop': True,
                        'text': False,
                        'lemma': True,
                        'pos': False,
                        'tag': True,
                        'dep': False,
                        'shape': False}
    Return
    ------
    enriched : A list of enriched docs.
    
    """
    
    enriched = []
    
    for doc in docs:
      
        enriched_doc = ''
          
        for token in doc:
            
            enriched_token = ''
            
            if features['keep_noalpha'] or token.is_alpha:
              
                if not (features['rm_stop'] and token.is_stop):        
                  
                    if features['text']:
                        enriched_token = f'{enriched_token}{token.text}'
                    if features['lemma']:
                        enriched_token = f'{enriched_token}{token.lemma_}'
                    if features['pos']:
                        enriched_token = f'{enriched_token}{token.pos_}'
                    if features['tag']:
                        enriched_token = f'{enriched_token}{token.tag_}'                  
                    if features['dep']:
                        enriched_token = f'{enriched_token}{token.dep_}'
                    if features['shape']:
                        enriched_token = f'{enriched_token}{token.shape_}'                  
                
                    enriched_doc = f'{enriched_doc} {enriched_token}'
                    
        enriched.append(enriched_doc)
    
    return enriched

In [None]:
features = {
    'keep_noalpha': False,
    'rm_stop': True,
    'text': False,
    'lemma': True,
    'pos': False,
    'tag': True,
    'dep': False,
    'shape': False}
train = enrich_features(train_docs, features)
test = enrich_features(test_docs, features)

In [None]:
train[:5]

In [None]:
train_data[:5]

### Tokenize again

**Task**: Use sklearn's [`TfidfVectorizer`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#) to vectorize `train` and `test`, storing the outputs in `X_train` and `X_test`, respectively.

Set `lowercase` to `False` `stop_words` to `None` and `use_idf` to `True`.

**Tutorial question 3**: How many features are there in `X_train` (i.e. what is $|V|$)?

**Tutorial question 4**: What is the 23rd token in $V$?

**Tutorial question 5**: What is the POS associated with that token?

## Text classification

Here, we focus on feature enrichment and not the learner. Thus, we'll stick with one learner (Multinomial Naive Bayes) and default hyperparameters.

### Train

In [None]:
clf = MultinomialNB().fit(X_train, train_data['category'])

### Evaluate

In [None]:
preds = clf.predict(X_test)

print('Test set accuracy:', (preds == test_data['category']).mean())

**Tutorial question 6**: What is the test set accuracy?

**Task**: Combine the above steps (`enrich_feathers`, `TfidfVectorizer`, training and evaluation) into a pipline called `pipeline`.
`pipeline()` should take `train_docs`, `test_docs`, and `features` as arguments and return the accuracy. Make sure that it can handle empty docs.

In [None]:
def pipeline(train_docs, test_docs, features):
    
    train = enrich_features(train_docs, features)
    test = enrich_features(test_docs, features)

    vectorizer = TfidfVectorizer(lowercase=False, stop_words=None, use_idf=True)

    try:
        X_train = vectorizer.fit_transform(train)
        X_test = vectorizer.transform(test)

        clf = MultinomialNB().fit(X_train, train_data['category'])
    
        preds = clf.predict(X_test)
    
        acc = (preds == test_data['category']).mean()    
        
    except:

        acc = 0

    return acc

**Task**: Find the best feature combination by training and evaluating models on all possible combinations. Store the feature configurations and accuracies in a list called `configs`. Don't forget to use `features.copy()` when storing the feature configurations in `configs`.

**Tutorial question 7**: What is the best 