# Enriched feature engineering for NLP

By HK Turesson

This tutorial explores how to enrich BOW representations with non-standard features such as part-of-speech (POS) tags, dependencies, word shapes, etc. 

We will use [spaCy](https://spacy.io/) - an advanced NLP library - to enrich the documents.

## Imports

In [60]:
import spacy
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [61]:
# pip install pandas numpy scikit-learn

## Load spaCy's English pipeline
[`en_core_web_sm`](https://spacy.io/models/en#en_core_web_sm) is an English spaCy pipeline optimized for CPU ([see here](https://spacy.io/models/en#en_core_web_sm) for details). It's components are: `tok2vec`, `tagger`, `parser`, `senter`, `ner`, `attribute_ruler`, `lemmatizer`.
`en_core_web_sm` is already installed on Google Colab, however if get an error when loading it try downloading with `python -m spacy download en_core_web_sm`.

In [62]:
nlp = spacy.load("en_core_web_sm")

In [63]:
# python -m spacy download en_core_web_sm

## Tokenization with spaCy

In [64]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

In [65]:
print('Text\t\tLemma\tPOS\tTag\tDep\tShape\talpha\tstop')
print('-'*80)
for token in doc:
    print(f'{token.text}\t\t{token.lemma_}\t{token.pos_}\t{token.tag_}\t{token.dep_}\t{token.shape_}\t{token.is_alpha}\t{token.is_stop}')

Text		Lemma	POS	Tag	Dep	Shape	alpha	stop
--------------------------------------------------------------------------------
Apple		Apple	PROPN	NNP	nsubj	Xxxxx	True	False
is		be	AUX	VBZ	aux	xx	True	True
looking		look	VERB	VBG	ROOT	xxxx	True	False
at		at	ADP	IN	prep	xx	True	True
buying		buy	VERB	VBG	pcomp	xxxx	True	False
U.K.		U.K.	PROPN	NNP	nsubj	X.X.	False	False
startup		startup	VERB	VBD	ccomp	xxxx	True	False
for		for	ADP	IN	prep	xxx	True	True
$		$	SYM	$	quantmod	$	False	False
1		1	NUM	CD	compound	d	False	False
billion		billion	NUM	CD	pobj	xxxx	True	False


In [66]:
!pip install -U spacy



In [67]:
# !python -m spacy download en_core_web_sm

See spaCy's [linguistic features documentation](https://spacy.io/usage/linguistic-features) for full explaination.

## Data

We will use the dataset [BANKING77](https://huggingface.co/datasets/PolyAI/banking77).
BANKING77 is composed of online banking queries annotated with their corresponding intents. It provides a very fine-grained set of intents in the banking domain. It comprises 13,083 customer service queries labelled with 77 intents. It focuses on fine-grained single-domain intent detection.

In [68]:
# !unzip banking_data.zip

**Task**: Read `train.csv` and `test.csv,` storing the data with the names `train_data` and `test_data,` respectively.

**Tutorial question 1**: What is the last text in `train_data`?

**Tutorial question 2**: How many unique classes are in the data set?

In [69]:
train_data = pd.read_csv('banking_data/train.csv')
test_data = pd.read_csv('banking_data/test.csv')


In [70]:
# Get last text in train_data
train_data.tail()

Unnamed: 0,text,category
9998,You provide support in what countries?,country_support
9999,What countries are you supporting?,country_support
10000,What countries are getting support?,country_support
10001,Are cards available in the EU?,country_support
10002,Which countries are represented?,country_support


In [71]:
train_data.iloc[-1]

text        Which countries are represented?
category                     country_support
Name: 10002, dtype: object

In [72]:
# Unique classes
train_data["category"].value_counts()
len(train_data["category"].value_counts())

77

In [73]:
train_data["category"].nunique()

77

## Pre-processing

Applying spaCy's `nlp()` pipeline to a document takes a bit of time. If possible, it is best to only do it once. Thus, we'll do it once, store the output in `train_docs` and `test_docs` and then use these pre-computed lists repeatedly.

In [74]:
train_docs, test_docs = [], []   

In [75]:
for i, row in train_data.iterrows():
  train_docs.append(nlp(row['text']))

for i, row in test_data.iterrows():
  test_docs.append(nlp(row['text'])) 

In [76]:
# train_data["text"]

### Helper function to enrich features

Concatenating the linguistic features into a new long string (i.e. un-tokenized document) and then tokenizing it again using sklearn's `TfidfVectorizer` is a bit hacky. However, here we do it for educational puproses.

In [77]:
def enrich_features(docs, features):
    """
    Arguments
    ---------
        docs     : A list of outputs from spaCy's nlp()
        features : A dictionary with the following keys
                    'keep_noalpha', 
                    'rm_stop',
                    'text',
                    'lemma',
                    'pos',
                    'tag',
                    'dep',
                    'shape'
                   and boolean values.
    
                   E.g.:
                       features = {
                        'keep_noalpha': False,
                        'rm_stop': True,
                        'text': False,
                        'lemma': True,
                        'pos': False,
                        'tag': True,
                        'dep': False,
                        'shape': False}
    Return
    ------
    enriched : A list of enriched docs.
    
    """
    
    enriched = []
    
    for doc in docs:
      
        enriched_doc = ''
          
        for token in doc:
            
            enriched_token = ''
            
            if features['keep_noalpha'] or token.is_alpha:
              
                if not (features['rm_stop'] and token.is_stop):        
                  
                    if features['text']:
                        enriched_token = f'{enriched_token}{token.text}'
                    if features['lemma']:
                        enriched_token = f'{enriched_token}{token.lemma_}'
                    if features['pos']:
                        enriched_token = f'{enriched_token}{token.pos_}'
                    if features['tag']:
                        enriched_token = f'{enriched_token}{token.tag_}'                  
                    if features['dep']:
                        enriched_token = f'{enriched_token}{token.dep_}'
                    if features['shape']:
                        enriched_token = f'{enriched_token}{token.shape_}'                  
                
                    enriched_doc = f'{enriched_doc} {enriched_token}'
                    
        enriched.append(enriched_doc)
    
    return enriched

In [78]:
# What the fucntion above does?
# Iterates through them, and for each doc, if you want to keep no-alpha, "rm_stop", whether or not to remove stop words
# token.text, add it, if you want to keep lemma, which ones you want to augment and added to enriched_doc

In [79]:
# repeat the same steps for training doc and test doc

In [80]:
features = {
    'keep_noalpha': False,
    'rm_stop': True,
    'text': False,
    'lemma': True,
    'pos': False,
    'tag': True,
    'dep': False,
    'shape': False}
train = enrich_features(train_docs, features)
test = enrich_features(test_docs, features)

In [81]:
train[:5]

# the codification is the same as the text, but the words are different
# waitVBG, card NN -> this means 


# Create new token with feature that we wanted
# Put it together in the string we see below

# Why? SKLEARN tokenizer will put it together, so SKLEARN can tokenize 


# Are we checking dependency here?
# Yes we are, apple is a nominal subject, looking is a verb, U.K. is a country, startup is a nominal object, and 1 billion is a number.

# figure out which combination is next

[' waitVBG cardNN',
 ' cardNN arriveVBN weekNNS',
 ' waitVBG weekNN cardNN comeVBG',
 ' trackVB cardNN processNN deliveryNN',
 ' knowVB cardNN loseVBN']

In [82]:
train_data[:5]

Unnamed: 0,text,category
0,I am still waiting on my card?,card_arrival
1,What can I do if my card still hasn't arrived ...,card_arrival
2,I have been waiting over a week. Is the card s...,card_arrival
3,Can I track my card while it is in the process...,card_arrival
4,"How do I know if I will get my card, or if it ...",card_arrival


### Tokenize again

**Task**: Use sklearn's [`TfidfVectorizer`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#) to vectorize `train` and `test`, storing the outputs in `X_train` and `X_test`, respectively.

Set `lowercase` to `False` `stop_words` to `None` and `use_idf` to `True`.

In [101]:
# Set up the object
vectorizer = TfidfVectorizer(lowercase=False, stop_words=None, use_idf=True)

# Fit the object
X_train = vectorizer.fit_transform(train)
X_test = vectorizer.transform(test) # DO NOT FIT THE TEST DATA, YOU ONLY TRANSFORM IT!

# # Transform the object
# X_train = tfidf.fit_transform(train)
# X_test = tfidf.transform(test)
    

In [97]:
vectorizer

**Tutorial question 3**: How many features are there in `X_train` (i.e. what is $|V|$)?

In [84]:
X_train.shape
# X_train.shape[1]
# **Tutorial question 4**: What is the 23rd token in $V$?

# # Get the feature names
# feature_names = tfidf.get_feature_names_out()

# # Get the 23rd token
# feature_names[22]


(10003, 2566)

In [95]:
# find 23rd token in X_train
X_train[0].toarray()

# **Tutorial question 4**: What is the 23rd token in $V$?

array([[0., 0., 0., ..., 0., 0., 0.]])

**Tutorial question 4**: What is the 23rd token in $V$?

In [86]:
# vocabulary = vectorizer.vocabulary_

In [105]:
feature_names = vectorizer.get_feature_names_out()

In [102]:
feature_names = vectorizer.get_feature_names_out()

In [None]:
#BANK NNP is the answer

**Tutorial question 5**: What is the POS associated with that token?

## Text classification

Here, we focus on feature enrichment and not the learner. Thus, we'll stick with one learner (Multinomial Naive Bayes) and default hyperparameters.

### Train

In [98]:
clf = MultinomialNB().fit(X_train, train_data['category'])

### Evaluate

In [88]:
preds = clf.predict(X_test)

print('Test set accuracy:', (preds == test_data['category']).mean())

# accuracy is 0.74642

Test set accuracy: 0.7464285714285714


**Tutorial question 6**: What is the test set accuracy?

**Task**: Combine the above steps (`enrich_feathers`, `TfidfVectorizer`, training and evaluation) into a pipline called `pipeline`.
`pipeline()` should take `train_docs`, `test_docs`, and `features` as arguments and return the accuracy. Make sure that it can handle empty docs.

In [100]:
def pipeline(train_docs, test_docs, features):
    
    train = enrich_features(train_docs, features)
    test = enrich_features(test_docs, features)

    vectorizer = TfidfVectorizer(lowercase=False, stop_words=None, use_idf=True)

    try:
        X_train = vectorizer.fit_transform(train)
        X_test = vectorizer.transform(test)

        clf = MultinomialNB().fit(X_train, train_data['category'])
    
        preds = clf.predict(X_test)
    
        acc = (preds == test_data['category']).mean()    
        
    except:

        acc = 0

    return acc

**Task**: Find the best feature combination by training and evaluating models on all possible combinations. Store the feature configurations and accuracies in a list called `configs`. Don't forget to use `features.copy()` when storing the feature configurations in `configs`.

In [108]:
features = {}
configs = []
opts = (True, False)

# Do this 256 times
# 256 was calculated by 2^8, where 8 is the number of features.
# We know there are 8 features because we have 8 features in the dictionary.
# The dictionary is reference in variable in this file: features = {}, defined earlier in this file.
for keep_noalpha in opts:
    features["keep_noalpha"] = keep_noalpha
    for rm_stop in opts:
        features["rm_stop"] = rm_stop
        for text in opts:
            features["text"] = text
            for lemma in opts:
                features["lemma"] = lemma
                for pos in opts:
                    features["pos"] = pos
                    for tag in opts:
                        features["tag"] = tag
                        for dep in opts:
                            features["dep"] = dep
                            for shape in opts:
                                features["shape"] = shape
                                # Append the features to the configs list, use copy
                                acc = pipeline(train_docs, test_docs, features)
                                configs.append({'acc': acc, 'features': features.copy()}) # avoid shallow copy
                                print(f'{len(configs)}/{2**5} configs completed')


1/32 configs completed
2/32 configs completed
3/32 configs completed
4/32 configs completed
5/32 configs completed
6/32 configs completed
7/32 configs completed
8/32 configs completed
9/32 configs completed
10/32 configs completed
11/32 configs completed
12/32 configs completed
13/32 configs completed
14/32 configs completed
15/32 configs completed
16/32 configs completed
17/32 configs completed
18/32 configs completed
19/32 configs completed
20/32 configs completed
21/32 configs completed
22/32 configs completed
23/32 configs completed
24/32 configs completed
25/32 configs completed
26/32 configs completed
27/32 configs completed
28/32 configs completed
29/32 configs completed
30/32 configs completed
31/32 configs completed
32/32 configs completed
33/32 configs completed
34/32 configs completed
35/32 configs completed
36/32 configs completed
37/32 configs completed
38/32 configs completed
39/32 configs completed
40/32 configs completed
41/32 configs completed
42/32 configs completed
4

**Tutorial question 7**: What is the best accuracy?

In [117]:
my_list = []
# Loop
for item in configs:
    print(item["acc"])
    my_list.append(item["acc"])

0.7022727272727273
0.7022727272727273
0.7477272727272727
0.7477272727272727
0.7012987012987013
0.7012987012987013
0.7454545454545455
0.7454545454545455
0.7051948051948052
0.7048701298701299
0.7538961038961038
0.7535714285714286
0.7035714285714286
0.7035714285714286
0.7487012987012988
0.7483766233766234
0.7022727272727273
0.701948051948052
0.7477272727272727
0.7477272727272727
0.700974025974026
0.700974025974026
0.7454545454545455
0.7454545454545455
0.7048701298701299
0.7048701298701299
0.7532467532467533
0.7532467532467533
0.7042207792207792
0.7042207792207792
0.749025974025974
0.7464285714285714
0.7032467532467532
0.701948051948052
0.7487012987012988
0.7493506493506493
0.6974025974025974
0.6987012987012987
0.7366883116883117
0.7373376623376623
0.7058441558441558
0.7077922077922078
0.7545454545454545
0.7555194805194805
0.7006493506493506
0.7012987012987013
0.7353896103896104
0.7366883116883117
0.25551948051948054
0.20454545454545456
0.19512987012987013
0.1279220779220779
0.187987012987

In [118]:
max(my_list)

np.float64(0.7834415584415585)

In [119]:
max_acc = 0
for cfg in configs:
    if cfg["acc"] > max_acc:
        max_acc = cfg["acc"]
        best_cfg = cfg

print(f'Best accuracy: {max_acc}')
print(f'Best features: {best_cfg["features"]}')

Best accuracy: 0.7834415584415585
Best features: {'keep_noalpha': True, 'rm_stop': False, 'text': False, 'lemma': True, 'pos': False, 'tag': False, 'dep': False, 'shape': False}


In [None]:
# Use lemma, don't use pos tag, don't use tags, don't use dep, don't use shape
# don't use text, don't use stop words, keep non-alpha

In [None]:
# Why no transfer mapping?
# We are not using the feature names, we are using the indices.
# If we ran this over and over, you have to consider how fast this will fun.
