<a href="https://colab.research.google.com/github/ShoSato-047/DSCI330_module_3/blob/main/DSCI330_act3_3_feature_engineering_for_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
%pip install composable



In [3]:
from composable import pipeable
from composable.strict import map, filter

In [4]:
class PipeableObject(object):
    def __init__(self, function = lambda x: x, after_method_call = False):
        self._function = function
        self._after_method_call = after_method_call

    def __getattr__(self, name):
        return PipeableObject(lambda x: getattr(self._function(x), name), after_method_call = False)

    def __call__(self, *args, **kwargs):
        if self._after_method_call:
            return self._function(*args, **kwargs)
        else:
            return PipeableObject(lambda x: self._function(x)(*args, **kwargs),
                                  after_method_call = True)

    def __rrshift__(self, other):
        return self._function(other)

obj = PipeableObject()

In [5]:
class PipeableAttribute(object):
    def __init__(self, function = lambda x: x):
        self.function = function

    def __getattr__(self, name):
        return pipeable(lambda x: getattr(x, name))

    def __rrshift__(self, other):
        return self.function(other)

    def __call__(self, *args, **kwargs):
        return self.function(*args, **kwargs)

attr = PipeableAttribute()

In [6]:
import functools

# fold and reduce are basically the same

@pipeable
def fold(func, init, seq):
    return functools.reduce(func, seq, init)

@pipeable
def reduce(func, seq):
    try:
        init, seq = seq[0], seq[1:]
    except:
        init = next(seq)
    return functools.reduce(func, seq, init)

In [7]:
from toolz import get as tlz_get

get = pipeable(tlz_get)

In [8]:
el = list(range(10))

get(2, el)

2

In [9]:
get([1,3], el)

(1, 3)

In [10]:
(d := {str(i): i for i in range(10)})

{'0': 0,
 '1': 1,
 '2': 2,
 '3': 3,
 '4': 4,
 '5': 5,
 '6': 6,
 '7': 7,
 '8': 8,
 '9': 9}

In [11]:
get('2',d)

2

In [13]:
d >> get(['2','3'])

(2, 3)

# **Feature Engineering for NLP**

In this notebook, we discuss

1. What is feature engineering?<br>
2. Common NLP methods for feature engineering:<br>
    a. Bag of Words,<br>
    b. Term Frequency-Inverse Document Frequency (TF-IDF),<br>
    c. Word Embeddings (Word2Vec, GloVe)<br>
    d. BERT Embeddings<br>


In [None]:
# feature >> target
# treatment >> response

## **Basics of Feature Engineering - Motivation**

* **Tabular data.** Most machine learning (ML) algorithms require tabular data.<br>
* **Feature engineering.** The act of creating columns of predictors (features), often from unstructured data.<br>
* **Successful feature engineering.** Creating features that<br>
    - Capture the signal, that is are helpful in the learning task, and <br>
    - Reduce the noise, that is cut out elements of the data that are not helpful.<br>

## **Example Task - Text classification.**

Text classification is
* **Supervised learning.** We need training data with a known target value.
* **Classification.** We are trying to predict a label (in contrast to *regression* which predicts a quantity).
* **Examples.**<br>
    - Predict the author of a text.<br>
    - Classify the text by overall sentiment, e.g., *positive* or *negative*.
    - Determine the outcome of an intervention.<br>

## **Common feature engineering techniques for NLP.**

### Technique 1 - Bag of Words

Bag of words involves
1. Tokenizing the text [e.g., words],
2. Building a vocabulary [e.g., unique words], and
3. Vectorizing the text [e.g, getting the word count].

#### Small example.

In [14]:
(documents :=
 ["Natural language processing is fun",
  "Language models are important in NLP",
  "I enjoy learning about artificial intelligence",
  "Machine learning and NLP are closely related",
  "Deep learning is a subset of machine learning",
 ]
)

['Natural language processing is fun',
 'Language models are important in NLP',
 'I enjoy learning about artificial intelligence',
 'Machine learning and NLP are closely related',
 'Deep learning is a subset of machine learning']

#### Step 0 - Preprocess the words and tokenize.

#### Step 1 - Tokenize into words.

In [15]:
(words :=
 documents
 >> map(obj.lower())
 >> map(obj.split(' '))
)

[['natural', 'language', 'processing', 'is', 'fun'],
 ['language', 'models', 'are', 'important', 'in', 'nlp'],
 ['i', 'enjoy', 'learning', 'about', 'artificial', 'intelligence'],
 ['machine', 'learning', 'and', 'nlp', 'are', 'closely', 'related'],
 ['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning']]

#### Step 2 - Build a vocabulary

In [17]:
# set can only contains unique values
init_vocab = set([])
update_vocab = lambda voc, doc: voc.union(doc)

(vocab :=
 words
 >> fold(update_vocab,init_vocab)
)

{'a',
 'about',
 'and',
 'are',
 'artificial',
 'closely',
 'deep',
 'enjoy',
 'fun',
 'i',
 'important',
 'in',
 'intelligence',
 'is',
 'language',
 'learning',
 'machine',
 'models',
 'natural',
 'nlp',
 'of',
 'processing',
 'related',
 'subset'}

#### Step 3 - Create a vector of word counts *for each document* [Dense representation].

The *dense representation* will include a count for the whole vocabulary for each document, including zeros for missing words.

In [29]:
init_counts = {w:0 for w in vocab}
update_counts = lambda cnts, w: cnts | {w:cnts[w] + 1}

(bag_of_words_dense :=
 words
 >> map(fold(update_counts, init_counts))
)
# Why are we using map and fold??
# because we want to apply the aggregation (count) for each row.

[{'in': 0,
  'intelligence': 0,
  'i': 0,
  'about': 0,
  'natural': 1,
  'of': 0,
  'artificial': 0,
  'models': 0,
  'nlp': 0,
  'subset': 0,
  'learning': 0,
  'deep': 0,
  'and': 0,
  'fun': 1,
  'is': 1,
  'are': 0,
  'processing': 1,
  'related': 0,
  'closely': 0,
  'enjoy': 0,
  'machine': 0,
  'a': 0,
  'important': 0,
  'language': 1},
 {'in': 1,
  'intelligence': 0,
  'i': 0,
  'about': 0,
  'natural': 0,
  'of': 0,
  'artificial': 0,
  'models': 1,
  'nlp': 1,
  'subset': 0,
  'learning': 0,
  'deep': 0,
  'and': 0,
  'fun': 0,
  'is': 0,
  'are': 1,
  'processing': 0,
  'related': 0,
  'closely': 0,
  'enjoy': 0,
  'machine': 0,
  'a': 0,
  'important': 1,
  'language': 1},
 {'in': 0,
  'intelligence': 1,
  'i': 1,
  'about': 1,
  'natural': 0,
  'of': 0,
  'artificial': 1,
  'models': 0,
  'nlp': 0,
  'subset': 0,
  'learning': 1,
  'deep': 0,
  'and': 0,
  'fun': 0,
  'is': 0,
  'are': 0,
  'processing': 0,
  'related': 0,
  'closely': 0,
  'enjoy': 1,
  'machine': 0,
  

## Review - Understanding the two `fold`s

To understand how the folds work, we should investigate how the updates happen at each step.

#### Understanding the vocabulary fold.

In [18]:
init_vocab = set([]) # empty set
update_vocab = lambda voc, doc: voc.union(doc)

In [19]:
(up := init_vocab)

set()

In [20]:
ws = words[0]

(ws, (up := update_vocab(up, ws)))

(['natural', 'language', 'processing', 'is', 'fun'],
 {'fun', 'is', 'language', 'natural', 'processing'})

In [21]:
ws = words[1]

(ws, (up := update_vocab(up, ws)))

(['language', 'models', 'are', 'important', 'in', 'nlp'],
 {'are',
  'fun',
  'important',
  'in',
  'is',
  'language',
  'models',
  'natural',
  'nlp',
  'processing'})

In [23]:
up = set([])
print(up)
for ws in words:
    up = update_vocab(up, ws)
    print(ws, up)

set()
['natural', 'language', 'processing', 'is', 'fun'] {'fun', 'is', 'natural', 'processing', 'language'}
['language', 'models', 'are', 'important', 'in', 'nlp'] {'in', 'natural', 'models', 'nlp', 'fun', 'is', 'are', 'processing', 'important', 'language'}
['i', 'enjoy', 'learning', 'about', 'artificial', 'intelligence'] {'important', 'learning', 'in', 'fun', 'intelligence', 'i', 'is', 'natural', 'are', 'about', 'processing', 'enjoy', 'artificial', 'models', 'nlp', 'language'}
['machine', 'learning', 'and', 'nlp', 'are', 'closely', 'related'] {'in', 'intelligence', 'i', 'about', 'natural', 'artificial', 'models', 'nlp', 'learning', 'and', 'fun', 'is', 'are', 'processing', 'related', 'closely', 'enjoy', 'machine', 'important', 'language'}
['deep', 'learning', 'is', 'a', 'subset', 'of', 'machine', 'learning'] {'in', 'intelligence', 'i', 'about', 'natural', 'of', 'artificial', 'models', 'nlp', 'subset', 'learning', 'deep', 'and', 'fun', 'is', 'are', 'processing', 'related', 'closely', 'e

## <font color="red"> Exercise 3.3.1 </font>

**Task.** Use a similar approach to explore the fold for creating the word counts.  Be sure to also investigate how the `dict` merge operator `|` works!

In [None]:
# Your code here

In [42]:
init_vocab = set([]) # empty set
update_vocab = lambda voc, doc: voc.union(doc)

In [46]:
(acc := init_counts)

{'important': 0,
 'learning': 0,
 'deep': 0,
 'fun': 0,
 'intelligence': 0,
 'natural': 0,
 'machine': 0,
 'processing': 0,
 'related': 0,
 'closely': 0,
 'subset': 0,
 'enjoy': 0,
 'artificial': 0,
 'models': 0,
 'nlp': 0,
 'language': 0}

In [47]:
acc | {w:acc[w] + 1}

{'important': 0,
 'learning': 1,
 'deep': 0,
 'fun': 0,
 'intelligence': 0,
 'natural': 0,
 'machine': 0,
 'processing': 0,
 'related': 0,
 'closely': 0,
 'subset': 0,
 'enjoy': 0,
 'artificial': 0,
 'models': 0,
 'nlp': 0,
 'language': 0}

In [48]:
w = doc[0]

print(acc)
print(w)


{'important': 0, 'learning': 0, 'deep': 0, 'fun': 0, 'intelligence': 0, 'natural': 0, 'machine': 0, 'processing': 0, 'related': 0, 'closely': 0, 'subset': 0, 'enjoy': 0, 'artificial': 0, 'models': 0, 'nlp': 0, 'language': 0}
deep


In [49]:
out = []

for doc in words:
    acc = init_counts
    print(doc, acc)
    for w in doc:
      acc = update_counts(acc, w)
      print(w, acc)
      print('\n\n')
    out = out + [acc] # this is the map pattern
    print(out)
    print('\n\n\n')

['natural', 'language', 'processing', 'fun'] {'important': 0, 'learning': 0, 'deep': 0, 'fun': 0, 'intelligence': 0, 'natural': 0, 'machine': 0, 'processing': 0, 'related': 0, 'closely': 0, 'subset': 0, 'enjoy': 0, 'artificial': 0, 'models': 0, 'nlp': 0, 'language': 0}
natural {'important': 0, 'learning': 0, 'deep': 0, 'fun': 0, 'intelligence': 0, 'natural': 1, 'machine': 0, 'processing': 0, 'related': 0, 'closely': 0, 'subset': 0, 'enjoy': 0, 'artificial': 0, 'models': 0, 'nlp': 0, 'language': 0}



language {'important': 0, 'learning': 0, 'deep': 0, 'fun': 0, 'intelligence': 0, 'natural': 1, 'machine': 0, 'processing': 0, 'related': 0, 'closely': 0, 'subset': 0, 'enjoy': 0, 'artificial': 0, 'models': 0, 'nlp': 0, 'language': 1}



processing {'important': 0, 'learning': 0, 'deep': 0, 'fun': 0, 'intelligence': 0, 'natural': 1, 'machine': 0, 'processing': 1, 'related': 0, 'closely': 0, 'subset': 0, 'enjoy': 0, 'artificial': 0, 'models': 0, 'nlp': 0, 'language': 1}



fun {'important': 

#### Step 3b - Creating a feature vector for each word in the vocabulary.

In [31]:
(features :=
{w: [get(w, word_counts)
 for word_counts in bag_of_words_dense] # words counts for each words
 for w in vocab
}
)

{'in': [0, 1, 0, 0, 0],
 'intelligence': [0, 0, 1, 0, 0],
 'i': [0, 0, 1, 0, 0],
 'about': [0, 0, 1, 0, 0],
 'natural': [1, 0, 0, 0, 0],
 'of': [0, 0, 0, 0, 1],
 'artificial': [0, 0, 1, 0, 0],
 'models': [0, 1, 0, 0, 0],
 'nlp': [0, 1, 0, 1, 0],
 'subset': [0, 0, 0, 0, 1],
 'learning': [0, 0, 1, 1, 2],
 'deep': [0, 0, 0, 0, 1],
 'and': [0, 0, 0, 1, 0],
 'fun': [1, 0, 0, 0, 0],
 'is': [1, 0, 0, 0, 1],
 'are': [0, 1, 0, 1, 0],
 'processing': [1, 0, 0, 0, 0],
 'related': [0, 0, 0, 1, 0],
 'closely': [0, 0, 0, 1, 0],
 'enjoy': [0, 0, 1, 0, 0],
 'machine': [0, 0, 0, 1, 1],
 'a': [0, 0, 0, 0, 1],
 'important': [0, 1, 0, 0, 0],
 'language': [1, 1, 0, 0, 0]}

In [32]:
import polars as pl

pl.DataFrame(features)

in,intelligence,i,about,natural,of,artificial,models,nlp,subset,learning,deep,and,fun,is,are,processing,related,closely,enjoy,machine,a,important,language
i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64
0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,1
1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1
0,1,1,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0,1,0,1,0,1,0,0,1,0,1,1,0,1,0,0,0
0,0,0,0,0,1,0,0,0,1,2,1,0,0,1,0,0,0,0,0,1,1,0,0


## Using a sparse representation

Did you notice the abundance of zeros in the count dictionaries?  We can clean up the output by only tracking things that are not zero.

#### Step 3 - Create a vector of word counts *for each document* [Sparse representation].

The *sparse representation* will only include a count for the words that appear in the given document.

In [34]:
# deleted 0 -- sparse representaion
# include 0 -- dense representation

init_counts = {}
update_counts = lambda cnts, w: cnts | {w:cnts[w] + 1 if w in cnts else 1} # merge left and right |

(bag_of_words :=
 words
 >> map(fold(update_counts, init_counts))
)

[{'natural': 1, 'language': 1, 'processing': 1, 'is': 1, 'fun': 1},
 {'language': 1, 'models': 1, 'are': 1, 'important': 1, 'in': 1, 'nlp': 1},
 {'i': 1,
  'enjoy': 1,
  'learning': 1,
  'about': 1,
  'artificial': 1,
  'intelligence': 1},
 {'machine': 1,
  'learning': 1,
  'and': 1,
  'nlp': 1,
  'are': 1,
  'closely': 1,
  'related': 1},
 {'deep': 1,
  'learning': 2,
  'is': 1,
  'a': 1,
  'subset': 1,
  'of': 1,
  'machine': 1}]

### Let's redo that, but remove the stop words.

In [35]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

eng_stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [37]:
(words :=
 documents
 >> map(obj.lower())
 >> map(obj.split(' '))
 >> map(filter(lambda w: w not in eng_stop_words)) # remove stop words
)

[['natural', 'language', 'processing', 'fun'],
 ['language', 'models', 'important', 'nlp'],
 ['enjoy', 'learning', 'artificial', 'intelligence'],
 ['machine', 'learning', 'nlp', 'closely', 'related'],
 ['deep', 'learning', 'subset', 'machine', 'learning']]

In [38]:
(vocab :=
 words
 >> fold(lambda voc, ws: voc.union(ws), set([]))
)

{'artificial',
 'closely',
 'deep',
 'enjoy',
 'fun',
 'important',
 'intelligence',
 'language',
 'learning',
 'machine',
 'models',
 'natural',
 'nlp',
 'processing',
 'related',
 'subset'}

In [39]:
init_counts = {w:0 for w in vocab}
update_counts = lambda cnts, w: cnts | {w:cnts[w] + 1}

(bag_of_words :=
 words
 >> map(fold(update_counts, init_counts))
)

[{'important': 0,
  'learning': 0,
  'deep': 0,
  'fun': 1,
  'intelligence': 0,
  'natural': 1,
  'machine': 0,
  'processing': 1,
  'related': 0,
  'closely': 0,
  'subset': 0,
  'enjoy': 0,
  'artificial': 0,
  'models': 0,
  'nlp': 0,
  'language': 1},
 {'important': 1,
  'learning': 0,
  'deep': 0,
  'fun': 0,
  'intelligence': 0,
  'natural': 0,
  'machine': 0,
  'processing': 0,
  'related': 0,
  'closely': 0,
  'subset': 0,
  'enjoy': 0,
  'artificial': 0,
  'models': 1,
  'nlp': 1,
  'language': 1},
 {'important': 0,
  'learning': 1,
  'deep': 0,
  'fun': 0,
  'intelligence': 1,
  'natural': 0,
  'machine': 0,
  'processing': 0,
  'related': 0,
  'closely': 0,
  'subset': 0,
  'enjoy': 1,
  'artificial': 1,
  'models': 0,
  'nlp': 0,
  'language': 0},
 {'important': 0,
  'learning': 1,
  'deep': 0,
  'fun': 0,
  'intelligence': 0,
  'natural': 0,
  'machine': 1,
  'processing': 0,
  'related': 1,
  'closely': 1,
  'subset': 0,
  'enjoy': 0,
  'artificial': 0,
  'models': 0,
  

### Using `CountVectorizer` to get the bag of words

In [40]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() # this is the instance of CountVectorizer(object)

(X := vectorizer.fit_transform(documents))

<5x22 sparse matrix of type '<class 'numpy.int64'>'
	with 29 stored elements in Compressed Sparse Row format>

In [41]:
print(X)

  (0, 16)	1
  (0, 12)	1
  (0, 19)	1
  (0, 11)	1
  (0, 7)	1
  (1, 12)	1
  (1, 15)	1
  (1, 2)	1
  (1, 8)	1
  (1, 9)	1
  (1, 17)	1
  (2, 6)	1
  (2, 13)	1
  (2, 0)	1
  (2, 3)	1
  (2, 10)	1
  (3, 2)	1
  (3, 17)	1
  (3, 13)	1
  (3, 14)	1
  (3, 1)	1
  (3, 4)	1
  (3, 20)	1
  (4, 11)	1
  (4, 13)	2
  (4, 14)	1
  (4, 5)	1
  (4, 21)	1
  (4, 18)	1


## **Exercise 3.3.2**

Explain the previous output.

**<font color = red>
CountVectorizer form sklearn counts the number of the targer vocabularies. The number on the left in the () indicates the row# and the number on the right in the () indicates the columns. This is the sparse representation because there is no 0 counts. (0, 16) -- 1 means that the document 0 has the vocabulary #16 once.**

## Example - Text Classification with Bag of Words

Let's use Naive Bayes classifier to classify our documents using the Bag of Words features.

**Note.** While this toy example is too small of be of real interest, most real problems will involve very similar code.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


# Cases
(documents :=
 ["Natural language processing is fun",
  "Language models are important in NLP",
  "I enjoy learning about artificial intelligence",
  "Machine learning and NLP are closely related",
  "Deep learning is a subset of machine learning",
 ]
)

# Target vector - 1 ==> NLP-related, 0 for AI related
labels = [1, 1, 0, 1, 0]

# Create the sparse feature set
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(documents)

# Create a training and test (validation) set
X_train, X_test, y_train, y_test = train_test_split(X, labels,
                                                    test_size = 0.2,
                                                    random_state=42)

# Train the model
classifier = MultinomialNB()

classifier.fit(X_train, y_train)

# Evaluate the model
y_pred = classifier.predict(X_test)

(accuracy := accuracy_score(y_test, y_pred))

1.0

In [None]:
print(X_train, y_train)

  (0, 11)	1
  (0, 13)	2
  (0, 14)	1
  (0, 5)	1
  (0, 21)	1
  (0, 18)	1
  (1, 6)	1
  (1, 13)	1
  (1, 0)	1
  (1, 3)	1
  (1, 10)	1
  (2, 16)	1
  (2, 12)	1
  (2, 19)	1
  (2, 11)	1
  (2, 7)	1
  (3, 2)	1
  (3, 17)	1
  (3, 13)	1
  (3, 14)	1
  (3, 1)	1
  (3, 4)	1
  (3, 20)	1 [0, 0, 1, 1]
