# 4.4. Bag-of-Words 🛍️


## Warmup

#### In groups, research to find three different products/use cases of text classification that you find useful in the real world


1.   To summarize texts or scientific papers
2.   sentiments analysis of for instance tweets
3.sentiment analysis
Summarisation  of texts https://www.paper-digest.com

deepl translation 

Classification of any text (e.g. scientific articles)

Chatbots

Hiring and Recruitment --> CV screening

1.   List item
2.   List item






## How do we process natural language in Machine Learning?

* The project this week is to classify text. How is this data different from other classification tasks you have seen so far?


* What pre-processing may we need to do to text to make it be somehow understandable to an ML algorithm?


### Text Preprocessing

In order to get better results and use fewer resources, some steps that we typically take in NLP text preprocessing include:

- **Tokenisation:** each *document* (= string) in the *corpus* (=list of strings) is split into meaningful substrings called *tokens*. 
    - These are use-case and language-dependent, so eg. in English a token could be any substring surrrounded by whitespace, but say, in German it may be more useful to split up composite nouns further *("Rechtsschutzversicherungsgesellschaften")*. 
    - To preserve some meaningful word order/context, it is possible to split the text into bi-grams, n-grams
    - "the cat eats the mouse" becomes:
        - ["the", "cat", "eats", "the", "mouse"] with word-based tokens
        - ["the cat", "cat eats", "eats the", "the mouse"] with bi-grams
        - three words for tri-grams etc.
       
- Cleaning the text:
    - **Punctuation** should be removed.
    - **Capitalisation** is dealt with
    - Normalisation: **Stemming/Lemmatisation**:
        - **Stemming** is the reduction of the word to its (pseudo)stem by removing suffixes via some heuristic rules. Does not always result in a real word at the end
            - chang**es**, chang**ing**, chang**ed**, chang**ingly** (definitely a real word!) --> **chang**
        - **Lemmatisation** is the conversion of a word to its dictionary form ("lemma"):
            - went, going, goes, gone, go --> go
        - Stemming is faster and less computationally expensive. Lemmatisation is more advanced and can preserve more meaning. Results from both have overlaps.
        - A word can have multiple lemmas depending on the part of speech. Useful to combine lemmatisation with **Part-Of-Speech (POS) tagging**. Many packages have this included. 
    - Removing **stopwords**: some words are common and add almost no meaning, they are removed to save time and memory (eg. "the" can be removed from all documents basically without losing information. Can use preexisting stopwords included in python packages and/or add your own
        - Can also remove **common words** that appear in >X% of your documents (eg. if your documents are all about water, it may be worth removing "water" from your corpus as it also doesn't add information in this case 
- Vectorisation:
    - In order to have our data in a format that can be used in Machine Learning, it must be somehow numeric. There are several ways of vectorising our text, two of which we will look at today:
        - **Bag of Words**
        - **TF-IDF**          


**Once we have our text data in its preprocessed state, what could we use as features to train our model?**



## What is Bag of Words (BoW)?

* Text vectorization is the process of converting each piece of text in the corpus into a numerical vector
* There are various ways to do this: Bag of Words, TF-IDF, Neural Networks, ...etc.


### Bag of Words:

The first attempt at creating word vectors.
The common approach for word vectorization until 2013 (when recurrent NN were proposed by Mikolov et al).

![1*hLvya7MXjsSc3NS2SoLMEg.png](attachment:1*hLvya7MXjsSc3NS2SoLMEg.png)

> We take all the words in our vocabulary and assign them a column in a matrix. We count the number of occurences of that word in each document (=row of the matrix).

## Bag of Words with song lyrics

### Let's collect a very small corpus of song lyrics!

* Artist 1: Bob Marley


* Artist 2: Eminem




In [None]:
# X: before preprocessing, feature engineering and vectorisation
corpus= ["excuse me while I light my spliff which I love",             # Bob Marley
          "no woman no cry I love spliffs",        # Bob Marley
          "I'm slim shady and I sometimes cry cry cry in love",      # Eminem
          "Palms are sweaty, love vomit on sweater, woman's spaghetti"]   # Eminem
# y:
labels = ["Bob Marley"] * 2 + ["Eminem"] * 2

In [None]:
labels

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

`CountVectoriser` automatically transforms the strings to lowercase and strips accents. It has an in-built token regex pattern that it will use as the default unless another is specified.

In [None]:
X_df.shape

This is why it is called a bag of words: we have lost all word order and semantic relationships. The representation of "the cat ate the mouse" and "the mouse ate the cat" are exactly the same.

####  What is our X?

In [None]:
X

In [None]:
type(X)

A **sparse matrix** is one where most of the entries are zero. As this can get very large very fast, SciPy stores only the locations of the nonzero elements plus their values and none of the zeros 

In [None]:
#print(X)

To return the **dense** (= not sparse) version of the matrix, we use the `.todense()` method:

In [None]:
X.todense()

From just four very short snippets of text with repeating words we already have a dataframe with 24 columns. As our corpus grows, so will the size of our vocabulary. This becomes an issue with both storage and computational cost. This is why we need to make our vocabulary as small as possible while preserving as much useful information as we can.

How can we remove the most common words?

1. Using a list of stop words
2. Removing the words that appear in more than X% of documents


- See `CountVectorizer` documentation for which parameters to use: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- After running `CountVectorizer` use `.vocabulary_` and `.stop_words_` attributes on the `vectorizer` variable to see which words have remained and which are removed (latter attribute only works in the case of the second method). Alternatively, you can also do this by inspecting `X_df`.

In [None]:
#Using stopwords
#english is built-in, can substitute for any list of stopwords in any language

vectorizer = CountVectorizer(stop_words='english')

X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

In [None]:
X_df.shape

In [None]:
# Remove words that are in 80% of documents
vectorizer = CountVectorizer(max_df=0.8)
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_df

In [None]:
vectorizer.stop_words_ 

# Use-case specific stop-words that are detected

## Normalisation using TF-IDF

* This normalizes the word counts in our BOW and aims to address the popularity/frequency of words in the whole corpus (not just inside a single document).


**TF-IDF**
* TF - Term Frequency (% count of a word  𝑤  in doc  𝑑 )
* IDF - Inverse Document Frequency

$$TF\cdot IDF = TF(w,d) \cdot IDF(w)$$

$$IDF(w) = log(\dfrac{1+  \text{n.documents}}{1 + \text{n.documents containing word w}})+1$$



![1*qQgnyPLDIkUmeZKN2_ZWbQ.png](attachment:1*qQgnyPLDIkUmeZKN2_ZWbQ.png)

**The steps for calculating TFIDF are:**

For each vector:
1. Calculate the term frequency for each term in the vector
2. Calculate the inverse doc frequency for each term in the vector
3. Multiply the two for each term in the vector
4. Then normalise each vector by the Euclidean norm `(numpy.linalg.norm)`

$$norm = \dfrac{v}{||v||^2}$$

We can do this in `sklearn` and add it to our pipeline:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

tf = TfidfTransformer() 
#Use the CountVectorised data
X_norm = tf.fit_transform(X)

In [None]:
X_norm

In [None]:
X_norm_df=pd.DataFrame(X_norm.todense(), columns=vectorizer.get_feature_names(), index=labels)
X_norm_df

### How do we put it all together?

**Option 1:** go through all the steps outlined above: preprocess-tokenise-lemmatise-vectorise-tfidf

**Option 2:** use sklearn's `TfVectorizer` class to combine CountVectorizer and TfIdfTransformer in one 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer()

In [None]:
X_vec = vectorizer.fit_transform(corpus)
X_vec_df = pd.DataFrame(X_vec.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [None]:
X_vec_df

What happens if our test set has unseen words in it? As long as we `.fit_transform()` our `TfidfVectorizer`on the training data only and `.transform()` our test data, any words not in the original training vocabulary will be ignored.


In [None]:
test_data=['hello I love my dog but not when it vomits']
test_vec=vectorizer.transform(test_data)

In [None]:
pd.DataFrame(test_vec.todense(), columns=vectorizer.get_feature_names())

## Next steps:


- Pre-Process and vectorise your lyrics corpus
- Can you draw some interesting insights from your data?
- Can you run a classification algorithm (or two) on your data to predict the artist? How would you make sure to preprocess the test data in exactly the same way? How does it perform on unseen lyrics?
- Bonus if you're done: see if you can create a wordcloud ([Guide here](https://spiced.space/euclidean-eukalyptus/ds-course/chapters/project_lyrics/README.html)) to visualise each artist's lyrics
- Another Bonus: try the LDA topic modelling exercise in the [course notes](https://spiced.space/naive-zatar/ds-course/chapters/project_lyrics/bag_of_words/README.html)


## References

- [An explanation of Stemming/Lemmatisation from Stanford with several stemming examples](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)
- [Nice article on Lemmatisation](https://devopedia.org/lemmatization)
- [Comparison of several commonly used tokenisers/lemmatisers](https://www.machinelearningplus.com/nlp/lemmatization-examples-python/)
- [What is WordNet](https://wordnet.princeton.edu/)
- [Lemmatisation function](https://gist.github.com/MaxHalford/68b584e9154098151e6d9b5aa7464948) adapted in notes
- [Top 5 Word Tokenizers That Every NLP Data Scientist Should Know](https://towardsdatascience.com/top-5-word-tokenizers-that-every-nlp-data-scientist-should-know-45cc31f8e8b9)
-[Great explainer of TF-IDF](https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558)
- [Another nice article on calculating TF-IDF](https://towardsdatascience.com/a-gentle-introduction-to-calculating-the-tf-idf-values-9e391f8a13e5)

**German Language Resources**
- [German Language NLP with NLTK examples](https://data-dive.com/german-nlp-binary-text-classification-of-reviews-part1)
- [German models with SpaCy](https://spacy.io/models/de) - see the [SpaCy lesson in the course materials](https://spiced.space/naive-zatar/ds-course/chapters/project_lyrics/spacy/README.html) too
- [Common pitfalls with the preprocessing of German text for NLP](https://medium.com/idealo-tech-blog/common-pitfalls-with-the-preprocessing-of-german-text-for-nlp-3cfb8dc19ebe)


### Bonus content! 

In [None]:
# If you are interested in what sklearn's english stopwords are:
from sklearn.feature_extraction import stop_words
print(stop_words.ENGLISH_STOP_WORDS)