# Bag of Words (Or: *What is unique about working with text*)

### 🎯 Goal: create a feature matrix `X` for your lyrics corpus

**Q**: How is text different from the types of data you've seen so far?
* Often has to be extracted/scraped.
* Unstructured.
* Ambiguity.
* Can be user-entered.

**Q**: What does this mean for how you can work with text compared to types of data you've worked with so far?
* We have to turn into numbers (feature engineering / vectorization)
* We have to clean / preprocess it

There are two main aspect of working with text that helps us feed it into computers / machine learning models:
1. Data/text preprocessing
2. Turning text into features

### 1. Data Preprocessing

**Q**: Things we may want to do to clean our textual data:
* separate words (_tokenization_), deal with spaces, new lines, etc.
* deal with user-entered input (see e.g. https://github.com/seatgeek/thefuzz)
* remove most frequent words ("stop words")
    * using a list of words to remove
    * removing words that appear in more than X% of the documents
* remove special characters, punctuation, emojis 
* deal with capitalization
* reducing words to their base parts:
    * _stemming_
    * _lemmatization_

Tokenization: 
* splitting text into _tokens_
* most often words, could also be sentences, sometimes individual letters
* .split(), regex, nltk (treebank tokenizer), scikit-learn, spacy, keras

Stemming:
* reducing the word to a more basic form, a _stem_
* by removing suffixes (-able, -ed, -ing)
* nltk
* based on heuristics / rules
* does not have to result in a word

Lemmatization:
* reducing the work to a more basic form, a _lemma_
* based on morphology of a word (vocabulary/dictionary)
* nltk (wordnet lemmatizer), spacy
* returns a word
* it doesn't always result in a reduced form

In [4]:
corpus = ["we all love a yellow submarine",             # Beatles
          "yesterday, my submarine was in love",        # Beatles
          "we are love trouble with loyalty here",      # Eminem
          "loyalty to us is worth more than love is"]   # Eminem
labels = ['Beatles'] * 2 + ['Eminem'] * 2

### 2. Turning text into features

**Q**: Once you have your data cleaned and preprocessed like this, what can you imagine using as your features?
* word counts
* length of text
* number of words starting with each letter
* repetitions: # per song, length of the longest
* number of unique words
* average of length of words in a song
* rule-based sentiment analysis
* count of slang/swear/domain words in a song

#### Bag of words

Each token/word is a column/feature.
* word order lost
* large sparse matrix
* counts not normalized

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
vectorizer = CountVectorizer()

In [5]:
X = vectorizer.fit_transform(corpus)

In [6]:
X

<4x20 sparse matrix of type '<class 'numpy.int64'>'
	with 26 stored elements in Compressed Sparse Row format>

In [7]:
type(X)

scipy.sparse.csr.csr_matrix

In [12]:
print(X)

  (0, 15)	1
  (0, 0)	1
  (0, 5)	1
  (0, 18)	1
  (0, 9)	1
  (1, 5)	1
  (1, 9)	1
  (1, 19)	1
  (1, 8)	1
  (1, 14)	1
  (1, 3)	1
  (2, 15)	1
  (2, 5)	1
  (2, 1)	1
  (2, 12)	1
  (2, 16)	1
  (2, 6)	1
  (2, 2)	1
  (3, 5)	1
  (3, 6)	1
  (3, 11)	1
  (3, 13)	1
  (3, 4)	2
  (3, 17)	1
  (3, 7)	1
  (3, 10)	1


In [13]:
X.todense()

matrix([[1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
        [0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0],
        [0, 0, 0, 0, 2, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0]])

In [9]:
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [10]:
X_df

Unnamed: 0,all,are,here,in,is,love,loyalty,more,my,submarine,than,to,trouble,us,was,we,with,worth,yellow,yesterday
Beatles,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,1,0
Beatles,0,0,0,1,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,1
Eminem,0,1,1,0,0,1,1,0,0,0,0,0,1,0,0,1,1,0,0,0
Eminem,0,0,0,0,2,1,1,1,0,0,1,1,0,1,0,0,0,1,0,0


##### How can we remove the most common words?

* Using a list of stop words
* Removing the words that appear in more than X% of documents

In [14]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [15]:
X_df

Unnamed: 0,love,loyalty,submarine,trouble,worth,yellow,yesterday
Beatles,1,0,1,0,0,1,0
Beatles,1,0,1,0,0,0,1
Eminem,1,1,0,1,0,0,0
Eminem,1,1,0,0,1,0,0


In [16]:
vectorizer = CountVectorizer(max_df=0.8)
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [19]:
X_df

Unnamed: 0,all,are,here,in,is,loyalty,more,my,submarine,than,to,trouble,us,was,we,with,worth,yellow,yesterday
Beatles,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0
Beatles,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1
Eminem,0,1,1,0,0,1,0,0,0,0,0,1,0,0,1,1,0,0,0
Eminem,0,0,0,0,2,1,1,0,0,1,1,0,1,0,0,0,1,0,0


#### n-grams

Instead of single tokens, we now also count token pairs (bigrams), triplets (trigrams), etc.
* preserves local word order
* even larger, sparses matrix
* too many features:
    * remove high frequency n-grams (max_df): not very informative
    * remove low frequency n-grams (min_df): very rare n-grams; can lead to overfitting

In [29]:
vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [30]:
X_df

Unnamed: 0,all,all love,are,are love,here,in,in love,is,is worth,love,...,we all,we are,with,with loyalty,worth,worth more,yellow,yellow submarine,yesterday,yesterday my
Beatles,1,1,0,0,0,0,0,0,0,1,...,1,0,0,0,0,0,1,1,0,0
Beatles,0,0,0,0,0,1,1,0,0,1,...,0,0,0,0,0,0,0,0,1,1
Eminem,0,0,1,1,1,0,0,0,0,1,...,0,1,1,1,0,0,0,0,0,0
Eminem,0,0,0,0,0,0,0,2,1,1,...,0,0,0,0,1,1,0,0,0,0


#### TF-IDF

Stands for `term frequency - inverse document frequency` and aims to address the popularity/frequency of words in a corpus(not just inside of a single document).

##### TF = term frequency 

$TF(t, d)$ - frequency of term (n-gram) _t_ in document _d_

##### IDF(t) = inverse document frequency (of term _t_ in the whole corpus)

$$ IDF(t) = \log \frac{1+N}{1+N_t}+1 $$

Where $N$ is the number of documents

and $N_t$ is the number of documents in which term $t$ appears

If term $t$ doesn't appear in many documents: $IDF(t)$ is "large"

If term $t$ appears in many documents: $IDF(t)$ is "small", i.e. it is close to 1.

-> terms that are too common are penalized

$ TFIDF(t, d) = TF(t,d)*IDF(t) $ 

**Q**: What kind of terms will have high TF-IDF?
* Those that appear a lot in a small number of documents (songs)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [32]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

In [33]:
X_df

Unnamed: 0,all,are,here,in,is,love,loyalty,more,my,submarine,than,to,trouble,us,was,we,with,worth,yellow,yesterday
Beatles,0.533343,0.0,0.0,0.0,0.0,0.27832,0.0,0.0,0.0,0.420493,0.0,0.0,0.0,0.0,0.0,0.420493,0.0,0.0,0.533343,0.0
Beatles,0.0,0.0,0.0,0.452035,0.0,0.235891,0.0,0.0,0.452035,0.356389,0.0,0.0,0.0,0.0,0.452035,0.0,0.0,0.0,0.0,0.452035
Eminem,0.0,0.425802,0.425802,0.0,0.0,0.222201,0.335707,0.0,0.0,0.0,0.0,0.0,0.425802,0.0,0.0,0.335707,0.425802,0.0,0.0,0.0
Eminem,0.0,0.0,0.0,0.0,0.635837,0.165903,0.250651,0.317919,0.0,0.0,0.317919,0.317919,0.0,0.317919,0.0,0.0,0.0,0.317919,0.0,0.0


What can you say about the values in your new `X_df` (think about sums, normalizations, etc.)?

In [36]:
np.square(X_df).sum(axis=1)

Beatles    1.0
Beatles    1.0
Eminem     1.0
Eminem     1.0
dtype: float64

#### BONUS: extracting most predictive words for each class

```import operator
model = LogisticRegression()
print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[-20:])
print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[:20])```