# Bag of Words (Or: *What is unique about working with text*)

### 🎯 Goal: create a feature matrix `X` for your lyrics corpus

**Q**: How is text different from the types of data you've seen so far?
* ...

**Q**: What does this mean for how you can work with text compared to types of data you've worked with so far?
* ...

There are two main aspect of working with text that helps us feed it into computers / machine learning models:
1. Data/text preprocessing
2. Turning text into features

### 1. Data Preprocessing

**Q**: Things we may want to do to clean our textual data:
* ...

In [1]:
corpus = ["we all love a yellow submarine",             # Beatles
          "yesterday, my submarine was in love",        # Beatles
          "we are love trouble with loyalty here",      # Eminem
          "loyalty to us is worth more than love is"]   # Eminem
labels = ['Beatles'] * 2 + ['Eminem'] * 2

### 2. Turning text into features

**Q**: Once you have your data cleaned and preprocessed like this, what can you imagine using as your features?
* ...

#### Bag of words

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
vectorizer = CountVectorizer()

In [4]:
X = vectorizer.fit_transform(corpus)

In [5]:
X_df = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names(), index=labels)

##### How can we remove the most common words?

* Using a list of stop words
* Removing the words that appear in more than X% of documents

#### n-grams

Instead of single tokens, we now also count token pairs (bigrams), triplets (trigrams), etc.

#### TF-IDF

Stands for `term frequency - inverse document frequency` and aims to address the popularity/frequency of words in a corpus(not just inside of a single document).

##### TF = term frequency 

$TF(t, d)$ - frequency of term (n-gram) _t_ in document _d_

##### IDF(t) = inverse document frequency (of term _t_ in the whole corpus)

$$ IDF(t) = \log \frac{1+N}{1+N_t}+1 $$


If term _t_ doesn't appear in many documents: ...

If term _t_ appears in many documents: ...

$ TFIDF(t, d) = TF(t,d)*IDF(t) $ 

**Q**: What kind of terms will have high TF-IDF?
* ...

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

What can you say about the values in your new `X_df` (think about sums, normalizations, etc.)?

#### BONUS: extracting most predictive words for each class

```import operator
model = LogisticRegression()
print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[-20:])
print(operator.itemgetter(*np.argsort(model.coef_[0]))(vectorizer.get_feature_names())[:20])```