---
# Lecture notes - NLP intro - text feature extraction
---

This is the lecture note for **NLP intro**, more on text feature extraction and preprocessing will be given in the NLP section of the deep learning course. 

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to NLP with focus on text extraction. I encourage you to read further about text text feature extraction. </p>

Read more:
- [Bag-of-words model - wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)
- [TF-IDF - wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [Sparse matrix - wikipedia](https://en.wikipedia.org/wiki/Sparse_matrix)
- [Bag-of-words - sklearn](https://scikit-learn.org/stable/modules/feature_extraction.html?highlight=tfidf#text-feature-extraction)
- [CountVectorizer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer)
- [TfidfTransformer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tfidftransformer#sklearn.feature_extraction.text.TfidfTransformer)
- [Tfidfvectorizer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Sparse matrix - SciPy](https://docs.scipy.org/doc/scipy/reference/sparse.html)
---

## Term frequency
term frequency $tf(t,d)$ - relative frequency of term $t$ in document $d$, i.e. how frequent a term occurs in a document

$$tf(t,d) = \frac{f_{t,d}}{\sum_{t'\in d} f_{t',d}}$$

, where $f_{t,d}$ the raw count, is the amount of time the term $t$ is in document $d$. The denominator is the total number of terms in the document. Also $tf(t,d)$ could be defined in several ways, and a simple way is to just equate it to the raw frequency count.

In [1]:
import numpy as np

review1 = "I LOVE this book about love"
review2 = "No this book was okay"


all_words = [text.lower().split() for text in [review1, review2]]
print(all_words)

[['i', 'love', 'this', 'book', 'about', 'love'], ['no', 'this', 'book', 'was', 'okay']]


In [2]:
# we don't want it in a 2D list. So we flatten it to a 1D list

all_words = [word for text in all_words for word in text]
all_words


['i',
 'love',
 'this',
 'book',
 'about',
 'love',
 'no',
 'this',
 'book',
 'was',
 'okay']

In [3]:
# mängden av alla ord (set) blir de unika orden
# set har ingen ordning utan datan rearangar bara för att spara minne

unique_words = set(all_words)
type(unique_words), unique_words


(set, {'about', 'book', 'i', 'love', 'no', 'okay', 'this', 'was'})

In [4]:
# loopar igenom unique words och ger dem ett index
vocab = {word: index for index, word in enumerate(unique_words)}
print(vocab)

{'i': 0, 'love': 1, 'about': 2, 'no': 3, 'this': 4, 'was': 5, 'okay': 6, 'book': 7}


In [5]:
# vocab består av alla review 
# För varje review så vill vi att den ska returnera en vektor
def term_frequency_vectorizer(document, vocab= vocab):
    term_frequency = np.zeros(len(vocab))

    # Dokument i detta fallet är en review som vi stoppar in
    # Vi får indexet genom att vocab är en dictionary och vi söker på ordet
    for word in document.lower().split():
        index = vocab[word]
        term_frequency[index] += 1

    return term_frequency

review1_term_frequency = term_frequency_vectorizer(review1)
review2_term_frequency = term_frequency_vectorizer(review2)

print(vocab)
print(review1)
print(review2)
review1_term_frequency, review2_term_frequency


{'i': 0, 'love': 1, 'about': 2, 'no': 3, 'this': 4, 'was': 5, 'okay': 6, 'book': 7}
I LOVE this book about love
No this book was okay


(array([1., 2., 1., 0., 1., 0., 0., 1.]),
 array([0., 0., 0., 1., 1., 1., 1., 1.]))

In [6]:
import pandas as pd

bag_of_words = pd.DataFrame([review1_term_frequency, review2_term_frequency], columns=vocab.keys())
bag_of_words

# Vi har nu alltså gått igenom manuellt hur man går från en text till en vektor
# Detta är en bra övning för att förstå vad som händer i bakgrunden

Unnamed: 0,i,love,about,no,this,was,okay,book
0,1.0,2.0,1.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0


## Bag of words sklearn
---
### Feature extraction with sklearn
- CountVectorizer - creates a bag of words model
- TfidfTransformer - transforms it using TF-IDF
- TfidfVectorizer - does CountVectorizer and TfidfTransformer

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
bag_of_words_sparse = count_vectorizer.fit_transform([review1, review2])
bag_of_words_sparse.todense(), count_vectorizer.get_feature_names_out()


(matrix([[1, 1, 2, 0, 0, 1, 0],
         [0, 1, 0, 1, 1, 1, 1]]),
 array(['about', 'book', 'love', 'no', 'okay', 'this', 'was'], dtype=object))

In [12]:
bag_of_words = pd.DataFrame(bag_of_words_sparse.todense(), columns=count_vectorizer.get_feature_names_out())
bag_of_words

Unnamed: 0,about,book,love,no,okay,this,was
0,1,1,2,0,0,1,0
1,0,1,0,1,1,1,1


---
## TF-IDF 
- Term frequency - inverse document frequency
- TF-IDF is a way to represent how important a word is across a corpus of documents. Basically it is a vector with numeric weights on each word, where higher weights is put on rarer terms.

The inverse document frequency $idf(t,D)$ gives information on the rarity of the word in all documents $D$.
$$idf(t,D) = \log{\frac{|D|}{1+|\{d\in D: t\in d\}|}}$$

, where $|D|$ is the number of documents in the corpus, $1+|\{d\in D: t\in d\}|$ 
is the number of documents where the word $t$ occurs, we add 1 to avoid division by zero in case the word is not in the corpus.

$tfidf(t,d,D) = tf(t,d)\cdot idf(t,D) $

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform([review1, review2]).todense()

# the more common the word is in a specifik document but not common in the corpus of documents will have higher tf-idf value

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])