<a href="https://colab.research.google.com/github/kokchun/Machine-learning-AI22/blob/main/Lecture_code/Lec11-NLP_intro.ipynb" target="_parent"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> &nbsp; for interacting with the code


---
# Lecture notes - NLP intro - text feature extraction
---

This is the lecture note for **NLP intro**, more on text feature extraction and preprocessing will be given in the NLP section of the deep learning course. 

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to NLP with focus on text extraction. I encourage you to read further about text text feature extraction. </p>

Read more:
- [Bag-of-words model - wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)
- [TF-IDF - wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [Sparse matrix - wikipedia](https://en.wikipedia.org/wiki/Sparse_matrix)
- [Bag-of-words - sklearn](https://scikit-learn.org/stable/modules/feature_extraction.html?highlight=tfidf#text-feature-extraction)
- [CountVectorizer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer)
- [TfidfTransformer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tfidftransformer#sklearn.feature_extraction.text.TfidfTransformer)
- [Tfidfvectorizer - sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- [Sparse matrix - SciPy](https://docs.scipy.org/doc/scipy/reference/sparse.html)
---


## Term frequency
term frequency $tf(t,d)$ - relative frequency of term $t$ in document $d$, i.e. how frequent a term occurs in a document

$$tf(t,d) = \frac{f_{t,d}}{\sum_{t'\in d} f_{t',d}}$$

, where $f_{t,d}$ the raw count, is the amount of time the term $t$ is in document $d$. The denominator is the total number of terms in the document. Also $tf(t,d)$ could be defined in several ways, and a simple way is to just equate it to the raw frequency count.

In [12]:
import numpy as np 

review1 = "I LOVE this book about love"
review2 = "No this book was okay"

[text for text in [review1, review2]]

['I LOVE this book about love', 'No this book was okay']

In [24]:
[text.lower() for text in [review1, review2]]

['i love this book about love', 'no this book was okay']

In [39]:
[text.lower().split() for text in [review1, review2]]

[['i', 'love', 'this', 'book', 'about', 'love'],
 ['no', 'this', 'book', 'was', 'okay']]

In [25]:
all_words = [text.lower().split() for text in [review1, review2]]
print(all_words)

[['i', 'love', 'this', 'book', 'about', 'love'], ['no', 'this', 'book', 'was', 'okay']]


In [26]:
# flattens 2D list to 1D list 
all_words = [word for text in all_words for word in text]
all_words

['i',
 'love',
 'this',
 'book',
 'about',
 'love',
 'no',
 'this',
 'book',
 'was',
 'okay']

In [28]:
set(all_words) # ger unika ord, har ingen inbördes ordning, returnerar type -set

{'about', 'book', 'i', 'love', 'no', 'okay', 'this', 'was'}

In [32]:
# removes all copies, but sets don't have any particular ordering 
unique_words = set(all_words)
unique_words

{'about', 'book', 'i', 'love', 'no', 'okay', 'this', 'was'}

In [33]:
type(all_words), type(unique_words)

(list, set)

In [37]:
# ordningen kommer vara olika på olika datorer
# skapar dict med word som keys och index som value dvs word:index par
{word: index for index, word in enumerate(unique_words)}

{'about': 0,
 'i': 1,
 'this': 2,
 'no': 3,
 'love': 4,
 'book': 5,
 'was': 6,
 'okay': 7}

In [38]:
# dictionary of all words 
vocabulary = {word: index for index, word in enumerate(unique_words)}
print(vocabulary)

{'about': 0, 'i': 1, 'this': 2, 'no': 3, 'love': 4, 'book': 5, 'was': 6, 'okay': 7}


In [41]:
# to get index write dict_name[key], for example:
vocabulary['this']

2

In [47]:
# vocalbulary corresponds to corpus of documents
def term_frequency_vectorizer(document, vocabulary): 
    term_frequency = np.zeros(len(vocabulary)) # vector of zeroes

    for word in document.lower().split():
        index = vocabulary[word]
        term_frequency[index] += 1

    return term_frequency
    

# note that we consider the raw count itself and not divide by total number of terms in the document
# this is another way to define the term frequency and more simplistic to ease understanding
review1_term_freq = term_frequency_vectorizer(review1, vocabulary)
review2_term_freq = term_frequency_vectorizer(review2, vocabulary)

print(vocabulary)
#review1 = "I LOVE this book about love"
#review2 = "No this book was okay"
review1_term_freq, review2_term_freq

{'about': 0, 'i': 1, 'this': 2, 'no': 3, 'love': 4, 'book': 5, 'was': 6, 'okay': 7}


(array([1., 1., 1., 0., 2., 1., 0., 0.]),
 array([0., 0., 1., 1., 0., 1., 1., 1.]))

In [45]:
# skapa bag of words
import pandas as pd

bag_of_words = pd.DataFrame([review1_term_freq, review2_term_freq],
                            columns=vocabulary.keys(), dtype="int16")
bag_of_words

Unnamed: 0,about,i,this,no,love,book,was,okay
0,1,1,1,0,2,1,0,0
1,0,0,1,1,0,1,1,1


---
## TF-IDF 
- Term frequency - inverse document frequency
- TF-IDF is a way to represent how important a word is across a corpus of documents. Basically it is a vector with numeric weights on each word, where higher weights is put on rarer terms.

The inverse document frequency $idf(t,D)$ gives information on the rarity of the word in all documents $D$.
$$idf(t,D) = \log{\frac{|D|}{1+|\{d\in D: t\in d\}|}}$$

, where $|D|$ is the number of documents in the corpus, $1+|\{d\in D: t\in d\}|$ 
is the number of documents where the word $t$ occurs, we add 1 to avoid division by zero in case the word is not in the corpus.

$$tfidf(t,d,D) = tf(t,d)\cdot idf(t,D) $$

---
### Bag of words - Feature extraction with sklearn
- CountVectorizer - creates a bag of words model
- TfidfTransformer - transforms it using TF-IDF
- TfidfVectorizer - does CountVectorizer and TfidfTransformer

In [55]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
bag_of_words_sparse = count_vectorizer.fit_transform([review1, review2])
bag_of_words_sparse.todense(), count_vectorizer.get_feature_names_out()

(matrix([[1, 1, 2, 0, 0, 1, 0],
         [0, 1, 0, 1, 1, 1, 1]], dtype=int64),
 array(['about', 'book', 'love', 'no', 'okay', 'this', 'was'], dtype=object))

In [56]:
# note that it ignores one letter words such as I 
bag_of_words_sklearn = pd.DataFrame(bag_of_words_sparse.todense(), columns = count_vectorizer.get_feature_names_out())
bag_of_words_sklearn

Unnamed: 0,about,book,love,no,okay,this,was
0,1,1,2,0,0,1,0
1,0,1,0,1,1,1,1


In [57]:
# result from the manual construction of bag of words
bag_of_words

Unnamed: 0,about,i,this,no,love,book,was,okay
0,1,1,1,0,2,1,0,0
1,0,0,1,1,0,1,1,1


### TF-IDF

In [61]:
from sklearn.feature_extraction.text import TfidfVectorizer

# same results but, done in one go

tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit_transform([review1, review2]).todense()

# higher number means more important word
# the more common word is in a specific document but not common in the courpus of documents wiill have higher tf-idf value

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])

In [63]:
from sklearn.feature_extraction.text import TfidfTransformer

# with TfidfTransformer you have to fit the bag of words count vector
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit_transform(bag_of_words_sparse).todense()

matrix([[0.4078241 , 0.29017021, 0.81564821, 0.        , 0.        ,
         0.29017021, 0.        ],
        [0.        , 0.35520009, 0.        , 0.49922133, 0.49922133,
         0.35520009, 0.49922133]])

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---
