<a href="https://colab.research.google.com/github/HHansi/Applied-AI-Course/blob/main/NLP/Text_Feature_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Feature Extraction

This notebook contains the practical examples and exercises for the Applied AI-Natural Language Processing.

*Created by Hansi Hettiarachchi*

Importing libraries

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import pandas as pd

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Bag-of-Words(BoW) Model

We cannot directly work with text when using machine learning algorithms. Text need to be converted into numbers.

Bag-of-words model(BoW) is a simple and effective way to convert text features into numeric features, to use with machine learning models.



Let's define some sample texts to use with following examples. <br>
Here, I picked some food reviews.

In [2]:
text_documents = ["good quality food",
                  "not as advertised",
                  "great taffy"]

The simplest BoW model using CountVectorizer.

In [3]:
count_vectorizer = CountVectorizer()
# learn a vocabulary dictionary of all tokens in the text_documents
count_vectorizer.fit(text_documents)

print(f'Features: {count_vectorizer.get_feature_names_out()}\n')

Features: ['advertised' 'as' 'food' 'good' 'great' 'not' 'quality' 'taffy']



In [4]:
# convert text data to numeric vectors
matrix = count_vectorizer.transform(text_documents)

# summarise vector details
print(f'Result matrix size: {matrix.shape}')
print(f'Result matrix:\n {matrix.toarray()}\n')

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

Result matrix size: (3, 8)
Result matrix:
 [[0 0 1 1 0 0 1 0]
 [1 1 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 1]]



Unnamed: 0,advertised,as,food,good,great,not,quality,taffy
0,0,0,1,1,0,0,1,0
1,1,1,0,0,0,1,0,0
2,0,0,0,0,1,0,0,1


<b> <u>Understanding CountVectoizer arguments</u> </b>
- lowercase (boolean, default=True) - If True, convert all characters to lowercase before tokenizing.
- token_pattern (string,default=r"(?u)\b\w\w+\b") - Regular expression denoting what constitutes a “token”. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
-  stop_words (list, default=None) - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- ngram_range (tuple, default=(1, 1)) - The lower and upper boundary of the range of n-values for different word n-grams.

Please refer the [CountVectoizer document](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for more details.

<b>Defining stop words to be removed</b> <br>
Let's use the stop words list available with NLTK.

In [5]:
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
count_vectorizer = CountVectorizer(stop_words=stop_words)
# learn a vocabulary dictionary of all tokens in the text_documents
count_vectorizer.fit(text_documents)

print(f'Features: {count_vectorizer.get_feature_names_out()}\n')

Features: ['advertised' 'food' 'good' 'great' 'quality' 'taffy']



### <font color='green'>**Activity 1**</font>

Compare the features you received with the 'stop_words' argument and without it. What are the differences you see?

In [7]:
# convert text data to numeric vectors
matrix = count_vectorizer.transform(text_documents)

# summarise vector details
print(f'Result matrix size: {matrix.shape}')
print(f'Result matrix:\n {matrix.toarray()}\n')

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

Result matrix size: (3, 6)
Result matrix:
 [[0 1 1 0 1 0]
 [1 0 0 0 0 0]
 [0 0 0 1 0 1]]



Unnamed: 0,advertised,food,good,great,quality,taffy
0,0,1,1,0,1,0
1,1,0,0,0,0,0
2,0,0,0,1,0,1


### <font color='green'>**Activity 2**</font>

Let's assume our final goal is to predict the sentiment (positive, negative and neutral) of the food reviews.

a) Do the features without stopwords reflect the original sentiment of each review?

b) If not, what is the solution you suggest to have features which reflect the original sentiment?

<b>Use different n-grams</b> <br>
Examples - Defining lower and upper boundary
- (1, 1) - 1-grams
- (1, 2) - 1-grams and 2-grams
- (2, 2) - 2-grams


In [8]:
count_vectorizer = CountVectorizer(ngram_range=(1,2))
# learn a vocabulary dictionary of all tokens in the text_documents
count_vectorizer.fit(text_documents)

print(f'Features: {count_vectorizer.get_feature_names_out()}\n')

Features: ['advertised' 'as' 'as advertised' 'food' 'good' 'good quality' 'great'
 'great taffy' 'not' 'not as' 'quality' 'quality food' 'taffy']



In [9]:
# convert text data to numeric vectors
matrix = count_vectorizer.transform(text_documents)

# summarise vector details
print(f'Result matrix size: {matrix.shape}')
print(f'Result matrix:\n {matrix.toarray()}\n')

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

Result matrix size: (3, 13)
Result matrix:
 [[0 0 0 1 1 1 0 0 0 0 1 1 0]
 [1 1 1 0 0 0 0 0 1 1 0 0 0]
 [0 0 0 0 0 0 1 1 0 0 0 0 1]]



Unnamed: 0,advertised,as,as advertised,food,good,good quality,great,great taffy,not,not as,quality,quality food,taffy
0,0,0,0,1,1,1,0,0,0,0,1,1,0
1,1,1,1,0,0,0,0,0,1,1,0,0,0
2,0,0,0,0,0,0,1,1,0,0,0,0,1


**n-grams with stop word removal:**

In [10]:
count_vectorizer = CountVectorizer(ngram_range=(1,2), stop_words=stop_words)
# learn a vocabulary dictionary of all tokens in the text_documents
count_vectorizer.fit(text_documents)

print(f'Features: {count_vectorizer.get_feature_names_out()}\n')

Features: ['advertised' 'food' 'good' 'good quality' 'great' 'great taffy' 'quality'
 'quality food' 'taffy']



In [11]:
# convert text data to numeric vectors
matrix = count_vectorizer.transform(text_documents)

# summarise vector details
print(f'Result matrix size: {matrix.shape}')
print(f'Result matrix:\n {matrix.toarray()}\n')

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=count_vectorizer.get_feature_names_out())

Result matrix size: (3, 9)
Result matrix:
 [[0 1 1 1 0 0 1 1 0]
 [1 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 1 0 0 1]]



Unnamed: 0,advertised,food,good,good quality,great,great taffy,quality,quality food,taffy
0,0,1,1,1,0,0,1,1,0
1,1,0,0,0,0,0,0,0,0
2,0,0,0,0,1,1,0,0,1


### <font color='green'>**Activity 3**</font>

a) Generate a BoW model using <i>train_text_documents</i> given below. Convert
all tokens to lowercase, remove stop words and use both 1-grams and 2-grams to build the model.  
b) Using the built model, convert <i>test_text_documents</i> into numeric vectors.

In [12]:
train_text_documents = ["The quick brown fox jumped over the lazy dog.",
                        "The cat played with the dog."]

test_text_documents = ["The dog saw the donkey."]

# create an instance of the CountVectorizer


# learn a vocabulary dictionary of all tokens in train_text_documents


# convert test_text_documents to numeric vectors


# print output vectors



## Token weighting with TfidfVectorizer

Let's define more detailed sample texts to use with following examples. <br>


In [13]:
text_documents = ["the quality of food is very good",
                  "the best hot sauce in the world",
                  "this is the best instant oatmeal"]

<b> <u>Understanding TfidfVectoizer arguments</u> </b>
- lowercase (boolean, default=True) - If True, convert all characters to lowercase before tokenizing.
- token_pattern (string,default=r"(?u)\b\w\w+\b") - Regular expression denoting what constitutes a “token”. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
-  stop_words (list, default=None) - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- ngram_range (tuple, default=(1, 1)) - The lower and upper boundary of the range of n-values for different word n-grams.

Please refer the [TfidfVectoizer document](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html?highlight=tfidfvectorizer#sklearn.feature_extraction.text.TfidfVectorizer) for more details.

As you can see below, TfidfVectorizer also requires a similar set of steps as with CountVectorizer.

In [14]:
tfidf_vectorizer = TfidfVectorizer()
# learn a vocabulary dictionary of all tokens in the text_documents
tfidf_vectorizer.fit(text_documents)

print(f'Features: {tfidf_vectorizer.get_feature_names_out()}\n')

Features: ['best' 'food' 'good' 'hot' 'in' 'instant' 'is' 'oatmeal' 'of' 'quality'
 'sauce' 'the' 'this' 'very' 'world']



In [15]:
# convert text data to numeric vectors
matrix = tfidf_vectorizer.transform(text_documents)

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

Unnamed: 0,best,food,good,hot,in,instant,is,oatmeal,of,quality,sauce,the,this,very,world
0,0.0,0.410747,0.410747,0.0,0.0,0.0,0.312384,0.0,0.410747,0.410747,0.0,0.242594,0.0,0.410747,0.0
1,0.311166,0.0,0.0,0.409146,0.409146,0.0,0.0,0.0,0.0,0.0,0.409146,0.483296,0.0,0.0,0.409146
2,0.358291,0.0,0.0,0.0,0.0,0.47111,0.358291,0.47111,0.0,0.0,0.0,0.278245,0.47111,0.0,0.0


Let's generate vectors using only 2-grams and without stop words.

In [16]:
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(2,2))

# learn a vocabulary dictionary of all tokens in the text_documents
tfidf_vectorizer.fit(text_documents)

print(f'Features: {tfidf_vectorizer.get_feature_names_out()}\n')

Features: ['best hot' 'best instant' 'food good' 'hot sauce' 'instant oatmeal'
 'quality food' 'sauce world']



In [17]:
# convert text data to numeric vectors
matrix = tfidf_vectorizer.transform(text_documents)

# for visualisation purpose, let's convert the features into a dataframe
pd.DataFrame(matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

Unnamed: 0,best hot,best instant,food good,hot sauce,instant oatmeal,quality food,sauce world
0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0
1,0.57735,0.0,0.0,0.57735,0.0,0.0,0.57735
2,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0
