<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%2010/MIS710_Lab10_NLP_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Extra Explanation of text vectorisation**
Author: Associate Professor Lemai Nguyen

Objectives:
1. To explain key concepts related to vectorisation
2. To visualise vector representation of text data

Note: This is a supplementaty explanation, please read the lecture slides and watch the class recordings for full details.  



**Key concept:** Vectorisation is the process of converting text data into numerical form so that machine learning algorithms can understand and process it. Think of it like translating a language the computer doesn't understand (text) into a language it does (numbers).

The purpose of this supplementary material is to help you 'visualise' the concept of vectorisation and what vectors 'look' like.

# **1. Import Libraries and Functions**

Import fundamental libraries and functions

In [81]:
# import libraries
import pandas as pd #for data manipulation and analysis
import numpy as   np

import seaborn as sns
import matplotlib.pyplot as plt

# **2. Load data**

For simplicity, let's load a small dataset I collected, curated, and labelled for demonstration purposes.

In [82]:
records = pd.read_csv('https://raw.githubusercontent.com/VanLan0/MIS710-ML/main/Datasets/DrCha_reviews.csv')
print(records)

                                               review sentiment
0   I just watched the first episode and it’s 10/1...  positive
1   this is going to be absolutely amazing I can't...  positive
2   Accidentally watched the first episode, it was...  positive
3                          This actress is hilarious!  positive
4                             Looks very entertaining  positive
5                         annoying unnecesarry  scene  negative
6                           I love this drama so much  positive
7   Too convoluted and over dramatic. the writers ...  negative
8   Really good to see Km Byung Chul in a more com...  positive
9                              disappointed, too long  negative
10  I'm addicted to this movie, but If she doesn't...  positive
11      I love it. Can't wait to see the next episode  positive
12  The movie really disappoint me, his mother wil...  negative
13  Loved the first 2 episodes and excited for the...  positive


As text vectorisation results in a large sparse matrix, so for the demonstration purpose, let's get only several short reviews to view it.

I'll explain what a sparse matrix is later. Let's keep your mind simple and clear for now :-)

In [83]:
records=records.iloc[3:7]
print(records)

                        review sentiment
3   This actress is hilarious!  positive
4      Looks very entertaining  positive
5  annoying unnecesarry  scene  negative
6    I love this drama so much  positive


# **2. Text pre-processing**
* Text cleaning: Remove any HTML tags, punctuation, and special characters from the reviews.
* Lowercasing: Convert all text to lowercase to ensure that words with the same spelling but different capitalization are treated as the same word.
* Tokenization: Split the text into individual words or tokens.
* Stopword removal: Remove common words such as "the", "and", "a", etc. that do not add much meaning to the text.
* Stemming/Lemmatization: Reduce words to their base form to normalize the text and group together words with similar meanings.

re is a built-in Python module for working with regular expressions. Regular expressions are a combination of alphanumeric characters and special characters. For example, the . character in a regular expression matches any single character, while the * character matches zero or more occurrences of the preceding character.

## **Text cleaning**

### **Irrelevant data removal**

In [84]:
#import the Python module re to work with regular expressions
import re

The re.sub() method takes three arguments:

1.   The regular expression pattern to search for.
2.   The replacement string to use for each match of the pattern.
3. The input string to perform the substitution on.

Let's define our clean_text funtion that cleans a given text and returned the cleaned version of it.

In [85]:
# Define function to clean text
def clean_text(text):
  # Remove HTML tags
  text = re.sub(r'<.*?>', '', text)
  # Remove punctuation and special characters
  text = re.sub(r'[^\w\s]', '', text)
  # Remove extra whitespace
  text = re.sub(r'\s+', ' ', text).strip()
  return text

* The . (dot) metacharacter matches any single character except for a newline character.
* The * (asterisk) metacharacter matches zero or more occurrences of the preceding character or group.

* \w matches any word character, which  includes all alphabetic characters (both upper- and lowercase) and digits, and the underscore character.

* \s matches any whitespace character, which includes spaces, tabs, newlines, and other types of whitespace.

* \s+ matches one or more consecutive whitespace characters. The + (plus) is a metacharacter that means "one or more occurrences of the preceding pattern".

We have defined the clean_text function, now let's write your code for the whole reviews to apply it.

Pandas DataFrame apply() function can be used to apply a function along an axis. We can direct  the column ['review'] to apply the clean_text function. It will apply the function for all values in this column.
https://www.w3schools.com/python/pandas/ref_df_apply.asp

In [86]:
#let's get reviews to run clean_text
records['review']= records['review'].apply(clean_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  records['review']= records['review'].apply(clean_text)


In [87]:
#You can check to have a feel:
records['review']

3     This actress is hilarious
4       Looks very entertaining
5    annoying unnecesarry scene
6     I love this drama so much
Name: review, dtype: object

### **Lowercasing**

In [88]:
#define the lowercasing function to turn all letters into lowercases.
def lowercasing(text):
  # Convert to lowercase
  text = text.lower()
  return text

In [89]:
# let's get reviews to run lowercasing
records['review']= records['review'].apply(lowercasing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  records['review']= records['review'].apply(lowercasing)


In [90]:
#You can check to have a feel:
records['review']

3     this actress is hilarious
4       looks very entertaining
5    annoying unnecesarry scene
6     i love this drama so much
Name: review, dtype: object

## **Linguistic preprocessing**

#### **Tokenization, Stop word removal, and Lemmatization**

* Tokenization breaks the text into individual words or tokens, and then stemming/lemmatization is applied on each token individually to obtain their root form.

* Stemming can help in reducing  words to their root form, which can help in reducing the number of unique words and therefore reducing the size of the vocabulary. For example: studies, study, studied, studying -> studi; flew -> fle; import PorterStemmer for stemming.

* Lemmatization reduces a word to its base or dictionary form, which can improve the accuracy of the analysis; For example: studies, study, studied, studying -> study; flew -> fly;  import WordNetLemmatizer for lemmatization.

NLTK (Natural Language Toolkit) is a Python library used for natural language processing (NLP) tasks such as tokenization, stemming, lemmatization, part-of-speech tagging, and more.

In [91]:
#import NLTK and functions for Tokenization, Stemming and Lemmatization
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

#import NLTK stopwords from its corpus
from nltk.corpus import stopwords

Punkt and WordNet are both modules in the Natural Language Toolkit (nltk) library in Python for natural language processing tasks.

* Punkt is a module used for tokenization. The Punkt module is trained on large corpora and uses unsupervised learning algorithms to tokenize sentences effectively even when presented with challenging text.

* WordNet is a large lexical database of English words for lemmatization.

In [92]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [93]:
# define stopwords without negation words
stop_words = set(stopwords.words('english'))
negation_words = {'no', 'not', 'nor', 'neither', 'none', 'never'}
filtered_words = [word for word in stop_words if word not in negation_words]

We can remove stopwords—before, during, or after lemmatization—often depending on the specific use case, the nature of the text data, and the desired outcome. Here we combines two steps into one for efficiency.

Below, we lemmatize words if they are not in filtered words.

In [94]:
#define a function to perform tokenization, stemming or lemmatization, use PorterStemmer for stemming
def tokenize_lemmatize(text):
  #tokenization
  tokens = nltk.word_tokenize(text)

  #initialize stemmer OR lemmatizer
  lemmatizer = WordNetLemmatizer()

  # Filtering out the stopwords and lemmatizing the tokens
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in filtered_words and token.lower() not in negation_words]
  return ' '.join(lemmatized_tokens)


**Alternatively**, you can use stemming


```
#define a function to perform tokenization and stemming
def tokenize_stem(text):
  #tokenization
  tokens = nltk.word_tokenize(text.lower())
  
  #initialize stemmer
  stemmer = PorterStemmer()

  # Filtering out the stopwords and stem the tokens
  stemmed_tokens = [stemmer.stem(token) for token in tokens]

  return ' '.join(lemmatized_tokens)
```



In [95]:
# Tokenize and lemmatize the review text
processed_text = records['review'].apply(tokenize_lemmatize)

In [96]:
#now you can see the effect:
processed_text

3             actress hilarious
4             look entertaining
5    annoying unnecesarry scene
6               love drama much
Name: review, dtype: object

**Note**: The WordNetLemmatizer from NLTK relies on the WordNet database. By default, the lemmatizer treats every word as a noun unless told otherwise. To get the correct lemma, e.g. gerund forms of verbs, we need to provide the part-of-speech (POS) tag to the lemmatizer. More processing is needed.

# **3. Text representation**
Text **representation** involves the process of **vectorisation**, wherein textual data is transformed into numerical form for machine learning algorithms.

**Preventing vocabulary leakage**

Note: If we vectorize the whole dataset before splitting into training and test sets, there is a risk of data leakage. This is because the information from the whole dataset (including test set) is used to prepare the training dataset.

To prevent this risk, we split the dataset first. We fit the initialised vectorizer only on the training data. Use this fitted vectorizer to transform both the training and test data. This is similar to how we scaled datasets in previous labs.

For the purpose of demonstrating the vectorisation concept, we skip data split.

#### **Feature Selection**

In [97]:
X=processed_text


In [98]:
X

3             actress hilarious
4             look entertaining
5    annoying unnecesarry scene
6               love drama much
Name: review, dtype: object

### **Vectorisation**

There are various methods to vectorise text, turning it into a numerical representation suitable for machine learning tasks. In this context, we're using TF-IDF vectorisation. TF-IDF stands for **Term Frequency-Inverse Document Frequency**. With TF-IDF, the importance of each word is determined based on its frequency in a document (e.g., a review) compared to its prevalence in the entire dataset (i.e., all reviews). Words that appear frequently in one document (or review) but are rare across the entire dataset receive higher weights, indicating their significance in characterising that particular document. For instance, this can highlight someone expressing a strong view.

In [99]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Let's use 1-grams first
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))

In [100]:
X_tfidf = tfidf_vectorizer.fit_transform(X)


## **Sparse matrix**

A 'sparse matrix' is a matrix where the majority of its values are zeros (or default values). The outcome of text vectorisation often results in such a sparse matrix because not every document (or review) will contain every word from the vocabulary (unique tokens or words).

To view the matrix (consiting of columns and vectors), let's turn it into a dense dataframe. A dense dataframe refers to a tabular representation where all data is explicitly stored, as opposed to a sparse format where only non-default values (like non-zero entries) are specifically represented.

Each review is represented as a vector consisting of TF-IDF values for each token.

In [101]:
# Convert the result to a dense DataFrame so that we can view the vectors
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

df_tfidf


Unnamed: 0,actress,annoying,drama,entertaining,hilarious,look,love,much,scene,unnecesarry
0,0.707107,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.707107,0.0,0.707107,0.0,0.0,0.0,0.0
2,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735
3,0.0,0.0,0.57735,0.0,0.0,0.0,0.57735,0.57735,0.0,0.0


In [102]:
df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   actress       4 non-null      float64
 1   annoying      4 non-null      float64
 2   drama         4 non-null      float64
 3   entertaining  4 non-null      float64
 4   hilarious     4 non-null      float64
 5   look          4 non-null      float64
 6   love          4 non-null      float64
 7   much          4 non-null      float64
 8   scene         4 non-null      float64
 9   unnecesarry   4 non-null      float64
dtypes: float64(10)
memory usage: 448.0 bytes


In [103]:
tfidf_vectorizer.vocabulary_

{'actress': 0,
 'hilarious': 4,
 'look': 5,
 'entertaining': 3,
 'annoying': 1,
 'unnecesarry': 9,
 'scene': 8,
 'love': 6,
 'drama': 2,
 'much': 7}

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer
#Let's have 1-grams and 2-grams
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

In [105]:
X_tfidf = tfidf_vectorizer.fit_transform(X)


In [106]:
# Convert the result to a dense DataFrame
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

df_tfidf


Unnamed: 0,actress,actress hilarious,annoying,annoying unnecesarry,drama,drama much,entertaining,hilarious,look,look entertaining,love,love drama,much,scene,unnecesarry,unnecesarry scene
0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.0,0.57735,0.57735,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.447214,0.447214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.447214,0.447214,0.447214
3,0.0,0.0,0.0,0.0,0.447214,0.447214,0.0,0.0,0.0,0.0,0.447214,0.447214,0.447214,0.0,0.0,0.0


In [107]:
df_tfidf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   actress               4 non-null      float64
 1   actress hilarious     4 non-null      float64
 2   annoying              4 non-null      float64
 3   annoying unnecesarry  4 non-null      float64
 4   drama                 4 non-null      float64
 5   drama much            4 non-null      float64
 6   entertaining          4 non-null      float64
 7   hilarious             4 non-null      float64
 8   look                  4 non-null      float64
 9   look entertaining     4 non-null      float64
 10  love                  4 non-null      float64
 11  love drama            4 non-null      float64
 12  much                  4 non-null      float64
 13  scene                 4 non-null      float64
 14  unnecesarry           4 non-null      float64
 15  unnecesarry scene     4 non

In [108]:
tfidf_vectorizer.vocabulary_

{'actress': 0,
 'hilarious': 7,
 'actress hilarious': 1,
 'look': 8,
 'entertaining': 6,
 'look entertaining': 9,
 'annoying': 2,
 'unnecesarry': 14,
 'scene': 13,
 'annoying unnecesarry': 3,
 'unnecesarry scene': 15,
 'love': 10,
 'drama': 4,
 'much': 12,
 'love drama': 11,
 'drama much': 5}

# **Final Words:**

I hope this has provided you with a clearer understanding of text data vectorisation. Remember, TF-IDF is just one of many methods used for this purpose. There are also other more advanced techniques such as word embeddings (like Word2Vec and GloVe) which can capture deeper semantic meanings. Systems like ChatGPT utilise semantic and contextual embeddings for vectorisation.

In a similar way to text vectorisation, we can also convert other types of data, such as images and audio, into numerical form. This transformation enables machine learning algorithms to process and analyse them.

Keep learning and enjoy!