<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Examples.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Examples: Text feature extraction
© ExploreAI Academy

In this notebook, we will delve into text feature extraction techniques, focusing on the bag-of-words model and n-grams. We'll explore how to transform text data into feature sets usable by classifiers, particularly using the NLTK library. The bag-of-words model simplifies text into word presence features, while n-grams capture combinations of words to extract deeper meaning from text. 

## Learning Objectives

By the end of this notebook, you should be able to:
* Understand the bag-of-words model and its role in text feature extraction.
* Implement the bag-of-words model to transform text data into feature sets.
* Explain the concept of n-grams and their significance in capturing combinations of words.
* Use n-grams to extract contextual information from text data.
* Fine-tune CountVectorizer parameters for optimal text feature extraction.


Before we get started, let's get the data and the  libraries we will be using.

In [1]:
import ssl

# Set the path to the CA certificates bundle
ssl._create_default_https_context = ssl._create_unverified_context

In [2]:
import nltk

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
import string

# set plot style
sns.set_theme()

In [None]:
nltk.download()
# or you can download directly, i.e.
nltk.download(['punkt','stopwords'])

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [None]:
from nltk.corpus import stopwords

Continuing with our `MBTI` dataset, let's read the data and clean it up a bit.

In [None]:
# Read the MBTI dataset
mbti = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/mbti_train.csv')
mbti.head()

In [None]:
# Separate each post in the 'posts' column into its own row
all_mbti = []
for i, row in mbti.iterrows():
    for post in row['posts'].split('|||'):
        all_mbti.append([row['type'], post])
all_mbti = pd.DataFrame(all_mbti, columns=['type', 'post'])

all_mbti

In [None]:
#Remove noise
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
all_mbti['post'] = all_mbti['post'].replace(to_replace = pattern_url, value = subs_url, regex = True)
all_mbti['post'] = all_mbti['post'].str.lower()

#Remove puntuation
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

all_mbti['post'] = all_mbti['post'].apply(remove_punctuation)

# Tokenize the text using the TreebankWordTokenizer
tokeniser = TreebankWordTokenizer()
all_mbti['tokens'] = all_mbti['post'].apply(tokeniser.tokenize)


## Text feature extraction

### Bag of words

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect `dict` style feature sets, so we must therefore transform our text into a Python dictionary object. The Bag of Words model is the simplest method; it constructs a word presence feature set from all the words in the text, indicating the number of times each word has appeared.

In [None]:
def bag_of_words_count(words, word_dict={}):
    """ this function takes in a list of words and returns a dictionary 
        with each word as a key, and the value represents the number of 
        times that word appeared"""
    for word in words:
        if word in word_dict.keys():
            word_dict[word] += 1
        else:
            word_dict[word] = 1
    return word_dict

Here we create a set of dictionaries, one for each of the MBTI types.

In [None]:
#Create a list of all the MBTI personality types that are present in the original dataset
type_labels = list(all_mbti.type.unique())

In [None]:
personality = {}
for pp in type_labels:
    df = all_mbti.groupby('type')
    personality[pp] = {}
    for row in df.get_group(pp)['tokens']:
        personality[pp] = bag_of_words_count(row, personality[pp])

Next, we create a list of all of the unique words.

In [None]:
all_words = set()
for pp in type_labels:
    for word in personality[pp]:
        all_words.add(word)

This was done so that we can create a combined bag of words dictionary for all the words in the text.

In [None]:
personality['all'] = {}
for pp in type_labels:    
    for word in all_words:
        if word in personality[pp].keys():
            if word in personality['all']:
                personality['all'][word] += personality[pp][word]
            else:
                personality['all'][word] = personality[pp][word]

Now we can easily calculate how many words there are in total.

In [None]:
total_words = sum([v for v in personality['all'].values()])
total_words

Let's take a look at the distribution of words which occur less than 10 times in the whole dataset.

In [None]:
_ = plt.hist([v for v in personality['all'].values() if v < 10],bins=10)
plt.ylabel("# of words")
plt.xlabel("word frequency")

There are a lot of words that only appear once! We'll print out that value here.

In [None]:
len([v for v in personality['all'].values() if v == 1])

What kind of words do you think would appear once? Let's print out a few of these rare words.

In [None]:
rare_words = [k for k, v in personality['all'].items() if v==1] 
print(rare_words[:100])

As you can see, some of these words don't make sense, but before we decide to remove them, let's see how much data we'll be left with.

In [None]:
# how many words appear more than 10 times?
# how many words of the total does that account for?
print(len([v for v in personality['all'].values() if v >= 10]))
occurs_more_than_10_times = sum([v for v in personality['all'].values() if v >= 10])
print(occurs_more_than_10_times)

In [None]:
occurs_more_than_10_times/total_words

Using words that appear more than 10 times seems much more useful!  And this accounts for 97% of all the words!

Finally, let's remove all words that occur less than 10 times.

In [None]:
max_count = 10
remaining_word_index = [k for k, v in personality['all'].items() if v > max_count]

### Hypothesis testing
Remember our Hypothesis from earlier?:

- Introverts tend to use the word `I` more than extroverts
- Conversely, Extroverts tend to favour the word `you`

Let's see if we finally have what we need to test it out. We'll first create one big dataframe with the word counts by personality profile (this may take a while).

In [None]:
hm = []
for p, p_bow in personality.items():
    df_bow = pd.DataFrame([(k, v) for k, v in p_bow.items() if k in remaining_word_index], columns=['Word', p])
    df_bow.set_index('Word', inplace=True)
    hm.append(df_bow)

# create one big dataframe
df_bow = pd.concat(hm, axis=1)
df_bow.fillna(0, inplace=True)

What are the top 10 words which appear most often?

In [None]:
df_bow.sort_values(by='all', ascending=False).head(10)

This isn't very helpful at all, is it? It's very difficult to extract insights from this data.  Let's see if we can use the $chi^2$ test to see whether Introverts favour the word **`I`**. 

The $chi^2$ test looks at observed versus expected results and lets us know where the greatest differences from expected values are.  The bigger the statistic, the greater the difference from expectation.  The formula is 

$$𝑐ℎ𝑖^2 = \sum{\frac{(𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 −𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)^2}{𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑}}$$

The $chi^2$ test will compare the **observed frequencies** of word usage by **introverts** to the **expected frequencies** based on the overall population and indicate the extent of this difference for each word.

Using the $chi^2$ statistic over simply comparing the observed percentages, i.e `I_perc`, means that we are considering both the observed (or word usage by introverts) and expected frequencies(or the overall populations word usage) for each word, taking into account the sample size. This helps us determine whether the differences between observed and expected frequencies are statistically significant, accounting for variability due to sample size.

We'll do this first by extracting introvert types only from all the personality types.

In [None]:
intro_types_i = [p for p in type_labels if p[0] == 'I']

Next, we'll create an introvert total word count column, which sums the counts of all introvert columns.

In [None]:
df_bow['I'] = df_bow[intro_types_i].sum(axis=1)

Now we'll calculate and add percentage columns.

In [None]:
for col in ['I', 'all']:
    df_bow[col+'_perc'] = df_bow[col] / df_bow[col].sum()

Print off the dataframe to view what we've done.

In [None]:
df_bow.sort_values(by='all', ascending=False).head(5)

In [None]:
# calculate chi2
df_bow['chi2_i'] = np.power((df_bow['I_perc'] - df_bow['all_perc']), 2) / df_bow['all_perc']

In [None]:
df_bow[['I_perc', 'all_perc', 'chi2_i']][df_bow['I_perc'] > df_bow['all_perc']].sort_values(by='chi2_i', ascending=False).head(10)

And there it is! What can we conclude from this?

Looking at the top words with higher chi-square values, we can see that words like "urlweb," "infp," "infj," as well as "i" have the top chi-square values compared to others. This indicates that these words are used more frequently by introverts than would be expected based on their overall occurrence in the dataset.

The word "I" appears 9th in the top 10 highest chi-square values of 0.000003, suggesting that its usage by introverts deviates significantly from what would be expected based on its general frequency.

Therefore, based on these findings, we can conclude that introverts tend to use "I" more frequently than extroverts, supporting the hypothesis that introverts favour the use of the word "I."

Let's now have a look at the words most used by extroverts following the same process but for extovert types.

In [None]:
#extract extrovert types only from all the personality types
intro_types_e = [p for p in type_labels if p[0] == 'E']
#Create an extrovert total word count column, which sums the counts of all extrovert columns
df_bow['E'] = df_bow[intro_types_e].sum(axis=1)
#calculate and add percentage column for extroverts
df_bow['E_perc'] = df_bow['E'] / df_bow['E'].sum()
# calculate chi2 for extroverts
df_bow['chi2_e'] = np.power((df_bow['E_perc'] - df_bow['all_perc']), 2) / df_bow['all_perc']
df_bow[['E_perc', 'all_perc', 'chi2_e']][df_bow['E_perc'] > df_bow['all_perc']].sort_values(by='chi2_e', ascending=False).head(15)

Based on the chi-squared analysis, there is evidence to suggest that extroverts tend to use words like "enfp", "entp", "entps", and "enfps" as well as "you" more frequently compared to their overall usage. This supports our hypothesis.

### n-grams

While individual words do carry meaning, it is often the case that combinations of words change meanings of sentences entirely.  For example, what difference does removing the `not` from a sentence make?

Natural Language Processing is **not** easy!

n-grams are a method to extract combinations of words into features for model building.  The `n` in n-grams specifies the number of tokens to include.  For example, a 2-gram returns all the consecutive pairs of words in a sentence.

In [None]:
from nltk.util import ngrams

In [None]:
def word_grams(words, min_n=1, max_n=4):
    s = []
    for n in range(min_n, max_n):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

In [None]:
print (word_grams('one two three four'.split(' ')))

Let's combine consecutive words into groups of 2 using n-grams.

In [None]:
[x for x in ngrams(all_mbti.iloc[55555]['tokens'], 2)]

Now let's combine consecutive words into groups of 3 using n-grams.

In [None]:
[x for x in ngrams(all_mbti.iloc[55555]['tokens'], 3)]

## Now that we understand all of that, let's cheat!

**Praise be to Python...**

`sklearn` has a built in text feature extraction module called [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) that will literally do all of that work in one line of code! This function will convert a collection of documents (rows of text) into a matrix of token counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Initialize CountVectorizer
vect = CountVectorizer()
# Fit the CountVectorizer on the preprocessed 'post' column
vect.fit(all_mbti['post'])

### Tuning the vectorizer

We have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune with examples on how to do so:

- **stop_words:** string 'english', list, or None (default)
    * If 'english', a built-in stop word list for English is used.
    * If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    * If None, no stop words will be used.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english')

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [None]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [None]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

### Guidelines for tuning CountVectorizer:

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.
- **Experiment**, and let the data tell you the best approach!

Finally, let's fit a tuned CountVectorizer to the MBTI data.

In [None]:
betterVect = CountVectorizer(stop_words='english', 
                             min_df=2, 
                             max_df=0.5, 
                             ngram_range=(1, 1))

In [None]:
betterVect.fit(all_mbti['post'])

After vectorization using `CountVectorizer`, we can view the transformed data as a matrix where each row represents a document (post in our case) and each column represents a unique word in the vocabulary. The cell values indicate the count of the corresponding word in each document.

It's essential to note that this process generates a very large dataset, potentially consuming significant memory on your machine.

Uncomment the code below if you would still want to view the vectorized data.

In [None]:
"""
# Transform the training data
vectorized_data = betterVect.transform(all_mbti['post'][0:10000,])

# Convert the sparse matrix to a dense array for easier viewing (optional)
dense_vectorized_data = vectorized_data.toarray()

# Create a DataFrame to display the vectorized data
vectorized_df = pd.DataFrame(dense_vectorized_data, columns=betterVect.get_feature_names_out())

# Display the vectorized DataFrame
print(vectorized_df)
"""

## Conclusion

In this train we covered various techniques for cleaning text data and extracting features to use with machine learning models. We also demonstrated how NLTK's `CountVectorizer` can be used to clean text data and extract features, transforming the text data into a matrix of numbers that can be fed into a machine learning model.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>