Run this command in your terminal! It downloads stopwords for the nltk library.

python -c "import nltk; nltk.download('stopwords')"

Here we have some new imports! We will need unicodedata to standardize our text characters, re (regex library) to remove special characters, and many items from the nltk (natural language toolkit) library to remove stopwords and stem/lemmatize our text.

In [1]:
import pandas as pd

import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jongarcia/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

We will work with an explanation of string theory from my website. It has special characters and long words, which will help us visualize the effects our operations have on text data!

In [3]:
data = "Advanced: String theory is a mathematical framework that proposes to be a theory of quantum gravity, seeking to reconcile general relativity (which describes gravity on a large scale) and quantum mechanics (which describes the behavior of particles at a microscopic level). It introduces the idea that the fundamental building blocks of the universe are not particles, but rather one-dimensional strings of energy. These strings can vibrate at different frequencies, giving rise to different types of particles and forces. String theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with, which are compactified or curled up into tiny sizes."

Let's start off by converting all text to lowercase.

In [4]:
data = data.lower()
data[:100]

'advanced: string theory is a mathematical framework that proposes to be a theory of quantum gravity,'

Now we can go through a series of steps to remove some special characters, such as accented characters in other languages. We may not see much of an effect with this particular example.

In [5]:
data = unicodedata.normalize('NFKD', data)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

data[:100]

'advanced: string theory is a mathematical framework that proposes to be a theory of quantum gravity,'

Let's use the regex library to substitute out all characters we don't want.

In [6]:
data = re.sub(r"[^a-z0-9'\s]", "", data)
data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

In [7]:
def basic_clean(data):
    # Convert the text to lowercase
    data = data.lower()
    
    # Normalize the text by removing any diacritical marks
    data = unicodedata.normalize('NFKD', data)\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')
    
    # Remove any characters that are not lowercase letters, numbers, apostrophes, or whitespaces
    data = re.sub(r"[^a-z0-9'\s]", "", data)
    
    # Return the cleaned data
    return data


In [8]:
data = basic_clean(data)
data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

We can create an instance of the ToktokTokenizer object and use it to tokenize our data. This doesn't have a visual effect.

In [9]:
tokenizer = ToktokTokenizer()

data = tokenizer.tokenize(data, return_str=True)

data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

Let's retrieve a list of stopwords from the nltk library. As you can see, the words in the stopwords list are common words we use in English, but they don't contribute much meaning to a sentence.

In [16]:
stopwords_list = stopwords.words('english')
stopwords_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [17]:
# extra_words = ['theory', 'framework']
# exclude_words = ['is', 'that']

extra_words = ['theory', 'framework']
exclude_words = ['me', 'my']

stopwords_list.extend(extra_words)

for word in exclude_words:
    if word in stopwords_list:
        stopwords_list.remove(word)


stopwords_list[:4], stopwords_list[-4:] 

(['i', 'myself', 'we', 'our'], ['wouldn', "wouldn't", 'theory', 'framework'])

With a simple list comprehension, we can remove all words from our corpus that are found in the stopwords list.

In [11]:
words = [word for word in data.split() if word not in stopwords_list]
words[:10]

['advanced',
 'string',
 'theory',
 'mathematical',
 'framework',
 'proposes',
 'theory',
 'quantum',
 'gravity',
 'seeking']

We can use the .join() string method to recompile our words into a single string.

In [12]:
new_data = ' '.join(words)
new_data[:100]

'advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene'

In [13]:
def tokenize(data):
    # Initialize a tokenizer object
    tokenizer = ToktokTokenizer()

    # Tokenize the input data using the tokenizer object
    data = tokenizer.tokenize(data, return_str=True)

    # Return the processed data
    return data


In [14]:
data = tokenize(data)
data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

In [15]:
def remove_stopwords(data):
    # Create a list of stopwords in English
    stopwords_list = stopwords.words('english')

    # Split the data into individual words and filter out stopwords
    words = [word for word in data.split() if word not in stopwords_list]
    
    # Join the filtered words back into a string
    data = ' '.join(words)
    return data

In [16]:
data = remove_stopwords(data)
data[:100]

'advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene'

We can create an instance of the PorterStemmer object and use it to stem our words. Note that many of the resulting words are not found in the dictionary!

In [17]:
ps = nltk.porter.PorterStemmer()

stems = [ps.stem(word) for word in words]

stemmed_data = ' '.join(stems)

stemmed_data[:100]

'advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel descri'

In [18]:
def stem(data):
    # Create an instance of the PorterStemmer class from the nltk library
    ps = nltk.porter.PorterStemmer()
    # Create a list of words form data
    words = data.split()
    # Apply stemming to each word in the input data
    stems = [ps.stem(word) for word in words]

    # Join the stemmed words into a single string with spaces in between
    stemmed_data = ' '.join(stems)

    # Return the stemmed data
    return stemmed_data


In [19]:
data = stem(data)
data[:100]

'advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel descri'

We can also use the WordNetLemmatizer. Note that it hardly changes text from its original form.

In [20]:
wnl = nltk.stem.WordNetLemmatizer()

lemmas = [wnl.lemmatize(word) for word in words]

lemmatized_data = ' '.join(lemmas)

lemmatized_data[:100]

'advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene'

In [21]:
def lemmatize(data):
    # Create an instance of WordNetLemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    # Create a list of words form data
    words = data.split()
    
    # Lemmatize each word in the input data
    lemmas = [wnl.lemmatize(word) for word in words]

    # Join the lemmatized words into a single string
    lemmatized_data = ' '.join(lemmas)

    # Return the lemmatized data
    return lemmatized_data


In [22]:
data = lemmatize(data)
data[:100]

'advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel descri'