In [1]:
%autosave 0

Autosave disabled


In [2]:
#Run this command in your terminal! It downloads stopwords for the nltk library.

import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lordvoldemort/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Here we have some new imports! We will need unicodedata to standardize our text characters, re (regex library) to remove special characters, and many items from the nltk (natural language toolkit) library to remove stopwords and stem/lemmatize our text.

In [3]:
import pandas as pd

import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

We will work with an explanation of string theory from my website. It has special characters and long words, which will help us visualize the effects our operations have on text data!

In [4]:
data = "Advanced: String theory is a mathematical framework that proposes to be a theory of quantum gravity, seeking to reconcile general relativity (which describes gravity on a large scale) and quantum mechanics (which describes the behavior of particles at a microscopic level). It introduces the idea that the fundamental building blocks of the universe are not particles, but rather one-dimensional strings of energy. These strings can vibrate at different frequencies, giving rise to different types of particles and forces. String theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with, which are compactified or curled up into tiny sizes."

Let's start off by converting all text to lowercase.

In [5]:
data = data.lower()
data[:100]

'advanced: string theory is a mathematical framework that proposes to be a theory of quantum gravity,'

Now we can go through a series of steps to remove some special characters, such as accented characters in other languages. We may not see much of an effect with this particular example.

In [6]:
data = unicodedata.normalize('NFKD', data)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

data[:100]

'advanced: string theory is a mathematical framework that proposes to be a theory of quantum gravity,'

Let's use the regex library to substitute out all characters we don't want.

In [14]:
data = re.sub(r"[^a-z0-9'\s]", "", data)  #the carrot mean not a-z or 0-9 or space will be replaced with nothing
data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

We can create an instance of the ToktokTokenizer object and use it to tokenize our data. This doesn't have a visual effect.

In [8]:
tokenizer = ToktokTokenizer()

data = tokenizer.tokenize(data, return_str=True)

data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

Let's retrieve a list of stopwords from the nltk library. As you can see, the words in the stopwords list are common words we use in English, but they don't contribute much meaning to a sentence.

In [15]:
stopwords_list = stopwords.words('english')
stopwords_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

With a simple list comprehension, we can remove all words from our corpus that are found in the stopwords list.

In [10]:
words = [word for word in data.split() if word not in stopwords_list]
words[:10]

['advanced',
 'string',
 'theory',
 'mathematical',
 'framework',
 'proposes',
 'theory',
 'quantum',
 'gravity',
 'seeking']

We can use the .join() string method to recompile our words into a single string.

In [11]:
new_data = ' '.join(words)
new_data[:100]

'advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene'

We can create an instance of the PorterStemmer object and use it to stem our words. Note that many of the resulting words are not found in the dictionary!

In [16]:
ps = nltk.porter.PorterStemmer()

stems = [ps.stem(word) for word in words]

stemmed_data = ' '.join(stems)

stemmed_data[:100]

'advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel descri'

We can also use the WordNetLemmatizer. Note that it hardly changes text from its original form.

In [19]:
wnl = nltk.stem.WordNetLemmatizer()

lemmas = [wnl.lemmatize(word) for word in words]

lemmatized_data = ' '.join(lemmas)

lemmatized_data[:100]

LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - '/Users/lordvoldemort/nltk_data'
    - '/opt/homebrew/anaconda3/nltk_data'
    - '/opt/homebrew/anaconda3/share/nltk_data'
    - '/opt/homebrew/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
