In [1]:
%autosave 0

Autosave disabled


Run this command in your terminal! It downloads stopwords for the nltk library.

python -c "import nltk; nltk.download('stopwords')"

Here we have some new imports! We will need unicodedata to standardize our text characters, re (regex library) to remove special characters, and many items from the nltk (natural language toolkit) library to remove stopwords and stem/lemmatize our text.

In [13]:
import pandas as pd

import unicodedata # lib to standardize characters
import re

import nltk # natural language tool kit lib for all operations
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

We will work with an explanation of string theory from my website. It has special characters and long words, which will help us visualize the effects our operations have on text data!

In [3]:
data = "Advanced: String theory is a mathematical framework that proposes to be a theory of quantum gravity, seeking to reconcile general relativity (which describes gravity on a large scale) and quantum mechanics (which describes the behavior of particles at a microscopic level). It introduces the idea that the fundamental building blocks of the universe are not particles, but rather one-dimensional strings of energy. These strings can vibrate at different frequencies, giving rise to different types of particles and forces. String theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with, which are compactified or curled up into tiny sizes."

Let's start off by converting all text to lowercase.

In [15]:
data = data.lower()
data#[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity seeking to reconcile general relativity which describes gravity on a large scale and quantum mechanics which describes the behavior of particles at a microscopic level it introduces the idea that the fundamental building blocks of the universe are not particles but rather onedimensional strings of energy these strings can vibrate at different frequencies giving rise to different types of particles and forces string theory also requires the existence of additional dimensions beyond the three spatial dimensions we are familiar with which are compactified or curled up into tiny sizes'

Now we can go through a series of steps to remove some special characters, such as accented characters in other languages. We may not see much of an effect with this particular example.

In [5]:
data = unicodedata.normalize('NFKD', data)\
    .encode('ascii', 'ignore')\
    .decode('utf-8', 'ignore')

data[:100]

'advanced: string theory is a mathematical framework that proposes to be a theory of quantum gravity,'

Let's use the regex library to substitute out all characters we don't want.

In [16]:
data = re.sub(r"[^a-z0-9'\s]", "", data) # carrot symbol returns any value notequal to regex statement and return it with nothing.
data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

We can create an instance of the ToktokTokenizer object and use it to tokenize our data. This doesn't have a visual effect.

In [7]:
tokenizer = ToktokTokenizer()

data = tokenizer.tokenize(data, return_str=True)

data[:100]

'advanced string theory is a mathematical framework that proposes to be a theory of quantum gravity s'

Let's retrieve a list of stopwords from the nltk library. As you can see, the words in the stopwords list are common words we use in English, but they don't contribute much meaning to a sentence.

In [21]:
stopwords_list = stopwords.words('english')
print(stopwords_list[:10])
print('\n\n\n')
print(stopwords_list)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]




['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 

**With a simple list comprehension, we can remove all words from our corpus that are found in the stopwords list.**

#### words = [: This line initializes a new list called words to store the words from the input data that are not in the stopwords_list.

#### word for word in data.split(): This is the beginning of the list comprehension. It iterates over each word obtained by splitting the string data. The data.split() method divides the string into words using space as a delimiter.

#### if word not in stopwords_list]: This is a conditional clause. It checks if the word is not found in the stopwords_list, which is presumably a list of common words to be excluded. If the word is not in stopwords_list, it is included in the words list.

In [22]:
words = [word for word in data.split() if word not in stopwords_list]
print(words[:10])
print('\n\n\n')
print(words)


['advanced', 'string', 'theory', 'mathematical', 'framework', 'proposes', 'theory', 'quantum', 'gravity', 'seeking']




['advanced', 'string', 'theory', 'mathematical', 'framework', 'proposes', 'theory', 'quantum', 'gravity', 'seeking', 'reconcile', 'general', 'relativity', 'describes', 'gravity', 'large', 'scale', 'quantum', 'mechanics', 'describes', 'behavior', 'particles', 'microscopic', 'level', 'introduces', 'idea', 'fundamental', 'building', 'blocks', 'universe', 'particles', 'rather', 'onedimensional', 'strings', 'energy', 'strings', 'vibrate', 'different', 'frequencies', 'giving', 'rise', 'different', 'types', 'particles', 'forces', 'string', 'theory', 'also', 'requires', 'existence', 'additional', 'dimensions', 'beyond', 'three', 'spatial', 'dimensions', 'familiar', 'compactified', 'curled', 'tiny', 'sizes']


We can use the .join() string method to recompile our words into a single string.

In [25]:
new_data = ' '.join(words)
print(new_data[:100])
print('\n\n\n')
print(new_data)

advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene




advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile general relativity describes gravity large scale quantum mechanics describes behavior particles microscopic level introduces idea fundamental building blocks universe particles rather onedimensional strings energy strings vibrate different frequencies giving rise different types particles forces string theory also requires existence additional dimensions beyond three spatial dimensions familiar compactified curled tiny sizes


We can create an instance of the PorterStemmer object and use it to stem our words. Note that many of the resulting words are not found in the dictionary!

In [33]:
# more aggresive

In [27]:
ps = nltk.porter.PorterStemmer() #PorterStemment object

stems = [ps.stem(word) for word in words] # i want my porter stem object to stem a word in my lst of words

stemmed_data = ' '.join(stems) # joing my stems into a sentence again.

print(stemmed_data[:100])
print('\n\n\n')
print(stemmed_data)

advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel descri




advanc string theori mathemat framework propos theori quantum graviti seek reconcil gener rel describ graviti larg scale quantum mechan describ behavior particl microscop level introduc idea fundament build block univers particl rather onedimension string energi string vibrat differ frequenc give rise differ type particl forc string theori also requir exist addit dimens beyond three spatial dimens familiar compactifi curl tini size


We can also use the WordNetLemmatizer. Note that it hardly changes text from its original form.

In [32]:
# less aggresive

In [31]:
wnl = nltk.stem.WordNetLemmatizer()

lemmas = [wnl.lemmatize(word) for word in words] # for loop to iterate through words from initial text data to apply lemmatizer object

lemmatized_data = ' '.join(lemmas)

print(lemmatized_data[:100])
print('\n\n\n')
print(lemmatized_data)

advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile gene




advanced string theory mathematical framework proposes theory quantum gravity seeking reconcile general relativity describes gravity large scale quantum mechanic describes behavior particle microscopic level introduces idea fundamental building block universe particle rather onedimensional string energy string vibrate different frequency giving rise different type particle force string theory also requires existence additional dimension beyond three spatial dimension familiar compactified curled tiny size
