In [1]:
import nltk; nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/lordvoldemort/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

1. Convert text to all lower case for normalcy
2. Remove any accented characters, non-ASCII characters
3. Remove special characters
4. Stem or lemmatize the words
5. Remove stopwords
6. Store the clean text and original text for use in future notebooks


In [2]:
import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import pandas as pd


In [3]:
article = "Coming into our Data Science program, you will need to know some math and \
stats. However, many of our applicants actually learn in the application process – you \
don’t need to be an expert before applying! Data science is a very accessible field to \
anyone dedicated to learning new skills, and we can work with any applicant to help them \
learn what they need to know. But what “skills” do we mean, exactly? Just what exactly \
are the data science math and stats principles you need to know?', 'What are the main \
math principles you need to know to get into Codeup’s Data Science program?'"


In [4]:
# convert text to all lower case for normalcy
article = article.lower()
article

"coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process – you don’t need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can work with any applicant to help them learn what they need to know. but what “skills” do we mean, exactly? just what exactly are the data science math and stats principles you need to know?', 'what are the main math principles you need to know to get into codeup’s data science program?'"

In [5]:
# remove accented characters and normalize and standardize it into ASCII
article = unicodedata.normalize('NFKD', article)\
    .encode('ascii','ignore')\
    .decode('utf-8', 'ignore')


In [6]:
article[0:300]

'coming into our data science program, you will need to know some math and stats. however, many of our applicants actually learn in the application process  you dont need to be an expert before applying! data science is a very accessible field to anyone dedicated to learning new skills, and we can wo'

In [7]:
# removing special characters
article = re.sub(r"[^a-z0-9'\s]", '', article)
print(article)

coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process  you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work with any applicant to help them learn what they need to know but what skills do we mean exactly just what exactly are the data science math and stats principles you need to know' 'what are the main math principles you need to know to get into codeups data science program'


# Tokenization

the process of breaking words and any punctuationleft over into discrete units using nltk


In [10]:
tokenizer = nltk.tokenize.ToktokTokenizer()
tokenizer.tokenize(article, return_str=True)[0:300]

'coming into our data science program you will need to know some math and stats however many of our applicants actually learn in the application process you dont need to be an expert before applying data science is a very accessible field to anyone dedicated to learning new skills and we can work wit'

# Stemming

create new words 

In [12]:
# create the nltk stemmer object, then use it
ps = nltk.porter.PorterStemmer()

ps.stem('call'), ps.stem('called'), ps.stem('calling')

('call', 'call', 'call')

In [13]:
stems = [ps.stem(word) for word in article.split()]
article_stemmed = ' '.join(stems)
article_stemmed

"come into our data scienc program you will need to know some math and stat howev mani of our applic actual learn in the applic process you dont need to be an expert befor appli data scienc is a veri access field to anyon dedic to learn new skill and we can work with ani applic to help them learn what they need to know but what skill do we mean exactli just what exactli are the data scienc math and stat principl you need to know' 'what are the main math principl you need to know to get into codeup data scienc program'"

In [14]:
pd.Series(stems).value_counts().head(10)

to        9
need      5
data      4
scienc    4
you       4
know      3
learn     3
and       3
math      3
the       3
dtype: int64

# Lemmatization

breaking down word but slower than stemming. complex process that result in actual words. most preferred

# Removing Stopwords

words that have little or no significance; articles,conunctions, prepositions, ex: a, an, the, like

but first, you have to segment text by tokenization

In [25]:
stopword_list = stopwords.words('english')

stopword_list.remove('no')
stopword_list.remove('not')

stopword_list[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [27]:
words = article.split()
filtered_words = [w for w in words if w not in stopword_list]

print('Removed {} stopwords'.format(len(words)-len(filtered_words)))
print('---')

Removed 48 stopwords
---


In [28]:
article_without_stopwords= ' '.join(filtered_words)
print(article_without_stopwords)

coming data science program need know math stats however many applicants actually learn application process dont need expert applying data science accessible field anyone dedicated learning new skills work applicant help learn need know skills mean exactly exactly data science math stats principles need know' 'what main math principles need know get codeups data science program'
