Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.

A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. Let's look at the kinds of text preprocessing tasks you can do with NLTK so that you'll be ready to apply them in future projects. You'll also see how to do some basic text analysis and create visualizations.

### Tokenizing

By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:
 - <b>Tokenizing by word</b>: Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often.
 - <b>Tokenizing by sentence</b>: When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

In [10]:
# import word and sentence tokenize from nltk.tokenize
from nltk.tokenize import sent_tokenize, word_tokenize

In [11]:
# create a sample data for demonstration
sample_data = """Today I am very happy to tell you that I have completed a project in NLP. With the use of NLP I extracted information on twitter sentiment analysis. I used that data to segment tweets into positive, negative and neutral tweets. In future I will use it to analyze comments from twitter account."""
sample_data

'Today I am very happy to tell you that I have completed a project in NLP. With the use of NLP I extracted information on twitter sentiment analysis. I used that data to segment tweets into positive, negative and neutral tweets. In future I will use it to analyze comments from twitter account.'

In [12]:
# sentence tokenize
sent_tokenize(sample_data)

['Today I am very happy to tell you that I have completed a project in NLP.',
 'With the use of NLP I extracted information on twitter sentiment analysis.',
 'I used that data to segment tweets into positive, negative and neutral tweets.',
 'In future I will use it to analyze comments from twitter account.']

In [13]:
# word tokenize
word_tokenize(sample_data)

['Today',
 'I',
 'am',
 'very',
 'happy',
 'to',
 'tell',
 'you',
 'that',
 'I',
 'have',
 'completed',
 'a',
 'project',
 'in',
 'NLP',
 '.',
 'With',
 'the',
 'use',
 'of',
 'NLP',
 'I',
 'extracted',
 'information',
 'on',
 'twitter',
 'sentiment',
 'analysis',
 '.',
 'I',
 'used',
 'that',
 'data',
 'to',
 'segment',
 'tweets',
 'into',
 'positive',
 ',',
 'negative',
 'and',
 'neutral',
 'tweets',
 '.',
 'In',
 'future',
 'I',
 'will',
 'use',
 'it',
 'to',
 'analyze',
 'comments',
 'from',
 'twitter',
 'account',
 '.']

### STOPWORDS filtering
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [14]:
from nltk.corpus import stopwords

In [15]:
# first we will convert our sample data into lower case
sample_data_lower = sample_data.lower()
sample_data_lower

'today i am very happy to tell you that i have completed a project in nlp. with the use of nlp i extracted information on twitter sentiment analysis. i used that data to segment tweets into positive, negative and neutral tweets. in future i will use it to analyze comments from twitter account.'

In [16]:
# word tokenize
sample_data_tokenized = word_tokenize(sample_data_lower)
sample_data_tokenized

['today',
 'i',
 'am',
 'very',
 'happy',
 'to',
 'tell',
 'you',
 'that',
 'i',
 'have',
 'completed',
 'a',
 'project',
 'in',
 'nlp',
 '.',
 'with',
 'the',
 'use',
 'of',
 'nlp',
 'i',
 'extracted',
 'information',
 'on',
 'twitter',
 'sentiment',
 'analysis',
 '.',
 'i',
 'used',
 'that',
 'data',
 'to',
 'segment',
 'tweets',
 'into',
 'positive',
 ',',
 'negative',
 'and',
 'neutral',
 'tweets',
 '.',
 'in',
 'future',
 'i',
 'will',
 'use',
 'it',
 'to',
 'analyze',
 'comments',
 'from',
 'twitter',
 'account',
 '.']

In [17]:
# set stopwords as english
stop_words = set(stopwords.words("english"))

In [18]:
# let's remove stopwords
filtered_data = [word for word in sample_data_tokenized if word not in stop_words]
filtered_data

['today',
 'happy',
 'tell',
 'completed',
 'project',
 'nlp',
 '.',
 'use',
 'nlp',
 'extracted',
 'information',
 'twitter',
 'sentiment',
 'analysis',
 '.',
 'used',
 'data',
 'segment',
 'tweets',
 'positive',
 ',',
 'negative',
 'neutral',
 'tweets',
 '.',
 'future',
 'use',
 'analyze',
 'comments',
 'twitter',
 'account',
 '.']