Natural language processing (NLP) is a field that focuses on making natural human language usable by computer programs. NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.

A lot of the data that you could be analyzing is unstructured data and contains human-readable text. Before you can analyze that data programmatically, you first need to preprocess it. Let's look at the kinds of text preprocessing tasks you can do with NLTK so that you'll be ready to apply them in future projects. You'll also see how to do some basic text analysis and create visualizations.

### Tokenizing

By tokenizing, you can conveniently split up text by word or by sentence. This will allow you to work with smaller pieces of text that are still relatively coherent and meaningful even outside of the context of the rest of the text. It’s your first step in turning unstructured data into structured data, which is easier to analyze.

When you’re analyzing text, you’ll be tokenizing by word and tokenizing by sentence. Here’s what both types of tokenization bring to the table:
 - <b>Tokenizing by word</b>: Words are like the atoms of natural language. They’re the smallest unit of meaning that still makes sense on its own. Tokenizing your text by word allows you to identify words that come up particularly often.
 - <b>Tokenizing by sentence</b>: When you tokenize by sentence, you can analyze how those words relate to one another and see more context. Are there a lot of negative words around the word “Python” because the hiring manager doesn’t like Python? Are there more terms from the domain of herpetology than the domain of software development, suggesting that you may be dealing with an entirely different kind of python than you were expecting?

In [2]:
# import word and sentence tokenize from nltk.tokenize
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
# create a sample data for demonstration
sample_data = """Today I am very happy to tell you that I have completed a project in NLP. With the use of NLP I extracted information on 
twitter sentiment analysis. I used that data to segment tweets into positive, negative and neutral tweets. In future I will use it to analyze 
comments from twitter account."""

In [4]:
# sentence tokenize
sent_tokenize(sample_data)

['Today I am very happy to tell you that I have completed a project in NLP.',
 'With the use of NLP I extracted information on \ntwitter sentiment analysis.',
 'I used that data to segment tweets into positive, negative and neutral tweets.',
 'In future I will use it to analyze \ncomments from twitter account.']

In [5]:
# word tokenize
word_tokenize(sample_data)

['Today',
 'I',
 'am',
 'very',
 'happy',
 'to',
 'tell',
 'you',
 'that',
 'I',
 'have',
 'completed',
 'a',
 'project',
 'in',
 'NLP',
 '.',
 'With',
 'the',
 'use',
 'of',
 'NLP',
 'I',
 'extracted',
 'information',
 'on',
 'twitter',
 'sentiment',
 'analysis',
 '.',
 'I',
 'used',
 'that',
 'data',
 'to',
 'segment',
 'tweets',
 'into',
 'positive',
 ',',
 'negative',
 'and',
 'neutral',
 'tweets',
 '.',
 'In',
 'future',
 'I',
 'will',
 'use',
 'it',
 'to',
 'analyze',
 'comments',
 'from',
 'twitter',
 'account',
 '.']