# Installing NLTK and Configuring the Python Environment

In [1]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


<br> It's worth noting that NLTK also requires some additional data to be downloaded before it can be used effectively. This data includes pre-trained models, corpora, and other resources that NLTK uses to perform various NLP tasks.

In [1]:
import nltk

nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping taggers\averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\av

True

<br><br>To perform sentiment analysis using NLTK in Python, the text data must first be preprocessed using techniques such as tokenization, stopword removal, and stemming or lemmatization. Once the text is preprocessed, we pass it to the Vader sentiment analyzer to analyze the text's sentiment (positive or negative).
<br><br>
## Step 1 - Import Libraries and Load Data
First, we will import the libraries needed for text analysis and sentiment analysis, such as pandas for data processing, nltk for natural language processing, and SentimentIntensityAnalyzer for sentiment analysis.

We will then download the entire NLTK corpus (a collection of linguistic data) using nltk.download().

Once the environment is set up, we'll load a dataset of Amazon reviews using pd.read_csv(). This will create a DataFrame object in Python that we can use to analyze the data. We'll display the contents of the DataFrame using df.

In [3]:
# import libraries
import pandas as pd

import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer


# download nltk corpus (first time only)
import nltk

nltk.download('all')




# Load the amazon review dataset

df = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/amazon.csv')

df

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\King\AppData\Roaming\nltk_data...
[nltk_

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1
...,...,...
19995,this app is fricken stupid.it froze on the kin...,0
19996,Please add me!!!!! I need neighbors! Ginger101...,1
19997,love it! this game. is awesome. wish it had m...,1
19998,I love love love this app on my side of fashio...,1


## Step 2 - Text Preprocessing
Let's create a preprocess_text function in which we first tokenize the documents using NLTK's word_tokenize function, then remove the stepped words using NLTK's stepwords module, and finally, lemmatize the filtered_tokens using NLTK's WordNetLemmatizer module.

In [5]:
# create preprocess_text function
def preprocess_text(text):

    # Tokenize the text

    tokens = word_tokenize(text.lower())




    # Remove stop words

    filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]




    # Lemmatize the tokens

    lemmatizer = WordNetLemmatizer()

    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]




    # Join the tokens back into a string

    processed_text = ' '.join(lemmatized_tokens)

    return processed_text

# apply the function df

df['reviewText'] = df['reviewText'].apply(preprocess_text)
df

Unnamed: 0,reviewText,Positive
0,one best apps acording bunch people agree bomb...,1
1,pretty good version game free . lot different ...,1
2,really cool game . bunch level find golden egg...,1
3,"silly game frustrating , lot fun definitely re...",1
4,terrific game pad . hr fun . grandkids love . ...,1
...,...,...
19995,app fricken stupid.it froze kindle wont allow ...,0
19996,please add ! ! ! ! ! need neighbor ! ginger101...,1
19997,love ! game . awesome . wish free stuff house ...,1
19998,love love love app side fashion story fight wo...,1


## Step 3 - NLTK Sentiment Analyzer
First, we'll initialize a Sentiment Intensity Analyzer object from the nltk.sentiment.vader library.

Next, we'll define a function called get_sentiment that takes a text string as input. The function calls the analyzer object's polarity_scores method to obtain a dictionary of sentiment scores for the text, which includes a score for positive, negative, and neutral sentiment.

The function then checks if the positive score is greater than 0 and returns a sentiment score of 1 if it is, and 0 otherwise. This means that any text with a positive score will be classified as having positive sentiment, and any text with a non-positive score will be classified as having negative sentiment.

Finally, we'll apply the get_sentiment function to the reviewText column of the DataFrame df using the apply method. This creates a new column called sentiment in the DataFrame, which stores the sentiment score for each review. We'll then display the updated DataFrame using df.

In [6]:
# initialize NLTK sentiment analyzer

analyzer = SentimentIntensityAnalyzer()


# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)

    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment




# apply get_sentiment function

df['sentiment'] = df['reviewText'].apply(get_sentiment)

df

Unnamed: 0,reviewText,Positive,sentiment
0,one best apps acording bunch people agree bomb...,1,1
1,pretty good version game free . lot different ...,1,1
2,really cool game . bunch level find golden egg...,1,1
3,"silly game frustrating , lot fun definitely re...",1,1
4,terrific game pad . hr fun . grandkids love . ...,1,1
...,...,...,...
19995,app fricken stupid.it froze kindle wont allow ...,0,0
19996,please add ! ! ! ! ! need neighbor ! ginger101...,1,1
19997,love ! game . awesome . wish free stuff house ...,1,1
19998,love love love app side fashion story fight wo...,1,1


The NLTK sentiment analyzer returns a score between -1 and +1. We used a cutoff threshold of 0 in the get_sentiment function above. Anything above 0 is classified as 1 (i.e., positive). Since we have real labels, we can evaluate the performance of this method by constructing a confusion matrix.

In [7]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(df['Positive'], df['sentiment']))

[[ 1131  3636]
 [  576 14657]]


We can also consult the classification report:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(df['Positive'], df['sentiment']))

As you can see, the overall accuracy of this rule-based sentiment analysis model is 79%. Since this is labeled data, you can also try building an ML model to evaluate whether an ML-based approach achieves better accuracy.
<br><br>

## Conclusion
NLTK is a powerful and flexible library for performing sentiment analysis and other natural language processing tasks in Python. NLTK allows you to preprocess text data, convert it into a bag-of-words model, and perform sentiment analysis using the Vader sentiment analyzer.

In this tutorial, we explored the basics of sentiment analysis with NLTK, including preprocessing text data, creating a bag-of-words model, and performing sentiment analysis with NLTK Vader. We also discussed the benefits and limitations of NLTK sentiment analysis and suggested areas for further reading and exploration.

Overall, NLTK is a powerful and widely used tool for performing sentiment analysis and other natural language processing tasks in Python. By mastering the techniques and tools presented in this tutorial, you can gain valuable insights into the sentiment of text data and use them to make data-driven decisions in a wide range of applications.