<a href="https://colab.research.google.com/github/peiyulan/Embeddedsystem/blob/master/Notebooks/2_WorkingWithTextInNLTK_2026.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK with Lewis Grassic Gibbon First Editions

**Data Source:** [National Library of Scotland Data Foundry](https://data.nls.uk/data/digitised-collections/lewis-grassic-gibbon-first-editions/)

**Code Reference:**
National Library of Scotland. Exploring Lewis Grassic Gibbon First Editions. National Library of Scotland, 2020. https://doi.org/10.34812/gq6w-6e91

**Date:** Feb 23, 2026

**Course:** Text Analysis with NLTK (Week 2); Centre for Data, Culture & Society

## Table of Contents

I. [Loading Data](#Loading_Data)

II. [Pre-processing](#pre-processing)

III. [Data Cleaning](#data_cleaning)

IV. [Analysis](#analysis)


<a id="Loading_Data"></a>
## I. Loading Data

In [None]:
# To load a CSV file with an inventory of the documents in the corpus
import pandas as pd
import numpy as np

# To create data visualizations
import altair as alt
import matplotlib.pyplot as plt

# To perform text analysis
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import PlaintextCorpusReader
from nltk.corpus import twitter_samples, stopwords
from nltk.text import Text
from nltk.probability import FreqDist
from nltk.draw.dispersion import dispersion_plot as displt
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('twitter_samples')

import re # Regular Expressions (RegEx)
import string
import random
import seaborn as sns

To get a sense of the data we're working with, let's create have a look at the documentation of nltk twitter sample corpus:

https://www.nltk.org/howto/corpus.html

Now we can follow the instruction of the documentation:

In [None]:
twitter_samples.fileids()

It shows that the corpus has been divided into three parts: negative, positive, and a combination of them.

As we want to do sentiment analysis, let's store the data into
`positive_twts` and `negative_twts`.

We can then a look at the size of the dataset.

In [None]:
positive_twts = twitter_samples.strings('positive_tweets.json')
negative_twts = twitter_samples.strings('negative_tweets.json')

In [None]:
print('Positive Tweets: ' + str(len(positive_twts)))
print('Negative Tweets: ' + str(len(negative_twts)))

It is a quite balanced dataset as each category has 5000 data point.

Now let's take a look at the actual tweets in the corpus.

In [None]:
print("positive:", positive_twts[5])
print("negative:",negative_twts[5])

---

<a id="pre-processing"></a>
## II. Pre-processing

As we discussed last week, it’s important to clean up the text before starting your analysis. This helps remove unwanted noise and errors, generating to more reliable results.

In general, pre-processing includes the following steps:

1. Use regular expressions (regex) to remove punctuation and unwanted characters.
2. Normalise the text by converting it to lowercase.
3. Perform tokenisation.
4. Remove stopwords.
5. Apply stemming or lemmatisation.

### Regular Expresssions (RegEx)

* **WHAT? Pattern matching strings in Python**
* **WHY? To find specific words or phrases, or variations of a particular word or phrase**
    * Once found, they can be replaced, so this is useful for cleaning text with digitization errors.  Optical Character Recognition (OCR) and Handwriting Recognition (HWT or HRT) technologies are imperfect, so you will find errors in digitized text corpora (unless of course they've been manually reviewed and corrected).
* **HOW? Combinations of special characters with a RegEx compiler**
    * In programming, a *compiler* translates code from one programming language to another.  In a sense, RegEx is a language that can sit on top of Python.  RegEx works with Python data types and syntax but it also has its own special characters and methods that plain Python doesn't use.
    
Resource for practice with and testing Regular Expressions: [Regex101.com](https://regex101.com): also check out [W3Schools](https://www.w3schools.com/python/python_regex.asp) for the cheat sheet it provides!

In [None]:
# # To use Regular Expressions (RegEx)
# import re

To remove a substring (a selection of characters in a string), we can use an empty string (either `""` or `''`) as the second input for the `replace()` method.

In [None]:
txt = "Jenny and Josh are having lunch."
new_txt = re.sub(r'Jenny', 'Peter', txt)
print(txt)
print(new_txt)

1. `[]` A set of characters. ex: "`[a-m]`"mean a to m
2. `\w`	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
3. `+`	One or more occurrences

In [None]:
txt = "@BhaktisBanter @PallaviRuhail This one is irresistible :)"
new_txt = re.sub(r'@[\w]+', '', txt)
print(txt)
print(new_txt)

Now let's try to remove links in the following text:

In [None]:
txt = "#FlipkartFashionFriday http://t.co/EbZ0L2VENM"

new_txt = re.sub(r'YOUR CODE HERE', '', txt)
print(txt)
print(new_txt)

In [None]:
def cleaning(tweet):
    """
    Preproceeses a Tweet by removing hashes, RTs, @mentions,
    links, stopwords and punctuation, tokenizing and stemming
    the words.

    Accepts:
        tweet {str} -- tweet string

    Returns:
        {str}
    """
    ## Your code hear
    pass
    return clean_twt

In [None]:
clean_positive_twts = [cleaning(tweet) for tweet in positive_twts]
clean_negative_twts = [cleaning(tweet) for tweet in negative_twts]
print(clean_positive_twts[5])

### Tokenization

In [None]:
tokenizer = TweetTokenizer()
twt_tokens = tokenizer.tokenize(clean_positive_twts[5])
print(twt_tokens)

In [None]:
pos_tokens = [tokenizer.tokenize(tweet) for tweet in clean_positive_twts]
print(pos_tokens[5])

### Lowercasing

Let's casefold to normalize so capitalized and lowercased versions of words are considered the same word:

In [None]:
def lowercase_all(words):
    # your code here
    pass


txt = ["SAMPLE", "Text"]
new_text = lowercase_all(txt)
print(new_text)

In [None]:
pos_tokens = [lowercase_all(token) for token in pos_tokens]
neg_tokens = [lowercase_all(token) for token in neg_tokens]

### Remove Stopwords

...and exclude stopwords using `stopwords.words(language)`:

In [None]:
pos_tokens_nosd = []
stopwords_en = stopwords.words('english')
for word in pos_tokens[5]:
    if word not in stopwords_en and word not in string.punctuation:
        pos_tokens_nosd.append(word)


In [None]:
print(stopwords_en)

In [None]:
print(pos_tokens[5])
print(pos_tokens_nosd)

In [None]:
def tokenise_tweet(tweets):
    clean_tokens = []
    twt_tokens = tokenizer.tokenize(tweets)

    # lowercasing
    #### Your code here

    # is alphabetical
    #### Your code here


    # remove stopword
    #### Your code here

    return clean_tokens

In [None]:
pos_twt_token = [tokenise_tweet(tweet) for tweet in clean_positive_twts]
neg_twt_token = [tokenise_tweet(tweet) for tweet in clean_negative_twts]

In [None]:
print("original:\n", positive_twts[5])
print("\nafter regex:\n", clean_positive_twts[5])
print("\nnormalised, stemmed, stopwords removed:\n",pos_twt[5])

<a id="analysis"></a>
## IV. Analysis

In [None]:
import itertools

def freq_plot(corpus, n):
    word = list(itertools.chain.from_iterable(corpus))
    fdist = FreqDist(word)
    plt.figure(figsize = (20, 8))
    plt.rc('font', size=12)
    fdist.plot(n, title=f'Frequency Distribution for {n} Most Common Tokens in the Dataset (excluding stop words)')
    plt.show()

In [None]:
freq_plot(pos_twt_token, 100) ## Try increasing or decreasing this number to view more or fewer tokens in the visualization
freq_plot(neg_twt_token, 100)

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = list(itertools.chain.from_iterable(pos_twt_token))
text = " ".join(text)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud")
plt.show()

Using NLTK’s VADER Sentiment Analyzer (Lexicon-based) as an example perform sentimet analysis.

**VADER (Valence Aware Dictionary and sEntiment Reasoner)** is a sentiment analysis tool designed to capture sentiments expressed in **social media**, which makes it particularly effective in analysing tweets.

One advantage of VADER is that it does not classify text as positive or negative; it also indicates the intensity of the sentiment. In the output, it provides separate scores for positive, negative, and neutral components.

In addition, VADER returns a fourth value, the compound score. This is a normalised metric that aggregates the overall sentiment and ranges from –1 to 1. Based on this score, we determine whether the overall sentiment is positive, negative, or neutral.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer

def get_vader_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    if compound >= 0.05:
        return ['positive, socre:',compound]
    elif compound <= -0.05:
        return ['negative, socre:',compound]
    else:
        return ['neutral, socre:',compound]

In [None]:
sample_text = "thank you!"
print(get_vader_sentiment(sample_text))

In [None]:
print(get_vader_sentiment("thank"))
print(get_vader_sentiment("you!"))