# Text Cleaning & Preprocessing Pipeline

In this project, I will build a Python script to process raw text files by applying basic NLP techniques. The goal is to transform unstructured text into a cleaner, more structured format, making it more suitable for further analysis.

The script will take a raw text file and apply fundamental NLP preprocessing steps to extract meaningful information such as keywords and named entities. These techniques will enhance the text’s usability for advanced NLP tasks like summarization, sentiment analysis, or topic modeling.

This project aims to provide hands-on experience with key NLP concepts, tools, and libraries. The main steps include:

- **Tokenization**: Splitting text into words or sentences.
- **Stopword Removal**: Removing common words (e.g., "the," "is," "and") that add little meaning.
- **Stemming & Lemmatization**: Reducing words to their root forms.
- **Part-of-Speech (POS) Tagging**: Assigning grammatical categories to words.
- **Named Entity Recognition (NER)**: Identifying proper names, places, and organizations.

### List of Contents

1. Read & Load the Text Data

2. Tokenization

3. Stopword Removal

4. Stemming & Lemmatization

5. Part-of-Speech (POS) Tagging

6. Keyword Extraction
    * Named Entity Recognition (NER)
    * Noun Chunks



### 1. Read & Load the Text Data

First I will load a text file (in .txt format) and remove whitespaces, special characters and extra line breaks. Also I will convert the text to lowercase for processing.

#### Why cleaning text is a crucial preprocessing step in Natural Language Processing?

Algorithms do not interpret text the same way as people do. For me and you, separation between words is necessary to understarn what is written. For NLP models they only add noise and do not have meaning whatsoever. 

Text preprocessing ensures that only the parts that carry meaning are feed to the model, making it perform better. 

Of course this has to be carry out with some care: for instance, when lowercasing. Apple and apple are the same word but, in some context, they can have different meaning (the first may be referring to the company and the later to the fruit) 

In [1]:
import re

In [2]:
# Load the text file

def load_text(file_path):
    """Reads a text file and returns the content as a string."""
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

In [3]:
file_path = '/Users/ezequ/Documents/Python/projects/intelligence.txt'

In [4]:
raw_text = load_text(file_path)

In [5]:
# Clean the text

def clean_text(text):
    """Cleans raw text by removing special characters, extra spaces, and converting to lowercase."""
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\[\d+\]', '', text)  # Remove references 
    text = re.sub(r'[,:;]', '', text)  # Remove commas
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces and newlines
    return text.strip()

In [6]:
cleaned_text = clean_text(raw_text)

In [None]:
print("Original Text:\n", raw_text)
print("\nCleaned Text:\n", cleaned_text)

Now we have a cleaned version of the original text.

### 2. Tokenization

#### What Tokenization means?

Is the process of *splitting text* into smaller units. Sentence tokenization means splitting a text into sentences, wheras word tokeniazation means splitting sentences into individual words.

#### NLTK and spaCy

Both NLTK and spaCy are Python libraries for Natural Language Processing (NLP), but they have different purposes.

NLTK is flexible and includes a wide range of NLP algorithms, making it ideal for research and educational use. However, its extensive resources often require more manual tuning, and it is not optimized for large-scale, industrial applications.

spaCy, on the other hand, is designed for speed and production-ready applications. While it is less flexible than NLTK for custom text processing, it is significantly faster and comes with efficient pre-trained language models (which need to be downloaded separately)

In [8]:
# Import necessary libraries for tokenization

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [None]:
# Download required resources

nltk.download('punkt')
nltk.download('punkt_tab')

In [9]:
def nltk_tokenize(text):
    """Tokenizes text into sentences and words using NLTK."""
    sentences = sent_tokenize(text)  # Sentence tokenization
    words = word_tokenize(text)  # Word tokenization
    return sentences, words

In [10]:
# Tokenize the cleaned version of the text

sentences, words = nltk_tokenize(cleaned_text)

In [None]:
print("Sentence Tokenization (First 3 sentences):")
for i, sentence in enumerate(sentences[:3], 1):
    print(f"{i}. {sentence}")

In [None]:
print("\n Word Tokenization (First 20 words):")
print(words[:20])

Now using SpaCy:

In [35]:
import spacy

In [39]:
# Load spaCy model
nlp = spacy.load("en_core_web_sm")

In [41]:
def spacy_tokenize(text):
    """Tokenizes text into sentences and words using spaCy."""
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]  # Sentence tokenization
    words = [token.text for token in doc]  # Word tokenization
    return sentences, words

In [42]:
spacy_sentences, spacy_words = spacy_tokenize(cleaned_text)

In [None]:
print("spaCy Sentence Tokenization (First 3 sentences):")
for i, sentence in enumerate(spacy_sentences[:3], 1):
    print(f"{i}. {sentence}")

In [None]:
print("\nspaCy Word Tokenization (First 20 words):")
print(spacy_words[:20])

In [None]:
all_match = True  

for i, sentence in enumerate(sentences):
    if sentence not in spacy_sentences:
        print(f"Sentence number {i} is not equal to the sentence using the spaCy model")
        all_match = False  

if all_match:
    print("Sentence tokenization is the same with both models")  


In [None]:
different_words = []

for word in words:
    if word not in spacy_words:  
        different_words.append(word)

print(different_words)

Here, NLTK manage better compund words like 'self-awareness' and 'problem-solving'

### 3. Stopword Removal


Stopwords are commonly occurring words like "the", "is", and "and" that don’t carry much meaningful information in many NLP tasks. Removing them helps reduce noise and improve the performance of models.

In [13]:
from nltk.corpus import stopwords

In [None]:
# Download NLTK stopwords

nltk.download('stopwords')

In [15]:
def nltk_remove_stopwords(words):
    """Removes stopwords from the tokenized words using NLTK."""
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    return filtered_words

In [None]:
filtered_words = nltk_remove_stopwords(words)

print("Words after Stopword Removal (First 20 words):")
print(filtered_words[:20])


now with spaCy:

In [53]:
def spacy_remove_stopwords(words):
    """Removes stopwords from the tokenized words using spaCy."""
    stop_words = nlp.Defaults.stop_words
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return filtered_words

In [54]:
filtered_words_spacy = spacy_remove_stopwords(spacy_words)

In [None]:
print("Words after Stopword Removal (First 20 words):")
print(filtered_words_spacy[:20])

### 4. Stemming and Lemmatization

Stemming and lemmatization are text preprocessing techniques that reduce word variants to one base form [1]. Even though the goal is the same, the approach is different: **stemming** merely removes common suffixes from the end of word tokens, whereas **lemmatization** ensures the output word is an existing, normalized form that can be found in the dictionary.

There is a trade-off between these two processes. Lemmatization may seem like a more robust technique—because it is. However, the drawback is that it is more computationally intensive. **Stemming** is preferable when **speed matters** and some errors are acceptable (e.g., in large-scale text processing, quick search indexing). **Lemmatization** is better for **accuracy**, especially when dealing with tasks requiring proper word meaning (e.g., NLP models, sentiment analysis, and linguistics-heavy applications). If computational cost is not an issue, **lemmatization is usually the better choice** since it preserves the actual dictionary form of words.

Why is this necessary?
Natural language is highly redundant, and *a single concept can be represented in multiple ways* (e.g., run, running, ran, runs). Some words have multiple inflected forms (study vs. studies, am vs. is vs. are) that may end up being treated as separate words. **Reducing the number of unique words in a dataset while retaining their true meaning** ensures that similar words are recognized as the same. This has the added benefit of requiring **less memory and computation**, making text processing more efficient.



First, I will apply stemming:

In [17]:
from nltk.stem import SnowballStemmer

In [18]:
def nltk_stemming(words):
    """Applies stemming to words using NLTK's SnowballStemmer."""
    stemmed_words = [SnowballStemmer("english").stem(word) for word in words]
    return stemmed_words

In [19]:
stemmed_words = nltk_stemming(filtered_words)

In [None]:
print("Words after Stemming (First 20 words):")
print(stemmed_words[:20])

Many, many 'weird' words in this list!

Now let's try lemmanizing 

In [20]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [None]:
# Download WordNet for lemmatization
nltk.download('wordnet')

In [22]:
def nltk_lemmatization(words):
    """Applies lemmatization using NLTK's WordNetLemmatizer."""
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return lemmatized_words

In [23]:
lemmatized_words_nltk = nltk_lemmatization(filtered_words)

In [None]:
print("Words after Lemmatization:")
print(lemmatized_words_nltk)

Now with spaCy:

In [56]:
def spacy_lemmatization(words):
    """Applies lemmatization using spaCy."""
    doc = nlp(" ".join(words)) 
    lemmatized_words = [token.lemma_ for token in doc]
    return lemmatized_words

In [57]:
lemmatized_words_spacy = spacy_lemmatization(filtered_words_spacy)

In [None]:
print("Words after Lemmatization with spaCy (First 20 words):")
print(lemmatized_words_spacy)

### 5. Part-of-Speech (POS) Tagging

POS tagging is the process of labeling words with their grammatical roles, such as nouns, verbs, adjectives and adverbs

In [25]:
from nltk import pos_tag

In [None]:
# Download required NLTK data
nltk.download('averaged_perceptron_tagger_eng')

In [31]:
def nltk_pos_tagging(words):
    """Assigns Part-of-Speech (POS) tags using NLTK."""
    return pos_tag(words)

In [32]:
pos_tags_nltk = nltk_pos_tagging(lemmatized_words_nltk)

In [None]:
print("POS Tags (First 20 words):")
print(pos_tags_nltk[:20])


Now with spaCy:

In [61]:
def spacy_pos_tagging(words):
    """Assigns Part-of-Speech (POS) tags using spaCy."""
    doc = nlp(" ".join(words))  
    return [(token.text, token.pos_) for token in doc]

In [62]:
pos_tags_spacy = spacy_pos_tagging(lemmatized_words_spacy)

In [None]:
print("POS Tags (First 10 words using spaCy):")
print(pos_tags_spacy[:10])

Now I am going to compare how the 2 models classify words, in particular for nouns verbs and adjectives, which by common experience are the most common types of words

In [None]:
# Nouns

noun_list_ntk = {word[0] for word in pos_tags_nltk if word[1] == 'NN'}
noun_list_spacy = {word[0] for word in pos_tags_spacy if word[1] in ('NOUN', 'PROPN')}

diff_nltk = [word for word in noun_list_ntk if word not in noun_list_spacy]
diff_spacy = [word for word in noun_list_spacy if word not in noun_list_ntk]

print("These words are recognized as nouns in the NLTK model but not in the spaCy model:", diff_nltk)
print("These words are recognized as nouns in the spaCy model but not in the NLTK model:", diff_spacy)


In [None]:
# Verbs

verb_list_ntk = {word[0] for word in pos_tags_nltk if word[1] == 'VBD'}
verb_list_spacy = {word[0] for word in pos_tags_spacy if word[1] in ('VERB')}

verb_diff_nltk = [word for word in verb_list_ntk if word not in verb_list_spacy]
verb_diff_spacy = [word for word in verb_list_spacy if word not in verb_list_ntk]

print("These words are recognized as verbs in the NLTK model but not in the spaCy model:", verb_diff_nltk)
print("These words are recognized as verbs in the spaCy model but not in the NLTK model:", verb_diff_spacy)


In [None]:
# Adjectives

adj_list_ntk = {word[0] for word in pos_tags_nltk if word[1] == 'JJ'}
adj_list_spacy = {word[0] for word in pos_tags_spacy if word[1] == 'ADJ'}

adj_diff_nltk = [word for word in adj_list_ntk if word not in adj_list_spacy]
adj_diff_spacy = [word for word in adj_list_spacy if word not in adj_list_ntk]

print("These words are recognized as adjectives in the NLTK model but not in the spaCy model:", adj_diff_nltk)
print("These words are recognized as adjectives in the spaCy model but not in the NLTK model:", adj_diff_spacy)


From this, I can conclude that:

* NLTK did a good job with **compound words**, correctly recognizing and classifying them.
* spaCy dig great at lemmatization, assigning present-tense verb forms, which later proved crucial for classification.
* Overall, spaCy performed better, with fewer misclassifications and more effective lemmatization.

For this reasons, I will stick to spaCy from here on now.

### 6. Keyword Extraction


#### Named Entity Recognition (NER)

This is the part in which we identify and categorize *named entities*, such as names of people, organizations, locations and dates.

In [92]:
def spacy_ner(text):
    """Extracts named entities from text using spaCy."""
    doc = nlp(text)  # Convert text to a spaCy document
    return [(ent.text, ent.label_) for ent in doc.ents]

In [96]:
# Remove text references

text_nref = re.sub(r'\[\d+\]', '', raw_text)

In [98]:
entities_spacy = spacy_ner(text_nref)

In [None]:
print("Named Entities:")
print(entities_spacy)


#### Noun Chunks

These are phrases that usually contain key concepts (e.g., "machine learning model", "financial market trends").

In [100]:
from collections import Counter

In [103]:
def extract_keywords_spacy(text, top_n=10, min_length=4):
    """Extracts keyword phrases using spaCy's noun chunks."""
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    keywords = [chunk.text.lower() for chunk in doc.noun_chunks if len(chunk.text) >= min_length]
    keywords = list(set([kw for kw in keywords if kw.isalpha() and kw not in nlp.Defaults.stop_words]))

    keyword_counts = Counter(keywords).most_common(top_n)
    
    return keyword_counts

In [None]:
print("Keywords (Noun Chunks):")
print(extract_keywords_spacy(text_nref))

<ins>References<ins>:

[1] https://www.ibm.com/think/topics/stemming-lemmatization