# Introduction: The `Understand` Pillar of Cognitive Systems

The `understand` pillar is about a system's ability to interpret unstructured data, just as a human brain processes the world around it. The most common form of `unstructured data is human language`.

This notebook will guide you through the process of `building a system that can read and comprehend text.` We'll follow the steps a cognitive system takes to move from raw, unstructured data to meaningful, actionable insights.

## Section 1: The Standard NLP Pipeline - A System's Mental Preparation

Before a cognitive system can reason or learn from text, it must first prepare the data. `This pipeline is like the brain's initial sensory processing`, where it filters and organizes information.

### Step-by-step processing using a classic Python library, `NLTK.`

Like any other data science project lets begin by filtering out the unnecessary warning to come from time to time in or after  outputs

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
# unstructured text input
raw_text = "Spam alert! This is a GREAT workshop! I learned a lot from their FREE course."

#### Step 1: Lowercasing (Standardization)

This is the system's first attempt at normalization `(But you can as well start with any other step like NOISE REMOVAL).` By converting all text to lowercase, `we ensure the system recognizes "GREAT" and "great" as the same concept.` <br>This reduces cognitive load and prevents it from having to learn two different representations for the same word.

`Caution:` A truly advanced cognitive system must know when to break this rule. For `Named Entity Recognition (NER)`, case can be a vital clue to distinguish between "Apple" (the company) and "apple" (the fruit).

In [4]:
# The system normalizes its perception of words
text_lower = raw_text.lower()
print(f"Normalized Text: {text_lower}")

Normalized Text: spam alert! this is a great workshop! i learned a lot from their free course.


#### Step 2: Tokenization (Perceptual Segmentation)

Just as the human brain segments a continuous stream of sound into distinct words, `tokenization breaks a sentence into individual units or tokens.` <br>This gives the system a structured way to perceive and process the text.

Install NLTK (if not already done) and download the necessary data

In [None]:
!pip install nltk   # Natural Language Toolkit



In [None]:
import nltk
nltk.download('punkt') # Punkt is a pre-trained model that helps NLTK understand where sentences begin and end in text.

[nltk_data] Downloading package punkt to C:\Users\HAMU
[nltk_data]     COMPUTERS\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
from nltk.tokenize import word_tokenize

# The system segments its input into discrete tokens
tokens = word_tokenize(text_lower)
print(f"Perceived Tokens: {tokens}")

Perceived Tokens: ['spam', 'alert', '!', 'this', 'is', 'a', 'great', 'workshop', '!', 'i', 'learned', 'a', 'lot', 'from', 'their', 'free', 'course', '.']


#### Step 3: Stop Word Removal (Attentional Filtering)

A cognitive system must learn to filter out noise and focus on what's important. `Stop words are the linguistic noise—common words (is, a, the)` that provide little semantic value for many tasks. <br>Removing them is an act of attentional filtering, allowing the system to focus its resources on keywords.

In [8]:
# Download stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\HAMU
[nltk_data]     COMPUTERS\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [9]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# The system applies a filter to focus its attention
filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]  # word.isalpha() Only keeps words that are pure letters (no punctuation, numbers)
print(f"Focused Tokens: {filtered_tokens}")

Focused Tokens: ['spam', 'alert', 'great', 'workshop', 'learned', 'lot', 'free', 'course']


#### Step 4: Lemmatization (Conceptual Mapping) (Normalization)

Humans don't just memorize words; we map them to a core concept.   `We know "running," "ran," and "runs" all relate to the action "run."`

`Lemmatization mimics this by reducing words to their base or dictionary form (lemma).` This is a crucial step for conceptual mapping, allowing the system to understand that different word forms represent the same idea.

In [10]:
# Download WordNet data
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to C:\Users\HAMU
[nltk_data]     COMPUTERS\AppData\Roaming\nltk_data...


True

In [11]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [13]:
# The system maps variations to core concepts
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(f"Conceptually Mapped Tokens: {lemmatized_tokens}")

Conceptually Mapped Tokens: ['spam', 'alert', 'great', 'workshop', 'learned', 'lot', 'free', 'course']


<br>
<br>


## Section 2: The Modern Approach with `spaCy` - An Integrated Cognitive System

Instead of building each component separately, modern cognitive systems use integrated pipelines. `spaCy` is a powerful example, combining all the previous steps into a single, highly optimized workflow.

<br>This is analogous to how a human brain processes information fluidly, without consciously thinking about each sub-step.

In [5]:
# Install the library and its model
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting spacy
  Using cached spacy-3.8.7-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11 (from spacy)
  Using cached spacy_legacy-3.0.12-py2.py3-none-any.whl.metadata (2.8 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0 (from spacy)
  Using cached spacy_loggers-1.0.5-py3-none-any.whl.metadata (23 kB)
Collecting murmurhash<1.1.0,>=0.28.0 (from spacy)
  Using cached murmurhash-1.0.13-cp312-cp312-win_amd64.whl.metadata (2.2 kB)
Collecting cymem<2.1.0,>=2.0.2 (from spacy)
  Using cached cymem-2.0.11-cp312-cp312-win_amd64.whl.metadata (8.8 kB)
Collecting preshed<3.1.0,>=3.0.2 (from spacy)
  Using cached preshed-3.0.10-cp312-cp312-win_amd64.whl.metadata (2.5 kB)
Collecting thinc<8.4.0,>=8.3.4 (from spacy)
  Using cached thinc-8.3.6-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting wasabi<1.2.0,>=0.9.1 (from spacy)
  Using cached wasabi-1.1.3-py3-none-any.whl.metadata (28 kB)
Collecting srsly<3.0.0,>=2.4.3 (from spacy)
  Using cached srsly-2.5.1-cp312-cp312

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
contourpy 1.2.0 requires numpy<2.0,>=1.20, but you have numpy 2.3.3 which is incompatible.
matplotlib 3.8.2 requires numpy<2,>=1.21, but you have numpy 2.3.3 which is incompatible.
pandas 2.2.0 requires numpy<2,>=1.26.0; python_version >= "3.12", but you have numpy 2.3.3 which is incompatible.
scikit-learn 1.4.0 requires numpy<2.0,>=1.19.5, but you have numpy 2.3.3 which is incompatible.
scipy 1.12.0 requires numpy<1.29.0,>=1.22.4, but you have numpy 2.3.3 which is incompatible.
statsmodels 0.14.1 requires numpy<2,>=1.18, but you have numpy 2.3.3 which is incompatible.
streamlit 1.34.0 requires numpy<2,>=1.19.3, but you have numpy 2.3.3 which is incompatible.
tensorflow-intel 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.3.3 which is incompatible.
tf-nightly 2.19.0.dev20250205 requires numpy<2.2.0,>=1

   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/s eta 0:05:09
   ------------------- -------------------- 6.8/13.9 MB 23.0 kB/


[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- ------------------------------------ 1.0/12.8 MB 27.6 kB/s eta 0:07:06
     --- -------------------------------

In [14]:
# Import and load the pretrained model
import spacy
nlp = spacy.load("en_core_web_sm") # 'sm' is the small, efficient model

In [15]:
# Messy Data
raw_text = "Hey @UCU_Student, check out this link: https://example.com for a FREE workshop on #NLP in Kampala! It's gonna be lit!"

### Step 1: Process the text with spaCy's NLP pipeline (does tokenization, POS(Part Of Speech) tagging, lemmatization, etc.)

In [None]:
doc = nlp(raw_text)  # By passing raw_text into the nlp object like a function, you're running the "entire series of NLP steps" on your text at once.

### Step 2: Let's inspect what spaCy gives us for each token

`token.text:` The original word.

`token.lemma_:` The base form of the word.

`token.is_stop:` Boolean (True/False) if the token is a stopword.

`token.is_punct:` Boolean if the token is punctuation.

`token.is_space:` Boolean if the token is a whitespace character.

In [22]:
print(f"{'Text':<15} {'Lemma':<10} {'Is_Stopword':<13} {'Is_Punct':<10}")
print("-" * 50)

for token in doc:
    print(f"{token.text:<15} {token.lemma_:<15} {token.is_stop:<10} {token.is_punct:<10}")

Text            Lemma      Is_Stopword   Is_Punct  
--------------------------------------------------
Hey             hey             0          0         
@UCU_Student    @ucu_student    0          0         
,               ,               0          1         
check           check           0          0         
out             out             1          0         
this            this            1          0         
link            link            0          0         
:               :               0          1         
https://example.com https://example.com 0          0         
for             for             1          0         
a               a               1          0         
FREE            free            0          0         
workshop        workshop        0          0         
on              on              1          0         
#               #               0          1         
NLP             NLP             0          0         
in              in       

### Step 3: Apply our filtering logic: keep if not a stopword, not punctuation, and not a space.

`Use this Logic`

In [30]:
clean_tokens = []


for token in doc:
    # Check the conditions
    if not token.is_stop and not token.is_punct and not token.is_space:
        # If conditions are met, add the lowercase lemma to our list
        clean_tokens.append(token.lemma_.lower())

In [31]:
# Final Result
print("\nClean, Lemmatized Tokens:", clean_tokens)


Clean, Lemmatized Tokens: ['hey', '@ucu_student', 'check', 'link', 'https://example.com', 'free', 'workshop', 'nlp', 'kampala', 'go', 'to', 'light']


`Or This for a better output`

In [32]:
clean_tokens = []
for token in doc:
    if not token.is_stop and token.is_alpha:
        clean_tokens.append(token.lemma_.lower())

In [33]:
# Final Result
print("\nClean, Lemmatized Tokens:", clean_tokens)


Clean, Lemmatized Tokens: ['hey', 'check', 'link', 'free', 'workshop', 'nlp', 'kampala', 'go', 'to', 'light']


<br>
<br>

## Section 3: Feature Extraction and Engineering

Now that our cognitive system has a clean representation of the text, it `needs to convert this knowledge into a format it can learn from.` This is where Feature Extraction comes in.

A machine can't understand words; it understands numbers. `TF-IDF (Term Frequency-Inverse Document Frequency)` is a method for knowledge representation. 

It assigns a numerical value to each word `based on how often it appears in a document (TF)` and `how important it is across a collection of documents (IDF).` 

Keywords like "spam" get a high score, while common words like "the" get a low score, allowing the system to weigh the importance of each piece of information.

## Building a Spam Classifier with TF-IDF

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer #contains tools for converting raw text into numerical features that a machine learning model

from sklearn.naive_bayes import MultinomialNB #Multinomial part means it's well-suited for data that represents counts or frequencies (like word counts from our TfidfVectorizer).

from sklearn.pipeline import make_pipeline

`TfidfVectorizer:` What it does: is to perform Text Preprocessing and Feature Extraction all in one step

#### Sample labeled dataset
##### 'spam' / 'ham' (ham is the term for non-spam email)

In [36]:
emails = [
    "Win a free prize now! Click here!",         # Spam
    "Your meeting agenda for tomorrow is attached.", # Ham
    "Please review the project report and provide feedback.", # Ham
    "Limited time offer! Buy now and get 50% off!", # Spam
    "Reminder: Your appointment is scheduled for 3 PM.", # Ham
    "You've been selected for a exclusive reward!", # Spam
]
labels = ["spam", "ham", "ham", "spam", "ham", "spam"]

#### Step 1 & 2: Create a Pipeline that does TF-IDF + Model Training

In [None]:
# The TfidfVectorizer handles ALL preprocessing: tokenization, lowercasing, stopwords, and more
text_classifier_model = make_pipeline(
    TfidfVectorizer(stop_words='english'), # Applies TF-IDF
    MultinomialNB()                         # Naive Bayes classifier
)

# Train the model with one command
text_classifier_model.fit(emails, labels)

#### Step 3: Test the model on new, unseen emails

In [None]:
new_emails = [
    "Congratulations! You won a free ticket to the conference.", # Clearly spam
    "Hi John, please send me the Q3 sales figures when you get a chance." # Looks ham
]

predictions = text_classifier_model.predict(new_emails)


In [None]:
# the results
for email, label in zip(new_emails, predictions):
    print(f"EMAIL: {email}\nPREDICTION: {label}\n")