# Introduction to Natural Language Processing (NLP)

In this notebook, we introduce basic NLP concepts and
implement simple text processing techniques using Python.


## Goals

- Understand what NLP is
- Perform basic text preprocessing
- Convert text into numerical representations
- Prepare data for machine learning models


In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter

We use a small dataset of short text samples for NLP tasks.

In [2]:
df = pd.read_csv("datasets/text_samples.csv")
df

Unnamed: 0,id,text,label
0,1,I love artificial intelligence,positive
1,2,Machine learning is fascinating,positive
2,3,I enjoy learning new AI concepts,positive
3,4,Debugging code is frustrating,negative
4,5,I hate software bugs,negative
5,6,Errors make programming stressful,negative
6,7,AI will change the future,neutral
7,8,Technology is evolving rapidly,neutral
8,9,Learning Python is useful,neutral


Text data must be cleaned and normalized before analysis.

Common steps:
- Lowercasing
- Removing punctuation
- Tokenization
- Removing stopwords

In [3]:
df["clean_text"] = df["text"].str.lower()
df

Unnamed: 0,id,text,label,clean_text
0,1,I love artificial intelligence,positive,i love artificial intelligence
1,2,Machine learning is fascinating,positive,machine learning is fascinating
2,3,I enjoy learning new AI concepts,positive,i enjoy learning new ai concepts
3,4,Debugging code is frustrating,negative,debugging code is frustrating
4,5,I hate software bugs,negative,i hate software bugs
5,6,Errors make programming stressful,negative,errors make programming stressful
6,7,AI will change the future,neutral,ai will change the future
7,8,Technology is evolving rapidly,neutral,technology is evolving rapidly
8,9,Learning Python is useful,neutral,learning python is useful


We remove punctuation using regular expressions.

In [4]:
df["clean_text"] = df["clean_text"].apply(
    lambda x: re.sub(r"[^a-z\s]", "", x)
)
df

Unnamed: 0,id,text,label,clean_text
0,1,I love artificial intelligence,positive,i love artificial intelligence
1,2,Machine learning is fascinating,positive,machine learning is fascinating
2,3,I enjoy learning new AI concepts,positive,i enjoy learning new ai concepts
3,4,Debugging code is frustrating,negative,debugging code is frustrating
4,5,I hate software bugs,negative,i hate software bugs
5,6,Errors make programming stressful,negative,errors make programming stressful
6,7,AI will change the future,neutral,ai will change the future
7,8,Technology is evolving rapidly,neutral,technology is evolving rapidly
8,9,Learning Python is useful,neutral,learning python is useful


Tokenization splits text into individual words.

In [5]:
df["tokens"] = df["clean_text"].apply(lambda x: x.split())
df

Unnamed: 0,id,text,label,clean_text,tokens
0,1,I love artificial intelligence,positive,i love artificial intelligence,"[i, love, artificial, intelligence]"
1,2,Machine learning is fascinating,positive,machine learning is fascinating,"[machine, learning, is, fascinating]"
2,3,I enjoy learning new AI concepts,positive,i enjoy learning new ai concepts,"[i, enjoy, learning, new, ai, concepts]"
3,4,Debugging code is frustrating,negative,debugging code is frustrating,"[debugging, code, is, frustrating]"
4,5,I hate software bugs,negative,i hate software bugs,"[i, hate, software, bugs]"
5,6,Errors make programming stressful,negative,errors make programming stressful,"[errors, make, programming, stressful]"
6,7,AI will change the future,neutral,ai will change the future,"[ai, will, change, the, future]"
7,8,Technology is evolving rapidly,neutral,technology is evolving rapidly,"[technology, is, evolving, rapidly]"
8,9,Learning Python is useful,neutral,learning python is useful,"[learning, python, is, useful]"


We count word frequencies across all documents.


In [6]:
all_words = [word for tokens in df["tokens"] for word in tokens]
word_freq = Counter(all_words)
word_freq

Counter({'is': 4,
         'i': 3,
         'learning': 3,
         'ai': 2,
         'love': 1,
         'artificial': 1,
         'intelligence': 1,
         'machine': 1,
         'fascinating': 1,
         'enjoy': 1,
         'new': 1,
         'concepts': 1,
         'debugging': 1,
         'code': 1,
         'frustrating': 1,
         'hate': 1,
         'software': 1,
         'bugs': 1,
         'errors': 1,
         'make': 1,
         'programming': 1,
         'stressful': 1,
         'will': 1,
         'change': 1,
         'the': 1,
         'future': 1,
         'technology': 1,
         'evolving': 1,
         'rapidly': 1,
         'python': 1,
         'useful': 1})

Bag of Words converts text into numerical feature vectors
based on word counts.

In [7]:
vocab = sorted(set(all_words))
vocab

['ai',
 'artificial',
 'bugs',
 'change',
 'code',
 'concepts',
 'debugging',
 'enjoy',
 'errors',
 'evolving',
 'fascinating',
 'frustrating',
 'future',
 'hate',
 'i',
 'intelligence',
 'is',
 'learning',
 'love',
 'machine',
 'make',
 'new',
 'programming',
 'python',
 'rapidly',
 'software',
 'stressful',
 'technology',
 'the',
 'useful',
 'will']

Each document is represented as a vector of word counts.

In [8]:
def vectorize(tokens, vocab):
    return [tokens.count(word) for word in vocab]

X_bow = np.array(df["tokens"].apply(lambda x: vectorize(x, vocab)).to_list())
X_bow

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        1, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 1, 0]])

Unlike numerical data, text:
- Is unstructured
- Has variable length
- Requires feature extraction

NLP bridges text and machine learning.
+