# Mastering Natural Language Processing with Python

Welcome to the practical session on Mastering NLP with Python. In this notebook, we will move from manual text manipulation (String Methods) to pattern-based extraction (Regular Expressions) and finally to automated language processing pipelines (NLTK and spaCy).

## Traditional String Methods

Before using complex AI models, we often use Python’s built-in string methods to perform data cleaning. These methods are fast and effective for tasks like standardizing casing or splitting simple delimited data.
* Lowercasing: Crucial to prevent "Dimensionality Explosion" (treating "Apple" and "apple" as different words).
* Strip/Replace: Used to remove leading/trailing whitespace or unwanted characters.

In [2]:
s = "This is a text and the datatype is a string."

In [3]:
type(s)

str

In [4]:
#get all methods for s
dir(s)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'stri

In [5]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |
 |  Methods defined here:
 |
 |  __add__(self, value, /)
 |      Return self+value.
 |
 |  __contains__(self, key, /)
 |      Return bool(key in self).
 |
 |  __eq__(self, value, /)
 |      Return self==value.
 |
 |  __format__(self, format_spec, /)
 |      Return a formatted version of the string as described by format_spec.
 |
 |  __ge__(self, value, /)
 |      Return self>=value.
 |
 |  __getitem__(self, key, /)
 |      Return self[key].
 |
 |  __getnewargs__(...)
 |
 |  _

In [6]:
s.upper()

'THIS IS A TEXT AND THE DATATYPE IS A STRING.'

In [7]:
import pandas as pd
pd.Series(["text", "tet", "dat"]).str.upper()

Unnamed: 0,0
0,TEXT
1,TET
2,DAT


## Regular Expressions (Regex)

Regex is a specialized "search language" used to identify, extract, and manipulate specific patterns within text. It is more powerful than simple string methods because it allows you to match types of characters (like "any digit") rather than just literal strings.

In [8]:
import re
text = "Room 402"
# Match the digits
print(re.findall(r'\d+', text))
# Match only uppercase
print(re.findall(r'[A-Z]', text))

['402']
['R']


In [9]:
s = "Paper No. 538"
# Check if string starts with 'Paper'
if re.search(r'^Paper', s):
    print("Valid Start")

# Match 'No.' only as a whole word (using boundaries)
print(re.findall(r'\bNo\b', s))

Valid Start
['No']


In [10]:
text = "<div>Hello</div><div>World</div>"

# Greedy: Matches from first <div> to the very last </div>
print(re.search(r'<div>.*</div>', text).group())

# Non-Greedy (Lazy): Stops at the first </div>
print(re.search(r'<div>.*?</div>', text).group())

<div>Hello</div><div>World</div>
<div>Hello</div>


In [11]:
price = "Cost: $100"
# Extract the number ONLY if it is preceded by a '$'
# The '$' is checked but NOT captured.
amt = re.search(r'(?<=\$)\d+', price).group()
print(amt)

100


In [12]:
text = "First Line\nSecond Line"

# Without DOTALL, '.' stops at the newline
print(re.findall(r'First.*Second', text)) # Output: []

# With DOTALL, '.' crosses the newline
print(re.findall(r'First.*Second', text, re.DOTALL))

[]
['First Line\nSecond']


In [13]:
import re

text = "In the study by Meier (2024), results showed a 15% increase. Contact: max_meier@univ.edu. See also (Zimmer & Bauer, 2018)."

# 1. Basic Cleaning
lower_text = text.lower()

# 2. Extracting Emails
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)


print(f"Original: {text}")
print(f"Emails found: {emails}")

Original: In the study by Meier (2024), results showed a 15% increase. Contact: max_meier@univ.edu. See also (Zimmer & Bauer, 2018).
Emails found: ['max_meier@univ.edu']


## NLP Pipeline

Raw text is messy for computers. A processing pipeline transforms unstructured text into a structured format that models can understand.
1.  Tokenization: Breaking text into individual words or "tokens".
2.  Stop Word Removal: Removing high-frequency but low-information words like "the" or "is".
3.  Lemmatization/Stemming: Reducing words to their base form (e.g., "running" $\rightarrow$ "run").
4.  Named Entity Recognition (NER): Identifying real-world objects like People, Organizations, or Locations.

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import pos_tag, ne_chunk

nltk.download('maxent_ne_chunker_tab')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('words')

# Initial setup
text = "Apple is looking at buying U.K. startup for $1 billion."
tokens = word_tokenize(text) # Tokenization
print("Tokens")
print(tokens)

# 1. Lowercasing
lower_tokens = [t.lower() for t in tokens]
print("Lower Cased")
print(lower_tokens)

# 2. Stop word removal
stop_words = set(stopwords.words('english'))
filtered = [t for t in lower_tokens if t not in stop_words]
print("Stop Words removed")
print(filtered)

# 3.a Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(t) for t in filtered]
print("Lemmatizer")
print(lemmas)

# 3.b Stemmer
stemmer = PorterStemmer()
stems = [stemmer.stem(t) for t in filtered]
print("Stemmer")
print(stems)

# 4. Removing punctuation
no_punc = [t for t in lemmas if t.isalnum()]
print("No Punctuation")
print(no_punc)

# 5. Part-of-Speech Tagging
tags = pos_tag(tokens)
print("POS")
print(tags)

# 6. Named Entity Recognition (NER)
entities = ne_chunk(tags)
print("Entities:")
print(entities)

[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Tokens
['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.']
Lower Cased
['apple', 'is', 'looking', 'at', 'buying', 'u.k.', 'startup', 'for', '$', '1', 'billion', '.']
Stop Words removed
['apple', 'looking', 'buying', 'u.k.', 'startup', '$', '1', 'billion', '.']
Lemmatizer
['apple', 'looking', 'buying', 'u.k.', 'startup', '$', '1', 'billion', '.']
Stemmer
['appl', 'look', 'buy', 'u.k.', 'startup', '$', '1', 'billion', '.']
No Punctuation
['apple', 'looking', 'buying', 'startup', '1', 'billion']
POS
[('Apple', 'NNP'), ('is', 'VBZ'), ('looking', 'VBG'), ('at', 'IN'), ('buying', 'VBG'), ('U.K.', 'NNP'), ('startup', 'NN'), ('for', 'IN'), ('$', '$'), ('1', 'CD'), ('billion', 'CD'), ('.', '.')]
Entities:
(S
  (GPE Apple/NNP)
  is/VBZ
  looking/VBG
  at/IN
  buying/VBG
  U.K./NNP
  startup/NN
  for/IN
  $/$
  1/CD
  billion/CD
  ./.)


In [15]:
import spacy

# Load the English model
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# 1. Tokenization & Lowercasing
# SpaCy tokens have attributes ready to use
for token in doc:
    print(f"Token: {token.text}, Lower: {token.lower_}")

# 2. Stop word removal & 4. Punctuation removal
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
print(filtered_tokens)

# 3. Lemmatization
lemmas = [token.lemma_ for token in doc]
print("Lemmas")
print(lemmas)

# 4. Part-of-Speech Tagging
pos_tags = [(token.text, token.pos_) for token in doc]
print("POS")
print(pos_tags)

# 5. Named Entity Recognition (NER)
entities = [(ent.text, ent.label_) for ent in doc.ents]
print("NER")
print(entities)

Token: Apple, Lower: apple
Token: is, Lower: is
Token: looking, Lower: looking
Token: at, Lower: at
Token: buying, Lower: buying
Token: U.K., Lower: u.k.
Token: startup, Lower: startup
Token: for, Lower: for
Token: $, Lower: $
Token: 1, Lower: 1
Token: billion, Lower: billion
Token: ., Lower: .
['Apple', 'looking', 'buying', 'U.K.', 'startup', '$', '1', 'billion']
Lemmas
['Apple', 'be', 'look', 'at', 'buy', 'U.K.', 'startup', 'for', '$', '1', 'billion', '.']
POS
[('Apple', 'PROPN'), ('is', 'AUX'), ('looking', 'VERB'), ('at', 'ADP'), ('buying', 'VERB'), ('U.K.', 'PROPN'), ('startup', 'VERB'), ('for', 'ADP'), ('$', 'SYM'), ('1', 'NUM'), ('billion', 'NUM'), ('.', 'PUNCT')]
NER
[('Apple', 'ORG'), ('U.K.', 'GPE'), ('$1 billion', 'MONEY')]


In [16]:
import spacy
print(spacy.explain("GPE"))

Countries, cities, states


In [17]:

from spacy import displacy

# Load the model and process text
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Visualize NER (Named Entity Recognition)
# This will highlight 'Apple' as ORG, 'U.K.' as GPE, etc.
displacy.render(doc, style="ent")

In [18]:
# Visualize Dependency Parsing (POS & Syntax)
# This shows arrows representing how words relate to each other
displacy.render(doc, style="dep", options={"distance": 100})

In [19]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

doc = nlp("The researchers are analyzing massive datasets on climate change.")

print(f"{'Token':<15} | {'Lemma':<15} | {'POS Tag'}")
print("-" * 45)
for token in doc:
    print(f"{token.text:<15} | {token.lemma_:<15} | {token.pos_}")

Token           | Lemma           | POS Tag
---------------------------------------------
The             | the             | DET
researchers     | researcher      | NOUN
are             | be              | AUX
analyzing       | analyze         | VERB
massive         | massive         | ADJ
datasets        | dataset         | NOUN
on              | on              | ADP
climate         | climate         | NOUN
change          | change          | NOUN
.               | .               | PUNCT


# Vectorization

In [20]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = ["Paper on AI.", "AI in research."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

# Visualize as a DataFrame
df_bow = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
print(df_bow)

   ai  in  on  paper  research
0   1   0   1      1         0
1   1   1   0      0         1


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(corpus)

# Higher values indicate more "important" unique words
df_tfidf = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vec.get_feature_names_out())
print(df_tfidf)

         ai        in        on     paper  research
0  0.449436  0.000000  0.631667  0.631667  0.000000
1  0.449436  0.631667  0.000000  0.000000  0.631667


In [22]:
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m33.5/33.5 MB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-md
Successfully installed en-core-web-md-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [23]:
import spacy

# Load a model with pre-trained vectors

nlp = spacy.load("en_core_web_md")
word1 = nlp("coffee")
word2 = nlp("tea")

# View the first 5 dimensions of the 300D dense vector
print(word1.vector[:5])
print(word2.vector[:5])

[-0.71252  -0.035167 -0.18155   0.38755  -0.055459]
[-0.61233   0.19931  -0.16447  -0.4831   -0.070536]


In [24]:
from sklearn.metrics.pairwise import cosine_similarity

# Comparing the TF-IDF vectors of our two sentences
similarity_score = cosine_similarity(X_tfidf[0], X_tfidf[1])
print(f"Similarity: {similarity_score[0][0]:.2f}")

# Visualizing with spaCy
doc1 = nlp("Artificial Intelligence in healthcare")
doc2 = nlp("AI for medical research")
print(f"SpaCy Similarity: {doc1.similarity(doc2):.2f}")

Similarity: 0.20
SpaCy Similarity: 0.69


# Text Classification and Topic Modelling

In [25]:
from sklearn.metrics import classification_report, confusion_matrix

# Example: Actual vs Predicted labels
y_true = ["Spam", "Real", "Spam", "Spam", "Real"]
y_pred = ["Spam", "Real", "Real", "Spam", "Real"]

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        Real       0.67      1.00      0.80         2
        Spam       1.00      0.67      0.80         3

    accuracy                           0.80         5
   macro avg       0.83      0.83      0.80         5
weighted avg       0.87      0.80      0.80         5



In [26]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

text = "The research paper was surprisingly good, but the methodology was a bit weak."
scores = analyzer.polarity_scores(text)

# 'compound' is the overall sentiment score from -1 (neg) to 1 (pos)
print(f"Sentiment Scores: {scores}")

Sentiment Scores: {'neg': 0.235, 'neu': 0.549, 'pos': 0.216, 'compound': -0.3182}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [27]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Challenges in medicine", "New ideas for health", "Policy for AI", "Government and AI"]
vec = CountVectorizer(stop_words='english')
X = vec.fit_transform(docs)

# Create LDA model to find 2 topics
lda = LatentDirichletAllocation(n_components=2, random_state=42)
lda.fit(X)

# Display top words for each topic
words = vec.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    top_words = [words[j] for j in topic.argsort()[-3:]]
    print(f"Topic {i}: {top_words}")

Topic 0: ['policy', 'government', 'ai']
Topic 1: ['ideas', 'new', 'health']


# Deep Learning

In [28]:
from transformers import pipeline

# 1. Sentiment Analysis
classifier = pipeline("sentiment-analysis", model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
print(classifier("This research is groundbreaking!"))

# 2. Named Entity Recognition (NER)
ner_tagger = pipeline("ner", grouped_entities=True, model="dbmdz/bert-large-cased-finetuned-conll03-english")
print(ner_tagger("Max Meier submitted Paper No. 538 at Stanford University."))

# 3. Text Summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")

text = '''Deep learning is a subset of machine learning in artificial intelligence (AI) that uses neural networks with multiple layers (hence "deep") to learn from data. Inspired by the structure and function of the human brain, these neural networks are designed to recognize patterns in raw data by processing information through various layers of interconnected nodes. Each layer extracts features from the input, with subsequent layers identifying more complex patterns based on the outputs of the previous layers. This hierarchical learning allows deep neural networks to automatically discover intricate structures in high-dimensional data, such as images, audio, and text, without requiring explicit feature engineering by humans.
At its core, deep learning leverages algorithms that iteratively adjust the weights and biases of connections between neurons to minimize a defined loss function. This optimization process, often carried out using techniques like stochastic gradient descent and backpropagation, enables the network to learn representations of the data that are highly effective for tasks like classification, regression, and generation. The "depth" of these networks—meaning the number of hidden layers—is crucial, as it allows them to learn hierarchical features, starting from simple edges or textures in early layers to more abstract concepts like object parts or semantic meanings in deeper layers."
'''
print(summarizer(text))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9998654127120972}]


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'entity_group': 'PER', 'score': np.float32(0.9968558), 'word': 'Max Meier', 'start': 0, 'end': 9}, {'entity_group': 'ORG', 'score': np.float32(0.98669684), 'word': 'Stanford University', 'start': 37, 'end': 56}]


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


[{'summary_text': ' Deep learning is a subset of machine learning in artificial intelligence (AI) that uses neural networks with multiple layers to learn from data . These neural networks are designed to recognize patterns in raw data by processing information through various layers of interconnected nodes . Each layer extracts features from the input, with subsequent layers identifying more complex patterns based on the output of previous layers .'}]
