
# **NLP Workshop Notebook**
This workshop covers real-world NLP preprocessing techniques and a hands-on text classification task.

## **Learning Objectives**
- Tokenize text
- Remove stopwords
- Perform stemming and lemmatization
- Apply POS tagging and Named Entity Recognition
- Extract noun chunks (compound terms)
- Train a basic sentiment classifier
- Evaluate model performance

---


## **1. Setup and Imports**

In [49]:
import nltk
import spacy

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## **2. Sample Text**

In [10]:

text = """Natural language processing (NLP) is a field of artificial intelligence
that gives computers the ability to understand text and spoken words. NLP started by applying rule-based techniques. Currently it uses transformers."""
print(text)


Natural language processing (NLP) is a field of artificial intelligence
that gives computers the ability to understand text and spoken words. NLP started by applying rule-based techniques. Currently it uses transformers.


## **3. Tokenization**

In [11]:

tokens = word_tokenize(text)
tokens


['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'field',
 'of',
 'artificial',
 'intelligence',
 'that',
 'gives',
 'computers',
 'the',
 'ability',
 'to',
 'understand',
 'text',
 'and',
 'spoken',
 'words',
 '.',
 'NLP',
 'started',
 'by',
 'applying',
 'rule-based',
 'techniques',
 '.',
 'Currently',
 'it',
 'uses',
 'transformers',
 '.']

In [12]:
#sentence tokens
Tokens2=sent_tokenize(text)
Tokens2

['Natural language processing (NLP) is a field of artificial intelligence\nthat gives computers the ability to understand text and spoken words.',
 'NLP started by applying rule-based techniques.',
 'Currently it uses transformers.']

**Exercise 1:** Tokenize your own sentence.

In [13]:
your_text = "My name is Rasha. I'm Egyptian. I have 3 children. I live in Cairo."
word_tokenize(your_text)

['My',
 'name',
 'is',
 'Rasha',
 '.',
 'I',
 "'m",
 'Egyptian',
 '.',
 'I',
 'have',
 '3',
 'children',
 '.',
 'I',
 'live',
 'in',
 'Cairo',
 '.']

## **4. Stopword Removal**

In [31]:
# check stop words in the English corpus
stop_words = set(stopwords.words('english'))
print(stop_words)



{'had', "he'll", 'themselves', 'than', 'ain', 'don', 'but', 'shan', 'who', 'both', 'can', "couldn't", 'just', 'against', "i'll", 'aren', "didn't", "it'd", 'from', 'more', 'and', 'until', 'me', "that'll", 'the', 'as', 'does', "they'll", 'an', 'having', 'too', 'off', 'wouldn', 'it', 'with', "you've", 'do', 'so', "she'd", 'couldn', 'own', 'that', 'if', "should've", 'here', 'needn', 'she', 'then', 'should', 'further', 'have', "needn't", 'were', 'about', 'why', "haven't", "they're", "you're", "i'm", "weren't", 'between', 'ours', 'during', "he'd", 'did', 'wasn', 'mightn', 'hasn', 'by', 'into', 'her', 'out', 'once', 'll', 'hadn', 'our', 'because', 'was', 'isn', "you'll", 's', 'up', "hasn't", "we've", 'y', 'been', "hadn't", 'mustn', 'these', "isn't", 'd', 'they', 'after', "doesn't", 'such', 'where', 'nor', 'doing', 'has', 'being', 'only', 'a', 'is', 'under', 'yours', 'shouldn', "won't", 're', "shouldn't", 'of', 'his', 'your', "we'd", 'those', "shan't", 'my', "aren't", "don't", 'haven', 'all', 

In [30]:
# list of text words after removinf the stop words
filtered = [w for w in tokens if w.lower() not in stop_words]  #list_comprehension
print(filtered)

['Natural', 'language', 'processing', '(', 'NLP', ')', 'field', 'artificial', 'intelligence', 'gives', 'computers', 'ability', 'understand', 'text', 'spoken', 'words', '.', 'NLP', 'started', 'applying', 'rule-based', 'techniques', '.', 'Currently', 'uses', 'transformers', '.']


In [32]:
# remove punct marks
filtered = [w for w in filtered if w.isalpha()]  #if not letter (alpha) then remove it - it keeps only letters
print(filtered)

['Natural', 'language', 'processing', 'NLP', 'field', 'artificial', 'intelligence', 'gives', 'computers', 'ability', 'understand', 'text', 'spoken', 'words', 'NLP', 'started', 'applying', 'techniques', 'Currently', 'uses', 'transformers']


## **5. Stemming and Lemmatization**

In [50]:
#stemming
ps = PorterStemmer()


stemmed = [ps.stem(w) for w in filtered]

print(stemmed )


snow=SnowballStemmer(language='english')
stemmed2=[snow.stem(w) for w in filtered]
print(stemmed2)


['natur', 'languag', 'process', 'nlp', 'field', 'artifici', 'intellig', 'give', 'comput', 'abil', 'understand', 'text', 'spoken', 'word', 'nlp', 'start', 'appli', 'techniqu', 'current', 'use', 'transform']
['natur', 'languag', 'process', 'nlp', 'field', 'artifici', 'intellig', 'give', 'comput', 'abil', 'understand', 'text', 'spoken', 'word', 'nlp', 'start', 'appli', 'techniqu', 'current', 'use', 'transform']


In [34]:
#lemmitization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w) for w in filtered]
lemmatized

['Natural',
 'language',
 'processing',
 'NLP',
 'field',
 'artificial',
 'intelligence',
 'give',
 'computer',
 'ability',
 'understand',
 'text',
 'spoken',
 'word',
 'NLP',
 'started',
 'applying',
 'technique',
 'Currently',
 'us',
 'transformer']

In [55]:
# stemming - lemmatization
print(ps.stem('older'))
print(snow.stem('older'))
print(lemmatizer.lemmatize('older', pos='a'))


older
older
old


## **6. POS Tagging and NER**

In [61]:

text3='Andrew Yan-Tak Ng is a British-American computer scientist and technology entrepreneur focusing on machine learning and artificial intelligence. He worked at Google for 13 years. He established DeepLearning platform in 2002.'
doc = nlp(text3)
pos = [(token.text, token.pos_) for token in doc]
entities = [(ent.text, ent.label_) for ent in doc.ents]

pos, entities


([('Andrew', 'PROPN'),
  ('Yan', 'PROPN'),
  ('-', 'PUNCT'),
  ('Tak', 'PROPN'),
  ('Ng', 'PROPN'),
  ('is', 'AUX'),
  ('a', 'DET'),
  ('British', 'ADJ'),
  ('-', 'PUNCT'),
  ('American', 'ADJ'),
  ('computer', 'NOUN'),
  ('scientist', 'NOUN'),
  ('and', 'CCONJ'),
  ('technology', 'NOUN'),
  ('entrepreneur', 'NOUN'),
  ('focusing', 'VERB'),
  ('on', 'ADP'),
  ('machine', 'NOUN'),
  ('learning', 'NOUN'),
  ('and', 'CCONJ'),
  ('artificial', 'ADJ'),
  ('intelligence', 'NOUN'),
  ('.', 'PUNCT'),
  ('He', 'PRON'),
  ('worked', 'VERB'),
  ('at', 'ADP'),
  ('Google', 'PROPN'),
  ('for', 'ADP'),
  ('13', 'NUM'),
  ('years', 'NOUN'),
  ('.', 'PUNCT'),
  ('He', 'PRON'),
  ('established', 'VERB'),
  ('DeepLearning', 'PROPN'),
  ('platform', 'NOUN'),
  ('in', 'ADP'),
  ('2002', 'NUM'),
  ('.', 'PUNCT')],
 [('Andrew Yan-Tak Ng', 'PERSON'),
  ('British', 'NORP'),
  ('Google', 'ORG'),
  ('13 years', 'DATE'),
  ('DeepLearning', 'ORG'),
  ('2002', 'DATE')])

## **7. Compound Term Extraction**

In [62]:

list(doc.noun_chunks)


[Andrew Yan-Tak Ng,
 a British-American computer scientist and technology entrepreneur,
 machine learning,
 artificial intelligence,
 He,
 Google,
 13 years,
 He,
 DeepLearning platform]

---
# **8. Text Classification Task**

In [None]:
import pandas as pd
df = pd.read_csv("/mnt/data/sentiment_dataset.csv")
df

In [None]:

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['text'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))


## **Exercise 2:**
Try replacing the classifier with `LogisticRegression`. Compare results.

---
# **Answer Key**


### ✅ Answer for Exercise 1:
Use `word_tokenize(your_text)` — the output should be a list of tokens.

### ✅ Answer for Exercise 2:
Replace the model section with:
```
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
```
