# NLP Preprocessing with NLTK and Scikit-learn

This notebook demonstrates fundamental NLP preprocessing steps including:
- Tokenization
- Stopwords Removal
- Stemming
- Lemmatization
- Bag of Words (BOW) representation

These are foundational techniques used in text preprocessing before applying machine learning or deep learning models.

## Summary
- **Tokenization**: Splits text into words and sentences  
- **Stopwords Removal**: Removes common filler words  
- **Stemming**: Reduces words to crude root forms (may not always be valid words)  
- **Lemmatization**: Converts words to meaningful dictionary root forms  
- **Bag of Words**: Converts text into numerical representation for ML models  

This workflow forms the foundation for more advanced NLP tasks like text classification, topic modeling, and sentiment analysis. 

In [132]:
!pip install nltk



In [133]:
paragraph = """ Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who is the 47th president of the United States. A member of the Republican Party, he served as the 45th president from 2017 to 2021.

Born into a wealthy family in New York City, Trump graduated from the University of Pennsylvania in 1968 with a bachelor's degree in economics. He became the president of his family's real estate business in 1971, renamed it the Trump Organization, and began acquiring and building skyscrapers, hotels, casinos, and golf courses. He launched side ventures, many licensing the Trump name, and filed for six business bankruptcies in the 1990s and 2000s. From 2004 to 2015, he hosted the reality television show The Apprentice, bolstering his fame as a billionaire. Presenting himself as a political outsider, Trump won the 2016 presidential election against Democratic Party nominee Hillary Clinton.
"""

In [134]:
paragraph

" Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who is the 47th president of the United States. A member of the Republican Party, he served as the 45th president from 2017 to 2021.\n\nBorn into a wealthy family in New York City, Trump graduated from the University of Pennsylvania in 1968 with a bachelor's degree in economics. He became the president of his family's real estate business in 1971, renamed it the Trump Organization, and began acquiring and building skyscrapers, hotels, casinos, and golf courses. He launched side ventures, many licensing the Trump name, and filed for six business bankruptcies in the 1990s and 2000s. From 2004 to 2015, he hosted the reality television show The Apprentice, bolstering his fame as a billionaire. Presenting himself as a political outsider, Trump won the 2016 presidential election against Democratic Party nominee Hillary Clinton.\n"

In [135]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [136]:
##tokenization -- converts paragraph -> sentences -> words

nltk.download("punkt")
nltk.download("punkt_tab")
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/harpreetgill/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/harpreetgill/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [137]:
print(sentences)

[' Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who is the 47th president of the United States.', 'A member of the Republican Party, he served as the 45th president from 2017 to 2021.', "Born into a wealthy family in New York City, Trump graduated from the University of Pennsylvania in 1968 with a bachelor's degree in economics.", "He became the president of his family's real estate business in 1971, renamed it the Trump Organization, and began acquiring and building skyscrapers, hotels, casinos, and golf courses.", 'He launched side ventures, many licensing the Trump name, and filed for six business bankruptcies in the 1990s and 2000s.', 'From 2004 to 2015, he hosted the reality television show The Apprentice, bolstering his fame as a billionaire.', 'Presenting himself as a political outsider, Trump won the 2016 presidential election against Democratic Party nominee Hillary Clinton.']


In [138]:
type(sentences)

list

In [139]:
#Stemming -- helps to find base root word

stemmer = PorterStemmer()

In [140]:
stemmer.stem('history')

'histori'

In [141]:
stemmer.stem('thinking')

'think'

In [142]:
stemmer.stem('bankruptcies')

'bankruptci'

In [143]:
#Lemmatizer -- provides the base word with correct spellings/meaning

from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/harpreetgill/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [144]:
lemmatizer = WordNetLemmatizer()

In [145]:
lemmatizer.lemmatize('history')

'history'

In [146]:
lemmatizer.lemmatize('thinking')

'thinking'

In [147]:
lemmatizer.lemmatize('bankruptcies')

'bankruptcy'

In [148]:
##Text Cleaning

import re
corpus = []
for i in range(len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    corpus.append(text)

In [149]:
corpus

[' donald john trump  born june           is an american politician  media personality  and businessman who is the   th president of the united states ',
 'a member of the republican party  he served as the   th president from      to      ',
 'born into a wealthy family in new york city  trump graduated from the university of pennsylvania in      with a bachelor s degree in economics ',
 'he became the president of his family s real estate business in       renamed it the trump organization  and began acquiring and building skyscrapers  hotels  casinos  and golf courses ',
 'he launched side ventures  many licensing the trump name  and filed for six business bankruptcies in the     s and     s ',
 'from      to       he hosted the reality television show the apprentice  bolstering his fame as a billionaire ',
 'presenting himself as a political outsider  trump won the      presidential election against democratic party nominee hillary clinton ']

In [150]:
nltk.download('stopwords')

stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/harpreetgill/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [151]:
##stemming

for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(stemmer.stem(word))

donald
john
trump
born
june
american
politician
media
person
businessman
th
presid
unit
state
member
republican
parti
serv
th
presid
born
wealthi
famili
new
york
citi
trump
graduat
univers
pennsylvania
bachelor
degre
econom
becam
presid
famili
real
estat
busi
renam
trump
organ
began
acquir
build
skyscrap
hotel
casino
golf
cours
launch
side
ventur
mani
licens
trump
name
file
six
busi
bankruptci
host
realiti
televis
show
apprentic
bolster
fame
billionair
present
polit
outsid
trump
presidenti
elect
democrat
parti
nomine
hillari
clinton


In [152]:
##Lemmatization

for i in corpus:
    words = nltk.word_tokenize(i)
    for word in words:
        if word not in set(stopwords.words('english')):
            print(lemmatizer.lemmatize(word))

donald
john
trump
born
june
american
politician
medium
personality
businessman
th
president
united
state
member
republican
party
served
th
president
born
wealthy
family
new
york
city
trump
graduated
university
pennsylvania
bachelor
degree
economics
became
president
family
real
estate
business
renamed
trump
organization
began
acquiring
building
skyscraper
hotel
casino
golf
course
launched
side
venture
many
licensing
trump
name
filed
six
business
bankruptcy
hosted
reality
television
show
apprentice
bolstering
fame
billionaire
presenting
political
outsider
trump
presidential
election
democratic
party
nominee
hillary
clinton


In [153]:
#Apply stopwords and lemmatize

import re
corpus = []
for i in range(len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    corpus.append(text)

In [154]:
##Bag of Words (BOW)
!pip install scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)



In [155]:
X = cv.fit_transform(corpus)

In [156]:
cv.vocabulary_

{'donald': 19,
 'john': 31,
 'trump': 63,
 'born': 9,
 'june': 32,
 'american': 1,
 'politician': 47,
 'medium': 36,
 'personality': 45,
 'businessman': 12,
 'th': 62,
 'president': 49,
 'united': 64,
 'state': 60,
 'member': 37,
 'republican': 54,
 'party': 43,
 'served': 55,
 'wealthy': 67,
 'family': 24,
 'new': 39,
 'york': 68,
 'city': 14,
 'graduated': 27,
 'university': 65,
 'pennsylvania': 44,
 'bachelor': 3,
 'degree': 17,
 'economics': 20,
 'became': 5,
 'real': 51,
 'estate': 22,
 'business': 11,
 'renamed': 53,
 'organization': 41,
 'began': 6,
 'acquiring': 0,
 'building': 10,
 'skyscraper': 59,
 'hotel': 30,
 'casino': 13,
 'golf': 26,
 'course': 16,
 'launched': 33,
 'side': 57,
 'venture': 66,
 'many': 35,
 'licensing': 34,
 'name': 38,
 'filed': 25,
 'six': 58,
 'bankruptcy': 4,
 'hosted': 29,
 'reality': 52,
 'television': 61,
 'show': 56,
 'apprentice': 2,
 'bolstering': 8,
 'fame': 23,
 'billionaire': 7,
 'presenting': 48,
 'political': 46,
 'outsider': 42,
 'presid

In [157]:
corpus[0]

'donald john trump born june american politician medium personality businessman th president united state'

In [158]:
X[0].toarray()

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
        0, 0, 0]])