# **Text Preprocessing Basics with NLTK**

**PROBLEM STATEMENT**
Natural Language Processing (NLP) involves preparing raw text data for analysis and extracting meaningful insights. Effective text preprocessing is a crucial step in building robust NLP models. This project addresses the following problem:

How can we preprocess textual data effectively to prepare it for downstream NLP
tasks such as topic modeling, classification, or sentiment analysis?
What tools and techniques can be used to clean, tokenize, and transform text for better machine understanding?
This project demonstrates fundamental preprocessing techniques, including tokenization, stopword removal, and stemming, using the Natural Language Toolkit (NLTK). The processed text lays the groundwork for advanced NLP workflows.

# **Key Features**
**Text Preprocessing:**
1.Sentence and word tokenization.              
2.Removal of stopwords to clean the text.         
3.Stemming using NLTK's PorterStemmer to reduce words to their base forms.       
     
***Example Paragraph:*** A descriptive paragraph about Latent Semantic Analysis (LSA) is used as sample text for demonstrating these techniques. However, the project does not implement LSA itself but focuses on preparing text for NLP tasks.

In [None]:
!pip install nltk



In [None]:
paragraph="Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1]"
paragraph

'Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis). A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.[1]'

In [None]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [None]:
nltk.download("punkt")
sentences=nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
sentences

['Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.',
 'LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis).',
 'A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.',
 'Documents are then compared by cosine similarity between any two columns.',
 'Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.',
 '[1]']

In [None]:

stemmer=PorterStemmer()
stemmer

<PorterStemmer>

In [None]:
stemmer.stem("HISTORICAL")

'histor'

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
lemmatizer=WordNetLemmatizer()
lemmatizer.lemmatize('HISTORICAL')

'HISTORICAL'

In [None]:
import re

In [None]:
corpus =[]
for i in range(len(sentences)):
  review =re.sub("[^a-zA-Z]",' ',sentences[i])
  review=review.lower()
  corpus.append(review)

In [None]:
corpus

['latent semantic analysis  lsa  is a technique in natural language processing  in particular distributional semantics  of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms ',
 'lsa assumes that words that are close in meaning will occur in similar pieces of text  the distributional hypothesis  ',
 'a matrix containing word counts per document  rows represent unique words and columns represent each document  is constructed from a large piece of text and a mathematical technique called singular value decomposition  svd  is used to reduce the number of rows while preserving the similarity structure among columns ',
 'documents are then compared by cosine similarity between any two columns ',
 'values close to   represent very similar documents while values close to   represent very dissimilar documents ',
 '   ']

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
x=stopwords.words("english")
x


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
for i in corpus:
  words=nltk.word_tokenize(i)
  for word in words:
    if word not in set(stopwords.words('english')):
      print(stemmer.stem(word))

latent
semant
analysi
lsa
techniqu
natur
languag
process
particular
distribut
semant
analyz
relationship
set
document
term
contain
produc
set
concept
relat
document
term
lsa
assum
word
close
mean
occur
similar
piec
text
distribut
hypothesi
matrix
contain
word
count
per
document
row
repres
uniqu
word
column
repres
document
construct
larg
piec
text
mathemat
techniqu
call
singular
valu
decomposit
svd
use
reduc
number
row
preserv
similar
structur
among
column
document
compar
cosin
similar
two
column
valu
close
repres
similar
document
valu
close
repres
dissimilar
document


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(binary=True,ngram_range=(2,2))

In [None]:
x=cv.fit_transform(corpus)

In [None]:
cv.vocabulary_

{'latent semantic': 46,
 'semantic analysis': 75,
 'analysis lsa': 1,
 'lsa is': 48,
 'is technique': 42,
 'technique in': 86,
 'in natural': 38,
 'natural language': 52,
 'language processing': 44,
 'processing in': 65,
 'in particular': 39,
 'particular distributional': 60,
 'distributional semantics': 29,
 'semantics of': 76,
 'of analyzing': 55,
 'analyzing relationships': 2,
 'relationships between': 69,
 'between set': 12,
 'set of': 77,
 'of documents': 57,
 'documents and': 32,
 'and the': 6,
 'the terms': 96,
 'terms they': 87,
 'they contain': 98,
 'contain by': 22,
 'by producing': 14,
 'producing set': 66,
 'of concepts': 56,
 'concepts related': 20,
 'related to': 68,
 'to the': 101,
 'the documents': 93,
 'and terms': 5,
 'lsa assumes': 47,
 'assumes that': 10,
 'that words': 91,
 'words that': 114,
 'that are': 90,
 'are close': 8,
 'close in': 16,
 'in meaning': 37,
 'meaning will': 51,
 'will occur': 111,
 'occur in': 54,
 'in similar': 40,
 'similar pieces': 79,
 'pie

In [None]:
corpus[1]

'lsa assumes that words that are close in meaning will occur in similar pieces of text  the distributional hypothesis  '

In [None]:
x[1].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 1]])

In [None]:

from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer(ngram_range=(1,1))
x=cv.fit_transform(corpus)

In [None]:
cv.vocabulary_

{'latent': 31,
 'semantic': 52,
 'analysis': 1,
 'lsa': 32,
 'is': 28,
 'technique': 60,
 'in': 27,
 'natural': 36,
 'language': 29,
 'processing': 45,
 'particular': 40,
 'distributional': 21,
 'semantics': 53,
 'of': 39,
 'analyzing': 2,
 'relationships': 49,
 'between': 7,
 'set': 54,
 'documents': 23,
 'and': 3,
 'the': 64,
 'terms': 61,
 'they': 66,
 'contain': 15,
 'by': 8,
 'producing': 46,
 'concepts': 13,
 'related': 48,
 'to': 67,
 'assumes': 6,
 'that': 63,
 'words': 77,
 'are': 5,
 'close': 10,
 'meaning': 35,
 'will': 75,
 'occur': 38,
 'similar': 55,
 'pieces': 43,
 'text': 62,
 'hypothesis': 26,
 'matrix': 34,
 'containing': 16,
 'word': 76,
 'counts': 18,
 'per': 41,
 'document': 22,
 'rows': 51,
 'represent': 50,
 'unique': 69,
 'columns': 11,
 'each': 24,
 'constructed': 14,
 'from': 25,
 'large': 30,
 'piece': 42,
 'mathematical': 33,
 'called': 9,
 'singular': 57,
 'value': 71,
 'decomposition': 19,
 'svd': 59,
 'used': 70,
 'reduce': 47,
 'number': 37,
 'while': 74

In [None]:
corpus[1]

'lsa assumes that words that are close in meaning will occur in similar pieces of text  the distributional hypothesis  '

In [None]:
x[1].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.19139971, 0.2334102 , 0.        , 0.        , 0.        ,
        0.19139971, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.19139971, 0.        , 0.        , 0.        ,
        0.        , 0.2334102 , 0.38279941, 0.        , 0.        ,
        0.        , 0.        , 0.19139971, 0.        , 0.        ,
        0.2334102 , 0.        , 0.        , 0.2334102 , 0.16159278,
        0.        , 0.        , 0.        , 0.2334102 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.19139971, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.19139971, 0.46682041, 0.16159278,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  