## Text Preprocessing
Before analysing text data, it's essential to preprocess it to ensure consistency and to remove noise. Common preprocessing steps include:

1. **Tokenization**: Splitting text into individual words or sentences.

2. **Lowercasing**: Converting all characters to lowercase to maintain uniformity.

2. **Removing** Punctuation and Special Characters: Eliminating unnecessary symbols that do not contribute to the analysis.

2. **Removing** Stopwords: Filtering out common words (e.g., 'and', 'the', 'is') that may not carry significant meaning.

2. **Stemming**: Reducing words to their root form (e.g., 'running' to 'run').

2. **Lemmatization**: Converting words to their base or dictionary form (e.g., 'better' to 'good').

We will be using NLTK library to demonstrate Text Data Preprocessing

## Introduction to NLTK
- NLTK (Natural Language Toolkit) is a powerful Python library for Natural Language Processing (NLP) and Computational Linguistics. It  provides a set of tools for text processing, classification, tokenization, stemming, lemmatization, and more.

In [1]:
# install nltk
!pip install nltk

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [22]:
# Let's learn text preprocessing by using the example

text =  """The voice that navigated was definitely that of a machine, and yet you could tell that the machine was a woman."""

## 1. Tokenization
- Tokenization is the process of splitting text into individual units, such as words or sentences.

#### 1. Sentence Tokenization

In [None]:
 
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text, language="english")
print(sentences)

['The voice that navigated was definitely that of a machine, and yet you could tell that the machine was a woman.']


#### 2. Word Tokenization 

In [26]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens)

['The', 'voice', 'that', 'navigated', 'was', 'definitely', 'that', 'of', 'a', 'machine', ',', 'and', 'yet', 'you', 'could', 'tell', 'that', 'the', 'machine', 'was', 'a', 'woman', '.']


## 2. Stemming
- Stemming reduces words to their root form.

In [None]:
# First way
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ['happy', 'happier', 'happiness', 'breathing']

for word in words:
    print(stemmer.stem(word))

happi
happier
happi
breath


In [None]:
# Second way - using Snowball Stemmer

from  nltk.stem import SnowballStemmer
snowBallStemmer = SnowballStemmer("english")
for word in words:
    print(word, '-->', snowBallStemmer.stem(word))
    

happy --> happi
happier --> happier
happiness --> happi
breathing --> breath


In [32]:
# Let's use PorterStemmer
print(f"before {tokens}")
for i in range(0, len(tokens)):
    tokens[i] = stemmer.stem(tokens[i])
    
print(f"After: {tokens}")

before ['The', 'voice', 'that', 'navigated', 'was', 'definitely', 'that', 'of', 'a', 'machine', ',', 'and', 'yet', 'you', 'could', 'tell', 'that', 'the', 'machine', 'was', 'a', 'woman', '.']
After: ['the', 'voic', 'that', 'navig', 'wa', 'definit', 'that', 'of', 'a', 'machin', ',', 'and', 'yet', 'you', 'could', 'tell', 'that', 'the', 'machin', 'wa', 'a', 'woman', '.']


## 3. Lemmatization
- Lemmatization is similar to stemming but returns meaningful root words.

In [34]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
for i in range(0, len(tokens)):
    tokens[i] = tokens[i]

## 4. Filtering Stop Words 
Stop words are common words that do not add much meaning to the text.

In [None]:
from 