## Common Preprocessing Techniques

### 1. Lowercasing 

Lowercasing is normalization.

In [2]:
# Sample text
text = "Natural Language Processing is AMAZING!"

# Convert to lowercase
cleaned_text = text.lower()
print(cleaned_text)

natural language processing is amazing!


### 2. Punctuation Removal 

Removing punctuation is part of cleaning text.

First, we imported re library that helps us use regex.


In [3]:
import re

# Sample text
text = "Hello, world! Welcome to NLP."

# Remove punctuation using regex
cleaned_text = re.sub(r"[^\w\s']", "", text)
print(cleaned_text)

Hello world Welcome to NLP


Explanation : 

* The re.sub() function is used to replace punctuation with an empty string.

* The regex pattern [^\w\s'] matches any character that is not a word (\w) or a space (\s).

### 3. Removing Extra Whitespaces

In [4]:
# Sample text
text = "  This   is   a   sentence   with   extra   spaces.   "

# Remove extra whitespaces between the words
cleaned_text = " ".join(text.split())

print(cleaned_text)

This is a sentence with extra spaces.


Explanation :

* The .split() method splits the text into words by whitespace.
* The " ".join() method reassembles the words into a single string, removing extra spaces.

### 4. Removing Numbers

In [5]:
import re

# Sample text
text = "The price is 100 dollars."

# Remove numbers using regex
cleaned_text = re.sub(r"\d+", "", text)
print(cleaned_text)

The price is  dollars.


Explanation:
* The regex pattern \d+ matches one or more digits in the text.
* re.sub() replaces the matched digits with an empty string.

### 5. Handling Case-Specific Words

In [6]:
# Sample text
text = "Stop words like 'and', 'or', and 'but' can be removed."

# Replace specific words
cleaned_text = re.sub(r"\b(and|or|but)\b", "", text)
cleaned_text = " ".join(cleaned_text.split())
print(cleaned_text)

Stop words like '', '', '' can be removed.


Explanation:
* The regex pattern \b(and|or|but)\b matches whole words "and", "or", and "but".
* After removing the words, we use the split() and join() methods to clean up extra spaces.

## Coding Challenges

### Challenge 1: Lowercasing and Punctuation Removal

Task: Write a function clean_text() that takes a string and returns it in lowercase with punctuation removed.

In [7]:
import re

def clean_text(text):
    """Clean text by lowercasing and removing punctuation."""
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r"[^\w\s']", "", text)
    return text

# Test the function
sample_text = "Hello, NLP World!"
print(clean_text(sample_text))

hello nlp world


### Challenge 2: Removing Numbers and Extra Whitespaces

Task: Write a function clean_text_numbers_spaces() that removes numbers and extra whitespaces from a string.

In [8]:
import re

def clean_text_numbers_spaces(text):
    """Clean text by removing numbers and extra spaces."""
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    # Remove extra spaces
    text = " ".join(text.split())
    return text

# Test the function
sample_text = "This 123 text   has 456 extra spaces and 789 numbers."
print(clean_text_numbers_spaces(sample_text))

This text has extra spaces and numbers.


## Using NLTK for Text Preprocessing

### 1. Tokenization

Tokenization breaks down text into smaller units (tokens) such as words or sentences, which is essential for most NLP tasks.

In [9]:
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 

In [10]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [11]:
import nltk
nltk.download('punkt')  # Download the tokenizer models
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is fascinating!"
tokens = word_tokenize(text)  # Tokenize the text into words
print(tokens)  # Display the tokens

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/Users/dianaterraza/nltk_data'
    - '/Users/dianaterraza/Desktop/NLP/time/nltk_data'
    - '/Users/dianaterraza/Desktop/NLP/time/share/nltk_data'
    - '/Users/dianaterraza/Desktop/NLP/time/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


Explanation: 

* We import word_tokenize from NLTK.
* The word_tokenize function splits the text into individual words and punctuation marks.

### 2. Stop-Word Removal

In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')  # Download the stop words list

stop_words = set(stopwords.words('english'))  # Get the English stop words
tokens = ['Natural', 'Language', 'Processing', 'is', 'fascinating', '!']
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]  # Remove stop words
print(filtered_tokens)  # Display tokens after stop-word removal

['Natural', 'Language', 'Processing', 'fascinating', '!']


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Explanation:
* NLTK provides a pre-defined list of stop words in various languages.
* We filter out tokens that match the stop words list using a list comprehension.

### 3. Stemming

Stemming reduces words to their root form by removing suffixes. This is useful in scenarios like document clustering, where grouping similar terms is essential.

In [6]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()  # Initialize the stemmer
filtered_tokens = ['Natural', 'Language', 'Processing', 'fascinating', '!']
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]  # Stem each word
print(stemmed_tokens)  # Display stemmed tokens

['natur', 'languag', 'process', 'fascin', '!']


Explanation:
* The PorterStemmer applies a set of rules to trim words to their root forms.
* Notice how "natural" becomes "natur" and "fascinating" becomes "fascin.". This is because stemming aggressively truncates words, which may not match dictionary forms.

### 4. Lemmatization

Lemmatization reduces words to their base or root form using vocabulary and morphological analysis. For example, "running" becomes "run.

In [7]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')  # Download the WordNet data

lemmatizer = WordNetLemmatizer()  # Initialize the lemmatizer
filtered_tokens = ['Natural', 'Language', 'Processing', 'fascinating', '!']
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]  # Lemmatize each word
print(lemmatized_tokens)  # Display lemmatized tokens

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dianaterraza/nltk_data...


['Natural', 'Language', 'Processing', 'fascinating', '!']


Explanation:
* The WordNetLemmatizer uses a lexical database (WordNet) to determine the base form of words.
* Unlike stemming, lemmatization ensures the resulting words are valid dictionary entries.

If we go with lemmatizer.lemmatize(word, pos='v') instead of lemmatizer.lemmatize(word), we’ll explicitly tell the lemmatizer to accurately treat verbs. In pos='v' the 'v' value stands for verbs. Here is an example:

In [8]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', 'v')

'go'

As a result, we got 'go' instead of going. The reason for that is because by default, the lemmatizer expects nouns and lemmatize nouns. This is one of the reasons why there is no need to specify pos='n', but for others parts of speech, we’d better do. For adjectives we’ll need - 'a'. For adverbs - 'r'. For verbs - 'v'.

### Diference between lemmatization and Stemming

Lemmatization produces meaningful base forms by consulting a dictionary. Stemming, however, may produce non-words like "comput" for "computer."

Stemming is faster but less accurate than lemmatization.

In [14]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')  # Needed for POS tagging

words = ["running", "flies", "better", "easily", "happiest"]

# Stem the words
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]

# Lemmatize with POS tagging
lemmatizer = WordNetLemmatizer()

# Function to map NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # Default to noun

# Get POS tags for each word
pos_tags = nltk.pos_tag(words)

# Lemmatize with POS tags
lemmas = []
for word, tag in pos_tags:
    wn_pos = get_wordnet_pos(tag)
    lemmas.append(lemmatizer.lemmatize(word, pos=wn_pos))

print("Stems:", stems)
print("Lemmas:", lemmas)

Stems: ['run', 'fli', 'better', 'easili', 'happiest']
Lemmas: ['run', 'fly', 'well', 'easily', 'happiest']


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dianaterraza/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
