<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Exercises/blob/main/M1_Exercise_Using_NLTK_Library_Answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Text Preprocessing with NLTK

## Description
In this exercise, students will use NLTK to perform basic text preprocessing tasks on a given dataset. They will apply various NLTK functionalities to clean and normalize the words, preparing them for further analysis or modeling.

## Instructions
Introduce the concept of text preprocessing to the students, explaining its importance in natural language processing tasks. Discuss common preprocessing steps, such as tokenization, removing punctuation, converting to lowercase, and removing stopwords.

Provide the students with a dataset containing a sample text document. It could be a paragraph or a collection of sentences.

Instruct students to write a Python script that uses NLTK to preprocess the words in the given text document.

Students should apply the following preprocessing steps using NLTK:

1. **Tokenization**: Split the text into individual words or tokens.
2. **Removing punctuation**: Remove any punctuation marks from the words.
3. **Converting to lowercase**: Convert all words to lowercase for consistency.
4. **Removing stopwords**: Remove common stopwords (e.g., "the," "is," "and") that do not contribute much to the meaning of the text.
5. **Lemmatization**: Reduce words to their base or root form.

Students should experiment with additional preprocessing steps or NLTK functionalities based on their understanding and familiarity with the toolkit.

In [1]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [2]:
# Define the preprocessing function
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text)

    # Removing punctuation and converting to lowercase
    tokens = [token.lower() for token in tokens if token not in punctuation]

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return tokens

# Test the preprocessing function with a sample text
text = "This is a sample sentence for text preprocessing."
preprocessed_tokens = preprocess_text(text)
print(preprocessed_tokens)

['sample', 'sentence', 'text', 'preprocessing']


## Explanation

### Tokenization
Tokenization is the process of splitting text into individual words or tokens. This helps in breaking down the text into manageable pieces for further processing.

### Removing Punctuation and Converting to Lowercase
Removing punctuation ensures that punctuation marks do not interfere with text analysis. Converting to lowercase makes the text uniform, so that "Text" and "text" are treated as the same word.

### Removing Stopwords
Stopwords are common words (e.g., "the," "is," "and") that do not add significant meaning to the text. Removing them helps in focusing on the words that contribute to the meaning of the text.

### Lemmatization
Lemmatization reduces words to their base or root form (e.g., "running" to "run"). This helps in normalizing the text for better analysis.

## Additional Examples and Exercises

### Example 1: Stemming
Stemming is the process of reducing words to their root form. It is different from lemmatization in that it may not produce real words.

In [3]:
# Import the PorterStemmer
from nltk.stem import PorterStemmer

# Define the stemming function
def stemming_text(text):
    stemmer = PorterStemmer()
    tokens = word_tokenize(text)
    stems = [stemmer.stem(token) for token in tokens]
    return stems

# Test the stemming function with a sample text
text = "This is a demonstration of stemming words like running, runner, and runs."
stemmed_tokens = stemming_text(text)
print(stemmed_tokens)

['thi', 'is', 'a', 'demonstr', 'of', 'stem', 'word', 'like', 'run', ',', 'runner', ',', 'and', 'run', '.']


### Example 2: POS Tagging
Part-of-Speech (POS) tagging is the process of labeling words with their corresponding part of speech, such as noun, verb, adjective, etc.

In [5]:
# Import POS tagging
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

# Define the POS tagging function
def pos_tagging_text(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags

# Test the POS tagging function with a sample text
text = "This is a simple POS tagging example."
pos_tags = pos_tagging_text(text)
print(pos_tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('POS', 'NNP'), ('tagging', 'VBG'), ('example', 'NN'), ('.', '.')]


### Exercise 1: Named Entity Recognition
Use NLTK to identify named entities in a given text.

In [8]:
# Import NER
from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Define the NER function
def named_entity_recognition(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    named_entities = ne_chunk(pos_tags)
    return named_entities

# Test the NER function with a sample text
text = "Barack Obama was the 44th President of the United States."
named_entities = named_entity_recognition(text)
print(named_entities)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...


(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  the/DT
  44th/JJ
  President/NNP
  of/IN
  the/DT
  (GPE United/NNP States/NNPS)
  ./.)


[nltk_data]   Unzipping corpora/words.zip.


### Exercise 2: Text Normalization
Normalize the text by removing numbers and special characters, and by correcting common typos.

In [9]:
import re

# Define the text normalization function
def normalize_text(text):
    # Remove numbers and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Correct common typos (e.g., "teh" to "the")
    text = re.sub(r'\bteh\b', 'the', text)
    tokens = word_tokenize(text)
    return tokens

# Test the normalization function with a sample text
text = "This is a text with numbers 123 and special characters !@# and a common typo teh."
normalized_tokens = normalize_text(text)
print(normalized_tokens)

['This', 'is', 'a', 'text', 'with', 'numbers', 'and', 'special', 'characters', 'and', 'a', 'common', 'typo', 'the']


## Exercise 3: Text Preprocessing Pipeline

### Objective

Create a complete text preprocessing pipeline that incorporates all the preprocessing steps learned. The pipeline should take a raw text input and output the fully processed text along with various annotations (like POS tags and named entities).

### Instructions

1. Define a function `preprocess_pipeline` that takes a raw text input.
2. The function should apply the following preprocessing steps using NLTK:
   - Tokenization
   - Removing punctuation
   - Converting to lowercase
   - Removing stopwords
   - Lemmatization
   - Stemming
   - POS tagging
   - Named entity recognition
   - Text normalization (removing numbers and special characters, correcting common typos)
3. The function should return:
   - Processed tokens
   - POS tags
   - Named entities

In [10]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk import pos_tag, ne_chunk
from string import punctuation
import re

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def preprocess_pipeline(text):
    # Text normalization: remove numbers and special characters, correct common typos
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\bteh\b', 'the', text)

    # Tokenization
    tokens = word_tokenize(text)

    # Removing punctuation and converting to lowercase
    tokens = [token.lower() for token in tokens if token not in punctuation]

    # Removing stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]

    # POS tagging
    pos_tags = pos_tag(tokens)

    # Named entity recognition
    named_entities = ne_chunk(pos_tags)

    return {
        'processed_tokens': stemmed_tokens,
        'pos_tags': pos_tags,
        'named_entities': named_entities
    }

# Test the preprocessing pipeline with a sample text
text = "Barack Obama was the 44th President of the United States. He was born in Hawaii and loves eating pineapple."
result = preprocess_pipeline(text)
print("Processed Tokens:", result['processed_tokens'])
print("POS Tags:", result['pos_tags'])
print("Named Entities:", result['named_entities'])


Processed Tokens: ['barack', 'obama', 'th', 'presid', 'unit', 'state', 'born', 'hawaii', 'love', 'eat', 'pineappl']
POS Tags: [('barack', 'NN'), ('obama', 'NN'), ('th', 'NN'), ('president', 'NN'), ('united', 'JJ'), ('states', 'NNS'), ('born', 'VBP'), ('hawaii', 'JJ'), ('loves', 'NNS'), ('eating', 'VBG'), ('pineapple', 'NN')]
Named Entities: (S
  barack/NN
  obama/NN
  th/NN
  president/NN
  united/JJ
  states/NNS
  born/VBP
  hawaii/JJ
  loves/NNS
  eating/VBG
  pineapple/NN)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
