# Lab 3: Exploring Text Pre-processing with NLTK

Note: This lab is **NOT** graded. However, it is highly recommended that you solve the exercises which will help you in solving Assignment 2 (to be posted tonight).

In the context of the tutorial, "search" refers to the process of querying and retrieving relevant information from a collection of textual documents. The goal is to demonstrate how different text pre-processing techniques can enhance the effectiveness of search queries. The term "search" is used in the context of searching for specific information or patterns within a set of documents, and the pre-processing steps aim to improve the accuracy and relevance of the search results.

For pre-processing, we will use (a) basic Python funcitons for regular expressions (b) Natural Language Toolkit (NLTK).

We start by installing NLTK.

In [32]:
!pip install nltk
!pip install spacy



Let's create sample documents and queries. You can always replace this with a process that can ingest real documents (say from PDF files) and convert documents in to lists.

In [33]:
documents = [
    "Text pre-processing is essential for natural language processing.",
    "NLTK and SpaCy are popular libraries for text analysis.",
    "Clean and normalized text improves search accuracy.",
    "Tokenization breaks text into words or phrases.",
    "Stemming reduces words to their root form.",
    "Lemmatization provides the base or dictionary form of a word.",
    "Stop words removal eliminates common words.",
    "Hunpos helps correct spelling errors.",
    "Python is widely used in natural language processing.",
    "SpaCy's lemmatization is more advanced than NLTK's."
]

search_queries = [
    "text analyis",
    "tokenization",
    "lemmatization in Python",
    "natural languge processing tools",
    "reducing word",
    "nltk and spacy",
    "breaking words"
]


# 1. Implementing a basic search algorithm

Let's implement a function that receives a list of tokens as input and matches that with all the documents (a list of list of tokens) and returns a ranked list of documents as output.

In [34]:
def return_ranked_results(search_query_tokens, all_documents):
    """
    Rank lists based on maximum overlap with the input list.

    Parameters:
    - search_query_tokens: The input list for comparison.
    - all_documents: List of lists to be ranked.

    Returns:
    - List: Ranked list_of_lists based on maximum overlap.
    """
    match_score = {}

    for i, document in enumerate(all_documents):
      score = 0
      document_index = i
      for token in search_query_tokens:
        if token in document:
          score += 1
      match_score[document_index] = score

    ranked_documents = sorted(match_score.items(),key = lambda x: x[1] ,reverse = True)
    return list(ranked_documents)

Let's conduct search without pre-processing the documents and search queries.

In [35]:
# convert documents into list of list of tokens using the split() method. No sophisticated techniques used


def perform_search_and_show_results(documents, search_queries):
  documents_tokens = []

  for document in documents:
    documents_tokens.append(document.split())

  for query in search_queries:
    search_tokens = query.split()
    results = return_ranked_results(search_tokens, documents_tokens)
    print (f"--------------------------")
    print (f"Results for query: {query}")
    print (f"--------------------------")
    for result in results:
      document_id = result[0]
      score = result[1]
      if score !=0:
        print (documents[document_id], score)

perform_search_and_show_results(documents, search_queries)

--------------------------
Results for query: text analyis
--------------------------
NLTK and SpaCy are popular libraries for text analysis. 1
Clean and normalized text improves search accuracy. 1
Tokenization breaks text into words or phrases. 1
--------------------------
Results for query: tokenization
--------------------------
--------------------------
Results for query: lemmatization in Python
--------------------------
Python is widely used in natural language processing. 2
SpaCy's lemmatization is more advanced than NLTK's. 1
--------------------------
Results for query: natural languge processing tools
--------------------------
Text pre-processing is essential for natural language processing. 1
Python is widely used in natural language processing. 1
--------------------------
Results for query: reducing word
--------------------------
--------------------------
Results for query: nltk and spacy
--------------------------
NLTK and SpaCy are popular libraries for text analysis

# Let's apply some text pre-processing techniques and see if the search improves.

We are going to apply the following text-preprocessing

1. Cleaning
2. Tokenization and stopword removal
3. Stemming
4. Lemmatization



In [42]:
import re

def clean_and_lowercase_text(text):
    # implement cleaning logic here
    # lower case
    text = text.lower()
    return text # Convert to lowercase


cleaned_documents = [clean_text(doc) for doc in documents]
cleaned_search_queries = [clean_text(query) for query in search_queries]
perform_search_and_show_results(cleaned_documents, cleaned_search_queries)

--------------------------
Results for query: text analyis
--------------------------
text pre-processing is essential for natural language processing. 1
nltk and spacy are popular libraries for text analysis. 1
clean and normalized text improves search accuracy. 1
tokenization breaks text into words or phrases. 1
--------------------------
Results for query: tokenization
--------------------------
tokenization breaks text into words or phrases. 1
--------------------------
Results for query: lemmatization in python
--------------------------
python is widely used in natural language processing. 2
lemmatization provides the base or dictionary form of a word. 1
spacy's lemmatization is more advanced than nltk's. 1
--------------------------
Results for query: natural languge processing tools
--------------------------
text pre-processing is essential for natural language processing. 1
python is widely used in natural language processing. 1
--------------------------
Results for query: r

We see slightly improved results. Let's go over other steps

In [47]:
import nltk

# these datasets that NLTK needs to tokenize should be downloaded once.
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [44]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def tokenize_and_remove_stopwords(text):
  tokenized_text = word_tokenize(text)
  filtered_text = [token for token in tokenized_text if token not in stop_words]
  return " ".join(filtered_text)

tokenized_cleaned_documents = [tokenize_and_remove_stopwords(doc) for doc in cleaned_documents]
tokenized_cleaned_search_queries = [tokenize_and_remove_stopwords(query) for query in cleaned_search_queries]
perform_search_and_show_results(tokenized_cleaned_documents, tokenized_cleaned_search_queries)

--------------------------
Results for query: text analyis
--------------------------
text pre-processing essential natural language processing . 1
nltk spacy popular libraries text analysis . 1
clean normalized text improves search accuracy . 1
tokenization breaks text words phrases . 1
--------------------------
Results for query: tokenization
--------------------------
tokenization breaks text words phrases . 1
--------------------------
Results for query: lemmatization python
--------------------------
lemmatization provides base dictionary form word . 1
python widely used natural language processing . 1
spacy 's lemmatization advanced nltk 's . 1
--------------------------
Results for query: natural languge processing tools
--------------------------
text pre-processing essential natural language processing . 2
python widely used natural language processing . 2
--------------------------
Results for query: reducing word
--------------------------
lemmatization provides base dictio

Again, a more improved search result obtained. Let's apply stemming.

In [45]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()

def stem_text(text):
  text = tokenize_and_remove_stopwords(text)
  tokens = text.split()
  stemmed_words = [ps.stem(token) for token in tokens]
  return " ".join(stemmed_words)

stemmed_documents = [stem_text(doc) for doc in cleaned_documents]
stemmed_queries = [stem_text(query) for query in cleaned_search_queries]
perform_search_and_show_results(stemmed_documents, stemmed_queries)

--------------------------
Results for query: text analyi
--------------------------
text pre-process essenti natur languag process . 1
nltk spaci popular librari text analysi . 1
clean normal text improv search accuraci . 1
token break text word phrase . 1
--------------------------
Results for query: token
--------------------------
token break text word phrase . 1
--------------------------
Results for query: lemmat python
--------------------------
lemmat provid base dictionari form word . 1
python wide use natur languag process . 1
spaci 's lemmat advanc nltk 's . 1
--------------------------
Results for query: natur langug process tool
--------------------------
text pre-process essenti natur languag process . 2
python wide use natur languag process . 2
--------------------------
Results for query: reduc word
--------------------------
stem reduc word root form . 2
token break text word phrase . 1
lemmat provid base dictionari form word . 1
stop word remov elimin common word . 1


Now, let's try lemmatization.

In [48]:
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def lemmatize_text(text):
  text = tokenize_and_remove_stopwords(text)
  tokens = text.split()
  lemmas = [wnl.lemmatize(token) for token in tokens]
  return " ".join(lemmas)

lemmatized_documents = [lemmatize_text(doc) for doc in cleaned_documents]
lemmatized_queries = [lemmatize_text(query) for query in cleaned_search_queries]
perform_search_and_show_results(lemmatized_documents, lemmatized_queries)

--------------------------
Results for query: text analyis
--------------------------
text pre-processing essential natural language processing . 1
nltk spacy popular library text analysis . 1
clean normalized text improves search accuracy . 1
tokenization break text word phrase . 1
--------------------------
Results for query: tokenization
--------------------------
tokenization break text word phrase . 1
--------------------------
Results for query: lemmatization python
--------------------------
lemmatization provides base dictionary form word . 1
python widely used natural language processing . 1
spacy 's lemmatization advanced nltk 's . 1
--------------------------
Results for query: natural languge processing tool
--------------------------
text pre-processing essential natural language processing . 2
python widely used natural language processing . 2
--------------------------
Results for query: reducing word
--------------------------
tokenization break text word phrase . 1
ste

Exercise E1. Search terms within a PDF document

For this research paper here https://arxiv.org/abs/1706.03762
1. Extract all text. Convert them into sentences using NLTK's Sentence Tokenizer

Example code:

```
from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']
```

2. Define a list of search queries on your own.

3. Reimplement the above techniques and compare search results before and after applying text-pre-processing