[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nepal-College-of-Information-Technology/AI-Data-Science-Workshop-2024/blob/main/Day%2010%20%3A%20Natural%20Language%20Processing/Part1_Text_Processing_Basics.ipynb)


# Part 1: Text Processing Basics

In this notebook, we will explore the basics of **Text Processing**, a fundamental step in **Natural Language Processing (NLP)**. We will cover key techniques like **tokenization**, **stemming**, **lemmatization**, and **stopwords removal**. These techniques help in preparing raw text data for analysis.

---

## 1. What is Text Processing?

Text processing refers to the steps involved in cleaning and preparing text data for further analysis. Raw text data often contains noise (e.g., punctuation, stopwords) and needs to be standardized before it can be used for tasks like **sentiment analysis**, **text classification**, or **machine translation**.

### Real-World Example:
Consider you're analyzing customer reviews on an e-commerce platform. The raw text might contain slang, punctuation, or extra words that don't help with analysis. **Text processing** helps clean the text to focus on the important words.

---

## 2. Tokenization

**Tokenization** is the process of splitting text into smaller pieces called **tokens**. Tokens can be words, sentences, or even sub-words.

### Example:
- Input: `"I love natural language processing!"`
- Tokens: `["I", "love", "natural", "language", "processing", "!"]`

### Code Example:

Let's use **NLTK** to tokenize some text.

In [1]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Example sentence
text = "I love natural language processing!"

# Tokenize the text into words
tokens = word_tokenize(text)
print("Tokens:", tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Tokens: ['I', 'love', 'natural', 'language', 'processing', '!']


---

## 3. Stopwords Removal

**Stopwords** are common words like "is", "the", "in" that don’t contribute much to the meaning of the text. Removing these words can make text analysis more efficient.

### Real-World Example:
In a sentence like "The movie was great!", the word "The" does not add much meaning, so it can be removed.

### Code Example:

Let's remove stopwords from the tokenized text using **NLTK**.

In [2]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Remove stopwords from tokens
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Filtered Tokens (without stopwords):", filtered_tokens)

Filtered Tokens (without stopwords): ['love', 'natural', 'language', 'processing', '!']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [4]:
# stop_words

---

## 4. Stemming

**Stemming** is the process of reducing words to their root form by removing suffixes. For example, "running", "runner", and "ran" all stem to "run". Stemming helps in grouping similar words together.

### Real-World Example:
When analyzing product reviews, words like "excited", "exciting", and "excitedly" all share the same root sentiment. Stemming reduces these words to their root form.

### Code Example:

We will use **PorterStemmer** from NLTK for stemming.

In [5]:
from nltk.stem import PorterStemmer

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the filtered tokens
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

Stemmed Tokens: ['love', 'natur', 'languag', 'process', '!']


---

## 5. Lemmatization

**Lemmatization** is a more advanced technique compared to stemming. It reduces words to their base or dictionary form (lemma). Unlike stemming, lemmatization takes context into account.

### Real-World Example:
Consider the words "better" and "good". Stemming would not recognize these as related, but **lemmatization** will convert "better" to "good", recognizing that they are related.

### Code Example:

We will use **WordNetLemmatizer** from NLTK for lemmatization.

In [6]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the filtered tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized Tokens: ['love', 'natural', 'language', 'processing', '!']


---

## 6. Putting It All Together

Now that we've covered tokenization, stopwords removal, stemming, and lemmatization, let's apply all these techniques to a real-world example: **Analyzing customer feedback**.

### Example Sentence:
"Customers loved the product, but some were disappointed with the late delivery."

We will tokenize this text, remove stopwords, and apply both stemming and lemmatization.

### Code Example:

In [7]:
# Example sentence for text processing
feedback = "Customers loved the product, but some were disappointed with the late delivery."

# Tokenization
tokens = word_tokenize(feedback)

# Stopwords Removal
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Stemming
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("Original Text:", feedback)
print("Filtered Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Text: Customers loved the product, but some were disappointed with the late delivery.
Filtered Tokens: ['Customers', 'loved', 'product', ',', 'disappointed', 'late', 'delivery', '.']
Stemmed Tokens: ['custom', 'love', 'product', ',', 'disappoint', 'late', 'deliveri', '.']
Lemmatized Tokens: ['Customers', 'loved', 'product', ',', 'disappointed', 'late', 'delivery', '.']


### Another example

In [8]:
story = "Nepal, a small yet diverse country nestled in the lap of the Himalayas, is a land of rich cultural heritage, breathtaking landscapes, and warm hospitality. From the towering peaks of Mount Everest to the lush jungles of Chitwan, the country’s geography offers a unique contrast that captivates both locals and visitors alike. The capital city, Kathmandu, is a vibrant hub of ancient history and modern development, where centuries-old temples like Pashupatinath and Swayambhunath stand alongside bustling markets and contemporary buildings. Nepal is a multilingual and multi-ethnic nation, home to more than 100 ethnic groups and 120 languages, with Nepali being the official language. Festivals like Dashain, Tihar, and Holi bring communities together in celebration, showcasing the country’s deep-rooted traditions. Despite its natural beauty and cultural richness, Nepal faces numerous challenges, including poverty, political instability, and infrastructural development. Many parts of the country, especially in the rural areas, still lack access to quality education, healthcare, and basic amenities. Agriculture remains the primary source of livelihood for the majority of Nepalese, although tourism has emerged as a significant contributor to the economy, drawing trekkers, mountaineers, and spiritual seekers from around the world. As Nepal continues to balance its ancient traditions with the demands of modernity, the spirit of resilience and optimism remains ever-present among its people, making it a truly unique and inspiring nation."

In [9]:

# Tokenization
tokens = word_tokenize(story)

# Stopwords Removal
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Stemming
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]

# Lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print("Original Text:", story)
print("Filtered Tokens:", filtered_tokens)
print("Stemmed Tokens:", stemmed_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)

Original Text: Nepal, a small yet diverse country nestled in the lap of the Himalayas, is a land of rich cultural heritage, breathtaking landscapes, and warm hospitality. From the towering peaks of Mount Everest to the lush jungles of Chitwan, the country’s geography offers a unique contrast that captivates both locals and visitors alike. The capital city, Kathmandu, is a vibrant hub of ancient history and modern development, where centuries-old temples like Pashupatinath and Swayambhunath stand alongside bustling markets and contemporary buildings. Nepal is a multilingual and multi-ethnic nation, home to more than 100 ethnic groups and 120 languages, with Nepali being the official language. Festivals like Dashain, Tihar, and Holi bring communities together in celebration, showcasing the country’s deep-rooted traditions. Despite its natural beauty and cultural richness, Nepal faces numerous challenges, including poverty, political instability, and infrastructural development. Many part

---

### Conclusion:

In this notebook, we explored the basics of text processing, including **tokenization**, **stopwords removal**, **stemming**, and **lemmatization**. These techniques are essential for preparing raw text data for NLP tasks like **sentiment analysis** or **text classification**.

In the next notebook, we will build a **Sentiment Analysis Model** using these text processing techniques.

---