In [30]:
import nltk
import string

from nltk.tokenize import sent_tokenize, word_tokenize
"""
Import the 'sent_tokenize' and 'word_tokenize' functions from the NLTK tokenizer module.
'sent_tokenize' is used to split a given text into individual sentences,
while 'word_tokenize' is used to break the text down into individual words.
These functions are essential for preprocessing text in natural language processing tasks.
"""

nltk.download("punkt")
"""
Download the 'punkt' tokenizer models from the NLTK data repository.
This package provides pre-trained models for tokenizing text into sentences and words,
which are essential for various natural language processing tasks.
Ensure that the models are available for use in text analysis, sentiment analysis, and more.
"""


import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer


nltk.download('stopwords')

"""
- 'stopwords': A corpus of common words (e.g., 'and', 'the') that are often filtered out in text processing.
- 'PorterStemmer': A stemming algorithm to reduce words to their root form.
- 'string': A module that provides string constants and functions.

Additionally, download essential NLTK resources:
- 'stopwords': To access the list of common stop words.
- 'punkt': To enable sentence and word tokenization.
"""





[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


"\n- 'stopwords': A corpus of common words (e.g., 'and', 'the') that are often filtered out in text processing.\n- 'PorterStemmer': A stemming algorithm to reduce words to their root form.\n- 'string': A module that provides string constants and functions.\n\nAdditionally, download essential NLTK resources:\n- 'stopwords': To access the list of common stop words.\n- 'punkt': To enable sentence and word tokenization.\n"

<div style="background-color:#d8b4fe;; padding:15px; border-radius:10px;">

## 📘 Exercise 1 — Corpus Loading & Tokenization
**Task:**  
1. Load the provided corpus file: `exercise1_corpus.txt`.  
2. Tokenize the text into words using **NLTK**.  
3. Count the frequency of each token.  


</div>

In [33]:
with open("./excercise_corpora/exercise1_corpus.txt") as corp:
    corpus1 = corp.read() # upload my corpus
    
sentences1 = sent_tokenize(corpus1) # seperating by sentence 
tokens1 =[word for word in word_tokenize(corpus1) if word not in string.punctuation] # extracting tokenized words
frequencies1 = nltk.FreqDist(tokens1) # getting frequencies 

print("senetece corpus 1:") 
print(sentences1)
print("-"*100)
print("tokenized corpus 1:") 
print(tokens1)
print("-"*100)
print("top 5 tokens frequencies corpus 1:") 
print(frequencies1.most_common(5))
print("-"*100)


senetece corpus 1:
['Sunflowers are a type of plant that follow the sun, turning their heads from east to west as the day progresses.', 'They are bright, tall, and full of seeds that humans eat, while birds and insects also rely on them for nourishment.', 'Some cultures even associate sunflowers with loyalty, longevity, and adoration, making them a symbol of positivity.', 'Roses, with their delicate petals and enchanting fragrance, have long been cultivated for ornamental and ceremonial purposes.', 'Their colors, ranging from deep crimson to soft pastels, carry various symbolic meanings such as love, friendship, and remembrance.', 'Tulips emerge in early spring, creating vibrant carpets of red, yellow, purple, and white.', 'Originating in Central Asia, tulips became an iconic flower in Dutch culture, sparking the historical phenomenon known as "tulip mania."', 'Bamboo is a fast-growing grass that can reach impressive heights in a short period.', 'It is highly valued for its versatility

<div style="background-color:#d8b4fe;; padding:15px; border-radius:10px;">

### 📘 Exercise 2: Stopword Removal and Stemming

**Objective:**  
Using NLTK, perform the following tasks on a given corpus:
1. Tokenize text into words.
2. Remove English stopwords.
3. Apply stemming to each remaining token.

**Instructions:**  
- Use the `nltk.corpus.stopwords` for the stopword list.
- Use the `PorterStemmer` for stemming.
- Display:
  - Original tokenized words
  - Tokenized words after stopword removal
  - Tokenized words after stemming

**Write your solution in the cell below.**
</div>

In [34]:
with open("./excercise_corpora/exercise2_corpus.txt") as corp:
    corpus2 = corp.read()

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

tokens2 = [word for word in word_tokenize(corpus2.lower()) if word.isalnum()]
filtered_tokens2 = [w for w in tokens2 if w not in stop_words]
stemmed_tokens2 = [stemmer.stem(w) for w in filtered_tokens2]

print("Tokenized corpus 2:")
print(tokens2)
print("-"*100)

print("Filtered (no stopwords) corpus 2:")
print(filtered_tokens2)
print("-"*100)

print("Stemmed corpus 2:")
print(stemmed_tokens2)
print("-"*100)

Tokenized corpus 2:
['natural', 'language', 'processing', 'nlp', 'is', 'a', 'vital', 'subfield', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'using', 'natural', 'language', 'it', 'combines', 'linguistics', 'computer', 'science', 'and', 'machine', 'learning', 'to', 'process', 'and', 'analyze', 'vast', 'amounts', 'of', 'textual', 'data', 'applications', 'of', 'nlp', 'span', 'numerous', 'domains', 'including', 'chatbots', 'virtual', 'assistants', 'machine', 'translation', 'speech', 'recognition', 'and', 'sentiment', 'analysis', 'advanced', 'nlp', 'models', 'leverage', 'techniques', 'such', 'as', 'tokenization', 'stemming', 'lemmatization', 'and', 'word', 'embeddings', 'to', 'understand', 'context', 'and', 'meaning', 'transformers', 'recurrent', 'neural', 'networks', 'rnns', 'and', 'long', 'memory', 'networks', 'lstms', 'have', 'significantly', 'improved', 'the', 'accuracy', 'of', 'language', 'models', 'recent'

<div style="background-color:#d8b4fe; padding:15px; border-radius:10px;">
    <h3>📘 Exercise 3: Tokenization, Stopword Removal, and Stemming</h3>
    <strong>Tasks:</strong>
    <ol>
        <li>Load the provided corpus from <code>exercise3_corpus.txt</code>.</li>
        <li>Tokenize the text into words.</li>
        <li>Remove punctuation and stopwords.</li>
        <li>Apply stemming using NLTK's <code>PorterStemmer</code>.</li>
        <li>Print the tokenized, filtered, and stemmed lists.</li>
    </ol>
</div>



In [35]:
with open("./excercise_corpora/exercise3_corpus.txt") as corp:
    corpus3 = corp.read()

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

tokens3 = [word for word in word_tokenize(corpus3.lower()) if word.isalnum()]
filtered_tokens3 = [w for w in tokens3 if w not in stop_words]
stemmed_tokens3 = [stemmer.stem(w) for w in filtered_tokens3]

print("Tokenized corpus 3:")
print(tokens3)
print("-"*100)

print("Filtered (no stopwords) corpus 3:")
print(filtered_tokens3)
print("-"*100)

print("Stemmed corpus 3:")
print(stemmed_tokens3)
print("-"*100)

Tokenized corpus 3:
['artificial', 'intelligence', 'is', 'revolutionizing', 'multiple', 'industries', 'including', 'healthcare', 'finance', 'and', 'transportation', 'machine', 'learning', 'allows', 'systems', 'to', 'improve', 'automatically', 'through', 'experience', 'natural', 'language', 'understanding', 'helps', 'computers', 'interact', 'with', 'humans', 'seamlessly', 'robotics', 'combines', 'ai', 'with', 'physical', 'systems', 'to', 'perform', 'tasks', 'autonomously', 'ethical', 'considerations', 'are', 'crucial', 'for', 'ai', 'deployment', 'especially', 'regarding', 'privacy', 'and', 'fairness']
----------------------------------------------------------------------------------------------------
Filtered (no stopwords) corpus 3:
['artificial', 'intelligence', 'revolutionizing', 'multiple', 'industries', 'including', 'healthcare', 'finance', 'transportation', 'machine', 'learning', 'allows', 'systems', 'improve', 'automatically', 'experience', 'natural', 'language', 'understanding',

<div style="background-color:#FFF8DC; padding:15px; border-radius:8px; line-height:1.5;">

# NLP Text Preprocessing Concepts (Exercises 1–3)

In Exercises 1–3, we took a deep dive into processing text corpora through several key steps. Each step has its own purpose and plays a crucial role in getting raw text ready for analysis or modeling in NLP. Here’s a breakdown of what we did and why it matters:

---

## 1. Tokenized Corpus
**What It Is:**  
A tokenized corpus is simply the text broken down into **tokens**, which are the smallest units we can work with. These can be words, subwords, or even punctuation marks.

**For Example:**  
Take this original sentence:  
Sunflowers are bright and beautiful.

After tokenization, it looks like this:  
Tokenized:  
['Sunflowers', 'are', 'bright', 'and', 'beautiful', '.']

**Why It Matters:**  
- This process turns raw text into manageable pieces for computation.  
- It allows us to perform frequency analysis, create n-gram models, and develop word embeddings.  
- Tokenization is fundamental for nearly all NLP tasks, whether it’s classification, generation, or retrieval.

---

## 2. Filtered Corpus (Stopwords Removal & Punctuation Handling)
**What It Is:**  
A filtered corpus is what you get when we remove **stopwords** (those common, less meaningful words like `the`, `and`, `is`) and punctuation, leaving us with the **content words** that really matter.

**For Example:**  
Starting with this tokenized list:  
['Sunflowers', 'are', 'bright', 'and', 'beautiful']

After filtering, we get:  
Filtered:  
['Sunflowers', 'bright', 'beautiful']

**Why It Matters:**  
- This step helps reduce noise and dimensionality in our data.  
- It allows us to focus on words that carry real semantic meaning.  
- Ultimately, it boosts efficiency and performance in tasks like text classification and topic modeling.

---

## 3. Stemmed Corpus
**What It Is:**  
A stemmed corpus takes each word and reduces it to its **root form** (or stem), often using algorithms like the **PorterStemmer**.

**For Example:**  
Consider these original words:  
running, runs, runner, easily, faster

After stemming, we get:  
Stemmed:  
run, run, runner, easili, faster

**Why It Matters:**  
- This groups words that are morphologically related under the same stem.  
- It helps reduce the vocabulary size and sparsity.  
- By doing this, models can recognize that different forms of a word share similar meanings.  
- This is particularly useful for information retrieval, search engines, and classification tasks.

---

## 4. Frequency Analysis
**What It Is:**  
Frequency analysis involves counting how often each token or stem appears in the corpus to get **word frequencies**.

**For Example:**  
{'sunflowers': 5, 'bright': 3, 'beautiful': 3}

**Why It Matters:**  
- This helps us identify the most common words or stems in our text.  
- It supports tasks like keyword extraction, weighting in TF-IDF, and feature selection for models.  
- It also guides our strategies for subword or BPE tokenization.

---

## 5. Overall Goal
The entire process of **tokenization → filtering → stemming → frequency analysis** allows us to:  

1. Clean and normalize the raw text.  
2. Reduce vocabulary size and eliminate noise.  
3. Focus on the semantically meaningful units.  
4. Prepare our data for various NLP tasks, such as:  
   - Text classification  
   - Sentiment analysis  
   - Named entity recognition  
   - Language modeling  

**Key Takeaway:**  
These preprocessing steps help us create a **structured representation of raw text**, enabling NLP algorithms to learn patterns efficiently and accurately while managing morphological variations and irrelevant content.

</div>
