7) Text Analytics
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency 

In [1]:
# ------------------------
# 📚 1. Import Required Libraries
# ------------------------

import nltk
import re
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Download NLTK resources (in Colab, this is quick and necessary)
nltk.download('punkt_tab')
# if not worked use : nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
# if not worked use : nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\kedar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kedar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kedar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\kedar\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [3]:
# ------------------------
# 📚 2. Sample Text for Preprocessing
# ------------------------

text = """Tokenization is the first step in text analytics.
The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization."""

In [4]:
print("Original Text:\n", text)

Original Text:
 Tokenization is the first step in text analytics.
The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization.


In [5]:
# ------------------------
# 📚 3. Tokenization
# ------------------------

# Sentence Tokenization
sent_tokens = sent_tokenize(text)
print("\nSentence Tokenization:\n", sent_tokens)


Sentence Tokenization:
 ['Tokenization is the first step in text analytics.', 'The process of breaking down a text paragraph into smaller chunks such as words or sentences is called Tokenization.']


In [6]:
# Word Tokenization
word_tokens = word_tokenize(text)
print("\nWord Tokenization:\n", word_tokens)



Word Tokenization:
 ['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'analytics', '.', 'The', 'process', 'of', 'breaking', 'down', 'a', 'text', 'paragraph', 'into', 'smaller', 'chunks', 'such', 'as', 'words', 'or', 'sentences', 'is', 'called', 'Tokenization', '.']


In [7]:
# ------------------------
# 📚 4. POS Tagging (Part of Speech Tagging)
# ------------------------

pos_tags = nltk.pos_tag(word_tokens)
print("\nPOS Tagging:\n", pos_tags)


POS Tagging:
 [('Tokenization', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('first', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('text', 'JJ'), ('analytics', 'NNS'), ('.', '.'), ('The', 'DT'), ('process', 'NN'), ('of', 'IN'), ('breaking', 'VBG'), ('down', 'RP'), ('a', 'DT'), ('text', 'NN'), ('paragraph', 'NN'), ('into', 'IN'), ('smaller', 'JJR'), ('chunks', 'NNS'), ('such', 'JJ'), ('as', 'IN'), ('words', 'NNS'), ('or', 'CC'), ('sentences', 'NNS'), ('is', 'VBZ'), ('called', 'VBN'), ('Tokenization', 'NN'), ('.', '.')]


In [8]:
# ------------------------
# 📚 5. Stopwords Removal
# ------------------------

stop_words = set(stopwords.words("english"))


In [9]:
# Remove punctuations and lowercase
text_clean = re.sub(r'[^a-zA-Z]', ' ', text)
tokens = word_tokenize(text_clean.lower())

In [10]:
# Remove stopwords
filtered_words = [w for w in word_tokens if w not in stop_words]
print("\nAfter Stopwords Removal:\n", filtered_words)


After Stopwords Removal:
 ['Tokenization', 'first', 'step', 'text', 'analytics', '.', 'The', 'process', 'breaking', 'text', 'paragraph', 'smaller', 'chunks', 'words', 'sentences', 'called', 'Tokenization', '.']


In [11]:
# ------------------------
# 📚 6. Stemming
# ------------------------

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
print("\nAfter Stemming:\n", stemmed_words)


After Stemming:
 ['token', 'first', 'step', 'text', 'analyt', '.', 'the', 'process', 'break', 'text', 'paragraph', 'smaller', 'chunk', 'word', 'sentenc', 'call', 'token', '.']


In [12]:
# ------------------------
# 📚 7. Lemmatization
# ------------------------

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print("\nAfter Lemmatization:\n", lemmatized_words)


After Lemmatization:
 ['Tokenization', 'first', 'step', 'text', 'analytics', '.', 'The', 'process', 'breaking', 'text', 'paragraph', 'smaller', 'chunk', 'word', 'sentence', 'called', 'Tokenization', '.']


In [13]:
# 📚 8. Document Representation - TF, IDF, TF-IDF
# --------------------------------------------------

# Sample Documents
documentA = "Jupiter is the largest Planet"
documentB = "Mars is the fourth planet from the Sun"

In [14]:
# Preprocess Documents: lowercase, remove punctuation, remove stopwords
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)  # Remove special characters
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in tokens if w not in stop_words]
    return " ".join(filtered_tokens)

docA_processed = preprocess_text(documentA)
docB_processed = preprocess_text(documentB)


In [15]:
print("\nPreprocessed Document A:\n", docA_processed)


Preprocessed Document A:
 jupiter largest planet


In [16]:
print("\nPreprocessed Document B:\n", docB_processed)


Preprocessed Document B:
 mars fourth planet sun


In [17]:
# Create TF-IDF Model
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([docA_processed, docB_processed])

In [18]:
# Show TF-IDF Matrix
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:\n", df_tfidf)


TF-IDF Matrix:
      fourth   jupiter   largest      mars    planet       sun
0  0.000000  0.631667  0.631667  0.000000  0.449436  0.000000
1  0.534046  0.000000  0.000000  0.534046  0.379978  0.534046



---

# 📚 PRACTICAL 7 — THEORY EXPLANATION (Line-by-Line Meaning)

---

## 1. **Import Libraries**

```python
import nltk
import re
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `import nltk` | Import the Natural Language Toolkit library | To perform text preprocessing tasks like tokenization, stopwords removal, stemming, lemmatization |
| `import re` | Import Regular Expression module | To clean text and remove unwanted characters like punctuation |
| `import pandas as pd` | Import pandas library | To organize TF-IDF scores into a table format |
| `from nltk.tokenize import sent_tokenize, word_tokenize` | Import tokenization methods | To split text into sentences and words |
| `from nltk.corpus import stopwords` | Import English stopwords list | To remove commonly occurring useless words |
| `from nltk.stem import PorterStemmer, WordNetLemmatizer` | Import stemming and lemmatization classes | To normalize words to their root/base form |
| `from sklearn.feature_extraction.text import TfidfVectorizer` | Import TF-IDF tool from scikit-learn | To automatically calculate TF-IDF scores |

---

## 2. **Download NLTK Resources**

```python
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `nltk.download('punkt')` | Download tokenizer models | Needed for sentence and word tokenization |
| `nltk.download('stopwords')` | Download stopword list | Needed to remove stopwords |
| `nltk.download('wordnet')` | Download WordNet database | Needed for lemmatization |
| `nltk.download('averaged_perceptron_tagger')` | Download POS tagging model | Needed to tag words with parts of speech |

---

## 3. **Text Preprocessing (Input Text)**

```python
text = """Tokenization is the first step in text analytics..."""
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `text = """...."""` | Define sample paragraph for analysis | To apply all preprocessing steps on this input |

---

## 4. **Tokenization**

```python
sent_tokens = sent_tokenize(text)
word_tokens = word_tokenize(text)
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `sent_tokens = sent_tokenize(text)` | Break paragraph into sentences | To understand structure sentence-wise |
| `word_tokens = word_tokenize(text)` | Break paragraph into words | To process each word individually |

---

## 5. **POS Tagging**

```python
pos_tags = nltk.pos_tag(word_tokens)
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `pos_tags = nltk.pos_tag(word_tokens)` | Assign part of speech (noun, verb, adjective) to each word | To understand the grammatical role of each word |

---

## 6. **Stopwords Removal**

```python
stop_words = set(stopwords.words("english"))
text_clean = re.sub(r'[^a-zA-Z]', ' ', text)
tokens = word_tokenize(text_clean.lower())
filtered_words = [w for w in tokens if w not in stop_words]
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `stop_words = set(stopwords.words("english"))` | Load list of common English stopwords | To filter out words like "is", "the", "an" |
| `text_clean = re.sub(r'[^a-zA-Z]', ' ', text)` | Remove punctuation and non-letter characters | To clean text before further processing |
| `tokens = word_tokenize(text_clean.lower())` | Tokenize cleaned, lowercased text | To split cleaned text into words |
| `filtered_words = [w for w in tokens if w not in stop_words]` | Remove stopwords from tokens | To keep only meaningful words |

---

## 7. **Stemming**

```python
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `ps = PorterStemmer()` | Create a stemmer object | To perform stemming operation |
| `stemmed_words = [ps.stem(word) for word in filtered_words]` | Stem each word to its root | To reduce word variations (waited → wait) |

---

## 8. **Lemmatization**

```python
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `lemmatizer = WordNetLemmatizer()` | Create a lemmatizer object | To find dictionary form (lemma) of each word |
| `lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]` | Lemmatize each word | To perform more intelligent word normalization |

---

## 9. **TF-IDF Calculation**

### Step 1: Preprocessing Documents

```python
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in tokens if w not in stop_words]
    return " ".join(filtered_tokens)
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `def preprocess_text(text):` | Define a cleaning function | To reuse cleaning steps |
| `text = re.sub(r'[^a-zA-Z]', ' ', text)` | Remove punctuations | Text becomes alphabetic |
| `tokens = word_tokenize(text.lower())` | Tokenize and lowercase text | Uniform processing |
| `filtered_tokens = [w for w in tokens if w not in stop_words]` | Remove stopwords | Keep only useful words |
| `return " ".join(filtered_tokens)` | Join tokens back to text | Final cleaned document |

---

### Step 2: Apply Preprocessing

```python
docA_processed = preprocess_text(documentA)
docB_processed = preprocess_text(documentB)
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `docA_processed = preprocess_text(documentA)` | Clean Document A | Prepare for TF-IDF |
| `docB_processed = preprocess_text(documentB)` | Clean Document B | Prepare for TF-IDF |

---

### Step 3: Generate TF-IDF Matrix

```python
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([docA_processed, docB_processed])
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `vectorizer = TfidfVectorizer()` | Create a TF-IDF vectorizer object | To calculate TF-IDF scores |
| `tfidf_matrix = vectorizer.fit_transform([...])` | Apply TF-IDF transformation | Create numerical representation of documents |

---

### Step 4: Display TF-IDF Table

```python
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
```

| Line | Meaning | Why it is used |
|:---|:---|:---|
| `df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=...)` | Create a readable table showing TF-IDF values for each word | Easy to analyze importance of words |

---

# 📚 CONCLUSION (Important Viva Answer)
✅ We have successfully performed:
- Text Preprocessing (tokenization, stopwords removal, stemming, lemmatization, POS tagging).
- Created a **numerical representation** of text documents using **TF-IDF**.
- Final output was a **TF-IDF matrix** showing the weightage of important words.


