# Write a program for pre-processing of a text document such as stop word removal, stemming.

### Libraries to use:
- NLTK (Natural Language Toolkit): This library provides easy access to a collection of stopwords and stemming algorithms.
- Stopwords: We’ll use NLTK's stopwords list.
- Stemmer: We’ll use the PorterStemmer from NLTK

In [1]:
pip install nltk


Note: you may need to restart the kernel to use updated packages.




In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

In [3]:
#do once
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [WinError 10061]
[nltk_data]     No connection could be made because the target machine
[nltk_data]     actively refused it>


False

In [5]:
#initialize stopword list and stemmer
stop_words=set(stopwords.words("english"))
stemmer =PorterStemmer()

In [8]:
#preprocess function to remove stopwords and apply stemming
def preprocess_text(text):
    text=text.lower() #lower case
    
    #remove punctuations
    text=text.translate(str.maketrans("","",string.punctuation))
    
    tokens= text.split() #split into words
    
    #remove stopwords and apply stemming
    processed_tokens= [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return " ".join(processed_tokens)

In [9]:
document = '''Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions - Please note that updating to Notebook 7 might break some of your extensions.'''

In [10]:
# Preprocess the document
processed_document = preprocess_text(document)

In [11]:
# Output the original and processed text
print("Original Document:\n", document)
print("\nProcessed Document:\n", processed_document)

Original Document:
 Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions - Please note that updating to Notebook 7 might break some of your extensions.

Processed Document:
 read migrat plan notebook 7 learn new featur action take use extens pleas note updat notebook 7 might break extens


In [None]:
### **Explanation of the Code and Outputs**

This program **preprocesses a text document** by removing stopwords and applying stemming. Here’s a detailed explanation:

---

### **Code Breakdown**

#### **1. Library Imports and Downloads**
```python
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
nltk.download('stopwords')
```

- **`nltk` (Natural Language Toolkit)**: A Python library for processing text data.
- **Stopwords**: Common words in a language (e.g., "and", "the", "is") that don’t contribute much meaning.
- **PorterStemmer**: A tool to reduce words to their root forms (e.g., "running" → "run").
- **Punctuation Removal**: Helps clean the text by removing characters like ".", ",", "?".
- **`nltk.download('stopwords')`**: Downloads the list of stopwords for English.

---

#### **2. Preprocessing Function**
```python
def preprocess_text(text):
    text = text.lower()  # Convert text to lowercase
    text = text.translate(str.maketrans("", "", string.punctuation))  # Remove punctuation
    tokens = text.split()  # Split text into words (tokens)
    processed_tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]  # Remove stopwords and apply stemming
    return " ".join(processed_tokens)  # Join processed words back into a string
```

- **Lowercase Conversion**: Ensures all words are treated uniformly (e.g., "The" and "the").
- **Punctuation Removal**: Eliminates unnecessary characters.
- **Tokenization**: Breaks the text into individual words.
- **Stopword Removal**: Filters out common, meaningless words using `stop_words`.
- **Stemming**: Reduces words to their root forms.

---

#### **3. Input Document**
```python
document = '''Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions - Please note that updating to Notebook 7 might break some of your extensions.'''
```

This is the raw input document that we want to preprocess.

---

#### **4. Processed Document**
```python
processed_document = preprocess_text(document)
```

The `preprocess_text` function processes the input and returns the cleaned-up version.

---

### **Output Explanation**

#### **Original Document**
```
Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions - Please note that updating to Notebook 7 might break some of your extensions.
```

This is the original text, containing:
- Mixed case words.
- Punctuation marks.
- Stopwords (e.g., "the", "to", "and").
- Inflected words (e.g., "updating", "features").

---

#### **Processed Document**
```
read migrat plan notebook 7 learn new featur action take use extens pleas note updat notebook 7 might break extens
```

After preprocessing:
1. **Lowercase Conversion**: All words are lowercase (e.g., "Read" → "read").
2. **Punctuation Removed**: Characters like ".", ",", "-" are gone.
3. **Stopwords Removed**: Words like "the", "to", "if" are eliminated.
4. **Stemming Applied**: 
   - "migration" → "migrat".
   - "features" → "featur".
   - "using" → "use".

---

### **Purpose of Preprocessing**
1. **Stopword Removal**: Focuses on meaningful words, improving analysis and reducing noise.
2. **Stemming**: Groups related words (e.g., "run", "running", "runner") into a single base form, improving consistency.
3. **Clean Text**: Simplifies the text for further processing in tasks like text classification or sentiment analysis.

---

### **Key Concepts**

#### **Stopwords**
- Words that are common but don’t add significant meaning to a sentence.
- Example: "the", "is", "and", "a".

#### **Stemming**
- Reduces words to their root form using algorithms.
- Example: "running", "runner", "runs" → "run".

#### **Why Preprocess?**
- Preprocessing is essential for preparing text data for **machine learning** models, **sentiment analysis**, or **text summarization**.
- Removes irrelevant parts of text and normalizes it for consistent representation.

