To implement text pre-processing with TF-IDF (Term Frequency-Inverse Document Frequency) in Python, we need to perform several steps including text cleaning, tokenization, and then apply TF-IDF. Below is a Python program that uses libraries such as `nltk`, `re`, and `sklearn` to implement text pre-processing and apply TF-IDF.

### Requirements:
1. `nltk` for natural language processing (tokenization, stopword removal).
2. `scikit-learn` for TF-IDF vectorization.
3. `re` for regular expression-based text cleaning.

### Installation:
You need to install `nltk` and `scikit-learn` if you haven't already:

```bash
pip install nltk scikit-learn
```

### Python Program:

```python
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Step 1: Text cleaning
def clean_text(text):
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Convert to lower case
    text = text.lower()
    
    return text

# Step 2: Tokenization and stopword removal
def tokenize_and_remove_stopwords(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    
    # Remove stopwords using NLTK's stopwords list
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    
    return " ".join(filtered_words)

# Step 3: Apply TF-IDF
def compute_tfidf(corpus):
    # Initialize TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Transform the corpus into a TF-IDF matrix
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
    
    # Convert the matrix into a DataFrame for better readability
    feature_names = tfidf_vectorizer.get_feature_names_out()
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
    
    return tfidf_df

if __name__ == "__main__":
    # Sample corpus of text
    corpus = [
        "Text preprocessing is an important part of NLP.",
        "TF-IDF helps in identifying important words.",
        "We are learning how to use TF-IDF with text data.",
        "Text data needs preprocessing before applying machine learning algorithms."
    ]
    
    # Step 1: Clean and preprocess the text
    cleaned_corpus = [clean_text(doc) for doc in corpus]
    
    # Step 2: Tokenize and remove stopwords
    preprocessed_corpus = [tokenize_and_remove_stopwords(doc) for doc in cleaned_corpus]
    
    # Step 3: Compute the TF-IDF matrix
    tfidf_df = compute_tfidf(preprocessed_corpus)
    
    # Show the resulting TF-IDF values
    print(tfidf_df)
```

### Explanation:
1. **Text Cleaning (`clean_text`)**: This function removes non-alphanumeric characters and converts the text to lowercase.
2. **Tokenization and Stopword Removal (`tokenize_and_remove_stopwords`)**: This function tokenizes the text into individual words and removes common stopwords using NLTK's list of English stopwords.
3. **TF-IDF Computation (`compute_tfidf`)**: The function uses `TfidfVectorizer` from scikit-learn to compute the TF-IDF values for the processed text corpus.

### Example Output:

```bash
   algorithm  applying  data  helps  identifying  important  learning  machine  ...  preprocessing  tf-idf  words
0        0.0       0.0   0.0    0.0          0.0        0.0        0.0      0.0  ...            0.5    0.4    0.6
```

This output will show the TF-IDF scores for each word across all documents in the corpus.

You can easily expand this program to handle larger datasets or integrate with other pre-processing steps, such as stemming or lemmatization, depending on your requirements.

Sure! Let me break down the Python code step by step for you:

### 1. **Imports:**

```python
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
```

- `re`: Regular expressions module, used for text cleaning (e.g., removing non-alphanumeric characters).
- `nltk`: The Natural Language Toolkit, used for natural language processing tasks like tokenization and stopword removal.
- `word_tokenize`: A function from `nltk.tokenize` used to split text into words (tokenization).
- `stopwords`: A list from `nltk.corpus` containing common words (like "the", "is", "and") that are typically removed in text preprocessing.
- `TfidfVectorizer`: A function from `sklearn.feature_extraction.text` used to calculate the TF-IDF values.

### 2. **Downloading NLTK Resources:**

```python
nltk.download('punkt')
nltk.download('stopwords')
```

These commands ensure that the required NLTK resources for tokenization (`punkt`) and stopwords (`stopwords`) are downloaded. This is necessary if you're running the code for the first time.

### 3. **Text Cleaning Function:**

```python
def clean_text(text):
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Convert to lower case
    text = text.lower()
    
    return text
```

- **Purpose:** This function cleans the input text by:
  - Removing all non-alphanumeric characters (except spaces). For example, punctuation marks are removed using the regular expression `[^a-zA-Z0-9\s]`.
  - Converting the entire text to lowercase to standardize the text and prevent case-sensitive discrepancies.
  
  For example, `"Hello, World!"` becomes `"hello world"` after cleaning.

### 4. **Tokenization and Stopword Removal Function:**

```python
def tokenize_and_remove_stopwords(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    
    # Remove stopwords using NLTK's stopwords list
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    
    return " ".join(filtered_words)
```

- **Purpose:** This function processes the cleaned text by:
  - Tokenizing the text into individual words using the `word_tokenize` function.
  - Removing common stopwords (like "the", "and", "in", etc.) from the tokenized list. `nltk.corpus.stopwords.words('english')` provides a list of common English stopwords.
  - The result is a list of words that do not contain stopwords, which is then converted back into a string using `" ".join(filtered_words)`.

For example, `"Text preprocessing is important."` becomes `"text preprocessing important"` after tokenization and stopword removal.

### 5. **TF-IDF Computation Function:**

```python
def compute_tfidf(corpus):
    # Initialize TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Transform the corpus into a TF-IDF matrix
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
    
    # Convert the matrix into a DataFrame for better readability
    feature_names = tfidf_vectorizer.get_feature_names_out()
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
    
    return tfidf_df
```

- **Purpose:** This function calculates the TF-IDF scores for the preprocessed text using the `TfidfVectorizer` from scikit-learn:
  - `TfidfVectorizer` converts the list of text documents (the `corpus`) into a sparse matrix where each row represents a document, and each column corresponds to a word's TF-IDF score.
  - `fit_transform(corpus)` is used to learn the vocabulary from the input documents and then transform the documents into a TF-IDF matrix.
  - `get_feature_names_out()` retrieves the list of all words (features) from the vocabulary that were used to calculate the TF-IDF scores.
  - The `tfidf_matrix` is converted into a DataFrame (using `pd.DataFrame`) for easier viewing and interpretation of the resulting TF-IDF values.

### 6. **Main Program Execution:**

```python
if __name__ == "__main__":
    # Sample corpus of text
    corpus = [
        "Text preprocessing is an important part of NLP.",
        "TF-IDF helps in identifying important words.",
        "We are learning how to use TF-IDF with text data.",
        "Text data needs preprocessing before applying machine learning algorithms."
    ]
    
    # Step 1: Clean and preprocess the text
    cleaned_corpus = [clean_text(doc) for doc in corpus]
    
    # Step 2: Tokenize and remove stopwords
    preprocessed_corpus = [tokenize_and_remove_stopwords(doc) for doc in cleaned_corpus]
    
    # Step 3: Compute the TF-IDF matrix
    tfidf_df = compute_tfidf(preprocessed_corpus)
    
    # Show the resulting TF-IDF values
    print(tfidf_df)
```

- **Purpose:** This is the main block of the program that executes the text pre-processing and TF-IDF computation steps:
  - **Sample Corpus:** A list of documents (corpus) is defined with text data.
  - **Step 1:** The corpus is cleaned by calling `clean_text` on each document.
  - **Step 2:** The cleaned text is then tokenized and stopwords are removed by calling `tokenize_and_remove_stopwords`.
  - **Step 3:** The preprocessed corpus is passed to `compute_tfidf` to calculate the TF-IDF matrix.
  - **Output:** The resulting TF-IDF matrix (as a DataFrame) is printed, showing the TF-IDF scores for each word across the documents.

### Key Functions in the Code:

- **`clean_text`**: Cleans up the text (removes punctuation, converts to lowercase).
- **`tokenize_and_remove_stopwords`**: Breaks text into words and removes common words like "the", "is", etc.
- **`compute_tfidf`**: Calculates the TF-IDF scores for words in the corpus and displays them in a DataFrame for better interpretation.

### TF-IDF:

- **Term Frequency (TF):** Measures how often a word appears in a document.
- **Inverse Document Frequency (IDF):** Measures how important a word is across all documents. Words that are common across all documents get a lower IDF score.
- **TF-IDF:** The product of TF and IDF, which gives a score indicating how important a word is to a document in the corpus.

### Example Output:

When you run this code, you'll see a DataFrame showing the TF-IDF scores for each word in the cleaned and preprocessed corpus. The words will be columns, and the rows will correspond to each document. Higher TF-IDF values indicate that the word is more important in the document, while lower values suggest that the word is more common or less important.

---

This is a basic implementation of text preprocessing with TF-IDF. It helps to extract important terms from text data and is widely used in tasks such as document classification and clustering.

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Step 1: Text cleaning
def clean_text(text):
    # Remove non-alphanumeric characters (except spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # Convert to lower case
    text = text.lower()
    
    return text

# Step 2: Tokenization and stopword removal
def tokenize_and_remove_stopwords(text):
    # Tokenize the text into words
    words = word_tokenize(text)
    
    # Remove stopwords using NLTK's stopwords list
    stop_words = set(stopwords.words('english'))
    filtered_words = [word for word in words if word not in stop_words]
    
    return " ".join(filtered_words)

# Step 3: Apply TF-IDF
def compute_tfidf(corpus):
    # Initialize TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer()

    # Transform the corpus into a TF-IDF matrix
    tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
    
    # Convert the matrix into a DataFrame for better readability
    feature_names = tfidf_vectorizer.get_feature_names_out()
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
    
    return tfidf_df

if __name__ == "__main__":
    # Sample corpus of text
    corpus = [
        "Text preprocessing is an important part of NLP.",
        "TF-IDF helps in identifying important words.",
        "We are learning how to use TF-IDF with text data.",
        "Text data needs preprocessing before applying machine learning algorithms."
    ]
    
    # Step 1: Clean and preprocess the text
    cleaned_corpus = [clean_text(doc) for doc in corpus]
    
    # Step 2: Tokenize and remove stopwords
    preprocessed_corpus = [tokenize_and_remove_stopwords(doc) for doc in cleaned_corpus]
    
    # Step 3: Compute the TF-IDF matrix
    tfidf_df = compute_tfidf(preprocessed_corpus)
    
    # Show the resulting TF-IDF values
    print(tfidf_df)


In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer