# Debugging and Problem Solving in Text Cleaning and Preprocessing

Coder: Hisham D Macaraya

---

### Introduction

This project focuses on the process of text cleaning and preprocessing using Python. It provides an opportunity to practice debugging Python code that is intended to clean and prepare a set of user reviews for further natural language processing (NLP) tasks. Preprocessing text data is an essential part of NLP, as it transforms raw text into a format that can be effectively used by machine learning models.

In this exercise, intentional errors have been introduced into the given code, and your goal is to identify and correct these errors. By doing so, you will improve your understanding of text preprocessing techniques such as tokenization, lowercasing, punctuation removal, stopwords removal, and stemming.

### Objectives:
1. Identify and fix the errors.
2. Ensure that the code tokenizes the reviews, converts them to lowercase, removes punctuation and stopwords, and then applies stemming.
3. Compare your results to what you expect them to be based on your understanding of the process.

**Remember:** The process of debugging is not just about making the code run without errors—it's also about ensuring the code produces the correct and expected outcomes. Use print statements or any other method you prefer to check intermediate results.

#### Sample Dataset: User Reviews
```python
reviews = [
    "Best purchase I made this year!!",
    "Totally regret this. Stopped after a month.",
    "Average product. Could be better, could be worse.",
    "Impressed with the quality and performance.",
    "Never buying this again. Complete waste."
]
```

#### Task 1: Tokenization (ERROR INTRODUCED HERE)
```python
tokenized_reviews = [word_tokenize(reviews) for review in reviews]
```

#### Task 2: Lowercasing (ERROR INTRODUCED HERE)
```python
lowercased_reviews = [[word.lower for word in review] for review in tokenized_reviews]
```

#### Task 3: Removing Punctuation and Stopwords
```python
stop_words = set(stopwords.words('english'))
cleaned_reviews = [[word for word in review if word.isalnum() and word not in stop_words] for review in lowercased_reviews]
```

#### Task 4: Stemming (ERROR INTRODUCED HERE)
```python
stemmed_reviews = [[stemmer.stemming(word) for word in review] for review in cleaned_reviews]
```

#### Print Out the Stemmed Reviews
```python
for r in stemmed_reviews:
    print(' '.join(r))
```

---

## Importing Necessary Libraries and Packages

### Importing Libraries
In this section, we will import the necessary libraries for natural language processing (NLP), including tokenization, stopwords, and stemming.

In [1]:
# Importing NLTK library and specific modules for natural language processing tasks
import nltk
from nltk.tokenize import word_tokenize  # For tokenizing sentences into words
from nltk.corpus import stopwords  # For accessing a list of common stop words
from nltk.stem import PorterStemmer  # For stemming words to their root form

## Download required NLTK data

### Downloading NLTK Data
We need to download 'punkt' for tokenization and 'stopwords' for removing common words that do not add much value to the analysis.

In [2]:
# 'punkt' is used for tokenization, and 'stopwords' provides a list of common words to be removed
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hisha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Sample dataset: User reviews

### Sample Dataset: User Reviews
Here we define a list of user reviews for a hypothetical product. These reviews will be processed through different NLP techniques.

In [3]:
# Defining a list of user reviews for a hypothetical product
reviews = [
    "Best purchase I made this year!!",
    "Totally regret this. Stopped after a month.",
    "Average product. Could be better, could be worse.",
    "Impressed with the quality and performance.",
    "Never buying this again. Complete waste."
]

## Task 1: Tokenization (FIXED)

### Task 1: Tokenization
Tokenizing each review into individual words. Tokenization helps break down text into smaller units (words) for further analysis.

#### Original Error
The original code had 'word_tokenize(reviews)' instead of 'word_tokenize(review)'. This resulted in an error because 'reviews' is a list, and we need to tokenize each individual review.

In [4]:
# Tokenizing each review into individual words
tokenized_reviews = [word_tokenize(review) for review in reviews]
print("Tokens for each review:")
for i, tokens in enumerate(tokenized_reviews):
    print(f"Review {i + 1} tokens: {tokens}")

Tokens for each review:
Review 1 tokens: ['Best', 'purchase', 'I', 'made', 'this', 'year', '!', '!']
Review 2 tokens: ['Totally', 'regret', 'this', '.', 'Stopped', 'after', 'a', 'month', '.']
Review 3 tokens: ['Average', 'product', '.', 'Could', 'be', 'better', ',', 'could', 'be', 'worse', '.']
Review 4 tokens: ['Impressed', 'with', 'the', 'quality', 'and', 'performance', '.']
Review 5 tokens: ['Never', 'buying', 'this', 'again', '.', 'Complete', 'waste', '.']


## Task 2: Lowercasing (FIXED)

### Task 2: Lowercasing
Converting all tokens to lowercase for consistency. This ensures that words like 'Product' and 'product' are treated as the same word.

#### Original Error
The original code used 'word.lower' instead of 'word.lower()'. This resulted in an error since 'lower' is a method and must be called with parentheses.

In [5]:
# Converting all tokens to lowercase for consistency
lowercased_reviews = [[word.lower() for word in review] for review in tokenized_reviews]
print("\nLowercased tokens for each review:")
for i, tokens in enumerate(lowercased_reviews):
    print(f"Review {i + 1} lowercased: {tokens}")


Lowercased tokens for each review:
Review 1 lowercased: ['best', 'purchase', 'i', 'made', 'this', 'year', '!', '!']
Review 2 lowercased: ['totally', 'regret', 'this', '.', 'stopped', 'after', 'a', 'month', '.']
Review 3 lowercased: ['average', 'product', '.', 'could', 'be', 'better', ',', 'could', 'be', 'worse', '.']
Review 4 lowercased: ['impressed', 'with', 'the', 'quality', 'and', 'performance', '.']
Review 5 lowercased: ['never', 'buying', 'this', 'again', '.', 'complete', 'waste', '.']


## Task 3: Removing Punctuation and Stopwords

### Task 3: Removing Punctuation and Stopwords
Removing punctuation and common stop words from the reviews. Stop words are common words that do not contribute much meaning, such as 'the', 'is', etc.

#### Method
We use 'isalnum()' to check if the word is alphanumeric, which helps remove punctuation. Additionally, stop words are filtered out using the NLTK stopwords list.

In [6]:
# Removing punctuation and common stop words from the reviews
stop_words = set(stopwords.words('english'))
cleaned_reviews = [[word for word in review if word.isalnum() and word not in stop_words] for review in lowercased_reviews]
print("\nCleaned tokens for each review (no punctuation and stopwords):")
for i, tokens in enumerate(cleaned_reviews):
    print(f"Review {i + 1} cleaned: {tokens}")


Cleaned tokens for each review (no punctuation and stopwords):
Review 1 cleaned: ['best', 'purchase', 'made', 'year']
Review 2 cleaned: ['totally', 'regret', 'stopped', 'month']
Review 3 cleaned: ['average', 'product', 'could', 'better', 'could', 'worse']
Review 4 cleaned: ['impressed', 'quality', 'performance']
Review 5 cleaned: ['never', 'buying', 'complete', 'waste']


## Task 4: Stemming (FIXED)

### Task 4: Stemming
Applying stemming to reduce words to their root forms. Stemming helps normalize words to a common base form, for example, 'running' to 'run'.

#### Original Error
The original code used 'stemmer.stemming(word)' instead of 'stemmer.stem(word)'. The correct method is 'stem', which returns the stemmed version of the word.

In [7]:
# Applying stemming to reduce words to their root forms
stemmer = PorterStemmer()
stemmed_reviews = [[stemmer.stem(word) for word in review] for review in cleaned_reviews]
print("\nStemmed tokens for each review:")
for i, tokens in enumerate(stemmed_reviews):
    print(f"Review {i + 1} stemmed: {tokens}")


Stemmed tokens for each review:
Review 1 stemmed: ['best', 'purchas', 'made', 'year']
Review 2 stemmed: ['total', 'regret', 'stop', 'month']
Review 3 stemmed: ['averag', 'product', 'could', 'better', 'could', 'wors']
Review 4 stemmed: ['impress', 'qualiti', 'perform']
Review 5 stemmed: ['never', 'buy', 'complet', 'wast']


## Final Output

### Final Output: Processed Reviews
Printing the stemmed reviews to show the final processed form of each review.

In [8]:
print("\nFinal Processed Reviews:")
for i, r in enumerate(stemmed_reviews):
    print(f"Review {i + 1}: {' '.join(r)}")


Final Processed Reviews:
Review 1: best purchas made year
Review 2: total regret stop month
Review 3: averag product could better could wors
Review 4: impress qualiti perform
Review 5: never buy complet wast
