### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Spelling Corrections

**Steps**:
1. Data Set: Import a dataset containing text reviews with spelling errors.
2. Apply Corrections: Use a spell-checker from an NLP library to correct spelling mistakes.
3. Verify Improvements: Review the corrections to ensure data quality improvement.

In [4]:
# write your code from here
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from collections import Counter
from textblob import TextBlob # For spelling correction

# Download NLTK stopwords and punkt tokenizer if not already downloaded
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    print("NLTK 'stopwords' corpus downloaded.")

try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    print("NLTK 'punkt' tokenizer downloaded.")


# --- Previous Task: Handling Noisy Text Data ---
print("### Task: Handling Noisy Text Data ###")
print("\n" + "="*50 + "\n")

# Step 1: Data Set - Obtain a dataset with customer reviews containing noise.
# We'll create a sample DataFrame with various types of noise.
data = {
    'ReviewID': [1, 2, 3, 4, 5, 6],
    'CustomerReview': [
        'Great product! Loved it. #awesome',
        'This is a @good_service, but delivery was slow. 🚚',
        'The quality is amazing!!! 💯 (highly recommended)',
        'Product was ok. Had some issues. $$$$$',
        'Terrible experience. Customer support is non-existent. 😡😡😡',
        'A bit pricey, but worth it. Check out: http://example.com/product'
    ],
    'NoisyText': [
        'Th1s 1s s0me r@nd0m t3xt w1th numb3rs and symb0ls! ^&*()',
        'Another review with [junk] characters and <html tags>.',
        'Good product, but the instructions were confusing. 🚀✨',
        'I received a broken item. Refund requested. 😠😡🤬',
        'Excellent! Very satisfied. 😊�👍',
        'Just some text without much noise.'
    ]
}
df = pd.DataFrame(data)
print("Original Dataset with Noisy Text:")
print(df[['ReviewID', 'CustomerReview', 'NoisyText']])
print("\n" + "="*50 + "\n")

# Step 2: Clean Data - Use regex patterns to clean the noise from text data.
# We'll define a function that applies multiple regex patterns for cleaning.

def clean_text(text):
    """
    Cleans text data by removing various types of noise using regex.

    Args:
        text (str): The input string to be cleaned.

    Returns:
        str: The cleaned string.
    """
    # Convert text to string to handle potential non-string types
    text = str(text)

    # 1. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # 2. Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # 3. Remove hashtags (#hashtag)
    text = re.sub(r'#\w+', '', text)
    # 4. Remove emojis (basic range, more comprehensive regex might be needed for all emojis)
    # This regex matches common emoji ranges.
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)
    # 5. Remove punctuation and special characters (keep alphanumeric and spaces)
    # This regex keeps letters, numbers, and spaces.
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # 6. Remove extra whitespace (multiple spaces, leading/trailing spaces)
    text = re.sub(r'\s+', ' ', text).strip()
    # 7. Remove numbers if desired (uncomment if numbers are considered noise)
    # text = re.sub(r'\d+', '', text)

    return text

# Apply the cleaning function to the 'CustomerReview' and 'NoisyText' columns
df['Cleaned_CustomerReview'] = df['CustomerReview'].apply(clean_text)
df['Cleaned_NoisyText'] = df['NoisyText'].apply(clean_text)

# Step 3: Evaluate - Compare the text before and after cleaning for noise.
print("Dataset After Cleaning (Noise Removal):")
print(df[['ReviewID', 'CustomerReview', 'Cleaned_CustomerReview', 'NoisyText', 'Cleaned_NoisyText']])
print("\n" + "="*50 + "\n")

print("Comparison of Original vs. Cleaned Text (Noise Removal):")
for index, row in df.iterrows():
    print(f"Review ID: {row['ReviewID']}")
    print(f"  Original Customer Review: '{row['CustomerReview']}'")
    print(f"  Cleaned Customer Review:  '{row['Cleaned_CustomerReview']}'")
    print(f"  Original Noisy Text:    '{row['NoisyText']}'")
    print(f"  Cleaned Noisy Text:     '{row['Cleaned_NoisyText']}'")
    print("-" * 30)

print("\n" + "#"*50 + "\n")


# --- Previous Task: Removing Stopwords ---
print("### Task: Removing Stopwords ###")
print("\n" + "="*50 + "\n")

# Step 1: Data Set - Use a dataset of text product descriptions.
product_data = {
    'ProductID': [1, 2, 3, 4, 5],
    'Description': [
        'This is a very good product with excellent features and a long battery life.',
        'The quick brown fox jumps over the lazy dog. It is a classic example.',
        'An amazing gadget for everyday use. It has many useful functions.',
        'I am really happy with this purchase. It was delivered quickly.',
        'A high-quality item that will last for many years. You should buy it.'
    ]
}
product_df = pd.DataFrame(product_data)
print("Original Product Descriptions Dataset:")
print(product_df)
print("\n" + "="*50 + "\n")

# Get English stopwords from NLTK
english_stopwords = set(stopwords.words('english'))

# Define a function for stopword removal
def remove_stopwords(text):
    """
    Removes stopwords from a given text.

    Args:
        text (str): The input string.

    Returns:
        str: The text with stopwords removed.
    """
    # Ensure text is string and convert to lowercase for consistent matching
    text = str(text).lower()
    # Tokenize the text (split into words) and remove stopwords
    words = text.split()
    filtered_words = [word for word in words if word not in english_stopwords]
    return ' '.join(filtered_words)

# Apply stopword removal to the 'Description' column
product_df['Cleaned_Description_No_Stopwords'] = product_df['Description'].apply(remove_stopwords)

print("Product Descriptions After Stopword Removal:")
print(product_df[['ProductID', 'Description', 'Cleaned_Description_No_Stopwords']])
print("\n" + "="*50 + "\n")

# Step 3: Assess Impact - Examine the effectiveness by analyzing word frequency before and after removal.

def get_word_frequencies(text_series):
    """
    Calculates word frequencies for a given pandas Series of text.

    Args:
        text_series (pd.Series): A Series containing text strings.

    Returns:
        collections.Counter: A Counter object with word frequencies.
    """
    all_words = []
    for text in text_series:
        # Convert to lowercase and split into words
        all_words.extend(str(text).lower().split())
    return Counter(all_words)

print("Word Frequencies Before Stopword Removal:")
original_frequencies = get_word_frequencies(product_df['Description'])
# Print top 20 most common words
for word, count in original_frequencies.most_common(20):
    print(f"  '{word}': {count}")
print("\n")

print("Word Frequencies After Stopword Removal:")
cleaned_frequencies = get_word_frequencies(product_df['Cleaned_Description_No_Stopwords'])
# Print top 20 most common words
for word, count in cleaned_frequencies.most_common(20):
    print(f"  '{word}': {count}")
print("\n" + "="*50 + "\n")

print("Observation:")
print("You can observe that common words like 'is', 'a', 'the', 'with', 'for', 'it', 'has', 'was', 'will', 'you', 'should' etc., which are typically stopwords, have significantly reduced or disappeared from the 'Word Frequencies After Stopword Removal' list, indicating the effectiveness of the process.")

print("\n" + "#"*50 + "\n")


# --- New Task: Spelling Corrections ---
print("### Task: Spelling Corrections ###")
print("\n" + "="*50 + "\n")

# Step 1: Data Set - Import a dataset containing text reviews with spelling errors.
review_data = {
    'ReviewID': [101, 102, 103, 104, 105],
    'ReviewText': [
        'The prodct is amzing, I reely like it.',
        'This servise was exelent and very fast.',
        'I recieved a brokn item, very disapointing.',
        'The softwre has sum minor glitces.',
        'Fantstic quality, highly recomended for evryone.'
    ]
}
review_df = pd.DataFrame(review_data)
print("Original Reviews with Spelling Errors:")
print(review_df)
print("\n" + "="*50 + "\n")

# Step 2: Apply Corrections - Use a spell-checker from an NLP library to correct spelling mistakes.

def correct_spelling(text):
    """
    Corrects spelling mistakes in a given text using TextBlob.

    Args:
        text (str): The input string.

    Returns:
        str: The text with spelling corrections applied.
    """
    # Create a TextBlob object
    blob = TextBlob(str(text))
    # Apply spelling correction
    corrected_text = str(blob.correct())
    return corrected_text

# Apply spelling correction to the 'ReviewText' column
review_df['Corrected_ReviewText'] = review_df['ReviewText'].apply(correct_spelling)

print("Reviews After Spelling Correction:")
print(review_df[['ReviewID', 'ReviewText', 'Corrected_ReviewText']])
print("\n" + "="*50 + "\n")

# Step 3: Verify Improvements - Review the corrections to ensure data quality improvement.
print("Comparison of Original vs. Corrected Text (Spelling Corrections):")
for index, row in review_df.iterrows():
    print(f"Review ID: {row['ReviewID']}")
    print(f"  Original Text:  '{row['ReviewText']}'")
    print(f"  Corrected Text: '{row['Corrected_ReviewText']}'")
    print("-" * 30)

print("\nObservation:")
print("You can observe that common spelling errors like 'prodct' -> 'product', 'amzing' -> 'amazing', 'reely' -> 'really', 'servise' -> 'service', 'exelent' -> 'excellent', 'recieved' -> 'received', 'brokn' -> 'broken', 'disapointing' -> 'disappointing', 'softwre' -> 'software', 'sum' -> 'some', 'glitces' -> 'glitches', 'Fantstic' -> 'Fantastic', 'recomended' -> 'recommended', 'evryone' -> 'everyone' have been corrected. This significantly improves the readability and quality of the text data for further analysis.")

[nltk_data] Downloading package punkt to /home/vscode/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


NLTK 'punkt' tokenizer downloaded.
### Task: Handling Noisy Text Data ###


Original Dataset with Noisy Text:
   ReviewID                                     CustomerReview  \
0         1                  Great product! Loved it. #awesome   
1         2  This is a @good_service, but delivery was slow. 🚚   
2         3   The quality is amazing!!! 💯 (highly recommended)   
3         4             Product was ok. Had some issues. $$$$$   
4         5  Terrible experience. Customer support is non-e...   
5         6  A bit pricey, but worth it. Check out: http://...   

                                           NoisyText  
0  Th1s 1s s0me r@nd0m t3xt w1th numb3rs and symb...  
1  Another review with [junk] characters and <htm...  
2  Good product, but the instructions were confus...  
3    I received a broken item. Refund requested. 😠😡🤬  
4                     Excellent! Very satisfied. 😊�👍  
5                 Just some text without much noise.  


Dataset After Cleaning (Noise Removal):


In [2]:
!pip install textblob

Defaulting to user installation because normal site-packages is not writeable
Collecting textblob
  Downloading textblob-0.19.0-py3-none-any.whl.metadata (4.4 kB)
Downloading textblob-0.19.0-py3-none-any.whl (624 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m624.3/624.3 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: textblob
Successfully installed textblob-0.19.0
