### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Handling Noisy Text Data

**Steps**:
1. Data Set: Obtain a dataset with customer reviews containing noise (e.g., random characters).
2. Clean Data: Use regex patterns to clean the noise from text data.
3. Evaluate: Compare the text before and after cleaning for noise.

In [1]:
# write your code from here
import pandas as pd
import re

# Step 1: Data Set - Obtain a dataset with customer reviews containing noise.
# We'll create a sample DataFrame with various types of noise.
data = {
    'ReviewID': [1, 2, 3, 4, 5, 6],
    'CustomerReview': [
        'Great product! Loved it. #awesome',
        'This is a @good_service, but delivery was slow. üöö',
        'The quality is amazing!!! üíØ (highly recommended)',
        'Product was ok. Had some issues. $$$$$',
        'Terrible experience. Customer support is non-existent. üò°üò°üò°',
        'A bit pricey, but worth it. Check out: http://example.com/product'
    ],
    'NoisyText': [
        'Th1s 1s s0me r@nd0m t3xt w1th numb3rs and symb0ls! ^&*()',
        'Another review with [junk] characters and <html tags>.',
        'Good product, but the instructions were confusing. üöÄ‚ú®',
        'I received a broken item. Refund requested. üò†üò°ü§¨',
        'Excellent! Very satisfied. üòäüëçüëç',
        'Just some text without much noise.'
    ]
}
df = pd.DataFrame(data)
print("Original Dataset with Noisy Text:")
print(df[['ReviewID', 'CustomerReview', 'NoisyText']])
print("\n" + "="*50 + "\n")

# Step 2: Clean Data - Use regex patterns to clean the noise from text data.
# We'll define a function that applies multiple regex patterns for cleaning.

def clean_text(text):
    """
    Cleans text data by removing various types of noise using regex.

    Args:
        text (str): The input string to be cleaned.

    Returns:
        str: The cleaned string.
    """
    # Convert text to string to handle potential non-string types
    text = str(text)

    # 1. Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # 2. Remove mentions (@username)
    text = re.sub(r'@\w+', '', text)
    # 3. Remove hashtags (#hashtag)
    text = re.sub(r'#\w+', '', text)
    # 4. Remove emojis (basic range, more comprehensive regex might be needed for all emojis)
    # This regex matches common emoji ranges.
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE
    )
    text = emoji_pattern.sub(r'', text)
    # 5. Remove punctuation and special characters (keep alphanumeric and spaces)
    # This regex keeps letters, numbers, and spaces.
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # 6. Remove extra whitespace (multiple spaces, leading/trailing spaces)
    text = re.sub(r'\s+', ' ', text).strip()
    # 7. Remove numbers if desired (uncomment if numbers are considered noise)
    # text = re.sub(r'\d+', '', text)

    return text

# Apply the cleaning function to the 'CustomerReview' and 'NoisyText' columns
df['Cleaned_CustomerReview'] = df['CustomerReview'].apply(clean_text)
df['Cleaned_NoisyText'] = df['NoisyText'].apply(clean_text)

# Step 3: Evaluate - Compare the text before and after cleaning for noise.
print("Dataset After Cleaning:")
print(df[['ReviewID', 'CustomerReview', 'Cleaned_CustomerReview', 'NoisyText', 'Cleaned_NoisyText']])
print("\n" + "="*50 + "\n")

print("Comparison of Original vs. Cleaned Text:")
for index, row in df.iterrows():
    print(f"Review ID: {row['ReviewID']}")
    print(f"  Original Customer Review: '{row['CustomerReview']}'")
    print(f"  Cleaned Customer Review:  '{row['Cleaned_CustomerReview']}'")
    print(f"  Original Noisy Text:    '{row['NoisyText']}'")
    print(f"  Cleaned Noisy Text:     '{row['Cleaned_NoisyText']}'")
    print("-" * 30)

# You can further evaluate by:
# - Manually inspecting a larger sample of cleaned text.
# - Using metrics if you have a ground truth for clean data (less common for noise removal).
# - Applying downstream NLP tasks (e.g., sentiment analysis) and comparing performance before/after cleaning.


Original Dataset with Noisy Text:
   ReviewID                                     CustomerReview  \
0         1                  Great product! Loved it. #awesome   
1         2  This is a @good_service, but delivery was slow. üöö   
2         3   The quality is amazing!!! üíØ (highly recommended)   
3         4             Product was ok. Had some issues. $$$$$   
4         5  Terrible experience. Customer support is non-e...   
5         6  A bit pricey, but worth it. Check out: http://...   

                                           NoisyText  
0  Th1s 1s s0me r@nd0m t3xt w1th numb3rs and symb...  
1  Another review with [junk] characters and <htm...  
2  Good product, but the instructions were confus...  
3    I received a broken item. Refund requested. üò†üò°ü§¨  
4                     Excellent! Very satisfied. üòäüëçüëç  
5                 Just some text without much noise.  


Dataset After Cleaning:
   ReviewID                                     CustomerReview  \
0 