### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Handling Noisy Text Data

**Steps**:
1. Data Set: Obtain a dataset with customer reviews containing noise (e.g., random characters).
2. Clean Data: Use regex patterns to clean the noise from text data.
3. Evaluate: Compare the text before and after cleaning for noise.

In [None]:

import pandas as pd
import re
import numpy as np
np.random.seed(42)
def generate_noisy_reviews(num_reviews=100):
    base_reviews = [
        "This product is amazing! I absolutely love it. 😊",
        "The delivery was super fast and the item was as described. Highly recommended. ⭐⭐⭐⭐⭐",
        "It's okay, not great, not terrible. Could be better. 🤔",
        "Disappointed with the quality. It broke after a week. 😠",
        "Excellent customer service and a great user experience. 👍",
        "Worst purchase ever. Complete waste of money. 😡",
        "Pretty good for the price. Would buy again. 👍👍",
        "Item received damaged. <br> Very unhappy. 👎👎",
        "The software is buggy and crashes frequently. Visit our site: http://buggysoftware.com",
        "A reliable choice. Nothing fancy, but gets the job done. 🙌",
        "Could use some improvements. Too many ads. www.ads-everywhere.net",
        "I expected more. The battery life is horrible. 🔋",
        "Fantastic! I'm so glad I bought this. ✨🎉",
        "Mediocre experience. 💯 I've seen better. 😊",
        "Perfect fit and comfortable. ✅",
        "Lagging performance. 🚫 Do not recommend. 😠",
        "Great value for money. 🤑 Will tell my friends. 🤗",
        "The instructions were unclear. ❓❓ Hard to assemble. 🛠️",
        "Solid build quality. A bit heavy though. 🏗️",
        "An absolute game-changer! 🚀 Don't hesitate to buy. 🔥",
    ]
    reviews = []
    for i in range(num_reviews):
        review = np.random.choice(base_reviews)
        # Introduce various types of noise
        if np.random.rand() < 0.3: # Add random characters
            noise_chars = ''.join(np.random.choice(list("!@#$%^&*()_+=-`~[]{}|;:'\",.<>/?"), 5))
            insert_pos = np.random.randint(0, len(review) + 1)
            review = review[:insert_pos] + noise_chars + review[insert_pos:]
        if np.random.rand() < 0.2: # Add HTML tags
            review = f"<p>{review}</p><div></div>"
        if np.random.rand() < 0.15: # Add URLs
            review += f" Check it out: http://example{np.random.randint(1,5)}.com/review{np.random.randint(100,999)}"
        if np.random.rand() < 0.4: # Add extra whitespace
            review = "  " + review + "   "
            if np.random.rand() < 0.5:
                review = review.replace(" ", "   ", np.random.randint(1, 5))
        reviews.append(review)
    return pd.DataFrame({'review_id': range(num_reviews), 'review_text_original': reviews})
customer_reviews_df = generate_noisy_reviews(num_reviews=200)
print("--- Sample of Original Noisy Customer Reviews ---")
print(customer_reviews_df.head())
def clean_text_data(text):
    text=text.lower()
    text=re.sub(r'http\S+|www\S+|https\S+','',text,flags=re.MULTILINE)
    text=re.sub(r'<.*?>','',text)
    text=re.sub(r'[^a-zA-Z0-9\s.,!?]','',text)
    text=re.sub(r'\s+',' ',text).strip()
    return text
customer_reviews_df['review_text_cleaned']=customer_reviews_df['review_text_original'].apply(clean_text_data)
print("\n--- Sample of Cleaned Customer Reviews ---")
print(customer_reviews_df[['review_text_original','review_text_cleaned']].head())
print("\n--- Evaluation: Before vs. After Cleaning (Random Samples) ---")
for i in np.random.choice(customer_reviews_df.index,5,replace=False):
    print(f"\nOriginal Review {i}:")
    print(customer_reviews_df.loc[i,'review_text_original'])
    print(f"Cleaned Review {i}:")
    print(customer_reviews_df.loc[i,'review_text_cleaned'])
    print("-" * 50)
original_lengths=customer_reviews_df['review_text_original'].apply(len)
cleaned_lengths=customer_reviews_df['review_text_cleaned'].apply(len)
print(f"\nAverage length of original reviews: {original_lengths.mean():.2f}")
print(f"Average length of cleaned reviews: {cleaned_lengths.mean():.2f}")
print(f"Total characters removed: {(original_lengths-cleaned_lengths).sum()}")


--- Sample of Original Noisy Customer Reviews ---
   review_id                               review_text_original
0          0  <p>Pretty good for the price. Would buy again....
1          1  Solid build quality. A _'|$*bit heavy though. ...
2          2    The delivery was super fast and the item was...
3          3                     Perfect fit and comfortable. ✅
4          4    It's okay, not great, not terrible. Could be...

--- Sample of Cleaned Customer Reviews ---
                                review_text_original  \
0  <p>Pretty good for the price. Would buy again....   
1  Solid build quality. A _'|$*bit heavy though. ...   
2    The delivery was super fast and the item was...   
3                     Perfect fit and comfortable. ✅   
4    It's okay, not great, not terrible. Could be...   

                                 review_text_cleaned  
0        pretty good for the price. would buy again.  
1  solid build quality. a bit heavy though. check...  
2  the delivery was 