### Using NLP for Text Data Quality
**Objective**: Enhance text data quality using NLP techniques.

**Task**: Handling Noisy Text Data

**Steps**:
1. Data Set: Obtain a dataset with customer reviews containing noise (e.g., random characters).
2. Clean Data: Use regex patterns to clean the noise from text data.
3. Evaluate: Compare the text before and after cleaning for noise.

In [1]:
import pandas as pd
import re

# Step 1: Simulated Dataset with Noisy Customer Reviews
data = {
    'Review': [
        "Th!s pr0duct is #amazing!!!", 
        "W0rth every $$$. L0ve it <3",
        "Terrrrible.....!!! nevvver again!!!", 
        "S000o g0ooood!!! Buy it now!!!",
        "U$eless!!!! Waste-of-money 123"
    ]
}
df = pd.DataFrame(data)

# Function to clean noisy text using regex
def clean_text(text):
    # Remove special characters, digits, and excessive punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # keep only letters and spaces
    text = re.sub(r'\s+', ' ', text)         # replace multiple spaces with one
    text = re.sub(r'(.)\1{2,}', r'\1', text) # reduce repeated characters (e.g., Terrrrible -> Terible)
    return text.strip().lower()

# Step 2: Apply cleaning
df['Cleaned_Review'] = df['Review'].apply(clean_text)

# Step 3: Compare before and after
print("Before and After Cleaning:\n")
print(df[['Review', 'Cleaned_Review']])

Before and After Cleaning:

                                Review         Cleaned_Review
0          Th!s pr0duct is #amazing!!!  ths prduct is amazing
1          W0rth every $$$. L0ve it <3      wrth every lve it
2  Terrrrible.....!!! nevvver again!!!    terible never again
3       S000o g0ooood!!! Buy it now!!!      so god buy it now
4       U$eless!!!! Waste-of-money 123    ueless wasteofmoney
