**Copyright: © NexStream Technical Education, LLC**.  
All rights reserved

# Regular Expressions Data Cleaning

In this assignment, you will implement a Python function which can be used in a sentiment analyzer to clean a dataset using regular expressions.





**Introduction - Regular Expressions**


Resources and References
- Python Regular Expression Documentation: https://docs.python.org/3/library/re.html

- Regular Expression Testing Tools  
  - regex101.com
  - pythex.org

- Best Practices Documentation
  - PEP 8 Style Guide
  - Google Python Style Guide


In [1]:
#Import libraries

import re
import regex

Create a function that preprocesses a text string according to the following specifications:   
- Remove all HTML tags (i.e. remove character sequences delimited by the '<' and '>' characters)
- Remove all URL tags (i.e. remove any 'http' or 'https' designations)
- Removes any character not in the English alphabet or a decimal digit (i.e. remove any characters not a-z, A-Z, 0-9)
- Remove any extra whitespace
- Converts all strings to lower case

Use the following function API:

    
    def preprocess_review(text):

    """
        Parameters:
        -----------
        text : input text string to clean

        Returns:
        -----------
        text : the cleaned text string
    
    """

In [2]:
def preprocess_review(text):
    # Remove HTML
    #text =  r'<[^>]+>'
    text = re.sub(r'<[^>]+>','', text)

    # Remove URLs
    #text = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','', text)

    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]','',text)

    # Convert to lowercase
    text = text.lower()

    # Remove extra whitespace
    #text = re.sub(r'\s+',' ', text)
    text = re.sub(r'\s+',' ', text).strip()

    return text

In [3]:
#Test cases
#Do not change the code in this cell - for validation purposes.


test_cases = [
    # HTML and special characters
    "<p>Great product!</p> &#128516;",

    # URLs and links
    "Check it out at https://example.com and www.test.com",

    # Product IDs and ratings
    "Product ID: AB1234 - 5* rating!!!",

    # Mixed case and extra whitespace
    "EXCELLENT    product...   very   HAPPY :)",

    # Numbers and measurements
    "Size: 10.5cm x 20cm - Perfect fit!",

    # Email addresses and contact info
    "Contact support@company.com or call 123-456-7890",

    # Dates and timestamps
    "Purchased on 2024-01-15 12:30PM EST",

    # Multiple line breaks and formatting
    """Product Review:

    Quality: Excellent
    Durability: Good
    Value: 4/5

    Would recommend!"""
]

def test_preprocessing(test_cases):
    cleaned = []
    for i, text in enumerate(test_cases, 1):
        cleaned.append(preprocess_review(text))
        print(f"Test Case {i}:")
        print(f"Original: {text}")
        print(f"Cleaned: {cleaned[-1]}")
        print("-" * 50 + "\n")
    return cleaned

# Run tests
cleaned_text = test_preprocessing(test_cases)
print('Cleaned text:\n', cleaned_text)



import doctest
'''
  >>> print(cleaned_text[0])
  great product 128516
  >>> print(cleaned_text[1])
  check it out at and wwwtestcom
  >>> print(cleaned_text[2])
  product id ab1234 5 rating
  >>> print(cleaned_text[3])
  excellent product very happy
  >>> print(cleaned_text[4])
  size 105cm x 20cm perfect fit
  >>> print(cleaned_text[5])
  contact supportcompanycom or call 1234567890
  >>> print(cleaned_text[6])
  purchased on 20240115 1230pm est
  >>> print(cleaned_text[7])
  product review quality excellent durability good value 45 would recommend
'''

doctest.testmod()

Test Case 1:
Original: <p>Great product!</p> &#128516;
Cleaned: great product 128516
--------------------------------------------------

Test Case 2:
Original: Check it out at https://example.com and www.test.com
Cleaned: check it out at and wwwtestcom
--------------------------------------------------

Test Case 3:
Original: Product ID: AB1234 - 5* rating!!!
Cleaned: product id ab1234 5 rating
--------------------------------------------------

Test Case 4:
Original: EXCELLENT    product...   very   HAPPY :)
Cleaned: excellent product very happy
--------------------------------------------------

Test Case 5:
Original: Size: 10.5cm x 20cm - Perfect fit!
Cleaned: size 105cm x 20cm perfect fit
--------------------------------------------------

Test Case 6:
Original: Contact support@company.com or call 123-456-7890
Cleaned: contact supportcompanycom or call 1234567890
--------------------------------------------------

Test Case 7:
Original: Purchased on 2024-01-15 12:30PM EST
Cleaned: 

TestResults(failed=0, attempted=8)

- Reflections

1. What metacharacters represent any character except newlines, start of strings, end of strings, and escape special characters?

​	 `.` is used to match a single character; `^` is used to match the start of a string;`$` is used to match the end of a string; `\`  is 	used to escape special characters, making other metacharacters like  `.` lose their special meaning in matching.

2. What characters are used to delimit matching letters or digits?

   The square brackets `[]` is used to match the single characters inside the brackets. For example, `[a-z]` is used to match all lower-case letters while `[0-9]` is used to match all single digits. What's more, `\d`can also be used to match any digit while `\w` can be used to match all kinds of letters and digits.

3. What is the difference between "greedy" and "non-greedy" matching?

​	Greedy matching means to match as much as possible, while non-greedy matching means to stop when the first match happens.