<h1>Regular Expressions for Corpus Preprocessing
</h1>

<p>Regular expressions, often abbreviated as regex, are a foundational tool in text analysis and natural language processing (NLP). They provide a robust framework for identifying, extracting, and manipulating patterns within text data, making them essential for preprocessing large corpora. Corpus preprocessing involves various tasks designed to clean and standardize text data, which in turn facilitates more effective data analysis and model training.</p>
<p>This guide will explore how regular expressions can be effectively utilized to streamline these preprocessing tasks. From simple operations like searching and replacing text to more complex pattern recognition, regular expressions offer a precise method to handle diverse text processing challenges, ultimately enhancing the accuracy and efficiency of NLP applications.</p>


The text file contains English-language reviews of food services from the Yelp website: https://www.yelp.com/.

It includes one thousand reviews and is part of a dataset found in the UCI Machine Learning Repository, titled "Sentiment Labelled Sentences": https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#.

<h2>Part 1. Load the Data.</h2>

We load the data from the specified file and obtain a list of 1000 strings/comments.

For now, we only require the Numpy and re libraries for managing arrays and regular expressions in Python.

In [1]:
import numpy as np    # import Numpy for array handling.
import re             # import re for regular expressions handling.

In [2]:
# Execute the following instructions to load the information from the given file:

with open('text_corpus_food_reviews.txt',        # you can update the path to your file, if necessary.
          mode='r',     # open the file in read mode.
          ) as f:
    docs = f.readlines()    # separate each comment by lines

f.close()  # now that we have the information in the docs variable, close the file

In [3]:
type(docs) == list   # Verify that variable "docs" is a list

True

In [4]:
len(docs)==1000  # verify that the length of "docs" is 10000

True

In [5]:
docs[0:10]     # Lets see the first 10 comments

['Wow... Loved this place.\n',
 'Crust is not good.\n',
 'Not tasty and the texture was just nasty.\n',
 'Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.\n',
 'The selection on the menu was great and so were the prices.\n',
 'Now I am getting angry and I want my damn pho.\n',
 "Honeslty it didn't taste THAT fresh.)\n",
 'The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.\n',
 'The fries were great too.\n',
 'A great touch.\n']

<h3>Auxiliar function to print list elements</h3>

In [6]:
# Function to print elements of a list formatted and numbered
def print_elements(list_items):
    """
    Prints each element of the list, aligned in a column, with the index on the left.
    
    Args:
    list_items (list): A list of strings.
    
    Returns:
    None
    """
    # Get the length of the largest number
    max_length = len(str(len(list_items)))
    
    # Print each element with the index aligned
    for index, item in enumerate(list_items, start=1):
        print(f"{index:>{max_length}}. {item}")

<h2>Problem 1</h2>


Search for and remove all newline characters '\n' at the end of each comment.

Once completed, print the first 10 comments from the resulting output.


In [7]:
Q = []
for doc in docs:
  tmp = re.sub(r'\n', "", doc)  
  if tmp:
    Q.append(tmp)

print('First 10 comments from the result:')
print_elements(Q[:10])
print('Total number of strings found:', len(Q))

First 10 comments from the result:
 1. Wow... Loved this place.
 2. Crust is not good.
 3. Not tasty and the texture was just nasty.
 4. Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.
 5. The selection on the menu was great and so were the prices.
 6. Now I am getting angry and I want my damn pho.
 7. Honeslty it didn't taste THAT fresh.)
 8. The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.
 9. The fries were great too.
10. A great touch.
Total number of strings found: 1000


<h2>Problem 2</h2>


Search for and print all words that end with two or more exclamation marks in a row, for example "!!!".

In [8]:
# Function to print elements of a list formatted, numbered, and with the count of exclamation marks
def print_exclamations(list_items):
    """
    Prints each element of the list, aligned in a column, with the index on the left.
    
    Args:
    list_items (list): A list of strings.
    
    Returns:
    None
    """
    # Maximum length of the indices and elements for formatting
    max_length_index = max(len(str(len(list_items))), len("Index"))
    max_length_element = max(len(item) for item in list_items)
    
    # Column headers
    print(f"{'Index':>{max_length_index}} | {'Element':<{max_length_element}} | {'Exclamations':>2}")
    print('-' * (max_length_index + max_length_element + 15))  # 15 for the spaces and the title of the exclamations column
    
    # Print each element with the index and the number of exclamation marks in the word
    for index, element in enumerate(list_items, start=1):
        num_exclamations = element.count('!')  # Count how many exclamation marks are in the element
        # Adjust the element by removing the exclamation marks for printing
        print(f"{index:>{max_length_index}} | {element:<{max_length_element}} | {num_exclamations:>2}")


In [9]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b\w+(?:!!+)', doc)  
  if tmp:
    Q = Q + tmp

print('Words with more than two exclamation marks:')
print_exclamations(Q)
print('Total number of strings found:', len(Q))

Words with more than two exclamation marks:
Index | Element                    | Exclamations
----------------------------------------------
    1 | Firehouse!!!!!             |  5
    2 | APPETIZERS!!!              |  3
    3 | amazing!!!                 |  3
    4 | buffet!!!                  |  3
    5 | good!!                     |  2
    6 | it!!!!                     |  4
    7 | DELICIOUS!!                |  2
    8 | amazing!!                  |  2
    9 | shawarrrrrrma!!!!!!        |  6
   10 | yucky!!!                   |  3
   11 | steak!!!!!                 |  5
   12 | delicious!!!               |  3
   13 | far!!                      |  2
   14 | biscuits!!!                |  3
   15 | dry!!                      |  2
   16 | disappointing!!!           |  3
   17 | awesome!!                  |  2
   18 | Up!!                       |  2
   19 | FLY!!!!!!!!                |  8
   20 | here!!!                    |  3
   21 | great!!!!!!!!!!!!!!        | 14
   22 | packed!!   

<h2>Problem 3</h2>


Search for and print all words that are written entirely in uppercase. Each match should be a single word.


In [10]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b[A-Z]+\b', doc)  
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
  1. I
  2. I
  3. THAT
  4. A
  5. I
  6. I
  7. I
  8. I
  9. I
 10. I
 11. I
 12. I
 13. I
 14. I
 15. I
 16. I
 17. APPETIZERS
 18. A
 19. I
 20. I
 21. I
 22. I
 23. I
 24. I
 25. I
 26. WILL
 27. NEVER
 28. EVER
 29. STEP
 30. FORWARD
 31. IN
 32. IT
 33. AGAIN
 34. I
 35. LOVED
 36. I
 37. AND
 38. REAL
 39. I
 40. I
 41. I
 42. I
 43. BITCHES
 44. I
 45. I
 46. I
 47. I
 48. NYC
 49. I
 50. I
 51. I
 52. I
 53. I
 54. I
 55. I
 56. STALE
 57. I
 58. I
 59. DELICIOUS
 60. I
 61. I
 62. I
 63. I
 64. I
 65. I
 66. I
 67. I
 68. I
 69. I
 70. I
 71. I
 72. I
 73. I
 74. I
 75. WORST
 76. EXPERIENCE
 77. EVER
 78. I
 79. ALL
 80. I
 81. BARGAIN
 82. I
 83. TV
 84. NONE
 85. I
 86. M
 87. FREEZING
 88. AYCE
 89. I
 90. I
 91. I
 92. I
 93. I
 94. I
 95. I
 96. I
 97. I
 98. I
 99. I
100. I
101. I
102. FLAVOR
103. I
104. I
105. I
106. I
107. I
108. I
109. I
110. I
111. I
112. I
113. I
114. I
115. I
116. I
117. I
118. I
119. I
120. I
121. NEVER
122. 

<p>If we want to see only the unique words that appear, we can do the following:</p>

In [11]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b[A-Z]+\b', doc)  
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(set(Q))
print('Total number of unique strings found:', len(set(Q)))

First 10 comments from the result:
 1. UNREAL
 2. FANTASTIC
 3. AYCE
 4. VERY
 5. MUST
 6. RI
 7. ALL
 8. PEOPLE
 9. MGM
10. REAL
11. STALE
12. AN
13. IN
14. THAT
15. APPETIZERS
16. OWNERS
17. OF
18. A
19. AVOID
20. MANAGEMENT
21. NOT
22. BLAND
23. NYC
24. HAPPENED
25. NOW
26. HAVE
27. OK
28. REALLY
29. WHAT
30. OMG
31. MANY
32. ESTABLISHMENT
33. BBQ
34. GO
35. BACK
36. GREAT
37. INCONSIDERATE
38. AGAIN
39. TOTAL
40. SCREAMS
41. BITCHES
42. I
43. DELICIOUS
44. SHOULD
45. BETTER
46. WILL
47. FLY
48. RUDE
49. TV
50. IT
51. AND
52. LEGIT
53. FORWARD
54. TIME
55. FS
56. EVER
57. NONE
58. HAD
59. NEVER
60. FREEZING
61. STEP
62. TOLD
63. THE
64. GC
65. WASTE
66. WORST
67. WEAK
68. FLAVOR
69. BEST
70. LOVED
71. WAY
72. HOUR
73. PERFECT
74. BARGAIN
75. HANDS
76. AZ
77. M
78. NASTY
79. T
80. BARE
81. EXPERIENCE
82. THIS
83. OVERPRICED
84. CONCLUSION
85. NO
Total number of unique strings found: 85


<h2>Problem 4</h2>


Search for and print comments where all alphabetical characters (letters) are in uppercase.

In [12]:
Q = []
for doc in docs:
  tmp = re.findall(r'^[^a-z]*$', doc)  # Find lines with no lowercase letters
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
1. DELICIOUS!!

2. RUDE & INCONSIDERATE MANAGEMENT.

3. WILL NEVER EVER GO BACK AND HAVE TOLD MANY PEOPLE WHAT HAD HAPPENED.

4. TOTAL WASTE OF TIME.

5. AVOID THIS ESTABLISHMENT!

Total number of strings found: 5


<h2>Problem 5</h2>

Search for and print all words that contain an accented vowel, such as á, é, í, ó, ú.

In [13]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b.*[áéíóú].*\b', doc)  # Find words with accented vowels
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
1. My fiancé and I came in the middle of the day and we were greeted and seated right away
2. I really enjoyed Crema Café before they expanded; I even told friends they had the BEST breakfast
3. The only thing I wasn't too crazy about was their guacamole as I don't like it puréed
Total number of strings found: 3


<h2>Problem 6</h2>


Search for and print all monetary amounts, either whole or with decimals, that start with the dollar sign ($).


In [14]:
Q = []
for doc in docs:
  tmp = re.findall(r'\$[0-9]*\.?[0-9]+', doc)  # Find monetary amounts starting with $
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of monetary strings found:', len(Q))

First 10 comments from the result:
1. $20
2. $4.00
3. $17
4. $3
5. $35
6. $7.85
7. $12
8. $11.99
Total number of monetary strings found: 8


<h2>Problem 7</h2>


Search for and print all words that are variants of the word "love," regardless of whether they include upper or lower case letters, or how they are conjugated or any other variation of the word.

In [15]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b\w*lov\w*\b', doc, flags=re.IGNORECASE)  # Find all variations of "love"
  if tmp:
    Q = Q + tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
 1. Loved
 2. loved
 3. Loved
 4. love
 5. loves
 6. LOVED
 7. lovers
 8. loving
 9. love
10. lovers
11. Love
12. loved
13. loved
14. love
15. love
16. love
17. loved
18. love
19. loved
20. Love
21. LOVED
22. love
23. lovely
24. love
25. lovely
26. love
27. lover
28. loved
29. love
30. love
31. love
32. love
33. gloves
34. love
35. love
36. love
37. love
Total number of strings found: 37


<h2>Problem 8</h2>


Search for and print all words that are variants of "so" and "good", which have two or more "o"s in "so" and three or more "o"s in "good".

In [16]:
Q = []
for doc in docs:
  # Find words that are variants of "so" with two or more 'o's and "good" with three or more 'o's
  tmp = re.findall(r'\b(s+o{2,})\b|\b(g+o{3,}d+)\b', doc, flags=re.IGNORECASE)  
  if tmp:
    # Flatten the tuple results and remove empty strings
    Q += [element for tuple_elements in tmp for element in tuple_elements if element]

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
1. Sooooo
2. soooo
3. gooodd
4. soooooo
5. soooo
Total number of strings found: 5


<h2>Problem 9</h2>


Search for and print all words that have a length strictly greater than 10 alphabetic characters. Punctuation marks or special characters are not considered in the length of these strings, only uppercase or lowercase alphabetic characters.

In [17]:
Q = []
for doc in docs:
  # Find words that have more than 10 alphabetic characters
  tmp = re.findall(r'\b[a-z]{11,}\b', doc, flags=re.IGNORECASE)  
  if tmp:
    Q += tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
  1. recommendation
  2. recommended
  3. overwhelmed
  4. inexpensive
  5. establishment
  6. imaginative
  7. opportunity
  8. experiencing
  9. underwhelming
 10. relationship
 11. unsatisfying
 12. disappointing
 13. outrageously
 14. disappointing
 15. expectations
 16. restaurants
 17. suggestions
 18. disappointed
 19. considering
 20. Unfortunately
 21. immediately
 22. ingredients
 23. accommodations
 24. maintaining
 25. Interesting
 26. disrespected
 27. accordingly
 28. unbelievable
 29. cheeseburger
 30. descriptions
 31. inexpensive
 32. disappointed
 33. Veggitarian
 34. outstanding
 35. recommendation
 36. disappointed
 37. disappointed
 38. neighborhood
 39. disappointed
 40. corporation
 41. considering
 42. exceptional
 43. shawarrrrrrma
 44. disappointed
 45. vinaigrette
 46. immediately
 47. unbelievably
 48. replenished
 49. disappointed
 50. enthusiastic
 51. Outstanding
 52. comfortable
 53. interesting
 54. INCONSIDERATE
 55. 

<h2>Problem 10</h2>

Search for and print all words that start with an uppercase letter and end with a lowercase letter, but are not the first word of the comment/string.

In [18]:
Q = []
for doc in docs:
  # Find words that start with an uppercase letter and end with a lowercase letter, 
  # and are not the first word in the string
  tmp = re.findall(r'\s([A-Z]\w*[a-z])\b', doc)  
  if tmp:
    Q += tmp

print('First 10 comments from the result:')
print_elements(Q)
print('Total number of strings found:', len(Q))

First 10 comments from the result:
  1. Loved
  2. May
  3. Rick
  4. Steve
  5. Cape
  6. Cod
  7. Vegas
  8. Burrittos
  9. Blah
 10. The
 11. They
 12. Mexican
 13. Luke
 14. Our
 15. Hiro
 16. Firehouse
 17. Greek
 18. Greek
 19. Heart
 20. Attack
 21. Grill
 22. Vegas
 23. Dos
 24. Gringos
 25. Jeff
 26. Really
 27. Excalibur
 28. Very
 29. Bad
 30. Customer
 31. Service
 32. Vegas
 33. Rice
 34. Company
 35. Pho
 36. Hard
 37. Rock
 38. Casino
 39. Buffet
 40. Tigerlilly
 41. Yama
 42. Thai
 43. Indian
 44. Not
 45. Vegas
 46. Lox
 47. Subway
 48. Subway
 49. Vegas
 50. Vegas
 51. Mandalay
 52. Bay
 53. Great
 54. Voodoo
 55. Phoenix
 56. Vegas
 57. Khao
 58. Soi
 59. Lemon
 60. Joey
 61. Valley
 62. Phoenix
 63. Magazine
 64. Pho
 65. Fridays
 66. Tasty
 67. Jamaican
 68. Bisque
 69. Bussell
 70. Sprouts
 71. Risotto
 72. Filet
 73. Otto
 74. Yeah
 75. Honestly
 76. Not
 77. Also
 78. Vegas
 79. Greek
 80. Vegas
 81. Veggitarian
 82. Madison
 83. Ironman
 84. Jenni
 85. Pho
 86.

<h2>Problem 11</h2>

Search and print the sequence of two or more words separated by a hyphen, "-", without any spaces between them.

For example, "Go-Kart" would be valid, but "Go -Kart" or "Go - Kart" would not be.

In [19]:
Q = []
for doc in docs:
  tmp = re.findall(r'\b(\w+[-]\w+)\b', doc, flags=re.IGNORECASE)
if tmp:
  Q += tmp
print('First 10 comments from the result:')
print_elements(Q[:10])
print('Total strings found:', len(Q))

First 10 comments from the result:
Total strings found: 0


<h2>Problem 12</h2>


Search and print all words ending in "ing" or "ed".

In [20]:
ings = []
eds = []
for doc in docs:
    ing = re.findall(r'\b\w+ing\b', doc, flags=re.IGNORECASE)
    ed = re.findall(r'\b\w+ed\b', doc, flags=re.IGNORECASE)
    if ing:
        ings += ing
    if ed:
        eds += ed

print('First 10 comments from the result:')
print_elements(ings[:10] + eds[:10])
print('Total strings ending with "ing":', len(ings))
print('Total strings ending with "ed":', len(eds))

First 10 comments from the result:
 1. during
 2. getting
 3. being
 4. being
 5. amazing
 6. running
 7. redeeming
 8. getting
 9. thing
10. dressing
11. Loved
12. Stopped
13. loved
14. ended
15. overpriced
16. tried
17. disgusted
18. shocked
19. recommended
20. performed
Total strings ending with "ing": 280
Total strings ending with "ed": 339


<h2>Corpus Preprocessing</h2>

<h2>Problem 13</h2>

Now perform a cleaning process on the corpus that includes the following steps:

<ul>
<li>Only consider alphabetical characters. This means removing all punctuation marks and special characters.</li>
<li>Transform all alphabetical characters to lowercase.</li>
<li>Remove any additional white spaces found in each comment.</li>
</ul>




In [21]:
def clean_and_find_long_words(text):
    """
    Cleans the text by removing non-alphabetic characters, converting to lowercase,
    removing extra spaces, and then finds words with more than 10 alphabetic characters.
    
    Args:
        text (str): The original text to process.
    
    Returns:
        str: Cleaned text containing words with more than 10 alphabetic characters.
    """
    # Remove non-alphabetic characters and convert to lowercase
    cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
    
    # Remove extra white spaces
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    
    return cleaned_text    

In [22]:
clean_data = []
for doc in docs:
    tmp = clean_and_find_long_words(doc)
    if tmp:
        clean_data.append(tmp)
print('First 10 comments from the result:')
print_elements(clean_data[:10])

First 10 comments from the result:
 1. wow loved this place
 2. crust is not good
 3. not tasty and the texture was just nasty
 4. stopped by during the late may bank holiday off rick steve recommendation and loved it
 5. the selection on the menu was great and so were the prices
 6. now i am getting angry and i want my damn pho
 7. honeslty it didnt taste that fresh
 8. the potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer
 9. the fries were great too
10. a great touch


<h2>Problem 14</h2>

<ul>
    <li>
        With the cleaning result obtained in the previous question, now perform a word tokenization process on the corpus.
        <ul>
            <li>
                That is, at the end of this tokenization process, you should have a list of lists where each comment will be tokenized by words.
            </li>
        </ul>
    </li>
    <li>
        After finishing, calculate the total number of tokens obtained in the entire corpus.
    </li>
</ul>

In [23]:
tokenization = []
num_tokens = 0
for doc in clean_data:
    tmp = re.findall(r'\b\w+\b', doc, flags=re.IGNORECASE)  
    if tmp:
        tokenization.append(tmp)
    num_tokens += len(tmp)
print('First 10 comments from the result:')
print_elements(tokenization[:10])
print("Total tokens: ", num_tokens)


First 10 comments from the result:
 1. ['wow', 'loved', 'this', 'place']
 2. ['crust', 'is', 'not', 'good']
 3. ['not', 'tasty', 'and', 'the', 'texture', 'was', 'just', 'nasty']
 4. ['stopped', 'by', 'during', 'the', 'late', 'may', 'bank', 'holiday', 'off', 'rick', 'steve', 'recommendation', 'and', 'loved', 'it']
 5. ['the', 'selection', 'on', 'the', 'menu', 'was', 'great', 'and', 'so', 'were', 'the', 'prices']
 6. ['now', 'i', 'am', 'getting', 'angry', 'and', 'i', 'want', 'my', 'damn', 'pho']
 7. ['honeslty', 'it', 'didnt', 'taste', 'that', 'fresh']
 8. ['the', 'potatoes', 'were', 'like', 'rubber', 'and', 'you', 'could', 'tell', 'they', 'had', 'been', 'made', 'up', 'ahead', 'of', 'time', 'being', 'kept', 'under', 'a', 'warmer']
 9. ['the', 'fries', 'were', 'great', 'too']
10. ['a', 'great', 'touch']
Total tokens:  10777


<h2>Problem 15</h2>


<ul>
    <li>
        Finally, in this exercise, we will define our set of stopwords, which you will need to remove from the entire corpus.
        <ul>
            <li>
                Remember that examples of stopwords include articles, adverbs, conjunctions, etc., which have very high frequencies of occurrence in any document but do not provide much meaning in terms of the content of a statement.
            </li>
        </ul>
    </li>
    <li>
        Based on the provided list of stopwords, perform a cleaning process by removing all these words from the corpus obtained in the previous exercise.
    </li>
    <li>
        Determine how many tokens/words remain in the entire corpus.
    </li>
    <li>
        Determine how many of these tokens/words are unique, that is, how many unique tokens our vocabulary will have later on.
    </li>
</ul>

In [24]:
# We consider the following as our list of stopwords:
my_stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'should', 'now', 'll']

In [25]:
phrases_tokens = []
for phrase in tokenization:
    token_sin_stopword = []
    for token in phrase:
        if(token not in my_stop_words):
            token_sin_stopword.append(token)
    phrases_tokens.append(token_sin_stopword)

In [26]:
flat_phrases = list(np.concatenate(phrases_tokens))

print("Tokens/words remaining in the entire corpus:", len(flat_phrases))
print("Unique tokens/words in the corpus:", len(set(flat_phrases)))

Tokens/words remaining in the entire corpus: 5776
Unique tokens/words in the corpus: 1941


In practice, regular expressions are useful for various text preprocessing tasks such as lemmatization, stemming, text normalization, and calculating edit distances. These techniques streamline text preparation for Natural Language Processing (NLP) by reducing words to their root forms, standardizing formats, and correcting typographical errors. Furthermore, regex supports matching complex patterns, which is crucial for extracting specific information such as dates and email addresses from large datasets. This capability not only simplifies data cleaning but also enhances advanced search functionalities in various contexts, making regex a versatile and powerful tool in text analysis and processing.

In conclusion, regular expressions are indispensable in the field of NLP due to their efficiency and precision in text manipulation and analysis. They allow greater adaptability to the specific needs of projects and contribute significantly to the effectiveness and accuracy of NLP systems. For these reasons, mastering regex is essential for anyone working in language technology, as it facilitates the implementation of more robust and optimized solutions.