# Regular Expression #
A Regular Expression (RegEx) is a specialized sequence of characters designed to identify and match specific patterns within a string or a group of strings.
<table>
    <thead>
        <tr>
            <th>function</th> <th>Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
        <td>findall</td> <td>Returns a list containg all matches</td>
        </tr>
        <tr>
            <td>Search</td> <td> Returns a match object if there is a match anywhere in the string</td>
        </tr>
    </tbody>
</table>

In [1]:
import re

In [2]:
text_lwisu="""Since 1932, Lewis University has been grounded in intentionality, guided by truth, and inspired by innovation. 
We’ve built a learning environment that creates critical thinkers who take on new knowledge and ask the big questions.
We welcome students of all beliefs and backgrounds and encourage them to create a better and more just society.
You can land anywhere from here.
Contact us:1 University Parkway

Romeoville, IL 60446

(815)-838-0500· (800)-897-9000

Webmaster@lewisu.edu

Lasallian Education"""

In [3]:
"(815)-838-0500" in text_lwisu

True

In [4]:
match=re.findall(r'\d{5}|\d{4}|\d{3}',text_lwisu)
print(match)

['1932', '60446', '815', '838', '0500', '800', '897', '9000']


In [219]:
" ".join(match)

'1932 60446 815 838 0500 800 897 9000'

## Find only phone numbers ##

In [5]:
pattern=r'[(\d{3})]+-\d{3}-\d{4}'
finds=re.findall(pattern,text_lwisu)
print(finds)

['(815)-838-0500', '(800)-897-9000']


### Get sparate gruop of the phone number ###
We used parentheses to sparate the phone number into groups

In [7]:
sent="this is my phone number 444-350-0563, call me soon!"
pattern=r'\d{3}+-\d{3}-\d{4}'
finds=re.search(pattern,sent)
finds.group()

'444-350-0563'

In [92]:
sent="this is my phone number 444-350-0563, contact me soon!"
pattern=r'(\d{3})+-(\d{3})-(\d{4})'
finds=re.search(pattern,sent)
finds.group()

'444-350-0563'

In [221]:
match2=re.findall(r'[Ww]e\w+',text_lwisu)
match2

['welcome', 'Webmaster']

In [222]:
re.findall(r"\w+at",text_lwisu)

['innovat', 'that', 'creat', 'creat', 'Educat']

## Find email address ##
<font color=black>This example demonstrates email validation using <b>regular expressions</b>, a common yet critical application in web development. Verifying properly formatted emails is essential for any website with user authentication systems, as it ensures data quality and prevents invalid submissions during sign-up or login processes.</font>

In [8]:
pattern=r'[a-zA-Z0-9]+@[a-zA-Z]+\.(com|edu|gov|net)'
email=re.search(pattern,text_lwisu)
email.group()

'Webmaster@lewisu.edu'

In [9]:
pattern=r'[\w+]+@[\w\.]+'
email2=re.search(pattern,text_lwisu)
email2.group()

'Webmaster@lewisu.edu'

### Another example: ###
<ul>
<li>[\w.-]+ → Matches usernames with letters, numbers, _, . and -.</li>
<li>@[a-zA-Z]+ → Ensures the domain name has only alphabetic characters.</li>
<li>\\.(?:com|edu|gov|net) → Uses (?: ... ) to make the group non-capturing, so the full email is returned.</li>
</ul>

In [13]:
patterns=r'[\w.-]+@[a-zA-Z]+\.(?:com|edu|gov|net)'
text_email="""Email1: jim304@lewisu.edu,
Email2: khaled.alrfou@lewisu.edu
Email3: web_master@lewisu.edu
"""
email3=re.findall(patterns,text_email)
print(email3)

['jim304@lewisu.edu', 'khaled.alrfou@lewisu.edu', 'web_master@lewisu.edu']


### ^[\w] match at the begning of the word ###
\b → Ensures the match starts at a word boundary.

\w* → Matches any following word characters (letters, digits, underscore) until the word ends.

In [229]:
sent2="""The best programming language for NLP is Python.
While other programming languages can be used in NLP, this course will focus exclusively on Python implementations."""
re.findall(r"\b[Tt]\w*",sent2)
#re.findall(r"\b[Tt]\w{2}\b",sent2)

['The', 'this']

### [^\w] Not alphanumeric ### 

In [237]:
matchall=re.findall(r"\b[^th\w]\w+",sent2)
" ".join(matchall)

' best  programming  language  for  NLP  is  Python  other  programming  languages  can  be  used  in  NLP  course  will  focus  exclusively  on  Python  implementations'

### `.$` and `\.$` ###
`.$`: Matches any single character (except a newline) at the end of a string. 

`\.$`: Matches a dot (.) only if it appears at the end of a string.

In [185]:
sent4="The end."
mat=re.search(r'\.$', sent4)
mat.group()

'.'

In [191]:
sent4="The end. is this the end?"
matt=re.search(r'\w+.$', sent4)
matt.group()

'end?'

# False positives and false negatives #
The process we just went through was based on fixing two kinds of errors:

Not matching things that we should have matched (The)
False negatives

Matching strings that we should not have matched (there, then, other)
False positives


In [213]:
senFpFn="The quick brown fox jumped over the lazy dog. There was a time when the weather was nice."
fp=re.findall(r'[tT]he\w*', senFpFn)

print("this is the type of False positive: ",fp)

fn=re.findall(r'\bthe\b',senFpFn)
print("this is the type of False negative: ",fn)

this is the type of False positive:  ['The', 'the', 'There', 'the', 'ther']
this is the type of False negative:  ['the', 'the']


In [215]:
correct=re.findall(r'[Tt]he\b',senFpFn)
print(correct)

['The', 'the', 'the']


### **Error Tradeoff in NLP: Precision vs. Recall**
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two antagonistic efforts: 


1. **False Negative Rate (Missed Matches)**:
  $
   \text{False Negative Rate} = \frac{\text{False Negatives}}{\text{True Positives} + \text{False Negatives}}
  $

2. **False Positive Rate (Incorrect Matches)**:
   $
   \text{False Positive Rate} = \frac{\text{False Positives}}{\text{True Negatives} + \text{False Positives}}
   $

   ### Increasing accuracy (or precision) (minimizing false positives) ###


4. **Precision (Accuracy of Matches)**:
   $
   \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
   $
### Increasing coverage (or recall) (minimizing false negatives) ###

4. **Recall (Coverage of Matches)**:
   $
   \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
   $
