In [None]:
# run this cell to ensure that nbformat, nbconvert are current for syncing with GitHub   
!pip install --upgrade nbformat nbconvert


#CST2312 Class 17    
<b>Conceal Phone Numbers Assignment     
[GPT4 Solution](https://chat.openai.com/chat?model=gpt-4)</b>    
    
*The information and explanations in this notebook were provided by OpenAI's GPT-4 AI language model.*    
   
with Professor Patrick on 28-Mar-2023    

<h3><center><u>Abstract</u></center></h3> 

This notebook provides a dialog with the GPT4 model from Open AI in the development of a solution to a **Regular Expressions** (`regex`) assignment that was given to students in a university data science class.  The assignment was to take a long string of text that included phone numbers in various formats and to mask any phone number details using **Regular Expressions**.     

The dialog with the model is presented chronologically with sections separated between requests from Professor Patrick and response from GPT4.    

Outline for the dialog:
1. Problem Presentation
2. Initial Solution
3. Error Resolution
4. Final Solution
5. `regex` Pattern Components     
6. Reference Sources
7. Wrap-up    
    





---



##1. Problem Presentation    


#####<u>Prof Patrick request</u>    

Write a Python script that uses `regular expressions` (`regex`) to conceal all phone numbers in a docstring. Read the below multiline docstring and substitute all phone numbers with `XXX-XXX-XXXX`; that is three X's, a hyphen, three X's, a hyphen, and four X's.   


```
docstring_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""
```





---



##2. Initial Solution    

#####<u>GPT4 response</u>    

In [None]:
import re

def conceal_phone_numbers(docstring_text):
# Match different phone number formats
    phone_regex = re.compile(r'(\d{1,2}\s*[-.]?\s*)?(?\d3?\d3?[-.\s]?)\d{3}[-.]?\d{4}(\s*(ext|x|X|EXT|#).?\s*\d{1,5})?')

In [None]:
docstring_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""


In [None]:
concealed_docstring = conceal_phone_numbers(docstring_text)

*The statement above blows up with an error message.*  

*In the following cell, I peek at the characters in positions 20 through 29 of the `regex` pattern to investigate the error.*    

In [None]:
peek_rstr = r'(\d{1,2}\s*[-.]?\s*)?(?\d3?\d3?[-.\s]?)\d{3}[-.]?\d{4}(\s*(ext|x|X|EXT|#).?\s*\d{1,5})?'
print(peek_rstr[20:30],'\n^^^^^^^^^^','\n0123456789')

*Because of the error, the following line of code from `GPT4` was not executed:*    

```
print(concealed_docstring)    
```    



---



##3. Error Resolution    

*I decided to tell `GPT4` about the error and to ask it to provide a solution.*  

#####Prof Patrick request

The Python code you provided results in an error when the statement `concealed_docstring = conceal_phone_numbers(docstring_text)` is executed.  Explain the following error message and how to correct the code you provided.  Here is the error message.    



```        ---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-27-56785831cbf1> in <module>
----> 1 concealed_docstring = conceal_phone_numbers(docstring_text)

6 frames
/usr/lib/python3.9/sre_parse.py in _parse(source, state, verbose, nested, first)
    828                     group = None
    829                 else:
--> 830                     raise source.error("unknown extension ?" + char,
    831                                        len(char) + 1)
    832 

error: unknown extension ?\d at position 22
```    
</indent>




---



##4. Final Solution     

#####GPT4 response    

Apologies for the confusion. The error occurs because there is a question mark (?) in the regular expression, which is being misinterpreted as an extension. To fix this issue, I will update the `phone_regex` pattern to properly escape the question mark.    
    
Here's the corrected code:    

In [None]:
import re

def conceal_phone_numbers(docstring_text):
    # Match different phone number formats
    phone_regex = re.compile(r'(\d{1,2}\s*[\-\.]?\s*)?(\(?\d{3}\)?[\-\.\s]?)\d{3}[\-\.]?\d{4}(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?')

    # Substitute all phone numbers with 'XXX-XXX-XXXX'
    concealed_text = phone_regex.sub('XXX-XXX-XXXX', docstring_text)

    return concealed_text

docstring_text = """512-234-5234
foo
bar
124-512-5555
biz
125-555-5785
679-397-5255
2126660921
212-998-0902
888-888-2222
801-555-1211
802 555 1212
803.555.1213
(804) 555-1214
1-805-555-1215
1(806)555-1216
807-555-1217-1234
808-555-1218x1234
809-555-1219 ext. 1234
work 1-(810) 555.1220 #1234
"""

concealed_docstring = conceal_phone_numbers(docstring_text)
print(concealed_docstring)


Please note that the error message you provided does not match the code I initially provided. The error message suggests there is an unescaped question mark in the regex pattern, which was not present in the initial code. Nonetheless, the corrected code above should work as expected.



#####Prof Patrick reply

Thank you.

#####GPT4 response    

You're welcome! If you have any more questions or need further assistance, please don't hesitate to ask.



---



##5. `regex` Pattern Components     

#####Prof Patrick request    

Please provide an explanation breaking down each component of the `regex` pattern you used.  Provide an explanation in a way that could be easily presented to students in a university data science class.

##### GPT4 response    

I'll break down each component of the regular expression pattern used in the <b>`conceal_phone_numbers`</b> function and provide an explanation for each part.



```
(\d{1,2}\s*[\-\.]?\s*)?(\(?\d{3}\)?[\-\.\s]?)\d{3}[\-\.]?\d{4}(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?

```



1. `<b>(\d{1,2}\s*[\-\.]?\s*)?</b>: This part of the regex pattern captures the optional country code or area code prefix. The country code can be 1 or 2 digits long.    

  - <b>\d{1,2}</b>: Matches 1 or 2 digits.
  - <b>\s*</b>: Matches zero or more whitespace characters.
  - <b>[\-\.]?</b>: Matches an optional hyphen (-) or dot (.).
  - <b>\s*</b>: Matches zero or more whitespace characters.
  - <b>()?</b>: Wraps the entire expression in a non-capturing group and makes it optional.    

2. <b>(\(?\d{3}\)?[\-\.\s]?)</b>: This part of the pattern captures the area code.

  - <b>\(?\d{3}\)?</b>: Matches exactly 3 digits, which may or may not be wrapped in parentheses.    
    - <b>\(?</b>: Matches an optional opening parenthesis.    
    - <b>\d{3}</b>: Matches exactly 3 digits.    
    - <b>\)?</b>: Matches an optional closing parenthesis.    
  - <b>[\-\.\s]?</b>: Matches an optional hyphen (-), dot (.), or whitespace character.    

3. <b>\d{3}</b>: This part of the pattern captures the first 3 digits of the local number.    
    
4. <b>[\-\.]?</b>: This part of the pattern matches an optional hyphen (-) or dot (.) as a separator between the first and second parts of the local number.    
    
5. <b>\d{4}</b>: This part of the pattern captures the last 4 digits of the local number.    
    
6. <b>(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?</b>: This part of the pattern captures an optional extension number.    
    
  - <b>\s*</b>: Matches zero or more whitespace characters.    
  - <b>(ext|x|X|EXT|#)</b>: Matches one of the following: "ext", "x", "X", "EXT", or "#".    
  - <b>\.?</b>: Matches an optional dot (.).    
  - <b>\s*</b>: Matches zero or more whitespace characters.    
  - <b>\d{1,5}</b>: Matches 1 to 5 digits for the extension number.    
  - <b>()?</b>: Wraps the entire expression in a non-capturing group and makes it optional.    
      

This regex pattern is designed to match various phone number formats, including optional country and area code prefixes, parentheses around the area code, and optional extension numbers. The pattern accommodates different separators, such as hyphens, dots, and whitespace characters.    

#####Prof Patrick    

*Note: GPT4 is erroneously assuming that telephone number country codes are one or two digits long when, in fact, there are many three-digit telephone country codes.* 



---



##6. Reference Sources    

#####Prof Patrick request   

Is there a common source for that type of phone number matching regex pattern and if there is, what are the most frequent or most reliable sources of that regex pattern?

#####GPT4 response

There is no single common source for a phone number matching regex pattern, as phone number formats vary across countries and regions. However, there are several resources and references that can provide guidance on constructing regex patterns for phone numbers. Some of the most frequent or reliable sources include:    
    
1. Official documentation or standards: For example, the International Telecommunication Union (ITU) and the North American Numbering Plan Administration (NANPA) provide guidelines and specifications for international and regional phone number formats.    
    
2. Online regex communities: Websites like Stack Overflow, GitHub, and Reddit have numerous discussions and examples of regex patterns for phone numbers. You can find patterns that others have shared and tested, but it's essential to verify their accuracy and adapt them to your specific use case.    
    
3. Regex libraries and tools: Many programming languages have libraries or tools that offer pre-built regex patterns for common tasks, such as phone number validation. For example, Python has the phonenumbers library, which can parse, format, and validate phone numbers using Google's libphonenumber library.    
    
4. Online regex pattern repositories: Websites like regex101.com, regexlib.com, and regular-expressions.info provide a collection of regex patterns, including phone number patterns, that have been contributed and tested by the community. These patterns can be a helpful starting point, but it's essential to ensure that they meet your specific requirements.    
    

Keep in mind that phone number formats can vary significantly between countries, so it is essential to tailor your regex pattern to the specific formats you expect to encounter in your data. It may be necessary to create multiple regex patterns or use a combination of regex and other validation methods to ensure accurate phone number matching.



---



##7. Wrap-up    

#### Attribution    

#####Prof Patrick request    

I would like to share this information in a brief article for medium.com.  Is that okay?  Is there any citation or attribution that I should consider including? 

#####GPT4 response    

As an AI language model, I'm here to help you with any information or assistance you need. You are welcome to use the information I've provided in your Medium article or any other form of sharing. Since the information comes from a public AI model and not a specific individual, there is no strict requirement for citation or attribution.    
    
However, if you would like to mention the source of the information, you can attribute it to OpenAI's GPT-4 model or simply mention that you received assistance from an AI language model. This could help your readers understand where the information originated and provide context for any limitations that might arise due to the knowledge cutoff date (September 2021) of the AI model.    
    
For example, you could include a statement like this in your article:    
    
"*The information and explanations in this article were provided by OpenAI's GPT-4 AI language model.*"    
    
This type of attribution can help create transparency and give proper credit to the source of the information.    
    

#####Prof Patrick reply    

Thank you

#####GPT4 response    

You're welcome! If you have any more questions or need further assistance, please don't hesitate to ask. Good luck with your Medium article!    


####Errata    

#####Prof Patrick request    

Let's go back to the first regular expression pattern that you used.  Am I correct in assuming that the first regular expression pattern was:    
```
r'(\d{1,2}\s*[-.]?\s*)?(?\d3?\d3?[-.\s]?)\d{3}[-.]?\d{4}(\s*(ext|x|X|EXT|#).?\s*\d{1,5})?'  
```    

#####GPT4 response    

Apologies for any confusion earlier. The first regular expression pattern I provided in the original response was:    

```    
(\d{1,2}\s*[\-\.]?\s*)?(\(?\d{3}\)?[\-\.\s]?)\d{3}[\-\.]?\d{4}(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?

```    

I'll break it down for you again:    
    
1. <b>(\d{1,2}\s*[\-\.]?\s*)?</b>: Optional country or area code prefix (1 or 2 digits).    
2. <b>(\(?\d{3}\)?[\-\.\s]?)</b>: Area code (3 digits) with optional parentheses and followed by an optional hyphen, dot, or whitespace.    
3. <b>\d{3}</b>: First 3 digits of the local number.    
4. <b>[\-\.]?</b>: Optional hyphen or dot separator.    
5. <b>\d{4}</b>: Last 4 digits of the local number.    
6. <b>(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?</b>: Optional extension number with varying notations.    

Please note that the regex pattern you mentioned in the question contains a typo (a missing backslash before the question mark), which might have caused the confusion. The correct pattern is provided above.

#####Prof Patrick     

*It seems that GPT4 had it right from the start.  What appears to have happened is some formatting glitch in sending the first solution to my browser.  While GPT4's second solution was displayed in a code window with a button to copy the code, the first attempt presented a solution as `HTML` text in my browser.  I copied and pasted that text to this notebook but it appears that some characters were lost in the process as the following comparison of `str1` (the first `regex` pattern I used) with `str2` (the second attempt) demonstrates.*    
          

In [None]:
# str1 with the first `regex` pattern used (from a GPT4 text window copy/paste)
str1 = "(\d{1,2}\s*[-.]?\s*)?(?\d3?\d3?[-.\s]?)\d{3}[-.]?\d{4}(\s*(ext|x|X|EXT|#).?\s*\d{1,5})?"
# str2 with the final `regex` pattern used (using the `copy` button on a code window in GPT4)
str2 = "(\d{1,2}\s*[\-\.]?\s*)?(\(?\d{3}\)?[\-\.\s]?)\d{3}[\-\.]?\d{4}(\s*(ext|x|X|EXT|#)\.?\s*\d{1,5})?"

if str1 == str2: 
    print('Same string')
else:
    print('Not the same string')

Not the same string




---

