<img src="images/lect-13.png" height=1000px width=1000px>


<h1 align="center" style="color:red">Regular Expressions </h1>

## Regular Expressions 
https://docs.python.org/3/howto/regex.html#regex-howto

https://docs.python.org/3/library/re.html

## Baseline:
- The concept of regular expressions began in the 1950s, when the American mathematician `Stephen Cole Kleene` formalized the concept of a regular language. 
- Today's regular expressions are used in Data Science day to day tasks like:
 >- Simple pattern matching
 >- In find and replace operations like in editor example sublime, notepad++, atom etc.
 >- Information Extraction
 >- Web scraping
 >- Text Mining (Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data.)
 >- In domain of natural language processing .

<h1 align="center">A Gentle Introduction to Regular Expressions (Regex)</h1> <br><br>

<img align="center" width="800" height="800"  src="images/re.jpeg"  >
<img align="center" width="500" height="500"  src="images/tm.jpg"  >

<br><br><br><br><br><br><br><br><br>

# <div style="text-align: center; color: red;"> <b> Medical Data Extraction Challenges  </b> </div>
<div style="text-align: center; color: red;">
    <b>
        I'm working on a large-scale data extraction project where I need to parse complex medical records to extract specific patient information such as names, dates of birth, medical conditions, and treatment details. What are the most efficient methods and best practices to handle this type of data extraction in Python?
    </b>
</div>


In [None]:
# Data

patient_records = """
Patient Records:
----------------

Patient ID: 001
Patient Name: John Doe
Date of Birth: 1980-05-25
Medical Condition: Hypertension, Diabetes
Treatment Details: Prescribed medications: Metoprolol, Insulin
Next Appointment: 2023-10-25

Patient ID: 002
Patient Name: Jane Smith
Date of Birth: 1975-08-12
Medical Condition: Asthma, Allergies
Treatment Details: Inhaler, Antihistamines
Next Appointment: 2023-11-05

Patient ID: 003
Patient Name: Michael Johnson
Date of Birth: 1992-11-18
Medical Condition: Migraine, Depression
Treatment Details: Therapy sessions, Sertraline
Next Appointment: 2023-10-28
"""


### First Solution

In [None]:
patient_data = []
records = patient_records.split('Patient ID')[1:]

for record in records:
    data = {}
    data['Patient Name'] = record.split('Patient Name:')[1].split('\n')[0].strip()
    data['Date of Birth'] = record.split('Date of Birth:')[1].split('\n')[0].strip()
    data['Medical Condition'] = record.split('Medical Condition:')[1].split('\n')[0].strip()
    data['Treatment Details'] = record.split('Treatment Details:')[1].split('\n')[0].strip()
    patient_data.append(data)

for idx, patient in enumerate(patient_data, start=1):
    print(f"Patient {idx} Data:")
    for key, value in patient.items():
        print(f"{key}: {value}")
    print()

### Second Solution

In [None]:
import re
pattern = r"Patient ID: (\d+)\nPatient Name: (.*?)\nDate of Birth: (.*?)\nMedical Condition: (.*?)\nTreatment Details: (.*?)\nNext Appointment: (\d{4}-\d{2}-\d{2})"
matches = re.findall(pattern, patient_records, re.DOTALL)
for idx, match in enumerate(matches, start=1):
    print(f"Patient {idx} Data:")
    print("Patient ID:", match[0])
    print("Patient Name:", match[1])
    print("Date of Birth:", match[2])
    print("Medical Condition:", match[3])
    print("Treatment Details:", match[4])
    print("Next Appointment:", match[5])
    print()


# Learning Agenda
**PART-I:**
1. A gentle introduction to Regular Expressions
2. Overview of Regex Metacharacters, Anchors, Quantifiers, Escape Codes and Grouping Constructs
3. Overview of regex101
4. A Step by Step hands-on practical understanding of REs on regex101.com
5. Practical Use Cases
    - Identify valid phone numbers
    - Identify/locate valid names or city codes
    - Identify valid email addresses
    - Identify valid URLs
6. Substitution and Replacement

<br><br>**PART-II:**

Regular Expressions in Python

## Wild Card / Meta Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:


| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group



> To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parentheses.

## Quantifiers
- A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match means they are used to search multiple characters.

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times

## Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isn’t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

##  Practice Regular Expressions


[Visit reges101](https://regex101.com/)


- There are some grand tools online these days that allow you to write and test just about any regular expression out there using color coding, code explanation, substitution, etc.
- The best regular expressions tester is Regex101 since it boasts the most features and supports the most language flavors. 

#### We will perform following some tasks/activities in regex101.
- perform single `.`
- perform multiple dots `....`
- perform `\.`
- To search `\` we perform double backslaches `\\`
- To search `*` we perform `\*` and other characters also like this.
- To search single digit like 1,2,3,... , we perform `\d`.
- To search non-digit characters, we perform `\D`.
- To search for boundary , we perform `\b`. like `Ha\b`. We can also put boundary on both sides like `\bHa\b`.
- Perform caret symbol like `^Ha` , this gives a string that's start with `Ha`.
- Search for a valid numbe using `\d`.
- Search for all valid numbers using `[.-]` like `\d\d\d[.-]`
- Search for a range by using this method `[A-z]`, Note this is differene of `-` in between and in end/start.
- What is return of `[^0-9]`? And what is difference between `[0-9]` and `[^0-9]`.
- find all the words ends with `at` but not start with `b`.
- select all the valid hexa-decimal numbers using quantifier. `0[xX][0-9a-fA-F]+\b`
- what is difference between `\d\d\d[.-]\d\d\d[.-]\d\d\d\d` and `\d{3}[.-]\d{3}[.-]\d{4}`.
- Here `M[sr]s?` , character ? is optional .
- Select all the valid names from given text.

abcdefghijklmnopqurtuvwxyz     
ABCDEFGHIJKLMNOPQRSTUVWXYZ    
1234567890    
Ha HaHa      
MetaCharacters (Need to be escaped):     
.[{()\^$|?*+     
helloworld.me      
321-555-4321         
123.555.1234       
111#923#9234        
cat          
mat        
bat        
0x45         
0X4Ad         
0x2g3       
0x349ABf         
0x      

#### Select only valid names

Hello World   
Mr. Ehtisham    
Mr Tayyab   
Ms Sonia   
Mrs. Ayesha   
Mr. B   
Learning is Data Science

#### Check validaity of emails

- `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9]+\.[a-zA-Z0-9.-]+` this regex is used to select valid emails.

**List of Valid Email Addresses**

ehtisham@pucit.edu.pk     
ehtisham.ds@pu.edu.pk        
ehtishampucit@gmail.com        
ehtisham.pucit@pu.edu.pk          
first+123.5@example.com          
abc%xyz@subdomain.example.com          
my_name@example.com          
first-last@example.com     


**List of Invalid Email Addresses**   

#@%^%#$@#$@#.com     
abc.def@mail      
abc.def@mail#archive.com       
@example.com       
ehtisham sadiq @example.com       
Tayyab#@gmail.com      
Abc.example.com       

**Select valid URL**
 
- https://www.google.com
- http://ehtisham.me
- https://youtube.com
- https://www.yahoo.com
- http://facebook.com

## Practice Questions:

### Example 1: Write a regular expression to search digit inside a string.
- Input : "My roll number is 25"
- Output : [2,5]

In [None]:
import re
targetString = "My roll number is 25"
reg = r"\d"
result = re.findall(reg,targetString)
result

### Write a Python program that matches a string that has an a followed by zero or more b's.

In [None]:
# write your answer

### Write a Python program that matches a string that has an a followed by zero or one 'b'.

Hint :         patterns = 'ab?'

### Write a Python program to find sequences of lowercase letters joined with a underscore.

Hint : ^[a-z]+_[a-z]+$

### Write a Python program that matches a word at the beginning of a string.

In [None]:
# import re
# def text_match(text):
#     pattern = r"^\w"
#     if re.search(pattern,text):
#         print("Text Found!")
#     else:
#         print("Text not found")
        
# text_match("The quick brown fox jumps over the lazy dog.")
# text_match(" The quick brown fox jumps over the lazy dog.")

### Write a Python program that matches a word at the end of a string, with optional punctuation.

In [None]:
# import re
# def text_match(text):
#         patterns = '\w+\S*$'
#         if re.search(patterns,  text):
#                 return 'Found a match!'
#         else:
#                 return('Not matched!')

# print(text_match("The quick brown fox jumps over the lazy dog."))
# print(text_match("The quick brown fox jumps over the lazy dog. "))
# print(text_match("The quick brown fox jumps over the lazy dog "))

### Write a Python program that matches a word containing 'z'.

In [None]:
# import re
# def text_match(text):
#         patterns = '\w*z.\w*'
#         if re.search(patterns,  text):
#                 return 'Found a match!'
#         else:
#                 return('Not matched!')

# print(text_match("The quick brown fox jumps over the lazy dog."))
# print(text_match("Python Exercises."))

### Write a Python program to match a string that contains only upper and lowercase letters, numbers, and underscores.


In [None]:
# import re
# def text_match(text):
#         patterns = '^[a-zA-Z0-9_]*$'
#         if re.search(patterns,  text):
#                 return 'Found a match!'
#         else:
#                 return('Not matched!')

# print(text_match("The quick brown fox jumps over the lazy dog."))
# print(text_match("Python_Exercises_1"))

In [None]:
# import re
# def match_num(string):
#     text = re.compile(r"^5")
#     if text.match(string):
#         return True
#     else:
#         return False
# print(match_num('5-2345861'))
# print(match_num('6-2345861'))

<h2 align="center">  Regular Expressions Part-II </h2>

<img align="left" width="600" height="600"  src="images/re.jpeg"  >
<img align="right" width="300" height="300"  src="images/tm.jpg"  >

## Learning Agenda
2. The Python `re` Module
    1. The `re.compile()` Method
    2. The `re.Pattern.search()` Method
    3. The `re.Pattern.match()` Method
    4. The `re.Pattern.findall()` Method
    5. The `re.Pattern.finditer()` Method
3. Practical Example
    1. Extracting Names
    2. Extracting Date of Births
    3. Extracting Emails and Usernames
    4. Extracting valid Cell phones
    5. Extracting Domain names from URLs
4. Modifying Strings
    1. The `re.Pattern.split()` Method
    2. The `re.Pattern.sub()`  Method
    3. The `re.Pattern.subn()` Method

### a. Wild Card / Meta Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. Some commonly used wild cards or meta characters are listed below:


| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group

### b. Quantifiers
- A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. Basically they are used to select more than one characters. 

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times
    
    
Note: The repeat characters (`*` and `+`) perform greedy search to match the largest possible string. However, you can performa a non greedy search by: `*?` (0 or more characters but non-greedy) `+?` (1 or more characters but non-greedy)

### c. Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isn’t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

## 2. The Python `re` Module

In [1]:
import re
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


In [6]:
# text = "Today's date is 2023-10-26"
# # match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)
# print(match.group(0))
# print(match.group(1))
# print(match.group(2))
# print(match.group(3))

### a. The `re.compile()` Method
This method is used to compile a regular expression pattern into a regular expression object, which can be used for matching using its `match()`, `search()` and other methods.

**`re.compile(pattern, flags=0)`**

Where,
   - `pattern` is the regular expression which you want to compile that you need to search/modify in a string or may be on a corpus of documents.
   - `flags` arguments can be used to modify the expression’s behaviour. Values can be any of the following variables, combined using bitwise OR (the | operator):
        - `IGNORECASE` or `I` to do a case in-sensitive search
        - `LOCALE` or `L` to perform a locale aware match.
        - `MULTILINE`, `M` to do multiline matching, affecting `^` and `$`
        - `DOTALL` or `S` to make the '.' special character match any character, including a newline; without this flag, '.' will match anything except a newline.

Once you have an `Pattern object` representing a compiled regular expression, you can use its methods to perform various operations on a string or may be in a corpus of documents:
- `p.match()`: Determine if the RE matches at the beginning of the string.
- `p.search()`: Scan through a string, looking for any location where this RE matches.
- `p.findall()`: Find all substrings where the RE matches, and returns them as a list.
- `p.finditer()`: Find all substrings where the RE matches, and returns them as an iterator.
- `p.split()`: Used to split string by the occurrences of pattern.
- `p.sub()`: Used for for find and replace purpose..

**Note:** Creating a regex object with `re.compile` is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

In [None]:
import re
# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[A]+[a-z]*", flags=re.M)  

In [None]:
print(p)
print(type(p))

Once you have got a pattern object, you can use its various methods for searching from a string

In [None]:
# p.

### b. The `re.Pattern.search()` Method
- Scan through string looking for the first location where the pattern object `p` produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern.

**`p.search(string, pos=0 endpos=9223372036854775807)`**

- Where,
   - `p` is the compiled pattern object
   - `string` is the test string from which we want to search
   - `pos` and `endpos` can be used to specify the portion of test string from where to search

In [7]:
import re
# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[A]+[a-z]*", flags=re.M)  

In [8]:
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ali Sadiq."
match = p.search(str1)

In [9]:
print(match)
print(type(match))


<re.Match object; span=(10, 13), match='Ali'>
<class 're.Match'>


### c. The `re.Pattern.match()` Method
- Look for the pattern at the beginning of the string and if found returns a corresponding match object. Return None if the string does not match the pattern.

**`p.match(string, pos=0 endpos=9223372036854775807)`**

- Where,
   - `p` is the compiled pattern object
   - `string` is the test string from which we want to search
   - `pos` and `endpos` can be used to specify the portion of test string from where to search

Note: 
- Even in MULTILINE mode, `re.match()` will only match at the beginning of the string and not at the beginning of each line.
- If you want to locate a match anywhere in string, use `search()` instead 

In [10]:
import re
# str1 = "Ehtisham, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ali Sadiq."
str2 = "Mr. Ehtisham, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ali Sadiq."

In [14]:
# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[E]+[a-z]*", flags=re.M)  

rv = p.match(str1)
print(rv)
print(type(rv))

<re.Match object; span=(0, 8), match='Ehtisham'>
<class 're.Match'>


In [15]:
# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[E]+[a-z]*", flags=re.M)  

rv = p.match(str2)
print(rv)
print(type(rv))

None
<class 'NoneType'>


### d. The `re.Pattern.findall()` Method
- The `search()` only returns the first match, `match()` only matches at the beginning of the string, while `findall()`  returns all matches in a string.
- Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
- If pattern `p` does not match, it returns an empty list.

**`p.findall(string, pos=0 endpos=9223372036854775807)`**

- Where,
   - `p` is the compiled pattern object
   - `string` is the test string from which we want to search
   - `pos` and `endpos` can be used to specify the portion of test string from where to search
   
Note: The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

In [16]:
import re
str1 = "Ehtisham, Jamil and Ahmad are good at playing acrobatic games. AAA is triple As. Ali Sadiq."
# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[A]+[a-z]*", flags=re.M)  


rv = p.findall(str1)
print(rv)
print(type(rv))

['Ahmad', 'AAA', 'As', 'Ali']
<class 'list'>


### e. The `re.Pattern.finditer()` Method
- Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.

**`p.finditer(string, pos=0 endpos=9223372036854775807)`**

- Where,
   - `p` is the compiled pattern object
   - `string` is the test string from which we want to search
   - `pos` and `endpos` can be used to specify the portion of test string from where to search
   

In [18]:
import re
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."

# The regular expression look for a string that starts with one or more uppercase 'A' alphabet, followed by 
# zero or more lower case alphabets in multi-line mode
p = re.compile(r"[A]+[a-z]*", flags=re.M)  

# matches = p.finditer(str1)
# print(matches)
# print(type(matches))


rv =p.finditer(str1)
print(rv)
print(type(rv))

<callable_iterator object at 0x7f122bafff40>
<class 'callable_iterator'>


>- **Once we have got the iterator of `Match object`, we can iterate it using a `for` loop.**
>- **Let us see how many match objects are there in this iterator named `matches`.**

In [19]:
for i in rv:
    print(i)
    print(type(i))

<re.Match object; span=(10, 13), match='Ali'>
<class 're.Match'>
<re.Match object; span=(25, 30), match='Ahmad'>
<class 're.Match'>
<re.Match object; span=(69, 72), match='AAA'>
<class 're.Match'>
<re.Match object; span=(83, 85), match='As'>
<class 're.Match'>


>- **Every match object has many associated methods.**
>- **Let us see different attributes of each match object using these methods.**

The **`group()`** method of the match object, return subgroups of the match (if they exist). By default return the entire match.

In [20]:
import re
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."

p = re.compile(r"[A]+[a-z]*")  
matches = p.finditer(str1)

for m in matches:
    print(m.group(0))

Ali
Ahmad
AAA
As


The **`span()`** method of the match object, return a 2-tuple containing the start and end index (end index not inclusive)

In [21]:
import re
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."

p = re.compile(r"[A]+[a-z]*")  
matches = p.finditer(str1)

for m in matches:
    print(m.group())
    print(m.span())

Ali
(10, 13)
Ahmad
(25, 30)
AAA
(69, 72)
As
(83, 85)


The **`start(group=0)`** method of the match object, return index of the start of the substring matched by group.

In [22]:
import re
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."
str2 = "Mr. Ehtisham, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."
p = re.compile(r"[A]+[a-z]*")  
matches = p.finditer(str1)

for m in matches:
    print(m.start())

10
25
69
83


The `end(group=0)` method of the match object, return index of the end of the substring matched by group.

In [23]:
import re
str1 = "Ehtisham, Ali, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."
str2 = "Mr. Ehtisham, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Ehtisham Sadiq."
p = re.compile(r"[A]+[a-z]*")  
matches = p.finditer(str1)

for m in matches:
    print(m.span())
    print(m.end())

(10, 13)
13
(25, 30)
30
(69, 72)
72
(83, 85)
85


In [None]:
# pattern = re.compile()
# pattern.search() -> scan whole document, but returns only first match
# pattern.match() -> At the begining of the string.
# pattern.findall() -> returns all the matches in a list
# pattern.finditer() -> return callable iterator(group, match, span, start, end)

## 3. Practical Example

**Read a text file**

In [None]:
!

In [24]:
! cat datasets/names_addresses.txt

Mr. Ehtisham Sadiq
615-555-7164
131 Model Town, Lahore
02-01-2001
ehtishampucit@gmail.com
http://www.ehtisham.me


Mrs. Azka Noreen
317.615.9124
33 Garden Town, Lahore
20/02/2000
azka-123@gmail.com
http://azka.pu.edu.pk



Mr. Ahmed Shahzad
321#521#9254
69, A Wapda Town, Lahore
12.09.2000
Ahmed3@yahoo.com
https://www.Ahmed.pu.edu.pk


Ms Aqsa
123.555.1997
56 Joher Town, Lahore
12/08/2001
aqsa_007@gmail.com
http://youtube.com

Mr. B
321-555-4321
19 Township, Lahore
05-07-2002
mrB@yahoo.com
http://facebook.com

In [25]:
with open("datasets/names_addresses.txt", "r") as fd:
    print(fd.read()) # read() method returns all information as a string

Mr. Ehtisham Sadiq
615-555-7164
131 Model Town, Lahore
02-01-2001
ehtishampucit@gmail.com
http://www.ehtisham.me


Mrs. Azka Noreen
317.615.9124
33 Garden Town, Lahore
20/02/2000
azka-123@gmail.com
http://azka.pu.edu.pk



Mr. Ahmed Shahzad
321#521#9254
69, A Wapda Town, Lahore
12.09.2000
Ahmed3@yahoo.com
https://www.Ahmed.pu.edu.pk


Ms Aqsa
123.555.1997
56 Joher Town, Lahore
12/08/2001
aqsa_007@gmail.com
http://youtube.com

Mr. B
321-555-4321
19 Township, Lahore
05-07-2002
mrB@yahoo.com
http://facebook.com


**Let us read the data from the file in a string**

In [26]:
with open("datasets/names_addresses.txt", "r") as fd:
    teststring = fd.read()
type(teststring)
# print(teststring)

str

### a. Extracting Names
- Assume that every name starts with Mr or Ms or Mrs, with an optional dot, a space and then followed by alphanumeric characters

In [27]:
import re
p = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')  


In [28]:
#teststring contains data read from file
matches = p.finditer(teststring)
for match in matches:
    print(match)

<re.Match object; span=(0, 12), match='Mr. Ehtisham'>
<re.Match object; span=(115, 124), match='Mrs. Azka'>
<re.Match object; span=(223, 232), match='Mr. Ahmed'>
<re.Match object; span=(337, 344), match='Ms Aqsa'>
<re.Match object; span=(430, 435), match='Mr. B'>


In [29]:
p = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group())
    

Mr. Ehtisham
Mrs. Azka
Mr. Ahmed
Ms Aqsa
Mr. B


What if we want to get the complete name

In [34]:
p = re.compile(r'(Mr|Ms|Mrs)\.?\s\w+\s[A-Za-z]*')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(0))
    

Mr. Ehtisham Sadiq
Mrs. Azka Noreen
Mr. Ahmed Shahzad
Ms Aqsa

Mr. B



In [37]:
# !cat datasets/names_addresses.txt

### b. Extracting Date of Births
- Assume that the the date, month and year are of two, two and four digits respectively. Moreover, they are separated by either a dot, a hyphen or a slash

In [38]:
p = re.compile(r'\d{2}.\d{2}.\d{4}')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group())
    

02-01-2001
20/02/2000
12.09.2000
12/08/2001
05-07-2002


What if we just want to get the date, month and year separately

In [39]:
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group())
    

02-01-2001
20/02/2000
12.09.2000
12/08/2001
05-07-2002


In [40]:
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(1))
    

02
20
12
12
05


In [41]:
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(2))
    

01
02
09
08
07


In [42]:
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')  

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(3))
    

2001
2000
2000
2001
2002


### c. Extracting Emails and Usernames
**Valid Name Part:**
- Lowercase case alphabets
- Uppercase case alphabets
- Digits: 0123456789,
- dot: . (not first or last character)
- For simplicity assume no special characters allowed

**Valid Domain Part:**
- Lowercase case alphabets
- Uppercase case alphabets
- Digits: 0123456789,
- Hyphen: - (not first or last character),
- Can contain IP address surrounded by square brackets: test@[192.168.2.4] or test@[IPv6:2018:db8::1].

In [43]:
!cat datasets/names_addresses.txt

Mr. Ehtisham Sadiq
615-555-7164
131 Model Town, Lahore
02-01-2001
ehtishampucit@gmail.com
http://www.ehtisham.me


Mrs. Azka Noreen
317.615.9124
33 Garden Town, Lahore
20/02/2000
azka-123@gmail.com
http://azka.pu.edu.pk



Mr. Ahmed Shahzad
321#521#9254
69, A Wapda Town, Lahore
12.09.2000
Ahmed3@yahoo.com
https://www.Ahmed.pu.edu.pk


Ms Aqsa
123.555.1997
56 Joher Town, Lahore
12/08/2001
aqsa_007@gmail.com
http://youtube.com

Mr. B
321-555-4321
19 Township, Lahore
05-07-2002
mrB@yahoo.com
http://facebook.com

In [None]:
raiehtisham

In [44]:
pattern = re.compile(r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-z]{2,3}')

#teststring contains data read from file
matches = pattern.finditer(teststring)

for match in matches:
    print(match.group())

ehtishampucit@gmail.com
azka-123@gmail.com
Ahmed3@yahoo.com
aqsa_007@gmail.com
mrB@yahoo.com


What if we want to just extract the usernames

In [45]:
p = re.compile(r'([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+)(\.[a-z]{2,3})')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:                          
    print(match.group(1))

ehtishampucit
azka-123
Ahmed3
aqsa_007
mrB


In [46]:
p = re.compile(r'([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+)(\.[a-z]{2,3})')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:                          
    print(match.group(2))

gmail
gmail
yahoo
gmail
yahoo


In [47]:
p = re.compile(r'([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+)(\.[a-z]{2,3})')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:                          
    print(match.group(3))

.com
.com
.com
.com
.com


In [49]:
# !cat datasets/names_addresses.txt

### d. Extracting valid Cell phones
Assume that every valid phone number consiste of 10 digits in three groups of three, three and four digits. The three groups are separated by either a `-`, `.` or `/` symbol

In [52]:
p = re.compile(r'(\d{3})[./#-](\d{3})[./#-](\d{4})')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group())
    

615-555-7164
317.615.9124
321#521#9254
123.555.1997
321-555-4321


You can easily extract the city codes, country codes and so on at your own by creating groups inside your regular expressions.

### e. Extracting Domain names from URLs
- Assume simple URLs, having the protocol either `http://` or `https://`
- Then we have optional `www.` string
- Then we have group of characters that make up our domain name, followed by top level domain

In [53]:
p = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group())   

http://www.ehtisham.me
http://azka.pu
https://www.Ahmed.pu
http://youtube.com
http://facebook.com


Let us extract the top level domain (TLDs) only.

In [54]:
p = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(3))   

.me
.pu
.pu
.com
.com


In [55]:
p = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')

#teststring contains data read from file
matches = p.finditer(teststring)

for match in matches:
    print(match.group(2))   

ehtisham
azka
Ahmed
youtube
facebook


## 4. Modifying Strings
- Up to this point, we’ve simply performed searches against a static string. Regular expressions are also commonly used to modify strings in various ways, using the following pattern methods:

        - split(): Split the string into a list, splitting it wherever the RE matches
        - sub(): Find all substrings where the RE matches, and replace them with a different string
        - subn(): Does the same thing as sub(), but returns the new string and the number of replacements

### a. The `re.Pattern.split()` Method
- It split the target string as per the regular expression pattern, and the matches are returned in the form of a list.
**`p.split(string, maxsplit=0)`**

- Where,
    - `string`: The variable pointing to the target string (i.e., the string we want to split).
    - `maxsplit`: The number of splits you wanted to perform. If maxsplit is 2, at most two splits occur, and the remainder of the string is returned as the final element of the list.
    

Note: 
It’s similar to the `split()` method of strings but provides much more generality in the delimiters that you can split by; string split() only supports splitting by whitespace or by a fixed string.

In [56]:
# defining string
mystring = "My name             is Ehtisham and my lucky number is 21 and 143"
mylist = mystring.split(sep=' ')
mylist

['My',
 'name',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'is',
 'Ehtisham',
 'and',
 'my',
 'lucky',
 'number',
 'is',
 '21',
 'and',
 '143']

In [57]:
# importing required libraries
import re

# defining string
mystring = "My name             is Ehtisham and my lucky number is 21 and 143"
p = re.compile(r"\s+")
p

re.compile(r'\s+', re.UNICODE)

In [63]:
word_list = p.split(mystring, maxsplit=0)

print(word_list)

['My', 'name', 'is', 'Ehtisham', 'and', 'my', 'lucky', 'number', 'is', '21', 'and', '143']


The `maxsplit` parameter of `split()` is used to define how many splits you want to perform. In simple words, if the maxsplit is 2, then two splits will be done, and the remainder of the string is returned as the final element of the list.

In [64]:
import re

# defining string
mystring = "My name             is Ehtisham and my roll number is 21 and 143"

p = re.compile(r"\s+")

word_list = p.split(mystring, maxsplit=3)

print(word_list)

['My', 'name', 'is', 'Ehtisham and my roll number is 21 and 143']


- The `split()` method of strings allows you to split by whitespace or by a fixed string.
- The regex `split()` method allows you to specify a regex pattern for the delimiters where you can specify multiple delimiters.
- For example, using the regular expression re.split() method, we can split the string either by the `comma` or by `space`.

- Let us split by the `comma` or by `hyphen`.

In [69]:
import re

# defining string
mystring = "12,45,78,85-17-89"

p = re.compile(r"-|,")
p = re.compile(r"[-,]")

word_list = p.split(mystring)

print(word_list)


['12', '45', '78', '85', '17', '89']


In [68]:
# mystring.split("-")

In [70]:
import re

# defining string
mystring = "12and45, 78and85-17and89-97,54"

p = re.compile(r"and|[\s,-]+")

word_list = p.split(mystring)

print(word_list)


['12', '45', '78', '85', '17', '89', '97', '54']


### b. The `re.Pattern.sub()` and `re.Pattern.subn()` Methods
- - Python regex offers `sub()` the `subn()` methods to `search` and `replace` patterns in a string. Using these methods we can replace one or more occurrences of a regex pattern in the target string with a substitute string.

- The `sub()` method return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in `string` by the replacement `repl`.

**`p.sub(repl, string, count=0)`**

- Where,
    - `repl`: The replacement that we are going to insert for each occurrence of a pattern. The replacement can be a string or function.
    - `string`: The variable pointing to the target string (In which we want to perform the replacement).
    - `count`: The default value of count is zero, means, find and replace all occurrences of pattern with replacement. For count=n, means replace first n occurrencesof pattern with the replacement
    
    

- It returns the string obtained by replacing the pattern occurrences in the string with the replacement string. If the pattern isn’t found, the string is returned unchanged.

Replace white space with underscore character

In [71]:
import re

# defining string
mystring = "Learning Data Science and Machine Learning"

p = re.compile(r"\s")

word_list = p.sub("_", mystring)

print(word_list)

Learning_Data_Science_and_Machine_Learning


Remove whitespaces from a string

In [72]:
import re

# defining string
mystring = "Learning Data Science and Machine Learning"

p = re.compile(r"\s+")

word_list = p.sub("", mystring)

print(word_list)

LearningDataScienceandMachineLearning


Remove leading Spaces from a string

In [73]:
import re

# defining string
mystring = "         Learning Data Science and Machine Learning"


# ^\s+ remove only leading spaces
# caret (^) matches only at the start of the string
p = re.compile(r"^\s+")

word_list = p.sub("", mystring)

print(word_list)

Learning Data Science and Machine Learning


Remove both leading and trailing spaces

In [74]:
import re

# defining string
mystring = "          Learning Data Science and Machine Learning        \t. "

# ^\s+ remove leading spaces
# ^\s+$ removes trailing spaces
p = re.compile(r"^\s+|\s+$")

word_list = p.sub("", mystring)

print(word_list)

Learning Data Science and Machine Learning        	.


### Write a Python program to find all words starting with 'a' or 'e' in a given string.

In [76]:
import re
# Input.
text = """The following example creates an ArrayList with a capacity of 50 elements. 
        Four elements are then added to the ArrayList and the ArrayList is trimmed accordingly.
"""
#find all the words starting with 'a' or 'e'
p = re.compile(r"[ae]\w+")
list1 = p.finditer(text)
# Print result.
for i in list1:
    print(i.group())

example
eates
an
ayList
apacity
elements
elements
are
en
added
ayList
and
ayList
ed
accordingly


### Write a Python program to separate and print the numbers and their position of a given string.

In [77]:
import re
text = "The following example creates an ArrayList with a capacity of 50 elements. Four elements are then added to the ArrayList and the ArrayList is trimmed accordingly."
p = re.compile(r"\d+")
iterobject = p.finditer(text)
for i in iterobject:
    print(i)
    print(i.start())

<re.Match object; span=(62, 64), match='50'>
62


In [79]:
# text[62:64]

In [80]:
import re
text = "Boys hostel no 17, university of the punjab, new campus, Lahore"
p = re.compile(r"Lahore$")
iterobj = p.sub("lhr", text)
iterobj

'Boys hostel no 17, university of the punjab, new campus, lhr'