<a href="https://colab.research.google.com/github/Rachita-G/Python_Practice/blob/main/Packages/Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in **re** library. 

Ref: [doc1](https://docs.python.org/3/library/re.html) and [doc2](https://docs.python.org/3/howto/regex.html) 

Let's begin by explaining how to search for basic patterns in a string!

In [None]:
# Prints many statements at the same time
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"

In [None]:
import re

In [None]:
nameage='James is 23 and I am 45422'

In [None]:
'am' in nameage # for checking!

True

In [None]:
re.findall(r'\d{1,2}',nameage) # searching for 2 digit numbers

['23', '45', '42', '2']

In [None]:
re.findall(r'\d{1,7}',nameage)

['23', '45422']

In [None]:
re.findall(r'[A-z][a-z]*',nameage) # searching for words in text, * represents words

['James', 'is', 'and', 'I', 'am']

In [None]:
re.findall(r'[A-z]',nameage) # searching for upper and lower case letters in text

['J', 'a', 'm', 'e', 's', 'i', 's', 'a', 'n', 'd', 'I', 'a', 'm']

In [None]:
re.findall(r'[a-z]',nameage) # searching for  lower case letters in text
re.findall(r'[A-Z]',nameage) # searching for  lower case letters in text

['a', 'm', 'e', 's', 'i', 's', 'a', 'n', 'd', 'a', 'm']

['J', 'I']

In [None]:
re.search('J',nameage)

<re.Match object; span=(0, 1), match='J'>

In [64]:
re.search('\w+', nameage) # matches a word at the beginning of a string

<re.Match object; span=(0, 5), match='James'>

In [62]:
re.search('/w+', nameage)

In [None]:
inf='this is to inform the information'
re.findall('inform',inf)

['inform', 'inform']

In [None]:
re.search('inform',inf) # gives the span for only the first letter in string

<re.Match object; span=(11, 17), match='inform'>

In [None]:
re.findall('inform',inf) # giives back all strings matched

['inform', 'inform']

In [None]:
# generating an iterator-- getting the starting and ending index of a particular string
for i in re.finditer('inform',inf):
    print(i.span()) 

(11, 17)
(22, 28)


In [None]:
re.split(" ",inf) # split

['this', 'is', 'to', 'inform', 'the', 'information']

In [None]:
re.split("\s", inf)

['this', 'is', 'to', 'inform', 'the', 'information']

In [None]:
re.split('\s',inf,3) # specifying the max split

['this', 'is', 'to', 'inform the information']

In [None]:
# sub() function (calls for substitution) replaces the matches with the text of your choice
re.sub('\s',' & ',inf)

'this & is & to & inform & the & information'

In [None]:
#control the number of replacements by specifying the count parameter
re.sub('\s',' ^ ',inf,3)

'this ^ is ^ to ^ inform the information'

In [None]:
name='I am Rachita'
x=re.search(r'\bR\w+',name) # looks for any words that starts with an upper case 'R', \b - boundary between words

In [None]:
x

<re.Match object; span=(5, 12), match='Rachita'>

In [None]:
x.span() #returns a tuple containing the start-, and end positions of the match.

(5, 12)

In [None]:
x.start()

5

In [None]:
x.end()

12

In [None]:
x.string #returns the string passed into the function

'I am Rachita'

In [None]:
x.group() # returns the part of the string where there was a match

'Rachita'

In [None]:
names=['Braund, Mr. Owen Harris',
 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
 'Heikkinen, Miss. Laina',
 'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
 'Allen, Mr. William Henry',
 'Moran, Mr. James',
 'McCarthy, Mr. Timothy J',
 'Palsson, Master. Gosta Leonard',
 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)',
 'Nasser, Mrs. Nicholas (Adele Achem)',
 'Sandstrom, Miss. Marguerite Rut',
 'Bonnell, Miss. Elizabeth']

In [None]:
for i in list(range(len(names))):
    y=names[i].split(",",1)[1].split(" ")[1]
    print(y)

Mr.
Mrs.
Miss.
Mrs.
Mr.
Mr.
Mr.
Master.
Mrs.
Mrs.
Miss.
Miss.



## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [None]:
text = "My telephone number is 408-555-1234"
re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [None]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [None]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
results = re.search(phone_pattern,text)

In [None]:
# The entire result
results.group()

'408-555-1234'

In [None]:
# Can then also call by group position.
# remember groups were separated by parentheses ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(1),results.group(2),results.group(3)

('408', '555', '1234')

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [None]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [None]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [None]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [None]:
re.findall(r"...at","The bat went splat") # before at it extarcts the previous letters.

['e bat', 'splat']

In [None]:
# However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:


In [None]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

In [None]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

['1']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[ ]**. Anything inside the brackets is excluded. For example:

In [None]:
phrase = "my name is R"
re.findall(r'[^\d]',phrase)

['m', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'R']

In [None]:
# To get the words back together, use a + sign 
s=re.findall(r'[^\d]+',phrase) # spliited by numbers
s

['my name is R']

In [None]:
re.split(" ",''.join(s))

['my', 'name', 'is', 'R']

In [None]:
# removing punctuation
test_phrase = 'String with punctuation. How can we remove it?'
re.findall('[^!.? ]+',test_phrase)

['String', 'with', 'punctuation', 'How', 'can', 'we', 'remove', 'it']

In [None]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))
clean

'String with punctuation How can we remove it'

## Brackets for Grouping
As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [None]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

## Parentheses for Multiple Options

If we have multiple options for matching, we can use parentheses to list out these options. For Example:

In [None]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [None]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [None]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [None]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

In [58]:
# Python program to check that a string contains only a certain set of characters (in this case a-z, A-Z and 0-9)
import re
def is_allowed_specific_char(string):
    charRe = re.compile(r'[^a-zA-Z0-9]')
    string = charRe.search(string)
    return not bool(string)

print(is_allowed_specific_char("ABCDEFabcdef123450")) 
print(is_allowed_specific_char("*&%@#!}{"))

True
False
