# Regular Expressions

In [1]:
#this is an inbuilt package which is used in python to work with Regular Expressions
import re
import nltk

This **re** module offers a set of functions that allow to search for a string to match. This Notebook will contain code showing examples from the  **Speech and Language Processing** Book that we're going through at AMMI.

AT the end of the notebook we try to build a very simple Eliza chatbot that uses pattern matching to recognize phrases and give outputs.

We're going to use  the following functions in **re**:
- re.findall  - Returns a list containing all matches
- re.search - Returns a match object if there is a match in the string
- re.sub - Replaces one or many matches with a string

## Basic Regular Expression Patterns

In [26]:
#searching for a simple sequence of characters
#matches name 'woodchuck in text'
txt = 'woodchuck'
regexp = re.compile(r'woodchuck')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

True!


In [28]:
#search words for both capital and small starting letter
#this gives false since the word we are searching of starts with a capital W
#Regex is case sensistive
regexp = re.compile(r'Woodchuck')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

Not in text


In [29]:
#to solve the above problem
regexp = re.compile(r'[wW]oodchuck')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

True!


In [31]:
#finding a sequence of  capital letters
txt = 'I am so and so and I am 26 years old'
regexp = re.compile(r'[A-Z]')

regexp.findall(txt)

['I', 'I']

In [32]:
#finding a sequence of small letters
regexp = re.compile(r'[a-z]')

regexp.findall(txt)

['a',
 'm',
 's',
 'o',
 'a',
 'n',
 'd',
 's',
 'o',
 'a',
 'n',
 'd',
 'a',
 'm',
 'y',
 'e',
 'a',
 'r',
 's',
 'o',
 'l',
 'd']

In [33]:
#finding a sequence of single digits
regexp = re.compile(r'[0-9]')

regexp.findall(txt)

['2', '6']

In [35]:
#specify what a single charachter cannot be using ^
# ^ needs to be the first symbol after the first open bracket
regexp = re.compile(r'[^a-z]')
regexp.findall(txt)

['I', ' ', ' ', ' ', ' ', ' ', ' ', 'I', ' ', ' ', '2', '6', ' ', ' ']

In [37]:
#matching the charachter before or nothing
#looking for woodchucks. will not find it because of the additional s
txt = 'woodchuck'
regexp = re.compile(r'[wW]oodchucks')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

Not in text


? is used to match zero or one instances of the previous character

In [43]:
#we can solve the above by adding the ?. 
regexp = re.compile(r'[wW]oodchucks?')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

Not in text


In [42]:
#another example using color
txt = input()
regexp = re.compile(r'[cC]olou?r')

if regexp.search(txt):
    print('True!')
else:
    print('Not in text')

colour
True!


using **Kleen *** which means matching zero or more occurences of the immediate previous character 

In [69]:
#searching for zero or more a or b's 
#txt = input()
txt2 = 'Tund'
regexp = re.compile(r'[e][e]*')

if regexp.search(txt2):
    print('True!')
else:
    print('Not in text')

Not in text


**Kleen +**
specifying at least one of the charachters you use Kleen + so that you don't have to specify regular expression for digits twice

In [74]:
#txt = input()
txt2 = '2'
regexp = re.compile(r'[0-9]+') #specifies a sequence of digits

if regexp.search(txt2):
    print('True!')
else:
    print('Not in text')

True!


Matching any single character between a word

In [None]:
#Example
txt2 = 'begn'
regexp = re.compile(r'beg.n') #

if regexp.search(txt2):
    print('True!')
else:
    print('Not in text')

Finding any line in which a particular word appears twice we use .* together

In [81]:
# Example
#try removing the word that is repeated to see how it works
txt2 = 'my name name is she'
regexp = re.compile(r'name.*name') #specifies a sequence of digits

if regexp.search(txt2):
    print('True!')
else:
    print('Not ')

True!


**Anchors** are special characters that anchor regular expressions to particular places in a string most popular are ^ and $.

Example1:
'^The' matches word The only at the start of a line

In [94]:
#Example 1
txt2 = 'I am the girl is here'
regexp = re.compile(r'^The')

if regexp.search(txt2):
    print('True!')
else:
    print('The is not at the start of the line ')

The is not at the start of the line 


In [95]:
#Example 2
txt2 = 'The girl is here'
regexp = re.compile(r'^The')

if regexp.search(txt2):
    print('True!')
else:
    print('The is not at the start of the line')

True!


**NB**:
The ^ (caret) character has **three** uses:

- To match the start of a line ^The
- To indicate a negation inside square brackets \[^0-9\] finds everything except numerical characters
- And just to mean a caret \[a^\] finds a^

Example 2: The $ sign matches the end of the line. so example ' $' is a useful pattern for matching a space at the end of the line

In [99]:
txt2 = 'The dog.is here'
regexp = re.compile(r'^The dog')

regexp.findall(txt2)

['The dog']

In [100]:
txt2 = 'The dog.'
regexp = re.compile(r'^The dog\.$') 
# the backslash \ here is used to mean the character . not the wildcard . that we talked about earlier

regexp.findall(txt2)

['The dog.']

There are two other **Anchors** used:
\b matches a word boundary
\B matches non-boundary

Example below:

In [None]:
txt1 = 'The girl is one that goes to school there'
txt2 = 'The girl is the one that goes to school there'
regexp = re.compile(r'\bthe\b')

if regexp.search(txt2):
    print('True!')
else:
    print('No the with boundaries')

In [102]:
#this will match a string with the number 25. try with another number like 325, it wont match
txt2 = 'The girl is the one that is 25 years old'
regexp = re.compile(r'\b25\b')

if regexp.search(txt2):
    print('True!')
else:
    print('25 is not in the text')

True!


## Disjunction, Grouping and Precedence

The **disjunction** operator is also known as the **pipe** | used when we want to match either one word or the other

In [106]:
txt2 = 'There is a dog in the house'
regexp = re.compile(r'cat|dog')

if regexp.search(txt2):
    print('True!')
else:
    print('No cat or dog in the text')

True!


To make the disjunction operator apply only to a specific pattern, we need to use the parenthesis operators ( )

In [108]:
#Example we want to match both puppy and puppies
txt2 = 'There is a puppies in the house'
regexp = re.compile(r'pupp(y|ies)')

if regexp.search(txt2):
    print('True!')
else:
    print('No cat or dog in the text')

True!


Suppose we want to match repeated instances of a string

In [122]:
txt2 = 'Column 1 Column 2 Column 3'
#matches the word Column followed by a number and optional spaces & the 
#whole pattern repeated zero or more times
regexp = re.compile(r'(Column [0-9]+ *)')

regexp.findall(txt2)

['Column 1 ', 'Column 2 ', 'Column 3']

**greedy** vs **non-greedy** matching

*read more on this*

The operator *? is a Kleene star that matches as little text as
possible. The operator +? is a Kleene plus that matches as little text as possible.

In [7]:
num = '45'
re.findall('[5-9][0-9]{2}|[1-9][0-9]{3}[0-9]*',num)

[]

### Substitutions, Capture Groups and Eliza

In python, the sub function is used in replacing text in a string.
It takes up 3 arguements:
- The text to replace with
- The text to replace in
- The maximum number of substitutions to make

sub returns a string consisting of the given text with the substitution(s) made

In [125]:
print (re.sub(r"(.*)\1", r"\1", "HeyHey"))

Hey


In [126]:
#example 1
mystring = 'There are 35 boxes'
pattern = re.compile(r'/([0-9]+)/')
newstring = pattern.sub(r"/<\1>/", mystring)

In [127]:
newstr2 = re.sub(r'([0-9]+)', r'<\1>', mystring)

In [128]:
newstr2

'There are <35> boxes'

In [129]:
mystring2 = 'the Xer they were, the Xer they will be'
newstring = pattern.sub(r'the (.*)er they were, the \1er they will be', mystring2)

In [130]:
newstring

'the Xer they were, the Xer they will be'

In [131]:
user1 = "I'm depressed"
pattern = re.compile(r"I'm (depressed|sad)")
pattern.sub(r"I AM SORRY TO HEAR YOU ARE \1", user1)

'I AM SORRY TO HEAR YOU ARE depressed'

In [132]:
user2 = "my name is cate"
pattern = re.compile(r"my name is (.*)")
pattern.sub(r"I AM SORRY TO HEAR YOU ARE \1", user2)

'I AM SORRY TO HEAR YOU ARE cate'

In [134]:
newstring = pattern.sub('s/.*I am (.*).*/Hi  $1/', user2)