### Regular expressions

<PRE>
#The big bracket (Square brackets):
[]

Inside the square  bracket, we can specify what to find.
Like we can specify digits, range of digits, characters, etc

Example:
[0-9]


NOTE:
I think everything inside the Sq.bracket is treated as OR
The symbols are treated differently inside the square bracket than how they are treated in regex.
For example:
Inside the [], '^' means NOT which would mean STARTSWITH in a regex

So '\S+' is equivalent to '[^ ]+'
Both mean, non space character, 1 or many


#The plus symbol and The asterik symbol
+ : It means 1 or more.
* : It means 0 or more

For example, if we specify:
[0-9]+
This means 1 or more digits.

Another example, if we specify:
[0-9]*
This means 0 or more digits





### re.search vs re.findall

<PRE>re.search()
returns True or False, depending upon whether the string matches the regex or not.

re.findall()
re.findall('regex', 'string')
If we actually want the matching strings to be extracted, we use re.finall()

The findall() method a list of all the matches from the string.
If none is found, then an empty list.


In [11]:
import re
x = 'My 2 favorite numbers are 19 and 42'

y = re.findall('[0-9]+', x)  #

z = re.findall('[AEIOU]+', x)

print(y)
print(z)

['2', '19', '42']
[]


### Greedy matching
<PRE>
By default, in regex, the greedy matching is ON.
This means the regular expression library attempts to give you the largest possible version of the string that 
you're matching.
The repeat characters (+ and *) push outwards in both directions to match the largest possible string. 
(i.e. Greedy)


Let's say we have a string:
x = 'From: Using the: whatever'

If we define a regex like:
^F.+:
This regex means, starting with 'F', follwed by any character '.'  1 or many count with a ':'

Here, we expect to get the result, 'From:' from the string x
But what happens is, the regex will return the whole substring up to 'From: Using the:'
This is because the greedy matching is ON.

</PRE>
### So how to avoid the greedy matching?
<PRE>
We can do this by using the '?' symbol.
It tells the regex to match the first (shortest) string match.



In [16]:
#Greedy matching ON (by default)
import re
x = 'From: Using the: character'
y = re.findall('^F.+:', x)
y

['From: Using the:']

In [17]:
#Greedy matching OFF
import re
x = 'From: Using the: character'
y = re.findall('^F.+?:', x)
y

['From:']

### Matching email

In [20]:
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('\S+@\S+', x)  #matches a nonspace sequence (because +) with @ followed by another nonspace sequence
print(y)


['stephen.marquard@uct.ac.za']


<PRE>NOTE: on the above regex, we didn't specify '?'.
If we'd done that, it would have returned 'd@u'

In [31]:
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('\S+@\S+?', x)  #matches a nonspace sequence (because +) with @ followed by another nonspace sequence
print(y)


['stephen.marquard@u']


### Using paranthesis
<PRE>
Paranthesis are not treated as a keyword-esque character in regex.
But they can be used to specify where to start and stop string EXTRACTION.
Remember the word 'Extraction', only the part that is extracted is returned.
So, using paranthesis, we can specify which part to extract from the matched pattern.

For example:
let's say we want to extract the emails from the string:

"From: raileohang@herald.college.edu.np Sat Jan 5"

Here, we can see that the email is followed by a "From: "
Let's say that these patterns repeat in a txt file.

Then we can use:
^From (\S+@\S+)

This means, the string should start with From with a space, some non space chars followed by an "@" again 
followed by some non-space characters.
But only return the part that matches the regex inside the bracket.

NOTE:
i.e. scan for the whole expression
But only return the part specified inside paranthesis

In [36]:
#Using paranthesis
x = 'From raileohang@herald.college.edu.np Sat Jan 5'
y = re.findall('^From (\S+@\S+)', x)
y

['raileohang@herald.college.edu.np']

### Matching the domain part of an email
Remember how we did this before, by using split() and find() to get the position of @ and then slicing using the posision obtained from the find() method.

In [45]:
import re
x = 'From raileohang@herald.college.edu.np Sat Jan 5'
y = re.findall('@(\S+)', x)
y


['herald.college.edu.np']

In [48]:
#Alternative:
x = 'From raileohang@herald.college.edu.np Sat Jan 5'
y = re.findall('@([^ ]+)', x)  #Here [^ ] means, not space
y

['herald.college.edu.np']

### Let's fine tune this even more

In [52]:
import re
x = 'From raileohang@herald.college.edu.np Sat Jan 5'
y = re.findall('From .*@(\S+)', x)   
#means From with a space followed by 0 or many characters(.) followed by '@'' followed by 1 or many nonspace characters
#But only return the non-space characters after the "@" i.e. (\S+)

y

['herald.college.edu.np']

In [53]:
x = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
y = re.findall('\S+?@\S+',x)
y


['stephen.marquard@uct.ac.za']