<a href="https://colab.research.google.com/github/ShaunakSen/Natural-Language-Processing/blob/master/Regex101.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Regular Expressions - Basics to Advanced

> https://web.stanford.edu/~jurafsky/slp3/

---






### The need for regular expressions

Regular expressions can be used to specify strings we might want to
extract from a document, , from transforming “I need X” in Eliza above, to defining strings like $199 or $24.99 for extracting tables of prices from a document.

#### Text normalization

Normalizing text means converting it to a more convenient, standard form.

1. __Tokenization__

For example, most of what we are going to
do with language relies on first separating out or tokenizing words from running
tokenization text, the task of tokenization. English words are often separated from each other by whitespace, but whitespace is not always sufficient. "New York" and "rock ’n’ roll" are sometimes treated as large words despite the fact that they contain spaces, while sometimes we’ll need to separate "I’m" into the two words I and am. For processing tweets or texts we’ll need to tokenize emoticons like :) or hashtags like #nlproc. Some languages, like Japanese, don’t have spaces between words, so word tokenization becomes more difficult.

2. __Lemmitization__:

Another part of text normalization is lemmatization, the task of determining
that two words have the same root, despite their surface differences. For example, the words sang, sung, and sings are forms of the verb sing. The word sing is the common lemma of these words, and a lemmatizer maps from all of these to sing.

Lemmatization is essential for processing morphologically complex languages like
stemming Arabic

3. __Stemming__:

Stemming refers to a simpler version of lemmatization in which we mainly
just strip suffixes from the end of the word.

4. __Sentence segmentation__:

Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like sentence
segmentation periods or exclamation points.

Finally, we’ll need to compare words and other strings. We’ll introduce a metric
called __edit distance__ that measures how similar two strings are based on the number
of edits (insertions, deletions, substitutions) it takes to change one string into the
other. Edit distance is an algorithm with applications throughout language processing, from spelling correction to speech recognition to coreference resolution.


### Basic Regex

The simplest kind of regular expression is a sequence of simple characters. To search for woodchuck, we type /woodchuck/. The expression /Buttercup/ matches any string containing the substring Buttercup; grep with that expression would return the line I’m called little Buttercup. The search string can consist of a single character (like /!/) or a sequence of characters (like /urgl/).

Regular expressions are case sensitive; lower case /s/ is distinct from upper
case /S/ (/s/ matches a lower case s but not an upper case S). This means that
the pattern /woodchucks/ will not match the string Woodchucks. We can solve this
problem with the use of the square braces [ and ]. The string of characters inside the braces specifies a disjunction of characters to match

For example, Fig. 2.2 shows
that the pattern /[wW]/ matches patterns containing either w or W.

![](https://i.imgur.com/E8CRgco.png)

The square braces can also be used to specify what a single character cannot be,
by use of the caret ˆ. If the caret ˆ is the __first symbol after the open square brace__ [, the resulting pattern is negated. For example, the pattern /[ˆa]/ matches any single character (including special characters) except a. 

This is only true when the caret is the first symbol after the open square brace. If it occurs anywhere else, it usually
stands for a caret;

![](https://i.imgur.com/i3WuhN8.png)

How can we talk about optional elements, like an optional 's' in woodchuck and
woodchucks? We can’t use the square brackets, because while they allow us to say
“s or S”, they don’t allow us to say “s or nothing”. For this we use the question mark
/?/, which means “the preceding character or nothing”,

We can think of the question mark as meaning “zero or one instances of the
previous character”. That is, it’s a way of specifying how many of something that we want, something that is very important in regular expressions. For example, consider the language of certain sheep, which consists of strings that look like the
following:

baa!
baaa!
baaaa!
baaaaa!

One solution : `/ba+!/`

+ matches the previous token between one and unlimited times

Another approach: This language consists of strings with a b, followed by at least two a’s, followed by an exclamation point. The set of operators that allows us to say things like “some Kleene * number of as” are based on the asterisk or *

The Kleene star means “zero or more occurrences
of the immediately previous character or regular expression”. So `/a*/` means “any
string of zero or more as”
This will match a or aaaaaa, but it will also match Off
Minor since the string Off Minor has zero a’s. So the regular expression for matching
one or more a is `/aa*/`, meaning one a followed by zero or more a's

More complex
patterns can also be repeated. So /[ab]*/ means “zero or more a’s __or__ b’s” (not
“zero or more right square braces”). This will match strings like aaaa or ababab or
bbbb..

![](https://i.imgur.com/YcQmUJu.png)

Here we matched any char in a-z followed by a group of 0 or more a or b

For specifying multiple digits (useful for finding prices) we can extend /[0-9],
the regular expression for a single digit. An integer (a string of digits) is thus /[0-9][0-9]*/. (Why isn’t it just /[0-9]*/?)

- because /[0-9]*/ matches 0 or more occurances so even if we have something like "abcd" it will return a match

![](https://i.imgur.com/CyaFc2u.png)
- no match

![](https://i.imgur.com/Rb8Jk3w.png)

Sometimes it’s annoying to have to write the regular expression for digits twice, so there is a shorter way to specify “at least one” of some character. This is the Kleene + Kleene +, which means “one or more occurrences of the immediately preceding character or regular expression”. Thus, the expression /[0-9]+/ is the normal way to specify “a sequence of digits”. There are thus two ways to specify the sheep language: /baaa*!/ or /baa+!/.


One very important special character is the period (/./), a wildcard expression
that matches any single character (except a carriage return)

![](https://i.imgur.com/fv6jBVR.png)

This basically matches "aardvark" followed by 0 or more occurance of any character followed by "aardvark" again

![](https://i.imgur.com/zJFxTbM.png)

