# Strings and Regular Expressions

Strings are a native data type in python which is one of the reasons python is the go-to language for NLP related tasks. 

In this notebook, we will:
 - Explore Strings up and close
 - Understand different operations that you can apply on strings
 - String Manipulation techniques

We will also briefly cover __Regular Expressions__ in this notebook.

In [2]:
# Zen of python
"""
The Zen of Python is a collection of 19 "guiding principles" 
for writing computer programs that influence the design 
of the Python programming language.
"""
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


## String Under the Hood

- A string is defined using alpha-numeric characters enclosed within double quotes
- Each string object has a unique memory address which you can get using ``id(str)`` function
- String is a native data-type

In [3]:
new_string = "This is a String"  # storing a string

print('ID:', id(new_string))  # shows the object identifier (address)
print('Type:', type(new_string))  # shows the object type
print('Value:', new_string)  # shows the object value

ID: 140450804196312
Type: <class 'str'>
Value: This is a String


There are various types of strings like we had mentioned earlier and also shown a few examples in one of the previous sections regarding data types. The following BNF (Backus-Naur Form) gives us the general lexical definitions for producing strings as seen in the official Python docs.

```
stringliteral   ::=  [stringprefix](shortstring | longstring)
stringprefix    ::=  "r" | "u" | "ur" | "R" | "U" | "UR" | "Ur" | "uR"
                     | "b" | "B" | "br" | "Br" | "bR" | "BR"
shortstring     ::=  "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring      ::=  "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::=  shortstringchar | escapeseq
longstringitem  ::=  longstringchar | escapeseq
shortstringchar ::=  <any source character except "\" or newline or the quote>
longstringchar  ::=  <any source character except "\">
escapeseq       ::=  "\" <any ASCII character>
```


The above rules tell us different types of string prefixes exist which can be used with different string types to produce string literals. In simple terms, the following types of string literals are used the most.

- Short strings: These strings are usually enclosed with single (') or double quotes (") around the characters. Some examples would be, 'Hello' and "Hello".

- Long strings: These strings are usually enclosed with three single (''') or double quotes (""") around the characters. Some examples would be, """Hello, I'm a long string""" or '''Hello I\'m a long string '''. Do note the (\') indicates an escape sequence which we shall talk about soon.

- Escape sequences in strings: These strings often have escape sequences embedded in them where the rule for escape sequences in mentioned above which starts with a backslash (\) followed by any ASCII character. Hence they perform backspace interpolation. Popular escape sequences include (\n) indicating a new line character and (\t) indicating a tab.

- Bytes: These are used to represent bytestrings which create objects of the bytes data type. These strings can be created as bytes('…') or using the b'...' notation. Examples would be bytes('hello') and b'hello'. 

- Raw strings: These strings were originally created for specifically for regular expressions (regex) and creating regex patterns. These strings can be created using the r'...' notation and keep the string in its raw or native form. Hence it does not perform any backspace interpolation and turns off the escape sequences. An example would be r'Hello'.

- Unicode: These strings support Unicode characters in text and they are usually non-ascii character sequences. These strings are denoted with the u'...' notation. However typically in Python 3.x all string literals are typically represented as Unicode. Besides the string notation, there are several specific ways to represent special Unicode characters in the string. The usual include the hex byte value escape sequence of the format '\xVV'. Besides this, we also have Unicode escape sequences of the form '\uVVVV' and '\uVVVVVVVV' where the first form uses 4 hex-digits for encoding a 16 bit character and the second uses 8 hex-digits for encoding a 32-bit character. Some examples would be u 'H\xe8llo' and u 'H\u00e8llo' which represents the string 'Hèllo'.


In [3]:
# simple string
# You concatenate two strings with "+" operator
simple_string = 'Hello!' + " I'm a simple string"
print(simple_string)

Hello! I'm a simple string


In [4]:
# multi-line string, note the \n (newline) escape character automatically created
multi_line_string = """Hello I'm
a multi-line
string!"""

multi_line_string


"Hello I'm\na multi-line\nstring!"

In [5]:
print(multi_line_string)

Hello I'm
a multi-line
string!


In [7]:
# Normal string with escape sequences leading to a wrong file path!
escaped_string = "C:\the_folder\new_dir\file.txt"
print(escaped_string)  # will cause errors if we try to open a file here

C:	he_folder
ew_dirile.txt


In [8]:
# raw string keeping the backslashes in its normal form
# pay attention to "r" before the string quote starts
raw_string = r'C:\the_folder\new_dir\file.txt'
print(raw_string)

C:\the_folder\new_dir\file.txt


In [9]:
# unicode string literals
# python strings extend to unicode characters.
string_with_unicode = 'H\u00e8llo!'
print(string_with_unicode)

Hèllo!


In [10]:
# Lets show some love for pizza using unicode and python
more_unicode = 'I love Pizza 🍕!  Shall we book a cab 🚕 to get pizza?'
print(more_unicode)

I love Pizza 🍕!  Shall we book a cab 🚕 to get pizza?


In [11]:
print(string_with_unicode + '\n' + more_unicode)

Hèllo!
I love Pizza 🍕!  Shall we book a cab 🚕 to get pizza?


In [12]:
# you can join two strings using the ``join`` function
# syntax '<join-char>'.join([list of strings to be joined])
' '.join([string_with_unicode, more_unicode])

'Hèllo! I love Pizza 🍕!  Shall we book a cab 🚕 to get pizza?'

In [13]:
# reversing a string is as simple as this!
more_unicode[::-1]  # reverses the string

'?azzip teg ot 🚕 bac a koob ew llahS  !🍕 azziP evol I'

## String Operations

Strings are iterable sequences and hence a lot of operations can be performed on them which are used especially when processing and parsing textual data into easy to consume formats. There are several operations which can be performed on strings. We have categorized them into the following segments.

- Basic operations - concatenation, substrings etc.
- Indexing and Slicing
- Methods
- Formatting
- Regular Expressions 


These would cover the most frequently used techniques for working with strings

### Concatenation

- '+' and ' ' operators are simple concatenation operators
- We also enclose all strings in paranthesis '(s1 s2...s)` to concatenate all strings with space
- '*' operator is used to specify number of repititions of a given string

In [14]:
'Hello 😊' + ' and welcome ' + 'to Python 🐍!'

'Hello 😊 and welcome to Python 🐍!'

In [15]:
'Hello 😊' ' and welcome ' 'to Python 🐍!'

'Hello 😊 and welcome to Python 🐍!'

In [16]:
# concat variables and literals
s1 = 'Python 💻!'
'Hello 😊 ' + s1

'Hello 😊 Python 💻!'

In [17]:
# pattern based string concat
# or repeat a given string for specified number of times
s2 = '--🐍Python🐍--'
s2 * 5

'--🐍Python🐍----🐍Python🐍----🐍Python🐍----🐍Python🐍----🐍Python🐍--'

In [18]:
# of course, concatenate and repeat
(s1 + s2)*3

'Python 💻!--🐍Python🐍--Python 💻!--🐍Python🐍--Python 💻!--🐍Python🐍--'

In [19]:
# paranthesis as an operator for string concat
s3 = ('This '
      'is another way '
      'to concatenate '
      'several strings!')
s3

'This is another way to concatenate several strings!'

### Substrings

- The "in" operator can be used to check if a sequence of characters are contained in a string

In [20]:
'way' in s3

True

### Length of String
- "len" function is a simple way to get the lenght or number of characters in a given string object

In [21]:
len(s3)

51

## String Manipulation
- Strings can be manipulated in the same we can manipulate arrays, i.e. using indices
- Python strings can be sliced from both sides with special indexing capabilities

In [22]:
# creating a string
s = 'PYTHON'
s, type(s)

('PYTHON', str)

### Indexing

Strings are iterables which are sequences of characters. Hence they can be indexed, sliced and iterated through similar to other iterables like lists. Each character has a specific position in the string which is its index and using indexes, we can access specific parts of the string. 

Accessing a single character using a specific position or index in the string is termed as indexing and accessing a part of a string i.e., a substring using a start and end index is termed as slicing. 

Python supports two types of indexes, one starting from 0 and increases by 1 each time per character till the end of the string and the other starting from -1 at the end of the string and decreases by 1 each time for each character till the beginning of the string

![](https://i.imgur.com/u9tEKkM.png)

In [23]:
# depicting string indexes
for index, character in enumerate(s):
    print('Character ->', character, 'has index->', index)

Character -> P has index-> 0
Character -> Y has index-> 1
Character -> T has index-> 2
Character -> H has index-> 3
Character -> O has index-> 4
Character -> N has index-> 5


In [24]:
s[0], s[1], s[2], s[3], s[4], s[5]

('P', 'Y', 'T', 'H', 'O', 'N')

In [25]:
s[-1], s[-2], s[-3], s[-4], s[-5], s[-6]

('N', 'O', 'H', 'T', 'Y', 'P')

### Slicing

In [26]:
s[:] 

'PYTHON'

- Python strings are 0 indexed
- The range mentioned in the square braces includes the lower bound but not the upper bound

In [27]:
s[1:4]

'YTH'

- You can count from the end as well
- -1 is the last index, -2 is second last and so on

In [29]:
s[-3:]

'HON'

In [30]:
s[:3] , s[-3:]

('PYT', 'HON')

- You can move within string with a stride greater than 1 as well

In [31]:
# slicing with offsets
s[::2]  # print every 2nd character in string

'PTO'

## Strings are IMMUTABLE
- You cannot change strings

In [32]:
# strings are immutable hence assignment throws error
s[0] = 'X'

TypeError: 'str' object does not support item assignment

In [33]:
print('Original String id:', id(s))
# creates a new string
s = 'X' + s[1:]
print(s)
print('New String id:', id(s))

Original String id: 140402619031032
XYTHON
New String id: 140402619112000


## Case Conversion
- Python loves strings and provides a number of utilities to handle various aspects such as case conversion, etc.
- Different conversion methods are available by default with all string objects

In [34]:
s = 'python is great'

In [35]:
s.capitalize()

'Python is great'

In [36]:
s.upper()

'PYTHON IS GREAT'

In [37]:
s.title()

'Python Is Great'

## Replacement
- To substitute specific characters in a string, simply use ``replace`` method

In [38]:
s.replace('python', 'NLP')

'NLP is great'

## Checks
- Strings can be checked whether they are composed of:
  - numbers -> ``isdecimal``
  - alpha-numeric -> ``isalnum``
  - alphabets -> ``isalpha``

In [39]:
'12345'.isdecimal()

True

In [41]:
'number1'.isalpha()

False

In [42]:
'total'.isalnum()

True

In [43]:
'1+1'.isalnum()

False

## Splits and Joins
- Splitting strings based on specific characters is pretty simple. ``split(<split-char>)``
- Joining is also simple. ``<join-character>.join([list of string objects])``

In [44]:
s = 'I,am,a,comma,separated,string'
s.split(',')

['I', 'am', 'a', 'comma', 'separated', 'string']

In [45]:
' '.join(s.split(','))

'I am a comma separated string'

In [46]:
# stripping whitespace characters
s = '   I am surrounded by spaces    '
s

'   I am surrounded by spaces    '

In [47]:
s.strip()

'I am surrounded by spaces'

## Formatting
- What fun it would be if we could not format strings
- Python provides a number of ways of formating strings

In [48]:
'Hello %s' %('Python!')

'Hello Python!'

In [49]:
'Hello %s %s' %('World!', 'How are you?')

'Hello World! How are you?'

In [50]:
'We have %d %s containing %.2f gallons of %s' %(2, 'bottles', 2.5, 'milk')

'We have 2 bottles containing 2.50 gallons of milk'

In [51]:
'We have %d %s containing %.2f gallons of %s' %(5.21, 'jugs', 10.86763, 'juice')

'We have 5 jugs containing 10.87 gallons of juice'

### New Style
- The recent versions of python prefer the following method of formating strings

In [52]:
'Hello {} {}, it is a great {} to meet you at {}'.format('Mr.', 'Jones', 'pleasure', 5)

'Hello Mr. Jones, it is a great pleasure to meet you at 5'

In [53]:
'Hello {} {}, it is a great {} to meet you at {} o\' clock'.format('Sir', 'Arthur', 'honor', 9)

"Hello Sir Arthur, it is a great honor to meet you at 9 o' clock"

In [54]:
'I have a {food_item} and a {drink_item} with me'.format(drink_item='soda', food_item='sandwich')

'I have a sandwich and a soda with me'

# Regular Expressions
A regular expression is a sequence of characters that define a search pattern. Patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. 

Python provides a simple yet powerful library ``re`` to use regular-expressions

Regular expressions, known more popularly as regexes allows you to create string patterns and use them for searching and substituting specific pattern matches in textual data. Python offers a rich module named re for creating and using regular expressions. Entire books have been written on this topic because it is easy to use but difficult to master. Discussing every aspect of regular expressions would not be possible in the current scope but we will cover the main areas with sufficient examples which should be enough to get started on this topic. 

Regular expressions or regexes are specific patterns often denotes using the raw string notation. These patterns match a specific set of strings based on the rules expressed by the patterns. These patterns then are usually compiled into bytecode which is then executed for matching strings using a matching engine. The re module also provides several flags which can change the way the pattern matches are executed. Some important flags are mentioned as follows.

- re.I or re.IGNORECASE is used to match patterns ignoring case sensitivity
- re.S or re.DOTALL causes the period (.) character to match any character including new lines
- re.U or re.UNICODE helps in matching Unicode based characters also (deprecated in Python 3.x)


In [4]:
s1 = 'Python is an excellent language'
s2 = 'I love the Python language. I also use Python to build applications at work!'

In [5]:
import re

pattern = 'Python'
# match only returns a match if regex match is found at the beginning of the string
print(re.match(pattern, s1))

<re.Match object; span=(0, 6), match='Python'>


In [7]:
tmp = re.match(pattern, s1)
tmp.group()

'Python'

## Lets find some strings

In [8]:
# pattern is in lower case hence ignore case flag helps
# in matching same pattern with different cases
re.match(pattern, s1, flags=re.IGNORECASE)

<re.Match object; span=(0, 6), match='Python'>

In [9]:
# printing matched string and its indices in the original string
m = re.match(pattern, s1, flags=re.IGNORECASE)
print('Found match {} ranging from index {} - {} in the string "{}"'.format(m.group(0), 
                                                                            m.start(), 
                                                                            m.end(), s1))

Found match Python ranging from index 0 - 6 in the string "Python is an excellent language"


In [10]:
# match does not work when pattern is not there in the beginning of string s2
re.match(pattern, s2, re.IGNORECASE)

In [65]:
# illustrating find and search methods using the re module
re.search(pattern, s2, re.IGNORECASE)

<re.Match object; span=(11, 17), match='Python'>

In [66]:
# find all substrings 
re.findall(pattern, s2, re.IGNORECASE)

['Python', 'Python']

In [67]:
# get an iterator for matches found in a string
match_objs = re.finditer(pattern, s2, re.IGNORECASE)
match_objs

<callable_iterator at 0x7fb2083c04e0>

In [68]:
print("String:", s2)
for m in match_objs:
    print('Found match "{}" ranging from index {} - {}'.format(m.group(0), 
                                                               m.start(), m.end()))

String: I love the Python language. I also use Python to build applications at work!
Found match "Python" ranging from index 11 - 17
Found match "Python" ranging from index 39 - 45


For pattern matching, there are various rules used in regexes. Some popular ones are mentioned as follows.

-	`.` for matching any single character
-	`^` for matching the start of the string
-	`$` for matching the end of the string
-	`*` for matching zero or more cases of the previous mentioned regex before the `*` symbol in the pattern
-	`?` for matching zero or one case of the previous mentioned regex before the `?` symbol in the pattern
-	`[...]` for matching any one of the set of characters inside the square brackets
-	`[^...]` for matching a character not present in the square brackets after the `^` symbol
-	`|` denotes the OR operator for matching either the preceding or the next regex
-	`+` for matching one or more cases of the previous mentioned regex before the `+` symbol in the pattern
-	`\d` for matching decimal digits which is also depicted as `[0-9]`
-	`\D` for matching non-digits, also depicted as `[^0-9]`
-	`\s` for matching white space characters
-	`\S` for matching non whitespace characters
-	`\w` for matching alpha-numeric characters also depicted as `[a-zA-Z0-9_]`
-	`\W` for matching non alpha-numeric characters also depicted as `[^a-zA-Z0-9_]`


Regular expressions can be compiled into pattern objects and then they can be used with a variety of methods for pattern search and substitution in strings. The main methods offered by the re module for performing these operations are described briefly as follows.

-	`re.compile()`: This method compiles a specified regular expression pattern into a regular expression object, which can be used for matching and searching. Takes a pattern and optional flags as input which we discussed above.
-	`re.match()`: This method is used to match patterns at the beginning of strings.
-	`re.search()`: This method is used to match patterns occurring at any position in the string.
-	`re.findall()`: This method returns all non-overlapping matches of the specified regex pattern in the string.
-	`re.finditer()`: This method returns all matched instances in the form of an iterator, for a specific pattern in a string when scanned from left to right.
-	`re.sub()`: This method is used to substitute a specified regex pattern in a string with a replacement string. It only substitutes the left-most occurrence of the pattern in the string.


## Time for some Substitution 
Regular expressions are useful in locating different patterns in a given string. This search capability is further enhanced by utilities to replace the substrings as well.

The python ``re`` library provides the following methods for substitution:
- ``sub``
- ``subn``

In [84]:
re.

<module 're' from '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py'>

In [69]:
# illustrating pattern substitution using sub and subn methods
re.sub(pattern, 'Java', s2, flags=re.IGNORECASE)

'I love the Java language. I also use Java to build applications at work!'

In [82]:
s3 = "Is see Java, Python, foo, Java"
re.subn(pattern, 'Java', s3, flags=re.IGNORECASE)

('Is see Java, Java, foo, Java', 1)

## Regex and Unicode Strings
Python handles unicode strings pretty well. This capability is carried forward by the ``re`` package as well. We can easily search for patterns in a unicode string and vice-versa as well

In [71]:
# dealing with unicode matching using regexes
s = u'H\u00e8llo! this is Python 🐍'
s

'Hèllo! this is Python 🐍'

In [85]:
re.findall(r'\w+', s)

['Hèllo', 'this', 'is', 'Python']

In [86]:
re.findall(r"[A-Z]\w+", s, re.UNICODE)

['Hèllo', 'Python']

In [87]:
emoji_pattern = r"['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
re.findall(emoji_pattern, s, re.UNICODE)

['🐍']

## Regex Anchors

Anchors are a special feature of the regex library. It allows us to apply pattern search to specific contexts. Anchors are identified through the use of escape character `\`. The following are a few popular anchors:
- `\A`: Search restricted start of string
- `\Z`: Search restrictrd to end of string
- `^' : Search at start of line. Same as `\A` is there is only a single string (no `\n`')
- `$`: Search at the end of line.
- `\b`:Search at a word's boundary

In [88]:
# search at the start of string
bool(re.search(r'\Acat','cater'))

True

In [89]:
# this would return False as cat is not at the starting of string
bool(re.search(r'\Acat','concatenation'))

False

In [90]:
# search at end of string and append
bool(re.sub('\Z','er','hack'))

True

In [91]:
# search at start of string
bool(re.search(r'^hi','hi, welcome to python\nhi to everyone'))

True

In [100]:
sentence = "par spar parts spare"

In [93]:
# this will replace all occurances
re.sub(r'par','X',sentence)

'X sX Xts sXe'

In [95]:
re.sub(r'par', '', sentence)

' s ts se'

In [96]:
# this will replace only matches with start of string
re.sub(r'\bpar','X',sentence)

'X spar Xts spare'

In [101]:
# this will replace only matches with end of string
re.sub(r'par\b','X',sentence)

'X sX parts spare'

## Alternation and Grouping

``re`` provides smart ways of combining multiple search patterns in a single command.
- `|`: this acts like a logical OR
- Paranthesis can be used to group multiple alternatives

In [None]:
# stopped here Jan. 18, 2021, at 5:15pm

In [102]:
pattern = "cat|dog"
sentence = "cats and dogs make good pets"

# this will search mutliple patterns at once
re.sub(pattern,'X',sentence)

'Xs and Xs make good pets'

In [103]:
pattern = "re(form|lated)"
sentence = "all reforms are not related to forests"

# this will search mutliple patterns at once using grouping
re.sub(pattern,'X',sentence)

'all Xs are not X to forests'

# End to End Example

We covered a number of concepts related to __strings__ and __regular expressions__. Time to work through an end-to-end example to apply everything we have learnt so far

In [None]:
# get a sample corpus
import nltk
nltk.download('gutenberg')

### Import Libraries

In [None]:
from nltk.corpus import gutenberg
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
bible = gutenberg.open('bible-kjv.txt')
bible = bible.readlines()

# get first few lines
bible[:5]

In [None]:
# remove newline characters from sentences
bible = list(filter(None, [item.strip('\n') for item in bible]))
bible[:5]

## How long are Sentences in the Bible?

In [None]:
# in terms of number of characters per sentence
line_lengths = [len(sentence) for sentence in bible]
h = plt.hist(line_lengths)

### Number of words per Sentence?

In [None]:
tokens = [item.split() for item in bible]
print(tokens[:5])

In [None]:
total_tokens_per_line = [len(sentence.split()) for sentence in bible]
h = plt.hist(total_tokens_per_line, color='orange')

In [None]:
words = [word for sentence in tokens for word in sentence]
print(words[:20])

## Filter out alphanumeric and numbers

In [None]:
words = list(filter(None, [re.sub(r'[^A-Za-z]', '', word) for word in words]))
print(words[:20])

### How Frequency are these words used?

In [None]:
from collections import Counter

words = [word.lower() for word in words]
c = Counter(words)
c.most_common(10)

## StopWords?
“stop words” typically refers to the most common words in a language which do not contribute to much towards the content of the corpus.

### Remove Stopwords

``nltk`` to the rescue. ``nltk`` has a long list of stopwords in english language. We can use this list to clean our word list


In [None]:
# download stopwords
nltk.download('stopwords')

In [None]:
import nltk 

# get list of stopwords
stopwords = nltk.corpus.stopwords.words('english')
# filter out stop words
words = [word.lower() for word in words if word.lower() not in stopwords]

# get frequency counts of cleaned list
c = Counter(words)
c.most_common(10)