<a href="https://colab.research.google.com/github/Sindhuhar/code_python/blob/main/04_adv_collections_module_RegEx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Collections Module in Python

Python’s collections module has specialized container datatypes that can be used to replace Python’s general-purpose containers (dict, tuple, list, and set).

###Overview of ChainMap
A ChainMap is a class that provides the ability to link multiple mappings together such that they end up being a single unit. If we look at the example below, we will notice that it accepts maps, which means that a ChainMap will accept any number of mappings or dictionaries and turn them into a single view that we can update.

In [1]:
from collections import ChainMap

car_parts = {"hood": 500, "engine": 5000, "front_door": 750}
car_options = {"A/C": 1000, "Turbo": 2500, "rollbar": 300}
car_accessories = {"cover": 100, "hood_ornament": 150, "seat_cover": 99}
car_pricing = ChainMap(car_accessories, car_options, car_parts)
print(car_pricing["hood"])

500


###Counter

The collections module also provides us with a neat little tool that supports convenient and fast tallies. This tool is called Counter.

In [3]:
from collections import Counter

print(Counter("superfluous"))


counter = Counter("superfluous")
print(counter["u"])

Counter({'u': 3, 's': 2, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1})
3


###Simple example of counter.elements()
The Counter provides a few methods that might be of our interest. For example, we can call elements which will be an iterator over the elements that are in the dictionary, but in an arbitrary order. We can kind of think of this function as a “scrambler” as the output in this case is a scrambled version of the string.

In [4]:
from collections import Counter

Counter("superfluous")
counter = Counter("superfluous")
print(list(counter.elements()))

['s', 's', 'u', 'u', 'u', 'p', 'e', 'r', 'f', 'l', 'o']


Testing counter.most_common() function#
Another useful method is most_common. We can ask the Counter what the most common items are by passing in a number that represents the top recurring “n” items:

In [8]:
from collections import Counter

Counter("superfluous")
counter = Counter("superfluous")
print(counter.most_common(2))

[('u', 3), ('s', 2)]


Here we just ask our Counter what the top two most recurring items were. As we can see, it produced a list of tuples that tells us “u” occurred thrice and “s” occurred twice.

###subtract() method

The subtract method accepts an iterable or a mapping and then uses that argument to subtract

In [9]:
from collections import Counter

counter_one = Counter("superfluous")
print(counter_one)


counter_two = Counter("super")
print(counter_one.subtract(counter_two))


print(counter_one)

Counter({'u': 3, 's': 2, 'p': 1, 'e': 1, 'r': 1, 'f': 1, 'l': 1, 'o': 1})
None
Counter({'u': 2, 's': 1, 'f': 1, 'l': 1, 'o': 1, 'p': 0, 'e': 0, 'r': 0})


###defaultdict

The collections module has a handy tool called defaultdict. The defaultdict is a subclass of Python’s dict that accepts a default_factory as its primary argument. The default_factory is usually a Python type, such as int or list, but we can also use a function or a lambda

Simple example of counting the occurrence of words


Let’s start by creating a regular Python dictionary that counts the number of times each word is used in a sentence:

In [10]:
sentence = "The red for jumped over the fence and ran to the zoo for food"
words = sentence.split(' ')

reg_dict = {}
for word in words:
    if word in reg_dict:
        reg_dict[word] += 1
    else:
        reg_dict[word] = 1

print(reg_dict)

{'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1}


Now let’s try doing the same thing with defaultdict!

In [11]:
from collections import defaultdict

sentence = "The red for jumped over the fence and ran to the zoo for food"
words = sentence.split(' ')

d = defaultdict(int)
for word in words:
    d[word] += 1

print(d)

defaultdict(<class 'int'>, {'The': 1, 'red': 1, 'for': 2, 'jumped': 1, 'over': 1, 'the': 2, 'fence': 1, 'and': 1, 'ran': 1, 'to': 1, 'zoo': 1, 'food': 1})


Trying Python list type as default_factory

In [12]:
my_list = [(1234, 100.23), (345, 10.45), (1234, 75.00),
           (345, 222.66), (678, 300.25), (1234, 35.67)]

reg_dict = {}
for acct_num, value in my_list:
    if acct_num in reg_dict:
        reg_dict[acct_num].append(value)
    else:
        reg_dict[acct_num] = [value]

print(reg_dict)

{1234: [100.23, 75.0, 35.67], 345: [10.45, 222.66], 678: [300.25]}


Using lambda as default_factory

In [13]:
from collections import defaultdict

animal = defaultdict(lambda: "Monkey")
animal['Sam'] = 'Tiger'

print (animal['Nick'])

print (animal)

Monkey
defaultdict(<function <lambda> at 0x7e3264b8b920>, {'Sam': 'Tiger', 'Nick': 'Monkey'})


Here we create a defaultdict that will assign ‘Monkey’ as the default value to any key. We set the first key to ‘Tiger’, then we don’t set the next key at all. If we print the second key, we will see that ‘Monkey’ is assigned to it. In case we haven’t noticed yet, it’s basically impossible to cause a KeyError to happen as long as we set the default_factory to something that does not make sense. If we happen to set the default_factory to None, then we will receive a KeyError.

In [14]:
from collections import defaultdict

x = defaultdict(None)
x['Mike']

KeyError: 'Mike'

###deque

According to the Python documentation, “deques are a generalization of stacks and queues”. They are pronounced “deck” which is short for “double-ended queue”. They are a replacement container for the Python list. Deques are thread-safe and support memory efficient appends and pops from either side of the deque. A list is optimized for fast fixed-length operations. A deque accepts a maxlen argument which sets the bounds for the deque. Otherwise the deque will grow to an arbitrary size. When a bounded deque is full, any new items added will cause the same number of items to be popped off the other end.

In [15]:
from collections import deque
import string

d = deque(string.ascii_lowercase)
for letter in d:
    print(letter)

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z


###Testing different deque methods



In [17]:
from collections import deque
import string

d = deque(string.ascii_lowercase)
for letter in d:
    letter

d.append("bork")
print(d)

d.appendleft("test")
print(d)

d.rotate(1)
print(d)

deque(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bork'])
deque(['test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'bork'])
deque(['bork', 'test', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'])


###namedtuple

Creating namedtuple

In [18]:
from collections import namedtuple

Parts = namedtuple("Parts", "id_num desc cost amount")
auto_parts = Parts(id_num="1234", desc="Ford Engine", cost=1200.00, amount=10)
print(auto_parts.id_num)

1234


One of the benefits of using a namedtuple over a regular tuple is that we no longer have to keep track of each item’s index because now each item is named and accessed via a class property.

In [20]:
from collections import Counter

auto_parts = ("1234", "Ford Engine", 1200.00, 10)
print(auto_parts[0])

id_num, desc, cost, amount = auto_parts
print(desc)

1234
Ford Engine


In [21]:
# Here is a code to convert a Python dictionary into an object

from collections import namedtuple

Parts = {'id_num':'1234', 'desc':'Ford Engine',
     'cost':1200.00, 'amount':10}
parts = namedtuple('Parts', Parts.keys())(**Parts)
print (parts)

Parts(id_num='1234', desc='Ford Engine', cost=1200.0, amount=10)


###OrderedDict

Python’s collections module has another great subclass of dict known as OrderedDict. As the name implies, this dictionary keeps track of the order of the keys as they are added. If we create a regular dict, we will note that it is an unordered data collection

In [25]:
from collections import Counter

Counter("superfluous")
counter = Counter("superfluous")
d = {"banana": 3, "apple": 4, "pear": 1, "orange": 2}
print(d)

{'banana': 3, 'apple': 4, 'pear': 1, 'orange': 2}


Every time we print it out, the order may be different. There are times when we will need to loop over the keys of our dictionary in a specific order. For example, we have a use case where we needed the keys sorted so we could loop over them in order.

In [26]:
d = {"banana": 3, "apple": 4, "pear": 1, "orange": 2}
keys = d.keys()
print(keys)


keys = sorted(keys)
print(keys)


for key in keys:
    print(key, d[key])

dict_keys(['banana', 'apple', 'pear', 'orange'])
['apple', 'banana', 'orange', 'pear']
apple 4
banana 3
orange 2
pear 1


In [27]:
#Sorting the dictionary's keys using OrderedDict

from collections import OrderedDict

d = {"banana": 3, "apple": 4, "pear": 1, "orange": 2}
new_d = OrderedDict(sorted(d.items()))
print(new_d)


for key in new_d:
    print(key, new_d[key])

OrderedDict([('apple', 4), ('banana', 3), ('orange', 2), ('pear', 1)])
apple 4
banana 3
orange 2
pear 1


In [28]:
# Testing reverse iteration using OrderedDict

from collections import OrderedDict

d = {"banana": 3, "apple": 4, "pear": 1, "orange": 2}
new_d = OrderedDict(sorted(d.items()))
new_d

for key in new_d:
    key, new_d[key]
for key in reversed(new_d):
    print(key, new_d[key])

pear 1
orange 2
banana 3
apple 4


###How list is not an iterator?

In [29]:
my_list = [1, 2, 3]
next(my_list)

TypeError: 'list' object is not an iterator

In [30]:
print (iter(my_list))


list_iterator = iter(my_list)
print (next(list_iterator))


print (next(list_iterator))


print (next(list_iterator))


print (next(list_iterator))

<list_iterator object at 0x7e326548f0d0>
1
2
3


StopIteration: 

In [31]:
# Trying loops with iterator
my_list = [1, 2, 3]
for item in iter(my_list):
    print(item)

1
2
3


In [32]:
# Creating an iterator for string of letters

class MyIterator:

    def __init__(self, letters):
        """
        Constructor
        """

        self.letters = letters
        self.position = 0

    def __iter__(self):
        """
        Returns itself as an iterator
        """

        return self

    def __next__(self):
        """
        Returns the next letter in the sequence or
        raises StopIteration
        """

        if self.position >= len(self.letters):
            raise StopIteration
        letter = self.letters[self.position]
        self.position += 1
        return letter


if __name__ == '__main__':
    i = MyIterator('abcd')
    for item in i:
        print (item)

a
b
c
d


In [33]:
# Creating an infinite iterator

class Doubler:

    """
    An infinite iterator
    """

    def __init__(self):
        """
        Constructor
        """

        self.number = 0

    def __iter__(self):
        """
        Returns itself as an iterator
        """

        return self

    def __next__(self):
        """
        Doubles the number each time next is called
        and returns it.
        """

        self.number += 1
        return self.number * self.number


if __name__ == '__main__':
    doubler = Doubler()
    count = 0

    for number in doubler:
        print (number)
        if count > 5:
            break
        count += 1

1
4
9
16
25
36
49


##Generators

A normal Python function will always return one value, whether it be a list, an integer or some other object. But what if we call a function and it yields a series of values? That is where generators come in. A generator works by “saving” where it last left off (or yielding) and giving the calling function a value. So instead of returning the execution to the caller, it just gives temporary control back. To do this magic, a generator function requires Python’s yield statement.

In [34]:
#Creating a simple generator

def doubler_generator():
    number = 2
    while True:
        yield number
        number *= number

doubler = doubler_generator()
print (next(doubler))


print (next(doubler))


print (next(doubler))


print (type(doubler))

2
4
16
<class 'generator'>


In [35]:
# Generator with three items

def silly_generator():
    yield 'Python'
    yield 'Rocks'
    yield 'So do you!'


gen = silly_generator()
print (next(gen))

print (next(gen))

print (next(gen))

print (next(gen))

Python
Rocks
So do you!


StopIteration: 

###Generator and loops

In [36]:
def silly_generator():
    yield 'Python'
    yield 'Rocks'
    yield 'So do you!'


gen = silly_generator()
for item in gen:
    print (item)

Python
Rocks
So do you!


##Regular expressions

Let’s spend a few moments looking at how some of these work.

##[ and ]
One of the most common pairs of metacharacters we will encounter are the square braces: [ and ]. They are used for creating a “character class”, which is a set of characters that we would like to match. We may list the characters individually like this: [xyz]. This will match any of the characters listed between the braces. We can also use a dash to express a range of characters to match against: [a-g]. In this example, we would match against any of the letters a through g.

##*
To actually do a search though, we would need to add a beginning character to look for and an ending character. To make this easier, we can use the asterisk which allows repetitions. Instead of matching *, the * will tell the regular expression that the previous character may be matched zero or more times.

In [37]:
# Simple example of character matching

import re
text = 'abcdfghijk'
parser = re.search('a[b-f]*f', text)
print (parser)


print (parser.group())

<re.Match object; span=(0, 5), match='abcdf'>
abcdf


##+
There’s another repeating metacharacter which is similar to *. It is +, which will match one or more times. This is a subtle difference from * which matches zero or more times. The + requires at least one occurrence of the character it is looking for.

##?

The last two repeating metacharacters work a bit differently. There is the question mark, ?, which will match either once or zero times. It sort of marks the character before it as optional. A simple example would be “co-?op”. This would match both “coop” and “co-op”.

##{ and }
The final repeated metacharacter is {a,b} where a and b are decimal integers. What this means is that there must be at least a repetitions and at most b.


example:                xb{1,4}z

### Pattern Matching Using Search and Escape Codes

Pattern matching with search()

In [38]:
import re

text = "The ants go marching one by one"

strings = ['the', 'one']

for string in strings:
    match = re.search(string, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        text_pos = match.span()
        print(text[match.start():match.end()])
    else:
        print('Did not find "{}"'.format(string))

Did not find "the"
Found "one" in "The ants go marching one by one"
one


###Example of compiling using re module

In [39]:
import re

text = "The ants go marching one by one"

strings = ['the', 'one']

for string in strings:
    regex = re.compile(string)
    match = re.search(regex, text)
    if match:
        print('Found "{}" in "{}"'.format(string, text))
        text_pos = match.span()

Found "one" in "The ants go marching one by one"


###re.A / re.ASCII
The ASCII flag tells Python to only match against ASCII instead of using full Unicode matching when coupled with the following escape codes: w, W, b, B, d, D, s and S. There is a re.U / re.UNICODE flag too that is for backward compatibility purposes; however, those flags are redundant since Python 3 already matches Unicode by default.

###re.DEBUG
This will display debug information about our compiled expression.

###re.I / re.IGNORECASE
If we’d like to perform case-insensitive matching, then this is the flag for us. If our expression was [a-z] and we compiled it with this flag, our pattern will also match uppercase letters too! This also works for Unicode, and it’s not affected by the current locale.

###re.M / re.MULTILINE
When we use this flag, we are telling Python to make the ^ pattern character match at both the beginning of the string and at the beginning of each line. It also tells Python that $ should match at the end of the string and the end of each line, which is subtly different from their defaults.

###re.S / re.DOTALL
This fun flag will make the . (period) metacharacter match any character at all. Without the flag, it would match anything except a newline.

###re.X / re.VERBOSE#
If we find our regular expressions hard to read, then this flag is just what we need. It will allow us to visually separate logical sections of our regular expressions and even add comments! Whitespace within the pattern will be ignored except when in a character class or when the whitespace is preceded by an unescaped backslash.

In [42]:
# Using a compilation flag

import re

def validate_input(input_email):


	re_compilation=re.compile(r"""
                           ^([a-z0-9_\.-]+)      #it will pick the first local part
                           @                     # will pick the @ sign
                            ([0-9a-z\.-]+)       # will pick the domain name
                           \.                    # will have single "."
                            ([a-z]{2,6})$        # it will pick the top level Domain (last part)
                           """,
           re.VERBOSE)

	result=re_compilation.fullmatch(input_email)

	if result:
		print("{} is Valid.".format(input_email))

	else:
		print("{} is Invalid".format(input_email))


validate_input("name@gmail.com")
validate_input("name@.com")

name@gmail.com is Valid.
name@.com is Invalid


In [43]:
# Finding multiple matches using findall()

import re
silly_string = "the cat in the hat"
pattern = "the"
print (re.findall(pattern, silly_string))

['the', 'the']


In [44]:
# Finding multiple matches using finditer()

import re

silly_string = "the cat in the hat"
pattern = "the"

for match in re.finditer(pattern, silly_string):
    s = "Found '{group}' at {begin}:{end}".format(
        group=match.group(), begin=match.start(),
        end=match.end())
    print(s)

Found 'the' at 0:3
Found 'the' at 11:14


To search for this in a regular expression, we will need to escape the backslash but because Python also uses the backslash, then that backslash also has to be escaped so we’ll end up with the following search pattern: “\\python”.

In [45]:
testing_string = 'python "\"'
print(testing_string)

python ""


In [46]:
testing_string = 'python "\\"'
print(testing_string)

python "\"


In [47]:
# Python supports raw strings by pre-pending the string with the letter ‘r’. So we can make this more readable by doing the following: r"\python".

testing_string = r'python "\"'
print(testing_string)

python "\"
