# Collections Module

The collections module is a built-in module that implements specialized container data types providing alternatives to Python’s general purpose built-in containers. We've already gone over the basics: dict, list, set, and tuple.

Now we'll learn about the alternatives that the collections module provides.

## Counter

*Counter* is a *dict* subclass which helps count hashable objects. Inside of it elements are stored as dictionary keys and the counts of the objects are stored as the value.

Let's see how it can be used:

In [2]:
from collections import Counter

**Counter() with lists**

In [7]:
list1 = [1,1,1,2,2,3,3,3,3,4,4,5,1,8]

Counter(list1)


Counter({1: 4, 2: 2, 3: 4, 4: 2, 5: 1, 8: 1})

**Counter with strings**

In [3]:
Counter('hello everyone')

Counter({'h': 1,
         'e': 4,
         'l': 2,
         'o': 2,
         ' ': 1,
         'v': 1,
         'r': 1,
         'y': 1,
         'n': 1})

**Counter with words in a sentence**

In [9]:
st='this is python learning program this is python'
word = st.split()

Counter(word)

Counter({'this': 2, 'is': 2, 'python': 2, 'learning': 1, 'program': 1})

In [15]:
# Methods with Counter()
c = Counter(word)

c.most_common(1)

[('this', 2)]

In [16]:
sum(c.values())

8

In [17]:
list(c)

['this', 'is', 'python', 'learning', 'program']

In [19]:
set(c)

{'is', 'learning', 'program', 'python', 'this'}

In [20]:
dict(c)

{'this': 2, 'is': 2, 'python': 2, 'learning': 1, 'program': 1}

In [21]:
c.items()

dict_items([('this', 2), ('is', 2), ('python', 2), ('learning', 1), ('program', 1)])

In [23]:
Counter(dict(c))

Counter({'this': 2, 'is': 2, 'python': 2, 'learning': 1, 'program': 1})

In [27]:
c.most_common()[:-4-1:-1]

[('program', 1), ('learning', 1), ('python', 2), ('is', 2)]

In [29]:
c += Counter()  
c

Counter({'this': 2, 'is': 2, 'python': 2, 'learning': 1, 'program': 1})

In [33]:
c.clear()

In [34]:
dict(c)

{}

## Common patterns when using the Counter() object

    sum(c.values())                 # total of all counts
    c.clear()                       # reset all counts
    list(c)                         # list unique elements
    set(c)                          # convert to a set
    dict(c)                         # convert to a regular dictionary
    c.items()                       # convert to a list of (elem, cnt) pairs
    Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
    c.most_common()[:-n-1:-1]       # n least common elements
    c += Counter()                  # remove zero and negative counts

## defaultdict

defaultdict is a dictionary-like object which provides all methods provided by a dictionary but takes a first argument (default_factory) as a default data type for the dictionary. Using defaultdict is faster than doing the same using dict.set_default method.

**A defaultdict will never raise a KeyError. Any key that does not exist gets the value returned by the default factory.**

In [36]:
from collections import defaultdict

In [47]:
d= defaultdict(lambda:0)
d[1]='a'
d[2]='b'
print(d[1])
print(d[2])
print(d[3])

a
b
0


In [48]:
y = defaultdict(lambda: 0)

In [49]:
y['one']

0

## OrderedDict
An OrderedDict is a dictionary subclass that remembers the order in which its contents are added.

For example a normal dictionary:

In [50]:
print('Normal dictionary:')

d = {1:'a',2:'b',3:'c'}
for i,j in d.items():
    print(i,j)

Normal dictionary:
1 a
2 b
3 c


An Ordered Dictionary:

In [53]:
from collections import OrderedDict

print('OrderedDict:')

d = OrderedDict({1:'a',2:'b',3:'c'})

for i,j in d.items():
    print(i,j)

OrderedDict:
1 a
2 b
3 c


## Equality with an Ordered Dictionary
A regular dict looks at its contents when testing for equality. An OrderedDict also considers the order the items were added.

A normal Dictionary:

In [55]:
d1 = {1:'a',2:'b'}
d2 = {2:'b',1:'a'}


if d1==d2:
    print("both are equal")
else:
    print("not equal")    

both are equal


An Ordered Dictionary:

In [56]:
d1 = OrderedDict({1:'a',2:'b'})
d2 = OrderedDict({2:'b',1:'a'})


if d1==d2:
    print("both are equal")
else:
    print("not equal")

not equal


# namedtuple
The standard tuple uses numerical indexes to access its members, for example:

In [57]:
tup = (1,2,3,4)

In [60]:
tup[2]

3

For simple use cases, this is usually enough. On the other hand, remembering which index should be used for each value can lead to errors, especially if the tuple has a lot of fields and is constructed far from where it is used. A namedtuple assigns names, as well as the numerical index, to each member. 

Each kind of namedtuple is represented by its own class, created by using the namedtuple() factory function. The arguments are the name of the new class and a string containing the names of the elements.

You can basically think of namedtuples as a very quick way of creating a new object/class type with some attribute fields.
For example:

In [62]:
from collections import namedtuple

In [66]:
stud = namedtuple('stud','name age usn')

s1 = stud('sam','19','132')

s2 = stud('joey','20','013')

In [67]:
s1

stud(name='sam', age='19', usn='132')

In [68]:
s2.age

'20'

In [69]:
s1.usn

'132'

In [70]:
s2[0]

'joey'

## Conclusion

Hopefully you now see how incredibly useful the collections module is in Python and it should be your go-to module for a variety of common tasks!

# datetime

Python has the datetime module to help deal with timestamps in your code. Time values are represented with the time class. Times have attributes for hour, minute, second, and microsecond. They can also include time zone information. The arguments to initialize a time instance are optional, but the default of 0 is unlikely to be what you want.

## time
Let's take a look at how we can extract time information from the datetime module. We can create a timestamp by specifying datetime.time(hour,minute,second,microsecond)

In [16]:
import datetime

t = datetime.datetime.now()

# Let's show the different components
print(t)
print('hour  :', t.hour)
print('minute:', t.minute)
print('second:', t.second)
print('microsecond:', t.microsecond)
print('tzinfo:', t.tzinfo)

2021-09-15 10:00:12.024478
hour  : 10
minute: 0
second: 12
microsecond: 24478
tzinfo: None


Note: A time instance only holds values of time, and not a date associated with the time. 

We can also check the min and max values a time of day can have in the module:

In [3]:
print('Earliest  :', datetime.time.min)
print('Latest    :', datetime.time.max)
print('Resolution:', datetime.time.resolution)

Earliest  : 00:00:00
Latest    : 23:59:59.999999
Resolution: 0:00:00.000001


The min and max class attributes reflect the valid range of times in a single day.

## Dates
datetime (as you might suspect) also allows us to work with date timestamps. Calendar date values are represented with the date class. Instances have attributes for year, month, and day. It is easy to create a date representing today’s date using the today() class method.

Let's see some examples:

In [12]:
today = datetime.date.today()
print(today)
print('Year :', today.year)
print('Month:', today.month)
print('Day  :', today.day)

2021-09-15
Year : 2021
Month: 9
Day  : 15


As with time, the range of date values supported can be determined using the min and max attributes.

In [5]:
print('Earliest  :', datetime.date.min)
print('Latest    :', datetime.date.max)
print('Resolution:', datetime.date.resolution)

Earliest  : 0001-01-01
Latest    : 9999-12-31
Resolution: 1 day, 0:00:00


Another way to create new date instances uses the replace() method of an existing date. For example, you can change the year, leaving the day and month alone.

In [6]:
d1 = datetime.date(2015, 3, 11)
print('d1:', d1)

d2 = d1.replace(year=1990)
print('d2:', d2)

d1: 2015-03-11
d2: 1990-03-11


# Arithmetic
We can perform arithmetic on date objects to check for time differences. For example:

In [6]:
d1

datetime.date(2015, 3, 11)

In [None]:
d2

In [8]:
d1-d2

datetime.timedelta(9131)

This gives us the difference in days between the two dates. You can use the timedelta method to specify various units of times (days, minutes, hours, etc.)

Great! You should now have a basic understanding of how to use datetime with Python to work with timestamps in your code!

# Python Debugger

You've probably used a variety of print statements to try to find errors in your code. A better way of doing this is by using Python's built-in debugger module (pdb). The pdb module implements an interactive debugging environment for Python programs. It includes features to let you pause your program, look at the values of variables, and watch program execution step-by-step, so you can understand what your program actually does and find bugs in the logic.

This is a bit difficult to show since it requires creating an error on purpose, but hopefully this simple example illustrates the power of the pdb module. <br>*Note: Keep in mind it would be pretty unusual to use pdb in an iPython Notebook setting.*

___
Here we will create an error on purpose, trying to add a list to an integer

In [17]:
a = [5,6,7]
b = 1
c = 4

res = c + b
print(res)
res2 = a + b
print(res2)

5


TypeError: can only concatenate list (not "int") to list

Hmmm, looks like we get an error! Let's implement a set_trace() using the pdb module. This will allow us to basically pause the code at the point of the trace and check if anything is wrong.

In [None]:
import pdb

a = [5,6,7]
b = 1
c = 4

res = c + b
print(res)

# Set a trace using Python Debugger
pdb.set_trace()

res2 = a + b
print(res2)


5
--Return--
None
> [1;32m<ipython-input-21-174ce16d764a>[0m(11)[0;36m<module>[1;34m()[0m
[1;32m      9 [1;33m[1;33m[0m[0m
[0m[1;32m     10 [1;33m[1;31m# Set a trace using Python Debugger[0m[1;33m[0m[1;33m[0m[1;33m[0m[0m
[0m[1;32m---> 11 [1;33m[0mpdb[0m[1;33m.[0m[0mset_trace[0m[1;33m([0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[0m[1;32m     12 [1;33m[1;33m[0m[0m
[0m[1;32m     13 [1;33m[0mres2[0m [1;33m=[0m [0ma[0m [1;33m+[0m [0mb[0m[1;33m[0m[1;33m[0m[0m
[0m


Great! Now we could check what the various variables were and check for errors. You can use 'q' to quit the debugger. For more information on general debugging techniques and more methods, check out the official documentation:
https://docs.python.org/3/library/pdb.html

# Timing your code
Sometimes it's important to know how long your code is taking to run, or at least know if a particular line of code is slowing down your entire project. Python has a built-in timing module to do this. 

This module provides a simple way to time small bits of Python code. It has both a Command-Line Interface as well as a callable one. It avoids a number of common traps for measuring execution times. 

Let's learn about timeit!

In [None]:
import timeit

Let's use timeit to time various methods of creating the string '1234....50'

We'll pass two arguments: the actual line we want to test encapsulated as a string and the number of times we wish to run it. Here we'll choose 10,000 runs to get some high enough numbers to compare various methods.

In [None]:
# For loop
timeit.timeit('(str(n) for n in range(51))', number=10000)

In [3]:
# List comprehension
timeit.timeit('([str(n) for n in range(51)])', number=10000)

0.19484614421698643

In [4]:
# Map()
timeit.timeit('(map(str, range(51)))', number=10000)

0.15291817337139246

Great! We see a significant time difference by using map()! This is good to know and we should keep this in mind.

Now let's introduce iPython's magic function **%timeit**<br>
*NOTE: This method is specific to jupyter notebooks!*

iPython's %timeit will perform the same lines of code a certain number of times (loops) and will give you the fastest performance time (best of 3).

Let's repeat the above examinations using iPython magic!

In [5]:
%timeit (str(n) for n in range(51))

20.4 µs ± 269 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [6]:
%timeit ([str(n) for n in range(51)])

18.1 µs ± 56.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [7]:
%timeit (map(str, range(100)))

14.4 µs ± 64.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Great! We arrive at the same conclusion. It's also important to note that iPython will limit the amount of *real time* it will spend on its timeit procedure. For instance if running 100000 loops took 10 minutes, iPython would automatically reduce the number of loops to something more reasonable like 100 or 1000.

Great! You should now feel comfortable timing lines of your code, both in and out of iPython. Check out the documentation for more information:
https://docs.python.org/3/library/timeit.html

# Regular Expressions

Regular expressions are text-matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, from finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).


If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the <code>re</code> module with Python for this lecture.


Let's get started!

## Searching for Patterns in Text

One of the most common uses for the re module is for finding patterns in text. Let's do a quick example of using the search method in the re module to find some text:

In [None]:
import re

# List of patterns to search for
words = ['Basket', 'rain']

# Text to parse
string = 'We were Playing Basket ball in Basket ball court.'

for i in words:
    print('Searching for "%s" in:\n "%s"\n' %(i,string))
    
    #Check for match
    if re.search(i,string):
        print('Match was found. \n')
    else:
        print('No Match was found.\n')

Now we've seen that <code>re.search()</code> will take the pattern, scan the text, and then return a **Match** object. If no pattern is found, **None** is returned. To give a clearer picture of this match object, check out the cell below:

In [20]:
import re

# List of patterns to search for
word = 'Playing'

# Text to parse
string = 'We were Playing Basket ball in Basket ball court.'

match = re.search(word,string)


re.Match

This **Match** object returned by the search() method is more than just a Boolean or None, it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:

In [21]:
# Show start of match
match.start()

22

In [22]:
# Show end
match.end()

27

## Split with regular expressions

Let's see how we can split with the re syntax. This should look similar to how you used the split() method with strings.

In [23]:
# Term to split on
split_term = 'Be'

term = 'Do Good Be kind'

# Split the phrase
re.split(split_term,term)

['What is the domain name of someone with the email: hello', 'gmail.com']

Note how <code>re.split()</code> returns a list with the term to split on removed and the terms in the list are a split up version of the string. Create a couple of more examples for yourself to make sure you understand!

## Finding all instances of a pattern

You can use <code>re.findall()</code> to find all the instances of a pattern in a string. For example:

In [24]:
# Returns a list of all matches
re.findall('Basket','We were Playing Basket ball in Basket ball court.')

['test is']

### Repetition Syntax

There are five ways to express repetition in a pattern:

   1. A pattern followed by the meta-character <code>*</code> is repeated zero or more times. 
   2. Replace the <code>*</code> with <code>+</code> and the pattern must appear at least once. 
   3. Using <code>?</code> means the pattern appears zero or one time. 
   4. For a specific number of occurrences, use <code>{m}</code> after the pattern, where **m** is replaced with the number of times the pattern should repeat. 
   5. Use <code>{m,n}</code> where **m** is the minimum number of repetitions and **n** is the maximum. Leaving out **n** <code>{m,}</code> means the value appears at least **m** times, with no maximum.
    
Now we will see an example of each of these using our multi_re_find function:

In [None]:
def multi_re_find(patterns,phrase):
    for i in patterns:
        print('Searching the phrase: %r' %(i))
        print(re.findall(i,phrase))
        print('\n')

In [8]:
string = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

pattern = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(pattern,string)

Searching the phrase using the re check: 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching the phrase using the re check: 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching the phrase using the re check: 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching the phrase using the re check: 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']


Searching the phrase using the re check: 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']




## Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input <code>[ab]</code> searches for occurrences of either **a** or **b**.
Let's see some examples:

In [9]:
string = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

patterns = ['[sd]',    # either s or d
                's[sd]+']   # s followed by one or more s or d

multi_re_find(patterns,string)

Searching the phrase using the re check: '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching the phrase using the re check: 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




It makes sense that the first input <code>[sd]</code> returns every instance of s or d. Also, the second input <code>s[sd]+</code> returns any full strings that begin with an s and continue with s or d characters until another character is reached.

## Exclusion

We can use <code>^</code> to exclude terms by incorporating it into the bracket syntax notation. For example: <code>[^...]</code> will match any single character not in the brackets. Let's see some examples:

In [10]:
phrase = 'Hello! How are you.Where do you live?'

Use <code>[^!.? ]</code> to check for matches that are not a !,.,?, or space. Add a <code>+</code> to check that the match appears at least once. This basically translates into finding the words.

In [11]:
re.findall('[^!.? ]+',phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

## Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious. A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is <code>[start-end]</code>.

Common use cases are to search for a specific range of letters in the alphabet. For instance, <code>[a-f]</code> would return matches with any occurrence of letters between a and f. 

Let's walk through some examples:

In [12]:

phrase = 'Lets try to find Some Letters.'
patterns=['[a-z]+',      # sequences of lower case letters
               '[A-Z]+',      # sequences of upper case letters
               '[a-zA-Z]+',   # sequences of lower or upper case letters
               '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_re_find(patterns,phrase)

Searching the phrase using the re check: '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z]+'
['T', 'L']


Searching the phrase using the re check: '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']


Searching the phrase using the re check: '[A-Z][a-z]+'
['This', 'Lets']




## Escape Codes

You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits, whitespace, and more. For example:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Escapes are indicated by prefixing the character with a backslash <code> \ </code>. Unfortunately, a backslash must itself be escaped in normal Python strings, and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with <code>r</code>, eliminates this problem and maintains readability.

Personally, I think this use of <code>r</code> to escape a backslash is probably one of the things that block someone who is not familiar with regex in Python from being able to read regex code at first. Hopefully after seeing these examples this syntax will become clear.

In [13]:
phrase = 'lets find 1234 and *asterisk'

patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(patterns,phrase)

Searching the phrase using the re check: '\\d+'
['1233']


Searching the phrase using the re check: '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching the phrase using the re check: '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching the phrase using the re check: '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching the phrase using the re check: '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching the phrase using the re check: '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




## Conclusion

You should now have a solid understanding of how to use the regular expression module in Python. There are a ton of more special character instances, but it would be unreasonable to go through every single use case. Instead take a look at the full [documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) if you ever need to look up a particular pattern.

You can also check out the nice summary tables at this [source](http://www.tutorialspoint.com/python/python_reg_expressions.htm).

Good job!


# StringIO Objects and the io Module

Back in **Lecture 24 - Files** we opened files that exist outside of python, and streamed their contents into an in-memory file object. You can also create in-memory file-like objects within your program that Python treats the same way. Text data is stored in a StringIO object, while binary data would be stored in a BytesIO object. This object can then be used as input or output to most functions that would expect a standard file object.

Let's investigate StringIO objects. The best way to show this is by example:

In [2]:
import io

In [3]:
# Arbitrary String
string = 'Have a great day all.'

In [4]:
# Use StringIO method to set as file object
a = io.StringIO(string)

Now we have an object *f* that we will be able to treat just like a file. For example:

In [5]:
a.read()

'This is just a normal string.'

We can also write to it:

In [6]:
a.write('welcome')

40

In [9]:
# Reset cursor just like you would a file
a.seek(0)

0

In [10]:
# Read again
a.read()

'This is just a normal string. Second line written to file like object'

In [8]:
# Close the object when contents are no longer needed
a.close()

Great! Now you've seen how we can use StringIO to turn normal strings into in-memory file objects in our code. This kind of action has various use cases, especially in web scraping cases where you want to read some string you scraped as a file.

For more info on StringIO check out the documentation: https://docs.python.org/3/library/io.html