# Python Collections module

The Collections module is a built-in module to python that implements specialized container data types that are alternatives to the built-in containers. Containers are basically anything that stores data, like lists, tuples, or dictionaries.

The first module we are going to be talking about is the Counter class. Let's import it here:

### Counter

In [1]:
from collections import Counter

Let's say that we have a list of unique values but those values sometimes repeat in the list. Now lets say you want to find the number of times each unique item appears in the list. Normally, we'd have to do some stuff with dictionaries and loops, but here we can just do a single call with Counter:

In [2]:
myList = [1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3]

In [5]:
Counter(myList)

Counter({1: 5, 2: 9, 3: 3})

What this Counter has done is that it has found how many 1s, how many 2s and how many 3s there were in the list and returned it in this format. The Counter also works for other things, not just integers:

In [6]:
ListTwo = ['a','a','a','b','b','b']

In [10]:
Counter(ListTwo)

Counter({'a': 3, 'b': 3})

The Counter also works for strings, not just lists:

In [11]:
Counter('abbcccdddd')

Counter({'a': 1, 'b': 2, 'c': 3, 'd': 4})

The Counter also has lots of atrributes and things you can do with it. For example, there is the most_common() method which returns the pairs from most to least common:

In [12]:
c = Counter([1,1,1,1,1,2,2,2,2,2,2,2,2,2,3,3,3])

In [13]:
c.most_common()

[(2, 9), (1, 5), (3, 3)]

##### Common patterns when using the Counter() object

    sum(c.values())                 # total of all counts
    c.clear()                       # reset all counts
    list(c)                         # list unique elements
    set(c)                          # convert to a set
    dict(c)                         # convert to a regular dictionary
    c.items()                       # convert to a list of (elem, cnt) pairs
    Counter(dict(list_of_pairs))    # convert from a list of (elem, cnt) pairs
    c.most_common()[:-n-1:-1]       # n least common elements
    c += Counter()                  # remove zero and negative counts

Above are some common uses for the methods of the Counter() object.

### Default Dictionary

The next thing we will learn about from the collections module will be the Default Dictionary. let's import it here:

In [16]:
from collections import defaultdict

What normal python dictionaries do is that, if you try and call a key that isn't there, they will return a key error:

In [17]:
d = {1:"a",
    2:"b",
    3:"c"}

In [18]:
d[4]

KeyError: 4

However, default dictionaries will, if there would be a KeyError otherwise, set that value for the key to be a default value, thus earning it's name. Here's how to set one up:

In [19]:
new_d = defaultdict(lambda: 'default value goes here')

In [20]:
new_d[1] = 'a'

In [21]:
new_d[2] = 'b'

In [22]:
new_d[3] = 'c'

In [23]:
#Now, let's try calling new_d[4], even though we haven't defined it
new_d[4]

'default value goes here'

As you can see, the default dictionary set the default value to be 'default value goes here' as we defined above. Let's see how the default dict actually looks like:

In [25]:
new_d

defaultdict(<function __main__.<lambda>()>,
            {1: 'a', 2: 'b', 3: 'c', 4: 'default value goes here'})

### Named Tuple

Similar to how the default dictionary gets rid of the key errors by having default values, the named tuple triest to expand on a normal tuple by having named indicies. Let's see what that means:

First, we'll make a standard tuple.

In [26]:
mytuple = (10,20,30)

In [28]:
mytuple[0]

10

This tuple is rather small, so we can easily see which item is at what index and easily pull it out of the tuple. However, we might have a very large tuple in certain situations, or we might not remember what value is at which index. The named tuple has not just an integer relation to each value (in the form of an index) but will also have a named index for that value. Let's see how we can create a named tuple:

In [29]:
from collections import namedtuple

The namedtuple, when defined, takes in 2 parameters. The first is a type name, or what the tuple will be identified as. The next is a list of attributes called 'field names'. Let's create a Dog namedtuple and test it out:

In [35]:
Dog = namedtuple('Dog',['age','breed','name'])

In [36]:
myDog= Dog(age = 5, breed = 'Default Dog', name = 'dog1')

In [37]:
type(myDog)

__main__.Dog

In [38]:
myDog

Dog(age=5, breed='Default Dog', name='dog1')

This named tuple looks like a cross between an object you would made and a tuple. One thing that makes it more like an object is the way you can call its attributes:

In [39]:
myDog.age

5

In [40]:
myDog.breed

'Default Dog'

In [41]:
myDog.name

'dog1'

However, we can also call this using indexing and numbers, making it more like a tuple:

In [42]:
myDog[0]

5

In [43]:
myDog[1]

'Default Dog'

In [44]:
myDog[2]

'dog1'

As you can see, this is very versatile. Now, lets try it for our original example:

In [45]:
my_named_tuple = namedtuple('numbers',['myFirstNum','mySecondNum','myThirdNum'])

In [46]:
new_named_tuple = my_named_tuple(10, 20, 30)

In [47]:
new_named_tuple

numbers(myFirstNum=10, mySecondNum=20, myThirdNum=30)

In [48]:
new_named_tuple.myFirstNum

10

In [49]:
new_named_tuple.mySecondNum

20

In [50]:
new_named_tuple.myThirdNum

30

In [51]:
new_named_tuple[0]

10

In [52]:
new_named_tuple[1]

20

In [53]:
new_named_tuple[2]

30

# Python Shutil and OS modules

The Shutil and Os Modules in python specialize in opening and reading files and folders on your computer. We already know how to open an individual file with Python, but we dont know how to open every file in a directory or move files around on our computer. These two modules allow us to navigate files and directories easily and do things to them, like move or delete them. 

Let's start by finding this notebook's location with <code>pwd</code>

In [1]:
pwd

'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python'

Now, let's make a test text file:

In [2]:
f = open('practice.txt','w+')
f.write('This is a test string')
f.close()

Now, let's import the OS module:

In [1]:
import os

In [4]:
os.getcwd()

'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python'

The thing about the OS module is that it works across all operating systems. The above function did the same thing as the <code>pwd</code> statement but the pwd only works in Jupyter.

Now, let's say that we want to list the items in the directory:

In [5]:
#Note: os.listdir() will, if not given any parameters,
#give you the items of the directory that you are running it in.
#If you want it to run in a separate directory, you need to put the location
#of it in the parenthesis.
os.listdir()

['.ipynb_checkpoints',
 'Advanced Modules.ipynb',
 'Blackjack.ipynb',
 'Course docs',
 'Decorators.ipynb',
 'Error Handling.ipynb',
 'Functions.ipynb',
 'Generators.ipynb',
 'Module1.py',
 'Modules and packages.ipynb',
 'MyFirstPackage',
 'Object & data structures.ipynb',
 'Object Oriented Programming.ipynb',
 'practice.txt',
 'Program1.py',
 'Python Statements.ipynb',
 'Tic Tac Toes.ipynb',
 'War.ipynb',
 '__pycache__']

This could let us potentially have a for loop that opens all the files in a directory. Now, let's move some files around with shutil.

In [6]:
import shutil

Right now, we have practice.txt located in the Udemy Course Python area. Let's move it to the Python folder:

In [7]:
#The shutil.move() method takes in 2 parameters - the file you want to move
#and the location you want to move it to
shutil.move('practice.txt','C:\\Users\\aadia\\OneDrive\\Python')

'C:\\Users\\aadia\\OneDrive\\Python\\practice.txt'

In [9]:
os.listdir('C:\\Users\\aadia\\OneDrive\\Python')

['.ipynb_checkpoints',
 'blackjack.py',
 'Leetcode + Other puzzles.ipynb',
 'practice.txt',
 'ScoreBoard.txt',
 'Seccond course',
 'times tables.ipynb',
 'Udemy course python',
 '__pycache__']

As you can see, practice.txt has been sucessfully moved. Now, let's look at deleting files - The os module gives 3 ways to delete files: <code>os.unlink(path)</code> which deletes a file at the path you provide, <code>os.rmdir(path)</code> which deletes a folder (the folder must be empty) at the path you provide, and <code>os.rmtree(path)</code> which is the most dangerous as it will remove all files and folders in the path. __All of these methods cannot be reversed (if you make a mistake you won't be able to recover the file)__. Instead of using these dangerous methods, we will use the send2trash module, which is safer as it sends deleted items to the trash bin instead of being permanently deleted. Install it with

pip install send2trash


at command line. Now, it can be imported:

In [10]:
import send2trash

let's delete the practice file. First, we need to move the file back into this directory:

In [11]:
shutil.move('C:\\Users\\aadia\\OneDrive\\Python\\practice.txt',os.getcwd())

'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python\\practice.txt'

In [12]:
send2trash.send2trash('practice.txt')

Now, we can recover the file from the recycle bin if we really want to. 

The last thing we will do with the OS module will be the walk function. It takes in one parameter, 'top', and yields a directory tree. What that means is that it returnes a 3-tuple; dirpath, dirnames, and filenames as the parts. Dirpath is the path to the directory, dirnames is the names of the subfolders, and filenames are the names of the files. Note that it is a yield function, meaning that this walk() is a generator. We can thus use a for loop to go through it using tuple unpackaging:

In [2]:
for folder, sub_folders, files in os.walk(os.getcwd()):
    print(folder,sub_folders,files)

C:\Users\aadia\OneDrive\Python\Udemy course python ['.ipynb_checkpoints', 'Course docs', 'MyFirstPackage', '__pycache__'] ['Advanced Modules.ipynb', 'Blackjack.ipynb', 'Decorators.ipynb', 'Error Handling.ipynb', 'Functions.ipynb', 'Generators.ipynb', 'Module1.py', 'Modules and packages.ipynb', 'Object & data structures.ipynb', 'Object Oriented Programming.ipynb', 'Program1.py', 'Python Statements.ipynb', 'Tic Tac Toes.ipynb', 'War.ipynb']
C:\Users\aadia\OneDrive\Python\Udemy course python\.ipynb_checkpoints [] ['Advanced Modules-checkpoint.ipynb', 'Blackjack-checkpoint.ipynb', 'Calculating PI and other puzzles-checkpoint.ipynb', 'Decorators-checkpoint.ipynb', 'Error Handling-checkpoint.ipynb', 'Functions-checkpoint.ipynb', 'Generators-checkpoint.ipynb', 'Modules and packages-checkpoint.ipynb', 'Object & data structures-checkpoint.ipynb', 'Object Oriented Programming-checkpoint.ipynb', 'Python Statements-checkpoint.ipynb', 'Tic Tac Toes-checkpoint.ipynb', 'times tables-checkpoint.ipynb'

What the walk is doing is going through a depth-first search through the directory tree that is these files. This walk can be used to effectively go through every file in a directory to find one that you are looking for if you can't find it.

# Python Datetime module

The Datetime module allows you to create objects that have info not only about the date and the time, but also things like timezones or operations between datetime objects, like how many seconds have passed or how many days have passed. Let's explore the time portion of the datetime modlue by importing it:

In [2]:
import datetime

Let's try out the datetime module. It takes in a few parameters, for hours, minutes, seconds, microseconds, and then time zone information. If you don't put in some info, it will automatically fill it in. Let's only provide the hour and the minute:

In [3]:
mytime = datetime.time(8,26)

In [4]:
mytime.minute

26

In [5]:
mytime.hour

8

In [6]:
print(mytime)

08:26:00


In [7]:
mytime.microsecond

0

In [8]:
mytime.second

0

As you can see, even though we only provided hours and minutes it automatically set microseconds and seconds to 0. Now, lets try out the date portion by finding the current date:

In [9]:
today = datetime.date.today()

In [10]:
print(today)

2021-04-20


In [11]:
today

datetime.date(2021, 4, 20)

As you can see, this follows the european way of saying dates, with the largest units (years) first and the smallest units (days) at the end. These are also all attributes and can be pulled out individually:

In [12]:
today.year

2021

In [13]:
today.month

4

In [14]:
today.day

20

Python also allows you to do something called ctime formatting, which is another way of formatting this with other time info:

In [15]:
today.ctime()

'Tue Apr 20 00:00:00 2021'

As you can see, it added everything except the time information. Let's see how to add that:

To do this, we need to import the class datetime from the datetime module. This class takes in year, month, day, hour, minute, second, microsecond, AND timezone info. Let's try and implement it for our own time:

In [2]:
from datetime import datetime

In [3]:
mydatetime = datetime(2021,4,20,8,36,17)

In [4]:
print(mydatetime)

2021-04-20 08:36:17


Let's say that we accidentaly inputted that it is 8:36 when it is really 8:26. To change things in our datetime object, we can use the replace function. It takes in a statement of what part you want to replace (minute, second, year, etc) and an equals sign then the value you want to replace it with.

In [5]:
mydatetime.replace(minute = 26)

datetime.datetime(2021, 4, 20, 8, 26, 17)

Because this doesn't happen in place, we need to set mydatetime to that replace thing.

In [6]:
mydatetime = mydatetime.replace(minute = 26)

In [8]:
print(mydatetime)

2021-04-20 08:26:17


Now let's say that you want to see the difference between two times. You can use simiple arithmatic and the date() function to do this:

In [9]:
from datetime import date

In [10]:
date1 = date(2022,11,3)
date2 = date(2021,11,3)

In [11]:
date1 - date2

datetime.timedelta(days=365)

Here, it returns a timedelta object that says that the difference is 365 days in the two dates. You can actually set this timedelta as a variable and then take the days, years, and other things from it.

In [12]:
result = date1-date2

In [13]:
result

datetime.timedelta(days=365)

In [14]:
result.days

365

In [18]:
result.months

AttributeError: 'datetime.timedelta' object has no attribute 'months'

As you can see, the timedelta only works for the type of time count that is there, in this case days. We can also do arithmatic with datetime objects:

In [19]:
datetime1 = datetime(2022,11,3,12,0)

In [20]:
datetime2 = datetime(2021,11,3,2,0)

In [22]:
result = datetime1 - datetime2

In [23]:
result

datetime.timedelta(days=365, seconds=36000)

# Python Math and Random modules

In [26]:
import math

The math module has a ton of useful math functions. We are going to look at the floor and ceiling functions here. Let's say you have a float and you want to find the closest integer that is less than or equal to that number. For this, you can just use the floor function. The ceiling function does the same but with the integer greater than or equal to:

In [27]:
value = 4.34

In [28]:
math.floor(value)

4

In [30]:
math.ceil(value)

5

This isn't really true rounding, but is just taking the closest integers. If you want a true rounding, you can use the round() function built into python:

In [32]:
round(4.4)

4

In [33]:
round(4.6)

5

The math library has lots of useful constants, like pi or e:

In [34]:
math.pi

3.141592653589793

In [35]:
math.e

2.718281828459045

You can also use logarithms with math module:

In [42]:
math.log(100)

4.605170185988092

This logarithm is working with base of e (it is a natural logarithm). You can actually set your own base, like a base of 10 as a parameter:

In [43]:
math.log(100,10)

2.0

Here, it returned 2 because log10(100) is 2. Another thing you can do is trigonometric functions:

In [48]:
math.sin(80)

-0.9938886539233752

In [46]:
math.cos(120)

0.8141809705265618

In [49]:
math.tan(100)

-0.5872139151569291

Now, let's check out the random function using random.randint(a,b), which returns a random number between a and b including BOTH a and b.

In [1]:
import random

In [3]:
random.randint(0,100)

90

How the randint function works is that it sets a seed - this seed is the start of the random number algorithm. If you run randint() with the same seed, you will always get the same result. This seed is useful if you are debugging code and you want to run it with the same random numbers. Here is how to use the seed:

In [6]:
random.seed(42)

random.randint(0,100)

81

What we have done here is set the seed for random operations to be 42 so that the random.randint() will return a set value. Note that in jupyter notebooks the random.seed() and random.randint() methods need to be in the same cell for this to work - the seed needs to be reset every time. You don't actually need them to be in the same cell but it is easier if they are. Let's try something out:

In [9]:
random.seed(42)
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))
print(random.randint(0,100))

81
14
3
94


Here, it is giving a different set of random numbers but we expected it to just be 81. That is because a seed is not just for one computation, but instead the seed for an infinite sequence of random numbers.

Let's check out some more random module methods. One thing we can do is take a random item from a list with the choice() method.

In [10]:
mylist = list(range(10))

In [11]:
mylist

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [12]:
random.choice(mylist)

4

In [13]:
random.choice(mylist)

3

In [14]:
random.choice(mylist)

3

In [15]:
random.choice(mylist)

2

In [16]:
random.choice(mylist)

1

In [17]:
random.choice(mylist)

8

In [18]:
random.choice(mylist)

1

In [19]:
random.choice(mylist)

9

Keep in mind that this does not alter the list:

In [20]:
mylist

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Now, let's say you want to take 5 random numbers from the list. We can use the random.choices() method which lets us put in two variables - population, which is the list you are taking from, and k, which is the number of items you want to take. This will allow for a number to be picked multiple times.

In [21]:
random.choices(population = mylist, k = 5)

[4, 0, 2, 5, 0]

Now, lets say we don't want to select the same item twice. We can then use random.sample(), which takes the same parameters. 

In [22]:
random.sample(population = mylist, k = 5)

[3, 8, 6, 1, 9]

In [23]:
mylist

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

As you can see, there are no replacements but the original list is not altered. One more thing we can use is the .shuffle() method, which alters a list in place and shuffles it; 

In [24]:
random.shuffle(mylist)

In [25]:
mylist

[8, 7, 5, 2, 3, 1, 6, 0, 4, 9]

Now, let's say you want to find a random number in a range, but with normal or gaussian distribution. You can use the .gauss() to do this. It takes in a mean and a standard deviation:

In [30]:
random.gauss(mu = 0, sigma = 1)

0.3671625953860221

# Python Debugger

When debugging code in python, you often have to use several print statements to see the error. Fortunately, python has a builtin debugger tool that lets you check variables while running code! Let's check it out by making an error:

In [1]:
x = [1,2,3]
y = 2
z = 3

result = y + z
result2 = x + y

TypeError: can only concatenate list (not "int") to list

This creates an error because we can't add a list to an integer. In this code we know exactly what happened, but imagine that we have longer code that we are dealing with and we get an error. You could try to debug using print statements:

In [2]:
x = [1,2,3]
y = 2
z = 3

result = y + z
print(result)
print(x, y, z)
result2 = x + y

5
[1, 2, 3] 2 3


TypeError: can only concatenate list (not "int") to list

This kind of helps, but if you are using large code it becomes confusing to have too many print statements. Let's do this but with the debugger tool's trace ability:

In [1]:
import pdb

In [2]:
x = [1,2,3]
y = 2
z = 3

result_one = y + z

result_two = y + x

TypeError: unsupported operand type(s) for +: 'int' and 'list'

Now, we can use the trace ability. What trace does is essentially pause operations mid-script and allow you to play with the variables to find out what is going on. When using the trace, you put it before the error line, which is told to us in the error statement:

In [None]:
x = [1,2,3]
y = 2
z = 3

result_one = y + z

pdb.set_trace()
result_two = y + x

--Return--
> <ipython-input-6-2653c393277e>(7)<module>()->None
-> pdb.set_trace()


The python debugger then gives us this little text box to put in things. We can call variables at this time to see what they are to try and find the error at this point in time.

In [2]:
x = [1,2,3]
y = 2
z = 3

result_one = y + z

pdb.set_trace()
result_two = y + x

--Return--
> <ipython-input-2-2653c393277e>(7)<module>()->None
-> pdb.set_trace()
(Pdb) x
[1, 2, 3]
(Pdb) y
2
(Pdb) z
3
(Pdb) result_one
5
(Pdb) q


BdbQuit: 

What we did was put in variables into the debugger tool text prompt and it returned their values at that point in time. This is very useful to see a variable's current position without having a billion print statements. __important__: to exit the debugger view, enter 'q' into the thing.

# Python regular expressions

We already know that we can look for substrings in a larger string with the 'in' operator. However, this has severe limitations - we need to know the exact string and need to perform additional operations to account for capitalization and punctuation. What if we only knew the pattern structure of the string we're looking for? Like an email or phone number?

Regular Expressions (regex) allow us to search for general patterns in text data! For example, if you are looking for emails in a document, a format could be:

<code> user@email.com </code>

We know we are looking for the pattern 'text' + '@' + 'text' + '.com'. We have the things we do not know, like the user name and the email domain name, but we have the things we think we know, which is the '@' symbol and the '.com'. 

The __re__ library allows us to make specialized pattern strings and then look for matches in the text. The primary skill for regex is understanding the special syntax for the pattern strings. 


Don't feel that you have to memorize the patterns! Just focus on knowing how to look up the ionformation. One example of the regex patterns is this - let's say you are looking for a phone number in the format (555)-555-5555. The Regex pattern would be r"(\d\d\d)-\d\d\d-\d\d\d\d". 


Let's see what is going on here: the r outside the string tells python that this string should not be treated like a normal string. Notice that there are lots of identifiers here - \d is often repeated as an idtentifier. These identifiers are placeholders, often like wildcards, looking for a match. In this case, \d stands for 'digit', so it was a placeholder for a digit. Note that the other strings there, like the parenthesis and the hyphens are part of the format - they tell python that these must be there because they are part of the format. 

We can also use things such as quantifiers to make this easier. By putting a {} after the first \d and putting a number into those curly braces, it will take it as that number of \d's. For example,  r"(\d{3})-\d{3}-\d{4}" would do the same thing as  r"(\d\d\d)-\d\d\d-\d\d\d\d". 


We will learn how to use the re library to find patterns within text. Let's first see how to search for basic patterns in text. We will create a sample text saying something like "Joe's phone number is 408-555-1234. Call soon!";

In [1]:
text = "Joe's phone number is 408-555-1234. Call Soon!"

If you want to see if a word, like 'phone' is in the text, you can use a simple 'in' operator:

In [2]:
'phone' in text

True

However, we are looking for a pattern not a specific string. Let's import regular expressions to do this:

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern,text)

<re.Match object; span=(6, 11), match='phone'>

What we did was have the .search() method look for instance of pattern in text. It output back that it did find a match and where the span was (it started at index 6 and ended at index 11). Now, let's try searching for some pattern that we know is not in the text and see what happens:

In [6]:
pattern = "NOT IN TEXT"

In [7]:
re.search(pattern,text)

What happened here is that because it did not find a match the .search() function just returned None. Now, let's try playing with the original code that worked;

In [8]:
pattern = 'phone'

In [10]:
match = re.search(pattern,text)

In [11]:
match

<re.Match object; span=(6, 11), match='phone'>

Now, we can find some of the attributes of this object:

In [12]:
match.span()

(6, 11)

In [13]:
match.start()

6

In [14]:
match.end()

11

However, if a string is found multiple times it will only return the first match:

In [15]:
text = 'my phone once, my phone twice'

In [16]:
match = re.search('phone',text)

In [17]:
match

<re.Match object; span=(3, 8), match='phone'>

As you can see here, it only found the first match and not the next. If we want to get all the instances of that string in the text, we can use the re.findall() method:

In [18]:
matches = re.findall('phone',text)

In [19]:
matches

['phone', 'phone']

It gives out a list of how many matches we had in the list. However, if you want to get actual match objects, we can use the iterator:

In [20]:
for match in re.finditer('phone',text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


In [21]:
type(re.finditer('phone',text))

callable_iterator

What the finditer thing has done is created an iterator that finds each instance of the pattern in the text. If we want the actual text match, we can use the .group() method:

In [23]:
for match in re.finditer('phone',text):
    print(match.group())

phone
phone


Now, here is a table of the character identifiers in the re library:

<table border="1" class="docutils">
<colgroup>
<col width="14%" />
<col width="86%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Code</th>
<th class="head">Meaning</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\d</span></tt></td>
<td>a digit</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\D</span></tt></td>
<td>a non-digit</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\s</span></tt></td>
<td>whitespace (tab, space, newline, etc.)</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\S</span></tt></td>
<td>non-whitespace</td>
</tr>
<tr class="row-even"><td><tt class="docutils literal"><span class="pre">\w</span></tt></td>
<td>alphanumeric</td>
</tr>
<tr class="row-odd"><td><tt class="docutils literal"><span class="pre">\W</span></tt></td>
<td>non-alphanumeric</td>
</tr>
</tbody>
</table>

Let's start with the example that we used in the beginning with the telephone number:

In [2]:
text = 'My phone number is 408-555-1234'

In [26]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [27]:
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

It is important whenever we are using regex to remember to put an r before the string starts - otherwise, it will think that these backslashes are 'escape' backslashes that try to escape the code such as \n or \t. Instead, we are using them as patterns for regular expressions. Now, here is a good use of the .group() method discussed earlier - we can use it on the phone object to see what the string was that we found:

In [29]:
phone.group()

'408-555-1234'

But what if we had a pattern we were looking for that had 20 digits or 100 digits? We can't just go around putting 20 or 100 \d's in our code! We can use quantifiers to show repititon of a character.

![Quantifiers.PNG](attachment:Quantifiers.PNG)

Now, let's transform our original search into quantifier form:

In [5]:
phone = re.search(r'\d{3}-\d{3}-\d{4}',text)

In [6]:
phone

<re.Match object; span=(19, 31), match='408-555-1234'>

Let's say that we want to only grab the area code, or the first 3 digits of this thing. We can use groups for any task that requires grouping of regular expressions. That allows us to later break them down. We will also use the .compile() method. What compile does is it compiles togethere diffferent regular expression pattern codes, for example, our first pattern we are looking for (r'\d{3}-\d{3}-\d{4}') can be thought of as 3 different pattern codes connected by hyphens. Here is the syntax:

In [9]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

What has happened is that the .compile() has took the 3 things separated into parenthesis and compiled them into one regular expression. However, the compile still understands that these are 3 different groupings. You can call the groupings individually:

In [11]:
results = re.search(phone_pattern,text)

In [12]:
results.group()

'408-555-1234'

In [13]:
results.group(0)

'408-555-1234'

In [14]:
results.group(1)

'408'

In [15]:
results.group(2)

'555'

In [16]:
results.group(3)

'1234'

What we have done is created a pattern, searched for it, and then searched for the group number. This is why the compile method and the group method are so powerful together.

Now, lets' look at some more Regular Expressions syntax. The first one we will be looking at is the OR operator. Let's say you want to search a string for a cat OR a dog. Right now, we don't have the tools to do that:

In [5]:
re.search(r'cat','cat is here')

<re.Match object; span=(0, 3), match='cat'>

In [6]:
re.search(r'cat','dog is here')

When we search cat, we can't find dog. To do this, we need to use the pipe operator: | .

In [7]:
re.search(r'cat|dog','dog is here')

<re.Match object; span=(0, 3), match='dog'>

In [8]:
re.search(r'cat|dog','cat is here')

<re.Match object; span=(0, 3), match='cat'>

In [9]:
re.search(r'cat|dog','bird is here')

As you can see, the pipe has made it so that we can search for cat or dog. Now, let's look at the wildcard operator. Let's say you are looking for all the strings with the letters 'at' in them consecutively. Now, let's say you want to find the item before the at as well:

In [10]:
re.findall(r'at','The cat in the hat sat there')

['at', 'at', 'at']

Here, we would want to grab the character before the at - not just a letter, or an alphanumeric, but anything. We can use the period before the at to do that:

In [11]:
re.findall(r'.at','The cat in the hat sat there')

['cat', 'hat', 'sat']

However, this will only grab one wildcard before at. What if we are looking for the word 'splat'?

In [12]:
re.findall(r'.at','The cat in the hat went splat')

['cat', 'hat', 'lat']

We want the whole string splat, but this only gives us lat. We can just add more wildcards before to make this work:

In [13]:
re.findall(r'...at','The cat in the hat went splat')

['e cat', 'e hat', 'splat']

Note that the wild card characters will grab other things such as spaces and other words, as demonstrated above. Now, let's say we want to find if our string starts with a number. WE can use the carrot symbol (^) combined with an identifier to find it:

In [15]:
re.findall(r'^\d','1 is a number')

['1']

Note that this only searches the entire string for the first character being a number, so if you put multiple numbers it won't do anything. Now, lets say you are searching if a string ends with a digit. You can use the Ends With operator (a dollar sign) for this:

In [2]:
re.findall(r'\d$','This string ends with the number 5')

['5']

Now, what if you wanted to take a string but exclude all the digits from it?

In [1]:
phrase = 'there 7 are lots 238 of random 0 numbers inside this 15 sentance'

To do this, we use a pattern with the following syntax:

In [3]:
pattern = r'[^\d]'

In [5]:
re.findall(pattern,phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 'l',
 'o',
 't',
 's',
 ' ',
 ' ',
 'o',
 'f',
 ' ',
 'r',
 'a',
 'n',
 'd',
 'o',
 'm',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 ' ',
 's',
 'e',
 'n',
 't',
 'a',
 'n',
 'c',
 'e']

What this is doing is returning every single character that is not a number. However, if we want to keep the words themselves intact, we can use the + identifier:

In [6]:
pattern = r'[^\d]+'

In [7]:
re.findall(pattern,phrase)

['there ', ' are lots ', ' of random ', ' numbers inside this ', ' sentance']

What if we want to use inclusion instead of exclusion? As in, we want to include everything in those brackets instead of removing them?

In [1]:
text = 'Only find the hyphen-words in this sentance. But you do not know how long-ish they are'

The thing is, we don't know how many letters are before or after the hyphen in the hyphen-words. To find the words that have a hyphen in the middle of them, we would search for a group of alphanumeric words or characters, a hyphen then another group of alphanumeric characters. Here is how to do it:

In [8]:
pattern = r'[\w]+-[\w]+'

In [9]:
re.findall(pattern, text)

['hyphen-words', 'long-ish']

# Timing code in Python

As you learn more Python, you will find multiple ways to solve a idfferent task, and you might wonder what the most efficient way to solve the task is. The easiest way to do that is to time your code's performance. We can do this in 3 ways: by simply finding the time elapsed, using the timeit module, and using the special %%timeit "magic" with jupyter notebooks. 

Let's use these methods while testing a function that finds the string representation of all the numbers up to an input. Here are a few of the functions:

In [10]:
def func_one(n):
    return [str(num) for num in range(n)]

In [11]:
func_one(10)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

In [12]:
def func_two(n):
    return list(map(str,range(n)))

In [15]:
func_two(10)

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Func_one and Func_two are two functions that do the exact same thing. Now, we need to see which one is more efficient. One way of doing this is by taking the start time of the code and subtracting it by the end time. Here's how we will do it by using the time module:

In [21]:
import time

#Current Time Before
start_time = time.time()
#Run Code
result = func_one(1000000)
#Current Time After Running Code
end_time = time.time()
#Elapsed Time
elapsed_time = end_time - start_time

In [22]:
print(elapsed_time)

0.2686460018157959


This tells us that it took func_one around 0.269 seconds to run this code with a million digits as n. Let's see how func_two stacks up:

In [25]:
#Current Time Before
start_time = time.time()
#Run Code
result = func_two(1000000)
#Current Time After Running Code
end_time = time.time()
#Elapsed Time
elapsed_time = end_time - start_time

In [26]:
print(elapsed_time)

0.18251299858093262


As you can see, function two was slightly faster. The disadvantage of this method is that it doesn't work if the code is very fast, like if it takes less than 0.1 of a second to run. It's also harder to see how much faster one thing is than another if they are very close. 

Now, let's try with the timeit module. The timeit.timeit function takes in 3 main parameters: the statement (so the function itself you want to run), the setup (any variables or functions that need to be setup before each repitition of the function) and the number of reptitions of that function you want to do. The odd thing is that the statement and setup parts are actually passed in as strings, not the raw functions themselves. Let's look at it now by defining the statement variable:

In [28]:
stmt = '''
func_one(100)
'''

What we did here was create a variable called stmt, which is func_one where n is 100. This is the code that will be replayed over and over again.

In [29]:
setup = '''
def func_one(n):
    return [str(num) for num in range(n)]
'''

What we did here was define the function that will be used over and over again. This will be run before each statement call. Note that for all variables or functions, they need to be redefined every time in setup.

In [32]:
num = 100000

What we did here was set num to be 100000, so the statement code will be run 100000 times.

In [33]:
import timeit

In [34]:
timeit.timeit(stmt,setup,number = num)

1.858846199999789

It says here that it took 1.86 seconds to run func_one(100) 100000 times. Now, let's do this again but with function two:

In [37]:
stmt2 = '''
func_two(100)
'''

In [38]:
setup2 = '''
def func_two(n):
    return list(map(str,range(n)))
'''

In [40]:
num = 100000

In [41]:
timeit.timeit(stmt2,setup2,number = num)

1.4314589999999043

As is clear, statement 2 is running quite a bit faster. Now, let's look at the 3rd timing method, which utilizes jupyter notebooks magic function:

In [43]:
%%timeit
func_one(100)

19 µs ± 461 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [44]:
%%timeit
func_two(100)

14.4 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


IMPORTANT: It is important that the timeit thing is always in the beginning of the cell in Jupyter, otherwise it will not work. As you can see, it says that func_two is running at around 14.4 microseconds while func_one is 19. 

# Unzipping and Zipping files w/ built in tools and Shutil

We are going to learn how to zip and unzip files programmaticlly with python. We will create a zip file, compress text files and then insert them into the zip file, close it, and then unzip the information into a folder of your choosing. 



First thing, we will create some text files to compress here. The first method here will be using tools built into python, but we will later do it with shell utilities (aka shutil).

In [13]:
f = open('fileone.txt','w+')

In [14]:
f.write('ONE FILE')

8

In [15]:
f.close()

In [16]:
g = open('filetwo.txt','w+')

In [17]:
g.write('TWO FILE')

8

In [18]:
g.close()

Now, let's say these are really large text files and we want to compress them in order to send them as an email. Let's use zipfile to do this:

In [19]:
import zipfile

In [21]:
#Here, we are creating a zip file. The first parameter
#is the name of the zip file we will create, and the 
#second is the mode (we will be writing to this file
#and thus we are in mode 'w')
comp_file = zipfile.ZipFile('comp_file.zip','w')

Now that we have a zip file, we can compress the text files and insert them into the zip file. 

In [22]:
#We are here writing to this zip file. The first parameter
#is the file name we want to compress, and the second one
#is the compression type. We will be using the zip_deflated
#type because it is one of the most common ones.
comp_file.write('fileone.txt',compress_type = zipfile.ZIP_DEFLATED)

In [23]:
comp_file.write('filetwo.txt',compress_type = zipfile.ZIP_DEFLATED)

In [24]:
comp_file.close()

To extract the files from the zip file, we do the following:

In [25]:
zip_obj = zipfile.ZipFile('comp_file.zip','r')

In [27]:
zip_obj.extractall('extracted_content(FOR THE ADVANCED MODULES LECTURE)')

What has happened here is that we have taken the contents of the zipfile in read mode and we have extracted them to a folder called extracted_content(FOR THE ADVANCED MODULES LECTURE) saved to the current location you are at.


Keep in mind that, typically, when you are doing things such as compressing folders and files, you often are compressing / extracting an entire folder, not just a single file. If we want to do something like that, Shell Utility Library is a much better tool for that:

In [28]:
import shutil

What we do here is then put in a directory we want to turn into a zip file. Let's turn the extracted_content folder into a zip file. 

In [34]:
#this is the variable that represents the directory we want to zip
dir_to_zip = 'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python\\extracted_content(FOR THE ADVANCED MODULES LECTURE)'

In [33]:
#This is the variable which will be where we want to output the zipped version
output_filename = 'example'

In [35]:
#This thing takes in 3 parameters - the file name we want to output to,
#The format of the file we want to turn it into (in our case, zip)
#and the directory we want to zip itself.
shutil.make_archive(output_filename,'zip',dir_to_zip)

'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python\\example.zip'

We have sucessfully made an entire folder into a zip file, named 'example'! The output of the above function is the file location it is now.

To extract the contents of the zip file with shell utilities, we use the unpack_archive method:

In [36]:
#Here, the first parameter is the file we want to unpack,
#the second is what we want to name it as,
#and the third is the file format.
shutil.unpack_archive('example.zip','final_unzip','zip')

# Advanced Modules Puzzle

The puzzle is to unzip a file called unzip_me_for_instructions.zip, open the txt file with python, and see if you can figure out what you need to do.

In [38]:
import shutil

In [39]:
shutil.unpack_archive('C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python\\Course Docs\\12-Advanced Python Modules\\08-Advanced-Python-Module-Exercise\\unzip_me_for_instructions.zip','instructions','zip')

__EXTRACTED INSTRUCTIONS:__


Good work on unzipping the file!
You should now see 5 folders, each with a lot of random .txt files.
Within one of these text files is a telephone number formated ###-###-#### 
Use the Python os module and regular expressions to iterate through each file, open it, and search for a telephone number.
Good luck!

In [48]:
import os

In [46]:
import re

In [50]:
#Variable for the extracted content directory
dir_of_files = 'C:\\Users\\aadia\\OneDrive\\Python\\Udemy course python\\instructions\\extracted_content'

In [71]:
#Solution Function

#Get all the files in each subfolder of the thing
for folder, sub_folders, files in os.walk(dir_of_files):
    #Go through each of the text files
    for text_file in files:
        the_file = open(folder + '\\' + text_file)
        file_contents = the_file.read()
        #Check if the phone number is in the file
        if re.search(r'\d{3}-\d{3}-\d{4}',file_contents) != None:
            number = re.search(r'\d{3}-\d{3}-\d{4}',file_contents)
            location = folder + '\\' + text_file
            print(f'The phone number is {number.group()} and was found at {location}')

The phone numer is 719-266-2837 and was found at C:\Users\aadia\OneDrive\Python\Udemy course python\instructions\extracted_content\Four\EMTGPSXQEJX.txt
