# Session 5
- Command Line Processing
- Regular Expressions
- File Formats
- Various helpful modules + some more from personal experience

## Command Line Processsing
```python
python3 -m program_name argument_1 argument_2
python3 program_name.py argument_1 argument_2
```
We can create command line tools in Python through two modules in the standard module
1. [sys argv](https://docs.python.org/3/library/sys.html#sys.argv)
2. [argparse](https://docs.python.org/3/library/argparse.html)
3. [optparse](https://docs.python.org/3/library/optparse.html)
4. [getopt](https://docs.python.org/3/library/getopt.html)   # Don't use this one

And good external ones are:
1. [click](https://click.palletsprojects.com/en/stable/)
2. [msgspec-click](https://github.com/ofek/msgspec-click)
3. [typer](https://typer.tiangolo.com/)

<br>

The goal is to give a general view on what to expect.  
You will need to experiment with it yourself outside of this Notebook.  

In [None]:
import sys

# using len(sys.argv) as argc alternative in Python
if len(sys.argv) != 4:
    print("Three arguments needed")
    sys.exit()
    
print(f'The name of the program: {sys.argv[0]')
print(f'The first argument: {sys.argv[1]}')
print(f'The second argument: {sys.argv[2]}')
print(f'The third argument: {sys.argv[3]}')

One could work with argv, but a lot of matching cases needs to be done.  
It is a lot more error prone.  
I rather prefer the other options available, these are more forgiving.  

In [None]:
import optparse

if __name__ == '__main__':
    parser = optparse.OptionParser()
    parser.add_option('-o', '--output')
    parser.add_option('-v', dest='verbose', action='store_true')
    opts, args = parser.parse_args()
    process(args, output=opts.output, verbose=opts.verbose)

In [None]:
import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--output')
    parser.add_argument('-v', dest='verbose', action='store_true')
    parser.add_argument('rest', nargs='*')
    args = parser.parse_args()
    process(args.rest, output=args.output, verbose=args.verbose)

**The difference between Optparse and argparse**  
Optparse separates the options and arguments.  
Optparse gives more control and is less 'opiniated'.  
For most use cases, it is recommended to use Argparse.  

In [1]:
# Works with decorators -> https://realpython.com/primer-on-python-decorators/
import click

def decorator(fn):
    do_something()
    fn()
    do_something()

@click.command()
@click.option('-i', help='Interactive Mode')
def send_cmd(ip: str, port: int):
    pass

## Regular Expressions
Somehow I keep bashing on about it but read the [Python Docs](https://docs.python.org/3/howto/regex.html).  
Regular expressions provide pattern matching.  
These are regular strings with 'metacharacters'.  
These characters have a different meaning.  

**All Meta characters**
|Character|Meaning|
|--|--|
|\.|Matches any character (except newline)|
|\[\]|Matches a group of characters|
|\*|Zero or more occurences|
|\+|One or more occurences|
|{}|Exactly the specified amount of occurences|
|?|Zero or one occurences|
|^|Starts with|
|$|Ends with|
|\||Either one or the other|
|()|Capture a group|
|\\|Special sequence or escape character|

**Special Sequences**
|Character|Meaning|
|--|--|
|\A|Returns a match if the specified characters are at the beginning of the string|
|\b|Returns a match where the specified characters are at the beginning or at the end of a word|
|\B|Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word|
|\d|Returns a match where the string contains digits (numbers from 0-9)|
|\D|Returns a match where the string DOES NOT contain digits|
|\s|Returns a match where the string contains a white space character|
|\S|Returns a match where the string DOES NOT contain a white space character|
|\w|Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)|
|\W|Returns a match where the string DOES NOT contain any word characters|
|\Z|Returns a match if the specified characters are at the end of the string|
|\n|Newline character|
|\r|Carriage return|
|\t|Tabulation|
|\\/|Forward slash|
|\\\\|Backward slash|
|\\"|Double Quote|
|\\'|Single Quote|


In [2]:
import re

# the r in front of the string tells python to use a raw string
# so it doesn't try to convert parts itself according to standard Python string interpretations
# the compile 'compiles' a regular expression into a Pattern Object
find_a = re.compile(r"a+")
answer = find_a.findall("aardvark")
print(answer)

['aa', 'a']


In [3]:
# Finding all occurences and returning it
import re

print(re.findall(r'a+', 'aardvark'))

['aa', 'a']


In [4]:
# Finding the location of the object
import re

# re.search returns a Match object for first occurence in the string
m = re.search(r'a+', 
              'Aardvarks are medium-sized, burrowing, nocturnal mammals native to Africa.\
              Aardvarks are the only living species of the family Orycteropodidae and the order Tubulidentata.\
              They have a long proboscis, similar to a pig\'s snout, which is used to sniff out food.')
print(f'{m.group()} is found at index {m.start()}')

# Methods for the match object are
# group -> returns the found pattern
# start -> returns at which patterns start
# end   -> returns where patterns has ended

a is found at index 1


In [5]:
import re

# re.match returns a Match Object if the string starts with the pattern
m = re.match(r'a+', 'Look out for the banana')
if m:
    print(f'{m.group()} starts the string')
else:
    print('The pattern does not start at the beginning of the string')

The pattern does not start at the beginning of the string


In [6]:
# This one is handy for dealing with a lot of stuff
import re

# finditer, returns an iterator of Match Objects for all occurences found
mlist = re.finditer( r"a+", "Look out! A dangerous aardvark is on the loose !" )
for m in mlist:
    print(f"{m.group()} starts at index {m.start()} and ends at index {m.end()}.")

a starts at index 13 and ends at index 14.
aa starts at index 22 and ends at index 24.
a starts at index 27 and ends at index 28.


In [8]:
import re

# ball, bell, bill, boll, bull, we want them all
# [] designates a group of characters to match
slist = re.findall(r"[Bb][aeiou]ll", 
                    "Bill Gates and Uwe Boll drank Red Bull at a football match in Campbell.")
print(slist)

['Bill', 'Boll', 'Bull', 'ball', 'bell']


In [9]:
import re

mlist = re.finditer(r"ba+","A sheep says ' baaaaah ' to Ali Baba.")
for m in mlist:
    print(f"{m.group()} is found at {m.start()}.")

baaaaa is found at 15.
ba is found at 34.


In [10]:
import re

# Capture a group of characters with ()
# \d{1,2} means one or two digits
# group(0) means all
# The rest are the individual capture groups as defined
# or use groups for a tuple of all the groups
date = re.compile(r"(\d{1,2})-(\d{1,2})-(\d{4})")
m = date.search("In response to your letter of 25-3-2015 , I decided to hire a hitman to get you." )
if m:
    print(f"Date {m.group(0)}; day {m.group(1)}; month {m.group(2)}; year {m.group(3)}")
    print(m.groups())

Date 25-3-2015; day 25; month 3; year 2015
('25', '3', '2015')


In [13]:
import re

# findall methods returns a list of pattern objects
# in all examples so far it returned a list of strings
# but with MULTIPLE groups, pattern objects are tuples that contain the groups
date = re.compile(r"(\d{1,2})-(\d{1,2})-(\d{4})")
dlist = date.findall("In response to your letter of 25-3-2015, on 27-3-2015 I decided to hire a hitman to get you.")
for date in dlist :
    print(date)

('25', '3', '2015')
('27', '3', '2015')


In [14]:
import re

# We can name the groups too!
# ?P<name> before the pattern inside of the capture group
date = re.compile(r"(?P<day>\d{1,2})-(?P<month>\d{1,2})-(?P<year>\d{4})")
m = date.search("In response to your letter of 25-3-2015, I have to admit that you are a butt ugly motherfucker")
if m:
    print(f"day is {m.group('day')}")
    print(f"month is {m.group('month')}")
    print(f"year is {m.group('year')}")

day is 25
month is 3
year is 2015


In [15]:
import re

# And lastly, we can replace stuff too!
# refering back to the group number \g<number>
s = re.sub(r"([iy])se", "\g<1>ze", 
           "Whether you categorise, emphasise, or analyse, you should use American spelling !" )
print(s)

Whether you categorize, emphasize, or analyze, you should use American spelling !


  s = re.sub(r"([iy])se", "\g<1>ze",


## File Formats
1. Comma Separated Values
2. Pickling
3. Javascript Object Notation
4. HTML & XML

We can open any CSV the normal way and strip newline.  
But CSV is not standardized.  
Python has a module for it to deal with all the dialects.  
It is, creatively, called csv.

In [19]:
import csv

# recommendation from Python docs to add newline=''
with open('movies.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['MOVIE', 'RATING'])
    writer.writerow(["Monty Python and the Holy Grail", 8])
    writer.writerow(["Monty Python's Life of Brian", 8.5])
    writer.writerow(["Monty Python's Meaning of Life", 7])
    
with open('movies.csv', 'r', newline='') as f:
    reader = csv.reader(f, delimiter=',')
    
    # How do we skip the HEADER?
    next(reader)
    for line in reader:
        print(line)

['Monty Python and the Holy Grail', '8']
["Monty Python's Life of Brian", '8.5']
["Monty Python's Meaning of Life", '7']


In [20]:
# Please don't use pickle for security reasons
from pickle import dump, load

cheeseshop = [("Roquefort", 12, 15.23), ("White Stilton", 25, 19.02), ("Cheddar", 5, 0.67)]

# Store as binary file
with open("cheese.pck", "wb") as f:
    dump(cheeseshop, f)
print("Cheeseshop was pickled")

# Open as binary file
with open("cheese.pck", "rb") as f:
    buffer = load(f)
    print(type(buffer))
    print(buffer)

Cheeseshop was pickled
<class 'list'>
[('Roquefort', 12, 15.23), ('White Stilton', 25, 19.02), ('Cheddar', 5, 0.67)]


In [21]:
from json import dump, load

# dump and load, take in a file
# dumps and loads, take in a string

cheeseshop = [("Roquefort", 12, 15.23), ("White Stilton", 25, 19.02), ("Cheddar", 5, 0.67)]

with open("cheese.json", "w") as f:
    dump(cheeseshop, f)

with open("cheese.json", "r") as f:
    buffer = load(f)
    print(type(buffer))
    print(buffer)

<class 'list'>
[['Roquefort', 12, 15.23], ['White Stilton', 25, 19.02], ['Cheddar', 5, 0.67]]


In [22]:
# for HTML and XML parsing -> https://beautiful-soup-4.readthedocs.io/en/latest/
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
print(soup.title)

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

<title>The Dormouse's story</title>


## Useful Modules
1. Datetime / Time
2. Collections
3. Urllib
4. Glob

In [24]:
from datetime import datetime

print(datetime.now())

2025-02-27 21:15:46.006377


In [25]:
from collections import Counter, deque

# Counter is a dictionary that counts 
data = ['apple', 'pear', 'apple', 'apple', 'chicken nugget']
c = Counter(data)
print(c)
print(c.most_common())

# Double Ended Queue
dq = deque([1,2,3])
dq.appendleft(4)
dq.extendleft([3, 4])
print(dq)

Counter({'apple': 3, 'pear': 1, 'chicken nugget': 1})
[('apple', 3), ('pear', 1), ('chicken nugget', 1)]
deque([4, 3, 4, 1, 2, 3])


In [26]:
# A solid reason on why everyone is using requests
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from sys import exit

try:
    u = urlopen('http://www.python.org')
except HTTPError as e:
    print(f'HTTP Error: {e}')
    exit()
except URLError as e:
    print(f'URL Error: {e}')
    exit()
    
for i in range(5):
    text = u.readline()
    print(text)

u.close()

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed}}w\xdb6\xb2\xf7\xff9g\xbf\x03\xaa\x9c\x1b;\xdbP\xb2\xe4w\xc7V\xd7I\x9c\xd4\xdd\xbc\xb8u\xd2\xdc{{{r(\x8a\x92hK$#\x92\xb6\xd5\xdd\xfd\xee\xcfo\x06\x00\tR$%\xd9n\xda>\'\xddm-Q 0\x18\xcc;\x06\x83\xc3o\xfa\x81\x13\xcfBW\x8c\xe2\xc9\xb8\xfb\xe0\xf0\x1b\xcb\xfa\xc5\x1b\x88q,NO\xc4\xee\xaf]!\xc4!\xfd$\x9c\xb1\x1dEG\r?\xb0."\xe1\xb9;hby\xee\xae\xfc\xb3\'\xff\xec7\xb8\xfd7\xbf\xb8~\xdf\x1b\xfcjYY\x87ioU\x1d\x96\xf5\x84\xb6\xf4OU\x87{\x0c\x1e5(\x830\x07\x93l\xa7\xff[\xda\xe1\x90\xa7\x8c>\t\x07\xdd\xf9.\x1bbl\xfb\xc3\xa3\x86\xeb7D\xdf\x9b\x1e5\xc6\xf1\x94\xe6K\xcds\x1d>8\x1c\xb9v\xbf\xfb@\x82nY\xe2U\x10\x0c\xc7\xae\x88\xed\xa1X\x1f\xe2\xbf\xcd\x8b\xe8\xb1 \xe4p\x8b\xc8\x99za,\xech\xe6;"\x9a:G\x8dQ\x1c\x87\xd1A\xabu}}\xdd\x1c\xf2\xbbxib\xfb\xf6\xd0\x9d6\x9d`\xd2\xa2NZ\x17\xd1w^\xff\xe8\x95\xf5\xfe\xe5\xe6\xf6\xff\xbc\xdc\x7f\xfe\xf3\xf7\x8d\xeeaK\xf6\x96\xebZ~\x11\xe2\xda\xf3\xfb\xc1u\xb3o\xc7\xf6k{\xe6N\xc5\xd1\xfc\xa3\x7f\xff[\xfc\xf2\xebS\x06L\x88A

In [29]:
# use iglob for an iterator instead of a list
from glob import iglob

data = iglob('*.csv')
print(type(data))
for name in data:
    print(name)

<class 'generator'>
movies.csv
