# 1. Getting started with Python 👋

- <span style='color:orange'>Don't use floating point for money</span>
    - Can lead to strange rounding errors
    - Instead do 1025 integer to represent $10.25
- You can also change text colour in notes using code :
    - `<span style='color:aqua'>` <span style='color:aqua'>text u want colored</span> `</span>`
![](https://user-images.githubusercontent.com/18719295/187836098-ae7589a5-8cec-4462-ada7-cf173625c5dd.png)

## Expressions

In [6]:
# Can do int() on str that have num
my_int = 123
int(my_int) + 1

# name = input('Who are you')
# print(f'Hi', name)

int(88.6)

88

## Conditionals

In [14]:
x = 5
if x == 5: # think of == as a question
    print('x equals 5')
    
for i in range(5):
    if i > 2:
        print('larger than 2')
print('All done')


x equals 5
larger than 2
larger than 2
All done
First -1
Second 123


In [None]:
# If you have code that might traceback, can use try/except
astr = 'Hello Bob'
try:
    istr = int(astr) # will fail, goes to except 
except:
    istr = -1 
print('First', istr)

astr = '123'
try:
    istr = int(astr) # will pass, skips except
except:
    istr = -1 
print('Second', istr)



- `try/except` is similar to if/else but avoids traceback
- try not to put entire code in try: block
- if it encounters smth wrong doesn't finish all of try block, exits immediately to run except
- best practice is to minimise lines in try (only dangerous might fail code)

In [15]:
# Example of try/except 
rawstr = input('Enter a number:')
try:
    ival = int(rawstr)
except:
    ival = -1 

if ival > 0: 
    print('Nice work!')
else:
    print('Not a number')

Nice work!


Write code that pays 1.5 times the rate after someone has worked more than 40 hrs. 

In [2]:
sh = input('Enter hours:')
sr = input('Enter rate: ')
try:
    fh = float(sh)
    fr = float(sr)
except:
    print('Error, please enter numeric input')
    
if fh > 40:
    pay = (40 * fr) + (fh - 40)*(fr * 1.5)
    print(pay)
else:
    pay = fh * fr 
    print(pay)

400.0


## Functions

In [3]:
def greet(lang):
    if lang == 'es':
        print('Hola')
    elif lang == 'fr':
        print('Bonjour')
    else:
        print('Hello')
greet('es')

Hola


In [4]:
# return provides val of the function 
def greet():
    return 'Hello'
print(greet(), 'Minh')

# Note: return ends the function if run 

Hello Minh


In [6]:
# Can have more than 1 parameter 
def addtwo(a, b):
    added = a + b 
    return added 
x = addtwo(3, 5)
print(x)

8


## Loops and Iteration

In [6]:
n = 3
while n > 0: # checks if true
    print(n)
    n = n - 1 # goes back up 
print('Blastoff')
print(n)

# while like an if statement
# iteration var is what changes so loop isn't infinite

3
2
1
Blastoff
0


### Breaking out of a Loop

#### `break` statement 
- ends current loops and jumps to next code

In [None]:
while True:
    line = input('> ')
    if line == 'done' :
        break 
    print(line)
print('Done!')

#### `continue` statement 
- ends current iteration and jumps to top of loop to start next iteration

In [None]:
while True:
    line = input('> ')
    if line[0] == '#' :
        continue
    if line == 'done' :
        break
    print(line)
print('Done')

### Definite Loops

In [5]:
for i in range(1, 4): # can pick anything for i 
    print(i)
print('Blastoff')

1
2
3
Blastoff


In [7]:
# Definite loops with strings 
friends = ['Thanh', 'Peter', 'Thomas']
for friend in friends:
    print('Happy New Year:', friend)
print('Done')

Happy New Year: Thanh
Happy New Year: Peter
Happy New Year: Thomas
Done


### Smarter loops

In [8]:
# Making 'smart' loops 
# E.g. finding the largest number 
largest_so_far = -1 
print('Before', largest_so_far)
for num in [9, 41, 12, 3, 74, 13]:
    if num > largest_so_far:
        largest_so_far = num
    print(largest_so_far, num)
print('After', largest_so_far)

Before -1
9 9
41 41
41 12
41 3
74 74
74 13
After 74


In [9]:
# Counting in a loop 
count = 0 
print('Before', count)
for thing in [9, 41, 12, 3, 74, 15]:
    count = count + 1
    print(count, thing)
print('After', count)

Before 0
1 9
2 41
3 12
4 3
5 74
6 15
After 6


In [10]:
# Summing in a loop
total = 0
print('Before', total)
for thing in [9, 41, 12, 3, 74, 15]:
    total = total + thing 
    print(total, thing)
print('After', total)

Before 0
9 9
50 41
62 12
65 3
139 74
154 15
After 154


In [11]:
# Putting if in a loop (flitering)
print('Before')
for value in [9, 41, 12, 3, 74, 15]:
    if value > 20:
        print('Large number', value)
print('After')

Before
Large number 41
Large number 74
After


In [16]:
# Search using a Boolean Variable 
found = False
print('Before', found)
for value in [9, 41, 12, 3, 74, 15]:
    if value == 3:
        found = True 
    print(found, value)
print('After', found)

Before False
False 9
False 41
False 12
True 3
True 74
True 15
After True


In [19]:
# Finding the smallest value 
smallest = None 
print('Before', smallest)
for num in [9, 41, 12, 3, 74, 15]:
    if smallest is None: # if it's empty, grab the first num
        smallest = num
    elif num < smallest:
        smallest = num 
    print(smallest, num)
print('After', smallest) 



Before None
9 9
9 41
9 12
3 3
3 74
3 15
After 3


- flaw in using -1 as a flag value
- would use None instead and assign it as first number in list

### `is` and `is not`

- logical operators that returns True or False 
- "is" similar to == but stronger as can't convert mathematically (don't overuse)

Write program where user repeatedly enters number until they enter 'done'. After done, print out the total, count and averages of number. If they enter smth other than num, detect mistake. Try using `try` & `except` to print an error then skip the next num

In [1]:
count = 0
tot = 0.0                       # very important its 0.0 when dealing with floats or code stops working 
while True:
    sval = input('Enter a number: ')
    if sval == 'done':
        break                   # gets out if 'done' 
    try:
        fval = float(sval)
    except:
        print('invalid input')
        continue                # code goes up to while again
    print(fval)
    count = count + 1           # counter 
    tot = tot + fval            # total
print('ALL DONE')
print(tot, count, tot/count)    # total, counter, average 

invalid input
invalid input
1.0
2.0
invalid input
invalid input


Write a program similar to above where user has to enter values. Should check for smallest & largest nums. If numbers aren't values should print an error and continue asking. If user enters 'done' program should print smallest & largest nums

In [16]:
largest = None
smallest = None
num_list = []

while True:
    num = input("Enter a number: ")
    if num == "done":
        break
    try:
        fnum = float(num)
    except:
        print('Invalid input')
        continue
    
    num_list.append(fnum)
    if smallest is None and largest is None:
        smallest = fnum
        largest = fnum
        
    for num in num_list:
        if num < smallest:
            smallest = num 
        elif num > largest:
            largest = num
print('ALL DONE')
print(num_list)        
print("Maximum is", largest)
print("Minimum is", smallest)     

ALL DONE
[]
Maximum is None
Minimum is None


# 2. Data Structures 

## Strings

In [26]:
fruit = 'banana'
letter = fruit[1]
print(letter)

x = 3
w = fruit[x - 1]
print(w)

print(len(fruit))

# Looping through strings:
idx = 0 
len(fruit)
while idx < len(fruit):
    letter = fruit[idx]
    print(idx, letter)
    idx = idx + 1 
    
# better method of looping unless u need flexibility:
for letter in fruit:
    print(letter)

a
n
6
0 b
1 a
2 n
3 a
4 n
5 a
b
a
n
a
n
a


#### Looping and Counting

In [28]:
word = 'banana'
count = 0
for letter in word: 
    if letter == 'a': 
        count = count + 1 
print(count)

3


#### Slicing Strings

In [1]:
s = 'Monty Python'
print(s[0:4])
print(s[6:7])
print(s[6:20])

# the second number is non-inclusive [ : )

Mont
P
Python


### Manipulating strings

#### Using `in` as a Logical Operator 

In [6]:
fruit = 'banana'
print('nan' in fruit)
if 'a' in fruit:
    print('Found it!')

True
Found it!


#### String Comparison

In [26]:
word = 'x'

if word == 'banana':
    print('All right, bananas.')

if word < 'banana':
    print('Your word, ' + word + ', comes before banana')
elif word > 'banana':
    print('Your word, ' + word + ', comes after banana')
else:
    print('All right, bananas.')

# note: uppercase letters are less than lowercase (A < a)

Your word, x, comes after banana


#### String Library Functions

In [9]:
greet = 'Hello Bob'
zap = greet.lower() 
print(zap)
# call function lower and pass greet into it

print('HI THERE'.lower())
# can pass a constant to lower()

hello bob
hi there


In [12]:
stuff = 'hello world'
type(stuff) # tells u class string (str)
# dir(stuff) tells methods u can do in class str 

stuff.capitalize() # capitalizes 1st letter only 


'Hello world'

#### Searching a String

In [14]:
fruit = 'banana'
idx = fruit.find('na')
print(idx)

idx2 = fruit.find('z')
print(idx2)

2
-1


- use find() to search for substr in a str 
- finds first occurence index of substr only 
- if not found, returns -1 

#### Search & Replace

In [15]:
greet = 'Hello Bob'
nstr = greet.replace('Bob', 'Jane')
print(nstr)
nstr = greet.replace('o', 'X')
print(nstr)

Hello Jane
HellX BXb


- replace() replaces `all` instances of Bob to Jane

#### Stripping Whitespace

In [30]:
# lstrip() and rstrip() remove whitespace at left or right
# strip() removes both beginning & end whitespace

greet = '     Hello Bob      '
print(greet.lstrip())
print(greet.rstrip())
greet.strip()

Hello Bob      
     Hello Bob


'Hello Bob'

#### `startswith()`

In [32]:
line = 'Please have a nice day'
print(line.startswith('Please'))
line.startswith('p')

True


False

#### Extracting String

In [41]:
data = 'From stephen.marquard@uct.ac.za Sat Jan'
# want to extract uct.ac.za - first find the idx range

start = data.find('@')
print(start)
# finds idx of first @

end = data.find(' ', start) 
print(end)
# second parameter of find() is where to start searching
# here we tell find() to start at @ 

host = data[start+1 : end]
print(host)
# slice the boundary 22 (not incl. @) to 31 (not incl. space)


21
31
uct.ac.za


### Assignment 6.5
Take number from following python code and store as float:

`str = 'X-DSPAM-Confidence: 0.8475'`

In [60]:
str =  'X-DSPAM-Confidence: 0.8475'
start = str.find(' ')
print(start)
print(str[start:])

end = str.find('5', start)
print(end)
print(str[start : end])

num = str[start+1 : end+1] # remember end is non-incl. 
num = float(num)
print(num)

print(num + 42.0) # confirms its a float

19
 0.8475
25
 0.847
0.8475
42.8475


## Files

- Before we can read file contents, must tell python which files will be used and what will be done 
- `open()` function returns a `"file handle"` (variable used to perform operations on file)
- Similar to "File -> Open" in Word

Using Open()
- handle = open(`filename`, `mode`)
    - returns a handle use to manipulate file 
    - `filename` is a str 
    - `mode` is optional, `r` if read file, `w` if write file

Example:
fhand = open(`'mbox.txt`, `r`)

What is a Handle?
- Just a way to get to file but not the actual data 

#### The Newline Character `\n`

In [65]:
stuff = 'Hello\nWorld!'
print(stuff)
len(stuff)



Hello
World!


12

- \n is still 1 character not 2
- \n is a whitespace 
- Files are long str of characters with \n at end of line 

### Processing Files

In [None]:
xfile = open('mbox.txt')
for cheese in xfile: 
    print(cheese)

- file handle open for read treated as `sequence` of lines 
- can use for statement to iterate through `sequence` of lines
- Rememeber a `sequence` is an ordered set 

#### Counting Lines in a File

In [None]:
fhand = open('mbox.txt')
count = 0
for line in fhand:
    count = count + 1
print('Line count:', count)

#### Reading the `Whole` File

In [69]:
fhand = open('mbox-short.txt')
inp = fhand.read() # store entire file into inp
print(len(inp)) 
print(inp[:20]) # prints first 20 chars of whole file

94626
From stephen.marquar


- Does not split into lines, shows \n too
- Whereas for loop splits into lines

#### Searching Through a File

In [None]:
fhand = open('mbox.txt')
for line in fhand:
    if line.startswith('From'):
        print(line)
# creates search operation through file, iterates by line
# print() adds a newline so unneeded empty line

#### `rstrip()` to Remove Newline

In [None]:
# To remove empty newline made from print, use rstrip()
fhand = open('mbox.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('From'):
        print(line)

#### Alternative: Skipping with Continue

In [None]:
fhand = open('mbox.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From'):
        continue
    print(line)
# Lines that don't start with From are skipped 
# Only lines that start with From are printed

#### Using `in` to Select `lines`

In [None]:
fhand = open('mbox.txt')
for line in fhand:
    line = line.rstrip()
    if not '@uxt.ac.za' in line:
        continue
    print(line)

#### Prompt for File Name

In [None]:
fname = input('Enter file name: ')
try: 
    fhand = open(fname)
except: 
    # if user enters a bad file name
    print('File cannot be opened', fname)
    quit() 
    # code after isn't run 
count = 0
for line in fhand 
    if line startswith('Subject:'):
        count = count + 1
print('There were', count, 'subject lines in', fname)

# could use a while loop to ask for input again after
# bad file name given 

### Assignment 7.1
Write a program to read through a file and print the contents of the file (line by line) all in uppercase.

In [None]:
fh = open('mbox-short.txt')
# just a file handle (like a portal to file not file itself)

for line in fh:
    line = line.rstrip() # removes the blank lines from print()
    print(line.upper())

### Assignment 7.2

## Lists
- More complicated `data structure` (ways of organising data)
- Others include dictionaries & tuples
- A list is a kind of `collection` 
    - Can store more than 1 `variable`
    - Carries `many values` in 1 convenient package 

- We've already used them in `Definite Loops`

#### List constants [ ] & Elements

In [83]:
x = ['red', 24, 98.6, [5, 6]]
print(x)
print(len(x))  # len() gives num of elements in list - 4 here
len(x)

print([])
# List can be any object - even another list
# list can be empty

['red', 24, 98.6, [5, 6]]
4
[]


#### Looking Inside Lists

In [77]:
# Lists use the same indexing method
friends = ['Joseph', 'Glenn', 'Sally']
print(friends[1])

Glenn


#### Lists Are `Mutable` 
- Strings are `immutable` - cannot be changed 
- List can be changed using `indexing`

In [80]:
lotto = [2, 14, 26, 41, 63]
print(lotto)
lotto[2] = 28
print(lotto)

[2, 14, 26, 41, 63]
[2, 14, 28, 41, 63]


Using the `Range` Function
- Range returns a list of numbers from 0 to one less than end `parameter`
- Can made an index loop using `for` and a interger `iterator`

In [89]:
friends = ['Joseph', 'Glenn', 'Sally']
print(len(friends)) # 3 items 
print(range(len(friends))) # range(0, 3) end non-incl.
# can use to make a list

3
range(0, 3)


#### `Counted Loop`

In [91]:
friends = ['Joseph', 'Glenn', 'Sally']
for friend in friends:
    print('Happy New Year:', friend)

# Alternatively can use range and i to iterate
for i in range(len(friends)):
    friend = friends[i]
    print('Happy New Year', friend)

# Benefit is that in each iteration, we know what i is
# This is called a counted loop

Happy New Year: Joseph
Happy New Year: Glenn
Happy New Year: Sally
Happy New Year Joseph
Happy New Year Glenn
Happy New Year Sally


### Manipulating Lists

#### Concatenating Lists 

In [92]:
a = [1, 2, 3]
b = [4, 5, 6]
c = a + b
print(c)
print(a)

[1, 2, 3, 4, 5, 6]
[1, 2, 3]


#### Lists can be `Sliced` using `:`
- Remember: Just like in str, 2nd num is `up to but not incl`

In [97]:
t = [ 9, 41, 12, 3, 74, 15]

print(t[1 : 3])

print(t[:4])

print(t[3:])

[41, 12]
[9, 41, 12, 3]
[3, 74, 15]


#### List Methods
- append(), count(), pop(), etc. 

In [100]:
x = list()
type(x)
# dir(x) tells u all the methods

list

#### Building List from Scratch 
- Create empty list then `append` (add) elements to end
- Maintains same order

In [102]:
stuff = list()
stuff.append('book')
stuff.append(99)
print(stuff)

['book', 99]


#### Searching a List (`in`)

In [104]:
some = [1, 9, 21, 10, 16]
print(9 in some)
20 not in some

True


True

#### `Sort()` Lists

In [105]:
friends = ['Joseph', 'Glenn', 'Sally']
friends.sort() # sorts in alphabetical order
print(friends)

['Glenn', 'Joseph', 'Sally']


Built-in Functions `min`, `max`, `sum`

In [106]:
nums = [3, 41, 12]
print(min(nums))
print(max(nums))
print(sum(nums))
print(sum(nums)/len(nums))

3
41
56
18.666666666666668


#### `Manual` loop vs. loop using `List`

In [None]:
# Remember when we had to find average of nums user inputted
total = 0
count = 0
while True:
    inp = input('Enter a number: ')
    if inp == 'done': break 
    value = float(inp)
    total = total + value 
    count = count + 1

In [None]:
# This is the same method using a loop
numlist = list()
while True:
    inp = input('Enter a number: ')
    if inp == 'done': break
    value = float(inp)
    numlist.append(value)

average = sum(numlist) / len(numlist)
print('Average:', average)

- Both methods do the same thing 
- Pick which one you like 
- 2nd method has to keep all nums in `memory` to cal avg whereas 1st method doesn't 
- `1st method better` if you have `millions` of numbers
- `2nd method great` if you alr have nums in a list

### Lists and Strings 
- `split()` breaks str into parts and makes list
- by default on whitespace (treats many spaces as 1 space)
- can specify a delimiter


In [111]:
abc = 'With three words' 
stuff = abc.split()
print(stuff)
print(len(stuff))
print(stuff[0])

for w in stuff:
    print(w)

['With', 'three', 'words']
3
With
With
three
words


In [None]:
# From stephen.marquard@uct.ac.za Sat Jan 509:14:16 2008

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From '): continue 
    words = line.split() # makes a list of words from line
    print(words[2])
    # Extracts day from line
    # Easier method than using find which stops at 1st space


### Double Split Method 

In [None]:
# From stephen.marquard@uct.ac.za Sat Jan 509:14:16 2008
# Extract uct.ac.za 

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From '): continue 
    words = line.split()       # 1st split
    email = words[1]           #stephen.marquard@uct.ac.za
    pieces = email.split('@')  # 2nd split
    domain = pieces[1]         # uct.ac.za
    print(domain)
    

### Assignment 8.4

### Assignment 8.5

### List Exercise: `Guardian Pattern` 
- Protect code from traceback

In [None]:
# Let's try extracting day:
# From stephen.marquard@uct.ac.za Sat Jan 509:14:16 2008

han = open('mbox-short.txt')
for line in han:
    line = line.rstrip()
    words = line.split()

# Only want lines starting with 'From'
# Guardian protects from blank lines & lines with only 'From':
    if len(words) < 3 or words[0] != 'From'
        continue     

    print(words[2]) # prints Day

- Order is important 
- `len(word) < 3` checked first (guardian) then `words[0]!='From'`
- If order flipped, would traceback  
- If guardian code False, doesn't bother checking `words[0]!='From'`
- __Always put guardian first__

### 9.1 `Dictionaries`
List 
- a precise linear collection of values that stay in that order
    
Dictionaries
- A 'bag' (`no order`) of `values`, each with its own label (`key`)
- Python's most powerful collection
- index things put in the dict w/ a lookup tag (key)
- `mutable`

In [133]:
purse = dict() # empty didctionary 
purse['money'] = 12 # key and value pair to append to dict
purse['candy'] = 3
purse['tissues'] = 75

print(purse)
print(purse['candy'])

purse['candy'] = purse['candy'] + 2 # can add/subtract etc
print(purse['candy'])

purse['money'] = 0                  # can reassign values
print(purse['money'])

{'money': 12, 'candy': 3, 'tissues': 75}
3
5
0


#### Counting with Dictionaries

In [134]:
nam = dict()
nam['csev'] = 1
nam['cwen'] = 1
print(nam)

nam['cwen'] = nam['cwen'] + 1
print(nam['cwen'])
# everytime name seen, can update dict to make histogram

{'csev': 1, 'cwen': 1}
2


#### Dictionary Tracebacks 
- Can't look for key not in dict 
- Can use `in` to see if key in dict

In [None]:
nam =  dict()
print(nam['csev']) 
# returns traceback since key not in dict

In [139]:
nam =  dict()
'csev' in nam
# returns False since key not in dict
# Can write an if statement to add them in 

False

#### When We See a New Name

In [141]:
counts = dict()
names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']

for name in names:
    if name not in counts:   # if name not in dict, set name to 1 
        counts[name] = 1
    else:
        counts[name] = counts[name] + 1  # if name in dict, add 1 to it
print(counts)

{'csev': 2, 'cwen': 2, 'zqian': 1}


#### Easier in-built Alternative
- Pattern of checking to see if `key` in dict & assuming default value if not is common 
- `method` called `get()` that does this
- parameters are `get(key, default)`
- `default` is value back if key DNE
- used many times 

In [149]:
counts = dict()
names = ['csev', 'cwen', 'csev', 'zqian', 'cwen']

for name in names:
    counts[name] = counts.get(name, 0) + 1  # this is an idiom used a lot, get used to it
print(counts)           # get(key, default)

# get() searches if key exists, if doesn't defaults 0, 
# then we add 1 to it 

{'csev': 2, 'cwen': 2, 'zqian': 1}


### 9.3 Dictionaries and Files
- read text and find most common word

#### Counting Pattern

In [150]:
# General Pattern:
# Empty dict, split line to get list then loop thru list 
# and use a dict to keep track of count of each word

counts = dict()
print('Enter a line of text:')
line = input('')

words = line.split()  # puts all words into list

print('Words:', words)

print('Counting. . .')
for word in words :  # looping through words 
    counts[word] = counts.get(word, 0) + 1  # update count
print('Counts', counts)

Enter a line of text:
Words: ['Hi', 'Hi', 'how', 'u', 'u', 'u', 'u', 'doinnnnn']
Counting. . .
Counts {'Hi': 2, 'how': 1, 'u': 4, 'doinnnnn': 1}


#### Definite Loops and Dictionaries
- Although not in order, can write `for` loop to go through all `keys` and `lookup` values in dict
- for loop for dict goes thru `keys` not `values`

In [162]:
counts = { 'chuck' : 1, 'fred' : 42, 'jan' : 100 }
for k in counts:  # k takes values of keys not values 
    print(k, counts[k]) # counts[key] extracts value

chuck 1
fred 42
jan 100


Retrieving `lists` of Keys, Values or Items (Both) from dict

In [157]:
names  = { 'chuck' : 1, 'fred' : 42, 'jan' : 100 }
print(list(names)) # list() takes key only 

print(names.keys())

print(names.values()) # order of values matches order of keys 
# even though order is not the same when u put them in 

print(names.items()) # items() gives list of both key & value pairs
# 3 element list, with each element called a tuple 

['chuck', 'fred', 'jan']
dict_keys(['chuck', 'fred', 'jan'])
dict_values([1, 42, 100])
dict_items([('chuck', 1), ('fred', 42), ('jan', 100)])


#### Looping thru `key-value` Pairs 
- Two Iteration Variables!
- In each iteration:
    - 1st iteration var is `key`
    - 2nd iteration var is `value`

In [163]:
names  = { 'chuck' : 1, 'fred' : 42, 'jan' : 100 }
for k, v in names.items():  # k - key, v - value
    print(k, v)

chuck 1
fred 42
jan 100


#### Checking Understanding

In [None]:
name = input('Enter file:') # try entering test.txt
handle = open(name) # gets file handle 

counts = dict()
for line in handle: 
    words = line.split() # don't need rstrip() as split removes whitespace
    for word in words:
        counts[word] = counts.get(word, 0) + 1 # counter, make sure get() using circle brackets, not square


bigcount = None # biggest count seen so far 
bigword = None # word associated w biggest count
for word, count in counts.items():  # key-value pair iteration
    if bigcount is None or count > bigcount: # if no word or count  is bigger than previous count
        bigword = word
        bigcount = count

print(bigword, bigcount)

### Assignment 9.4 

### Dictionary Exercise:

In [190]:
fname = input('Enter file: ')
if len(fname) < 1: fname = 'test.txt' 
# can hit enter and it enters file name automatically 
han = open(fname)

di = dict()
for line in han:
    line = line.rstrip()
    print(line) # prints lines
    words = line.split()
    print(words) # prints words in a list
    for word in words:
        
#        if word in di:
#            di[word] = di[word] + 1 
#            print('**Existing**')
#        else:
#            di[word] = 1
#            print('**New**')
        di[word] = di.get(word, 0) + 1 # this line more effiicent counter than above
        
        print(word, di[word]) # prints each word & value

print(di)

# Now find the most common word - loo
# p thru dict

largest = -1 # a counter so can use although not good practice
largest_word = None 
for k, v in di.items():
    if v > largest:
        largest = v
        largest_world = k # capture/remember largest word
print('Done', largest_world, largest)

Hi My name is Minh and this is a test test test test should be biggest count
['Hi', 'My', 'name', 'is', 'Minh', 'and', 'this', 'is', 'a', 'test', 'test', 'test', 'test', 'should', 'be', 'biggest', 'count']
Hi 1
My 1
name 1
is 1
Minh 1
and 1
this 1
is 2
a 1
test 1
test 2
test 3
test 4
should 1
be 1
biggest 1
count 1
{'Hi': 1, 'My': 1, 'name': 1, 'is': 2, 'Minh': 1, 'and': 1, 'this': 1, 'a': 1, 'test': 4, 'should': 1, 'be': 1, 'biggest': 1, 'count': 1}
Done test 4


### 10. Tuples
- Uses `( )` instead of `[ ]`
- Like list but `immutable`
- Can't sort, append, pop tuple
- Because of this, more efficient and stored denser than lists

In [195]:
x = ('Glenn', 'Sally', 'Joseph') 
print(x[2])
# x[1] = 'John' will traceback as tuple immutable

x = [9, 8, 7]
x[1] = 0  # lists are mutable
print(x)

Joseph
[9, 0, 7]


#### Tuples and Assignment 
- can put tuple on `LHS` of assignment statement
- can omit parentheses 

In [197]:
(x, y) = (4, 'Fred')
print(y)

(a, b) = 99, 98
print(a)

# Can't do (a, b) = 98,  needs to be 2 var to assign
# created when doing .items() for dict

Fred
99


#### Tuples are Comparable
- if 1st item equal goes to next element and so on until it finds an element that differs
- Once it finds element that differs, it `stops`

In [201]:
print((0, 100, 2) < (5, 0, 0))
# 0 < 5 so it stops there, doesn't check 100 < 0

# Same with strings 
('Jones', 'Sally') < ('Jones', 'Sam')

True


True

#### Sorting Lists of Tuples 
- sort dict by key using `items()` and `sorted()` function

In [206]:
d = {'a':10, 'b':1, 'c':22}
print(d.items())
sorted(d.items())
# sorts based on first thing (bc can't have 2 same keys in dict)

for k, v in sorted(d.items()): # loop thru dict in key order
    print(k, v)
    


dict_items([('a', 10), ('b', 1), ('c', 22)])
a 10
b 1
c 22


#### Sort Tuples by Values Instead of Key
- Need to make list of `tuples` in form `(value, key)`
- Can do we a `for` loop that makes list of tuples

In [23]:
c = {'a':10, 'b':1, 'c':22}
tmp = list() # makes temp list to append tuples to 
for k, v in c.items():
    tmp.append((v, k)) # makes a new tuple but reverses item order 
print(tmp)    # value is 1st item, key is 2nd

tmp = sorted(tmp, reverse=True) # sorts values in desc ord
print(tmp) 

[(10, 'a'), (1, 'b'), (22, 'c')]
[(22, 'c'), (10, 'a'), (1, 'b')]


#### Pratical Uses: Top 10 Words

In [None]:
fhand = open('test.txt')
counts = dict()
for line in fhand:
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
# creates histogram of most common words in dict 

#----------------------------------------
lst = list()
for k, v in counts.items():
    newtup = (v, k)
    lst.append(newtup)
# creates a list of tuples from dict

lst = sorted(lst, reverse=True)
# sorts highest to lowest val
#----------------------------------------

for v, k in lst[:10]:  # stops after 10 val, other way would be starts after 10 vals
    print(k, v) 
# prints top 10 words, reverses so key 1st again



#### Even shorter method
- Can do what's in b/w ---- above in one line of code
- Using `List Comprehension` 
    - creates dynamic list, reversed tuples then sorts it (very succinct code)
- Pick which method makes more sense 

In [220]:
c = {'a':10, 'b':1, 'c':22}

# For all loop that flips tuples then sorts them 
print(sorted( [ (v,k) for k,v in c.items()], reverse=True))
# not expected to understand but cool 

[(22, 'c'), (10, 'a'), (1, 'b')]


### Exercise: Tuples and Sorting

In [3]:
# Program that finds the 5 most common word (need sorting)

fname = input('Enter File: ')
if len(fname) < 1: fname = 'test.txt'
hand = open(fname)

di = dict()
for line in hand:
    line = line.rstrip()
    words = line.split()
    for word in words:
        # idiom: retrieve/create/update dictionary
        di[word] = di.get(word, 0) + 1

# print('di', di)
# print(di.items()) # makes tuple k-v pairs in a list 

# iterate and store each k-v pair as tuple in list 
tmp = list() 
for k, v in di.items():
    newtup = (v, k)  # reverses order so value first (v, k)
    tmp.append(newtup) 
print('Flipped', tmp)

tmp = sorted(tmp, reverse=True) # sorts by value since flipped before
print('Sorted', tmp[:5]) # gets most common 5 words in DESC ord
# if there's a tie, goes in reverse alphabetical ord

# to print nicer 
for v, k in tmp[:5]: # rememb: flipped list 
    print(k, v)


Flipped [(1, 'Hi'), (1, 'My'), (1, 'name'), (2, 'is'), (1, 'Minh'), (1, 'and'), (1, 'this'), (1, 'a'), (4, 'test'), (1, 'should'), (1, 'be'), (1, 'biggest'), (1, 'count')]
Sorted [(4, 'test'), (2, 'is'), (1, 'this'), (1, 'should'), (1, 'name')]
test 4
is 2
this 1
should 1
name 1


# 3. Python to Acess Web Data

## 11.1 Regular Expressions
- AKA `regex`, `regexp` provides concise & flexivle way for matching strings of text 
- Smart & powerful way of searching thru lots of text
- Don't have to write as code much but harder to learn
- Instead of programming w lines, program w characters 

![regexp.png](attachment:regexp.png)
- Must import regexp using `import re`
- `re.search()` to see if string matches a regexp, similar to `find()`
- `re.findall()` to extract portions of a string that match regexp, similar to combo of `find()` & slicing: `var[5:10]`

### Using `re.search()` like `find()`

In [None]:
# find() method:
hand = open('mbox-short.txt')
for line in hand: # iterates thru entire text line by line
    line = line.rstrip() 
    if line.find('From:') >= 0: # e.g. not -1 & From: is found:
        print(line)

In [None]:
# re.search() method:
# characters not used in this case
# would prolly not use regexp for smth this simple

import re 

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From:', line): # returns T/F if match found
        print(line)

### Using `re.search` Like `startswith()`


In [None]:
# startswith() method:
hand = open('mbox-short.txt')
for line in hand: # iterates thru entire text line by line
    line = line.rstrip() 
    if line.startswith('From:'): # e.g. not -1 & From: is found:
        print(line)

In [None]:
# re.search method:
import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From:', line): 
        print(line)
        
# tweak matching string by adding special character (^)
# says u want 'F' as beginning of line -> returns T/F

### Wild-Card Characters 

`^X.*:`
- Translates to looking for str startswith X, followed by any number of characters, then ending w colon

What each char means:
- `^` indicates startswith 
- `.` matches any character 
- `*` indicates 0 or more repeats 

Will match following `strings`:
- `X-Sieve:` CMU Sieve 2.3
- `X-DSPAM-Result:` Innocent

### Fine-Tuning the match
`^X.*:`
- will match `X-Plane is behing schedule:` two weeks
- When we don't want it to match

Can do `X-\S+:`
- `\S` matches any non-space character 
- `+` one or more times (>= 1) 
- will no longer match: 

X-Plane is behing schedule: two weeks

### 11.2 Extracting Data from Regexp
- re.search() only returns T/F for match or no match
- If want to `extract` string, use `re.findall`

In [8]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[0-9]+', x)
print(y)

# [0-9]+ indicates 1 or more digits 
# without the + it would be 1 digit only
# Don't have to do split, for loop, much quicker 
# Outputs as STRINGS not INTS in a list 

['2', '19', '42']


When using re.findall(), it returns a list of zero or more sub-strings that match regexp 

In [11]:
import re
x = 'My 2 favorite numbers are 19 and 42'
y = re.findall('[AEIOU]+', x) # 1 or more uppercase vowel
print(y)
# no uppercase vowels, so empty list 

[]


#### Warning: `Greedy` Matching
- The `repeat` characters (`*` & `+`) push `outward` in both directions (greedy) 
- <span style='color:orange'>Prefers the largest possible string</span>


In [14]:
import re 
x = 'From: Using the : character'
y = re.findall('^F.+:', x)
print(y)

# Notice how there's 2 colons and both r matched 
# Can think of * and + as very pushy n greedy
# They want the largest possible string

['From: Using the :']


#### `Non-Greedy` Matching
- If you add a `?` character before the colon, the + and * chill out a bit
- Think of `?` as don't be greedy (<span style='color:greenyellow'>chooses the shortest string`</span>)

In [17]:
import re 
x = 'From: Using the : character'
y = re.findall('^F.+?:', x)
print(y)


['From:']


#### Fine-Tuning String Extracting
- Can refine match for `re.findall()` & separately determine which portion of match is to be extracted using paratheses

In [22]:
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

y = re.findall('\S+@\S+', x)
print(y)
# look at cheatsheet \S is any non-blank character (i.e. not a whitespace)
# The + added to repeat as many times 

# It finds the @ but pushes outwards of non-blank characters 
# Without greediness u only get 'd@u'

['stephen.marquard@uct.ac.za']


#### Parantheses to Fine-Tune
- Parentheses are not part of the match but they tell where to `start` & `stop` what string to extract 

In [39]:
x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'

y = re.findall('\S+@\S+', x)
print(y)

y = re.findall('^From (\S+@\S+)', x)
print(y)
# From is part of the match but you are not extracting it 
# Only extracting part in parentheses 
# Can be used to fine-tune what to extract 
# E.g. only extract if has prefix From space 

['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']


In [28]:
# Remember when we wanted to get uct.ac.za 
# we had to use a double split pattern 
# Using Regex is less code
import re

x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('@([^ ]*)', x)
print(y)

['uct.ac.za']


`@([^ ])*)`
- `@( )` finds @ and extract things in brackets
- `[^ ]` matches non-blank character
    - anything after ^ is match everything `BUT`
        - so here it matches everything but space
- `*` match many of them

In [29]:
import re

x = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
y = re.findall('^From .*@([^ ]*)', x)
print(y)

['uct.ac.za']


- Adding `^From .*` fine-tunes this
- `^From` asks to match with anything starting with From
- ` [space].*@` asks to match with anything with a space after From and any number of character until @


#### Exapmple Regex Extracting


In [30]:
import re 
hand = open('mbox-short.txt')
numlist = list()
for line in hand:
    line = line.rstrip()
    stuff = re.findall('^X-DSPAM-Confidence: ([0-9.]+)', line)
    if len(stuff) != 1: continue  
    # line we're looking for should only have 1 extraction (floating num) in a list 
    num = float(stuff[0]) # converts num from string to float
    numlist.append(num) # adds it to the list 
print('Maximum:', max(numlist))

# ^X-DSPAM-Confidence: ([0-9.]+)
    # start extracting in brackets ( )
    # take any digits incl. period [0-9.]
    # one or more times (+)

Maximum: 0.9907


#### Escape Character 
- if u want a regular exp character to behave `normally` prefix with `'\'`

In [32]:
import re 
x = 'We just got $10.00 for cookies.'
y = re.findall('\$[0-9.]+', x)
print(y)

['$10.00']


- `\$` search for text starting w/ $
- `[0-9.]` look for digit or period 
- `+` At least one or more

Try to see what this matches w/o running:

In [1]:
import re

x = 'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008'
y = re.findall('\S+?@\S+', x)

print(y)

['stephen.marquard@uct.ac.za']


## 12.1 Networked Tech

### Sockets in Python
Python has built-in support for TCP Sockets

In [3]:
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect( ('data.pr4e.org', 80))
#                 ^HOST           ^ PORT 80

## 12.2 Hypertext Transfer Protocol (HTTP)

### <span style=color:orange>What is a protocol</span>
- Set of rules that all parties follow so we can predict each other's behaviour
- And not bump into each other

<span style='color:orange'>Since Python gives us a reliable `socket`, what do we want to do with it?</span>
- Application Protocols (different rules depending on app)
    - mail
    - world wide web
- HTTP
    - dominant App Layer Protocol on the Internet 
    - Invented for the Web to Retrieve HTML, Imgs, Docs, etc.
    - Extended to be data in addition to docs - RSS, Web Services, etc. 
        - Basic Concept - Make a Connection - Request a doc - Retrieve the Doc - Close the Connection

<span style='color:orange'>HTTP is a set of rules to allow browsers to retrieve web docs from servers over the Internet</span>

<span style='color:yellowgreen'>http://</span><span style='color:magenta'>www.dr-chuck.com</span><span style='color:orange'>/page1.htm</span>

- <span style='color:yellowgreen'>protocol
- <span style='color:magenta'>host
- <span style='color:orange'>document

### <span style='color:greenyellow'>Getting Data From The Server</span>
- Each time user clicks on tag w `href=` value to switch to a new page, the browser makes a connection to the web server & issues a `GET` request - to GET the content of the page at specified URL
- The server returns the HTML doc to the browser which formats & displays the doc to the user

### <span style='color:orange'>Request Response Cycle:
![HTTP.png](attachment:HTTP.png)


- Click on link 
- Browser looks at HTML to say what web server + port to connect to and doc to retrieve 
- Browser makes socket connection to port 80 & sends it <span style='color:yellow'>request</span>
- Web Server decodes request to send <span style='color:greenyellow'>response</span> as HTML
    - `<h1>` - header 1
    - `<p>` - start of paragraph
    - `<a>` - anchor for href clickable text
- Browser reads response & makes page showup

### `A HTTP Request in Python`
TLDR make a request, get data back

In [7]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) # makes doorway but not yet connected to anything 
mysock.connect(('data.pr4e.org', 80)) # attempts connect to port 80 & establish socket connected to web server
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() # sends request, encode converts unicode to UTF-8 that we'll send
mysock.send(cmd) # send to server

while True: 
    data = mysock.recv(512) # receive up to 512 chrs
    if (len(data) < 1): # if we get no data, stop loop
        break
    print(data.decode()) # if get data, decode
mysock.close() # close socket


HTTP/1.1 200 OK
Date: Fri, 08 Dec 2023 23:18:37 GMT
Server: Apache/2.4.18 (Ubuntu)
Last-Modified: Sat, 13 May 2017 11:22:22 GMT
ETag: "a7-54f6609245537"
Accept-Ranges: bytes
Content-Length: 167
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Pragma: no-cache
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Connection: close
Content-Type: text/plain

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already s
ick and pale with grief



### Using the Developer Console to Explore HTTP

Can find various things about <span style='color:orange'>request response cycle</span>
- On the Brave Browser, Go to settings -> more tools -> developer tools `(or ctrl + shift + i)`
- Try it with this link: https://www.dr-chuck.com/page1.html
    - Try looking under the `network` tab

## Graded: Understanding the Request/Response Cycle

## 12.3 Unicode Characters & Strings

How do computers know the difference b/w `H` and `h`?
- had to come up with a mapping b/w numbers & letters
- E.g. ASCII: 
    - `H` is 73
    - `h` is 104

### <span style='color:orange'> Representing Simple Strings </span>
- Each chr is represented by a num b/w 0-256 stored in 8 bits of memory
- We refer to '8 bits of memory' as a <span style='color:yellowgreen'> **byte**</span> 
- The `ord()` function tells us the numeric value of a simple ASCII character

In [22]:
print(ord('H'))
print(ord('h'))
# explains why h > H in python

print('h' > 'H')

72
104
True


Way back American computers couldn't talk to Japanese computers bc they had Asian characters 
- That's why `unicode` was invented
    - Has <span style='color:red'>**wayy**</span> more characters than ASCII which only has 128

First thing we tried was:
#### <span style='color:orange'>Multi-Byte Characters</span>
- UTF-16 - fixed length - 2 bytes
- UTF-32 - fixed length - 4 bytes - (4 times as much data for single char; not efficient)
- <span style='color:yellowgreen'>**UTF-8**</span> - 1-4 bytes
    - Automatic detection b/w ASCII & UTF-8
    - <span style='color:yellowgreen'>UTF-8 is recommended practice for encoding data to be exchanged b/w systems

In [21]:
unicode = u'食べ物'
print(type(unicode)) # unicode is type str

byte = b'abc'
print(type(byte)) # but byte str & regular str diff

# Byte is raw, unencoded 
# Could be UTF-8, UTF-16, ASCII, don't know its encoding 
# Have to manage this when dealing w outside data

<class 'str'>
<class 'bytes'>


#### Python 3 & Unicode
In Python 3, all str internally are `UNICODE` not ASCII or UTF
- Bc of this, working w str variables in Python programs & reading data from files usually 'just works'
- BUT now when we talk to a network resource using sockets or talk to a database, <span style='color:orange'>we have to encode & decode data (usually to UTF-8)</span>

#### Python Strings to Bytes 

`ENCODING (unicode -> bytes encoded as UTF-8)`

- Since we send bytes when talking to external resouce (e.g. network socket), we need to encode Python 3 strings which are unicode into bytes

In [None]:
import socket 

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)  
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() # take this str, make into bytes encoded in UTF-8 (default)
mysock.send(cmd) # send as bytes

`DECODING (bytes -> unicode)`

 When we read data from external resource, we must decode it based on the character set (e.g. UTF-8, ASCII) so it's represented in Python 3 as string

In [None]:
# This is why we need a decode operation (Bytes -> Unicode)
while True:
    data = mysock.recv(512) # This is a BYTE
    if ( len(data) < 1) :
        break
    mystring = data.decode()  # This becomes UNICODE
    print(mystring)

# Can put character set inside .decode() argument (line 6) but it assumes UTF-8 or ASCII (both backwards compatible w/ each other)
# Very rare you'll get smth other than UTF-8 or ASCII
# By the end of this decoding, it'll be a string

![encode decode.png](<attachment:encode decode.png>)
Sounds more confusing than it is - this is the jist of it

Before sending 
- `encode()` - from `String (Unicode)` -> to `Bytes (UTF-8)`
- send() as bytes

Receiving back
- revc() as bytes 
- `decode()` from `Bytes` -> to `String (Unicode)`

Think of as a barrier b/w inside & outside world
- Allows consistent inside data 
- So we can mix strings from various sources regardless of the character set of those strings

## 12.4 Retriving Web Pages

### Using `urllib` in Python
Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file
- <span style='color:orange'>Only 4 lines of code!</span>

In [25]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') # doesn't read data, remember, only makes the portal
for line in fhand: # iterates through all lines
    print(line.decode().strip()) # iteration is byte array, not string, need to decode

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


#### Working with HTML files

In [30]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://data.pr4e.org/romeo.txt')

counts = dict() # making a dictionary 
for line in fhand:
    words = line.decode().split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1 # idiom counter
print(counts)

# identical to opening a file, reading all words & counting them like we've done before

{'But': 1, 'soft': 1, 'what': 1, 'light': 1, 'through': 1, 'yonder': 1, 'window': 1, 'breaks': 1, 'It': 1, 'is': 3, 'the': 3, 'east': 1, 'and': 3, 'Juliet': 1, 'sun': 2, 'Arise': 1, 'fair': 1, 'kill': 1, 'envious': 1, 'moon': 1, 'Who': 1, 'already': 1, 'sick': 1, 'pale': 1, 'with': 1, 'grief': 1}


#### Reading Web Pages

In [33]:
import urllib.request, urllib.parse, urllib.error

fhand = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')
for line in fhand:
    print(line.decode().strip())

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


## 12.5 Parsing Web Files

### <span style='color:orange'>What is Web Scraping</span>
- When a program or script pretends to be a browser & retrieves web pages, looks at those web pages, extracts info, and then looks at more web pages
- Search engines scrap web pages - we call this 'spidering the web' or 'web crawling'

#### <span style='color:orange'>Why Scrape?</span>
- Pull data - particiularly social data - who links to who?
- Monitor a site for new info
- Get your own data from some system that has no 'export capability'

#### <span style='color:orange'>Ethics</span>
- Some controversy about web page scraping and some sites are snippy bout it 
- Republishing copyrighted info is not allowed
- Violating ToS is not allowed
- Be careful what you scrape 
- If caught scrapping when u shouldn't 
    - Network, computer or ur account could be blocked

### `Beautiful Soup` - String Searching the Web Easy

<span style='color:orange'> Lots of syntax errors on html </span> (very hard to use `regexp`, `find`, `split`, etc. to parse the web)
- Someone has alr found all the variations since html is so flexible
- Use free software library called [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)
 to make life easier 

### BeautifulSoup Installation - 2 Ways

In [5]:
# WAY 1
# Run line 2 below to install BeautifulSoup for all python programs on your computer 
# pip install beautifulsoup4

# WAY 2
# Or download the file:
# http://www.py4e.com/code3/bs4.zip
# unzip it in the same directory as the file you run code on 
# useful when on campus computer and can't install software

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
# import should work if installed correctly 

### Example: Retrieving Anchor Tags (href links)

In [9]:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

url = input('Just Press Enter Only') 
if len(url) < 1: url =  'http://www.dr-chuck.com/page1.htm'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser') 

tags = soup('a') 
for tag in tags:
    print(tag.get('href', None)) 

# Much easier than using regex or other methods 

http://www.dr-chuck.com/page2.htm


- <span style='color:cyan'>'html.parser'</span> tells BS to clean the html file and make an object (line 7)
    - Can assign the object anything like 'soup' (line 8)
- Now you ask BeautifulSoup some questions:
    - Retriving a list of all anchor tags (line 9)
    - Using get(), retrieve href, if none, return None (line 11)

### Graded: Scraping HTML Data with BeautifulSoup

111

### Assignment: Following Links in HTML Using BeautifulSoup

## 13.1 Data on the Web