### Arithmetic Operators

The `//` operator performs <font color='royalblue'><b>*integer or floored division*</b></font> that keeps only the integer part of the result, while the `%` operator calculates the remainder:


### Operator Order

Python evaluates expressions from left to right. The following table summarizes the <font color="royalblue"><b>*operator precedence*</b></font> for all the operators we have seen so far, from highest precedence to lowest precedence:



|Operator|Meaning|
|:-- |:-- |
|`()`|Grouping|
|`x[i], x[i:j:k], x(...), x.attr`|Indexing, slicing, call, attribute reference|
|`**`|Exponentiation|
|`+x, -x`|identity, negatition|
|`*, /, //, %`|Multiplication (repetition), division, integer division, remainder (format)|
|`+, -`|Addition (concatenation), substraction|
|`<, <=, >, >=, ==, !=, in, not in, is, is not`|Comparisons, including membership tests and identity tests|
|`not`|Logical Negation|
|`and`|Logical AND|
|`or`|Logical OR|

In [1]:
True or True and False == True or (True and False) # precedence: and before or

True

The built-in function `dir()`,  when called without arguments,
returns the list of all the names (functions and variables) belonging to the namespace from where it is called  
Deleting a name using the `del` statement removes the name from the namespace:

When we enter an "Enter", we are actually creating a "\n" within the string. Therefore, we can make a string literal span multiple lines by including a backslash character \ at the end of each line to escape the newline (\n):

In [None]:
'Python Programming \
for Business Analytics'

'Python Programming for Business Analytics'

Integer supports '+' and '*', but not '-' and '/'

### String comparison

-  The comparison operators (e.g., `==`, `>`, `<=`) compare strings ***lexicographically***, the way in which sequences are ordered based on the <b>*alphabetical order*</b> of their component characters:

   - In alphabetical ordering, digits come before letters and capital letters come before lowercase letters.
     - i.e., digits (as characters) < uppercase letters < lowercase letters.
   - Compare the leftmost characters first, and generate `True` or `False` if their values differ, or continue until a difference is observed.


Index [i:j] includes i but excludes j.

In [None]:
# access the help system
help(print)

Help on built-in function print in module builtins:

print(*args, sep=' ', end='\n', file=None, flush=False)
    Prints the values to a stream, or to sys.stdout by default.
    
    sep
      string inserted between values, default a space.
    end
      string appended after the last value, default a newline.
    file
      a file-like object (stream); defaults to the current sys.stdout.
    flush
      whether to forcibly flush the stream.



In [None]:
print('purchase', shares, 'shares of',  stock, "at $", price, 'per share', end='\t')
print('purchase', shares, 'shares of',  stock, "at $", price, 'per share', sep='-')

purchase 3.2 shares of Apple at $ 443.05 per share	purchase-3.2-shares of-Apple-at $-443.05-per share


In [None]:
stock = 'Google'; percentage = 0.1845; week = 52.5
f"{stock}'s stock is trading {percentage:.1%} off of {week:.0f}-week highs"

"Google's stock is trading 18.4% off of 52-week highs"

In [None]:
f"{{a + b}}"      # Treated as a literal string

'{a + b}'

In [None]:
f"{{{a + b}}}" # Treated as variables

NameError: name 'a' is not defined

### Type Conversion
- str -> float -> int
- bool(0) = bool(0.0) = bool("") = False

In [2]:
int(4.999)

4

Str.split() return a list of the substrings in the string

In [None]:
'Business-Applications-Development-in-Python'.split()

['Business-Applications-Development-in-Python']

In [None]:
'Business-Applications-Development-in-Python'.split('-')

['Business', 'Applications', 'Development', 'in', 'Python']

Str.split(' - ', maxsplit=1) to avoid error with string with multiple ' - '

In [None]:
help(course.split)

Help on built-in function split:

split(sep=None, maxsplit=-1) method of builtins.str instance
    Return a list of the substrings in the string, using sep as the separator string.

      sep
        The separator used to split the string.

        When set to None (the default value), will split on any whitespace
        character (including \n \r \t \f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits.
        -1 (the default value) means no limit.

    Splitting starts at the front of the string and works to the end.

    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.



In [None]:
help(course.join)

Help on built-in function join:

join(iterable, /) method of builtins.str instance
    Concatenate any number of strings.
    
    The string whose method is called is inserted in between each given string.
    The result is returned as a new string.
    
    Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'



In [4]:
# To create a tuple with only one item, you need to add a comma after the item
print(type((1)), type((1,)))

<class 'int'> <class 'tuple'>


In [None]:
# still return a sequence
squares[-3:-2]

(9,)

In [10]:
# Term-wise comparison and stop comparing when the first unequal term is found
['large',10, 'big'] < ['small',1, 'little']

True

In [None]:
[10,'large', 'big'] < [1,'small', 'little']

False

In [15]:
[1,2,3,1,2,3,1,2,3].index(2,2,5) # find the first 2 start at index 1 (included) and stop at index 5 (excluded)

4

In [None]:
list.sort(key=None, reverse=False) # sort the list in ascending order in place (modifies the original list)

In [None]:
# Key must be a function that takes one argument and returns a value to be used for sorting purposes
numbers=[0.12, 0.78, 0.5, -0.43, -0.87, 1.0, 0.64]
numbers.sort(key=abs, reverse=True)
numbers

[1.0, -0.87, 0.78, 0.64, 0.5, -0.43, 0.12]

If any comparison fails, the entire sort will fail

- list.sort() return None while changing the original list.
- sorted(list) returns a new list without changing the original list.


### The `*` Operator
- When used before a **name on the left of an assignment**, it creates a variable that gathers up any superfluous elements during sequence unpacking and returns a **list**:

In [17]:
fruits = ['apple', 'orange', 'banana', 'mango']
((first, *remaining), *others) = fruits
first, remaining, others

('a', ['p', 'p', 'l', 'e'], ['orange', 'banana', 'mango'])

- When used before a sequence inside a list or tuple display, `*` unpacks its individual values:




In [18]:
first, *remaining, *others 

('a', 'p', 'p', 'l', 'e', 'orange', 'banana', 'mango')




The assignment of each item  to the **loop variable** can leverage sequence unpacking to make the handling of nested data easier:






In [19]:
gradebook = [['Alice', 95], ['Troy', 92], ['James', 89], ['Charles', 100], ['Bryn', 59]]
name_only = [name for name, score in gradebook]
name_only

['Alice', 'Troy', 'James', 'Charles', 'Bryn']

Dictionary keys are ***unique*** and can be any ***immutable*** data type (e.g., numbers, strings, Booleans, tuples containing only immutable elements). List cannot be key

Assigning a new key and value adds an entry:

In [None]:
gradebook['Charles'] = 100
gradebook

{'Alice': 95, 'Troy': 92, 'James': 89, 'Charles': 100}

Many of the operators and built-in functions that can be used with sequences work with dictionaries as well. But they operate ***primarily on keys of dictionaries***:

- The membership operator: `in` or `not in`
- The `*` operator
- len/max/sorted

In [23]:
gradebook = dict(gradebook)
gradebook, 'Alice' in gradebook, 95 in gradebook, max(gradebook), sorted(gradebook)

({'Alice': 95, 'Troy': 92, 'James': 89, 'Charles': 100, 'Bryn': 59},
 True,
 False,
 'Troy',
 ['Alice', 'Bryn', 'Charles', 'James', 'Troy'])

- We can use the `|` operator to create a new dictionary by merging two existing dictionaries and update the value of existing key (it doesn't change the original dictionary!!!):

In [26]:
gradebook = {'Alice': 95, 'Troy': 90, 'James': 89}
gradebook | {'Charles': 100}, gradebook | {'Alice': 100}

({'Alice': 95, 'Troy': 90, 'James': 89, 'Charles': 100},
 {'Alice': 100, 'Troy': 90, 'James': 89})

- To update the original dictionary, can use |= similar to +=

In [28]:
gradebook |= {'Alice': 100}
gradebook

{'Alice': 100, 'Troy': 90, 'James': 89}



- [`.get(<key>[, <default>])`](https://docs.python.org/3/library/stdtypes.html#dict.get) returns the value for `key` if `key` is present, else `default` (defaulting to `None`):

In [None]:
fruits = ['apple', 'pear', 'peach', 'banana', 'apple',
          'strawberry', 'lemon', 'apple', 'blueberry', 'banana']
fruits_count = {}
for name in fruits:
  fruits_count[name] = fruits_count.get(name, 0) + 1
fruits_count

- [`values()`](https://docs.python.org/3/library/stdtypes.html#dict.keys) ([`keys()`](https://docs.python.org/3/library/stdtypes.html#dict.keys)) returns a **dictionary view** consisting of only `value`s (`key`s):
- [`items()`](https://docs.python.org/3/library/stdtypes.html#dict.items) returns a **dictionary view** consisting of `(key, value)` pairs:

In [29]:
gradebook.values()   # a dictionary view object

dict_values([100, 90, 89])

In [31]:
gradebook.items()     # can be iterated over compared with dictionary

dict_items([('Alice', 100), ('Troy', 90), ('James', 89)])

We can use the dictionary-like syntax in comprehensions (i.e., **dictionary comprehensions**) to construct new dictionaries from existing ones:

In [None]:
{name: score + 5 for name, score in gradebook.items()}

{'Alice': 100, 'Troy': 97, 'James': 94}


A dictionary can also be constructed by casting a collection of key-value pairs to the `dict` type with the `dict()` function:

In [33]:
dict([("red", 34), ("green", 30), ("brown", 31)])  # or dict((("red", 34), ("green", 30), ("brown", 31)))

{'red': 34, 'green': 30, 'brown': 31}

In [32]:
list(gradebook)           # the keys will be used by default

['Alice', 'Troy', 'James']


A set is an ***unordered***, ***mutable*** collection of ***distinct immutable*** elements.

***Non-empty*** sets can be created by using the curly brace notation (`{}`):


In [None]:
integers = {5, 2, 3, 1, 4, 2}
integers # The duplicate is eliminated

{1, 2, 3, 4, 5}

An ***empty*** set can only be defined with the `set()` function.

In [34]:
type({}), type(set())

(dict, set)

Element of set must be immutable (No List!)

However, sets do not support indexing, slicing, or other ***sequence-like*** behavior, because they do not record element position or order of insertion.

Non-Boolean values can be used in place of `<condition>`. Rules for deciding the truthiness or falsehood of a non-Boolean value:

- All ***non-zero*** numbers and all ***non-empty*** strings are true;

- `0` and the ***empty*** string (`""`) are false;

- Other built-in data types that can be considered to be ***empty*** or ***not empty*** follow the same pattern.

In [37]:
bool([]), bool(()), bool({}), bool(''), bool(0), bool(None) # all return False

(False, False, False, False, False, False)


### Iterating over Index-value Pairs with **`enumerate()`**

The [`enumerate()`](https://docs.python.org/3/library/functions.html#enumerate) function provides a simpler way to generate both indices and items for a loop.

In [42]:
s = "python is a high-level language for general-purpose programming"
# find the index of all 'a's
[index for index, letter in enumerate(s) if letter == 'a']

[10, 24, 28, 41, 57]

In [None]:
print("Main Menu:")
for position, option in enumerate(menu, start=1):
    print(f"{position}. {option}")

Main Menu:
1. Big Mac
2. McChicken
3. French Fries
4. Apple Pie
5. Coca-Cola



### Parallel Traversals with **`zip()`**


The built-in [`zip()`](https://docs.python.org/3/library/functions.html#zip) function allows us to visit multiple sequences ***in parallel***.


`zip()` takes one or more sequences as arguments and returns a series of tuples that pair up parallel items taken from those sequences:

<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/zip1.svg" width=500/>





In [36]:
names = ['John', 'Danny', 'Tyrion', 'Sam']
balances = [20, 10, 5, 40]
students = ['Yes', 'No', 'Yes', 'No']
outcomes = [False, False, True, True]

zipped = zip(names, balances, students, outcomes)
zipped, list(zipped), dict(zip(names, students))

(<zip at 0x1d65095e380>,
 [('John', 20, 'Yes', False),
  ('Danny', 10, 'No', False),
  ('Tyrion', 5, 'Yes', True),
  ('Sam', 40, 'No', True)],
 {'John': 'Yes', 'Danny': 'No', 'Tyrion': 'Yes', 'Sam': 'No'})

`zip()` truncates results at the length of the shortest sequence when the argument lengths differ:

In [None]:
list(zip('abc', 'xyz123'))

[('a', 'x'), ('b', 'y'), ('c', 'z')]

In [None]:
# Nested list comprehension
students = ['Alice', 'Troy', 'James', 'Charles', 'Bryn']
[(name1, name2) for index, name1 in enumerate(students) for name2 in students[index+1:]]

[('Alice', 'Troy'),
 ('Alice', 'James'),
 ('Alice', 'Charles'),
 ('Alice', 'Bryn'),
 ('Troy', 'James'),
 ('Troy', 'Charles'),
 ('Troy', 'Bryn'),
 ('James', 'Charles'),
 ('James', 'Bryn'),
 ('Charles', 'Bryn')]

In [None]:
[song for rating, index, song in sorted(zip(ratings, range(len(ratings)), songs), reverse=True)] 
# It will sort the songs by rating, then order, then song name in descending order


### 5 Interrupting Loops: `break` and `continue`

Python provides two keywords `break` and `continue` that terminate a loop prematurely:



- The `break` and `continue` statements are usually further nested in an `if` test to take action in response to some `<condition>`.

    - The `continue` statement terminates the ***current*** iteration immediately.
    
    - The `break` statement immediately terminates a loop ***entirely***.

We can use **nested break and nested continue statements** in loops in Python. These statements can work inside nested loops (loops within loops) to control the flow of execution. However, their behavior depends on which loop they are located in.

<pre class="lang-python">
<span style="color:#2767C5";>for</span> <span style="color:#BB2F29";>&lt;variable&gt;</span> <span style="color:#2767C5";>in</span> <span style="color:#BB2F29";>&lt;iterable&gt;</span> or <span style="color:#2767C5";>while</span> <span style="color:#BB2F29";>&lt;condition&gt;</span>:

    statement(s)   

    <span style="color:#2767C5";>for</span> <span style="color:#BB2F29";>&lt;variable&gt;</span> <span style="color:#2767C5";>in</span> <span style="color:#BB2F29";>&lt;iterable&gt;</span> or <span style="color:#2767C5";>while</span> <span style="color:#BB2F29";>&lt;condition&gt;</span>:
        statement(s)  
        <span style="color:#2767C5";>continue</span> or <span style="color:#2767C5";>break</span>              <span style="font-style: italic; color:dark teal;"># apply to the inner loop</span>

    <span style="color:#2767C5";>continue</span> or <span style="color:#2767C5";>break</span>                  <span style="font-style: italic; color:dark teal;"># apply to the outer loop</span></pre>  




### 7 Exception Handling: The `try` Statement


Even if a statement or expression is syntactically correct, it may cause an error/exception:




Here is a code skeleton that shows the full potential of the `try` statement:

<pre class="lang-python">
<span style="color:#2767C5";>try</span>:
   statement(s)         

<span style="color:#2767C5";>except</span> <span style="color:#BB2F29";>&lt;Type 1 Error&gt;</span>:
   statement(s)
   
   ...

<span style="color:#2767C5";>except</span> <span style="color:#BB2F29";>&lt;Type n Error&gt;</span>:
   statement(s)   
   
<span style="color:#2767C5";>else</span>:
   statement(s)      

<span style="color:#2767C5";>finally</span>:                      <span style="font-style: italic; color:dark teal;"># All clause headers are all at the same indentation level.</span>
   statement(s)

following statement(s)
</pre>


- Explicit exception type specification is not mandatory for an `except` clause.

- An arbitrary number of `except` clauses can be specified under the `try` clause to handle different exceptions, e.g., `RuntimeError`, `TypeError`, `NameError`, etc.


- If no exception occurs in the `try` clause, all the following `except` clauses are skipped.

- If an exception occurs in the `try` clause, the rest of the `try` clause is skipped. Then

    - If the exception type is matched by an `except` clause, that clause is executed and then execution continues after the `try` statement.
    
    - If an exception occurs with no match in the following `except` clauses, execution is stopped and we get the standard error.

- In effect, at most one handler will be executed. An `except` clause may name multiple exceptions as a parenthesized tuple, for example: `except (RuntimeError, TypeError, NameError): `

- The `try` statement has an optional `else` clause, which must follow all `except` clauses and is executed only when the `try` clause does not raise an exception.

- The `try` statement supports another optional `finally` clause, which is intended to define clean-up actions that must be executed under all circumstances, such as closing a file or releasing a lock, regardless of whether an exception was raised or not.



In [None]:
lst = [1, 'text', 5, 12]
for i in range(5):
    try:
        print(lst[i] / i)
    except (TypeError, ZeroDivisionError) as error1:  # you can give a name to the captured error using the as keyword
        print(error1)
    except IndexError as error2:
        print(error2)

division by zero
unsupported operand type(s) for /: 'str' and 'int'
2.5
4.0
list index out of range


Below are some common types of `Error`s in python:

|Type|	Description| Example|
|:-- |:-- | :-- |
|`AttributeError`| Raised on the attribute assignment or reference fails. |  `a_tuple = 1, 2, 3` <br> `a_tuple.sort()`|
|`ImportError`| Raised when the import statement has troubles trying to load a module. | `import panda` |
|`IndexError`|Raised when the index of a sequence is out of range.| `numbers = [1, 2, 3, 4]` <br>`numbers[6]`|
|`NameError`| Raised when a variable is not found in the local or global scope. | `course = "ISOM3400"` <bR> `print(corse)` |
|`TypeError`| Raised when a function or operation is applied to an object of an incorrect type. | `3 + '0.4' `|
|`ValueError`|Raised when a function gets an argument of correct type but improper value.|`int('I have $3.8 in my pocket')` |
|`ZeroDivisionError`|Raised when the second operand of a division or module operation is zero.| `5 / 0`|


### Range

In [39]:
[i for i in range(1, 10, 2)] # start at 1, stop before 10, step by 2

[1, 3, 5, 7, 9]


#### 1.4.3 Mixing the Two Maching Schemes

**Positional matching** and **keyword matching** (matching by name) can be mixed during a function call. But arguments using **positional matching** must precede arguments using **keyword matching**.

In [None]:
quad_1(1, c=2, 3)

SyntaxError: positional argument follows keyword argument (<ipython-input-18-1a64c887bd0c>, line 1)

In [None]:
quad_1(2, b=3, a=1)

TypeError: quad_1() got multiple values for argument 'a'

### 1.5.1 Default Parameter Values

In a function definition, we can specify a default value for a parameter with the form `parameter=expression`:

Note that parameters with default values ***must follow*** those without default values:

In [None]:
def quad_2(a=1.0, b, c):
    x1 = -b / (2 * a)
    x2 = (b ** 2 - 4 * a * c) ** 0.5 / (2 * a)
    return (x1 + x2), (x1 - x2)

SyntaxError: non-default argument follows default argument (<ipython-input-38-b479b308fe47>, line 1)

### `*` and `**`

When the `*` operator is used in a functional call, it unpacks a sequence into individual positional arguments. Note `*` here is used when calling a function rather than when defining a function.

In [None]:
# write your codes here
def greet(*names):
  for name in names:
    print("Hello", name)

greet("Monica", "Luke", "Steve", "John")

Hello Monica
Hello Luke
Hello Steve
Hello John



We can put `*` before a parameter to indicate that it can take a ***variable*** sequence of *positional arguments* and pack them ***into a tuple***. (In simple assignment `variable_name = *arg`, `*variable` returns a **list**)

In [None]:
print(*[1, 2, 3, 4, 5, 6])

1 2 3 4 5 6




`**` is used to indicate that a parameter can take a ***variable*** sequence of *keyword*/*named arguments*, and pack them ***into a dictionary***:


In [45]:
def update_detail(**info):
    print(info)
    for k, v in info.items():
      print(f"{k} -> {v}")

update_detail(name='Sam', id='1902034', grade="A+")

{'name': 'Sam', 'id': '1902034', 'grade': 'A+'}
name -> Sam
id -> 1902034
grade -> A+


Similarly, `**` used in a call unpacks a mapping into individual keyword arguments:

In [None]:
details = {'name': 'Sam', 'id': '1902034', 'major': 'Business Analytics', 'year': 3}
update_detail(**details)      # equivalent to update_detail(name='Sam', id='1902034', major='IS', year=3)

{'name': 'Sam', 'id': '1902034', 'major': 'Business Analytics', 'year': 3}
name -> Sam
id -> 1902034
major -> Business Analytics
year -> 3



###  Rules of Using `*args` & `**kwargs`

- By definition, `*args` cannot have default values and cannot take keyword arguments.

- Parameters after `*args` can take keyword arguments only; Parameter before `*args` can take keyword arguments only if there are no additional positional arguments intended for `*args`.

In [None]:
def print_args_1(x1, x2, *args, y1, y2):
    print("x1 is: ", x1)
    print("x2 is: ", x2)
    print("args is: ", args)
    print("y1 is: ", y1)
    print("y2 is: ", y2)

In [None]:
# the position of y1 and y2 can be swapped
print_args_1("a", "b", y1=1, y2=2)

x1 is:  a
x2 is:  b
args is:  ()
y1 is:  1
y2 is:  2


In [None]:
# the position of y1 and y2 can be swapped
print_args_1("a", "b", "c", "d", y1=1, y2=2)

x1 is:  a
x2 is:  b
args is:  ('c', 'd')
y1 is:  1
y2 is:  2


In [None]:
print_args_1(x2="b", x1="a", y1=1, y2=2)

x1 is:  a
x2 is:  b
args is:  ()
y1 is:  1
y2 is:  2


In [None]:
# recall: positional arguments cannot come after keyword arguments
print_args_1(x2="b", x1="a", "c", "d", y1=1, y2=2)

SyntaxError: positional argument follows keyword argument (<ipython-input-65-1bbc4fea801b>, line 2)

- By definition, `**kwargs` cannot have default values and cannot take positional arguments. It can only appear at the end of a parameter list.

In [None]:
def print_args_2(y1, **kwargs, y2):
    print("y1 is: ", y1)
    print("y2 is: ", y2)
    print("kwargs is: ", kwargs)

SyntaxError: arguments cannot follow var-keyword argument (<ipython-input-68-2bebf0c92214>, line 1)

- `**kwargs` does not impose any restrictions on the type of arguments that come before it.

In [None]:
def print_args_2(y1, y2, **kwargs):
    print("y1 is: ", y1)
    print("y2 is: ", y2)
    print("kwargs is: ", kwargs)

In [None]:
print_args_2(1, 2, x1="a", x2="b")

y1 is:  1
y2 is:  2
kwargs is:  {'x1': 'a', 'x2': 'b'}


In [None]:
print_args_2(x1="a", x2="b", y2=2, y1=1)

y1 is:  1
y2 is:  2
kwargs is:  {'x1': 'a', 'x2': 'b'}


In [None]:
print_args_2(y2=2, y1=1)

y1 is:  1
y2 is:  2
kwargs is:  {}


In [None]:
# still, positional arguments cannot come after keyword arguments
print_args_2(y1=1, 2, x1="a", x2="b")

SyntaxError: positional argument follows keyword argument (<ipython-input-73-13b14e8ac4af>, line 2)


###  Parameter Ordering

Ordinary parameters, `*args`, and `**kwargs` can be combined in the same parameter list:



In [None]:
def print_args_3(x1, x2, *args, y1, y2, **kwargs):
    print("x1 is: ", x1)
    print("x2 is: ", x2)
    print("args is: ", args)
    print("y1 is: ", y1)
    print("y2 is: ", y2)
    print("kwargs is: ", kwargs)

In [None]:
# the positions of y1 and y2 can be swapped
print_args_3('a', 'b', 'c', 'd', y1=1, y2=2, y3=3, y4=4)

x1 is:  a
x2 is:  b
args is:  ('c', 'd')
y1 is:  1
y2 is:  2
kwargs is:  {'y3': 3, 'y4': 4}


In [None]:
# all arguments can be keyword ones as long as there're no arguments intended for *args
print_args_3(y3=3, y4=4, x1='a', x2='b', y1=1, y2=2)

x1 is:  a
x2 is:  b
args is:  ()
y1 is:  1
y2 is:  2
kwargs is:  {'y3': 3, 'y4': 4}



When used in a parameter list, `/` enforces all paramenters before it to take positional arguments only during a call.

In [None]:
def print_args_4(x1, x2, /, *args, y1, y2, **kwargs):
    print("x1 is: ", x1)
    print("x2 is: ", x2)
    print("args is: ", args)
    print("y1 is: ", y1)
    print("y2 is: ", y2)
    print("kwargs is: ", kwargs)

In [None]:
print_args_4('a', 'b', 'c', 'd', y1=1, y2=2, y3=3, y4=4)

x1 is:  a
x2 is:  b
args is:  ('c', 'd')
y1 is:  1
y2 is:  2
kwargs is:  {'y3': 3, 'y4': 4}


In [None]:
# no longer valid
print_args_4(x1='a', x2='b', y1=1, y2=2, y3=3, y4=4)

TypeError: print_args_4() missing 2 required positional arguments: 'x1' and 'x2'

### Python's parameter ordering rule:

We can mix ordinary parameters, `*args` (positional arguments), and `**kwargs` (keyword arguments) in a function's parameter specification, but they must appear in a particular order:

- Parameters for positional matching (those without default values come first) > `/` > `*args` (captures additional positional arguments) > parameters for keyword matching > `**kwargs` (captures additional keyword arguments)

- Amendment:  The rule that *parameters with default values can only follow those without default values* only applies to parameters for positional matching.

In [None]:
def print_args_5(x1, x2, /, *args, y1=1, y2, **kwargs):     # y1=1 can come before y2
    '''Documentation string''' # description of the function
    print("x1 is: ", x1)
    print("x2 is: ", x2)
    print("args is: ", args)
    print("y1 is: ", y1)
    print("y2 is: ", y2)
    print("kwargs is: ", kwargs)

print_args_5('a', 'b', 'c', 'd', y3=3, y4=4, y2=2)

x1 is:  a
x2 is:  b
args is:  ('c', 'd')
y1 is:  1
y2 is:  2
kwargs is:  {'y3': 3, 'y4': 4}




### 1.7 `lambda` Expressions


Besides the `def` statement, Python provides an expression form to generate function objects, known as [`lambda` expressions](https://docs.python.org/3/reference/expressions.html#lambdas).

`Lambda`function, also known as anonymous functions, can be considered a degenerate kind of functions, which don't have a name and carry only a ***single*** expression whose result is returned:


<pre class='lang-python'>
<span style="color:#2767C5";>lambda</span> <span style="color:#BB2F29";>&lt;parameter 1&gt;</span>, <span style="color:#BB2F29";>&lt;parameter 2&gt;</span>, ...: <span style="color:#BB2F29";>&lt;a single expression using parameters&gt;</span>
</pre>

In [None]:
# sort sublists by their second elements
sorted(gradebook, key=lambda x: x[1])

[['Bryn', 59], ['James', 89], ['Troy', 92], ['Alice', 95], ['Charles', 100]]

**<font color='steelblue' > Question</font>**:

1) What if we want to sort the gradebook by score range in descending order (e.g., 10-point ranges such as 91-100, 81-90, and so on)??



In [None]:
gradebook = [['Troy', 92], ['Alice', 95], ['James', 89], ['Charles', 99], ['Mike', 86], ['Bryn', 59]]

# write your codes here
sorted(gradebook, key=lambda x: x[1]//10, reverse = True)

[['Troy', 92],
 ['Alice', 95],
 ['Charles', 99],
 ['James', 89],
 ['Mike', 86],
 ['Bryn', 59]]

2) What if we want to further sort students by name in ascending order within each range?


In [None]:
# write your codes here
sorted(gradebook, key=lambda x: (-x[1]//10, x[0]))

[['Alice', 95],
 ['Charles', 99],
 ['Troy', 92],
 ['James', 89],
 ['Mike', 86],
 ['Bryn', 59]]


### 2.1 Defining a Class

Imagine we want to define a new type of object to represent individual customers for a bank's personal loan business.

**Common properties**:


| Name |  Income | Years | Criminal |  
|-----|-----|-----|-----|
| Amy | 27 |4.2 |  No |  
| Sam | 32 |1.5 |  No |  
| Jane | 55 | 3.5  | Yes |
|...|


**Common operations**: A routine determines whether to grant a loan to a customer.

<br>

[The `class` statement](https://docs.python.org/3.7/reference/compound_stmts.html#class-definitions) creates a class object and assigns it a name.

The body of a class is where we specify the attributes of the class, including both data and method attributes.


In [47]:
class Customer:  # names for user-defined classes begin with uppercase letters by convention
    '''A class for bank customers.'''

    def __init__(self, name, income, years, criminal='No'):
        '''populate the attributes of a particular customer'''
        self.name = name                   # assignments to attributes of self
        self.income = income
        self.years = years
        self.criminal = criminal

    def apply_loan(self):                  # a function attribute, a.k.a. a method of the class
        if self.income >= 70:
            result = 'Approve'
        elif self.income >= 30:
            if self.years >= 2:
                result = 'Approve'
            else:result = 'Reject'
        else:
            if self.criminal == "No":
                result = 'Approve'
            else: result = 'Reject'
        return result

In [None]:
Customer

__main__.Customer

In [None]:
Customer.__init__

<function __main__.Customer.__init__(self, name, income, years, criminal='No')>

In [None]:
Customer.apply_loan   # class methods are just functions defined inside the class

In [None]:
Customer.__doc__   # a special attribute of a class that contains the object's docstring

'A class for bank customers.'

In [None]:
Customer.__dict__ # Names in a class's namespace can be exposed by the built-in `__dict__` attribute

mappingproxy({'__module__': '__main__',
              '__doc__': 'A class for bank customers.',
              '__init__': <function __main__.Customer.__init__(self, name, income, years, criminal='No')>,
              'apply_loan': <function __main__.Customer.apply_loan(self)>,
              '__dict__': <attribute '__dict__' of 'Customer' objects>,
              '__weakref__': <attribute '__weakref__' of 'Customer' objects>})

In [48]:
customer_1 = Customer("Amy", 27, 4.2)
customer_1

<__main__.Customer at 0x1d64fb67ce0>

In [None]:
# instance-level attributes
customer_1.name, customer_1.income, customer_1.years, customer_1.criminal

('Amy', 27, 4.2, 'No')

In [None]:
# class-level data attributes
customer_1.__doc__

'A class for bank customers.'

In [None]:
# class-level method attributes
# the instance is implicitly passed as the argument to self
customer_1.apply_loan()

'Approve'

Both class-level attributes and instance-level attributes can be added on the fly via assignments with qualified names:

In [None]:
customer_1.__dict__

{'name': 'Amy', 'income': 27, 'years': 4.2, 'criminal': 'No'}

In [49]:
customer_1.name = "John"  # change the name of the customer
customer_1.name

'John'


### 2.4 `__X__()` Methods


Running `dir(customer_1)` shows that some `__X__()` (pronouced as "dunder X") methods  are available to `customer_1` (inherited from somewhere).


In Python, [dunder methods](https://docs.python.org/3/reference/datamodel.html#special-method-names) are methods that allow instances of a class to interact with the built-in functions and operators of the language. The word “dunder” comes from “double underscore”, because the names of dunder methods start and end with two underscores.

Dunder methods exist for nearly every operation available to built-in types. The mapping from each of these operations to a dunder method is ***fixed*** and ***unchangeable***. The table below lists a few of the most straightforward ones:







|Operation | Expression (or Statement) | And Python Calls
|:-- |:-- |:-- |
|Addition |`a + b`| `a.__add__(b)` |
|Subtraction| `a - b`|`a.__sub__(b)`|
|Multiplication |`a * b`|`a.__mul__(b)`|
| Division |`a / b`|`a.__truediv__(b)`|
| Equality |`a == b`|`a.__eq__(b)`|
| Inequality |`a != b`|`a.__ne__(b)`|
| Less than |`a < b`|`a.__lt__(b)`|
| Greater than |`a > b`|`a.__gt__(b)`|
| Less than or equal to|	`a <= b`|`a.__le__(b)`|
| Greater than or equal to|	`a <= b`|`a.__ge__(b)`|
| Length |	`len(s)`|`s.__len__()`|
| Membership tests  |	`x in s` | `s.__contains__(x)`|
| Attribute listing  |	`dir(x)` | `x.__dir__()`|





The `hasattr()` method returns true if an object has the given named attribute and false if it does not.

In [None]:
hasattr("string", "__len__") # 'string' has a method called __len__

True

In [None]:
hasattr(1, "__len__") # 1 (integer) does not have a method called __len__

False


### `__repr__()`


Typing the name directly into the interpreter prints out its string representation:

In [None]:
customer_1

<__main__.Customer at 0x7e968750d910>

Behind the scene,  [`__repr__()`](https://docs.python.org/3/reference/datamodel.html#object.__repr__) is invoked to return a string representing the object.
It's what we get when the Python intepreter shows an object:


In [None]:
customer_1.__repr__()

'<__main__.Customer object at 0x7e968750d910>'

We can override the default `__repr__()` to produce a more readable string representation:

In [None]:
class Customer:
    '''A class for bank customers.'''

    def __init__(self, name, income, years, criminal='No'):
        '''populate the attributes of a particular customer'''
        self.name = name
        self.income = income
        self.years = years
        self.criminal = criminal

    __repr__ = lambda self: f"Customer: {self.__dict__}"

In [None]:
customer_1 = Customer("Amy", 27, 4.2)
customer_1

Customer: {'name': 'Amy', 'income': 27, 'years': 4.2, 'criminal': 'No'}



### `__str__()`

The `__str__` method is called when you use the str() function on an object or when you use the print() function to print an object.

In [None]:
print(customer_1)

Customer: {'name': 'Amy', 'income': 27, 'years': 4.2, 'criminal': 'No'}


In [None]:
class Customer:
    '''A class for bank customers.'''

    def __init__(self, name, income, years, criminal='No'):
        '''populate the attributes of a particular instance'''
        self.name = name
        self.income = income
        self.years = years
        self.criminal = criminal

    def __repr__(self):
        '''defines a string representation of a given instance'''
        return f"Customer: {self.__dict__}"

    def __str__(self):
        '''defines a printable representation of a given instance'''
        prefix = "Customer:\n"
        return prefix + '\n'.join([f'{k.title()} -> {v}' for k, v in self.__dict__.items()])   # The title() method returns a string where the first character in every word is upper case.
        # The return value must be a string object


In [None]:
customer_1 = Customer("Amy", 27, 4.2)
print(customer_1)

Customer:
Name -> Amy
Income -> 27
Years -> 4.2
Criminal -> No



###  `__iter__()`


- `__iter__()` methods for most commonly used operations are not provided by default. The corresponding operations are then not supported for the class's instances.
- `__iter__()` makes a class's instances iterable. We implement it by using a generator expression to yield the components one after the other:

In [None]:
class Customer:
    '''A class for bank customers.'''

    def __init__(self, name, income, years, criminal='No'):
        '''populate the attributes of a particular instance'''
        self.name = name
        self.income = income
        self.years = years
        self.criminal = criminal

    def __repr__(self):
        '''define a string representation of a given instance'''
        return f"Customer: {self.__dict__}"

    def __str__(self):
        '''define a printable string representation of a given instance'''
        prefix = "Customer:\n"
        return prefix + '\n'.join([f'{k.title()} -> {v}' for k, v in self.__dict__.items()])    # The return value must be a string object

    def __iter__(self):
        '''make an instance iterable'''
        return ((k, v) for k, v in self.__dict__.items())


In [None]:
customer_1 = Customer("Amy", 27, 4.2)
for k, v in customer_1:
    print(f"{k} -> {v}")

name -> Amy
income -> 27
years -> 4.2
criminal -> No



### 2.5 Defining a Derived Class (Optional)



Python allows us to form a **derived class** (**subclass**) from one or more than one **base class** (**superclass**) to specialize behaviors while reusing existing code.

To create a subclass, we just list the base class in parentheses in the `class` statement's header (seperated by `,` if there is more than one base class) :

In [None]:
class Person:

    def __init__(self, name, date_of_birth, gender):
        self.name = name
        self.date_of_birth = date_of_birth
        self.gender = gender

    def __repr__(self):
        '''define a string representation of a given instance'''
        return f"{self.__dict__}"


class Customer(Person):

    def __init__(self, name, date_of_birth, gender, income, years, criminal="No"):
        # a method call notation; self should not be present
        super().__init__(name, date_of_birth, gender)
        self.income = income
        self.years = years
        self.criminal = criminal

    def apply_loan(self):
        if self.income >= 70:
            result = 'Approve'
        elif self.income >= 30:
            if self.years >= 2:
                result = 'Approve'
            else:result = 'Reject'
        else:
            if self.criminal == "No":
                result = 'Approve'
            else: result = 'Reject'
        return result

In [None]:
a_person = Person("Amy", "Oct. 10, 1999", "F")
a_person

{'name': 'Amy', 'date_of_birth': 'Oct. 10, 1999', 'gender': 'F'}

In [None]:
customer_3 = Customer("Amy", "Oct. 10, 1999", "F", 27, 4.2)
customer_3

{'name': 'Amy', 'date_of_birth': 'Oct. 10, 1999', 'gender': 'F', 'income': 27, 'years': 4.2, 'criminal': 'No'}

Each instance inherits names from the class it's generated from, as well as all of that class's superclasses:

To resolve attribute references, Python goes through the following steps:

1. The search first checks the instance's namespace and then its class's namespace.
2. If a requested attribute is not found in the class's namespace, the search proceeds to look in the most recent superclass.
3. This rule is applied ***recursively*** upward through a **hirerarchy of classes** until all superclasses are searched.

Searches stop at the first appearance of the attribute name that it finds.


### Modules

Most of the functionality in Python is provided by **modules**, which are typically Python program files that contain definitions and statements we want to import to use.





### The `import` Statement

Useful modules that form the [standard library](https://docs.python.org/3/library/) include `os`, `sys`, `math`, `random`, `shutil`, and so on.

In [None]:
help(random.randint)

Help on method randint in module random:

randint(a, b) method of random.Random instance
    Return random integer in range [a, b], including both end points.



Alternatively, we can import all names in a module to the current namespace using the `from` form of the `import` statement:

In [None]:
from random import *


Then we don't need to use the prefix every time we use something from it:


In [None]:
randint(0, 10)

9

However, this should be used with caution, as it would potentially create <b>*name collisions*</b>:

In [None]:
randint = 2
randint(0, 10)

TypeError: 'int' object is not callable


### HTML Basics



[**HyperText Markup Language**](https://developer.mozilla.org/en-US/docs/Web/HTML) (HTML for short) is a markup language for describing Web documents.


It is plain text, but includes a rich collection of "tags" that define the structure of the document and allow documents to include a variety of page elements.

- Tags are always enclosed in angle brackets: ` <tagname>`.

<img src="https://www.dropbox.com/s/1xc7ep9hu035ky7/elment.png?raw=1" width=400/>

- Tags usually travel in pairs and contain something in between. An opening (or start) tag begins a section of page content, and a closing (or end) tag ends it, e.g.,  `<tagname>content</tagname>`.

    - [`<h1>`, `<h2>`, ..., `<h6>`](https://www.w3schools.com/tags/tag_hn.asp): largest headings, second largest headings, etc.;

    - [`<p>`](https://www.w3schools.com/tags/tag_p.asp): paragraphs;

    - [`<ul>`](https://www.w3schools.com/tags/tag_ul.asp) or [`<ol>`](https://www.w3schools.com/tags/tag_ol.asp): unordered or ordered bulleted lists;

    - [`<li>`](https://www.w3schools.com/tags/tag_li.asp): individual list items;

    - [`<div>`](https://www.w3schools.com/tags/tag_div.asp): divisions or sections;

    - [`<a>`](https://www.w3schools.com/tags/tag_a.asp): anchors;

    - and [many others ...](https://www.w3schools.com/tags/default.asp)

    


---
- There are also a few *self-closing* tags, e.g.:

    - [`<br>`](https://www.w3schools.com/tags/tag_br.asp)

    - ```html
      <img src="https://idp.ust.hk/idp/images/logo.png" alt="UST Logo">
      ```
      See [`<img>`](https://www.w3schools.com/tags/tag_img.asp)


- Tags can have attributes, which are always specified in the start tag and come in `name="value"` pairs. E.g.:

    - ```html
      <a href="https://docs.python.org/" target="_blank">Python documentation</a>
      ```
      See [`<a>`](https://www.w3schools.com/tags/tag_a.asp)
      
      <img src="https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Structuring_content/Basic_HTML_syntax/grumpy-cat-attribute-small.png" width=600/>


- There are a few attributes that are common to all tags, among which the two most used ones are `id` and `class`:

    - [`id`](https://www.w3schools.com/html/html_id.asp) allows us to specify a unique identifier for an element which must be unique in the whole document. E.g.:
    
        - ```html
          <p id="unique-one">This is a paragraph with id "unique-one".</p>
          ```    

    - [`class`](https://www.w3schools.com/html/html_classes.asp) allows us to assign multiple elements (possibly of different types) to the same group identified by a class name. E.g.:

        - ```html
          <p class="class-one"> This is a paragraph of class "class-one". </p>
          ```

        - An element can also be assigned to multiple groups that are specified by a *space-separated* list of class names. E.g.:
        
            - ```html
              <div class="col m-1 border class-one">Some Content</div>
              ```
- For Web scraping, the  `class` and/or `id` attributes can be leveraged to differentiate the section we want from other sections. To do so, we need to use something called the **CSS selector**.


### CSS Basics


[**Cascading Style Sheets**](https://developer.mozilla.org/en-US/docs/Web/CSS) (CSS for short) is a style sheet language for describing the presentation of a document written in a markup language.

HTML dictates the content and structure of a webpage, while CSS modifies design and display of HTML elements.




[3 ways](https://www.w3schools.com/css/css_howto.asp) of creating and applying CSS rules:

- An inline style created right on an HTML start tag, using `style="attribute: value;"`;

- An [embedded/internal style sheet](https://jwang.people.ust.hk/sample_embeddedCSS.html) included by a `<style>` tag nested within the `<head>` section;

- As an [external style sheet](https://jwang.people.ust.hk/sample_externalCSS.html) in a [separate file](https://jwang.people.ust.hk/example.css).


<img src="https://4.bp.blogspot.com/-yQFU_PhmTRg/U7viQ7bMNjI/AAAAAAAADJE/ctlBTLl-fhY/s640/css-selectors-lrg.png" width=500px />

A CSS style sheet is a set of style rules, and a CSS ruleset consists of:

- A CSS selector: a pattern used to select the element(s) we want to style;

-  A style declaration block, which contains a list of style properties and their values:

    - Style properties and values come in pairs, with each pair separated with a semicolon (`;`).

    - See a [list](https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Properties_Reference) of common properties and what values are accepted for them.







CSS selectors can be combined to create more specific selectors. For example,
to select
```html
<div class="col m-1 border class-one">Some Content</div>
```
we can use:

- `div.class-one`;

- `div.m-1.col` (the order does not matter);

- `div[class="col m-1 border class-one"]`.


BeautifulSoup helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects.

The most commonly used object in the BeautifulSoup library is the `BeautifulSoup` object.

In [50]:
from bs4 import BeautifulSoup
import bs4
bs4.__version__

'4.12.3'


The use of the `BeautifulSoup` constructor:

- The 1st argument is the HTML text the object is based on;

- The 2nd argument specifies the parser that we want BeautifulSoup to use in order to create a `BeautifulSoup` object.


> <code>html.parser</code> is the HTML parser included in the standard Python 3 library. Information on other HTML parsers is available <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser">here</a>.




Let's create a `BeautifulSoup` object for the tacky HTML code above to "***make the soup***":



In [51]:
html_sample_code = ('<!DOCTYPE html><html lang="en"><head><title>Sample HTML Page</title></head>'
                    '<body><h1>This is a heading.</h1>'
                    '<p>This is a typical paragraph.</p>'
                    '<p class="class-one">This is a paragraph of class "class-one".</p>'
                    '<ol><li class="class-one"><a href="sample.html">The 1st item</a></li></ol>'
                    '<p id="unique-one">This is a paragraph with id "unique-one".</p>'
                    '<div class="col m-3 border class-one">This is a division.</div></body></html>')

sample_soup = BeautifulSoup(html_sample_code, 'html.parser')


We can use the `prettify()` method to format the HTML source code so as to get a better idea of its structure:


In [None]:
from bs4.formatter import HTMLFormatter

# We can customize the indent size using an HTMLFormatter object
print(sample_soup.prettify(formatter = HTMLFormatter(indent=4)))

<!DOCTYPE html>
<html lang="en">
    <head>
        <title>
            Sample HTML Page
        </title>
    </head>
    <body>
        <h1>
            This is a heading.
        </h1>
        <p>
            This is a typical paragraph.
        </p>
        <p class="class-one">
            This is a paragraph of class "class-one".
        </p>
        <ol>
            <li class="class-one">
                <a href="sample.html">
                    The 1st item
                </a>
            </li>
        </ol>
        <p id="unique-one">
            This is a paragraph with id "unique-one".
        </p>
        <div class="col m-3 border class-one">
            This is a division.
        </div>
    </body>
</html>





These textual components form a ["family tree"](https://jwang.people.ust.hk/familytree.html), where the top-level `<html>` tag contains the `<head>` and `<body`> tags, which further contain other textual contents and tags, and so on.


We can define parents, chidren, siblings, descendents, and ancestors for each tag by referencing this family tree (based on the identation levels of related tags)


### What's in a Soup?

A `BeautifulSoup` object is essentially a hierarchical collection of `tag` objects, which are each equipped with attributes for navigating and iterating over child tags and strings it contains.

The simplest way to navigate this collection is to say the name of a child tag we want:

In [None]:
# get the <h1> tag nested two layers deep into the BeautifulSoup object structure (html → body → h1)
sample_soup.html.body.h1        # equivalently, sample_soup.body.h1 or samle_soup.h1

<h1>This is a heading.</h1>

A tag's children are available in a list when we call `.contents`:

In [None]:
sample_soup.body.contents

[<h1>This is a heading.</h1>,
 <p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <ol><li class="class-one"><a href="sample.html">The 1st item</a></li></ol>,
 <p id="unique-one">This is a paragraph with id "unique-one".</p>,
 <div class="col m-3 border class-one">This is a division.</div>]

In [None]:
sample_soup.body.get_text()

'This is a heading.This is a typical paragraph.This is a paragraph of class "class-one".The 1st itemThis is a paragraph with id "unique-one".This is a division.'

In [None]:
sample_soup.ol.get_text()

'The 1st item'

In [None]:
sample_soup.a.get('href')   # alternatively, it can also be accessed with dictionary-like indexing, e.g., sample_soup.a['href']

'sample.html'



### `.find()` & `.find_all()`

Using a tag name as an attribute gives us only the first tag by that name.

If we need to get all tags with a certain name, we need to use `find_all()`.

The `find_all()` (`find()`) method can take a variety of filters to find lists of desired tags (a single tag).

The `find_all()` returns a list of all matching elements; `find()` returns the first matching element.

In [None]:
sample_soup.find_all('p')                            # perform a match against that exact string; return a list of tags

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <p id="unique-one">This is a paragraph with id "unique-one".</p>]

In [None]:
sample_soup.find('p')                           # find only the first match of a tag

<p>This is a typical paragraph.</p>

### .find_all always returns a **list**, even if there is only one match

In [None]:
sample_soup.find_all(['p', 'a'])                     # perform a string match against any item in that list

[<p>This is a typical paragraph.</p>,
 <p class="class-one">This is a paragraph of class "class-one".</p>,
 <a href="sample.html">The 1st item</a>,
 <p id="unique-one">This is a paragraph with id "unique-one".</p>]

In [None]:
sample_soup.find_all('p', {'class': 'class-one'}) # Always a list, even if only one match is found

[<p class="class-one">This is a paragraph of class "class-one".</p>]

In [None]:
sample_soup.find_all(['div', 'p'], {'class': 'class-one'})

[<p class="class-one">This is a paragraph of class "class-one".</p>,
 <div class="col m-3 border class-one">This is a division.</div>]

In [None]:
# filter elements only on attributes' values
# None is indispensible if using positional matching
sample_soup.find_all(None, {'class': 'class-one'})

[<p class="class-one">This is a paragraph of class "class-one".</p>,
 <li class="class-one"><a href="sample.html">The 1st item</a></li>,
 <div class="col m-3 border class-one">This is a division.</div>]

In [None]:
# equivalently, keyword matching
sample_soup.find_all(attrs = {'class': 'class-one'})

[<p class="class-one">This is a paragraph of class "class-one".</p>,
 <li class="class-one"><a href="sample.html">The 1st item</a></li>,
 <div class="col m-3 border class-one">This is a division.</div>]

We can also search for the exact string value of the class attribute:

In [None]:
sample_soup.find('div', {'class': 'col m-3 border class-one'})

<div class="col m-3 border class-one">This is a division.</div>

But searching for variants of the string value won't work: (i.e., **No Partial Search**)

In [None]:
sample_soup.find_all('div', {'class': 'col class-one'})

[]


### `.select()` & `.select_one()`





A `BeautifulSoup` object has a `.select()` (or `.select_one()`) method, which allows us to query elements using CSS selectors.

The `select()` returns a list of all matching elements; `select_one()` returns the first matching element.


Different types of selectors:

| Type | Notation |  Example | Explanation |
|-----|-----|-----|-----|
| type | `elementname` | `p` | Select all `<p>` elements|
| id | `#idname` | `#unique-one` | Select the element with `id="unique-one"` |
| class | `.classname` | `.class-one`  |Select all elements with `class="class-one"` |
| attribute | `[attributename]` |`[alt]` | Select all elements with an `alt` attribute |
| attribute value | `[attr=value]` |`[alt='UST Logo']` |  Select all elements whose `alt` attribute has the value `'UST Logo'` |


In [None]:
sample_soup.select("#unique-one")

[<p id="unique-one">This is a paragraph with id "unique-one".</p>]

In [None]:
# match elements against two or more classes
sample_soup.select('.class-one.m-3')

[<div class="col m-3 border class-one">This is a division.</div>]

In [None]:
# selectors can be combined to create more specific selectors
sample_soup.select_one('p.class-one')    # equivalent to sample_soup.find('p', {'class': 'class-one'})

<p class="class-one">This is a paragraph of class "class-one".</p>

In [None]:
sample_soup.select("div[class='col m-3 border class-one']")  # equivalent to sample_soup.find_all('div', {'class': 'col m-3 border class-one'})

[<div class="col m-3 border class-one">This is a division.</div>]

The selector syntax also supports a set of relational symbols, called **combinators**, which allows us to describe target elements in terms of their relations to others.

| Name | Notation |  Example | Explanation |
|-----|-----|-----|-----|
| Descendant| ` `| `div p` | Select all `<p>` elements inside `<div>` elements |
| Child | `>`| `div > p` | Select all `<p>` elements whose parent is a `<div>` element|


In [None]:
sample_soup.select("ol a")  # select all <a> elements that are inside <ol> elements

[<a href="sample.html">The 1st item</a>]

In [None]:
sample_soup.select("ol > a")  # select all <a> elements whose parent is a <ol> element

[]

Then use `.get_text()` to extract only the text, with the `strip` option active to remove leading and trailing spaces:

In [None]:
lies_soup.find_all('span', {'class': 'short-desc'})[0].find("strong").get_text(strip=True)   

In [None]:
# starter code
import requests
# open the URL; set timeout to 3 seconds
response = requests.get("https://jwang.people.ust.hk/countries.html", timeout=3)
# accesses the raw binary content (HTML source code) of the webpage
country_soup = BeautifulSoup(response.content, 'html.parser')

# provide your code below

item_list = country_soup.find_all('div', {'class': 'col-md-4 country'})
name_list = []; capital_list = []; population_list = []; area_list = []

for item in item_list:
    name_list.append(item.h3.get_text(strip=True))
    capital_list.append(item.find('span', {'class': 'country-capital'}).get_text())
    population_list.append(item.find('span', {'class': 'country-population'}).get_text())
    area_list.append(item.find('span', {'class': 'country-area'}).get_text())


country_df = pd.DataFrame({'Name': name_list, 'capital': capital_list, 'Population': population_list, 'Area': area_list})
country_df


Unnamed: 0,Name,capital,Population,Area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0
3,Antigua and Barbuda,St. John's,86754,443.0
4,Anguilla,The Valley,13254,102.0
...,...,...,...,...
242,Yemen,Sanaa,23495361,527970.0
243,Mayotte,Mamoudzou,159042,374.0
244,South Africa,Pretoria,49000000,1219912.0
245,Zambia,Lusaka,13460305,752614.0


### 1 `Series` and `DataFrame`

A [`DataFrame`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) represents a ***2-dimensional***, ***tabular*** data structure containing an ***ordered*** collection of columns, each of which can be a different value type (numeric, string, Boolean, etc.).

Previously in web scraping, we have seen:
```python
lie_df = pd.DataFrame({'date': date_list, 'lie': lie_list, 'explanation': explanation_list, 'url': url_list})
```


A `DataFrame` can be thought of as a specialization of a Python dictionary. It maps names (i.e., column names or indcies) to a sequence of data series that share the same set of labels (i.e., row names or indices).


<img src="https://raw.githubusercontent.com/justinjiajia/img/master/python/DataFrame.png" width=700/>

<br>

Let's build up the above `DataFrame` from scratch  based on this component view (column by column):

In [54]:
import pandas as pd

# a Series can be thought of as a 1-dimensional array with attached labels
# a set of default indices, consisting of the integers 0 through n-1, are automatically attached
district_name = pd.Series(['Wan Chai', 'North', 'Sai Kung',  'Sha Tin'])
district_name

0    Wan Chai
1       North
2    Sai Kung
3     Sha Tin
dtype: object

In [55]:
# Return Series as array
district_name.values

array(['Wan Chai', 'North', 'Sai Kung', 'Sha Tin'], dtype=object)

`Array` is similar to `List`, but it requires all elements to be of the same data type. This characteristic is beneficial for certain operations, especially those that are mathematically intensive, as it allows for more efficient data processing.

In [56]:
list(district_name.index)

[0, 1, 2, 3]

In [103]:
district_population = pd.Series([150900, 310800, 448600, 648200],
                                index=district_name)
district_population

Wan Chai    150900
North       310800
Sai Kung    448600
Sha Tin     648200
dtype: int64

In [104]:
district_area = pd.Series([9.83, 136.61, 129.65, 68.71],
                          index=district_name)
district_area

Wan Chai      9.83
North       136.61
Sai Kung    129.65
Sha Tin      68.71
dtype: float64

In [105]:

HK_district1 =  pd.DataFrame({'Population': district_population,
                              'Area': district_area},
                             index=district_name)
HK_district1

Unnamed: 0,Population,Area
Wan Chai,150900,9.83
North,310800,136.61
Sai Kung,448600,129.65
Sha Tin,648200,68.71


A `Series` can also be created with user supplied index.

Apart from making data more readable, the explicit index definition gives the `Series` object additional capabilities.

In [63]:
district_population2 = pd.Series([448600, 648200, 150900, 310800],
                                 index=['Sai Kung',  'Sha Tin', 'Wan Chai', 'North'])
district_population2

Sai Kung    448600
Sha Tin     648200
Wan Chai    150900
North       310800
dtype: int64

They can also be accessed using the attribute reference notation as if they are the attributes of a `DataFrame`.

In [65]:
HK_district1.Area

0      9.83
1    136.61
2    129.65
3     68.71
Name: Area, dtype: float64

Because pandas is built on top of NumPy,  `Series` and `DataFrame` objects support **vectorized operations**.

Vectorized operations are a powerful feature of Python that allow you to apply a function or an operation to multiple elements of an array or a dataframe at once, instead of using loops. This can save time, improve your code readability, and reduce your memory usage.

In [109]:
# this assignment form of indexing creates a new column
HK_district1['Density'] = HK_district1.Population / HK_district1.Area
HK_district1

Unnamed: 0,Population,Area,Density
Wan Chai,150900,9.83,15350.966429
North,310800,136.61,2275.089671
Sai Kung,448600,129.65,3460.084844
Sha Tin,648200,68.71,9433.852423




Another usual way of creating a DataFrame is by using `pd.DataFrame(data, columns=[column_names],index=[row_names])` explicitly.




In [68]:
df = pd.DataFrame(data=[[3.0, 8.0], [4.0, 7.0]], columns=['a', 'b'], index=['first','second'])
df

Unnamed: 0,a,b
first,3.0,8.0
second,4.0,7.0





### 2 Data Selection in `DataFrame`s


`DataFrame` support both ***label-based*** indexing and ***location-based*** indexing.

Pandas provids two indexer attributes that explicitly expose which indexing scheme to apply:

- The `loc` attribute uses ***label-based*** indexing:



In [106]:
HK_district1.loc['Sai Kung', 'Area']

129.65

Start and End are all **Inclusive**

In [107]:
# slicing selects contiguous rows and columns
# but the last label in inclusive this time

HK_district1.loc['Wan Chai':'Sai Kung', :'Area']

Unnamed: 0,Population,Area
Wan Chai,150900,9.83
North,310800,136.61
Sai Kung,448600,129.65


In [110]:
# accepts a NumPy array for row selection via boolean indexing

import numpy as np
HK_district1.loc[np.array([True, False, True, False]), ['Population', 'Density']]

Unnamed: 0,Population,Density
Wan Chai,150900,15350.966429
Sai Kung,448600,3460.084844


In [111]:
HK_district1.Area > 100

Wan Chai    False
North        True
Sai Kung     True
Sha Tin     False
Name: Area, dtype: bool

In [112]:
# accepts a Pandas Series for row selection via boolean indexing

HK_district1.loc[HK_district1.Area > 100, ['Population', 'Density']]

Unnamed: 0,Population,Density
North,310800,2275.089671
Sai Kung,448600,3460.084844


In [113]:
# Boolean operators are ~, &, and | are used for selection

HK_district1.loc[~(HK_district1.Area > 100) & (HK_district1.Population > 200000),
                 ['Population', 'Density']]

Unnamed: 0,Population,Density
Sha Tin,648200,9433.852423


Pandas also provide a handy helper method (i.e., `.query()`) that allows us to query data with less verbose query strings. It is a powerful tool for filtering `DataFrame` rows using a concise and readable expression syntax. It is specifically designed for row selection.


In [114]:
HK_district1.query('~ (Area > 100) & (Population > 200000)')

Unnamed: 0,Population,Area,Density
Sha Tin,648200,68.71,9433.852423


In [115]:
HK_district1.query('index == "Sai Kung"')

Unnamed: 0,Population,Area,Density
Sai Kung,448600,129.65,3460.084844


In [116]:
# write your code here
HK_district1.query('Density < 5000 | Area > 50')

Unnamed: 0,Population,Area,Density
North,310800,136.61,2275.089671
Sai Kung,448600,129.65,3460.084844
Sha Tin,648200,68.71,9433.852423


In [117]:
# Can take a 1-argument function. The x passed to the lambda is the whole DataFrame being sliced.

HK_district1.loc[lambda x: [i[0]=='S' for i in x.index], :]

Unnamed: 0,Population,Area,Density
Sai Kung,448600,129.65,3460.084844
Sha Tin,648200,68.71,9433.852423


If the second argument (column labels) is omitted, `.loc` will return all columns for the specified rows.

In [118]:
HK_district1.loc[lambda x: [i[0]=='S' for i in x.index]]

Unnamed: 0,Population,Area,Density
Sai Kung,448600,129.65,3460.084844
Sha Tin,648200,68.71,9433.852423


-  The `iloc` attribute uses Python-style ***location-based*** indexing:
- which means right-hand side **exclusive**

In [121]:
HK_district1

Unnamed: 0,Population,Area,Density
Wan Chai,150900,9.83,15350.966429
North,310800,136.61,2275.089671
Sai Kung,448600,129.65,3460.084844
Sha Tin,648200,68.71,9433.852423


In [125]:
# the last index is exclusive as with regular Python slicing
HK_district1.iloc[1:3,:2]

Unnamed: 0,Population,Area
North,310800,136.61
Sai Kung,448600,129.65


In [126]:
# select non-contiguous rows and columns
HK_district1.iloc[[1, 3], [2, 0]]

Unnamed: 0,Density,Population
North,2275.089671,310800
Sha Tin,9433.852423,648200


The `.iloc` is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

In [128]:
# what HK_district2.Area returns is a Series
# iloc can only take a NumPy array (not Pandas Series), which can be accessed via .values
HK_district1.iloc[(HK_district1.Area > 100).values, [True, False, True]]

Unnamed: 0,Population,Density
North,310800,2275.089671
Sai Kung,448600,3460.084844


In [None]:
# Can take a 1-argument function. The x passed to the lambda is the DataFrame being sliced.
HK_district2.iloc[lambda x: [i for i in range(len(x)) if i % 2 == 0],
              lambda x: [i[0]=='A' for i in x.columns]]

Unnamed: 0,Area
North,136.61
Sha Tin,68.71



### 3 Importing and Exporting Data

Pandas features a number of [functions](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for reading tabular data as a `DataFrame` object. Among them, [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) is likely the one we'll use the most:


In [130]:
sp = pd.read_csv("https://raw.githubusercontent.com/daisydream00/datafiles/refs/heads/main/adj_closing_sub.csv")
sp

Unnamed: 0,Date,GOOG,APPL,AMZN
0,2015/5/1,537.900024,120.220688,422.869995
1,2015/5/4,540.780029,119.987633,423.040009
2,2015/5/5,530.799988,117.283951,421.190002
3,2015/5/6,524.219971,116.547424,419.100006



### 4 Computing Summary and Descriptive Statistics


`DataFrame` objects are equipped with common mathematical and statistical [methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) for column-wise computations (or row-wise by setting `axis=1`):

- Most of them produce aggregates:




In [None]:
sp[['GOOG', 'APPL']].mean()

Unnamed: 0,0
GOOG,533.425003
APPL,118.509924


The following table summarizes some built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``nunique()``            | Number of distinct items
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |


- Some statistics are computed from pairs of columns:

In [None]:
sp.cov(numeric_only=True)

Unnamed: 0,GOOG,APPL,AMZN
GOOG,55.248513,13.269122,13.454445
APPL,13.269122,3.488251,3.236487
AMZN,13.454445,3.236487,3.364861


In [None]:
sp[['APPL', 'AMZN']].cov()

Unnamed: 0,APPL,AMZN
APPL,3.488251,3.236487
AMZN,3.236487,3.364861


- Some produce multiple summary statistics in one shot:

In [131]:
# by default, summarize numeric columns only
sp.describe()

Unnamed: 0,GOOG,APPL,AMZN
count,4.0,4.0,4.0
mean,533.425003,118.509924,421.550003
std,7.432934,1.867686,1.834356
min,524.219971,116.547424,419.100006
25%,529.154984,117.099819,420.667503
50%,534.350006,118.635792,422.029999
75%,538.620025,120.045897,422.912499
max,540.780029,120.220688,423.040009



### 5 Handling Missing Values

In [None]:
import numpy as np

df_w_nan = pd.DataFrame({'A': [1, 2, np.nan],
                         'B': [5, np.nan, np.nan],
                         'C': [4, 5, 6]})
df_w_nan

Unnamed: 0,A,B,C
0,1.0,5.0,4
1,2.0,,5
2,,,6



Pandas provides several useful methods for detecting, removing, and replacing missing values in pandas data structures:

- [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html) generates a boolean mask indicating missing values, while [`notnull()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.html) produces the opposite:


In [None]:
df_w_nan.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False



- [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) returns a filtered version of the data:

In [None]:
df_w_nan.dropna(axis=0)  # axis=0 Drop rows which contain missing values.

Unnamed: 0,A,B,C
0,1.0,5.0,4



- [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) returns a copy of the data with missing values filled or imputed (set `inplace=True` to modify it in place):

In [None]:
# Replace all NaN elements with 0s.
df_w_nan.fillna(0)

Unnamed: 0,A,B,C
0,1.0,5.0,4
1,2.0,0.0,5
2,0.0,0.0,6


In [None]:
#ffill() function is used to forward fill the missing value with the value from the previous row (column) when axis = 0 (1)
df_w_nan.ffill(axis=0)

Unnamed: 0,A,B,C
0,1.0,5.0,4
1,2.0,5.0,5
2,2.0,5.0,6


In [None]:
df_w_nan.loc[df_w_nan.B.isnull(), 'B'] =  df_w_nan.B.mean()
df_w_nan

Unnamed: 0,A,B,C
0,1.0,5.0,4
1,2.0,5.0,5
2,,5.0,6



### 6 Computing Group-wise Summary Statistics



Categorizing a dataset and applying a function to each group (whether be an aggregation or transformation) is often a critical component of a data analysis workflow.  




Splitting data in a `DataFrame` into groups can be done by calling the `DataFrame`'s [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) method, passing the name of the desired key column:



In [None]:
stock = pd.read_csv("https://raw.githubusercontent.com/daisydream00/datafiles/refs/heads/main/pricevolume_sub.csv")
stock

Unnamed: 0,Date,Symbol,Price,Volume
0,2015/5/1,GOOG,537.900024,1768200
1,2015/5/4,GOOG,540.780029,1308000
2,2015/5/5,GOOG,530.799988,1383100
3,2015/5/6,GOOG,524.219971,1567000
4,2015/5/1,APPL,120.220688,58512600
5,2015/5/4,APPL,119.987633,50988300
6,2015/5/5,APPL,117.283951,49271400
7,2015/5/6,APPL,116.547424,72141000
8,2015/5/1,AMZN,422.869995,3565800
9,2015/5/4,AMZN,423.040009,2270400


In [None]:
stock_by_symbol = stock.groupby('Symbol')

# what was returned is a GroupBy object
# wrap it in a loop to examine the resulting groupings
for key, group in stock_by_symbol:
    print(f"Group: {key}")
    print(group)
    print("-" * 40)  # Separator for clarity

Group: AMZN
        Date Symbol       Price   Volume
8   2015/5/1   AMZN  422.869995  3565800
9   2015/5/4   AMZN  423.040009  2270400
10  2015/5/5   AMZN  421.190002  2856400
11  2015/5/6   AMZN  419.100006  2552500
----------------------------------------
Group: APPL
       Date Symbol       Price    Volume
4  2015/5/1   APPL  120.220688  58512600
5  2015/5/4   APPL  119.987633  50988300
6  2015/5/5   APPL  117.283951  49271400
7  2015/5/6   APPL  116.547424  72141000
----------------------------------------
Group: GOOG
       Date Symbol       Price   Volume
0  2015/5/1   GOOG  537.900024  1768200
1  2015/5/4   GOOG  540.780029  1308000
2  2015/5/5   GOOG  530.799988  1383100
3  2015/5/6   GOOG  524.219971  1567000
----------------------------------------


In [None]:
stock_by_date_symbol = stock.groupby(['Date', 'Symbol'])

for key, group in stock_by_date_symbol:
    print(f"Group: {key}")
    print(group)
    print("-" * 40)  # Separator for clarity

Group: ('2015/5/1', 'AMZN')
       Date Symbol       Price   Volume
8  2015/5/1   AMZN  422.869995  3565800
----------------------------------------
Group: ('2015/5/1', 'APPL')
       Date Symbol       Price    Volume
4  2015/5/1   APPL  120.220688  58512600
----------------------------------------
Group: ('2015/5/1', 'GOOG')
       Date Symbol       Price   Volume
0  2015/5/1   GOOG  537.900024  1768200
----------------------------------------
Group: ('2015/5/4', 'AMZN')
       Date Symbol       Price   Volume
9  2015/5/4   AMZN  423.040009  2270400
----------------------------------------
Group: ('2015/5/4', 'APPL')
       Date Symbol       Price    Volume
5  2015/5/4   APPL  119.987633  50988300
----------------------------------------
Group: ('2015/5/4', 'GOOG')
       Date Symbol       Price   Volume
1  2015/5/4   GOOG  540.780029  1308000
----------------------------------------
Group: ('2015/5/5', 'AMZN')
        Date Symbol       Price   Volume
10  2015/5/5   AMZN  421.190002  

In [None]:
# can only apply to numeric columns
stock_by_symbol.mean(numeric_only=True)

Unnamed: 0_level_0,Price,Volume
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1
AMZN,421.550003,2811275.0
APPL,118.509924,57728325.0
GOOG,533.425003,1506575.0


To suppress using group keys as indices in the aggregated output, we can pass `as_index=False` to `groupby()` when first creating the `GroupBy` object:

In [None]:
stock_by_symbol = stock.groupby('Symbol', as_index=False)
stock_by_symbol.mean(numeric_only=True)

Unnamed: 0,Symbol,Price,Volume
0,AMZN,421.550003,2811275.0
1,APPL,118.509924,57728325.0
2,GOOG,533.425003,1506575.0


If you want to sort the results of an aggregation (e.g., `sum()`, `mean()`, etc.), you can use the .sort_values() method.

In [None]:
sorted_group = stock.groupby('Symbol')['Volume'].mean().sort_values(ascending=False)
print(sorted_group)

Symbol
APPL    57728325.0
AMZN     2811275.0
GOOG     1506575.0
Name: Volume, dtype: float64


### Sample Web Scraping

In [None]:
# starter code

import requests
from bs4 import BeautifulSoup
import pandas as pd
response = requests.get("https://www.imdb.com/title/tt13016388/reviews/?spoilers=EXCLUDE", timeout=3, headers={'User-Agent':'Mozilla/5.0'})
review_soup = BeautifulSoup(response.content, 'html.parser')

# provide your code below

In [None]:
review_list = review_soup.find_all('article',{'class':'user-review-item'})

title_list = []
text_list = []
ID_list = []
date_list = []
rating_list = []

for result in review_list:
    title = result.find('h3', {'class':'ipc-title__text'}).get_text(strip=True)
    title_list.append(title)

    text = result.find('div', {'class':'ipc-html-content'})
    if text:
      text_list.append(text.get_text(strip=True))
    else:
      text_list.append("")

    ID = result.find('a',{'class':'ipc-link'}).get_text(strip=True)
    ID_list.append(ID)

    date = result.find('li', {'class':'review-date'}).get_text(strip=True)
    date_list.append(date)

    rating_element = result.find('span', {'class':'ipc-rating-star--rating'})
    if rating_element:
        rating_list.append(rating_element.get_text(strip=True))
    else:
        rating_list.append(None)

### Streamlit

- Text
    - st.write()
    - st.markdown()
    - st.html()
    - st.code()
    - st.divider()
- Data
    - st.dataframe(): Interactive and scrollable format to display a dataframe
    - st.table(): Static and non-interactive
    - st.data_editor(): Editable table for user to modify in real-time
    - st.column_config()
- Media
    - st.image()
    - st.video()

In [3]:
passage = """
    Streamlit is an open-source Python framework for data scientists and AI/ML engineers to deliver interactive data apps – in only a few lines of code.
    """
st.write(passage)

NameError: name 'st' is not defined

In [None]:
import pandas as pd
st.write(
    pd.DataFrame({"first column": [1, 2, 3], "second column": [10, 20, 30]}), 
    passage, 
    123
)

In [None]:
html_source = '''<div>
    <a href="www.ust.hk">HKUST</a>
</div>
'''
st.code(html_source, language="html", line_numbers=True)

In [None]:
st.data_editor(
    df, 
    column_config={"name": "App name",
                   "url": st.column_config.LinkColumn("App URL", disabled=True),
                   "likes": st.column_config.ProgressColumn("Stars", max_value=1000, format="%d ⭐"),
                   "views_history": st.column_config.LineChartColumn("Views", y_min=0, y_max=5000),
                   "in_progress": st.column_config.SelectboxColumn("In progress?")}
)

![image.png](attachment:image.png)

In [None]:
st.image(["static/cat.jpg", "static/dog.jpg", "static/owl.jpg"], width=200)

In [None]:
st.video("static/sora_gen.mp4", 
         start_time="5s", end_time="17s",
         autoplay=True, muted=True, loop=True)

- Chart
    - st.line_chart()


![image.png](attachment:image.png)

If we align color with a categorical variable (i.e., a column that contains discrete values), data points will be grouped into lines of the same color based on the value of this variable.

In [None]:
st.line_chart(medals, x="year", y="total", color="type",
              width=720, height=500)

    - st.area_chart()

by stacking multiple area segments on top of each other, an area chart also allows us to visualize the cumulative sum or total of the values at each point, along with the individual trends.

In [None]:
st.area_chart(medals, x="year", y="total", color="type", stack=True,
              width=720, height=500)

Streamlit also allows us to customize the colors used for different categories with all built-in charting functions except st.map(). However, to leverage this color customization feature, we must first transform the DataFrame into wide format.

In wide format, categories are no longer represented as distinct values within a single column (e.g., type). Rather, they are mapped into independent columns (e.g., Gold, Silver, and Bronze).

![image.png](attachment:image.png)

In [None]:
# "#A77044", "#FEE101", and "#A7A7AD" are the Hex color codes for Bronze, Gold, and Silver, respectively
# https://www.schemecolor.com/olympic-medals-color-scheme.php

st.area_chart(medals_w, x="year", y=['Bronze', 'Gold', 'Silver'], 
              color=["#A77044", "#FEE101", "#A7A7AD"],
              width=720, height=500)

    - st.scatter_chart()


In [None]:
st.scatter_chart(iris, x="sepal_length", y="sepal_width", 
                 width=720, height=500)

    - st.bar_chart()

In [None]:
st.bar_chart(tips, x="sex", y="total_bill", color="smoker", horizontal=True,
            width=720, height=500)

### Selection data
`DataFrame.query()` can be used to filter data for plotting,
e.g., if your variable that holds the selected stocks is called symbols and that holds the specified year is called year, `pd.query(f"date < {year + 1} and date >= {year} and symbol in {symbols}")` will give you the desired subset.

### Widgets
- st.selectbox()

In [None]:
contact_str = st.selectbox(
    "How would you like to be contacted?",
    ("Email", "Home phone", "Mobile phone")
)

![image.png](attachment:image.png)

- st.radio

In [None]:
genre_str = st.radio(
    "What's your favorite movie genre?",
    ["Comedy", "Drama", "Documentary"]
)

![image.png](attachment:image.png)

- st.text_input()

![image.png](attachment:image.png)

- st.text_area(): just a larger area for typing, same as input

### list

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [None]:
slider_int_range = st.slider(
    'Choose two values', min_value=10, max_value=50, # if the min or max are floats, the step will be 0.01
    value=[30, 40] # default value
)

### Integer or Float (default).
similar to slider. step depends on min/max, `value=` is the default value

![image.png](attachment:image.png)

### Bool

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

### Sample Dynamic Chart

In [None]:
import streamlit as st

import pandas as pd

source = pd.read_csv(
    "https://raw.githubusercontent.com/vega/vega-datasets/main/data/stocks.csv", 
    parse_dates=['date'], date_format="%b %d %Y"
    ).query(f"date < 2010 and date >= 2005")


st.dataframe(source)

stocks = st.multiselect("Select stocks for comparison", source.symbol.unique(), default=['AMZN', 'AAPL']) #.unique() returns a list of unique values in the column
all_years = st.checkbox("Show prices for the whole period")
year = st.slider("Select a year", min_value=2005, max_value=2009, value=2006, disabled=all_years) # disabled=True means the slider is not clickable


if not stocks:
    st.error("Please select at least one stock!")
else:
    if not all_years:
        st.markdown(f"### Stock prices in {year}")
        chart_data = source.query(f"date < {year + 1} and date >= {year} and symbol in {stocks}")
    else:
        st.markdown(f"### Stock prices in all years")
        chart_data = source.query(f"symbol in {stocks}")  
    st.line_chart(chart_data, x="date", y="price", color="symbol", 
                    width=720, height=500)

### Side bar
Each page can feature a sidebar implemented by st.sidebar. Passing an element to st.sidebar positions it to the left of the main viewport, allowing users to focus on the content in our app.

In [None]:
import streamlit as st
import pandas as pd

source = pd.read_csv("https://raw.githubusercontent.com/vega/vega-datasets/main/data/stocks.csv", 
                     parse_dates=['date'], date_format="%b %d %Y")

with st.sidebar:
    stock = st.selectbox("Select a stock", source.symbol.unique())
    year = st.slider("Select a year", 2004, 2010)
''' Equivalent to:
    stock = st.sidebar.selectbox("Select a stock", source.symbol.unique())
    year = st.sidebar.slider("Select a year", 2004, 2010)
'''

query_str = f"date < {year + 1} and date >= {year} and symbol == '{stock}'"
chart_data = source.query(query_str)

st.markdown(f"### Prices for `{stock}` in {year}")
st.area_chart(chart_data, x="date", y="price")

![image.png](attachment:image.png)

### Column

In [None]:
col1, col2, col3 = st.columns(3, vertical_alignment="top") # Top is default
# equivalent to st.columns([1/3, 1/3, 1/3]) which can adust the width of each column

# use with notation:
with col1:
   st.header("A cat")
   st.image("static/cat.jpg")

with col2:
   st.header("A dog")
   st.image("static/dog.jpg")

with col3:
   st.header("An owl")
   st.image("static/owl.jpg")

### Container
st.container allows us to inserts an invisible container into our app that can hold multiple elements.

By default, the container grows to fit its content and no border is shown.

When a fixed height is provided, a grey border is shown around the container and scrolling is also enabled for large content:

In [None]:
container = st.container(height=150)
container.write("This is inside the container")
st.write("This is outside the container")
container.write(":dog:")

![image.png](attachment:image.png)

st.container can be used along with st.columns to make a grid of containers like the following:

In [None]:
row1 = st.columns(4)
row2 = st.columns(4)

for col in row1 + row2:
    tile = col.container(border=True)
    tile.write(":dog:")

![image.png](attachment:image.png)

## Sample

### Task 1: Class representation

 You are required to represent each course's exam as an instance of the class `Exam`.<br>Complete the following dunder methods for the class:
 - `__init__()`: constructor
 - `__repr__()`: official string representation
 - `__str__()`: invoked when an object is printed using print().

<br>

As an example, given the following information:
<div class='course'>
	<div class='courseanchor' style='position:relative; float:left; visibility:hidden;'><a></a></div>
	<div class='subject'><b>ISOM 0000 - Some Demo Course</b></div>
	<table class='sections'>
		<tbody>
			<tr>
				<th>&nbsp;</th>
				<th>Section</th>
				<th>Date</th>
				<th>Time</th>
				<th>Venue</th>
				<th>Remarks</th>
			</tr>
			<tr class='newsect secteven'>
				<td align='center'></td>
				<td align='center'>L1</td>
				<td class='date'>12-May-2025</td>
				<td class='time'>08:30AM - 10:30AM</td>
				<td class='venue'>Venue A<br>Venue B</td>
				<td class='remarks' align='center' colspan=''> </td>
			</tr>
				<tr class='newsect sectodd'>
				<td align='center'></td>
				<td align='center'>L2</td>
				<td class='date'>12-May-2025</td>
				<td class='time'>08:30AM - 10:30AM</td>
				<td class='venue'>Venue A<br>Venue B</td>
				<td class='remarks' align='center' colspan=''> </td>
			</tr>
		</tbody>
	</table>
</div>
<em>*Please refer to the actual website for the HTML code and exam information. The above representation may not be 100% accurate.</em>

The properties of this Exam object are:

| Property | Type | Example Value |
|----------|------|---------------|
|`course_code`|string|`'ISOM 0000'`|
|`course_name`|string|`'Some Demo Course'`|
|`date`|string|`'12-May-2025'`|
|`start`|string|`'08:30AM'`|
|`end`|string|`'10:30AM'`|
|`venues`|**list** of strings|`['Venue A', 'Venue B']`|

The constructor call should be:
```Python
some_exam = Exam('ISOM 0000', 'Some Demo Course', '12-May-2025', '08:30AM', '10:30AM', ['Venue A', 'Venue B'])
```

The official string representation (`__repr__()`) should return the course code of the exam, i.e. **ISOM 0000**

The object when printed (`__str__()`) should return exam information as `[course_code]: [date] [start]-[end] at [venues]`, e.g.:
```Python
'ISOM 0000: 12-May-2025 08:30AM-10:30AM at Venue A / Venue B'
```
*Note that if there are multiple venues, each venue should be separated by a forward slash (`/`). If there is only one venue, nothing special is required.





In [None]:
class Exam:

    # Complete the constructor for the Exam class below.
    def __init__(self, course_code, course_name, date, start, end, venues):
        # *Please use the same property names as shown in the example table above.*
        self.course_code = course_code
        self.course_name = course_name
        self.date = date
        self.start = start
        self.end = end
        self.venues = venues


    def __repr__(self):
        # repr: defines "official" representation of the object when you type its name, e.g., 'ISOM 0000'
        # Please complete the return value for this method.
        return self.course_code


    def __str__(self):
        # str: defines what you see when you print it out, e.g, 'ISOM 0000: 12-May-2025 08:30AM-10:30AM at Venue A / Venue B'
        # Please complete the return value for this method

        return f'{self.course_code}: {self.date} {self.start}-{self.end} at {" / ".join(self.venues)}'


Test your code by instantiating the oject and show the representations.

In [None]:
some_exam = Exam('ISOM 0000', 'Some Demo Course', '12-May-2025', '08:30AM', '10:30AM', ['Venue A', 'Venue B'])
some_exam

ISOM 0000

In [None]:
print(some_exam)

ISOM 0000: 12-May-2025 08:30AM-10:30AM at Venue A / Venue B


In [7]:
# Starter code
from bs4 import BeautifulSoup
import requests

In [None]:
# Do not run this code cell if you are using requests.get().
# open() and requests.get() retrieve HTML in different ways
# open() is the built-in function to read and write local files
# so, you need to download the webpage manually and save it in your file system beforehand
# In contrast, requests.get() is to retrieve HTML over the Internet
html_code = open('ISOM_2430_exam.html', 'r')
soup = BeautifulSoup(html_code, 'html.parser') # <- use soup for your scraping operations, or you can change the variable name if you like.
html_code.close()

Hints:
1. Consider using `.split()` method to split course code and course name. You need to specify `maxsplit=1` to deal with the special case of more than 1 separator, e.g, 'ISOM 3380 - Advanced Network Management (CISCO - ICND)'
2. To extract all the venues (in the case of multiple venues), use a separator (e.g., `|`) when calling `get_text()` method to join multiple pieces of text from within the HTML element. Then split the extracted text into a list, using the same symbol (e.g., `|`) as the separator.

In [None]:
exam_list = []
# Write your code below

all_courses = soup.select('div.course')
for course in all_courses:
    # Structure of subject: 'ISOMXXXX - Course Name'. There is only one subject per div, so select_one is a better option.
    course_code, course_name = course.select_one('div.subject').get_text().split(' - ', maxsplit=1)
    # splitting with ' - ' without maxsplit: error with ISOM 3380 - Advanced Network Management (CISCO - ICND)
    # If course code is 5XXX or above, skip
    if int(course_code[5]) > 4:
        continue
    # Alternatively: break can also be used, but continue seems to be a safer option (may break prematurely if website not in order of course code)
    # For this assignment & this website: break is fine.

    # Check if an exam is scheduled on the website, i.e. if there is a valid time and date.
    exam_date = course.select_one('td.date').get_text()
    # If this course does not have a date (or time): we skip this course
    if exam_date == '-':
        continue

    # Process remaining information:
    exam_start, exam_end = course.select_one('td.time').get_text().split(' - ')
    exam_venues = course.select_one('td.venue').get_text('|', strip = True).split('|') # '|' symbol as a placeholder for <br>.

    # If a placeholder symbol is not provided, get_text will remove <br> and join the venues together -> more difficult to process
    this_exam = Exam(course_code, course_name, exam_date, exam_start, exam_end, exam_venues)
    exam_list.append(this_exam)


### Streamlit

In [None]:
import streamlit as st
import pandas as pd


# Configure page
st.set_page_config(page_title="Stock Dashboard", layout="wide")

st.markdown("""### :material/description:  Requirements


- Please download `In-Class Exercise_5_part1.ipynb` from Canvas. Complete the code to extract relevant info from the **`yfinance`** libarary to generate two files `ticker_info.csv` and `stock_data.csv` (The two data files are also available on Canvas if you want to start with the second part first). Place these two files into your streamlit project folder and complete
this second part of this exercise, which is to set up a streamlit application.
            
- Include a sidebar (`with st.sidebar`) that contains the following widgets:
    - A checkbox widget (`st.checkbox()`) that alows the user to control whether to show a bar chart that visualizes the market cap of S&P 100 stocks by sector; the default value is checked.
    - A multiselect widget (`st.multiselect()`) that allows the user to choose which stocks should be included for plotting; the options are S&P 100 tickers; the default values are `'TSLA'`, `'NVDA'`, and `'AAPL'`.
    - A slider widget (`st.slider()`) that allows the user to specify the year range over which stocks are compared; the range is 2020 - 2025; the default value is 2004 - 2025.

- If the checkbox is checked, display a bar chart that visualizes the market cap of S&P 100 stocks by sector.

- If none of the stocks are selected, display an error message: "Please select at least one stock!"; 
  if stocks are selected, show a line chart for closing prices (using the column 'Close') and a bar chart for volume (using the column 'Volume').
            
- Tips:
    - Use the `.unique()` method to obtain the unique values of `Ticker` column in the Pandas dataframe `ticker_info`;
    - [`DataFrame.query()`](https://pandas.pydata.org/docs/user_guide/indexing.html#the-query-method) can be used to filter data for plotting, 
        - e.g., if the variable holding the selected stocks is `selected_tickers` and the variable holding the year range is `selected_years`, `query(f"Date < {selected_years[1] + 1} and Date >= {selected_years[0]} and Ticker in {selected_tickers}")` will give you the desired subset of data.
""")          


# load stock info and data
ticker_info = pd.read_csv("ticker_info.csv")
stock_data = pd.read_csv("stock_data.csv", parse_dates=['Date'])

# extract S&P 100 tickers from ticker_info
tickers_100 = ticker_info.Ticker.unique()

# set page title
st.title("S&P 100 Stock Dashboard 📊")  #:bar_chart:

# Sidebar controls
with st.sidebar:
    st.header("Sidebar Widgets")

    show_sector = st.checkbox("Show Market Cap by Sector", True)

    selected_tickers = st.multiselect(
        "Select Companies",
        options=tickers_100,
        default=['TSLA', 'NVDA', 'AAPL']
    )

    # Year range slider
    selected_years = st.slider(
        "Select Year Range",
        min_value=2020,
        max_value=2025,
        value=(2024, 2025)
    )

if show_sector:
    st.header("Market Cap by Sector")
    st.bar_chart(ticker_info, x="Sector", y="Market Cap (B)", color="Ticker", horizontal=True,
            width=720, height=500)

if selected_tickers:

    chart_data = stock_data.query(f"Date < {selected_years[1] + 1} and Date >= {selected_years[0]} and Ticker in {selected_tickers}")

    st.header(f"Stock Trend Analysis ({selected_years[0]} - {selected_years[1]})")
    
    # Closing Prices
    st.subheader("Closing Prices")
    st.line_chart(chart_data, x="Date", y="Close", color="Ticker",
                  width=720, height=500)

    
    # Volume Analysis
    st.subheader("Trading Volume")
    st.bar_chart(chart_data, x="Date", y="Volume", color="Ticker", 
                  width=720, height=500)

else:
    st.error("Please select at least one stock")