**note**: for anyone trying to view this as a slideshow, run 'pip install rise', reload your notebook, then click on the button that looks like a bar graph to the right of the command palette at the top.

In [29]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style = 'darkgrid')

from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE





In [30]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

<center><h1>Discussion 5</h1></center>
<center><h2>DSC 20, Fall 2023</h2><center>

<center><h2>Meme of the week</h2></center>

<center><img src='imgs/meme.png' width=800></center>

<center><h2>Agenda</h2></center>
<ul>
    <li> Midterm Feedback</li>
    <li> <b>Content</b> </li>
    <ul>
        <li> lambda
        <li> map/filter
        <li> HOF
        <li> args, kwargs, defaults
    </ul>
    <li> Practice Questions </li>
    <li> (if we have time) project foreshadowing: machine learning </li>
</ul>

<center><h2> About the Midterm </h2> </center>

- grades will come out when they do

- make sure to look through your score and submit regrade requests for sections you disagree with

reminder:
<center><img src='imgs/exam.png' height = 200></center>

<center><h3> Lambda Functions </h3></center>
<ul>
    <li> known as anonymous functions (their functions are so simple, they don't need a name)
    <li> syntax: lambda (input): (some operation)
    <li> within the scope of this course, lambda is used in conjunction with map and filter

In [4]:
def add_2(x):
    return x+2
add_2(1)

3

In [3]:
func = lambda x: x+2
func(1)

3

<center><h3 style = 'color:blue'> Checkpoint </h3></center>

<center>Are the following 2 functions equivalent?</center>

In [5]:
def strip_caps(string):
    output = ''
    for char in string:
        if not char.isupper():
            output+=char
    return output

In [7]:
lambda_strip = lambda x: x if x.isupper() else ''

<center><h3 style = 'color:blue'> Checkpoint Solution</h3></center>

<center> nope! </center>

In [8]:
example = 'AEIOUaeiou'

In [9]:
strip_caps(example)

'aeiou'

In [14]:
''.join(list(map(lambda_strip, example)))

'AEIOU'

<center><h3>Map</h3></center>
Map -
<b>Syntax: map(function, iterable)</b>
<ul>
    <li> Map allows you to apply a function to all elements to an iterable input
    <li> very common to use a lambda function as the function to apply
    <li> returns an iterator through the iterable object, applying the function as it traverses
</ul>

In [5]:
data = [1,2,3,4,5]
list(map(lambda x:x+2, data))

[3, 4, 5, 6, 7]

<center><h3>Filter</h3></center>

Filter - 
<b>Syntax: filter(function, iterable)</b>
<ul>
    <li> Filter takes in a function that returns a boolean and only keeps elements that satisfy the function (i.e. return True).
    <li> Very common to use a lambda function as the function to apply, but keep in mind the function <b>must return a boolean</b>.
    <li> Returns an iterator through the iterable object that only yields values that pass the function.
</ul>

In [1]:
data = [1,2,3,4,5]
list(filter(lambda x:x%2==0,data))

[2, 4]

<center><h3 style = 'color:blue'> Checkpoint </h3></center>

<center>Are the following 2 statements equivalent?</center>

In [15]:
data = list(range(0,101))

In [None]:
lambda_map = lambda x: x*2 if x%2==0 else 0
sum(map(lambda_map, data))

In [None]:
lambda_filter = lambda x: x%2==0
sum(map(lambda_map, filter(lambda_filter, data)))

<center><h3 style = 'color:blue'> Checkpoint Solution</h3></center>

<center> yep! </center>

In [16]:
data = list(range(0,101))

In [21]:
lambda_map = lambda x: x*2 if x%2==0 else 0
sum(map(lambda_map, data))

5100

In [22]:
lambda_filter = lambda x: x%2==0
sum(map(lambda_map, filter(lambda_filter, data)))

5100

<center><h3>HOF</h3></center>

- Algorithm design framework to create generalized code

- Functions that either return functions or use other functions

- Helper functions!

- Many uses, including abstraction, scope protection, etc.

In [1]:
def summation_i(n):
    return (n*(n+1)) / 2

def summation_i2(n):
    return (n*(n+1)*(2*n+1)) / 6

def summation_formulas(n, form):
    if form=='i':
        return summation_i(n)
    if form=='i**2':
        return summation_i2(n)
    else:
        raise ValueError("unfacilitated i value")

<center><h3>HOF (cont.)</h3></center>

In [3]:
def add_1(x):
    return x + 1
def minus_1(x):
    return x - 1

def operate(op_type):
    if op_type == 'add':
        return add_1
    else:
        return minus_1
operate('add')

<function __main__.add_1(x)>

<center><h3 style = 'color:blue'> Checkpoint </h3></center>

<center>Given the previous code, what is the result of these calls?</center>

In [31]:
def add_1(x): # code from last slide
    return x + 1
def minus_1(x):
    return x - 1

def operate(op_type):
    if op_type == 'add':
        return add_1
    else:
        return minus_1

In [34]:
temp = operate('add')

In [None]:
temp(1)

<center><h3 style = 'color:blue'> Checkpoint Solution</h3></center>

In [35]:
temp

<function __main__.add_1(x)>

In [36]:
temp(1)

2

<center><h3>*args</h3></center>

<ul>
    <li> Used when an unknown number of arguments will be passed into a function
    <li> Denoted by * in the method header (IMPORTANT)
    <li> processed in a similar manner to a list
</ul>

In [10]:
def summation(*nums):
    return sum(nums)
print(summation())
print(summation(1,2,3,4,5))

0
15


<center><h3 style = 'color:blue'> Checkpoint </h3></center>

<center>What is the result of this function call?</center>

In [None]:
def generate_names(*name_parts):
    output = []
    for name in name_parts:
        output.append(name*2)
    return output
generate_names('pika', 'niko', 'oro')

<center><h3 style = 'color:blue'> Checkpoint Solution </h3></center>

In [28]:
def generate_names(*name_parts):
    output = []
    for name in name_parts:
        output.append(name*2)
    return output
generate_names('pika', 'niko', 'oro')

['pikapika', 'nikoniko', 'orooro']

<center><h3>**kwargs</h3></center>

<ul>
    <li> Used when an unknown number of <b>keyworded</b> arguments will be passed into a function
    <li> Denoted by ** in the method header (IMPORTANT)
    <li> processed in a similar manner to a dictionary
</ul>

In [15]:
marina = {'marina':1}
def create_dct(**entry):
    return dict(entry)
print(create_dct())
print(create_dct(marina=1, langlois=2))
print(marina)

{}
{'marina': 1, 'langlois': 2}
{'marina': 1}


<center><h3>default_arguments</h3></center>

<ul>
    <li> Basically normal arguments, but with a default value
    <li> if no value is passed, default value is set
    <li> if a value is passed, default value is overwritten
</ul>

In [23]:
def check_legal_age(age=18):
    return age>=21
print(check_legal_age())
print(check_legal_age(21))

False
True


<center><h3 style = 'color:blue'> Checkpoint </h3></center>

<center>What is the result of this function call?</center>

In [30]:
def filter_dict(t=2, **items_in):
    return {k:v for k,v in items_in.items() if len(v)>t}
filter_dict(0, temp=[1,2], test=[3,4,5])

<center><h3 style = 'color:blue'> Checkpoint Solution </h3></center>

In [18]:
def filter_dict(t=2, **items_in):
    return {k:v for k,v in items_in.items() if len(v)>t}
filter_dict(0, temp=[1,2], test=[3,4,5], a=[])

{'temp': [1, 2], 'test': [3, 4, 5]}

<center><h3>note</h3></center>

<center>complex argument ordering gets really messy</center>

In [40]:
def func(norm, *args, darg=2, **kwargs):
    return [norm, list(args), darg, dict(kwargs)]
func(42,1,1,1,1,1,1,3,darg=4, test=1)

[42, [1, 1, 1, 1, 1, 1, 3], 4, {'test': 1}]

In [2]:
def func(norm, darg=2, *args, **kwargs):
    return [norm, list(args), darg, dict(kwargs)]
func(42,4,1,1,1,1,1,1,3,test=1)

[42, [1, 1, 1, 1, 1, 1, 3], 4, {'test': 1}]

<center><h2>practice questions</h2></center>
<br>
<center>Time to do some practice questions! Take about 10-15 minutes to work on the questions. Feel free to flag me down if you need help/clarification.</center>
<br>
<center> If you finish early, head over to gradescope and complete the discussion attendance assignment </center>

In [32]:
def query_data(database, source, quality):
    """
    Write a function that takes in a dictionary and returns a list 
    of items from source that are at least of quality level.
    Requirement: map/filter/lambda only
    args:
        database(list): list of dictionary data entries
        source (str): string for source of items to be pulled
        quality(int): numerical representation of quality
    returns:
        a list of items from source that are at least of quality level
    
    >>> data = [
    {'name':'a', 'quality':4, 'source':'dsc'}, 
    {'name':'b','quality':10, 'source':'lign'}, 
    {'name':'c','quality':2, 'source':'dsc'}, 
    {'name':'d','quality':5, 'source':'dsc'}
    ]
    >>> query_data(data, 'dsc', 4)
    ['a','d']
    """
    # Write your implementation here
    return

def summary_stat(operation):
    """
    Write a function that takes in a specified operation and returns a function that 
    will take in a set of numbers and calculate the operation accordingly.
    
    possible operations:
        min -> finds the minimum value
        max -> finds the maximum value
        range -> finds the range of the values
        median -> finds the median of the values

    >>> med = summary_stat('median')
    >>> med([1,2,3,4,5,6])
    3.5
    >>> ran = summary_stat('range')
    >>> ran([1,2,3,4,5,6])
    5
    """
    # Write your implementation here
    return


# args, default // SPLIT INTO MULTIPLE QUESTIONSS -> write method header for the description
# -> implement the code
def count_len_lsts(*lists, counter=4):
    """
    Write a function that takes in an unknown
    number of lists and returns the sum of the
    length of the first 'counter' lists, default
    value of 4.
    
    Args:
        lists(args): unknown number of lists
        counter(int): number of lists length to count
    Returns:
        sum of the lengths of the first counter lists
    
    >>> count_len_lsts([],[1],[1],[1])
    3
    >>> count_len_lsts([],[],[1,2,3],[4,5], counter=2)
    0
    """
    # Write your implementation here
    return

# kwargs
def foo(**kwargs):
    output = []
    for k,v in kwargs:
        output.append((k,v))
    return output
foo(temp=1,fizz=2,buzz=3)

<center><h2>practice question solutions</h2></center>

<center>Write 2 functions, one to calculate the median and another to calculate the spread of a set of numbers.</center>

In [25]:
def median(vals):
    length = len(vals)
    data = sorted(vals)
    if length%2==0:
        return (data[(length-1)//2] + data[length//2])/2
    return data[length//2]

def spread(vals):
    return max(vals) - min(vals)

In [29]:
def summary_stat(operation):
    """
    Write a function that takes in a specified operation 
    and returns a function that will take in a set of 
    numbers and calculate the operation accordingly.
    
    possible operations:
        min -> finds the minimum value
        max -> finds the maximum value
        range -> finds the range of the values
        median -> finds the median of the values

    >>> med = summary_stat('median')
    >>> med([1,2,3,4,5,6])
    3.5
    >>> ran = summary_stat('range')
    >>> ran([1,2,3,4,5,6])
    5
    """
    if operation=='min':
        return min
    if operation=='max':
        return max
    if operation=='median':
        return median
    else:
        return spread
print(summary_stat('median')([1,2,3,4,5,6]))
print(summary_stat('range')([1,2,3,4,5,6]))

3.5
5


<center>Write a function that takes in an <b>unknown</b> number of lists 
and returns the sum of the length of the first 'counter' 
    lists, default value of 4.</center>

In [10]:
def count_len_lsts(*lists, counter=4):
    """
    Write a function that takes in an unknown
    number of lists and returns the sum of the
    length of the first 'counter' lists, default
    value of 4.
    
    Args:
        lists(args): unknown number of lists
        counter(int): number of lists length to count
    Returns:
        sum of the lengths of the first counter lists
    
    >>> count_len_lsts([],[1],[1],[1])
    3
    >>> count_len_lsts([],[],[1,2,3],[4,5], counter=2)
    0
    """
    return sum([len(x) for x in lists[:counter]])
print(count_len_lsts([],[1],[1],[1]))
print(count_len_lsts([],[],[1,2,3],[4,5], counter=2))

3
0


In [2]:
data = [{'name':'a', 'quality':4, 'source':'dsc'}, 
    {'name':'b','quality':10, 'source':'lign'}, 
    {'name':'c','quality':2, 'source':'dsc'}, 
    {'name':'d','quality':5, 'source':'dsc'}]

In [4]:
def query_data(database, source, quality):
    """
    Write a function that takes in a dictionary 
    and returns a list of items from source that 
    are at least of quality level.

    >>> data = [{'name':'a', 'quality':4, 'source':'dsc'}, 
    {'name':'b','quality':10, 'source':'lign'}, 
    {'name':'c','quality':2, 'source':'dsc'}, 
    {'name':'d','quality':5, 'source':'dsc'}]
    >>> query_data(data, 'dsc', 4)
    ['a','d']
    """
    data_check = lambda x: x['source']==source and x['quality']>=quality
    filtered = filter(data_check, database)
    data_yield = lambda x:x['name']
    return list(map(data_yield, filtered))
query_data(data, 'dsc', 4)

['a', 'd']

<center>What is the result of the following function and function call?</center>

In [20]:
def foo(**kwargs):
    output = []
    for k,v in kwargs:
        output.append((k,v))
    return output
foo(temp=1,fizz=2,buzz=3)

ValueError: too many values to unpack (expected 2)

<center><h2>Discussion Attendance</h2></center>
<center>Take 2 minutes and head to gradescope to complete discussion attendance. The assignment is called Discussion 5 Participation.</center>

<center><h2>Machine Learning</h2></center>

<center><h3>Preface</h3></center>

<center><b style='color:blue'>Recall from discussion 1:</b></center>

<center>"Why bother learning how to code? So that we can do cool things with the tools that people have invented!"</center>

<center>The start of those cool things is always foundational machine learning - solving problems and finding answers through code that can "learn".</center>

<center> Learning always starts with data. Without having some "ground truth" to base your learning on, whatever you learn is ultimately meaningless. For today's discussion, we'll take a look at a hallmark dataset for data science, penguins: </center>

<center> To simplify the problem, we will not be using all features. Instead, we will only consider <b>'body_mass_g', 'flipper_length', and 'species'</b>.</center>

<center><img src='imgs/penguins_set.png' width=400></center>

In [31]:
data = pd.read_csv('data/penguins.csv')
data = data.dropna()[['flipper_length_mm', 'body_mass_g','species']]

In [33]:
X = data[['flipper_length_mm', 'body_mass_g']]
y = data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=69)

<center><h3>Classifiers</h3></center>

<center>Classification is one of the major tasks in machine learning. Its goal (as its name implies) is to ingest data associated with a label and be able to predict the label of future unseen data by learning some sort of pattern. Within our penguins dataset, this would involve predicting the penguin species ('species') based on its 'flipper_length_mm' and 'body_mass_g'.</center>

<center><img src='imgs/class_scatter.png' width=600></center>

In [34]:
clf = SVC()

In [35]:
clf.fit(X_train, y_train);
training_accuracy = (clf.predict(X_train) == y_train).mean()
testing_accuracy = (clf.predict(X_test) == y_test).mean()

<center>training accuracy: 0.745</center>
<center>testing accuracy: 0.6940298507462687</center>

<center><h2>Why are we talking about this?</h2></center>

<center>Foreshadowing! You may (or may not) need to apply some of this knowledge (relatively) soon! Possibly for your project</center>

<center><h3>K Nearest Neighbors (KNN) for Classification</h3></center>

<center><b style='color:blue'>idea</b>: Given an unseen value of flipper_length and body_mass, how can we determine what species of penguins it belongs to? </center>

<center>The KNN approach is simple - if I look at the k nearest points to this new value, and I take the most common species among them, then this point is <b>most likely</b> the same species as the most common species among the neighbors. </center>

<center><img src='imgs/class_scatter.png' width=600></center>

<center><h3>Procedure</h3></center>
<br>

<center><b>Step 1</b>: Given a new point <i>X</i>, quantify the distance between all points and <i>X</i></center>

<center><b>Step 2</b>: Take the <b style = 'color:blue'>k-nearest</b> points to <i>X</i> into consideration</center>

<center><b>Step 3</b>: Classify <i>X</i> as the most common label among the neighbors</center>

<br>

<center><b>note: </b> As the name implies, the key parts of this algorithm are <b style = 'color:blue'>k</b> and <b style = 'color:blue'>nearest</b>. </center> 
    
<center><b style = 'color:blue'>k</b> is often referred to as a <b>hyper-parameter</b>, or a value that you choose independently. There are often procedures to select a good k, but this varies depending on the context. </center>
    
<center><b style = 'color:blue'>Nearest</b> refers to quantifying distance - one such way is to use euclidian distance (distance formula), but there are many other "distances" that can be used (ex. Manhattan Distance).</center>

<center> <h2 style = "color:blue">Checkpoint</h2> </center>
<center>Given the new data point (201, 4750), what would a KNN classifier classify the new point as for <b>k=1</b>? What about for <b>k=4</b>? What about for <b>k=10</b>?</center>
<center><img src='imgs/knn_inject_new.png' width=600></center>

<center> <h2 style = "color:blue">Checkpoint Solution</h2> </center>

<b>for k=1</b>: Seems like the orange point is closest according to an eye test, so Chinstrap

<b>for k=4</b>: The 3 closest points seem to all be from unique colors, the fourth point is hard to determine by eye. Could easily be Adelie or Chinstrap

<b>for k=10</b>: The bulk of the points nearby are blue, so Adelie


<center> <h1>Thanks for coming!</h1></center>