**note**: for anyone trying to view this as a slideshow, run 'pip install rise', reload your notebook, then click on the button that looks like a bar graph to the right of the command palette at the top.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style = 'darkgrid')

from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

In [2]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

<center><h1>Discussion 5</h1></center>
<center><h2>DSC 20, Spring 2023</h2><center>

<center><h3>map/filter/lambda, Machine Learning</h3></center>
<center>Since we just had the first midterm, we'll have an interlude of content</center>

<center><h3>Midterm 1 Grades</h3></center>
<center>Don't ask us about this! The grades will be out when they're out!</center>

<center><h3>Agenda</h3></center>
<ul>
    <li> Map/Filter/Lambda</li>
    <li> Machine Learning Preface </li>
    <li> Vocabulary </li>
    <li> Classifiers vs Regressors </li>
    <li> KNN </li>
</ul>


<center><h3> Lambda Functions </h3></center>
<ul>
    <li> known as anonymous functions (their functions are so simple, they don't need a name)
    <li> syntax: lambda (input): (some operation)
    <li> ex: lambda x: x + 2 (adds 2 to every element encountered)
    <li> within the scope of this course, lambda is used in conjunction with map and filter
</ul>

<b>note: </b>lambda functions can't include statements (ex. return, assert)

<center><h3>Map</h3></center>

<b>Syntax: map(function, iterable)</b>
<ul>
    <li> Map allows you to apply a function to all elements to an iterable input
    <li> very common to use a lambda function as the function to apply
    <li> returns an iterator through the iterable object, applying the function as it traverses
</ul>


In [3]:
test = ['dsc20', 'midterm', 'one']
sample_map = map(lambda x: x[:3], test)
list(sample_map)

['dsc', 'mid', 'one']

<center><h3>Filter </h3></center>

<b>Syntax: filter(function, iterable)</b>
<ul>
    <li> Filter takes in a function that returns a boolean and only keeps elements that satisfy the function (i.e. return True).
    <li> Very common to use a lambda function as the function to apply, but keep in mind the function <b>must return a boolean</b>.
    <li> Returns an iterator through the iterable object that only yields values that pass the function.
</ul>

<b>note: </b>filters are a unique subset of maps. You <b>can</b> theoretically write filters as maps, but that's unnecessary complication.

In [4]:
test = ['dsc20', 'midterm', 'one']
sample_filter = filter(lambda x: any([c in 'aeiou' for c in x]), test)
list(sample_filter)

['midterm', 'one']

<center> <h2 style = "color:blue">Checkpoint</h2> </center>
<center> Given a list of integers, use only map/filter/lambda to create a new list containing only even numbers from the original list, then multiply the remaining elements by 3.
 </center>

In [6]:
def even_by_3(int_list):
    """
    Function that utilizes only map/filter/lambda to 
    remove odd numbers from a given list and multiply 
    the remaining elements by 3.
    
    Args:
        int_list (list): list of ints to be considered
    Returns:
        a list of even numbers from the original list,
        multiplied by 3.
    
    >>> lst = [1,2,3,4,5,6]
    >>> even_by_3(lst)
    [6,12,18]
    """
    # Write your implementation here
    return 

<center> <h2 style = "color:blue">Checkpoint Solution</h2> </center>

In [7]:
def even_by_3(int_list):
    """
    Function that utilizes only map/filter/lambda to 
    remove odd numbers from a given list and multiply 
    the remaining elements by 3.
    
    Args:
        int_list (list): list of ints to be considered
    Returns:
        a list of even numbers from the original list,
        multiplied by 3.
    
    >>> lst = [1,2,3,4,5,6]
    >>> even_by_3(lst)
    [6,12,18]
    """
    only_evens = filter(lambda x: x%2==0, int_list)
    mult_by_3 = map(lambda x: x*3, only_evens)
    return list(mult_by_3)

In [8]:
even_by_3([1,2,3,4,5,6]) 

[6, 12, 18]

<center><h3>What happens if I don't cast the output as a list?</h3></center>

In [9]:
def even_by_3(int_list):
    only_evens = filter(lambda x: x%2==0, int_list)
    mult_by_3 = map(lambda x: x*3, only_evens)
    return mult_by_3

In [10]:
even_by_3([1,2,3,4,5,6]) 

<map at 0x7f93fa54b550>

<center>map and filter are part of a class of objects called Iterators in Python. Without being explicitly called, they will perform no operations. In order to get the desired output, we had to cast the map instance into a list.</center>

<center><h2>Machine Learning</h2></center>

<center><h3>Preface</h3></center>

<center><b style='color:blue'>Recall from discussion 1:</b></center>

<center>"Why bother learning how to code? So that we can do cool things with the tools that people have invented!"</center>

<center>The start of those cool things is always foundational machine learning - solving problems and finding answers through code that can "learn".</center>

<center> Learning always starts with data. Without having some "ground truth" to base your learning on, whatever you learn is ultimately meaningless. For today's discussion, we'll take a look at a hallmark dataset for data science, penguins: </center>

<center> To simplify the problem, we will not be using all features. Instead, we will only consider <b>'body_mass_g', 'flipper_length', and 'species'</b>.</center>

<center><img src='imgs/penguins_set.png' width=400></center>

In [11]:
data = pd.read_csv('data/penguins.csv')
data = data.dropna()[['flipper_length_mm', 'body_mass_g','species']]

In [25]:
X = data[['flipper_length_mm', 'body_mass_g']]
y = data['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=69)

<center><h3> ML Vocabulary </h3></center>

<ul>
    <li> <b style = 'color:blue'>fit</b> - attach (training) data to the model
    <li> <b style = 'color:blue'>predict</b> - According to the implementation of the model, predict the outcome of new data (based on the training data)
    <li> <b style = 'color:blue'>error/accuracy</b> - Metric to measure the accuracy of your model.
</ul>

<br></br>

<center>The goal of machine learning is to be able to model some real life phenomenon so that a reasonable prediction can be derived based on past trends. A very popular example within HDSI is predicting Data Science Salaries based on education and years of experience. In this example, we would <b style = 'color:blue'>fit</b> our model with past data about salaries based on education and experience, measure its <b style = 'color:blue'>error/accuracy</b> by running it on past unseen data, then once we achieve an acceptable accuracy, we can begin utilizing this model to <b style='color:blue'>predict</b> future salaries!</center>

<center><h3>Classifiers</h3></center>

<center>Classification is one of the major tasks in machine learning. Its goal (as its name implies) is to ingest data associated with a label and be able to predict the label of future unseen data by learning some sort of pattern. Within our penguins dataset, this would involve predicting the penguin species ('species') based on its 'flipper_length_mm' and 'body_mass_g'.</center>

<center><img src='imgs/class_scatter.png' width=600></center>

In [26]:
sns.scatterplot(data = data, x = 'flipper_length_mm', y='body_mass_g', hue = 'species')
plt.title('flipper_length vs body_mass by species')
plt.savefig('imgs/class_scatter.png')
plt.close()

In [27]:
clf = SVC()

In [33]:
clf.fit(X_train, y_train);
training_accuracy = (clf.predict(X_train) == y_train).mean()
testing_accuracy = (clf.predict(X_test) == y_test).mean()

<center>training accuracy: 0.745</center>
<center>testing accuracy: 0.6940298507462687</center>

<center><h3>Regressors</h3><center>
    
<center>Regression is one of the other major tasks in machine learning. Its goal is to ingest (continuous) data and be able to predict a representative value for unseen data by learning some sort of pattern. Within our penguins dataset, this would be like predicting 'body_mass_g' based off of 'flipper_length_mm'.</center>
    
<center><img src='imgs/class_less_scatter.png' width=600></center>

In [34]:
sns.scatterplot(data = data, x = 'flipper_length_mm', y='body_mass_g')
plt.title('flipper_length vs body_mass')
plt.savefig('imgs/class_less_scatter.png')
plt.close()

In [38]:
X = data[['flipper_length_mm']]
y = data['body_mass_g']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=69)
clf = LinearRegression()

In [39]:
clf.fit(X_train, y_train);
MSE(clf.predict(X_train), y_train)
MSE(clf.predict(X_test), y_test)

<center>training MSE: 148670.31768681708</center>
<center>testing MSE: 162810.850300213</center>

In [40]:
print('training MSE: ' + str(MSE(clf.predict(X_train), y_train)))
print('testing MSE: ' + str(MSE(clf.predict(X_test), y_test)))

training MSE: 148670.31768681708
testing MSE: 162810.850300213


<center><h3>Result</h3></center>

<center><img src='imgs/reg_scatter.png' width=700></center>

In [22]:
sns.regplot(data = data, x = 'flipper_length_mm', y='body_mass_g',line_kws = {"color": "red"})
plt.title('flipper_length vs body_mass with regression line')
plt.savefig('imgs/reg_scatter.png')
plt.close()

<center><h2>Why are we talking about this?</h2></center>

<center>You'll have to implement a very basic machine learning algorithm later in the quarter for your project. The model itself is very simple, but it's generally effective and in some cases it's actually the optimal solution.</center>

<center><h3>K Nearest Neighbors (KNN) for Classification</h3></center>

<center><b style='color:blue'>idea</b>: Given an unseen value of flipper_length and body_mass, how can we determine what species of penguins it belongs to? </center>

<center>The KNN approach is simple - if I look at the k nearest points to this new value, and I take the most common species among them, then this point is <b>most likely</b> the same species as the most common species among the neighbors. </center>

<center><img src='imgs/class_scatter.png' width=600></center>

In [23]:
sns.scatterplot(data = data, x = 'flipper_length_mm', y='body_mass_g', hue = 'species')
plt.title('flipper_length vs body_mass by species')
plt.close()

<center><h3>Procedure</h3></center>
<br>

<center><b>Step 1</b>: Given a new point <i>X</i>, quantify the distance between all points and <i>X</i></center>

<center><b>Step 2</b>: Take the <b style = 'color:blue'>k-nearest</b> points to <i>X</i> into consideration</center>

<center><b>Step 3</b>: Classify <i>X</i> as the most common label among the neighbors</center>

<br>

<center><b>note: </b> As the name implies, the key parts of this algorithm are <b style = 'color:blue'>k</b> and <b style = 'color:blue'>nearest</b>. </center> 
    
<center><b style = 'color:blue'>k</b> is often referred to as a <b>hyper-parameter</b>, or a value that you choose independently. There are often procedures to select a good k, but this varies depending on the context. </center>
    
<center><b style = 'color:blue'>Nearest</b> refers to quantifying distance - one such way is to use euclidian distance (distance formula), but there are many other "distances" that can be used (ex. Manhattan Distance).</center>

<center> <h2 style = "color:blue">Checkpoint</h2> </center>
<center>Given the new data point (201, 4750), what would a KNN classifier classify the new point as for <b>k=1</b>? What about for <b>k=4</b>? What about for <b>k=10</b>?</center>
<center><img src='imgs/knn_inject_new.png' width=600></center>

In [24]:
sns.scatterplot(data = data, x = 'flipper_length_mm', y='body_mass_g', hue = 'species')
plt.title('flipper_length vs body_mass by species')
plt.plot(201, 4550, marker="o", markersize=8,markerfacecolor="red")
plt.savefig('imgs/knn_inject_new.png')
plt.close()

<center> <h2 style = "color:blue">Checkpoint Solution</h2> </center>

<b>for k=1</b>: Seems like the orange point is closest according to an eye test, so Chinstrap

<b>for k=4</b>: The 3 closest points seem to all be from unique colors, the fourth point is hard to determine by eye. Could easily be Adelie or Chinstrap

<b>for k=10</b>: The bulk of the points nearby are blue, so Adelie


<center> <h1>Thanks for coming!</h1></center>
<center> There's a discussion quiz on canvas! </center>