In [1]:
#Necessary so that each cell can produce multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## 3.2.2: Finding the Distance Between Two Points
***
- learn how to use Python to determine the Euclidean __distance between two points__ as a NumPy array
<p align="center">
  <img width="474" height="274" src="euclid_dist_drawing.1.jpg">
</p>

Here we have two points, and want to find the length of the distance between them (a '?' on the diagram)

We can draw lines horizontally and vertically to form a triangle, whose side lenghts we know:
<p align="center">
  <img width="474" height="274" src="euclid_dist_drawing.2.jpg">
</p>

The side lengths are $|x_2-x_1|$ and $|y_2-y_1|$
If we now say that the length of the longest side is $d$, then we can calculate the distance using the pythagorean theorem:
$d^2 = \sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$

Now let's implement these ideas in code. Since we're dealing with vectors, we'll use NumPy for this
 - From a mathematical perspective, it will be easier to use column vectors, but row vectors are easier to use in NumPy



In [2]:
import numpy as np
p1 = np.array([1,1])
p2 = np.array([4,4])
p2-p1

array([3, 3])

Now, we want to take this difference, and square it using the np.power() function


In [3]:
np.power(p2-p1,2)

array([9, 9])

We have the square differences between the x and y coordinates of these two points
Now we sum over these two terms, and then square root

In [4]:
np.sqrt(np.sum(np.power(p2-p1,2)))

4.242640687119285

Now let's turn this code into a function

In [5]:
import numpy as np
def distance(p1, p2):
    '''Find the distance between points p1 and p2'''
    return np.sqrt(np.sum(np.power(p2-p1,2)))

p1 = np.array([1,1])
p2 = np.array([4,4])

distance(p1,p2)
#gives 4.242640687119285

4.242640687119285

This gives us the same answer as before, so the function works!

## 3.2.3: Finding the Majority Vote
***
- learn how to find the __most common vote__ in an array or a sequence of votes
- __Compare__ two different methods for finding the most common vote

NOTE: Note that while this method is commonly called "majority vote," what is actually determined is the plurality vote, because the most common vote does not need to represent a majority of votes. We have used the standard naming convention of majority vote here.


- For building our kNN classifier, we need to be able to calculate a majority vote
    - Given an array/sequence of fobes (ex. 1,2,3), we need to determine how many times each occurs then find the most common vote
    - Ex: 1,1,2,2,2,3: majority vote is 2
    - We need to count the # of times each vote occurs, but will not be returning the count (we'll return the observation corresponding to the highest count)

Let's build a function called majority_vote (which will be similar to the count_words function in the previous case study)

Here's the count_words function that we previously defined:

In [6]:
def count_words(text):
    '''
    Counts the number of times each word occurs in the text (str). Returns dictionary where keys are
    unique words and values are word counts. Skip punctuation. 
    '''
    
    text = text.lower()
    skips = [".", ",", ";", ":", "'", '"', "!", "?"]
    for ch in skips:
        text = text.replace(ch, "")

    word_counts= {}
    for word in text.split(' '):
        if word in word_counts:
            word_counts[word] += 1
        else: 
            word_counts[word] = 1
    return word_counts

Let's define our new function by editing the old one:

In [7]:
def majority_vote(votes):
    '''
    xxx
    '''
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
            vote_counts[vote] += 1
        else: 
            vote_counts[vote] = 1
    return vote_counts

Now let's try it out with some data

In [9]:
votes = [1,2,3,1,2,3,1,2,3,3,3,3]
vote_counts = majority_vote(votes)
vote_counts

{1: 3, 2: 3, 3: 6}

In this dictionary, the keys are the different votes that occurred, and the values are the number of occurences of the vote.

Looking at this dictionary, we can tell that 3 occurred the most times, but how can we get python to do this?
- Given a dictionary where the values are counts, how can we get Python to return the key that corresponds to the largest value?

Let's start by finding the maximum counts.


In [17]:
#the following code returns the maximum key
max(vote_counts) 
# or 
max(vote_counts.keys())

#We want the maximum VALUE
max_counts = max(vote_counts.values())
max_counts

3

3

6

Let's add this code to our editor for future use

In [18]:
def majority_vote(votes):
    '''
    xxx
    '''
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
            vote_counts[vote] += 1
        else: 
            vote_counts[vote] = 1
    return vote_counts

votes = [1,2,3,1,2,3,1,2,3,3,3,3]
vote_counts = majority_vote(votes)

max_counts = max(vote_counts.values())

Now we can loop over all entries in the dictionary, and find which entry/entries corresponds to the maximum count using the items method of dictionaries

In [19]:
for vote, count in vote_counts.items():
    print(vote, count)

1 3
2 3
3 6


The above code loops over the dictionary and for each entry prints the key and value associated to it. This is a good starting point for our code.

What we want to know now is if the current entry we're looking at has the maximum number of votes

In [23]:
#Use an empty list 'winners' for keeping track of the keys that correspond to the highest values
winners = []
max_count = max(vote_counts.values())

for vote, count in vote_counts.items():
    if count == max_count:
        winners.append(vote)
winners

[3]

If we have a tie, we can just pick one of the winners at random, since there's no difference. Let's modify the code accordingly, and also turn it into a function:


In [26]:
import random

def majority_vote(votes):
    '''
    Return the most common element in votes
    '''
    vote_counts = {}
    for vote in votes:
        if vote in vote_counts:
            vote_counts[vote] += 1
        else: 
            vote_counts[vote] = 1

    winners = []
    max_count = max(vote_counts.values())
    for vote, count in vote_counts.items():
        if count == max_count:
            winners.append(vote)
    
    return random.choice(winners)



Let's try our function out on the previous example dataset

In [39]:
votes = [1,2,3,1,2,3,1,2,3,3,3,3,2,2,2]
winner = majority_vote(votes)
winner
winner = majority_vote(votes)
winner
winner = majority_vote(votes)
winner

2

3

2

Most commonly ocurring element in a sequence is the __mode__
Finding the mode is a common statistical operation, so we could use a shorter method (using scipy.stats.mstats.mode())



In [None]:
import scipy.stats as ss
def majority_vote_short(votes):
    '''
    Return the most common element in votes
    '''
    mode, count = ss.mstats.mode(votes)
    return mode