Reading: Chapter 4. Chapter 16 pages 290 - 292.

Lecture Topics:

1. Statistics
2. Correlation
3. K-means Clustering Alorithm

### I. Statistics Definitions



*   A <u>population</u> is a complete collection of all subjects we wish to study.
*   A *sample* is a subset of the population.
*   A *variable* is a characteristic of each population member that we wish to study. In data science and machine learning a variable is often called a *feature*.
*   *Data* are measurements of a variable or variables.



a) The *mean* of a vector $x$ is defined as $\bar{x} = \frac{\Sigma_{i = 1}^nx_i}{n}$. It is a measure of center for the data vector $x$.

b) The *standard deviation* of a vector $x$ is defined as $s_x = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n}}$. It is a measure of spread for the data vector $x$.  

c) A variable that is rescaled via the equation $z_i = (x_i - \bar{x})/s_x$ is  *standardized*. $z_i$ gives the number of standard deviations that $x_i$ is above (or below) the mean.







**Example**: Compute the sample mean and sample standard deviation for ```x = np.array([-1, 2, 9, 20, 2, -3, -4, 9, 10, 0])
```. Find the number of standard deviations ```x[0]``` is from the mean.

In [None]:
import numpy as np
import math

x = np.array([-1, 2, 9, 20, 2, -3, -4, 9, 10, 0])

x_bar = sum(x)/x.shape[0] #len

print(x.shape[0])
print(x_bar)

x_bar = np.mean(x);

print(x_bar)

#standard deviation

#x_x = sum[(x - x_bar)]**2
#len = x.shape
#s_x = math.sqrt(x_x / len)

s_x = np.std(x)

print(s_x)


#z
z_0 = (x[0]-x_bar) / s_x
print(z_0)

10
4.4
4.4
7.08801805866774
-0.7618490747771856


In [None]:
import numpy as np
import math

x = np.array([3.0,4,2,3,5,3,4,1,2,1])
y = np.array([1.0,3,5,5,2,2,6,5,4,2])
print(x)
print(y)
x_bar = np.mean(x);
print(x_bar)
y_bar = np.mean(y);
x_x = sum[(x - x_bar)]**2
y_y = sum[(y - y_bar)]**2
len_x = x.shape
len_y = y.shape
s_xy = math.sqrt((x_x / len_x) + (y_y / len_y))
print(s_xy)

[3. 4. 2. 3. 5. 3. 4. 1. 2. 1.]
[1. 3. 5. 5. 2. 2. 6. 5. 4. 2.]
2.8


TypeError: ignored

**Example**: Standardize data vector x from the previous example using ```preprocessing``` package from the ```sklearn``` library.

In [None]:
from sklearn import preprocessing

x = np.array([[-1], [2], [9], [20], [2], [-3], [-4], [9], [10], [0]])

scaler = preprocessing.StandardScaler() #construct the scalar object
standardized_x = scaler.fit_transform(x) #deploy and calculate
#standardized_x = scaler.fit_transform(x.reshape(-1,1))
print(standardized_x)

[[-0.76184907]
 [-0.33859959]
 [ 0.64898255]
 [ 2.20089733]
 [-0.33859959]
 [-1.0440154 ]
 [-1.18509856]
 [ 0.64898255]
 [ 0.79006571]
 [-0.62076591]]


**Exercise**: Write a python function ```standardize(v)``` that takes an arbitrary vector ```v``` and returns a standardized vector. (See Lab 2.)

### II. Correlation

The Pearson Correlation Coefficient and Cosine Similarity are both measures of how closely two different data vectors correspond with each other (linear relationship).

$r = \frac{\Sigma_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\Sigma_{i=1}^n(x_i-\bar{x})^2}\sqrt{\Sigma_{i=1}^n(y_i-\bar{y})^2}}$

Define the *de-meaned* (centered) vectors $\tilde{x} = x - \bar{x}$ and $\tilde{y} = x - \bar{y}$. Then

$r = \frac{\tilde{x}^T\tilde{y}}{||\tilde{x}|| ||\tilde{y}||}$.

**Questions**:

1. What is the largest and smallest values that $r$ can take?

2. Is it possible to have $r = 0$? What does this mean geometrically?

**Exercise**: Find and interpret the correlation coeffient for the following vector pairs.

In [None]:
import numpy as np

x1 = np.array([1, 2, 3])
y1 = np.array([2, 4, 7]) # What do you notice about these two vectors?

col1 = np.correlate(x1, y1)
col1 = np.corrcoef(x1, y1)
print(col1)
#col2 = np.correlate(y1)

x2 = np.array([-10, 2, 9, 20, 21, -3, -4, 9, 10, 0])
y2 = np.array([-2, -13, 8, 1, 0, 0, 12, 2, 10, 1]) # I randomly chose the numbers for these vectors

col2 = np.corrcoef(x2, y2)
print(col2)


[[1.         0.99339927]
 [0.99339927 1.        ]]
[[1.         0.08200585]
 [0.08200585 1.        ]]


### III. K-Means Clustering

*Supervised Learning*: Algorithm is trained on labeled data, then its accuracy/effectiveness is tested by having the algorithm predict labels on other data and measuring how well it does. Example: email spam filter.

*Unsupervised Learning*: Algorithm is run (trained) on unlabled data, no way to see if it did what it was "supposed to do." Example: clustering vectors that are "similar."

In supervised learning we define an objective function that uses the labeled outcomes in our dataset to check for algorithm convergence. Since K-Means is an unsupervised method we need to be creative in coming up with the objective function ourselves. This can be done in many different ways. The most common is using Eculidian distance to measure whether or not a particular grouping is actually "grouped."

More formally, if $z_1, ..., z_k$ are the group centroid vectors with $j \in {1, ... , k}$, and $c_i = j$, we define the objective function as

$J^{clust} = (||x_1 - z_{c_1}||^2 + ||x_2 - z_{c_2}||^2 + ... + ||x_n - z_{c_n}||^2)/n.$

We have found an "optimal" grouping when $J^{clust}$ is minimized. We can miminize $J^{clust}$ by minimizing each inidividual $||x_i - z_{c_i}||$. Thus, the optimal grouping solution is found by

$\begin{align} \underset{j = 1, 2, ..., k}{min} ||x_1 - z_j|| + \underset{j = 1, 2, ..., k}{min}||x_2 - z_j|| + ... + \underset{j = 1, 2, ..., k}{min}||x_n - z_j||.\end{align}$

K-Means Clustering Algorithm:

1. Choose *k*, the number of groups. Randomly initialize *k* centroids.
2. Compute the Euclidian distance between each data observation (vector) and each of the *k* centroids.
3. Assign each vector to the group with the closest centroid.
4. Update each centroid as the average of all vectors assigned to that centroid's group.
5. Repeat steps 2-4 until convergence.  

**Example**: Group the following data vectors into two groups (by hand). Assume that z0 = x2 and z1 = x2 are randomly chose to be the initial group centroids. Assume that both variables were measured on the same scale (why is this important?).

x0 = [1, 1],
x1 = [1, 0],
x2 = [0, 2],
x3 = [2, 4],
x4 = [3, 5].



In [None]:
import numpy as np
import pandas as pd

X = np.array([[1, 1], [1,0], [0, 2], [2, 4], [3, 5]])


z_0 = np.array([1,0.5])
z_1 = np.array([1.7, 3.7])

J_1 = (np.dot(X[0]-z_0, X[0]- z_0) +  np.dot(X[1]-z_0, X[1]- z_0) +  np.dot(X[2]-z_0, X[2]- z_0) +  np.dot(X[3]-z_1, X[3]- z_1) +  np.dot(X[4]-z_1, X[4]- z_1))/5

z_0 = (X[0] + X[1] + X[2])/3
z_1 = (X[3] + X[4])/2

J_2 = (np.dot(X[0]-z_0, X[0]- z_0) +  np.dot(X[1]-z_0, X[1]- z_0) +  np.dot(X[2]-z_0, X[2]- z_0) +  np.dot(X[3]-z_1, X[3]- z_1) +  np.dot(X[4]-z_1, X[4]- z_1))/5
print(J_2)


0.7333333333333334


**Example**: Use the ```cluster``` module from the ```sklearn``` library in python to group the vectors in the previous example.

In [None]:
from sklearn import cluster

# Initialize the cluster object
kmeans_ex = cluster.KMeans(n_clusters=2, n_init = 10, random_state=2)

In [None]:
# Run the algorithm
label = kmeans_ex.fit_predict(X)

In [None]:
import matplotlib.pyplot as plt

# Get unique labels for plotting purposes
u_labels = np.unique(label)

#plot
centers = np.array(kmeans_ex.cluster_centers_)
for i in u_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1] , label = i)
plt.scatter(centers[:,0], centers[:,1], marker="x", color='k')
plt.legend()
plt.show()

In [None]:
# Use results
kmeans_ex.labels_
kmeans_ex.cluster_centers_
kmeans_ex.predict([[0, 0], [12, 3]])

### IV. The Iris Dataset

The "Iris" dataset is a classic in statistics (read about it at https://en.wikipedia.org/wiki/Iris_flower_data_set). Ronald Fisher (widely considered the father of modern statistics; he developed the majority of techniques that are fundamental to the discipline) introduced it to demonstrate the effectiveness of his Linear Discriminant Analysis (LDA) method. LDA is a fundamental classification (supervised machine learning) method: for the iris problem it uses the 4 numerical variables as inputs and then predicts which of iris species a flower belongs to based on those inputs. That is, it is a supervised learning technique since it makes predictions using labeled data (the dataset includes observed iris species for each flower). We will revisit this dataset in the second half of the class when we study classification methods.





    



In [None]:
# Step 1: Initialize the import object to give colab access to our computer

from google.colab import files

uploaded = files.upload()

Saving iris.csv to iris.csv


In [None]:
# Step 2: Import and view dataset

import pandas as pd

iris = pd.read_csv("iris.csv")
iris.head()

**Questions**:

1. Does this dataset contain measurements on a population or a sample?
2. After importing the iris dataset  we can view the ```variety``` variable. Will the K-Means method utilize this part of the dataset? (Hint: We're about to group the examples in the dataset using K-Means, which is an _____ learning method.)

**Example**: Use ```sklearn``` to group the data in the ```iris``` dataset.

In [None]:
# Step 3: Preprocess the dataset by converting from pandas dataframe to a numpy ndarray and drop last column
X = iris.to_numpy()
print(X)

In [None]:
# Step 4: Preprocess the dataset by standardizing each variable
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
X = scaler.fit_transform(X)

In [None]:
from sklearn import cluster
import numpy as np
import matplotlib.pyplot as plt

# Step 5: Initialize K-Means cluster object
kmeans_iris = cluster.KMeans(n_clusters=3, n_init = 10, random_state=0)

# Step 6: Use kmeans_iris's "fit_predict()" method to predict the groups for each value in the dataset
label = kmeans_iris.fit_predict(X)

# Step 7: Get unique labels (group labels) using numpy's "unique" method
u_labels = np.unique(label)

# Step 8: Plot the results:
centers = np.array(kmeans_iris.cluster_centers_)
for i in u_labels:
    plt.scatter(X[label == i , 0] , X[label == i , 1], label = i)
plt.scatter(centers[:,0], centers[:,1], marker="x", color='k')
plt.legend()
plt.show()



**Questions**:

1. Why does the horizontal axis start at -2?
2. An *outlier* is a data value that falls far from the mean. There is not universally accepted definition for outlier but a commonly used rule-of-thumb is "farther than 2 standard deviations from the mean." Do their appear to be any outliers in the iris dataset?
3. Why do the groups appear to overlap? What does this indicate about the way the K-Means algorithm grouped the irises?