Dimensionality Reduction & Visualization



In [1]:
import numpy as np

# import sklearn stuff
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

# set up for plotting as interactive figures in the notebook
%matplotlib notebook
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

## Load some data


In [2]:
# load the iris dataset
iris = datasets.load_iris()

In [3]:
# note that the iris data is 4-dimensional
iris.data.shape

(150L, 4L)

In [4]:
# let's look at the first 10 elements
iris.data[:10]


array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])

## Plot the data
Since the iris dataset has 4 features, we'll need to plot them as pairs; we can use color to represent class label.  Here is an example of plotting the first two dimensions:

In [5]:
plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first 2 dimensions as our x and y coordinates
    x = iris.data[i][0]
    y = iris.data[i][1]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (first two dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (first two dimensions)')

# Plot the other combinations of axes
You will need a total of 6 plots (including the one above) to plot all possible combinations of dimensions; the remaining 5 are left to you, but you should be able to copy the example above and make minor modifications to it.

In [6]:
# todo: plot 2
plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first second and third dimensions as our x and y coordinates
    x = iris.data[i][1]
    y = iris.data[i][2]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (second and third dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (second and third dimensions)')

In [73]:

plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first third and fourth dimensions as our x and y coordinates
    x = iris.data[i][2]
    y = iris.data[i][3]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (third and fourth dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (third and fourth dimensions)')

In [7]:

plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first 2 dimensions as our x and y coordinates
    x = iris.data[i][3]
    y = iris.data[i][0]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[3])
plt.ylabel(iris.feature_names[0])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (fourth and first dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (fourth and first dimensions)')

In [75]:

plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first first and third dimensions as our x and y coordinates
    x = iris.data[i][0]
    y = iris.data[i][2]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[2])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (first and third dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (first and third dimensions)')

In [8]:

plt.figure() # make a new figure to plot in
# let's set up a list of colors for the different class labels:
colors = ['darkred', 'blue', 'orange']
# now we'll loop over the points in our data set, and plot them one at a time
for i in range(len(iris.data)):
    # use the first 2 dimensions as our x and y coordinates
    x = iris.data[i][1]
    y = iris.data[i][3]
    # use the target (which we know is 0, 1, or 2) to select the color for this point
    c = colors[iris.target[i]]
    # plot the point as a single point in a scatter graph
    plt.scatter(x, y, color=c)

# now let's add some axis labels; we'll use the names from the data set
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[3])
# if we want a key, we'll need to make "handles" attaching colors to names
red = mpatches.Patch(color='darkred', label='setosa')
blue = mpatches.Patch(color='blue', label='versicolor')
orange = mpatches.Patch(color='orange', label='virginica')
# now actually show the legend
plt.legend(handles=[red, blue, orange])

# let's add a title
plt.title('Iris dataset (second and fourth dimensions)')

<IPython.core.display.Javascript object>

Text(0.5,1,'Iris dataset (second and fourth dimensions)')

## Run PCA
Here, we'll apply principal component analysis (PCA) to the dataset.  We'll use `n_components=2` to indicate we want to reduce our dimensionality to 2

In [9]:
# set up a PCA learner
pca = PCA(n_components = 2)
# actually run the fit algorithm
eigenbasis = pca.fit(iris.data)
# transform our data using the learned transform
iris2d = eigenbasis.transform(iris.data)

In [10]:
# note that our transformed dat is now 2-dimensional
iris2d.shape

(150L, 2L)

In [11]:
# again, let's look at the first 10 elements; note that they are 2 dimensional, rather than 4
iris2d[:10]

array([[-2.68420713,  0.32660731],
       [-2.71539062, -0.16955685],
       [-2.88981954, -0.13734561],
       [-2.7464372 , -0.31112432],
       [-2.72859298,  0.33392456],
       [-2.27989736,  0.74778271],
       [-2.82089068, -0.08210451],
       [-2.62648199,  0.17040535],
       [-2.88795857, -0.57079803],
       [-2.67384469, -0.1066917 ]])

### Examining components
We can look at the actual "principal components," which we're using as the basis for our transformed data space.  Since each component is a vector in the original data space, we can see what "axis" in the original space is the one of primary variance.

Since we said to use the top 2 components, we're going to have two vectors, each of length 4 (since our original data was 4 dimensional).

We can also show the amount of the total variance explained by each component, which tells us how "important" they are.

In [12]:
# the actual components
print("principal components:\n", pca.components_)
# let's also look at how much of the total variance we were able to cover with 2 dimensions
print('percentage of variance explained by first 2 principal components:', pca.explained_variance_ratio_)

('principal components:\n', array([[ 0.36158968, -0.08226889,  0.85657211,  0.35884393],
       [ 0.65653988,  0.72971237, -0.1757674 , -0.07470647]]))
('percentage of variance explained by first 2 principal components:', array([0.92461621, 0.05301557]))


## Make a plot of the 2D "transformed" data
 First, here's an example adapted from http://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_lda.html#sphx-glr-auto-examples-decomposition-plot-pca-vs-lda-py

In [13]:
# make a new figure
plt.figure()
# pick some colors to use
colors = ['navy', 'turquoise', 'darkorange']

# plot our points with colors and labels
for color, i, iris.target_name in zip(colors, [0, 1, 2], iris.target_names):
    plt.scatter(iris2d[iris.target == i, 0], iris2d[iris.target == i, 1], color=color, label=iris.target_name)
plt.legend(loc='best')
plt.title('PCA of IRIS dataset')

<IPython.core.display.Javascript object>

Text(0.5,1,'PCA of IRIS dataset')

In [82]:
# here's an alternative version of plotting this data that may be easier to understand:
colors = ['red', 'blue', 'green']
plt.figure()
# loop over examples, and plot each one
for i in range(len(iris2d)):
    point = iris2d[i]
    classLabel = iris.target[i]
    # plot a dot at an (x, y) coordinate, using the specified color.
    plt.scatter(point[0], point[1], color=colors[classLabel])

<IPython.core.display.Javascript object>

# Compare this to the 6 plots from before
Does this plot seem "better" than the plots we made before using the original axes?  Explain the *pros* and *cons* of the two ways of visualizing the data (note that you should have at minimum one of each):

***
One common thing to observe is Setosa remians same and can be distinguished easily. The other two classes are very close to each other here also. We can try separating with a line but there would be few points on the otehr side of line for each of these two classes.

Pros - allows estimating probabilities in high-dimensional data
     - Faster processing
Cons - might be too expensive for many applications