### **Biostatistics with R**

# **Assignment 10: _K_-Means and Hierarchical Cluster Analysis**

### In this lesson you will learn about:

* K-Means Cluster Analysis
* Using square bracket `[ ]` indexing for subsetting data
* Convert numerical values in a dataframe into a matrix
* The Elbow Method for selecting estimating the correct K value
* Hierarchical Cluster Analysis


### **WARNING:** You will need to install the R packages `r-cluster` and `r-factoextra` before you can run the next cell

In a terminal window on your laptop you will need to run the following command **_BEFORE_** you can run this lesson.

`conda install -c conda-forge r-cluster` <br>
`conda install -c conda-forge r-factoextra`


In [None]:
# Install libraries
library(cluster)
library(factoextra)

### Lesson Setup

Run the next cell to load the necessary R packages for this lesson.

In [None]:
print(paste("My current working directory is", getwd()))

# Introduction to Cluster Analysis

**_Cluster analysis_** is a data analysis technique that explores the naturally occurring groups within a data set known as **_clusters_**. Cluster analysis doesn't need to group data points into any predefined groups, which means that it is an **_unsupervised_** learning method. Like regression analysis, cluster analysis is considered a form of **_Machine Learning_** that is one of the main componenets of **_Artificial Intelligence_** or **_AI_**. 

![__](AI.png)

The two main forms of Cluster Analysis is **_K-Means Cluster Analysis_** and **_Hierachial Cluster Analysis_**. _K_-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of **_K_** groups (i.e. $k$ clusters), where **_K_** represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., _clusters_), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In _K_-Means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster. 

There are two potential problems with the K-means clustering algorithm. First, it is a flat clustering method. After observations are assigned to their clusters, they are all considered to be similar within the same cluster. That is, the observations are not further separated based on dissimilarity within a cluster. Secondly, we need to specify the number of clusters K _a priori_. Finding the appropriate number of clusters is not trivial, and the selected number has a substantial impact on the results.

An alternative approach that avoids these issues is **hierarchical clustering**. The result of this method is a dendrogram (a tree). The root of the dendrogram is its highest level and contains all _n_ observations. The leaves of the tree are its lowest level and are each a unique observation.

In this lesson, the cluster analysis approach is introducted by using K-Means Cluster analyis to classify three species of the iris flowers based on their petal and sepal lengths. In a future lesson we look at an example of hiearchical cluster analysis.

# **The famous _Iris flower_ dataset**

In this lesson we will perform a _K_-Means Cluster and Hierarchical analyses on the famous _Iris flower dataset_, The Iris flower data set or Fisher's _Iris_ data set is a multivariate data set used and made famous by the British statistician Ronald Fisher in his 1936 paper _The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis_ [1]. (NOTE: A PDF of this paper has been included with this lesson). It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

The dataset contains measurements of 4 morphological characteristics of an iris flower: (1) sepal length, (2) sepal width, (3) pedal length and (4) pedal width. These measurements were taken from 3 different iris species: (1) _Iris setosa_, (2)_Iris virginica_ and (3)  _Iris versicolor_. The image below shows the flower of each of these species.

![__](IrisSpecies.png)


The next figure illustrates the difference between a **_sepal_** and a **_petal_** in an Iris flower.

![__](SepalPetal.png)


### Read Iris flower dataset and store the data in a dataframe

The cell below uses the function `read_csv()` to read the file `iris.csv` located in the current working directory. It should be noted that the header information for `iris.csv` has been simplified to make it easier to use as a teaching example while the actual physical measurements themselves have been left in intact. 

The iris flower data read from the file `iris.csv` will be stored in a dataframe called `iris_df`. The `head()` function is used to print out the first 6 records to make sure the file was read correctly.   

In [None]:
# Read Iris data set

# Use read.csv() to read datafile
iris_df <- read.csv('iris.csv', sep=',')

# Print out the first six records
head(iris_df)


If your code is setup correctly you should see the following output.

![__](Ex0.jpg)

### Dataframe indexing: Extracting subsets from a dataframe

From the output above you can see that the dataframe `iris_df` contains the following 5 columns:

1. **sepalLen:** The length of the flower's sepal
2. **sepalWid:** The width of the flower's sepal
3. **petalLen:** The length of the flower's petal
4. **petalWid:** The width of the flower's petal
5. **Species:** The species name of the flower

In this lesson, a series of _Examples_ will given to illustrate how to perform certain tasks using the _sepal dimensions_. You will be asked to complete a series of **_Exercises_** in which you need to perform a series of similar tasks using just the **_petal_** _dimensions_.

One common way to extract specific column information in a dataframe is to use **_indexing_** using square brackets `[Row, Column]`. The first number within the square bracket specifies the _row_ number while the second number specifies the _column_ number. A comma is used to separate the two numbers. Finally, leaving the first and/or second number _blank_ means the _entire_ dataframe or array.


### Example 1: Extract sepal dimensions from `iris_df`

In _Example 1_, the square bracket notation `[ ,1:2]` means, _all_rows and just columns number 1 through 2. Since the sepal dimensions are stored in columns 1 and 2, we can use this square bracket index to extract just the sepal dimensions.  

In [None]:
# Example 1: Print out just data in columns 1 and 2 for all the rows

head(iris_df[ , 1:2])

### **Exercise 1:** Extract _petal_ dimensions from `iris_df`

In **_Exercise 1_**, use the `head()` function in combination with square bracket notation to print out the _petal_ dimensions stored in columns 3 and 4.  

In [None]:
# Insert your code for Exercise 1 here


If your code is correct you should see the following output.

![](Ex1.jpg)

### Example 2: Extract _sepal_ dimensions and _species_ names from `iris_df`

Square bracket indexing is quite flexible. In this example we use the _concatenate_ function `c` to specify which rows to add to our subset. In _Example 2_, the square bracket notation `[ , c(1,2,5)]` means, _all_rows (the initial blank space) and columns number 1, 2 and 5. 

Since the sepal dimensions are stored in columns 1 and 2, and the species names are stored in column 5, we can use this square bracket index to extract the sepal dimensions _and_ the species names. The letter `c` stands for _concatenate_ which essentially means to "add together" in this context. 

In [None]:
# Example 2: Print out the data in columns 1, 2 and 5 for all rows

head(iris_df[ , c(1,2,5)])

### **Exercise 2:** Extract _petal_ dimensions and _species_ names from `iris_df`

In **_Exercise 2_**, use the `head()` function in combination with square bracket notation to print out the _petal_ dimensions stored in columns 3 and 4 and the species names stored in column 5.  

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following output.

![](Ex2.jpg)

# K-Means Cluster Analysis

In statistics, a _cluster_ refers to a collection of data points aggregated together because of certain similarities. For example, _Iris_ flowers fron the same species might be expected to form a cluster if each species has relatively unique combination of petal and sepal lengths and widths. In this lesson, we will use K-Means Cluster Analysis to see if _Iris_ flowers form distinct clusters based on their sepal lengths (`sepalLen`) and sepal widths (`sepalWid`). As an exercise, your will perform K-Means Cluster analysis on the breast cancer data.

The basic idea is to  "plot the data" in an _n-dimensional_ space. For the _Iris_ data we can plot the data on a standard XY scatterplot since we will only be using 2 independent variables at a time. In the _Examples_, we will be using _sepal length_, which we plot along the X-axis and _sepal width_, which we will plot along the Y-axis. In the **_Exercises_** you will be using _petal length_ and _petal width_. 
 

## Centroids

Once the independent variables has been plotted in their _n-dimensional_ space, your computer will randomly place a small number of imaginary "objects" called _centroids_ somewhere within the data. How many _centroids_ will be placed is up to you. The number of centroids is called **_K_**, which explains, in part, the name **_K_**-_Means Cluster_ analysis. For the _Iris_ flower data, we will be using 3 centroids (_K_=3) since we will be looking for 3 data clusters, one for each flower species. 

## How the _K_-Means Cluster algorithm works

Once your computer has randomly placed the _K_ centroids within the data points, it calculates how far each data point is from the each centroid. Data points that are closest to a particular centroid are assigned to that centroid to form a cluster.

Since the centroids were placed randomly, it very unlikely that they ended up being in the true center of a particular cluster. This is where the "Machine Learning" comes in. Your computer then performs a series of "loops" (_iterations_) in which it starts _learning about the data_. During each iteration, it moves the location of all the centroids and computes whether the new locations are "better" than the old location. The computer keeps moving the centroids around until their locations are as optimal as possible.  

For you to make these calculation manually would take a considerable amount of time and effort. However, these are _exactly_ the kind of small repetitive computations that your laptop is supremely good at, and can perform at blazing speeds. 

In this lesson, we will be asking your computer to make up to 300 iterations. In other words, your computer will repeat the _K_-means Cluster computational procedure over and over, up to 300 times. How many iternations--or how long do you want to wait for an answer--is up to you. You can set the number of iterations using the argument, `maxiter = `. Each time, the algorithmn runs, it uses the results of the last run to improve its seach. After every run, your computer will (hopefully) get closer (better) approximations to the solution of the problem, such as separating all 150 plants in the Iris flower dataset into 3 clusters, one for each species.        
To make the algoithmn run faster, the _K_-Means Cluster software is smart enough to know when the locations of thecentroids have stabilized and there is no change in their values. At this point the "looping" will stop and the results will be printed out. However, if some data has not separated into clusters, the program terminates when the number of maximum iterations has been reached. 

## Estimating _K_ using an _Elbow Plot_

One of the major challenges when using _K_-Means analysis is deciding on the number of clusters the algorithmn should be looking for? With the _Iris_ data, we already know to look for 3 clusters -- one for each plant species. But in most cases we won't know the correct number of clusters.

There is a widely accepted technique called the _Elbow Method_ that gives an estimate for _K_. In the Elbow Method, the data is analyzed using a small range (e.g. 1 to 10) of _K_ values. For each _K_ value, the **WCSS** (Within-Cluster Sum of Square) value is computed. The _WCSS_ is the sum of the squared distance between each data point and the centroid in a cluster. The lower the WCSS, the "better" the fit between the centroids and the data. 

It is always easier to fit data with more and more parameters. So as the _K_ value increases, the WCSS value will _always_ get smaller and smaller. However, plotting the WCSS as a function the _K_ value, usually produces a line with a shape similar to an _elbow_. As the number of clusters increases from 1, to 2, to 3, on so on, the WCSS value initally starts to decrease rapidly, but at some value of _K_, the rate of decline suddenly slows. The _K_ value where this _elbow_-shape lines slows is either the optimal number of clusters, or very close to it.   

The Python code in the cell below generates a line plot (i.e. _Elbow plot_) of the WCSS for _K_ values from 1 to 10 for the sepal dimensions using the function `fviz_nbclust()`. This function that is part of the `factoextra` library that was installed before the lesson. 

The `fviz_nbclust()` function requires the data to be in the form of a numerical _matrix_, not a dataframe. Therefore, before we can generate our elbow plot, we will first need to convert the sepal dimensions in `iris_df` into a numerical matrix called `sep_mat` using the `data.matrix()` function as shown in Example 3.

### Example 3: Convert data in a dataframe into a numerical matrix

The R code in the cell below uses the function `data.matrix()` to convert the sepal dimensions in `iris_df` into a numerical matrix called `sepal_mat`. 

In [None]:
# Example 3: Generate elbow function

# Use data.matrix() function to create matrix of sepal dimensions
sepal_mat <- data.matrix(iris_df[ , 1:2])

# List the first 6 rows of sepal_max
head(sepal_mat)

### **Exercise 3:** Convert _petal_ data into a numerical matrix

In the code cell below, write the R code to convert the _petal_ dimensions in `iris_df` into a numerical matrix called `petal_mat` and then print out the first 6 values. 

In [None]:
# Insert your code for Execise 3 here



If your code is correct you should see the following output.

![](Ex3.jpg)

### Example 4: Elbow plot for sepal dimensions

The code in the cell below uses the function `fviz_nbclust()` to generate an elbow plot from the numberical matrix `sepal_mat`. 

In [None]:
# Example 4: Generate elbow function

# Generate elbow plot of sepal dimensions
fviz_nbclust(sepal_mat, kmeans, method = "wss")


Notice that the decline in the WSS decreases rapidly for _K_ equals 1 and 2, but then declines more slowly after _K_ = 3. 

### **Exercise 4:** Elbow plot for _petal_ dimensions

In the cell below, use the function `fviz_nbclust()` to generate an elbow plot from the numberical matrix `petal_mat`. 

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see the following output.

![](Ex4.jpg)

### Example 5: K-Means Cluster analysis of sepal dimensions

The code in cell below uses the function `kmeans()` to perform the _K_-Means cluster analysis of the sepal data. The number of clusters, _K_, is set to $3$ based on visual inspection of the eblow plot in Example 4. 

The argument, `iter.max` is the number of times the algorithm is run before results are returned. The algorithm works by finding a minimum of a cost function through iterations (passes) through the data. 

The argument, `nstart` represents the number of random data sets used to run the algorithm. When the algorithm initialized, it is fed the initial coordinates of the cluster centers, picked at random. The value `nstart` extracts the number of times k random numbers, between 1 and the number of “observations” in the dataframe, and it takes as a starting point for cluster 1 up to cluster k candidates centroids. 


In [None]:
# K-means clustering: Sepal dimensions

# Set the number of clusters
K = 3

# Perform Kmean for K=3
sepal_km_results <- kmeans(iris_df[,1:2], K, iter.max=50, nstart = 10)

# Print out the results
print(sepal_km_results$centers)

The variable `sepal_km_results` contains a number of statistical measures about the results of the _K_-Means analysis. For this lesson, we will focus on the co-ordinates of the _centroid centers_ which we will use later in our analysis. 

### **Exercise 5:** K-Means Cluster analysis of petal dimensions

In the code cell below, use the function `kmeans()` to perform the _K_-Means cluster analysis of the _petal_ data. Set the number of clusters, _K_, to $3$ based on the eblow plot in Exercise 4. Set `iter.max` to 50 and `nstart` to 10. Call the results of your analysis `petal_km_results`. 

In [None]:
# Insert your code for Exercise 5 here



If your code is correct you should see the following output.

![__](Ex5.jpg)

### Example 6: Visualize K-Means Cluster analysis for sepal dimensions

The code in cell below uses the function `fviz_cluster()` to plot the results of the _K_-Means cluster analysis of the sepal data. 


In [None]:
# Example 6: Visualize kmeans clustering

fviz_cluster(sepal_km_results, iris_df[,1:2], 
             ellipse.type = "norm", 
             geom=c("point"), 
             pointsize = 3, 
             ggtheme = theme_minimal())


If your code is working correctly, you should see the following cluster plot.

![__](Example_06.jpg)

### **Exercise 6:** Visualize K-Means Cluster analysis for petal dimensions

In the code cell below, use the function `fviz_cluster()` to plot the results of your _K_-Means cluster analysis of the petal data. 


In [None]:
# Insert your code for Exercise 6 here



If your code is correct you should see the following output.

![__](Ex6.jpg)

## Determining the relationship between clusters and flower species

Now that _K_-Means Cluster analysis has divided the _Iris_ flower data into 3 clusters based on sepal dimensions (Example 6) and petal dimensions (Exercise 6), the next step is to determine the relationship between species and clusters. In order words, which cluster contains which species? Perhaps the most direct way to answer this question is to generate XY scatter plots of the Iris dataset in which the data points are color-coded according to the flower's species. 

### Example 7: Generate logical (boolean) mask for Iris setosa

Logical (boolean) masks are one way to separate data in a dataframe according to some condition, in this case `Species`. The code in the cell below shows how to generate a boolean masks for the Iris species, `Iris setosa`. As you can see from the output, a boolean masks is simply a vector containing either `TRUE` or `FALSE` for each row in `iris_df`.

In [None]:
# Example 7" Use logical indexing to generate a boolean mask for I setosa

setosa_mask = iris_df$Species == 'Iris-setosa' 

# Print out the first 6 values in
head(setosa_mask)

### **Exercise 7:** Generate logical (boolean) masks for Iris versicolor and Iris virginica

In the cell below, generate a boolean mask called `versicolor_maske` for the Iris species, `Iris versicolor` and another mask called `virginica_masks` for the species `Iris virginica`. Use the `head()` function to print out the first 6 records in `versicolor_mask`.

In [None]:
# Insert your code for Exercise 7 here



If your code is correct you should see the following output.

`FALSE·FALSE·FALSE·FALSE·FALSE·FALSE`

### Example 8: Calculate the number of Iris setosa from logical mask. 

Since R considers the boolean value `TRUE` to be equal to the number $1$ and the boolean value `FALSE` to be equal to the number $0$, we can use the logical mask `setosa_mask` to calculate the number of flowers in `iris_df` that belong to the species `Iris setosa` as shown in the code cell below. The variable `num_setosa` is used to store this value for later use.

In [None]:
# Calculate the number of each species

# Since TRUE=1 and FALSE=0, sum to find total
num_setosa = sum(setosa_mask)

# Print out the numbers
print(paste("Number of Iris setosa =", num_setosa))

### **Exercise 8:** Calculate the number of Iris versicolor and virginica from logical mask. 

In the code cell below, use the logical masks that you created in Exercise 7 to calculate the total number of plants in `iris_df` that belong to the species `I. versicolor` and `I. virginica`. Store these values in the variables `num_versicolor` and `num_virginica`, respectively. Print out the contents of both variables as shown in Example 8. 

In [None]:
# Insert your code for Exercise 8 here



If your code is correct you should see the following output.

`[1] "Number of Iris versicolor = 50"` <br>
`[1] "Number of Iris virginica = 50"`

### Example 9: Generate vectors to hold sepal dimensions by species

When using vectors it is necessary to first initialize them so R knows how big (long) the vector should be and what type of values will be stored in it. The code in the cell below utilizes the function `vector()` to initialize 6 different vectors. These 6 vectors will be used later to hold the sepal lengths and sepal widths of all the flowers in the `iris_df` dataframe. Since there are 3 species, we will need a total of 6 vectors to hold the sepal length and sepal width dimensions. The `mode` of these vectors is set to `numeric` since we will be storing numbers. The vectors are named `Xsep_1`, `Ysep_1`, etc. to remind us that the vectors contain **_sepal_** values.

In [None]:
# Example 9: Initialize 6 vectors to hold sepal dimensions 

# Initialize XY arrays for plotting flowers by species
Xsep_1 <- vector(mode="numeric", length=num_setosa - 1)       # X values for setosa
Ysep_1 <- vector(mode="numeric", length=num_setosa -1)        # Y values for setosa
Xsep_2 <- vector(mode="numeric", length=num_versicolor -1)    # X values for versicolor
Ysep_2 <- vector(mode="numeric", length=num_versicolor - 1)   # Y values for versicolor
Xsep_3 <- vector(mode="numeric", length=num_virginica  -1)    # X values for virginica
Ysep_3 <- vector(mode="numeric", length=num_virginica -1)     # Y values for virginica


### **Exercise 9:** Generate vectors to hold petal dimensions by species

In the cell below, initialize 6 vectors to hold the _petal_ dimensions for the 3 different Iris species. Using Example 9 as a template, your vectors should be called `Xpet_1`, `Ypet_1`, `Xpet_2`, `Ypet_2`, `Xpet_3` and `Ypet_3`. 

In [None]:
# Insert your code for Exercise 9 here



### Example 10: Extract sepal dimensions according to species

The code below extracts the sepal dimensions from `iris_df` into one of the six vectors previously generated in Example 9. Like many programming tasks there are several ways this problem could be solved. The one used here is not especially "elegant", but it has the advantage of being easy to understand. 

The code uses a `while loop` to cycle through the `iris_df` dataframe, one row at a time. The number of the current row is held in the variable `row_num`. This variable starts at $1$, the first row in `iris_df`. After each "loop" through the dataframe, the value of `row_num` is increased by $1$ and the loop runs again. This continues until the value of `row_num` equals the number of rows in the dataframe as determined by the function `nrows(iris_df)`. 

During each loop through the data, species name stored in column 5 is compared to the three species names. If the two names are the same `==`, the logical condition is met and the next two line of code are executed. First, the sepal _length_ stored in `iris_df[row_num, 1]` is placed in the `Xsep` vector for that species. Second, the sepal _width_ stored in `iris_df[row_num, 2]` is placed in the `Ysep` vector for that species. If the condition for `Iris-versicolor` was met, the sepal length and width would be stored in `Xsep_2` and `Ysep_2`.  

After the `while` loop has finished, the total number of loops (rows) completed is printed out to verify the code ran correctly.


In [None]:
# Example 10: Separate sepal dimensions by species

# Initialize counter variables
row_num = 1  # loop counter
i = 0        # I setosa counter
j = 0        # I versicolor counter
k = 0        # I virginica counter
totalFlowers = nrow(iris_df)

# use while loop to loop through iris_df one row at a time
while (row_num < totalFlowers) {
    if (iris_df[row_num,5] == 'Iris-setosa'){
        # Iris setosa
        Xsep_1[i] = iris_df[row_num,1]
        Ysep_1[i] = iris_df[row_num,2]
        i = i + 1
   } else if (iris_df[row_num,5] == ('Iris-versicolor')){
        # Iris versicolor
        Xsep_2[j] = iris_df[row_num,1]
        Ysep_2[j] = iris_df[row_num,2]
        j = j + 1
   } else if (iris_df[row_num,5] == ("Iris-virginica")){
        # Iris virginica
        Xsep_3[k] = iris_df[row_num,1]
        Ysep_3[k] = iris_df[row_num,2]
        k = k + 1
    }
    # Increment row number
    row_num = row_num +1
}

# Print out the total number of loops
print(paste("The total number of loops=", row_num))

### **Exercise 10:** Extract petal dimensions according to species

In the cell below, write the R code to extract the petal dimensions from `iris_df` into one of the six vectors you previously generated in Exercise 9. Remember, the _petal length_ is stored in `iris_df[row_num, 3]` and should be placed in the `Xpet` vector for that species. Also, the _petal width_ is stored in `iris_df[row_num, 4]` and should be placed in the `Ypet` vector for that species. After the `while` loop has finished, print out the number of loops to verify the code ran correctly.   


In [None]:
# Insert your code for Exercise 10 here



If your code is correct you should see the following output.

`[1] "The total number of loops= 150"`

### Example 11: Prepare vectors to plot the centroids of the 3 sepal clusters

Before we can generate our XY scatterplot of sepal widths (Y) as a function of sepal lengths (X) we need to prepare two vectors with the co:ordinates of the centroid clusters. These values for the centroid cluster for the _sepal_ data was generated in Example 5. Here is a copy of the Example 5 output:

`Cluster means:` <br>
`1 sepalLen sepalWid` <br>
`2 5.773585 2.692453` <br>
`3 5.006000 3.418000` <br>
`4 6.812766 3.074468` <br>

**NOTE:** These are the centroid centers for the **_sepal_** cluster analysis. The centroids for the _petal_ cluster analysis are different, and can be found as the output of Exercise 5 if you code ran correctly.

In the cell below, we create 2 vectors. The first vector, called `CentroidXsepLen`, holds the sepal **lengths** (X-values) for the 3 centroids from the sepal analysis. The second vector called `CentroidYsepLen` holds the sepal **widths** (Y-values) for the 3 centroids. 

The last line of code prints out the 3 X-values stored in `CentroidXsepLen`. 

In [None]:
# Example 11: Create vectors to store centroid co:ordinates

CentroidXsepLen <-c(5.77,5.01,6.81)    #This vector stores sepalLen
CentroidYsepWid <-c(2.69, 3.42, 3.07)  #This vector stores sepalWid
CentroidXsepLen

### **Exercise 11:** Prepare vectors to plot the centroids of the 3 petal clusters

In the cell below, create 2 vectors. Call the first vector `CentroidXpetLen` and fill it with the 3 petal **lengths** (X-values) for the 3 centroids from the _petal_ cluster analysis. Call your second vector `CentroidYpetWid` and fill it with the petal **widths** (Y-values) for the 3 centroids for petal width from the _petal_ cluster analysis. Print out the 3 values stored in `CentroidXpetWid`. 

Remember, the X and Y co:ordinates for centroids were generated above in **_Exercise 5_**.

In [None]:
# Insert your code for Exercise 11 here



If you code is correct you should see the following output.

`1.46·5.59·4.27` 

### Example 12: Generate XY Scatterplot of sepal dimensions color-coded by species name

We are now ready to generate an XY scatterplot in which the X-value for each point is the sepal length of a particular flower and the point's Y-value is the flower's sepal width. Since we have already separated the XY sepal dimensions by **_species_** in Example 10, it is simple to plot these sepal dimensions using different colors for the three different Iris species. 

There are several graphics packages for R that could be used to generate these XY scatterplots--the most sophisticated ("complex") being a package called `ggplot2`. Instead we will use what is called _base graphics_ that comes embedded in the R language. 

In the code below, we start by generating a simple XY plot of the sepal dimensions for all of the flowers in `iris_df` that belong to the species _Iris setosa_ using the function `plot()`. 

The data points in this plot will be colored green. The argument`ylim` is used to set the Y-limits of the plot to the values between 1.5 and 5 while the argument `xlim` sets the X-limits between 4 and 8.5. These values must be adjusted _manually_ to center the data points in the center of the graph. The argument `main` sets the main title of the graph which `xlab` sets the label for the X-axis and `ylab` sets the label for Y-axis. 

After initial XY plot of _Iris setosa_ sepal data, we _add_ the data points for _Iris versicolor_, _Iris virginica_, and the 3 cluster centroids using the function called `points()`. The data points for _Iris versicolor_ are red, the points for _Iris virginica_ are blue and the centroids are yellow with a black boarder. 

The code for adding a legend to the graph is shown at the bottom.


In [None]:
# Example 12: Plot sepal dimensions color-coded by species name

# Plot Iris setosa as blue
plot(Xsep_1, Ysep_1,
     pch = 19,
     cex = 1.2,
     col = 'blue',
     ylim = c(1.5,5),
     xlim = c(4,8.5),
     main = "Sepal Dimensions by Species",
     xlab = "Sepal Length (cm)",
     ylab = "Sepal Width (cm)")

# Plot Iris versicolor as red
points(Xsep_2, Ysep_2,
     pch = 19,
     cex = 1.2,
     col = 'red')

# Plot Iris virginica as green
points(Xsep_3, Ysep_3,
     pch = 19,
     cex = 1.2,
     col = 'green')

# Plot Centroid boarders as black
points(CentroidXsepLen, CentroidYsepWid,
     pch = 19,
     cex = 2.2,
     col = 'black')

# Plot Centroid centers as yellow
points(CentroidXsepLen, CentroidYsepWid,
     pch = 19,
     cex = 1.5,
     col = 'yellow')


# Add a legend
legend("topleft", legend=c("Cluster #1: Iris versicolor", 
                           "Cluster #2: Iris virginica", 
                           "Cluster #3: Iris setosa", 
                           "Centroid"),
       col=c("red", "green", "blue", "yellow"), 
       pch = 19,
       bty = 'n',
       lty=1:2, cex=1.2)

If your code is correct you should see the following graph.

![__](Example12.jpg)

### Species Identification of the K Clusters

Comparing the XY scatterplot above with the _K_-Means Cluster plot generated in Example 6, we can identify the species in the 3 clusters as follows. Cluster numbers (e.g. `Cluster #1`) refers to the `cluster` number in the legend in Example 6. 

1. Cluster #1: _Iris versicolor_ shown in red on the graph above.
2. Cluster #2: _Iris virginca_ shown in green on the graph above.
3. Cluster #3: _Iris setosa_ shown in blue on the graph above.


### **Exercise 12:** Generate XY Scatterplot of petal dimensions color-coded by species name

In the cell below write the R code to generate an XY plot showing the petal dimensions of the flowers in the dataframe `iris_df`. The data points for _Iris setosa_ should be color-coded blue, the data points for _Iris versicolor_ should be color-coded red, and the data points for _Iris virginica_ should be color-coded green. Also include in your XY plot the 3 centroid cluster centers color-coded yellow with a black boarder. Finally include a legion for your figure.

For full credit, don't forget to change the main title, the X-axis label and Y-axis label to match the data in your plot. You will also need to adjust the values in `ylim` and `xlim` in order to center your data points in the middle of your graph.

In [None]:
# Insert your code for Exercise 12 here



If your code is correct you should see the following graph.

![_](Ex12.jpg)

# Hierarchical Cluster Analysis

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

* Agglomerative: This is a "bottom-up" approach: Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
* Divisive: This is a "top-down" approach: All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances. On the other hand, except for the special case of single-linkage distance, none of the algorithms (except exhaustive search in 

### Example 13: Hierarchical clustering of sepal dimensions

The code in the cell below uses the function `hcut()` to perform an Hierarchical clustering analysis of sepal dimensions in the `iris_df` dataframe. It then uses the function `fviz_dend()` to generate a dendogram of the results.


In [None]:
# Example 13: Hierarchical clustering of sepal dimensions

# Use hcut() which compute hclust and cut the tree
sepal_hc.cut <- hcut(iris_df[ ,1:2, -5], k = 3, hc_method = "complete")

# Visualize dendrogram
fviz_dend(sepal_hc.cut, 
          guides = "none",
          show_labels = FALSE,
          main = "Dendogram of Sepal Dimensions",
          k_colors = c("green", "red", "blue"),
          rect = FALSE)



### **Exercise 13:** Hierarchical clustering of petal dimensions

Write the R code in the cell to apply the function `hcut()` to _petal_ dimensions in the `iris_df` dataframe. It then uses the function `fviz_dend()` to generate a dendogram of the results.


In [None]:
# Hierarchical clustering



If your code is correct you should see the following graph.

![_](Ex13.jpg)