# Assignment 12 - Case Study 1

## <u>Case Study 1</u> Clustering "Large" Artificial Datasets with Mini-Batch k-Means

In this case study in assignment 12, we will be using the same dataset that we used in assignment 11 with BIRCH.

We are again dealing with mostly the same issues described below.

### Issue 1: "Working Memory Constraints"
The 30 csv files (ie. batch files) in the attached data folder each contain 100 rows of a two dimensional dataset. We would like to create an insightful clustering of the entire dataset, however, we will assume THROUGHOUT THIS WHOLE ASSIGNMENT that we can only read in one of these csvs at a time. In this analysis we will assume that a dataframe of **over 300 rows** is "too large" to store in the Jupyter notebook working memory. (This is not the case, but just assume for now).

Thus, whenever you read in one of these csv files, you must either delete it or overwrite it with another, so you are not storing two or more of these datasets in your working memory at any given time.

### Issue 2: We KNOW these batches are not random samples!

Now, we will assume to know that these 30 batches are not random samples and thus representative of the whole dataset of 3000 observations.


### Issue 3: We do not know how many clusters are in the entire dataset.

Because we cannot store the entire dataset in our Jupyter notebook working memory, we cannot use t-SNE plots on the whole dataset like we did in the past. We will try to figure this out in this analysis **in a different way than how we did in assignment 11**.

### Issue 4: We do not know the underlying clustering structure of the dataset.

Similarly, because we cannot store the entire in our Jupyter notebook working memory, we cannot use t-SNE plots on the whole dataset like we did in the past. We will try to figure this out in this analysis **in a different way than how we did in assignment 11**.



### Imports

<hr>

## <u>Tutorials</u>:

You may need the following tutorial information to complete this case study.

### Creating an Empty Dataframe

We can create an empty dataframe by using the **pd.DataFrame()** function and supplying just the intended column names.

In [1]:
import pandas as pd
tmp1 = pd.DataFrame(columns=['x','y'])
tmp1

Unnamed: 0,x,y


### Vertically Concatenating Two Dataframes

We can concatenate two dataframes by using the **pd.concat()** function with the **axis=0** parameter.

In [2]:
tmp2 = pd.DataFrame({'x': [1,3,5,7], 'y':[9,0,-1,10]})
tmp2

Unnamed: 0,x,y
0,1,9
1,3,0
2,5,-1
3,7,10


In [3]:
pd.concat([tmp1,tmp2], axis=0)

Unnamed: 0,x,y
0,1,9
1,3,0
2,5,-1
3,7,10


### Random Sample of a Dataframe

We can randomly sample the rows of a dataframe (without replacement) by using the **.sample()** function. We can specify a random state of the sample as well as the number of rows to sample.

In [4]:
sampledf = tmp2.sample(3, random_state=100)
sampledf

Unnamed: 0,x,y
2,5,-1
1,3,0
3,7,10


### Index of a Dataframe

We can extract the index of a given dataframe by using the **.index** attribute.

In [5]:
sampledf_index = sampledf.index
sampledf_index

Int64Index([2, 1, 3], dtype='int64')

### Setting the Index of a Dataframe

When creating a dataframe, in addition to defining the columns we can also define the index values as well.


In [6]:
tmp3 = pd.DataFrame({'x': [8,9,10], 'y':[11,12,13]}, index=sampledf_index)
tmp3

Unnamed: 0,x,y
2,8,11
1,9,12
3,10,13


<hr>

## 1. Initial Implementation of Mini-Batch k-Means

In this analysis, we will read in each of the 30 csv data batches into the mini-batch k-means algorithm. We know from assignment 11 that these batches are not random samples of the whole dataset.

### 1.1. Reassignment Ratio

First, given that we know that our batches will not be random, should we set our reassignment ratio to be higher or lower? Explain.

### 1.2 Algorithm Instantiation

Now, instantiate your Mini-Batch k-Means model with the following specifications:
* the number of clusters you selected in part 2
* a re-assignment ratio of 0.4
* a random state of 100.


### 1.3. Reading in the Batches

Then, one at a time, read in each of your csv files and update the mini-batch k-means algorithm with this batch of data. Once you are finished updating the algorithm with this dataframe, either delete it or overwrite it.

### 1.4. Random Sample of the Full Dataset

Ideally, we would like to collect a random sample of size 300 from the full dataset. One at a time, for each of the csv batches, do the following.
* Read in the csv into a dataframe.
* Collect a random sample of 10 observtions from this dataframe (without replacement), using a random state of 100 for each sample.
* Concatenate this random sample dataframe to a "total sample dataframe" which will eventually contain 300 observations (= your 30 random samples of size 10).

Once you are finished with a given batch dataframe, either delete it or overwrite it.

Show your "total sample dataframe" below.

### 1.5. Cluster Labels for your Random Sample

Next, get the mini-batch k-means cluster labels for each of the 300 observations in your "total sample dataframe."

### 1.6. t-SNE Algorithm

Finally, using 6 different perplexity values and at least two random states for each perplexity value, map your random sample of 300 observations onto a two-dimensional dataset with the t-SNE algorithm. Show your projected coordinates in a scatterplot for each combination of random states and perplexity value. Color code the points in your t-SNE plots by the mini-batch k-means cluster labels.

**Important Hint:**
The output of **tsne.fit_transform()** is a numpy array. So normally when we have created t-SNE plots in the past, we have converted this numpy array into a dataframe by using the **pd.DataFrame()** function. When you do this, you should set the index of this dataframe to be the same as the index of your 'total sample dataframe'.

Without doing this, when we would have tried to concatenate our t-SNE coordinates dataframe with our 'total sample dataframe' we would have had a problem, because the indexes of the two dataframes would not have matched. This is because the indexes of the 'total sample dataframe' have been reordered.

### 1.7. t-SNE Plot Interpretation

Use your t-SNE plots above to answer the following questions.

1. How many clusters does the t-SNE plot above suggest are in the random sample of 300 observations?
2. To what extent did the results of our mini-batch k-means algorithm agree with the clustering structure suggested by the t-SNE plots above? Explain.

Choose a random state and perplexity value that best reflects what you answered in 1.7.1 above and show this plot below.

## 2. Parameter Selection for Mini-Batch k-means

### 2.1. Testing out Reassignment Ratios

Next, we would like to see if we can select a reassignment ratio in which the mini-batch k-means cluster labels will agree *more* with the clusters suggested by the t-SNE plots.

Using the following reassignment ratios (0.01, 0.1, 0.2, 0.4, and 0.8) do the following:
1. Instantiate mini-batch k-means with:
    * 3 clusters
    * a batch size of 100
    * the reassignment ratio that you are considering
    * a random state of 100
2. Read in each of the 30 csv batches and fit that given mini-batch k-means class that you instantiated with each of these batches one at a time. Again, make sure that you delete or overwrite a batch dataframe when you are done with it.
3. Using your final fitted mini-batch k-means results for that given reassignment ratio, get the cluster labels for your total random sample of 300 observations.
4. Plot your t-SNE plot of this total random sample, color coding your points by the given cluster labels.



### 2.2. Selecting a Reassignment Ratio

Of the reassignment ratios that you considered, which reassignment ratio produced clustering results that had the most amount of agreement with the clustering structure suggested by your t-SNE plot?