# Finding the Best k

In this activity, you’ll apply the elbow method to iteratively run the K-means algorithm and find the optimal number of clusters, or value for `k`.

Instructions

1. Read in  the `option_trades.csv` file from the Resources folder and create a DataFrame. Use the “date” column to create the DateTime Index. Be sure to include parameters for `parse_dates` and `infer_datetime_format`. 

    > **Note** The option data that’s provided for this activity contains the prices of options, measured every four hours, on various stock options on the S&P 500. These stock options differ by various characteristics, including the time to expiration.

2. Create two lists: one for the range of lowercase-k values (from 1 to 11) to analyze and another to hold the list of inertia scores.

3. For each instance of k, define and fit a K-means model, and append the model’s inertia to the empty inertia list that you created in Step 2.

4. Store the values for lowercase-k and the inertia lists in a DataFrame called `df_elbow_data`.

5. Using hvPlot, plot the `df_elbow_data` DataFrame to visualize the elbow curve. Be sure to style and format your plot. 

6. Answer the following question: Considering the plot, what’s the best number of clusters to choose, or value of k?

References

[scikit-learn Python Library](https://scikit-learn.org)

[scikit-learn ML Algorithms](https://scikit-learn.org/stable/user_guide.html)

[scikit-learn K-means](https://scikit-learn.org/stable/modules/clustering.html#k-means)




In [6]:
# Import the required libraries and dependencies
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans

## Step 1: Read in  the `option_trades.csv` file from the Resources folder and create a DataFrame. Use the “date” column to create the DateTime Index. Be sure to include parameters for `parse_dates` and `infer_datetime_format`. 

In [7]:
# Read the CSV file into a Pandas DataFrame
# Use the date column to create the DateTime Index
df_options = pd.read_csv(Path('../Resources/option_trades.csv'), index_col='date', parse_dates=True, infer_datetime_format=True)

# Review the DataFrame
df_options

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2020-08-04 08:30:00,195.631965,210.025058,202.829513,221.569809,215.823048,212.112938,197.524908,214.564618,207.460115,209.855990,...,212.817158,201.262083,213.933774,206.132907,219.661568,204.972118,199.161883,194.000531,201.362749,205.688881
2020-08-04 11:00:00,218.833616,193.663638,182.807302,213.005657,194.657965,216.787274,201.662100,215.953316,201.586270,204.233793,...,218.560756,203.906526,196.645644,189.943663,196.537013,215.602311,217.919553,195.033360,202.346823,209.713289
2020-08-04 15:00:00,222.549239,200.632362,204.053803,198.749230,193.896719,201.005404,199.516591,209.182859,205.425138,197.457472,...,202.110909,219.896820,189.815097,198.069253,184.975622,198.668261,189.010191,204.879033,185.872788,196.961774
2020-08-05 08:30:00,177.901221,167.170212,178.674226,180.081992,197.030368,182.861254,182.138259,163.847409,175.976501,170.643134,...,173.560308,165.625163,177.090720,193.282793,187.996491,172.252274,183.706807,191.109464,179.242510,181.603642
2020-08-05 11:00:00,180.847294,186.696453,184.825757,180.116009,190.997511,177.779359,180.832512,173.574245,174.426271,148.636061,...,185.786780,171.388340,169.806288,168.503200,198.223226,183.767643,183.771038,203.553074,187.438263,155.905713
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-11-03 10:00:00,81.939634,88.430788,86.270229,88.109898,89.688909,87.261466,95.978367,87.265970,83.753496,97.294381,...,94.226554,91.973535,100.917195,86.636626,81.259035,77.619161,77.066175,78.276264,70.347567,88.658503
2020-11-03 14:00:00,81.734849,81.312948,93.025768,78.743969,78.917070,88.522835,78.838711,92.795458,87.478699,80.863024,...,101.713473,92.470463,91.297655,92.924256,99.606585,87.341789,95.643514,66.443985,75.685857,72.340522
2020-11-04 08:30:00,44.059735,48.110673,35.495850,41.166459,58.969224,44.610467,34.738626,22.289603,47.994218,34.954226,...,48.444567,47.579644,39.239714,58.670127,35.434913,63.176094,60.078842,25.258316,32.334332,58.251580
2020-11-04 10:00:00,36.935261,52.366055,55.049215,37.831209,33.176954,41.599295,51.847649,39.298211,43.193773,25.485094,...,36.504154,43.218135,53.844972,57.466938,48.939584,25.891143,30.721429,44.619639,38.122299,44.135863


## Step 2: Create two lists: one for the range of lowercase-k values (from 1 to 11) to analyze and another to hold the list of inertia scores.

In [8]:
# Create a list for the range of k's to analyze in the elbow plot
# The range should be 1 to 11. 
k = list(range(1,11))


In [9]:
# Create an empty list to hold inertia scores
inertia = []


## Step 3: For each instance of k, define and fit a K-means model, and append the model’s inertia to the empty inertia list that you just created.

In [14]:
# For each k, define and fit a K-means model and append its inertia to the above list
# Hint: This will require the creation of a for loop. 
for i in k:
    model=KMeans(n_clusters=i, random_state=0)
    model.fit(df_options)
    inertia.append(model.inertia_)

    
# View the inertia list
inertia


  "KMeans is known to have a memory leak on Windows "


[10804651.957374888,
 3367798.7347745816,
 1660546.9227245785,
 1247312.1570758787,
 935906.6738774017,
 798592.8554840287,
 719132.4847767286,
 661257.2363072736,
 616441.4716543406,
 576976.3355050476]

## Step 4: Store the values for lowercase-k and the inertia lists in a DataFrame called `df_elbow_data`.

In [15]:
# Create a dictionary with the data to plot the Elbow curve
elbow_data = {
    "k":k,
    "inertia": inertia
}


In [17]:
# Create a DataFrame from the dictionary holding the values for k and inertia.
df_elbow_data = pd.DataFrame(elbow_data)


## Step 5:  Using hvPlot, plot the `df_elbow_data` DataFrame to visualize the elbow curve. Be sure to style and format your plot.

In [20]:
# Plot the elbow curve using hvPlot.
df_elbow_data.hvplot.line(x="k", y="inertia", title="Elbow Curve", xticks=k)


## Step 6: Answer the following question:

**Question** Considering the plot, what’s the best number of clusters to choose, or value of k?

**Answer** # YOUR ANSWER HERE 

In [None]:
3