# Finding k

You have been analyzing the pricing data on one of the stocks your firm owns. Specifically, you were examining the relationship between the day's trading volume and the spread between the high and low trading price. 

Use the elbow method to determine the optimal number of clusters, `k`, that should be used to segment these trades. Once the elbow curve has been established, evaluate the two most likely values for `k` using the K-means algorithm and a scatter plot.


In [4]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
%matplotlib inline

## Read in the `stock_data.csv` file from the Resources folder and create a DataFrame. Set the “date” column to create the DatetimeIndex. Be sure to include parameters for `parse_dates` and `infer_datetime_format`.

In [9]:
# Read in the CSV file as a Pandas DataFrame
spread_df = pd.read_csv(Path("../Resources/stock_data.csv"),
    index_col="date",
    parse_dates=True, 
    infer_datetime_format=True   
)

# Review the DataFrame

spread_df


Unnamed: 0_level_0,close,volume,open,high,low,returns,hi_low_spread
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-04-30,3.61,18193730,3.55,3.73,3.53,0.028490,0.20
2009-05-01,3.82,16233940,3.55,3.90,3.55,0.058172,0.35
2009-05-04,4.26,21236940,3.90,4.30,3.83,0.115183,0.47
2009-05-05,4.32,16369170,4.36,4.39,4.11,0.014085,0.28
2009-05-06,4.31,15075630,4.45,4.45,4.12,-0.002315,0.33
...,...,...,...,...,...,...,...
2019-04-23,27.97,41583740,28.18,28.49,27.79,-0.007452,0.70
2019-04-24,28.46,51487330,28.10,28.85,27.93,0.017519,0.92
2019-04-25,27.66,56709000,28.67,28.86,27.36,-0.028110,1.50
2019-04-26,27.88,48736860,27.66,27.90,27.05,0.007954,0.85


## Scale the data using the `StandardScaler` to normalize the DataFrame values

In [10]:
# Scale the DataFrame data
scaler = StandardScaler()

scaled_array = scaler.fit_transform(spread_df.values)
scaled_array

array([[-0.67997366, -0.51148838, -0.69052414, ..., -0.68293938,
         0.73196558, -0.35535951],
       [-0.64385227, -0.56986441, -0.69052414, ..., -0.6794055 ,
         1.5365806 ,  0.04340536],
       [-0.56816936, -0.42084066, -0.63034486, ..., -0.62993125,
         3.08205423,  0.36241725],
       ...,
       [ 3.45678542,  0.63576126,  3.62862863, ...,  3.5276729 ,
        -0.80234372,  3.10060268],
       [ 3.49462688,  0.3982961 ,  3.45496843, ...,  3.47289783,
         0.17526532,  1.37262158],
       [ 3.46194562,  0.26511122,  3.49623422, ...,  3.55241003,
        -0.22508463,  0.81435077]])

## Create a new DataFrame with the scaled data

In [11]:
# Create a DataFrame with the scaled data.
# Hint: You can use the columns and index DataFrame's attributes to set the column names and the index of the new DataFrame.
spread_scaled_df = pd.DataFrame(columns=spread_df.columns, data = scaled_array, index=spread_df.index)

# Show sample data
spread_scaled_df

Unnamed: 0_level_0,close,volume,open,high,low,returns,hi_low_spread
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2009-04-30,-0.679974,-0.511488,-0.690524,-0.670844,-0.682939,0.731966,-0.355360
2009-05-01,-0.643852,-0.569864,-0.690524,-0.642324,-0.679406,1.536581,0.043405
2009-05-04,-0.568169,-0.420841,-0.630345,-0.575219,-0.629931,3.082054,0.362417
2009-05-05,-0.557849,-0.565836,-0.551252,-0.560121,-0.580457,0.341459,-0.142685
2009-05-06,-0.559569,-0.604367,-0.535777,-0.550055,-0.578690,-0.103095,-0.009763
...,...,...,...,...,...,...,...
2019-04-23,3.510107,0.185227,3.544378,3.482952,3.603651,-0.242357,0.973857
2019-04-24,3.594391,0.480224,3.530622,3.543346,3.628388,0.434556,1.558712
2019-04-25,3.456785,0.635761,3.628629,3.545024,3.527673,-0.802344,3.100603
2019-04-26,3.494627,0.398296,3.454968,3.383972,3.472898,0.175265,1.372622


## Create two lists: one to hold the list of inertia scores and another for the range of k values (from 1 to 11) to analyze.

In [12]:
# Create a a list to store inertia values
inertia = []

# Create a a list to store the values of k
k = list(range(1, 11))

## Using a for-loop to evaluate each instance of k, define a K-means model, fit the K-means model based on the scaled DataFrame, and append the model’s inertia to the empty inertia list that you created in the previous step.

In [13]:
# Create a for-loop where each value of k is evaluated using the K-means algorithm
for i in k:
    k_model = KMeans(n_clusters=i, random_state=1)

# Fit the model using the spread_df DataFrame
    k_model.fit(spread_scaled_df)
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
    inertia.append(k_model.inertia_)
# YOUR CODE HERE

## Store the values for k and the inertia in a Dictionary called `elbow_data`. Use `elbow_data` to create a Pandas DataFrame called `df_elbow`.

In [18]:
# Create a Dictionary that holds the list values for k and inertia
elbow_data = {"k": k, "inertia": inertia}

# Create a DataFrame using the elbow_data Dictionary
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,17612.0
1,2,8696.386368
2,3,6260.146033
3,4,5332.582793
4,5,4694.732543


## Using hvPlot, plot the `df_elbow` DataFrame to visualize the elbow curve.

In [19]:
# Plot the DataFrame
df_elbow.hvplot.line(
    x= "k",
    y="inertia",
    xticks = k,
    title = "Elbow Curve"
)

## Perform the following tasks for each of the two most likely values of `k`:

* Define a K-means model using `k` to define the clusters, fit the model, make predictions, and add the prediction values to a copy of the scaled DataFrame and call it `spread_predictions_df`.

* Plot the clusters. The x-axis should reflect the "hi_low_spread", and the y-axis should reflect the "close" price.

In [20]:
# Define the model with the lower value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=3, random_state=1)

# Fit the model
model.fit(spread_scaled_df)

# Make predictions
k_lower = model.predict(spread_scaled_df)

# Create a copy of the DataFrame and name it as spread_df_predictions
spread_df_predictions = spread_scaled_df.copy()

# Add a class column with the labels to the spread_df_predictions DataFrame
spread_df_predictions['clusters_lower'] = k_lower

spread_df_predictions

Unnamed: 0_level_0,close,volume,open,high,low,returns,hi_low_spread,clusters_lower
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2009-04-30,-0.679974,-0.511488,-0.690524,-0.670844,-0.682939,0.731966,-0.355360,0
2009-05-01,-0.643852,-0.569864,-0.690524,-0.642324,-0.679406,1.536581,0.043405,0
2009-05-04,-0.568169,-0.420841,-0.630345,-0.575219,-0.629931,3.082054,0.362417,0
2009-05-05,-0.557849,-0.565836,-0.551252,-0.560121,-0.580457,0.341459,-0.142685,0
2009-05-06,-0.559569,-0.604367,-0.535777,-0.550055,-0.578690,-0.103095,-0.009763,0
...,...,...,...,...,...,...,...,...
2019-04-23,3.510107,0.185227,3.544378,3.482952,3.603651,-0.242357,0.973857,1
2019-04-24,3.594391,0.480224,3.530622,3.543346,3.628388,0.434556,1.558712,1
2019-04-25,3.456785,0.635761,3.628629,3.545024,3.527673,-0.802344,3.100603,1
2019-04-26,3.494627,0.398296,3.454968,3.383972,3.472898,0.175265,1.372622,1


In [22]:
# Plot the clusters
spread_df_predictions.hvplot.scatter(
    x = "hi_low_spread",
    y = "close",
    by = "clusters_lower"
)


In [25]:
# Define the model with the higher value of k clusters
# Use a random_state of 1 to generate the model
model = KMeans(n_clusters=4, random_state=1)

# Fit the model
model.fit(spread_scaled_df)

# Make predictions
k_higher = model.predict(spread_scaled_df)

# Create a copy of the DataFrame and name it as spread_df_predictions
spread_df_predictions = spread_scaled_df.copy()

# Add a class column with the labels to the spread_df_predictions DataFrame
spread_df_predictions['clusters_higher'] = k_higher
spread_df_predictions

Unnamed: 0_level_0,close,volume,open,high,low,returns,hi_low_spread,clusters_higher
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2009-04-30,-0.679974,-0.511488,-0.690524,-0.670844,-0.682939,0.731966,-0.355360,0
2009-05-01,-0.643852,-0.569864,-0.690524,-0.642324,-0.679406,1.536581,0.043405,0
2009-05-04,-0.568169,-0.420841,-0.630345,-0.575219,-0.629931,3.082054,0.362417,0
2009-05-05,-0.557849,-0.565836,-0.551252,-0.560121,-0.580457,0.341459,-0.142685,0
2009-05-06,-0.559569,-0.604367,-0.535777,-0.550055,-0.578690,-0.103095,-0.009763,0
...,...,...,...,...,...,...,...,...
2019-04-23,3.510107,0.185227,3.544378,3.482952,3.603651,-0.242357,0.973857,1
2019-04-24,3.594391,0.480224,3.530622,3.543346,3.628388,0.434556,1.558712,1
2019-04-25,3.456785,0.635761,3.628629,3.545024,3.527673,-0.802344,3.100603,1
2019-04-26,3.494627,0.398296,3.454968,3.383972,3.472898,0.175265,1.372622,1


In [30]:
# Plot the clusters
spread_df_predictions.hvplot.scatter(
    x = "hi_low_spread",
    y = "close",
    by = "clusters_higher"
)


## Answer the following question

**Question:** Considering the plot, what’s the best number of clusters to choose, or value of k? 

**Answer:** # YOUR ANSWER HERE