# Segmenting Customer Data

One of the world's biggest banks launched a machine learning competition in [Kaggle](https://www.kaggle.com/), an online community of data scientists and machine learning practitioners. They want to improve their marketing campaigns by identifying the optimal number of customer segments for their credit card clients. They offer a reward of $5,000 that gained your interest, so you decided to put your unsupervised learning skills into practice to participate in the competition.

The bank provided a dataset that consists of customer data that includes ten different features. The data columns were anonymized using generic names to protect customers' privacy, and data values were already normalized.

Use the starter code to accomplish the following tasks:

1. Load the raw data into a Pandas DataFrame.

2. Use the Elbow Method to determine the optimal number of clusters.

3. Segment the data with K-means using the optimal number of clusters.

In [3]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path

## Part 1: Load the raw data into a Pandas DataFrame

In [4]:
# Set the file path
file_path=("../Resources/customers.csv")

# Read the csv file into a pandas DataFrame
customers_df = pd.read_csv(file_path)

# Review the DataFrame
customers_df

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,1.148534,4.606077,2.699069,-2.661824,1.526433,1.236671,0.211421,1.482896,-4.445627,-1.936831
1,-1.149410,-1.650549,2.530167,-3.227088,0.572138,4.162600,-0.291679,-1.237575,3.604765,-1.635689
2,0.332427,-0.887985,-0.309216,0.399891,0.828492,3.641945,-0.916946,-1.978024,1.056772,-1.882747
3,2.245599,3.826309,0.264039,0.095471,1.984380,0.373991,-0.280279,1.602786,-5.993331,-2.258925
4,0.705503,-1.312329,0.895406,-0.405408,1.116187,3.699562,-1.427985,-1.494409,1.156908,-1.434964
...,...,...,...,...,...,...,...,...,...,...
995,1.923516,2.387442,1.746617,-0.850014,1.333114,-0.522750,-0.699195,1.876106,-4.063120,-0.244857
996,-0.760810,-2.490720,1.530053,-1.501746,0.423792,5.947200,-1.271437,-3.398691,4.745373,-1.616856
997,1.259010,2.469579,2.766727,-2.218555,1.203872,0.255983,-0.411843,1.691254,-3.021626,-0.452561
998,-3.063652,-2.770077,2.086373,-3.500722,-0.767900,5.048482,0.444592,-3.050005,7.259299,-1.254483


In [5]:
# Use the "info()" Pandas function to validate data types and null values
customers_df.dtypes

feature_1     float64
feature_2     float64
feature_3     float64
feature_4     float64
feature_5     float64
feature_6     float64
feature_7     float64
feature_8     float64
feature_9     float64
feature_10    float64
dtype: object

In [6]:
# Use the Pandas "describe()" function to compute summary statistics
customers_df.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,-0.022428,0.805748,1.942896,-2.36403,0.85498,1.232422,0.146269,0.833486,-0.53432,-1.219393
std,2.382021,2.335796,1.411307,1.716566,1.742986,3.250231,1.635576,2.039563,4.211831,1.979172
min,-6.259471,-4.649286,-2.894995,-8.735778,-4.641509,-9.11147,-4.260013,-4.911903,-9.522425,-6.083462
25%,-2.091657,-1.214774,1.026128,-3.438149,-0.23531,-0.333722,-0.967569,-0.894817,-4.129561,-2.505366
50%,0.16167,1.096439,1.905107,-2.437602,1.084556,1.367371,-0.222299,1.519069,-0.536849,-1.706372
75%,2.030005,2.513648,2.851613,-1.22973,2.287268,3.637304,1.061269,2.298862,2.626514,-0.553571
max,6.275723,7.955158,5.897102,4.296552,4.74135,8.705423,7.123969,5.789222,10.047819,5.413623


## Part 2: Use the Elbow Method to determine the optimal number of clusters

In [5]:
# Import the KMeans module from SKLearn
from sklearn.cluster import KMeans

In [6]:
# Create a list to store inertia values and the values of k
inertia = []

# Create a list to set the range of k values to test
k = list(range(1, 11))

In [7]:
# Create a for-loop where each value of k is evaluated using the K-means algorithm
# Fit the model using the "customers_df" DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
for i in k:
    k_model=KMeans(n_clusters-i,random_state=0)
    k_model=kmeans.fit(customers_df)
    k_model.append(k_model.inertia)

In [8]:
# Define a DataFrame to hold the values for k and the corresponding inertia
new_df=pd.DataFrame(k_model), index_col=

# Review the DataFrame
new_df

Unnamed: 0,k,inertia
0,1,58103.759171
1,2,32183.537923
2,3,17080.936423
3,4,14890.068176
4,5,12816.235532


In [9]:
# Plot the DataFrame to identify the optimal value for k
# YOUR CODE HERE

## Part 3: Segment the data with K-means using the optimal number of clusters

In [10]:
# Define the model with optimal number of clusters
model = # YOUR CODE HERE

# Fit the model
# YOUR CODE HERE

# Make predictions
# YOUR CODE HERE

# Create a copy of the customers_df DataFrame
# YOUR CODE HERE

# Add a class column with the labels to the new DataFrame
# YOUR CODE HERE

In [11]:
# Plot the clusters using the "feature_1" and "feature_2" columns
# YOUR CODE HERE

#### Optional Challenge: Use hvPlot to display the clusters using the `feature_1` column against the other nine columns.

**Hint:** To display all the plot combinations, you can use the [`subplots` parameter](https://hvplot.holoviz.org/user_guide/Subplots.html) of hvPlot.

In [12]:
# Plot the clusters using the "feature_1" columns againsts the othe nine columns

# YOUR CODE HERE