# Segmenting Customers Data with PCA

One of the world's biggest banks launched a machine learning competition in Kaggle, an online community of data scientists and machine learning practitioners. They want to improve their marketing campaigns by identifying the optimal number of customer segments for their clients. They offer a reward of $5,000 that gained your interest, so you decided to put your unsupervised learning skills into practice to participate in the competition.

The bank provided a dataset that consists of customer data that includes ten different features. The data columns were anonymized using generic names to protect customers' privacy, and data values were already normalized.

The PCA technique for dimensional reduction has just come to your attention. At this point, you have already segmented the data based on all of the factors, but are wondering if PCA would alter the segmentation results.

Using the starter code and the customer data provided, reduce the factors to only two dimensions using PCA, determine the optimal value for k using the PCA DataFrame, and then segment the data by using the K-Means algorithm and the optimal value for k. Once these steps are complete, segment the preprocessed customer DataFrame by using the K-means algorithm and that same value for k, and then compare the segmentation results.

In [21]:
# Import the modules
import pandas as pd
import hvplot.pandas
from pathlib import Path
from sklearn.cluster import KMeans

## Read in the CSV file and prepare the Pandas DataFrame

In [50]:
# Read the csv file into a pandas DataFrame
customer_pca = pd.read_csv(
    Path("../Resources/customers.csv")
)

# Review the DataFrame
customer_pca.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,1.148534,4.606077,2.699069,-2.661824,1.526433,1.236671,0.211421,1.482896,-4.445627,-1.936831
1,-1.14941,-1.650549,2.530167,-3.227088,0.572138,4.1626,-0.291679,-1.237575,3.604765,-1.635689
2,0.332427,-0.887985,-0.309216,0.399891,0.828492,3.641945,-0.916946,-1.978024,1.056772,-1.882747
3,2.245599,3.826309,0.264039,0.095471,1.98438,0.373991,-0.280279,1.602786,-5.993331,-2.258925
4,0.705503,-1.312329,0.895406,-0.405408,1.116187,3.699562,-1.427985,-1.494409,1.156908,-1.434964


## Part 1: Use PCA to reduce the dimensionality of the transformed customers DataFrame to 2 principal components

### Step 1: Import the PCA module from SKLearn

In [23]:
# Import the PCA module
from sklearn.decomposition import PCA

### Step 2: Instantiate the instace of the PCA model declaring the number of principal components as 2

In [24]:
# Instantiate the PCA instance and declare the number of PCA variables
pca = PCA(n_components=2)

### Step 3: Using the `fit_transform` function from PCA, fit the PCA model to the `customers_transformed_df` DataFrame. Review the first 5 rows of list data.

In [48]:
# Fit the PCA model on the transformed customers DataFrame
customer_transformed_df = pca.fit_transform(customer_pca_df)

# Review the first 5 rows of the array of list data
customer_transformed_df[:5]

array([[-4.72382358, -0.60489964],
       [ 5.85571568, -1.98331135],
       [ 2.43063042, -3.15456594],
       [-6.96050326, -1.35772617],
       [ 2.47746793, -3.29412896]])

### Step 4: Using the `explained_variance_ratio_` function from PCA, calculate the percentage of the total variance that is captured by the two PCA variables.  

In [49]:
# Calculate the PCA explained variance ratio
pca.explained_variance_ratio_

array([0.55083554, 0.30256389])

**Question:** What is the explained variance ratio captured by the two PCA variables?
    
**Answer:** YOUR ANSWER HERE

### Step 5: Using the `customer_pca` data, create a Pandas DataFrame called `customers_pca_df`. The columns of the DataFrame should be called "PCA1" and "PCA2".

In [39]:
# Create the PCA DataFrame
customer_pca_df = pd.DataFrame(
    customer_,
    columns=["PCA1", "PCA2"]
)

# Review the PCA DataFrame
customer_pca_df.head()

Unnamed: 0,PCA1,PCA2
0,-4.712383,-0.675527
1,5.924676,-1.966166
2,2.521497,-3.195785
3,-6.934801,-1.466815
4,2.568987,-3.331562


## Part 2: Using the `customers_pca_df` Dataframe, utilize the elbow method to determine the optimal value of k.

In [28]:
# Create a a list to store inertia values and the values of k
inertia = []
k = list(range(1, 11))

In [40]:
# Create a for-loop where each value of k is evaluated by using the K-means algorithm
# Fit the model by using the service_ratings DataFrame
# Append the value of the computed inertia from the `inertia_` attribute of the KMeans model instance
for i in k:
    k_model = KMeans(n_clusters=i, random_state=1)
    k_model.fit(customer_pca_df)
    inertia.append(k_model.inertia_)

In [30]:
# Define a DataFrame to hold the values for k and the corresponding inertia
elbow_data = {"k": k, "inertia": inertia}

# Create the DataFrame from the elbow data
df_elbow = pd.DataFrame(elbow_data)

# Review the DataFrame
df_elbow.head()

Unnamed: 0,k,inertia
0,1,49585.714978
1,2,23750.95547
2,3,8773.172935
3,4,6840.237425
4,5,5378.897735


In [31]:
# Plot the DataFrame
df_elbow.hvplot.line(
    x="k", 
    y="inertia", 
    title="Elbow Curve", 
    xticks=k
)

## Part 3: Segment the `customers_pca_df` DataFrame using the K-means algorithm.

In [41]:
# Define the model Kmeans model by using the optimal value of k for the number of clusters.
model = KMeans(n_clusters=3, random_state=0)

# Fit the model
model.fit(customer_pca_df)

# Make predictions
k_3 = model.predict(customer_pca_df)

# Create a copy of the customers_pca_df DataFrame
customer_pca_predictions_df = customer_pca_df.copy()

# Add a class column with the labels
customer_pca_predictions_df["customer_segments"] = k_3

In [42]:
# Plot the clusters
customer_pca_predictions_df.hvplot.scatter(
    x="PCA1",
    y="PCA2",
    by="customer_segments"
)

## Part 4: Segment the `customers_transformed_df` DataFrame with all factors using the K-means algorithm

In [43]:
# Define the model Kmeans model by using k=3 clusters
model = KMeans(n_clusters=3, random_state=0)

# Fit the model
model.fit(customers_transformed_df)

# Make predictions
k_3 = model.predict(customers_transformed_df)

# Create a copy of the customers_transformed_df DataFrame
customer_transformed_predictions_df = customers_transformed_df.copy()

# Add a class column with the labels
customers_transformed_df["customer_segments"] = k_3

In [45]:
customer_transformed_predictions_df

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,customer_segments
0,1.148534,4.606077,2.699069,-2.661824,1.526433,1.236671,0.211421,1.482896,-4.445627,-1.936831,1
1,-1.149410,-1.650549,2.530167,-3.227088,0.572138,4.162600,-0.291679,-1.237575,3.604765,-1.635689,0
2,0.332427,-0.887985,-0.309216,0.399891,0.828492,3.641945,-0.916946,-1.978024,1.056772,-1.882747,0
3,2.245599,3.826309,0.264039,0.095471,1.984380,0.373991,-0.280279,1.602786,-5.993331,-2.258925,1
4,0.705503,-1.312329,0.895406,-0.405408,1.116187,3.699562,-1.427985,-1.494409,1.156908,-1.434964,0
...,...,...,...,...,...,...,...,...,...,...,...
995,1.923516,2.387442,1.746617,-0.850014,1.333114,-0.522750,-0.699195,1.876106,-4.063120,-0.244857,1
996,-0.760810,-2.490720,1.530053,-1.501746,0.423792,5.947200,-1.271437,-3.398691,4.745373,-1.616856,0
997,1.259010,2.469579,2.766727,-2.218555,1.203872,0.255983,-0.411843,1.691254,-3.021626,-0.452561,1
998,-3.063652,-2.770077,2.086373,-3.500722,-0.767900,5.048482,0.444592,-3.050005,7.259299,-1.254483,0


In [44]:
# Plot the clusters by using the age and spending columns
customer_transformed_predictions_df.hvplot.scatter(
    x="",
    y="PCA2",
    by="customer_segments"
)

DataError: Supplied data does not contain specified dimensions, the following dimensions were not found: ['PCA1', 'PCA2']

PandasInterface expects tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html

## Part 5: Compare the segmentation results between the PCA DataFrame and the full-factored DataFrame

**Answer:** # YOUR ANSWER HERE