# Apply PCA and Clustering to Wholesale Customer Data
In this homework, we'll examine the [Wholesale Customers Dataset](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository. This dataset contains the purchase records from clients of a wholesale distributor. It details the total annual purchases across categories seen in the data dictionary below:

| Category | Description |
|----------|-------------|
CHANNEL	|1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)|
REGION	|Geographic region of Portugal for each order (Nominal)|
FRESH	|Annual spending (m.u.) on fresh products (Continuous);|
MILK	|Annual spending (m.u.) on milk products (Continuous);|
GROCERY	|Annual spending (m.u.)on grocery products (Continuous);|
FROZEN	|Annual spending (m.u.)on frozen products (Continuous)|
DETERGENTS_PAPER	|Annual spending (m.u.) on detergents and paper products (Continuous)|
DELICATESSEN	|Annual spending (m.u.)on and delicatessen products (Continuous);|

**TASK**: Read in wholesale_customers_data.csv from the datasets folder and store in a dataframe. Store the Channel column in a separate variable, and then drop the Channel and Region columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components). Print out the explained variance for each principal component.

## K-Means, but Without All the Supervision
**Challenge**: Use K-Means clustering on the wholesale_customers dataset, and then again on a version of this dataset transformed by PCA.

1. Read in the data from the wholesale_customers_data.csv file contained within the datasets folder.

2. Store the Channel column in a separate variable, and then drop the Region and Channel columns from the dataframe. Channel will act as our labels to tell us what class of customer each datapoint actually is, in case we want to check the accuracy of our clustering.

3. Scale the data, fit a k-means object to it, and then visualize the data and the clustering.

4. Use PCA to transform the data, and then use k-means clustering on it to see if our results are any better.

**Challenge**: Use the confusion matrix function to create a confusion matrix and see how accurate our clustering algorithms were. Which did better--scaled data, or data transformed by PCA?

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import preprocessing
from sklearn import metrics
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy

## 1. Read in wholesale_customers_data.csv from the datasets folder and store in a dataframe. 
## 2. Store the Channel column in a separate variable, and then drop the Channel and Region columns from the dataframe.

In [2]:
data = pd.read_csv('wholesale_customers_data.csv')

Channel = data['Channel']
data = data.drop(columns=['Channel', 'Region'])
print(data)

     Fresh   Milk  Grocery  Frozen  Detergents_Paper  Delicassen
0    12669   9656     7561     214              2674        1338
1     7057   9810     9568    1762              3293        1776
2     6353   8808     7684    2405              3516        7844
3    13265   1196     4221    6404               507        1788
4    22615   5410     7198    3915              1777        5185
..     ...    ...      ...     ...               ...         ...
435  29703  12051    16027   13135               182        2204
436  39228   1431      764    4510                93        2346
437  14531  15488    30243     437             14841        1867
438  10290   1981     2232    1038               168        2125
439   2787   1698     2510      65               477          52

[440 rows x 6 columns]


## 3. Scale the data and use PCA to engineer new features (Principal Components).

In [3]:
# PCA computation by sklearn
pca = PCA(n_components=6)

# Find the principle components of 17 features
X_r = pca.fit_transform(data)

print(X_r)

[[   650.02212207   1585.51909007    -95.39064375   4540.78048148
    -356.63711837    226.71184804]
 [ -4426.80497937   4042.45150884   1534.80474393   2567.65565913
     -44.39428259    468.93801652]
 [ -4841.9987068    2578.762176     3801.38479014   2273.49433697
    5245.3854378   -2141.12332875]
 ...
 [  4555.11499863  26201.75860287  -5887.43291863  -2082.90687562
     -29.79580385  -1030.68216765]
 [ -2734.37092005  -7070.77533531   -790.70302471   1344.54788768
    1448.4127231    -219.12615672]
 [-10370.12531409  -6161.46490876  -1017.14238084   1283.65788577
     -80.6767127     297.82017043]]


## 4. Print out the explained variance for each principal component.

In [4]:
# print(X_r)
# print("--")
print(pca.explained_variance_)
print("--")
print(pca.explained_variance_ratio_)
print("--")
print(pca.explained_variance_ratio_.cumsum())

[1.64995904e+08 1.45452098e+08 2.51399785e+07 1.58039005e+07
 5.39276364e+06 2.20364065e+06]
--
[0.45961362 0.40517227 0.07003008 0.04402344 0.01502212 0.00613848]
--
[0.45961362 0.86478588 0.93481597 0.97883941 0.99386152 1.        ]


## How to calculate the correlation of the principle components:

In [5]:
print('Correlation of PCA Component:')
print(scipy.stats.pearsonr(X_r[:, 0], X_r[:, 1]))

Correlation of PCA Component:
(2.0816681711721685e-17, 1.0000000000001332)
