## Apply PCA and Clustering to Wholesale Customer Data
https://github.com/Make-School-Courses/DS-2.1-Machine-Learning/blob/master/Assignments/Home_Work_PCA_Kmeans.ipynb

In this homework, we'll examine the [Wholesale Customers Dataset](https://archive.ics.uci.edu/ml/datasets/Wholesale+customers), which we'll get from the UCI Machine Learning Datasets repository. This dataset contains the purchase records from clients of a wholesale distributor. It details the total annual purchases across categories seen in the data dictionary below:

**Category** | **Description**
:-----|:-----
CHANNEL| 1= Hotel/Restaurant/Cafe, 2=Retailer (Nominal)
REGION| Geographic region of Portugal for each order (Nominal)
FRESH| Annual spending (m.u.) on fresh products (Continuous)
MILK| Annual spending (m.u.) on milk products (Continuous)
GROCERY| Annual spending (m.u.)on grocery products (Continuous)
FROZEN| Annual spending (m.u.)on frozen products (Continuous)
DETERGENTS_PAPER| Annual spending (m.u.) on detergents and paper products (Continuous)
DELICATESSEN| Annual spending (m.u.)on and delicatessen products (Continuous)


**TASK**: Read in `wholesale_customers_data.csv` from the datasets folder and store in a dataframe. Store the `Channel` column in a separate variable, and then drop the Channel and Region columns from the dataframe. Scale the data and use PCA to engineer new features (Principal Components). Print out the explained variance for each principal component.

## K-Means, but Without All the Supervision
**Challenge**: Use K-Means clustering on the `wholesale_customers` dataset, and then again on a version of this dataset transformed by PCA.

Read in the data from the `wholesale_customers_data.csv` file contained within the datasets folder.

Store the `Channel` column in a separate variable, and then drop the `Region` and `Channel` columns from the dataframe. `Channel` will act as our labels to tell us what class of customer each datapoint actually is, in case we want to check the accuracy of our clustering.

Scale the data, fit a k-means object to it, and then visualize the data and the clustering.

Use PCA to transform the data, and then use k-means clustering on it to see if our results are any better.

**Challenge**: Use the confusion matrix function to create a confusion matrix and see how accurate our clustering algorithms were. Which did better--scaled data, or data transformed by PCA?

## Task
### Read in wholesale_customers_data.csv from the datasets folder and store in a dataframe

In [7]:
import pandas as pd

df = pd.read_csv('../dataset/wholesale_customers_data.csv')
df.head()

Unnamed: 0,Channel,Region,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,2,3,12669,9656,7561,214,2674,1338
1,2,3,7057,9810,9568,1762,3293,1776
2,2,3,6353,8808,7684,2405,3516,7844
3,1,3,13265,1196,4221,6404,507,1788
4,2,3,22615,5410,7198,3915,1777,5185


### Store the Channel column in a separate variable

In [15]:
target = df["Channel"]

### Drop the Channel and Region columns from the dataframe

In [16]:
df.drop(['Channel', 'Region'], axis=1)

Unnamed: 0,Fresh,Milk,Grocery,Frozen,Detergents_Paper,Delicassen
0,12669,9656,7561,214,2674,1338
1,7057,9810,9568,1762,3293,1776
2,6353,8808,7684,2405,3516,7844
3,13265,1196,4221,6404,507,1788
4,22615,5410,7198,3915,1777,5185
...,...,...,...,...,...,...
435,29703,12051,16027,13135,182,2204
436,39228,1431,764,4510,93,2346
437,14531,15488,30243,437,14841,1867
438,10290,1981,2232,1038,168,2125


### Scale the data and use PCA to engineer new features (Principal Components)
- Use standard_scaler from Day 1 notes

In [17]:
from sklearn import preprocessing

standard_scaler = preprocessing.StandardScaler()
X_ss = standard_scaler.fit_transform(df)
print(X_ss)

[[ 1.44865163  0.59066829  0.05293319 ... -0.58936716 -0.04356873
  -0.06633906]
 [ 1.44865163  0.59066829 -0.39130197 ... -0.27013618  0.08640684
   0.08915105]
 [ 1.44865163  0.59066829 -0.44702926 ... -0.13753572  0.13323164
   2.24329255]
 ...
 [ 1.44865163  0.59066829  0.20032554 ... -0.54337975  2.51121768
   0.12145607]
 [-0.69029709  0.59066829 -0.13538389 ... -0.41944059 -0.56977032
   0.21304614]
 [-0.69029709  0.59066829 -0.72930698 ... -0.62009417 -0.50488752
  -0.52286938]]


In [18]:
from sklearn.decomposition import PCA

pca = PCA(n_components=6)
X_r = pca.fit_transform(X_ss)

### Print out the explained variance for each principal component.

In [19]:
print(pca.explained_variance_)

[3.10707136 1.79404441 1.0140786  0.74007428 0.55790035 0.46035462]


## K-Means

Challenge: Use K-Means clustering on the wholesale_customers dataset, and then again on a version of this dataset transformed by PCA.

Read in the data from the wholesale_customers_data.csv file contained within the datasets folder.

Store the Channel column in a separate variable, and then drop the Region and Channel columns from the dataframe. Channel will act as our labels to tell us what class of customer each datapoint actually is, in case we want to check the accuracy of our clustering.

Scale the data, fit a k-means object to it, and then visualize the data and the clustering.

Use PCA to transform the data, and then use k-means clustering on it to see if our results are any better.