## Clustering
In this exercise, you will use K-Means clustering to segment customer data into five clusters.

### Import the Libraries
You will use the **KMeans** class to create your model. This will require a vector of features, so you will also use the **VectorAssembler** class.

In [2]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler

### Load Source Data
The source data for your clusters is in a comma-separated values (CSV) file, and incldues the following features:
- CustomerName: The customer's name
- Age: The customer's age in years
- MaritalStatus: The custtomer's marital status (1=Married, 0 = Unmarried)
- IncomeRange: The top-level for the customer's income range (for example, a value of 25,000 means the customer earns up to 25,000)
- Gender: A numeric value indicating gender (1 = female, 2 = male)
- TotalChildren: The total number of children the customer has
- ChildrenAtHome: The number of children the customer has living at home.
- Education: A numeric value indicating the highest level of education the customer has attained (1=Started High School to 5=Post-Graduate Degree
- Occupation: A numeric value indicating the type of occupation of the customer (0=Unskilled manual work to 5=Professional)
- HomeOwner: A numeric code to indicate home-ownership (1 - home owner, 0 = not a home owner)
- Cars: The number of cars owned by the customer.

In [4]:
customers = spark.read.csv('wasb://spark@<YOUR_ACCOUNT>.blob.core.windows.net/data/customers.csv', inferSchema=True, header=True)
customers.show()

### Create the K-Means Model
You will use the feaures in the customer data to create a Kn-Means model with a k value of 5. This will be used to generate 5 clusters.

In [6]:
assembler = VectorAssembler(inputCols = ["Age", "MaritalStatus", "IncomeRange", "Gender", "TotalChildren", "ChildrenAtHome", "Education", "Occupation", "HomeOwner", "Cars"], outputCol="features")
train = assembler.transform(customers)

kmeans = KMeans(featuresCol=assembler.getOutputCol(), predictionCol="cluster", k=5, seed=0)
model = kmeans.fit(train)
print ("Model Created!")

### Get the Cluster Centers
The cluster centers are indicated as vector coordinates.

In [8]:
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

### Predict Clusters
Now that you have trained the model, you can use it to segemnt the customer data into 5 clusters and show each customer with their allocated cluster.

In [10]:
prediction = model.transform(train)
prediction.groupBy("cluster").count().orderBy("cluster").show()

In [11]:
prediction.select("CustomerName", "cluster").show(50)