# K-Means Clustering

K-Means Clustering is an unsupervised machine learning algorithm. In contrast to traditional supervised machine learning algorithms, K-Means attempts to classify data without having first been trained with labeled data. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the most relevant group.

The real world applications of K-Means include:
- customer profiling
- market segmentation
- computer vision
- search engines
- astronomy


### How it works:

It is an iterative algorithm.
It chooses k centers at random and keeps updating the centres while minimising the error (error is generally taken as MSE). 
Here is an illustration 

![image](http://shabal.in/visuals/kmeans/left.gif)
 

GIF Credits: http://shabal.in/visuals/kmeans/left.gif

### Choosing the right number of clusters

Often times the data you’ll be working with will have multiple dimensions making it difficult to visual. As a consequence, the optimum number of clusters is no longer obvious. Fortunately, we have a way of determining this mathematically.

We graph the relationship between the number of clusters and Within Cluster Sum of Squares (WCSS) then we select the number of clusters where the change in WCSS begins to level off (elbow method).

WCSS is defined as the sum of the squared distance between each member of the cluster and its centroid.

![image.png](https://miro.medium.com/max/1400/1*vLTnh9xdgHvyC8WDNwcQQw.png)

For example, the computed WCSS for figure 1 would be greater than the WCSS calculated for figure 2.

![img.png](https://miro.medium.com/max/1400/1*0naSz4RFw_m5VqiRXo2SRw.png)
Figure 1

![img2.png](https://miro.medium.com/max/2000/1*vNsFrDUvGn9yTjlnXLgW8A.png)
Figure 2

Image credits: miro.medium.com


#### Now lets see how we can implement a k-means algorithm using python.

We start by importing libraries.

Q1. Write code to import:
- pandas
- numpy
- pyplot from matplotlib
- KMeans from sklearn.cluster
- make_blobs from sklearn.datasets.samples_generator


In [None]:
# A1

We will perform the first trial on a dataset generated using the make_blobs function.
This function is available in the sklearn.datasets module. The 'centres' parameter specifies the number of clusters.
Changing the 'random_state' will change the data points you obtain. 'n_samples' controls number of data points generated and 'cluster_std' controls the 'tightness' of the cluster.

In [None]:
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

Q.2. Display the created datset using scatter function in pyplot.

In [None]:
#A2

Even though we already know the optimal number of clusters, we could still benefit from determining it using the elbow method. 

To get the values used in the graph, we train multiple models using a different number of clusters and storing the value of the intertia_ property (WCSS) every time.

In [None]:
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

Q.3. Write pyplot lines of code to visualise the graph of elbow method.
- plot the wcss vs number of cluster graph from range 1 to 11
- change the labels of x and y axis to appropriately. (X axis should be Number of clusters and Y axis should be wcss).
- chage the title of the plot to 'Elbow Method Graph'
- display the graph

In [None]:
#A3

Next, we’ll categorize the data using the optimum number of clusters (4) we determined in the last step. k-means++ ensures that you get don’t fall into the random initialization trap.

Read more about the random initialisation trap here:
https://www.geeksforgeeks.org/ml-k-means-algorithm/

In [None]:
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(X)

plt.scatter(X[:,0], X[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()

### Great ! So far, we used the sklearn library to fit the k-means model easily.

Here is an exercise that will allow you to actually visualise the working of the algorithm.

Q4. Develop code to compute k-means cluster with k known apriori.
Randomly take a 2d dataset of 20-25 points. (You can choose to take the points carefully so that they give you some good clustering). Take k=3/4. Write code that minimises inter-cluster error and finds the optimal centres. Make sure you plot graphs wherever you can to visualise the algorithm.


In [None]:
#A4

#### Wonderful, now lets try using k-means on a real life dataset

We will be using the Titanic dataset (available[here](https://www.kaggle.com/c/titanic)). 

Some information about the data:
- The training set contains several records about the passengers of Titanic (hence the name of the dataset). 
- It has 12 features capturing information about passenger_class, port_of_Embarkation, passenger_fare etc. 
- The dataset's label is survival which denotes the survivial status of a particular passenger. 
- Your task is to cluster the records into two i.e. the ones who survived and the ones who did not.

You might be thinking that since it is a labeled dataset, how could it be used for a clustering task? You just have to drop the 'survival' column from the dataset and make it unlabeled. It's the task of K-Means to cluster the records of the datasets if they survived or not. The label is provided to allow other supervised analysis algorithms to be carried out on the data.

Q.5 Import dependancies (No need to re-import if done previously)

- pandas as pd
- numpy as np
- KMeans from sklearn.cluster
- LabelEncoder and MinMaxScaler from sklearn.preprocessing
- seaborn
- pyplot


In [None]:
#A5

Q.6. Load data using pandas from the following URLs:
[train](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv)
[test](http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv) 

Preview the dataset by printing some samples and get some initial statistics of the data such as count, mean, std etc.


In [None]:
#A6

It is important to note that k-means algorithm cannot handle missing values, so we need to deak with them at the pre-processing state.

There are a couple of ways to handle missing values:

    Remove rows with missing values
    Impute missing values

Latter one is preferred because if you remove the rows with missing values it can cause insufficiency in the data which in turn results in inefficient training of the machine learning model.

Now, there are several ways you can perform the imputation:

    A constant value that has meaning within the domain, such as 0, distinct from all other values.
    A value from another randomly selected record.
    A mean, median or mode value for the column.
    A value estimated by another machine learning model.

Any imputation performed on the train set will have to be performed on test data in the future when predictions are needed from the final machine learning model. This needs to be taken into consideration when choosing how to impute the missing values.

Q.7. Find which fields contain missing value and impute them with mean of the column.

In [None]:
#A7

Now that you have imputed the missing values in the dataset, it's time to see if the dataset still has any missing values.


In [None]:
print(train.isnull().sum())
print(test.isnull().sum())

Yes, you can see there are still some missing values in the Cabin and Embarked columns. 
This is because these values are non-numeric.
In order to perform the imputation the values need to be in numeric form. There are ways to convert a non-numeric value to a numeric one.

You can see that the following features are non-numeric:

    Name
    Sex
    Ticket
    Cabin
    Embarked

Before converting them into numeric ones, you might want to do some feature engineering, i.e. features like Name, Ticket, Cabin and Embarked do not have any impact on the survival status of the passengers. Often, it is better to train your model with only significant features than to train it with all the features, including unnecessary ones. It not only helps in efficient modelling, but also the training of the model can happen in much lesser time. 

Features: Name, Ticket, Cabin and Embarked can be dropped and they will not have significant impact on the training of the K-Means model.

Q.8 Drop the unnecessary fields mentioned above. 

In [None]:
#A8

Now that the dropping part is done let's convert the 'Sex' feature to a numerical one (only 'Sex' is remaining now which is a non-numeric feature). You will do this using a technique called Label Encoding.

In [None]:
labelEncoder = LabelEncoder()
labelEncoder.fit(train['Sex'])
labelEncoder.fit(test['Sex'])
train['Sex'] = labelEncoder.transform(train['Sex'])
test['Sex'] = labelEncoder.transform(test['Sex'])

Q.9. Investigate the datatype of all remaining features to ensure that everythin is in numeric format

In [None]:
#A9

In [None]:
# drop the Survival column from the data with the drop() function.

X = np.array(train.drop(['Survived'], 1).astype(float))
y = np.array(train['Survived'])


Q.10. Cluster the data using KMeans from sklearn with k=2 (Because we want two clusters: survived and not survived)

In [None]:
#A10

You can see all the other parameters of the model other than n_clusters. 

Let's see how well the model is doing by looking at the percentage of passenger records that were clustered correctly.

In [None]:
correct = 0
for i in range(len(X)):
    predict_me = np.array(X[i].astype(float))
    predict_me = predict_me.reshape(-1, len(predict_me))
    prediction = kmeans.predict(predict_me)
    if prediction[0] == y[i]:
        correct += 1

print(correct/len(X))

Your model was able to cluster correctly with a 50% (accuracy of your model). But in order to enhance the performance of the model you could tweak some parameters of the model itself. I will list some of these parameters which the scikit-learn implementation of K-Means provides:

    algorithm
    max_iter
    n_jobs 
 
Let's tweak the values of these parameters and see if there is a change in the result.

In the scikit-learn documentation, you will find a solid information about these parameters which you should dig further.

One of the reasons of low accuracy is you have not scaled the values of the different features that you are feeding to the model. The features in the dataset contain different ranges of values. So, what happens is a small change in a feature does not affect the other feature. So, it is also important to scale the values of the features to a same range.

Let's do that now and for this experiment you are going to take 0 - 1 as the uniform value range across all the features.

Q.11. Use the MinMaxScaler() and transform() fucntion to normalise all values to the range of 0-1

Q.12. Fit the kmeans on the new tranformed features and find the accuracy as we did previously.

In [None]:
#A11
#A12

### Well Done !