# K-Means Clustering

For this project we will attempt to use KMeans Clustering to cluster Compounds in 3 groups:
        FDA,
        Biofacquim,
        AFRODB

Note that, we have the "Library" labels for this data set, 
but we will NOT use them for the K-Means clustering algorithm
(unsupervised learning algorithm).

In this case we will use the labels to try to get an idea of 
how well the algorithm performed, but you won't usually do 
this for Kmeans

So the classification report and confusion matrix at the end of 
this project, do not truly make sense in a real world setting!

## Import Libraries

** Import the libraries you usually use for data analysis.**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ModuleNotFoundError: No module named 'seaborn'

## Get the Data

In [None]:
pwd

Please configure your location to read .csv file

In [None]:
Data = pd.read_csv("Data_cluster.csv",index_col=0)

In [None]:
Data.columns

In [None]:
#Select descriptors to train model (Numerical Data)
features = ['HBA', 'HBD', 'RB', 'LogP', 'TPSA', 'MW', 
            'Heavy Atom', 'Ring Count', 'Fraction CSP3']

In [None]:
#Create a new data frame with numerical data
df = Data[features]

In [None]:
#df.info()

In [None]:
#Statistical values
df.describe()

## Exploratory Data Analysis

It's time to create some data visualizations!

In [None]:
sns.heatmap(df.corr())

In [None]:
sns.set_style('whitegrid')
sns.lmplot('LogP','MW',data=Data, hue='Library',
           palette='cool',size=6,aspect=1,fit_reg=False)

In [None]:
sns.set_style('whitegrid')
sns.lmplot('TPSA','MW',data=Data, hue='Library',
           palette='cool',size=6,aspect=1,fit_reg=False)

In [None]:
g = sns.FacetGrid(Data,hue="Library",palette='cool',size=6,aspect=2)
g = g.map(plt.hist,'MW',bins=20,alpha=0.7)

**Create a similar histogram for the Grad.Rate column.**

In [None]:
sns.set_style('darkgrid')
g = sns.FacetGrid(Data,hue="Library",palette='cool',size=6,aspect=2)
g = g.map(plt.hist,'TPSA',bins=20,alpha=0.7)

In [None]:
sns.set_style('darkgrid')
g = sns.FacetGrid(Data,hue="Library",palette='cool',size=6,aspect=2)
g = g.map(plt.hist,'Heavy Atom',bins=20,alpha=0.7)

"""K Means Cluster Creation """

In [None]:
from sklearn.cluster import KMeans

** Create an instance of a K Means model with 2 clusters.**

In [None]:
kmeans = KMeans(n_clusters=3)

**Fit the model to all the data except for the Private label.**

In [None]:
kmeans.fit(df)

** What are the cluster center vectors?**

In [None]:
kmeans.cluster_centers_

In [None]:
kmeans

In [None]:
kmeans.labels_

## Evaluation

There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real world.

** Create a new column for df called 'Cluster', which is:
    Afro -> 0,
    Biofacquim -> 1,
    FDA -> 2,

In [None]:
#Add predictions to Data
Data['Cluster'] = kmeans.labels_

In [None]:
#Plot scatter plots again
#Now hue is "Cluster" (predicted result)
sns.set_style('whitegrid')
sns.lmplot('LogP','MW',data=Data, hue='Cluster',
           palette='cool',size=6,aspect=1,fit_reg=False)

In [None]:
sns.set_style('whitegrid')
sns.lmplot('TPSA','MW',data=Data, hue='Cluster',
           palette='cool',size=6,aspect=1,fit_reg=False)

In the last plot patters are similar, 
we can say that cluster predictions is not so bad

** Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.**

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(Data['Target'],kmeans.labels_))
print(classification_report(Data['Target'],kmeans.labels_))