# K-Means Clustering

Your assignment is to use the "Breast Cancer Wisconsin (Diagnostic) Data Set" from Kaggle to try and cluster types of cancer cells. 

It may be helpful to use PCA to reduce the dimensions of your data first in order to obtain --but then again, maybe not. I dunno, you're the data scientist, you tell me.🤪 

Here's the original dataset for your reference:

<https://www.kaggle.com/uciml/breast-cancer-wisconsin-data>

## This is a supervised learning dataset

(Because it has **labels** - The "diagnosis" column.)

In [244]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA # You don't necessarily have to use this
from sklearn.cluster import KMeans # You don't necessarily have to use this
from sklearn.preprocessing import StandardScaler # You don't necessarily have to use this

df = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/Cancer_Cells.csv")
print(df.shape)
df.head()
answers = df['diagnosis']

(569, 33)


## Now it's an unsupervised learning dataset

(Because we've removed the diagnosis label) - Use this version.

In [245]:
df = df.drop('diagnosis', axis=1)
df.head()


df.isna().sum()
df = df.drop('Unnamed: 32', axis=1)

# You take it from here!

See what you can come up with. You have all the know-how! 

- You might want to do some data exploration to see if you can find specific columns that will help you find distinct clusters of cells
- You might want to do PCA on this data to see if that helps you find distinct linearly-separable clusters.
  - (In the real world, truly linearly-separable clusters are rare.)
- You might want to use an elbow chart to decide on the number of clusters to use.
- You might want to use a scree plot to decide how many principal components to include in your clustering.
- You might want to standardize your data before PCA (If you decide to use PCA). 

## Manage your time and don't spend it all on data exploration or something like that. You got this!

In [246]:
def standardize(data):
    mean = data.mean()
    std = data.std()
    return ([(elem - mean) / std for elem in data])

def std_df(df):
    return (df.apply(standardize))

In [247]:
df_std = std_df(df)

#M_cov = pd.DataFrame(np.dot(df_std.T, df_std))
#M_cov = M_cov/568

M_covN = np.cov(df_std.T.values, ddof=0) #numpy covariance matrix
M_covN = pd.DataFrame(M_covN)

M_covP = df_std.cov() #pandas covariance matrix
M_covP = M_covP

val, vec = np.linalg.eig(M_covP)

In [248]:
idx = val.argsort()[::-1]   #sorting eigen vectors by value.
val = val[idx]
vec = vec[:,idx]

In [249]:
P = vec.dot(df_std.T) #grabbing the projection of the eigen vector on our covariance Matrix.

P = pd.DataFrame(P.T)
P.shape

(569, 31)

In [250]:
import math
from scipy.spatial import distance

def find_nearest_centroid(df, centroids):
    distances = distance.cdist(df, centroids, 'euclidean')
    
    nearest_centroids = np.argmin(distances, axis=1)
    se = pd.Series(nearest_centroids)
    df['cluster'] = se.values
    
    return (df)

def get_centroids(df, column_header):
  new_centroids = df.groupby(column_header).mean()
  return new_centroids

In [251]:
cluster = P.copy()
centroids = cluster.sample(2)

In [252]:
for _ in range(20):
    cluster = find_nearest_centroid(cluster.select_dtypes(exclude='int64'), centroids)
    centroids = get_centroids(cluster, 'cluster')

In [263]:
cluster['cluster'].value_counts()
cluster['answers'] = answers

cluster['cluster'] = cluster['cluster'].replace(to_replace=0, value='M')
cluster['cluster'] = cluster['cluster'].replace(to_replace=1, value='B')

cluster[cluster['cluster'] == cluster['answers']].shape

(519, 33)

# Stretch Goal:

Once you are satisfied with your clustering, go back and add back in the labels from the original dataset to check how accurate your clustering was. Remember that this will not be a possibility in true unsupervised learning, but it might be a helpful for your learning to be able to check your work against the "ground truth". Try different approaches and see which one is the most successful and try understand why that might be the case. If you go back and try different methods don't ever include the actual "diagnosis" labels in your clustering or PCA.

**Side Note** Data Science is never DONE. You just reach a point where the cost isn't worth the benefit anymore. There's always more moderate to small improvements that we could make. Don't be a perfectionist, be a pragmatist.