# K-Means Clustering

Your assignment is to use the "Breast Cancer Wisconsin (Diagnostic) Data Set" from Kaggle to try and cluster types of cancer cells. 

It may be helpful to use PCA to reduce the dimensions of your data first in order to obtain --but then again, maybe not. I dunno, you're the data scientist, you tell me.🤪 

Here's the original dataset for your reference:

<https://www.kaggle.com/uciml/breast-cancer-wisconsin-data>

## This is a supervised learning dataset

(Because it has **labels** - The "diagnosis" column.)

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://raw.githubusercontent.com/ryanleeallred/datasets/master/Cancer_Cells.csv")

## Now it's an unsupervised learning dataset

(Because we've removed the diagnosis label)

In [2]:
labels = df['diagnosis']
df = df.drop('diagnosis', axis=1)

## Cleaning Dataset

In [3]:
df.isna().sum()
df = df.drop('Unnamed: 32', axis=1)

## Standardizing the dataset (me)

In [4]:
def standardize(df):
    columns = df.columns
    for column in columns:
        mean = df[column].mean()
        std = df[column].std()
        df[column] = df[column].apply(lambda x: (x - mean) / std)
    return df

In [5]:
df_std1 = standardize(df)

## Standardizing the dataset (scipy)

In [6]:
from sklearn.preprocessing import StandardScaler

scalar = StandardScaler()
scalar.fit(df)
df_std2 = pd.DataFrame(scalar.transform(df))

# Principle Component Analysis

## Calculate covariance matrix (me)

In [7]:
cov1 = np.dot(df_std1.T, df_std1) #get covariance matrix

cov1 /= 568 #additional standardization step

cov1 = pd.DataFrame(cov1) #transform into pandas DF.

## Calculate covariance matrix (pandas)

In [8]:
cov1_2 = df_std1.cov() #my standardization with pandas covariance
cov2 = df_std2.cov()   #scipy standardization 

## Find Eigen vectors (numpy)

In [9]:
val1_2, vec1_2 = np.linalg.eig(cov1_2) #my std, pd cov
val2, vec2 = np.linalg.eig(cov2) #scipy std, pd cov

### Sort eigen vectors by value

In [10]:
idx = val1_2.argsort()[::-1]   
val1_2 = val1_2[idx]
vec1_2 = vec1_2[:,idx]

idx = val2.argsort()[::-1]   
val2 = val2[idx]
vec2 = vec2[:,idx]

## Project the standardized dataset onto the eigen vectors (me)

In [12]:
P1_2 = df_std1.dot(vec1_2)
P2 = df_std2.dot(vec2)
P1_2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,21,22,23,24,25,26,27,28,29,30
0,-9.175127,1.969538,1.170595,3.636133,-1.192050,1.371074,-0.371517,-2.178612,0.231406,0.090317,...,-0.107276,-0.069573,0.085157,0.175473,-0.150642,0.200630,-0.252970,0.033882,-0.045532,-0.047124
1,-2.381202,-3.750159,0.579719,1.126447,0.624301,0.126505,0.288016,-0.044895,-0.426541,0.659359,...,0.075125,0.091660,-0.213737,-0.010359,-0.170061,0.042382,0.180491,-0.032601,0.005897,-0.001843
2,-5.737424,-1.079401,0.532619,0.902676,-0.180443,0.401502,-0.462780,0.714580,0.010703,0.082233,...,-0.303018,0.058878,-0.074079,-0.103742,0.170599,-0.005061,0.049843,-0.046981,-0.003287,0.000735
3,-7.118121,10.263195,3.147392,0.121405,-2.965446,2.559416,-1.948463,-1.286858,-1.271519,1.170347,...,-0.410504,0.204940,-0.135203,-0.158520,0.075617,0.272808,0.184188,-0.042428,0.068580,-0.019983
4,-3.942225,-1.957967,-1.399946,2.934973,0.540377,-1.232216,0.205255,0.958485,-0.628566,0.166208,...,0.117238,0.020405,0.135312,0.004870,0.002882,-0.039602,0.032558,0.034760,-0.005178,0.021180
5,-2.369066,3.961425,2.926767,0.924875,-1.060235,-0.483634,0.026780,-0.500173,0.095093,0.113824,...,0.004730,0.101716,0.032060,-0.003064,0.122264,0.030085,-0.084501,-0.000730,0.019731,0.003460
6,-2.231563,-2.671725,1.674262,0.150511,0.041665,-0.055764,0.262037,0.277535,-0.126071,0.034920,...,0.050346,-0.091927,-0.141962,0.080258,-0.277295,0.026922,0.020561,0.046731,0.013041,0.005477
7,-2.149542,2.325645,0.810015,-0.149277,-1.435638,-1.366281,-0.127345,-0.952738,0.754346,0.044545,...,-0.044894,0.130001,-0.244557,-0.099145,-0.196379,0.058059,0.067393,-0.026143,0.002504,0.001660
8,-3.162944,3.405830,3.116534,-0.612514,-1.522157,0.508742,-0.172164,0.207145,0.808389,-0.305701,...,-0.094375,-0.197309,-0.056171,0.088434,0.130454,0.055346,0.037997,-0.037129,-0.011658,0.005280
9,-6.349373,7.716749,4.234564,-3.414140,1.695312,-1.027747,-0.803853,-2.439721,-0.525152,0.573496,...,0.244319,0.717707,0.161039,-0.216508,0.134084,-0.311743,-0.001353,-0.135604,-0.038083,0.003787


# K-Means Clustering

## Clustering (me)