# BREAST CANCER CLASSIFICATION

#### Data Attribute Information:
1. Sample code number: id number 
2. Clump Thickness: 1 - 10 
3. Uniformity of Cell Size: 1 - 10 
4. Uniformity of Cell Shape: 1 - 10 
5. Marginal Adhesion: 1 - 10 
6. Single Epithelial Cell Size: 1 - 10 
7. Bare Nuclei: 1 - 10 
8. Bland Chromatin: 1 - 10 
9. Normal Nucleoli: 1 - 10 
10. Mitoses: 1 - 10 
11. Class: (2 for benign, 4 for malignant)


More information about the dataset can be found here:  [Breast Cancer Dataset,](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29)

### Aim

* Find if a person has a benign or malignant tumor.
* Learn to package solution properly using pipelines, opensource structure, config.yaml


### Author
* Mithuran Gajendran

### Imports 

In [1]:
from statistics import mean
import numpy as np
import matplotlib.pyplot as plt 
from matplotlib import style
import pickle
import random 
from sklearn import preprocessing, neighbors
import pandas as pd
from sklearn.model_selection import train_test_split
 

### Load Dataset 

In [6]:
df = pd.read_csv('../data/breast-cancer-wisconsin.data')

### Data Exploratory Analysis 

In [8]:
df.columns

Index(['id', 'clump_thickness', ' unif_cell_size', ' unif_cell_shape',
       'marg_adhesion', 'single_epith_cell_size', 'bare_nuclei', 'bland_chrom',
       'norm_nucleoli', 'mitoses', 'class'],
      dtype='object')

In [10]:
df.head()

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [11]:
df.tail() 

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
694,776715,3,1,1,1,3,2,1,1,1,2
695,841769,2,1,1,1,2,1,1,1,1,2
696,888820,5,10,10,3,7,3,8,10,2,4
697,897471,4,8,6,4,3,4,10,6,1,4
698,897471,4,8,8,5,4,5,10,4,1,4


In [17]:
#summary of a DataFrame.
df.describe(include= 'all') 

Unnamed: 0,id,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
count,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0,699.0
unique,,,,,,,11.0,,,,
top,,,,,,,1.0,,,,
freq,,,,,,,402.0,,,,
mean,1071704.0,4.41774,3.134478,3.207439,2.806867,3.216023,,3.437768,2.866953,1.589413,2.689557
std,617095.7,2.815741,3.051459,2.971913,2.855379,2.2143,,2.438364,3.053634,1.715078,0.951273
min,61634.0,1.0,1.0,1.0,1.0,1.0,,1.0,1.0,1.0,2.0
25%,870688.5,2.0,1.0,1.0,1.0,2.0,,2.0,1.0,1.0,2.0
50%,1171710.0,4.0,1.0,1.0,1.0,2.0,,3.0,1.0,1.0,2.0
75%,1238298.0,6.0,5.0,5.0,4.0,4.0,,5.0,4.0,1.0,4.0


In [18]:
# count missing values 
df.isnull().sum() 

id                        0
clump_thickness           0
 unif_cell_size           0
 unif_cell_shape          0
marg_adhesion             0
single_epith_cell_size    0
bare_nuclei               0
bland_chrom               0
norm_nucleoli             0
mitoses                   0
class                     0
dtype: int64

In [20]:
#from collections import Counter
#Counter = df['class']
df['class'].value_counts()     

2    458
4    241
Name: class, dtype: int64

In [22]:
#replace "?" with -99999
df.replace('?', -99999, inplace=True)

In [23]:
#drop id column
df.drop(['id'], axis=1)

Unnamed: 0,clump_thickness,unif_cell_size,unif_cell_shape,marg_adhesion,single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli,mitoses,class
0,5,1,1,1,2,1,3,1,1,2
1,5,4,4,5,7,10,3,2,1,2
2,3,1,1,1,2,2,3,1,1,2
3,6,8,8,1,3,4,3,7,1,2
4,4,1,1,3,2,1,3,1,1,2
...,...,...,...,...,...,...,...,...,...,...
694,3,1,1,1,3,2,1,1,1,2
695,2,1,1,1,2,1,1,1,1,2
696,5,10,10,3,7,3,8,10,2,4
697,4,8,6,4,3,4,10,6,1,4


### 4. Splitting your Data

In [24]:
#Define x and y 
X = np.array(df.drop(['class'],1))
y=np.array(df['class'])

In [25]:
#do cross validation 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### 5.Training Models

In [26]:
#call our classifer and fit to our data
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [27]:
#test the accuracy
accuracy = clf.score(X_test, y_test)
accuracy

0.5857142857142857

In [28]:
for_testing = X_test.reshape(len(X_test),-1)
prediction = clf.predict(for_testing)
prediction

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2,
       2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4,
       4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 2, 4, 4, 4, 2, 2, 2, 4,
       2, 2, 2, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2,
       2, 2, 2, 4, 4, 2, 4, 2, 2, 4, 4, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2,
       2, 2, 4, 2, 2, 2, 4, 2], dtype=int64)