# Prediction of cancer type using k-nearest neighbors algorithm

##  Data description

Title: Wisconsin Diagnostic Breast Cancer (WDBC)

1. Number of instances: 569
2. Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)
3. Attribute information
* 1) ID number
* 2) Diagnosis (M = malignant, B = benign)
* 3-32)
    
Ten real-valued features are computed for each cell nucleus:

* radius (mean of distances from center to points on the perimeter)
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeter^2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry
* fractal dimension ("coastline approximation" - 1)

Several of the papers listed above contain detailed descriptions of how these features are computed.

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

4. Missing attribute values: none
5. Class distribution: 357 benign, 212 malignant

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as skl
from scipy import stats
from sklearn.preprocessing import StandardScaler
 
data = pd.read_csv('CancerDiagnosis.csv',sep=';')
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


We change coding in 'diagnosis' column since in machine learning algorithms and in other calculations it is more convenient to have numerical coding of categorical variables.

B - Benign - 0

M - Malignant - 1

In [2]:
data['diagnosis'].replace('B', 0,inplace=True)
data['diagnosis'].replace('M', 1,inplace=True)

# Principal component analysis
Since we have 30 features, which implies 30 dimensions and in case of k nearest neighbors it could efficiently slows down the algorithm it could be beneficial to use principal component analysis in order to decrease dimensionality.

##  Standarization of features

In [3]:
df=data.iloc[:,2:]
diagnosis=data.iloc[:,1]
# Standardizing the features
x = StandardScaler().fit_transform(df)
x=pd.DataFrame(x, columns=df.columns)
x.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave_points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave_points_worst,symmetry_worst,fractal_dimension_worst
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,2.217515,2.255747,...,1.88669,-1.359293,2.303601,2.001237,1.307686,2.616665,2.109526,2.296076,2.750622,1.937015
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,0.001392,-0.868652,...,1.805927,-0.369203,1.535126,1.890489,-0.375612,-0.430444,-0.146749,1.087084,-0.24389,0.28119
2,1.579888,0.456187,1.566503,1.558884,0.94221,1.052926,1.363478,2.037231,0.939685,-0.398008,...,1.51187,-0.023974,1.347475,1.456285,0.527407,1.082932,0.854974,1.955,1.152255,0.201391
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,2.867383,4.910919,...,-0.281464,0.133984,-0.249939,-0.550021,3.394275,3.893397,1.989588,2.175786,6.046041,4.93501
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.53934,1.371011,1.428493,-0.00956,-0.56245,...,1.298575,-1.46677,1.338539,1.220724,0.220556,-0.313395,0.613179,0.729259,-0.868353,-0.3971


## PCA implementation
Using PCA we reduced dimensionality from 30 to 7 keeping 90 percent of variability.

In [4]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.9)
principalComponents = pca.fit_transform(x)
print('Number of components = {}'.format(pca.n_components_))


Number of components = 7


In [9]:
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['pc1', 'pc2','pc3','pc4','pc5','pc6','pc7'])
principalDf['diagnosis']=diagnosis
principalDf.head()


Unnamed: 0,pc1,pc2,pc3,pc4,pc5,pc6,pc7,diagnosis
0,9.192837,1.948583,-1.123166,3.633731,-1.19511,1.411424,2.15937,1
1,2.387802,-3.768172,-0.529293,1.118264,0.621775,0.028656,0.013358,1
2,5.733896,-1.075174,-0.551748,0.912083,-0.177086,0.541452,-0.668166,1
3,7.122953,10.275589,-3.23279,0.152547,-2.960878,3.053422,1.429911,1
4,3.935302,-1.948072,1.389767,2.940639,0.546747,-1.226495,-0.936213,1


# kNN algorithm

##  Splitting data into train and test data

In [38]:
import sklearn.model_selection
X_train,X_test,y_train,y_test = sklearn.model_selection.\
    train_test_split(principalDf.iloc[:,:7],principalDf['diagnosis'],test_size=0.2)

##  Fitting kNN
The number of k (nearest neighbors) is chosen quite arbitrary. Because we only have 2 clusters and we don't have huge amount of data k=5 seems reasonable.

In [39]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)

KNeighborsClassifier()

## Predicting

In [40]:
prediction = knn.predict(X_test)

## Evaluation

In [44]:
print(skl.metrics.confusion_matrix(y_test,prediction))

[[67  1]
 [ 2 44]]


In [45]:
knn.score(X_test,y_test)

0.9736842105263158

In [46]:
print(skl.metrics.classification_report(y_test,prediction))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98        68
           1       0.98      0.96      0.97        46

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



## Conclusions
K nearest neighbors algorithms classified the labels with good 97% accuracy. Other metrics also point out that the model perform quite well.

#  Remark
Since I still have been learning and I want to improve my coding and machine learning skills I would be grateful for any feedback and advice.

24.06.2022