This is my first simple machine learning program which use database of breast cancer tumor informaion.

Using Naive Bayse and k-nearest neighbor (kNN) algorithm classifier, I will predict whether or not a tumor is malignant or benign

## Stage 1: 

### Load the data

The data used here is already implemented in `sklearn.datasets` but I just convert it into csv file

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Cancer_data.csv')
cancer_data = data.copy() # make a deep of copy to avoid bad manipulation of the main data

## Stage 2:

### Explore the data

Check all columns' name

In [3]:
cancer_data.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'Labels'],
      dtype='object')

Check for null values

In [4]:
cancer_data.isnull().sum()

mean radius                0
mean texture               0
mean perimeter             0
mean area                  0
mean smoothness            0
mean compactness           0
mean concavity             0
mean concave points        0
mean symmetry              0
mean fractal dimension     0
radius error               0
texture error              0
perimeter error            0
area error                 0
smoothness error           0
compactness error          0
concavity error            0
concave points error       0
symmetry error             0
fractal dimension error    0
worst radius               0
worst texture              0
worst perimeter            0
worst area                 0
worst smoothness           0
worst compactness          0
worst concavity            0
worst concave points       0
worst symmetry             0
worst fractal dimension    0
Labels                     0
dtype: int64

There is no null values in our data

Check the type of our data

In [5]:
cancer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

In [None]:
sns.pairplot(cancer_data, hue='Labels')

KeyboardInterrupt: 

So, the data contains 569 rows with non-null float64 (30 columns) and non-null int64 (01 column).

Now, we can process to the next step

## Stage 3:

### Preparing our data

Choosing the features and the target

All the columns will be the features except for the last which will be the target

In [9]:
features_name = cancer_data.columns[:-1]
target_name = cancer_data.columns[-1]

Get the features and target data

In [13]:
features = cancer_data[features_name]
target = cancer_data[target_name]

Our target contains only 0 and 1 which is respectively malignant and benign

Now, let's organize the data into sets by spliting it into training and test set.

There is already a function in sklearn which can divide it into these sets

In [14]:
from sklearn.model_selection import train_test_split

Let's split it and get 30% of the original data to be our test data.

In [15]:
train, test, train_target, test_target = train_test_split(features, target, test_size=0.30, random_state=20)

It seems our data is ready, now we can build our model

## Stage 4

### Build the model

First, let's use the `Naive Bayse classifier`.

We can use directly the existing function in sklearn.

In [16]:
from sklearn.naive_bayes import GaussianNB

Create the naive bayse model

In [17]:
nb = GaussianNB()

Train the classifier

In [18]:
nb.fit(train, train_target)

Make a prediction

In [19]:
pred_nb = nb.predict(test)

We have predicted using Naive Bayse classifier, now let's use the kNN classifier

There is already a function we can use in sklearn for the `kNN` classifier

In [20]:
from sklearn.neighbors import KNeighborsClassifier

Create the kNN classifier, end let's choose 2 neighbors

In [23]:
kNN = KNeighborsClassifier(n_neighbors=2)

Train the kNN classifier

In [24]:
kNN.fit(train, train_target)

Make a prediction

In [25]:
pred_kNN = kNN.predict(test)

## Stage 5

### Evaluating each model

We can evaluate our model using two different methods.

##### Method 1

Get the accuracy of each method

- Naive Bayse classifier

In [26]:
from sklearn.metrics import accuracy_score

In [27]:
accuracy_score(test_target, pred_nb)

0.9532163742690059

- kNN classifier

In [28]:
accuracy_score(test_target, pred_kNN)

0.8654970760233918

It seems the Naive Bayse classification is better than the kNN classification

##### Method 2

Get all reports of our classification

In [29]:
from sklearn.metrics import classification_report

- Naive Bayse classifier

In [31]:
print(classification_report(test_target, pred_nb))

              precision    recall  f1-score   support

           0       0.94      0.94      0.94        64
           1       0.96      0.96      0.96       107

    accuracy                           0.95       171
   macro avg       0.95      0.95      0.95       171
weighted avg       0.95      0.95      0.95       171



- kNN classifier

In [32]:
print(classification_report(test_target, pred_kNN))

              precision    recall  f1-score   support

           0       0.77      0.91      0.83        64
           1       0.94      0.84      0.89       107

    accuracy                           0.87       171
   macro avg       0.86      0.87      0.86       171
weighted avg       0.88      0.87      0.87       171



Using classification_report, we can get all report of our classification. We can get the precision, recall and accuracy of our model in one go.