<a href="https://colab.research.google.com/github/89rael/ml-examples/blob/main/Classifier_with_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import sklearn

In [13]:
import pandas as pd

In [23]:
from sklearn.model_selection import train_test_split

In [27]:
from sklearn.naive_bayes import GaussianNB

In [39]:
from sklearn.metrics import accuracy_score

The dataset we will be working with is the [Breast Cancer Wisconsin Diagnostic Database](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-database). The dataset includes various information about breast cancer tumors, as well as classification labels of malignant or benign. The dataset has 569 instances and includes information on 30 attributes, or features, such as the radius of the tumor, texture smoothness, and area.

In [3]:
# import and load the dataset included in scikit-learn
from sklearn.datasets import load_breast_cancer

In [4]:
data = load_breast_cancer()

In [22]:
type(data)

sklearn.utils.Bunch

## What’s in our Dictionary (Bunch)?

Scikit’s dictionary or Bunchis really powerful. Let’s begin this dictionary by looking at its keys.

`data.keys()`

We get the following keys:

*  *`data`* is all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array
*  *`target`* is the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array,

These two keys are the actual data. The remaining keys (below), serve a descriptive purpose. It’s important to note that all of Scikit-Learn datasets are divided into data and target. data represents the features, which are the variables that help the model learn how to predict. target includes the actual labels. In our case, the target data is one column classifies the tumor as either 0 indicating malignant or 1 for benign.

*  *`feature_names`* are the names of the feature variables, in other words names of the columns in data
*  *`target_names`* is the name(s) of the target variable(s), in other words name(s) of the target column(s)
* *`DESCR`* , short for DESCRIPTION, is a description of the dataset
*  *`filename`* is the path to the actual file of the data in CSV format.

In [17]:
# create a pandas df with the features
df = pd.DataFrame(data.data, columns=data.feature_names)

In [19]:
# adding the target column
df['target'] = data.target

In [20]:
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

#### Now we heve to split our data into training set and test set.

In [26]:
train, test = train_test_split(df,test_size=0.33, random_state=42, shuffle=True)

## Building and Evaluating the Model

There are many models for machine learning, and each model has its
own strengths and weaknesses. In this case, we will focus on a simple
algorithm that usually performs well in binary classification tasks,
namely [Naive Bayes (NB)](https://scikit-learn.org/stable/modules/naive_bayes.html).

In [28]:
# initialize our classifier
gnb = GaussianNB()

In [35]:
# train our classifier
model = gnb.fit(train.loc[:, :'worst fractal dimension'], train['target'])

Now we can use the trained model to make predictions on our test set, which we do using the `predict()` function.
The `predict()` function returns an array of predictions for each data
instance in the test set. 

In [37]:
# make predictions
preds = gnb.predict(test.loc[:, :'worst fractal dimension'])
preds

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1])

##  Evaluating the Model’s Accuracy

Using the target column in test dataframe, we can evaluate the accuracy of our
model’s predicted values by comparing the two series `(target column
vs. preds)`. We will use the sklearn function `accuracy_score()` to
determine the accuracy of our machine learning

In [40]:
# evaluate accuracy
accuracy_score(test['target'], preds)

0.9414893617021277

The NB classifier is 94.15% accurate. This means that 94.15 percent of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign.These results suggest that our feature set of 30 attributes are good indicators of tumor class