# Classification

Classification is a technique or model which attempts to get some conclusion from observed values in classification problem. Models which perform classification tasks are usually called Classifiers. Classifiers are usually used in face recognition, spam identification etc

# Steps for building a Classifier 

## STEP 1: Import scikit-learn

Sklearn is a very powerful Machine LearningLibrary with in-built data sets and also support for all major Machine Learning ALgorithms

In [1]:
import sklearn

## STEP 2: Import Data

We will be making use of sklearn's inbuilt datasets. Here we will be making use of scikit-learns inbuilt breast cancer Wisconsin Diagnostic Dataset. The data set includes information about breast cancer tumours as well as classification labels of malignant or benign. The dataset contains 569 instances or data. 569 tumours and includes information o 30 attributes or features such as radius of the tumour, texture, smoothness etc 

In [2]:
#import the dataset
from sklearn.datasets import load_breast_cancer

#now we load the dataset
data = load_breast_cancer()

#### Keys for the data 

Below are important dictionary keys for the data:

Classification Label Names(target_names)

The actual labels(target)

The Attribute/feature names(feature_names)

The attribute (data)

Now with the help of the keywords above we can assign variables to important parts of the data 

In [3]:
label_names = data['target_names']

labels = data['target']

feature_names = ['feature_names']

features = data['data']

In [4]:
#we wil print the label names so we can have a clearer look at what we are looking at
print(label_names)

['malignant' 'benign']


## STEP 3: Organizing Data into sets

Next we will split the data into training sets and test sets in order to be able to test our model on unseen data. we will use 40% of our data to test our model

In [5]:
#we will import sklearns train_test_split function
from sklearn.model_selection import train_test_split

#Now we will split the data
train_data, test_data, train_labels, test_labels = train_test_split(features, labels, test_size=0.40, random_state=42) 

## STEP 4: Building a classifier model

Later on we will build several classifiers but for now we are going to make use of the Naive Bayes Algorithm

In [6]:
from sklearn.naive_bayes import GaussianNB

In [7]:
#Now to initialize it
gnb = GaussianNB()

In [8]:
#Now we are going to fir the model with our training data
model = gnb.fit(train_data, train_labels)

## Evaluating the Model using performance measures  

We are going to get the predictions of our model and get the score of it  using several performance measures, F1 score, recall, precision and accuracy

First we are going to find the prdictions for the dataset

In [9]:
prediction = gnb.predict(test_data)
#now we are goig to output the prediction of the dataset
print("Prediction:")
print(prediction)

Prediction:
[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1 0 1 0 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1
 0 0 1 1 0 1]


The above series of 0s and 1s are the predicted values for the tumour classes; malignant and benign.

Now by comparing the two classes test_labels and prediction we can get arguments for our performance measures

In [10]:
#import the performance_measure

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Now we will explain each of the performance measures we just imported

#### Accuracy

This is simply the number of predictions our model got right compared to the total number of predictions in the dataset

In [12]:
accuracy = accuracy_score(test_labels, prediction) 
print("Accuracy score:")
print(accuracy)

Accuracy score:
0.9517543859649122


#### Precision

This is simply the number of true predictions a model made out of the total predictions it made it is displayed as true predictions/true predictions + false predictions

In [14]:
precision = precision_score(test_labels, prediction)
print("Precision Score:")
print(precision)

Precision Score:
0.9536423841059603


#### Recall

This is somehow similar to precision but recall is simly the total number of true predictions a model makes out of the total true predictions. it is expressed as, model predictions/ model predictions that are true and model predictions that are false

In [15]:
recall = recall_score(test_labels, prediction)
print("Recall:")
print(recall)

Recall:
0.972972972972973


#### F1

This is simply the median of precision and recall

In [17]:
f1 = f1_score(test_labels, prediction)
print("F1 Score")
print(f1)

F1 Score
0.9632107023411371
