## Chapter 11: Data Classification

When data analysts mine data, they must often assign a classification label to data (such as a pet is a dog, cat, horse) or determine if an object is or isn’t something (such as a loan being approved or disapproved). With respect to data mining, classification is the use of a supervised machine-learning algorithm to assign an observation into a specific category. Depending on the application, the classification may be binary (two classes):

    •	Is the customer a male or female?
    •	Is the pet a dog or a cat?
    •	Should the loan be approved or disapproved?
    •	Is the wine a red or the white?
    •	Is the tumor benign or malignant?

Or, the classification may use multiple classes (categories):

    •	Is a person’s facial expression happy, sad, angry?
    •	What breed of dog is my pet?
    •	Which user’s biometric data was entered?
    •	Which voice-recognition option did the caller select?
    •	How do we predict the book to sell (poorly, good, very good, great)?

Classification algorithms work by examining an input “training set” of data to learn how the data values combine to create a result. Such a training set, for example, might contain heights, weights, colors, and temperaments of different dogs and the resulting breeds, or it might contain the sizes, shapes, dimensions, and locations of tumors that are malignant as well as similar data for tumors that are benign. In other words, the training data contains predictive values and the correct classification results.  

After the learning algorithm learns from and models the training test data, a “test” data set (for which the correct results are known) is tested against the model to determine its accuracy, such as 97%. With knowledge of the accuracy in hand, the data analyst can then use the model to classify other data values.
Normally, the training set and testing set come from the same dataset of values that are known to be correct or observed. The data analyst will specify, for example, that 70% of the data will be training data and 30% will be used for testing. Across the Web, you can find many different datasets of known or observed data that you can use to try different classification algorithms. This notebook will use several commonly-used algorithms to:

    •	Determine to which type of Iris a flower belongs.
    •	Determine if a breast-cancer tumor is malignant or benign.
    •	Determine, based on chemical composition, a wine’s type.
    •	Determine whether breast-surgery patients should live five or more years post surgery.   

Some of the scripts presented in this notebook use several Python libraries which have been pre-installed for you. If you had been required to install these libaries on your own, you would issue the following commands:

```python
! pip install --user pandas
! pip install --user numpy
! pip install --user sklearn
! pip install --user matplotlib
! pip install --user graphviz
```

# Applying the K-Nearest Neighbors (KNN) Algorithm

The KNN’s classification algorithm is based on the premise “If it walks like duck and quacks like a duck, it’s a duck.” To use KNN, you provide a value for the number K that specifies the number of neighboring data-set values to which a value must be similar in order to be considered part of a group.

When you use the KNN algorithm to classify data, you must specify the value of K for the number of neighbors to which a point must be similar in order to be included in a group. If you specify a value of K that is too small, you may “overfit” the model, meaning the model may start to treat noise or errant data as valid training data. Likewise, if you specify too large a value for K, you may “underfit” the model, which means the model is not capable of correctly modeling the training data.

The following Python script, IrisKNN.py, opens the Iris dataset and loads the data into two arrays, one containing the petal and sepal data (X) and one containing the known classifications (y). The code then splits the arrays into a training dataset that contains 70% of the values and a testing dataset that contains the remaining 30%. 
The script then uses the KNN (K-nearest neighbors) algorithm with K=3, to calculate and display the model’s accuracy. Using the model, the script then predicts the classification for three sets of sepal and petal lengths:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 1
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print("\nFrom the test data")
print('Index\tPredicted\t\tActual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i, '     ', pred[i],'       ', y_test[i], ' ****')
    
DataToPredict = np.array([[5.2,3.5,1.4,0.2],[5.7,2.9,3.6,1.3],[5.8,3.0,5.1,1.8]])
pred = knn.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print(DataToPredict[i], pred[i])

To help you understand which of the model's predictions were correct and which predictions were wrong, you can display the model’s confusion matrix, which summarizes the prediction’s results.

The following Python script, ConfusionIrisKNN.py, uses KNN to model the Iris-flower data. The script displays the accuracy score and the confusion matrix:

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print("\nFrom the test data")
print('Index\tPredicted\t\tActual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i, '\t', pred[i], '\t', y_test[i], ' ****')
    
print('\nConfusion Matrix\n', confusion_matrix(y_test, pred))

# Predicting Wine Types

The Wine dataset, available at University of California (UCI) Data Repository, contains 13 attributes that contribute to the quality of wine. The dataset contains data for three types of wines, identified by the category values 1, 2, and 3. The dataset consists of 178 records. 

The following Python script, WineKNN.py, uses the Wine data set with K=5 to predict the types for 3 sets of wine values:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 2
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['class', 'Alcohol','Malic Acid','Ash','Acadlinity','Magnisium','Total Phenols','Flavanoids', 'NonFlavanoid Phenols', 'Proanthocyanins', 'Color Intensity', 'Hue', 'OD280/OD315', 'Proline' ]

df = pd.read_csv('wine.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 1:14])
y = np.array(df['class'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))
    
DataToPredict = np.array(
[[14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065],
[12.64,1.36,2.02,16.8,100,2.02,1.41,.53,.62,5.75,.98,1.59,450],
[12.53,5.51,2.64,25,96,1.79,.6,.63,1.1,5,.82,1.69,515],
[13.49,3.59,2.19,19.5,88,1.62,.48,.58,.88,5.7,.81,1.82,580]])

pred = knn.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print('\t', DataToPredict[i], '\t', 'Class: ', pred[i])

# Predicting Breast Cancer Malignancy Using KNN

The breast-cancer dataset, available at the University of California Irvine (UCI) Data Repository contains 32 attributes which can be used to determine if a breast-cancer tumor is malignant or benign. The dataset contains 569 records.

The following Python script, CancerPredictKNN.py, uses the dataset with K=5 to train and test the model:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 3
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 1:9])  # do not include sample field
y = np.array(df['class'])

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
pred = knn.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))

# Classification Using Naïve Bayes

There are many different classification algorithms, each of which apporaches the data-grouping process differently. The Naïve Bayes classification algorithm is so named because it is based on Bayes Theorem to calculate the probability that an item is a member of a category based upon knowledge of related conditions. The Naïve Bayes classification algorithm is called “naïve” in that it treats the different dataset attributes as independent and calculates a probability for each.

The following Python script, NaiveBayes.py, uses the GaussianNB function to predict which class of Iris a flower observation aligns with:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = GaussianNB().fit(X_train, y_train)
pred = model.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print("\nFrom the test data")
print('Index\tPredicted\t\tActual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i, '\t', pred[i], '\t', y_test[i], ' ****')
    
DataToPredict = np.array([[5.2,3.5,1.4,0.2],[5.7,2.9,3.6,1.3],[5.8,3.0,5.1,1.8]])
pred = model.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print('\t', DataToPredict[i], '\t', pred[i])

As discussed, Naïve Bayes will create probabilities for each attribute. To display the probabilities, you can use the predict_proba function, as shown here:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)


model = GaussianNB().fit(X_train, y_train)
pred = model.predict(X_test)
print('probabilities')
print (model.predict_proba(X_test))


# Classification Using Logistic Regression

The logistic-regression classifier is best suited for binary-dependent variables—meaning classifications for which there are only two classes, such as gender, a tumor being malignant or benign, and so on. That said, you can use Logistic Regression for multiclass problems, however, your results may not prove as accurate as other methods. 

A logistic-regression classifier does not use the dependent variable (the classes we are trying to group into) directly; rather, it employs a function that uses each of the predictor variables called a logit. The logistic-regression algorithm is often called the “logit” algorithm. Behind the scenes, the algorithm uses a series of odds that correspond to whether an event will occur. The logistic classifier determines the probability that data belongs to each class based upon this series of odds which it produces by analyzing each predictor variable. 

The following Python script, LogitisticRegressionIris.py, uses the model to predict Iris flower types:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 4
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

model = LogisticRegression(solver='lbfgs', multi_class='multinomial', max_iter=200).fit(X_train, y_train)

pred = model.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, pred))

print('\nIndex     Predicted         Actual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i, '      ',pred[i], '  ', y_test[i], ' ****')
    
DataToPredict = np.array([[5.2,3.5,1.4,0.2],[5.7,2.9,3.6,1.3],[5.8,3.0,5.1,1.8]])
pred = model.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print(DataToPredict[i], pred[i])


As stated, Logistic Regression is best suited for a binary-dependent variable. The following Python script, LogisticRegressionCancer.py, uses the approach to predict if a tumor is benign or malignant:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 5
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:9]) 	
y = np.array(df['class'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

pred = model.predict(X_test)

print('Accuracy score: ', accuracy_score(y_test, pred))
print(confusion_matrix(y_test, pred))

# Classification Using a Neural Network

Neural networks are at the heart of machine learning and are used for a wide range of applications, including classification. In this section, you will examine the MLPClassifer function, so named because it uses multilayer perceptrons to accomplish its processing, in this case classifications.

In a neural network, a perceptron is a supervised learning algorithm that uses a linear function to convert inputs into outputs. However, many real-world problems are not linear in nature. As such, the problems must be decomposed into a series of linear components, and additional layers of perceptrons must be used.

The following Python script, MLPIris.py, uses a multilayer-perceptron model to predict Iris flower types. MLP will iterate through the dataset until either convergence occurs (when the score is no longer improving over a certain period), or the specified maximum number of iterations is reached:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 6
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = MLPClassifier(max_iter=1000)
model.fit(X_train, y_train)
pred = model.predict(X_test)

print ('Accuracy: ', accuracy_score(y_test, pred))

print('Index\t  Predicted\t   Actual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i, '       ', pred[i], ' ', y_test[i], ' ****')
    

DataToPredict = np.array([[5.2,3.5,1.4,0.2],[5.7,2.9,3.6,1.3],[5.8,3.0,5.1,1.8]])
pred = model.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print('\t', DataToPredict[i], '\t', pred[i])

# Classification Using Decision Trees

A decision tree is a graph-based data structure that a program can use to follow a series of decision paths to arrive at a decision. 

Within machine learning, a decision-tree classifier creates a similar structure with decision points that are based upon the different dataset attributes. As you might guess, as the number and complexity of the attributes increase, so too does the complexity of the underlying decision tree.

The following Python script, DecisionTreeCancer.py, uses a decision tree to predict if a breast-cancer tumor is benign:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 7
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
import sklearn.tree as tree
from sklearn.metrics import confusion_matrix

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:10]) 	
y = np.array(df['class'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

DT = tree.DecisionTreeClassifier()
DT.fit(X_train, y_train)
pred = DT.predict(X_test)
print ('Accuracy Score: ', accuracy_score(y_test, pred))
print('Confusion Matrix\n', confusion_matrix(y_test, pred))

# Classifying Data Using Random Forests

In the previous section you learned how to use decision-tree modeling to classify data. Depending on the dataset and model, there may be times when the decision tree becomes very deep (many levels of nodes). Often, such decision trees will overfill the data and have will have a large variance. 
A random-forest classification model is similar, but at each split, only considers a random subset of the attributes.

The following Python script, RandomForests.py, uses a random forest to predict whether a tumor is benign or malignant:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 8
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.neural_network import MLPClassifier
import sklearn.tree as tree
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:10]) 	
y = np.array(df['class'])

# split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = RandomForestClassifier()
model.fit(X_train, y_train)
pred = model.predict(X_test)
print ('Accuracy Score: ', accuracy_score(y_test, pred))
print('Confusion Matrix\n', confusion_matrix(y_test, pred))

# Classifying Data Using a Support Vector Machine

The Support Vector Machine (SVM), often called SVC (Support Vector Classifier) classifies data by separating values with a line called a hyperplane. The SVC algorithm extends the separation capabilities to support such non-linear solutions. 
SVC is ideal for binary-classification problems, such as whether a loan will be approved or disapproved. That said, you can use SVC to multiclass problems; however your solution may not be as accurate as other methods.

The following Python script, for example, SVCIris.py, uses SVC to predict Iris-flower types:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('iris.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 0:4]) 	
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = SVC(gamma='auto').fit(X_train, y_train)
pred = model.predict(X_test)

print ('\nModel accuracy score: ', accuracy_score(y_test, pred))
print("\nFrom the test data")
print('Index\tPredicted\t\tActual')
for i in range(len(pred)):
  if pred[i] != y_test[i]:
    print(i,'     ', pred[i], '       ', y_test[i], ' ****')
    
DataToPredict = np.array([[5.2,3.5,1.4,0.2],[5.7,2.9,3.6,1.3],[5.8,3.0,5.1,1.8]])
pred = model.predict(DataToPredict)

print("\nPredicted Results")
for i in range(len(pred)):
    print(DataToPredict[i], pred[i])

As discussed, SVC is ideal for binary classifications. The following Python script, SVCcancer.py, uses SVC to predict whether a breast tumor is malignant or benign:

In [None]:
######################################
# Chapter 11 (Python) / Deliverable 9
######################################

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

names = ['Sample', 'Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial Cell Size','Bare Nuclei','Bland Chromatin', 'Normal Nucleoli', 'Mitoses', 'class']

df = pd.read_csv('breast.data.csv', header=None, names=names)
X = np.array(df.iloc[:, 1:9]) # designate the attributes to be used in the classification
y = np.array(df['class'])

# split the data into the train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

model = SVC(gamma='auto').fit(X_train, y_train)
pred = model.predict(X_test)

print ('Model accuracy score: ', accuracy_score(y_test, pred))