# Classification

Classification is perhaps the most fundamental information processing task. It consists of converting an input to a discrete set of one or more outputs. We perform classification when we take a job title and assign an occupation code, we perform classification when we look at a survey response and deem it "acceptable", or "error, requiring further follow up", and we perform classification when we examine two records from different data sets and determine that they describe the "same" individual, or "different" individuals. Classification tasks are everywhere in data processing.

For many tasks automatic classification is easy. We can easily write a computer program that determines whether a reported employment number is within a specified range, whether a required survey response was provided or not, or whether one number is bigger than another. When tasks involve complicated inputs like natural language, imagery, or sound however, this approach is no longer feasible. If you have enough training data however, there is an approach that works well. Supervised Machine Learning. This is the basis for most of commercial "Artificial Intelligence" today, and the focus of this handbook.

# Supervised Machine Learning

Building a supervised machine learning system consists of the following steps:
1. Gather data containing the relevant inputs to a task and the desired outputs
2. Convert the inputs (if necessary), into a format suitable for your learning algorithm
3. Use a learning algorithm to "learn" from the data
4. Evaluate the performance of the system and modify steps 1-4 as needed

We illustrate this by building a simple "Part of Body" autocoder below.

## 1. Gather the Data

For this example we will use data from the Mining Safety and Health Administration consisting of a narrative, describing an injury to a mining worker, and the part of body injured. Our goal is to create a classifier that can automatically determine which part of body was injured based on the written narrative. We begin, by using the pandas library to read in, and examine an Excel file containing training data.

In [4]:
import pandas as pd

# read the Excel into a dataframe
df = pd.read_excel(r"..\data\training_data.xlsx")
# display the first 5 rows of the dataframe
df.head()

Unnamed: 0,MINE_ID,INJ_BODY_PART,NARRATIVE
0,2602512,FINGER(S)/THUMB,The employee's finger was pinched between the ...
1,4201444,"FACE,NEC",EE had just clocked in and had sat down on a c...
2,4609029,"HEAD,NEC",The employee received fatal injuries when he w...
3,200024,FOREARM/ULNAR/RADIUS,"Employee was cutting a 1"" piece of pipe secure..."
4,4606578,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),Employee was injured on 4-11-11. While operati...


## 2. Convert the Inputs

In this example, our input is text but the learning algorithm we will be using requires numeric inputs. We will convert the text to numbers using the "bag-of-words" representation, i.e. we will define a vector where each position in that vector corresponds to a word that occurs in our narratives, and we will convert each narrative to this vector by putting the number of times each word occurs in our narrative in the corresponding position of the vector.  

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# create a vectorizer object which will map words to vector positions
vectorizer = CountVectorizer()
# map each word occuring in df['NARRATIVE'] to a vector position
vectorizer.fit(df['NARRATIVE'])
# transform our narratives into their vector representation
bag_of_words_representation = vectorizer.transform(df['NARRATIVE'])

## 3. Learn from the Data

There are many algorithms for learning from data and the right algorithm depends on the task, but a popular option for text classification tasks like autocoding is LogisticRegression. LogisticRegression assumes the probability of each classification is a function of the weighted sum of the inputs, in this case the word counts from our bag of words representation. LogisticRegression "learns" by using the training to data to estimate the weights that will give the best performance. Since there are many possible injured body parts, behind the scenes our program is fitting a separate LogisticRegression model for each possible injured part.

In [9]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X=bag_of_words_representation, y=df['INJ_BODY_PART'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## 4. Evaluate the Model

There are many ways to evaluate the performance of a model but two of the most popular metrics are the accuracy, which is just the portion of classifications correctly assigned, and the macro-F1-score, which can be thought of as an average of the classification specific accuracies. It is very important, when evaluating complex models, to evaluate on data the model was not trained on because complex models can easily memorize the training data without learning patterns that are useful for performing on unseen data.

In [11]:
# Read in the data we will evaluate our model against
validation_df = pd.read_excel("..\data\evaluation_data.xlsx")
# Note that the validation data already has part of body codes assigned. We will
# assume these are the "correct" values when calculating accuracy and macro-f1 score
validation_df.head()

Unnamed: 0,MINE_ID,INJ_BODY_PART,NARRATIVE
0,3600017,BODY SYSTEMS,Possible heart attack.
1,3503757,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),Employee was cleaning up plant spillage into a...
2,1103189,FINGER(S)/THUMB,"Employee was putting a drag on a shuttle car, ..."
3,4600016,HAND (NOT WRIST OR FINGERS),While using a cutting torch to remove a bearin...
4,3609997,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),Employee was operating a Cat D400E articulated...


In [13]:
# convert the validation data into its bag of words representation using our vectorizer
validation_bag_of_words = vectorizer.transform(validation_df['NARRATIVE'])
# use our model to predict the injured body parts and save these to our dataframe
validation_df['Predicted Part'] = model.predict(validation_bag_of_words)
validation_df.head()

Unnamed: 0,MINE_ID,INJ_BODY_PART,NARRATIVE,Predicted Part
0,3600017,BODY SYSTEMS,Possible heart attack.,BODY SYSTEMS
1,3503757,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA),Employee was cleaning up plant spillage into a...,SHOULDERS (COLLARBONE/CLAVICLE/SCAPULA)
2,1103189,FINGER(S)/THUMB,"Employee was putting a drag on a shuttle car, ...",FINGER(S)/THUMB
3,4600016,HAND (NOT WRIST OR FINGERS),While using a cutting torch to remove a bearin...,HAND (NOT WRIST OR FINGERS)
4,3609997,BACK (MUSCLES/SPINE/S-CORD/TAILBONE),Employee was operating a Cat D400E articulated...,BACK (MUSCLES/SPINE/S-CORD/TAILBONE)


In [18]:
# calculate accuracy and macro f1_score
from sklearn.metrics import accuracy_score, f1_score

print('accuracy:', accuracy_score(y_true=validation_df['INJ_BODY_PART'],
                                  y_pred=validation_df['Predicted Part']))
print('macro-f1:', f1_score(y_true=validation_df['INJ_BODY_PART'],
                            y_pred=validation_df['Predicted Part'],
                            average='macro'))

accuracy: 0.7426641567932676
macro-f1: 0.5105508119093995


  'precision', 'predicted', average, warn_for)
