# Label Recommendation System

Recommendation software is among the most used cases in machine learning. The show the user relevant and personalized information. Today, I'll the label recommendation system I developed for a car shop which wanted their employees to label each bill in the database.


## Data and Data Collection
The data was provided by the car shop. In this data set the bill description and its label is stored. The data was labeled with OpenAI's API in on of the next categories:

| index | category |
| --- | --- |
| 0 | Shocks, Control Arms, Tires, Alignmen |
| 1 | Oil Change, Ignition, Fuel System |
| 2 | Manufacturer Service Intervals |
| 3 | Dashboard, Door Locks, Windows |
| 4 | Check Engine Light, Inspections |
| 5 | Alternator, Battery, Starter, Switches |
| 6 | AC System, Blower Motor |
| 7 | ABS Control Module, Brake Lines, Brake Pads |
| 8 | No category |\

To start, we load the data set DataFrame and clean it.

In [1]:
import pandas as pd
DF = pd.read_csv('labeled_data.csv')
DF['description'] = DF['description'].str.strip() #GET RID OF UNNECESSARY SPACES AT THE BIGINING AND END OF THE DESCRIPTION
DF['label'] = DF['label'].str.strip(' ,')         #AND THE COMAS AND BLANCK SPACES IN THE LABEL COLUMN
DF.dropna(subset=['description', 'label'], inplace=True) #WE DROP THE EMPTY ROWS

categories_DF = pd.read_csv('categories.txt')     #FINALLY, WE LOAD THE DATAFRAME THAT CONTAINS THE CATEGORIES

## Label Formatting
In order to feed the labels to the classifier, they must be first formatted. In a string containing zeroes and ones in a "one hot" encoding way.

In [2]:
#WE CREATE A NEW COLUMN WITH A PROPER LABEL USEFUL TO TRAIN  THE MODEL
def one_hot_label(label: str):
    place_holder_list = [0 for i in range(9)]
    for ind_label in label.split(','):
        if int(ind_label)<=9:
            place_holder_list[int(ind_label)-1] = 1
    return ','.join(list(map(str, place_holder_list)))
#THE OUTPUT IS A STRING OF CEROS AND ONES SEPARATED BY COMAS (1,0,0,0,1,0,0,0,1)

DF['emb_label'] = DF['label'].apply(one_hot_label)

Then, the entries with empty labels are discarded.

In [3]:
empty_label = ','.join(list(map(str, [0 for _ in range(9)])))
DF = DF[DF['emb_label'] != empty_label]
DF.shape

(4159, 13)

## Modeling

The classifier will be trained on 75% of the data. Because of the nonbinary nature of the labels, a one versus all classifier with a logistic regressor will be fitted.

In [16]:
#FIRST, WE PREPARE THE TEST AND SPLIT SETS

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(DF['description'], DF['emb_label'], test_size=0.25)

#WE TRAIN THE ONNE VS. REST CLASSIFIER
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression


vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

base_classifier = LogisticRegression()
clf = OneVsRestClassifier(base_classifier)
clf.fit(X_train_tfidf, Y_train)


Then, the accuracy score is displayed for the training test and the testing set:

In [27]:
print('Accuracy in test set:', round(accuracy_score(Y_test, clf.predict(X_test_tfidf)), 3))
print('Accuracy in train set:', round(accuracy_score(Y_train, clf.predict(X_train_tfidf)), 3))

Accuracy in test set: 0.691
Accuracy in train set: 0.786


Even though the accuracy scores fairly low, the classifier are still very useful. Because we will use not only the first prediction, but a collection of the best predictions. So the chance of recommending an adequate label increases.

## Building the Recommendation Function

With the classifier ready, we have to pass from the classifier outputs (a string of zeros and ones '0,0,0,0,1,0,0,0,0') to the label descriptions. And we have to select not only the best prediction but the best three predictions.

In [6]:
#THIS FUNCTION TRANSFORMS THE CALSSIFIER AOUTPURS INTO A LIST OF DESCRIPTIVE LABEL RECOMENDATIONS.
#THE LABEL DESCRIPTIONS ARE STORED IN A DIFFERENT DATA FRAME
def one_hot_to_label_description(emb_label: str):
    recommendation_list = []
    emb_label_list = [int(i) for i in emb_label.split(',')]
    for index, element in enumerate(emb_label_list):
        if element == 1:
            recommendation_list.append(categories_DF['category'][index])
    return recommendation_list

Then, we define the function that takes a string (the bill description) and outputs the top N predictions ranked according to the probability values calculated by the classifier. By the fault, this function will return the top three recommendations. But, if specified, it can return an arbitrary number of labels.

In [29]:
def top_label_recommendation(description: str, number_of_recommenmdations = 3):
    probability_values = clf.predict_proba(vectorizer.transform([description]))
    probability_values_DF = pd.DataFrame(probability_values.transpose(), columns=['probability']) #WE WILL TAKE ADVANTAGE OF THE DATAFRAMES PROPETIES TO ORGANIZE THE PROBABILITIES AND                                                                                                  
    probability_values_DF.sort_values(by='probability', ascending=False, inplace=True, ignore_index=False)#                            TO CONSERVE THE INDEX LINKED TO THE CLASS PREDICTION

    label_recommendation=[]
    for index in probability_values_DF.index[:number_of_recommenmdations]:
        label_recommendation += one_hot_to_label_description(clf.classes_[index])
    return(list(set(label_recommendation)))

Let's run some examples:

In [31]:
#EX: 1
for element in top_label_recommendation('oil change, rear tie rod pass side'):
    print(element)

Oil Change, Ignition, Fuel System
No category
Shocks, Control Arms, Tires, Alignment


In [33]:
#EX: 2
for element in top_label_recommendation('oxygen sensor up stream replacement', 4):
    print(element)

ABS Control Module, Brake Lines, Brake Pads
Oil Change, Ignition, Fuel System
No category
Check Engine Light, Inspections


In [32]:
#EX: 3
for element in top_label_recommendation('r+r rear differential support', 5):
    print(element)

ABS Control Module, Brake Lines, Brake Pads
Shocks, Control Arms, Tires, Alignment
Dashboard, Door Locks, Windows
Oil Change, Ignition, Fuel System
No category


## Conclusions

The 69% accuracy with the addition of the next best recommendation create a capable label recommendation. This work will help the car shop employees maintain a clean and neat database.

# Author
Leonardo Mier\
leo97mier@gmail.com