## Supplier Name Standardization using LSTM

The primary application for LSTM in spend analysis is vendor name normalization, whereby vendor names are predicted. Many large companies that constitute a large portion of your spend will hold various names within your various data systems.

Aggregating these names into a single name is important to show how much spend is going to certain suppliers so that you may identify your key suppliers.

In [None]:
For example,
[DELL FINANCIAL SERVICES, DELL MARKETING LP, DELL NV, DELLEMC, DMI DELL CORP BUS] Becomes DELL
[ORACLE, ORACLE AMERICA INC, ORACLE CORPORATION, ORACLE FINANCIAL SERVICES, ORACLE USA INC] becomes ORACLE

If you want to skip the hassle, you can find the full code here:

In [2]:
import numpy as np
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical

In [10]:
df = pd.read_csv(r"vendor_data.csv")

In [11]:
df.shape

(21, 3)

In [12]:
df.head()

Unnamed: 0,Supplier Code,Supplier Name,predicted_name
0,103,AMEREN ILLINOIS,AMEREN ILLINOIS
1,1601,AMEREN ILLINOIS,AMEREN ILLINOIS
2,1026,AT & T,ATT
3,931,AT & T MOBILITY,ATT
4,820,AT&T,ATT


In [15]:
df = df.drop(['Supplier Code'],axis = 1)
df.head()

Unnamed: 0,Supplier Name,predicted_name
0,AMEREN ILLINOIS,AMEREN ILLINOIS
1,AMEREN ILLINOIS,AMEREN ILLINOIS
2,AT & T,ATT
3,AT & T MOBILITY,ATT
4,AT&T,ATT


In [42]:
# Step 1: Preprocess the text data
# Tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['Supplier Name'])
X = tokenizer.texts_to_sequences(df['Supplier Name'])

In [43]:
# Pad sequences to ensure uniform input shape
max_seq_len = max(len(seq) for seq in X)  # Max length of sequences
X = pad_sequences(X, maxlen=max_seq_len, padding='post')

In [44]:
# Step 2: Encode the labels (Supplier Names)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['predicted_name'])
y = to_categorical(y)  # Convert labels to one-hot encoding

In [45]:
y

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.]], dtype=float32)

In [48]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify = y)

In [49]:
y_train

array([[0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]], dtype=float32)

In [50]:
y_test

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0.]], dtype=float32)

In [51]:
# Step 3: Build the LSTM Model
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64, input_length=max_seq_len),
    LSTM(128, return_sequences=False),
    Dense(y.shape[1], activation='softmax')  # Output layer with softmax for multi-class classification
])

In [52]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 4, 64)             1344      
                                                                 
 lstm_2 (LSTM)               (None, 128)               98816     
                                                                 
 dense_2 (Dense)             (None, 5)                 645       
                                                                 
Total params: 100,805
Trainable params: 100,805
Non-trainable params: 0
_________________________________________________________________


In [54]:
# Step 4: Train the Model
model.fit(X_train, y_train, epochs=10, batch_size=3, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1b9a0ec5040>

In [55]:
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy * 100:.2f}%")

Test Accuracy: 80.00%


In [57]:
# Predicting the supplier name for a new text
new_text = ["AT & T MOBILITY"]
new_text_seq = tokenizer.texts_to_sequences(new_text)
new_text_padded = pad_sequences(new_text_seq, maxlen=max_seq_len, padding='post')
prediction = model.predict(new_text_padded)

# Decode the predicted label
predicted_label = label_encoder.inverse_transform([np.argmax(prediction)])
print("Predicted Supplier Name:", predicted_label[0])

Predicted Supplier Name: ATT


In [58]:
# Predicting the supplier name for a new text
new_text = ["PITNEY BOWES GLOBAL FINANCIAL"]
new_text_seq = tokenizer.texts_to_sequences(new_text)
new_text_padded = pad_sequences(new_text_seq, maxlen=max_seq_len, padding='post')
prediction = model.predict(new_text_padded)

# Decode the predicted label
predicted_label = label_encoder.inverse_transform([np.argmax(prediction)])
print("Predicted Supplier Name:", predicted_label[0])

Predicted Supplier Name: PITNEY BOWES


In [67]:
# Predicting the supplier name for a new text
new_text = ["DELL GLOBAL FINANCIAL"]
new_text_seq = tokenizer.texts_to_sequences(new_text)
new_text_padded = pad_sequences(new_text_seq, maxlen=max_seq_len, padding='post')
prediction = model.predict(new_text_padded)

# Decode the predicted label
predicted_label = label_encoder.inverse_transform([np.argmax(prediction)])
print("Predicted Supplier Name:", predicted_label[0])

Predicted Supplier Name: DELL
