Clustering Product Names with Python — Part 1
Using Natural Language Processing (NLP) and K-Means to cluster unlabelled text in Python

In [5]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import plotly.express as px

#Libraries for preprocessing
from gensim.parsing.preprocessing import remove_stopwords
import string
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import webcolors

#Download once if using NLTK for preprocessing
import nltk
nltk.download('punkt')

#Libraries for vectorisation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.model_selection import GridSearchCV
from fuzzywuzzy import fuzz

#Libraries for clustering
from sklearn.cluster import KMeans

#Load data set
df = pd.read_csv('df1014_2.csv')
text1 = df['GDS_NM']

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\min\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Introduction
Natural Language Processing (NLP) refers to the automatic computational processing of human language like text and speech.

It is particularly useful for analysing large amounts of unlabelled text to quickly extract meaning which is exactly the problem when it comes to categorising eCommerce products. Products can be either labelled with incorrect categories or not labelled at all. Manual categorisation is not efficient, if not impossible for some.

Today we are going to talk about how we can use NLP and K-means in Python to automatically cluster unlabelled product names to quickly understand what kinds of products are in a data set. This method is unsupervised (the categories and number of categories are not set) and differs from classification which is supervised and allocates product names to target labels (known categories).

For this guide I’ll be using a data set from the Australian Food Composition Database which contains data on the nutrient content of Australian foods. I’ll show you how I clustered 1,534 food names with a lot of uniqueness…

The method consists of the following steps:

Preprocessing the text (the food names) into clean words so that we can turn it into numerical data.
Vectorisation which is the process of turning words into numerical features to prepare for machine learning.
Applying K-means clustering, an unsupervised machine learning algorithm, to group food names with similar words together.
Assessing cluster quality through cluster labelling and visualisation.
Finetuning steps 1–4 to improve cluster quality.
This article is Part 1 and will cover: Preprocessing and Vectorisation.

Be sure to also check out Part 2 which will cover: K-means Clustering, Assessing Cluster Quality and Finetuning.

Full disclosure: this data set actually comes with a column ‘Classification Name’ with 268 categories but for demonstration purposes, let’s pretend it’s not there ;)

Preprocessing
The aim of the game here is to remove unnecessary words and characters so that the words in our food names are meaningful for clustering later.

There are many preprocessing techniques and selecting which ones to use depends on how they’ll affect the clusters. Here are the techniques I used and why.

Removing stopwords, punctuation and numbers
Stopwords are the common words in language like ‘the’, ‘a’, ‘is’, ‘and’. Yielding a cluster because all the food names contain the word ‘and’ for example, which isn’t relevant to what the foods are, isn’t useful.

We’ll remove stopwords using the Gensim library, and punctuation and numbers using the String library.

In [6]:
#Remove stopwords, punctuation and numbers
text2 = [remove_stopwords(x)\
        .translate(str.maketrans('','',string.punctuation))\
        .translate(str.maketrans('','',string.digits))\
        for x in text1]

Stemming and making words lower case
Stemming words involves shortening them to their root forms. For example ‘apple’ and ‘apples’ both become ‘appl’ and are treated as the same word in the vectorisation stage.

Note: lemmatisation would reduce both words to the real word ‘apple’ based on context. It is more computationally expensive and wasn’t required for this exercise as I could easily tell what the stemmed words referred to.

Using the NLTK library will also make all words lower case. This is useful so that ‘Appl’ and ‘appl’ are treated as the same word in the vectorisation stage.

In [7]:
#Stem and make lower case
def stemSentence(sentence):
    porter = PorterStemmer()
    token_words = word_tokenize(sentence)
    stem_sentence = [porter.stem(word) for word in token_words]
    return ' '.join(stem_sentence)

text3 = pd.Series([stemSentence(x) for x in text2])

KeyboardInterrupt: 

Removing colours
Having colours in our food names will likely yield clusters of same-coloured but otherwise unrelated foods. We’ll remove colours using the Webcolors dictionary, but not the colours that are also foods (eg: ‘chocolate’ and ‘lime’).

In [None]:
#Remove colours
colors = list(webcolors.CSS3_NAMES_TO_HEX)
colors = [stemSentence(x) for x in colors if x not in ('bisque','blanchedalmond','chocolate','honeydew','lime',
                                         'olive','orange','plum','salmon','tomato','wheat')]
text4 = [' '.join([x for x in string.split() if x not in colors]) for string in text3]

Some Python libraries used in the vectorisation stage have some of these techniques built-in. However, if testing multiple vectorisation models, it’s best to start with a consistent, clean text to be able to compare output.

Vectorisation
We now want to turn our cleaned text into numerical data so that we can perform statistical analysis on it.

Just like preprocessing, there are many techniques to choose from. These are the models I tested.

Bag of words
Bag of words (using sci-kit learn’s CountVectorizer) is a basic model that counts the occurrences of words in a document. Here, each row — one food name — is a document. The result is a matrix containing a feature for each distinct word in the text and the count of each word in a row (or vector) as its numerical values.

In [None]:
#Bag of words
vectorizer_cv = CountVectorizer(analyzer='word')
X_cv = vectorizer_cv.fit_transform(text4)