<a href="https://colab.research.google.com/github/Forson12/CEM300-Natural-Language-Processing-Coursework/blob/main/2113122_CEM300_Natural_Language_Processing_Coursework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#CEM300-Natural-Language-Processing-Coursework

In this work, I explore the Product Classification and Clustering Dataset.
<br>
Sources && Frameworks:

Dataset Chosen - https://archive.ics.uci.edu/dataset/837/product+classification+and+clustering

<br>


# ----------------------------------------------------
# Section 1 - Dataset
# ----------------------------------------------------

## Task 1.1: Loading the Dataset

 In this section, I will load and explore the Product Classification and Clustering Dataset taken from the UCI Machine Learning Repository.

I will be using pandas and scikit-learn to inspect its features.

My goal in this coursework is to use NLP techniques to represent the text and compare different algorithms for classifying products into their correct categories.

In [None]:

# Importing required libraries
import pandas as pd #using pandas to view data in dataframes
import io #io is used to load the data after uploading
from google.colab import files #we import the files package from google.colab framework to be able to upload files

# Uploading dataset (same idea as Lab 01)
uploaded = files.upload()  # expect 'pricerunner_aggregate.csv'

# Reading dataset into a dataframe
product_file = io.BytesIO(uploaded['pricerunner_aggregate.csv'])
product_df = pd.read_csv(product_file, header=None)

# Inspecting the dataframe
print("Shape of dataset:", product_df.shape)
product_df.head()


Shape of dataset: (35312, 8)


Unnamed: 0,Product_ID,Product_Title,Merchant_ID,Cluster_ID,Cluster_Label,Category_ID,Category_Label,title_length
0,Product ID,Product Title,Merchant ID,Cluster ID,Cluster Label,Category ID,Category Label,13
1,1,apple iphone 8 plus 64gb silver,1,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones,31
2,2,apple iphone 8 plus 64 gb spacegrau,2,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones,35
3,3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,3,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones,70
4,4,apple iphone 8 plus 64gb space grey,4,1,Apple iPhone 8 Plus 64GB,2612,Mobile Phones,35


In [None]:
product_file = io.BytesIO(uploaded['pricerunner_aggregate.csv'])
product_df = pd.read_csv(product_file, header=None)
print(product_df)

       Product_ID                                      Product_Title  \
0      Product ID                                      Product Title   
1               1                    apple iphone 8 plus 64gb silver   
2               2                apple iphone 8 plus 64 gb spacegrau   
3               3  apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...   
4               4                apple iphone 8 plus 64gb space grey   
...           ...                                                ...   
35307       47350  smeg fab28 60cm retro style right hand hinge f...   
35308       47351  smeg fab28 60cm retro style left hand hinge fr...   
35309       47352  smeg fab28 60cm retro style left hand hinge fr...   
35310       47355     candy 60cm built under larder fridge cru160nek   
35311       47358           neff k4316x7gb built under larder fridge   

        Merchant_ID   Cluster_ID             Cluster_Label   Category_ID  \
0       Merchant ID   Cluster ID             Cluster Label 

The dataframe tells us a lot of useful information about the price runner dataset just uploaded.
Examples include:
1. There are 7 columns (6 features and 11 unique labels)
2. 35312 rows (as there is no header, this means there are 35312 examples in this dataset)
3. It has Columns including: Product ID, Merchant ID etc.

## Task 1.2: Understanding the Dataset

This dataset, called the contains records of retail products aggregated from online merchants.  
Each record includes a **Product ID**, textual information such as **Product Title** and **Cluster Label**, identifiers like **Merchant ID** and **Cluster ID**, and category information given by **Category ID** and **Category Label**.

The *Category Label* column serves as the target for classification, while the *Product Title* and *Cluster Label* provide textual data that require preprocessing before use in machine-learning models.  

Because these text fields are unstructured, this section will load, inspect, and clean the dataset.


In [None]:
# --- Rename columns ---
product_df.columns = [
    "Product_ID", "Product_Title", "Merchant_ID",
    "Cluster_ID", "Cluster_Label", "Category_ID", "Category_Label"
]

# --- Checking and handling missing values ---
print(product_df.isnull().sum())
product_df = product_df.dropna(subset=["Product_Title", "Category_Label"])

Product_ID        0
Product_Title     0
Merchant_ID       0
Cluster_ID        0
Cluster_Label     0
Category_ID       0
Category_Label    0
title_length      0
dtype: int64


## Task 1.3: Exploring and Preparing Text Data

This task focuses on understanding
the textual content of the dataset before vectorisation.  
Because clustering relies purely on text similarity, it is important to inspect the product titles,
decide how aggressively to clean them, and justify the chosen preprocessing strategy.


In [None]:
# Inspect example titles and basic text statistics
product_df['title_length'] = product_df['Product_Title'].astype(str).apply(len)

#print the mean to see the average length
print("Average title length:", product_df['title_length'].mean())
print("\nSample titles:")
for t in product_df['Product_Title'].head(10):
    print("-", t)


Average title length: 52.98108291798822

Sample titles:
- Product Title
- apple iphone 8 plus 64gb silver
- apple iphone 8 plus 64 gb spacegrau
- apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim free smartphone in gold
- apple iphone 8 plus 64gb space grey
- apple iphone 8 plus gold 5.5 64gb 4g unlocked sim free
- apple iphone 8 plus gold 5.5 64gb 4g unlocked sim free
- apple iphone 8 plus 64 gb space grey
- apple iphone 8 plus 64gb space grey
- apple iphone 8 plus 64gb space grey


In [None]:
# Cleaning product titles for clustering
import re, nltk                           # import regex and nltk for text processing
from nltk.corpus import stopwords         # import stopwords list
from nltk.stem import WordNetLemmatizer   # import lemmatiser for word normalisation

nltk.download('stopwords')                # download stopwords
nltk.download('wordnet')                  # download wordnet for lemmatisation

stop_words = set(stopwords.words('english'))     # define stopword list
lemmatizer = WordNetLemmatizer()                 # create lemmatiser object

# define text pre-processing function
def preprocess_text(text):
    text = str(text).lower()                      # convert to lowercase
    text = re.sub(r'[^a-z0-9 ]', ' ', text)       # remove punctuation/special chars
    tokens = text.split()                         # split into words
    tokens = [t for t in tokens if t not in stop_words]  # remove stopwords
    tokens = [lemmatizer.lemmatize(t) for t in tokens]   # lemmatise words
    return " ".join(tokens)                       # join back into sentence

# apply function and create new column
product_df["Clean_Title"] = product_df["Product_Title"].apply(preprocess_text)

# display original vs cleaned titles
product_df[["Product_Title","Clean_Title"]].head(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,Product_Title,Clean_Title
0,Product Title,product title
1,apple iphone 8 plus 64gb silver,apple iphone 8 plus 64gb silver
2,apple iphone 8 plus 64 gb spacegrau,apple iphone 8 plus 64 gb spacegrau
3,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,apple mq8n2b iphone 8 plus 64gb 5 5 12mp sim f...
4,apple iphone 8 plus 64gb space grey,apple iphone 8 plus 64gb space grey
5,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,apple iphone 8 plus gold 5 5 64gb 4g unlocked ...
6,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,apple iphone 8 plus gold 5 5 64gb 4g unlocked ...
7,apple iphone 8 plus 64 gb space grey,apple iphone 8 plus 64 gb space grey
8,apple iphone 8 plus 64gb space grey,apple iphone 8 plus 64gb space grey
9,apple iphone 8 plus 64gb space grey,apple iphone 8 plus 64gb space grey


# ----------------------------------------------------
# 2. Section 2: Representation Learning
# ----------------------------------------------------

## Task 2 – Representation Learning

To perform clustering, the cleaned product titles must be converted into numeric vectors that capture word importance.  
Here the **TF-IDF (Term Frequency - Inverse Document Frequency)** method is used.  
TF-IDF assigns higher values to words that appear frequently within one title but are less common across the dataset.  
This helps emphasise unique identifiers such as brand and model numbers while reducing the weight of very common terms.

TF-IDF is efficient and interpretable for short text, making it suitable for grouping similar products.  
Each product title becomes a sparse vector representing the strength of its key tokens.  
These vectors form the numerical feature space on which unsupervised algorithms—such as *k-means*—can measure distance and identify clusters.  
This representation follows best practice from standard NLP pipelines and serves as a solid baseline before exploring more advanced embeddings like Word2Vec or BERT.


In [23]:
# -------------------------------------------------------------
# Section 2 - Representation Learning
# -------------------------------------------------------------
# In clustering we need numeric features to compare similarity.
# TF-IDF (Term Frequency – Inverse Document Frequency) converts
# cleaned text into numeric vectors showing how important each
# word is across all product titles.
# -------------------------------------------------------------

from sklearn.feature_extraction.text import TfidfVectorizer   # import TF-IDF tool

# create TF-IDF vectoriser
# max_features limits how many words are kept (for efficiency)
vectoriser = TfidfVectorizer(max_features=5000)

# fit the vectoriser on the cleaned titles and transform to matrix
X_tfidf = vectoriser.fit_transform(product_df["Clean_Title"])

# show size of the resulting matrix (rows = products, cols = words)
print("TF-IDF matrix shape:", X_tfidf.shape)

# print out the first few feature names
print("Sample features:", vectoriser.get_feature_names_out()[:15])


TF-IDF matrix shape: (35312, 5000)
Sample features: ['00' '001' '00ghz' '01' '02' '02m' '03m' '05' '06' '06ghz' '0cf' '0ghz'
 '0in' '0inch' '0lcd']
