# Task 2: Text Classification

In [1]:
# obtain the data
# !pip install -U pandas
# !pip install -U nltk
import pandas as pd
data = pd.read_csv('testset_C.csv', sep=';')
data.head()

Unnamed: 0,id,productgroup,main_text,add_text,manufacturer
0,26229701,WASHINGMACHINES,WAQ284E25,WASCHMASCHINEN,BOSCH
1,16576864,USB MEMORY,LEEF IBRIDGE MOBILE SPEICHERERWEITERUNG FUER I...,PC__1100COMPUTINGMEMORY__1110MEMORYCARDS,LEEF
2,26155618,USB MEMORY,SANDISK 32GB ULTRA FIT USB 3.0,W1370,
3,25646138,BICYCLES,HOLLANDRAD DAMEN 28 ZOLL TUSSAUD 3-GAENGE RH 5...,FAHRRAEDER // SPORTFAHRRAEDER,SCHALOW & KROH GMBH
4,19764614,BICYCLES,DAHON SPEED D7 SCHWARZ ? FALTRAD,SPORTS__30000WHEELED__30070BIKES,DAHON


Before developing any classification models, I would first take a look at (i) how many entries the dataset has, and (ii) the distribution of the labels (productgroup).

In [2]:
from nltk import FreqDist
category_list = data['productgroup'].tolist()
unique_ctg = list(set(category_list))
print('num of product group', len(unique_ctg))

ctg_fd = FreqDist(category_list)
ctg_fd.most_common()

num of product group 4


[('WASHINGMACHINES', 2000),
 ('USB MEMORY', 2000),
 ('BICYCLES', 2000),
 ('CONTACT LENSES', 2000)]

Hence the dataset is balanced. There are three columns from which I can make predictions: *main_text*, *add_text* and *manufacturer*. Among them, manufacturer is categorial information, while main_text and add_text are free-text unstructured information. Hence, I would go from easy to hard: first use only manufacturers to make predictions, and then try to use the free-text information. 

In [3]:
# count how many unique manufacturers we have, and the frequency of each manufacturer
from nltk import FreqDist
manufacture_list = data['manufacturer'].tolist()
unique_mnf = list(set(manufacture_list))
print('num of unique manufacturers', len(unique_mnf))

mnf_fd = FreqDist(manufacture_list)
mnf_fd.most_common()

num of unique manufacturers 624


[(nan, 1344),
 ('COOPER', 343),
 ('CIBA', 243),
 ('SIEMENS', 205),
 ('MIELE', 197),
 ('B&L', 181),
 ('SANDISK', 176),
 ('J&J', 175),
 ('BOSCH', 173),
 ('ALCON', 172),
 ('TRANSCEND', 161),
 ('AEG', 149),
 ('VERBATIM', 140),
 ('CUBE', 126),
 ('KINGSTON', 125),
 ('INTENSO', 103),
 ('VESTEL GERMANY GMBH', 86),
 ('BERGAMONT', 80),
 ('SAMSUNG', 75),
 ('BAUKNECHT', 74),
 ('BEKO', 72),
 ('SERIOUS', 71),
 ('MPG&E', 68),
 ('GHOST', 66),
 ('MERIDA', 63),
 ('SILICON POWER', 61),
 ('KS CYCLING', 59),
 ('HAMA', 54),
 ('ACUVUE', 52),
 ('EMTEC', 52),
 ('TOSHIBA', 52),
 ('WINORA', 52),
 ('CORSAIR', 51),
 ('BICYCLES', 48),
 ('FUJI', 48),
 ('BOCAS', 43),
 ('CONSTRUCTA', 42),
 ('LEXAR', 42),
 ('JOHNSON & JOHNSON', 39),
 ('ORTLER', 38),
 ('OTTO GMBH & CO. KG', 38),
 ('GOODRAM', 36),
 ('SCHALOW & KROH GMBH', 35),
 ('GORENJE', 35),
 ('PUKY', 35),
 ('SONY', 34),
 ('GAZELLE', 33),
 ('RALEIGH', 33),
 ("S'COOL", 33),
 ('COOPER VISION', 32),
 ('BEKO DEUTSCHLAND GMBH', 32),
 ('XLYNE', 32),
 ('FIXIE INC.', 30),
 ('

In [4]:
# build the manufacturer-category matrix, to see whether there 
# exists a strong link between certain manufacturers and categories
from nltk import ConditionalFreqDist
mnf_ctg_matrix = ConditionalFreqDist(
    (manufacture_list[idx], category_list[idx]) for idx in list(range(8000))
)
print('manufacturer\tnum. of corresponding categories')
for mnf in unique_mnf:
    print(mnf, '\t', len(mnf_ctg_matrix[mnf].most_common()))

manufacturer	num. of corresponding categories
nan 	 4
ACUVUE 	 1
HP 	 1
COLOURVUE 	 1
BACH 	 1
DEXXON DATA MEDIA 	 1
MPG & E 	 1
HAMA GMBH & CO. KG 	 1
MPG&E-CONTACTLINSEN 	 1
MADD GEAR 	 1
SWISSLENS 	 1
BOSCH -W- 	 1
PENDING SYSTEM GMBH&CO.KG,,956 	 1
BÃ¼NTING SYSTEMKUNDEN HANDELS-GMBH 	 1
STAPLES 	 1
GORENJE HAUSGERAETE 	 1
IGA-OPTIK 	 1
COOPER VISION WES 	 1
BOSCH 	 1
BOSCH KG 	 1
PERFORMANCE 	 1
CONSTRUCTA ENERGY 	 1
DIVERSE 	 1
PLATINUM 	 1
XD 	 1
BACHTENKIRCH INTERBIKE 	 1
KAWASAKI 	 1
BAUSCH  & LOMB GMBH 	 1
TAKEMS SONDERPOSTEN 	 1
MERALENS 	 1
HYDROGEL V 	 1
BAUSCH + LOMB 	 1
ELECTROLUX HAUSGERäTE GMBH AEG 	 1
I.ONIK 	 1
BEKO GRUNDIG Deutschland GmbH 	 1
AD AUTODRIVE 	 1
BÜROBOSS.DE/LOGISTIK 	 1
GALFILA CONTACTLINSEN GMBH 	 1
SIEMENS-Electrogeräte 	 1
ZANKER 	 1
CHARGE 	 1
BABBOE 	 1
GRUNDIG 	 1
COOPER VISION 	 1
IMATION 	 1
DIVERSE MARKEN MEDIMAX 	 1
ELECTROLUX HAUSGERäTE VERTIEBS 	 1
CONSUMER 	 1
PRIMA-OPTICS 	 1
HAIBIKE 	 1
CORSAIR 	 1
SEG HAUSGERAETE GMBH 	 1
VISTAKON 	 1
H

SMEG HAUSGERAETE GMBH 	 1
SAMSUNG ELECTRONICS GMBH 	 1
BOSCH HAUSGERAETE GMBH 	 1
HEAD 	 1
3503460 	 1
BIOMEDICS 	 1
MIELE & CIE. KG 	 1
SONY 	 1
LEXAR MEDIA 	 1
BAUKNECHT HAUSGERäTE GMBH 	 1
WÖHLK-CONTACTLINSEN GMBH 	 1
COOPER EIGENMARKE 	 1
VCM MORGENTHALER GMBH 	 1
<NONE> 	 1
ZEG 	 1
BÃ¼NTING GROÃŸHANDEL & SERVICE GMBH 	 1
Alno AG 	 1
CIBAVISION 	 1
SON 	 1
HAMLET 	 1
AEG HAUSGERäTE 	 1
NO-NAME 	 1
PHOTOFAST 	 1
HUDORA 	 1
Bauknecht 	 1
IGA IGK 	 1
LG ELECTRONICS 	 1
HP ENTERPRISE 	 1
ELECTRONICPARTNER 	 1
IUK 	 1
XLAYER 	 1
3531470 	 1
ADATA TECHNOLOGY (USA) CO., LTD 	 1
CREME 	 1
3500470 	 1
TRANSCEND 	 1
SOENNECKEN / LS: 1 	 1
PURE FIX CYCLES 	 1
CONTA OPTIC GMBH 	 1
GROUP SFIT 	 1
KOENIC 	 1
ARP 	 1
SANDISK CORPORATION 	 1
JOHNSON+JOHNSON 	 1
TOSHIBA - MASS STORAGE 	 1
VIDEOSEVEN 	 1
BIOFINITY 	 1
PC?: WORTMANN AG 	 1
WOEHLKZEIS 	 1
JOHNSON & JOHNSON MEDICAL 	 1
MPG&E KATALOG 	 1
BIANCHI 	 1
GALIFA 	 1
JOHNSON&JOHNSON 	 1
BAUKNECHT HAUSGERAETE GMBH 	 1
BOSCH HAUSGERAETE GMBH9603

I find that, mostly, each manufacture corresponds to only one category. Hence I would first develop a *Naive Bayes* type model as a baseline: maintain the probabilities p(category|manufacturer); if a manufacturer corresponds to multiple categories, I would predict the most likely one.  

In [12]:
# split data into train, dev and test set by ratio 6:2:2
import random
train_idx, dev_idx, test_idx = [], [], []
for i in range(len(category_list)):
    rnd = random.random()
    if rnd <= 0.6: train_idx.append(i)
    elif rnd <= 0.8: dev_idx.append(i)
    else: test_idx.append(i)

print('train size', len(train_idx))
print('dev size', len(dev_idx))
print('test size', len(test_idx))

train size 4743
dev size 1599
test size 1658


In [13]:
# a baseline model: predict based on p(category|manufacturer)
baseline_pred = []
mnf_ctg_matrix = ConditionalFreqDist(
    (manufacture_list[idx], category_list[idx]) for idx in train_idx+dev_idx
)

for mnf in [manufacture_list[i] for i in test_idx]:
    # if the manufacturer has seen in the training set, label it with its most probable category
    if mnf in mnf_ctg_matrix: baseline_pred.append(mnf_ctg_matrix[mnf].max())
    # otherwise, return either of the category uniformly randomly
    else: baseline_pred.append(random.choice(unique_ctg))

test_true_catgy = [category_list[i] for i in test_idx]
acc = len([1 for (pred,true) in zip(baseline_pred,test_true_catgy) if pred==true])*1./len(baseline_pred)
print('manufacturer-only baseline accuracy', acc)

manufacturer-only baseline accuracy 0.8516284680337757


Given the baseline performance, I would use the text information from main_text and add_text to make predictions. More specifically, I would use the tf-idf vectors of the texts as input to train a Logistic Regression based classifier.  

In [14]:
# use tf-idf vectors to represent text, use logistic regression as classifier
# !pip install -U sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# prepare data
main_text_list = data['main_text'].tolist()
add_text_list = data['add_text'].tolist()

wanted_text = 'merged' # or 'only add' or 'only main'
if 'only main' == wanted_text:
    text_list = main_text_list
elif 'only add' == wanted_text:
    text_list = add_text_list
else:
    text_list = [repr(main_text_list[i])+' '+repr(add_text_list[i]) for i in range(len(main_text_list))]

train_text = [repr(text_list[i]) for i in train_idx]
train_cat = [repr(category_list[i]) for i in train_idx]
dev_text = [repr(text_list[i]) for i in dev_idx]
dev_cat = [repr(category_list[i]) for i in dev_idx]
test_text = [repr(text_list[i]) for i in test_idx]
test_cat = [repr(category_list[i]) for i in test_idx]

# this function facilitates trying out different hyperparameters of the classifier
def train_model(nrange, mf_num):
    tfidf_vectorizer = TfidfVectorizer(ngram_range=nrange,max_features=mf_num,sublinear_tf=True)
    train_vecs = tfidf_vectorizer.fit_transform(train_text)
    dev_vecs = TfidfVectorizer(vocabulary=tfidf_vectorizer.vocabulary_,ngram_range=nrange,max_features=mf_num,sublinear_tf=True).fit_transform(dev_text)
    clf = LogisticRegression().fit(train_vecs, train_cat)
    acc = clf.score(dev_vecs, dev_cat)
    return clf, acc, tfidf_vectorizer.vocabulary_

In [15]:
# find the best hyper-parameters combinations on dev test

ngram_list = [(1,1),(1,2),(1,3),(2,3)]
mfnum_list = [100,500,1000,3000,5000,8000,10000,15000,20000]
best_acc = -1
best_clf = None
best_setup = None
best_vocab = None

for ngram in ngram_list:
    for mfnum in mfnum_list:
        clf, acc, vocab = train_model(ngram,mfnum)
        print(ngram, mfnum, acc)
        if acc > best_acc:
            best_acc = acc
            best_clf = clf
            best_setup = (ngram, mfnum)
            best_vocab = vocab
            
print('\n===========')
print('best setup', best_setup)
print('best acc', best_acc)

(1, 1) 100 0.9724828017510945
(1, 1) 500 0.991869918699187
(1, 1) 1000 0.9956222639149468
(1, 1) 3000 0.9956222639149468
(1, 1) 5000 0.9956222639149468
(1, 1) 8000 0.9962476547842402
(1, 1) 10000 0.9962476547842402
(1, 1) 15000 0.9962476547842402
(1, 1) 20000 0.9962476547842402
(1, 2) 100 0.9668542839274546
(1, 2) 500 0.9881175734834271
(1, 2) 1000 0.9931207004377736
(1, 2) 3000 0.9956222639149468
(1, 2) 5000 0.9956222639149468
(1, 2) 8000 0.9956222639149468
(1, 2) 10000 0.9956222639149468
(1, 2) 15000 0.9956222639149468
(1, 2) 20000 0.9956222639149468
(1, 3) 100 0.9649781113195748
(1, 3) 500 0.9862414008755472
(1, 3) 1000 0.9924953095684803
(1, 3) 3000 0.9956222639149468
(1, 3) 5000 0.9956222639149468
(1, 3) 8000 0.9956222639149468
(1, 3) 10000 0.9956222639149468
(1, 3) 15000 0.9956222639149468
(1, 3) 20000 0.9956222639149468
(2, 3) 100 0.6779237023139462
(2, 3) 500 0.858036272670419
(2, 3) 1000 0.8974358974358975
(2, 3) 3000 0.9355847404627893
(2, 3) 5000 0.9418386491557224
(2, 3) 80

In [16]:
# check the classifier's performance on test set
test_vecs = TfidfVectorizer(vocabulary=best_vocab,ngram_range=best_setup[0],max_features=best_setup[1],sublinear_tf=True).fit_transform(test_text)
test_acc = best_clf.score(test_vecs, test_cat)
print(test_acc)

0.9981905910735827


## Analysis
I have repeated the experiments for 10 times, each time with a different train/dev/test split (in the same ratios). I made the following major observations.
* Using *main_text* and *add_text* together yields the best performance (accuracy consistantly above 0.99 on both dev and test set), followed by using *main_text* only (acc >0.98) and *add_text* only (acc >0.95), all better than the manufacturer-based baseline by a large margin.
* At each run, I find the accuracy at test and dev set are very similar, suggesting that the model is not overfitted. 
* All the weights learnt by the logistic regression model look right (e.g. no exgtremely large or small values), suggesting that the learned classifier is sensible. The most and least informative features for each class are presented below.

In [17]:
# check the most and least informative features, in terms of their corresponding weights
# !pip install -U numpy
import numpy as np
weights_matrix = best_clf.coef_
for i,catgy in enumerate(best_clf.classes_):
    print('\n======={}========'.format(catgy))
    sorted_weights = sorted(weights_matrix[i])
    weights = list(weights_matrix[i])
    dic = {word:weight for (word,weight) in zip(best_vocab,weights)}
    sorted_dic = {k: v for k, v in sorted(dic.items(), key=lambda item: np.abs(item[1]))}
    print('---most informative features---')
    for ww in list(sorted_dic.keys())[-20:]:
        print(ww,'\t', sorted_dic[ww])
    print('---least informative features---')
    for ww in list(sorted_dic.keys())[:10]:
        print(ww, '\t', sorted_dic[ww])


---most informative features---
21gang 	 -1.6110399724510758
interner 	 1.6532246330497573
action 	 -1.6709486602478556
11100 	 1.7009710201668746
silicone 	 1.7891281468306897
wasch 	 -1.8719913533184505
flexx 	 1.872802482067686
342 	 -1.9392541548188869
tdosxl 	 -1.9409259210996976
merida 	 1.9692421916966345
leistungsfaehiger 	 1.9692421916966345
summe7613317058850 	 1.9802554466582585
6513612106 	 2.011903661688615
vim 	 2.1143789242201496
49l 	 2.2400677199092387
109 	 2.4232009769753757
dokumenten 	 -2.777697221817248
ecoedition 	 2.9299554906285192
presbyopia 	 5.037362979109216
brake 	 5.831228885405922
---least informative features---
prob 	 -0.00025130474849383206
pd2gh2grtsbbl 	 -0.0004889499904894042
763513 	 -0.0005884401774260179
keine 	 -0.0006737667653951939
extra90 	 -0.0006996215216652044
igk 	 -0.0007398801682650466
wp12t227 	 -0.0008122199285723724
masse 	 -0.0008299100816094179
preiseinheit 	 -0.0008725725351952962
micu3 	 -0.0008811204470398108

---most informat

## API For The Trained Classifier

I have no experiences in developing REST APIs. Hence, due to the time limit, I did not implement it. To implement a REST API, in principle, I would first save *best_clf*, *best_vocab* and *best_setup* using *pickle*, so as to re-use these variables to vectorise next text and perform the classification. A very simple API is impelemented below.

In [18]:
# API for the trained classifier
def make_predictions(input_text):
    test_vecs = TfidfVectorizer(vocabulary=best_vocab,ngram_range=best_setup[0],max_features=best_setup[1],sublinear_tf=True).fit_transform(input_text)
    predictions = best_clf.predict(test_vecs)
    return predictions

test_text = ['guten tag','USB 3.0','hallo walt!']
print(make_predictions(test_text))

["'WASHINGMACHINES'" "'USB MEMORY'" "'WASHINGMACHINES'"]
