Multi-label classification techniques:

1. OneVsRest 
Problem: The main assumption here is that the labels are mutually exclusive. You do not consider any underlying correlation between the classes in this method.

2. Binary Relevance
Advantage: This is a simple approach (easy to implement).
Problem: Does not work well when there’s dependencies between the labels.

3. Classifier Chains
Advantage: Can take into account label correlations.
Problem: The total number of classifiers needed for this approach is equal to the number of classes, but the training of the classifiers is more involved.

4. Label Powerset
Advantage: This approach does take possible correlations between class labels into account.
Problem: This method needs worst case (2^|C|) classifiers, and has a high computational complexity.
However when the number of classes increases the number of distinct label combinations can grow exponentially. This easily leads to combinatorial explosion and thus computational infeasibility. Furthermore, some label combinations will have very few positive examples.

5. Adapted Algorithm - http://scikit.ml/
Examples: MLKNN

6. Other possible avenues to explore:
- ensemble models.
- OneVsRestClassifier or MultiOutputClassifier?? which is better
- what different classifiers can we use - SVM, Naive Bayes, Logistic Regression (Note: - Remember to tune the parameters)
- class imbalance problem??

In [1]:
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

with open('Data/comp90051-22-s1-p1/train.json','r') as f:
    data = json.loads(f.read())

In [2]:
df = pd.DataFrame.from_dict(data, orient='index').reset_index(drop=True)
df

Unnamed: 0,venue,keywords,year,author
0,,"[64, 1, 322, 134, 136, 396, 270, 144, 476, 481...",2017,"[1605, 759]"
1,0,"[258, 260, 389, 261, 390, 396, 400, 17, 146, 2...",2013,[2182]
2,1,"[320, 454, 266, 462, 17, 339, 404, 342, 407, 2...",2007,[2176]
3,2,"[260, 132, 333, 15, 400, 272, 146, 401, 278, 3...",2013,[1107]
4,3,"[64, 385, 449, 450, 71, 73, 268, 80, 216, 25, ...",2009,[1414]
...,...,...,...,...
26103,252,"[384, 320, 136, 457, 75, 17, 146, 465, 468, 21...",2011,"[656, 595]"
26104,50,"[318, 70, 457, 459, 396, 77, 146, 404, 468, 40...",2008,[876]
26105,6,"[320, 260, 69, 9, 265, 461, 156, 476, 166, 425...",2008,[535]
26106,138,"[450, 70, 198, 233, 394, 300, 492, 368, 246, 4...",2015,[1954]


In [3]:
dataTypeSeries = df.dtypes
print('Data type of each column of Dataframe :')
print(dataTypeSeries)

Data type of each column of Dataframe :
venue       object
keywords    object
year         int64
author      object
dtype: object


In [4]:
df['venue'] = df['venue'].replace('', -1)
dataTypeSeries = df.dtypes
print('Data type of each column of Dataframe :')
print(dataTypeSeries)

Data type of each column of Dataframe :
venue        int64
keywords    object
year         int64
author      object
dtype: object


In [5]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(sparse_output=True)

df = df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(df.pop('keywords')),
                index=df.index,
                columns=mlb.classes_))
df

Unnamed: 0,venue,year,author,0,1,2,3,4,5,6,...,490,491,492,493,494,495,496,497,498,499
0,-1,2017,"[1605, 759]",0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2013,[2182],0,0,0,0,0,0,0,...,0,0,1,1,1,0,0,0,0,0
2,1,2007,[2176],0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,2,2013,[1107],0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,3,2009,[1414],0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26103,252,2011,"[656, 595]",0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
26104,50,2008,[876],0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
26105,6,2008,[535],0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
26106,138,2015,[1954],0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [6]:
df = df.reindex(columns = [col for col in df.columns if col != 'author'] + ['author'])
df

Unnamed: 0,venue,year,0,1,2,3,4,5,6,7,...,491,492,493,494,495,496,497,498,499,author
0,-1,2017,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,"[1605, 759]"
1,0,2013,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,0,[2182]
2,1,2007,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,[2176]
3,2,2013,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,[1107]
4,3,2009,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,[1414]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26103,252,2011,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,"[656, 595]"
26104,50,2008,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,[876]
26105,6,2008,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,[535]
26106,138,2015,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,[1954]


In [7]:
X_train = df.iloc[:,:-1].to_numpy(dtype='int')
y = df.iloc[:, -1:]
y = y.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb.fit_transform(y.pop('author')),
                index=df.index,
                columns=mlb.classes_))
y_train = y.to_numpy()
y_train

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)

In [8]:
X_train

array([[  -1, 2017,    0, ...,    0,    0,    0],
       [   0, 2013,    0, ...,    0,    0,    0],
       [   1, 2007,    0, ...,    0,    0,    0],
       ...,
       [   6, 2008,    0, ...,    0,    0,    0],
       [ 138, 2015,    0, ...,    0,    0,    0],
       [  33, 2009,    0, ...,    0,    0,    0]])

In [9]:
#X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.33, shuffle=True)

In [10]:
print(y_train.shape)
y_train.shape[1]
type(y_train)
y_train.shape[1]
y_train.ndim

(26108, 2302)


2

In [11]:
#Feature Scaling
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform(X_train)
print(X_train.shape)
X_train.ndim
X_train

(26108, 502)


array([[-1.23352645e-02,  3.99334470e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  3.98542532e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.23352645e-02,  3.97354626e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 7.40115869e-02,  3.97552611e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.70226650e+00,  3.98938501e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 4.07063728e-01,  3.97750595e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

In [12]:
#Creating X_test on test data
with open('Data/comp90051-22-s1-p1/test.json','r') as f:
    test_data = json.loads(f.read())

test_df = pd.DataFrame.from_dict(test_data, orient='index').reset_index(drop=True)
test_df['venue'] = test_df['venue'].replace('', -1)
mlb_test = MultiLabelBinarizer(sparse_output=True)
test_df = test_df.join(
            pd.DataFrame.sparse.from_spmatrix(
                mlb_test.fit_transform(test_df.pop('keywords')),
                index=test_df.index,
                columns=mlb_test.classes_))

#remove coauthor (to match X_train)
coauthor = test_df.pop('coauthor')
#remove target and store separately
target = test_df.pop('target')
target_list = target.tolist()

X_test = test_df.to_numpy(dtype='int')

#Feature Scaling on test data
X_test = scaler.transform(X_test)
print(X_test.shape)
X_test.ndim
X_test

(2000, 502)


array([[-1.23352645e-02,  3.99334470e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 1.15951486e+00,  3.99730439e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 3.82393199e-01,  3.98740517e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 7.15445340e-01,  3.96760673e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [ 7.40115869e-02,  3.99136486e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       [-1.23352645e-02,  3.98146564e+02,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

In [18]:
#Linear SVC, Classifier Chains Approach
from skmultilearn.problem_transform import ClassifierChain
#from sklearn.multioutput import ClassifierChain
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
import time
from sklearn.model_selection import StratifiedKFold

from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

#classifier = ClassifierChain(LogisticRegression(multi_class='multinomial'), order='random', random_state=0)
#classifier = ClassifierChain(SVC(kernel="linear", probability=True), order='random', random_state=0)

start=time.time()
linearsvc = LinearSVC(dual = False, class_weight = 'balanced')
skf = StratifiedKFold(n_splits=5, shuffle=True)
classifier = ClassifierChain(
    classifier = CalibratedClassifierCV(linearsvc, cv=skf),
    require_dense = [False, True]
)
classifier.fit(X_train, y_train)
print('training time taken: ',round(time.time()-start,0),'seconds')



training time taken:  22172.0 seconds


In [19]:
predictions = classifier.predict_proba(X_test)
print(predictions[0])

  (0, 0)	0.0007702923935728773
  (0, 1)	0.0005830463275433445
  (0, 2)	0.00010760831511253516
  (0, 3)	0.00016053538439036904
  (0, 4)	0.0008751635932282458
  (0, 5)	0.00016861254983949916
  (0, 6)	6.953180062819344e-05
  (0, 7)	6.755221083460853e-05
  (0, 8)	0.0004375332772179409
  (0, 9)	0.0001593513543917558
  (0, 10)	0.00015287285891532032
  (0, 11)	0.00013283031750474926
  (0, 12)	3.221417672173833e-05
  (0, 13)	0.00042244583101511265
  (0, 14)	5.534279312099652e-05
  (0, 15)	5.721565046386987e-05
  (0, 16)	0.00028375142311684283
  (0, 17)	0.0014518486013591289
  (0, 18)	0.0011259054392372858
  (0, 19)	0.00017999654417341998
  (0, 20)	0.00014360120952112632
  (0, 21)	9.076084617188211e-05
  (0, 22)	0.000550601624125089
  (0, 23)	0.00023052282076617748
  (0, 24)	0.0007548744857464811
  :	:
  (0, 2277)	0.0002093231555897153
  (0, 2278)	0.016415777489208274
  (0, 2279)	0.0002146790887035197
  (0, 2280)	0.0012805402297649438
  (0, 2281)	0.0022178539124347715
  (0, 2282)	0.000104653384

In [20]:
predictions_final = predictions.tocsr()
predictions_final[0,0]

0.0007702923935728773

In [21]:
import csv
prediction_file="submission_cc_balanced.csv"
f = open(prediction_file, 'w')
writer = csv.writer(f)

prob_list=[]
for i in range(len(X_test)):
    author_index=target_list[i]
    prob_list.append([i, predictions_final[i,author_index]])
# print(prob_list)

writer.writerow(["Id","Predicted"])
for line in prob_list:
    writer.writerow(line)
    
f.close()