# 1. References

Title: Multi-Label Classification(Blog Tags Prediction)using NLP

Link: https://medium.com/coinmonks/multi-label-classification-blog-tags-prediction-using-nlp-b0b5ee6686fc

Naive approach is to do x -> y1, x -> y1, y2, x -> y1, y2, y3

# 2. Imports

In [1]:
import pandas as pd

# 3. Creation of a valid dataframe - one hot encoding style

In [2]:
DATA_DIR = "../../data/raw/"
INPUT_FILE_NAME = 'subset_raw.parquet'

In [3]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0_level_0,speaker,headline,description,duration,tags,transcript,WC
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0
2,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0
3,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0
4,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0
5,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0


In [4]:
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2475 entries, 1 to 2804
Data columns (total 7 columns):
speaker        2475 non-null object
headline       2475 non-null object
description    2475 non-null object
duration       2475 non-null object
tags           2475 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 154.7+ KB


## 3.1 Remove nan transcripts

In [5]:
df = df.dropna(subset=['transcript'])
df = df.reset_index(drop=True)
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2386 entries, 0 to 2385
Data columns (total 7 columns):
speaker        2386 non-null object
headline       2386 non-null object
description    2386 non-null object
duration       2386 non-null object
tags           2386 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 130.6+ KB


## 3.2 Finding the unique tags

In [6]:
joined_tags = df['tags'].str.cat(sep=',').split(',')
all_tags = pd.Series(joined_tags).str.strip().str.lower()
all_tags = list(dict.fromkeys(all_tags))
all_tags.remove('')
print(all_tags)
print(len(all_tags))

['cars', 'alternative energy', 'culture', 'politics', 'science', 'climate change', 'environment', 'sustainability', 'global issues', 'technology', 'macarthur grant', 'simplicity', 'industrial design', 'invention', 'engineering', 'design', 'corruption', 'poverty', 'economics', 'investment', 'military', 'policy', 'global development', 'entrepreneur', 'business', 'aircraft', 'flight', 'nasa', 'rocket science', 'transportation', 'art', 'biotech', 'oceans', 'genetics', 'dna', 'biology', 'biodiversity', 'ecology', 'computers', 'software', 'interface design', 'music', 'media', 'entertainment', 'performance', 'new york', 'memory', 'interview', 'death', 'architecture', 'disaster relief', 'cities', 'urban planning', 'collaboration', 'robots', 'education', 'innovation', 'social change', 'obesity', 'disease', 'health', 'health care', 'food', 'primates', 'africa', 'animals', 'nature', 'wunderkind', 'cancer', 'creativity', 'love', 'gender', 'relationships', 'cognitive science', 'psychology', 'evolut

## 3.3 Creating a new dataframe

In [7]:
def create_one_hot_encode(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        one_hot_encoding = [0] * 417
        transcript = [value['transcript']]
        indiv_tags = value['tags'].split(',')
        for tags in indiv_tags:
            if tags == '':
                continue
            index = all_tags.index(tags.lower().lstrip(' '))
            one_hot_encoding[index] = 1
        indiv_transcript_tags = transcript + one_hot_encoding
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['transcript'] + all_tags)

In [8]:
ted_tags = create_one_hot_encode()
ted_tags

Unnamed: 0,transcript,cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,...,anthropocene,syria,movies,ted residency,ted-ed,telescopes,ted en espanol,alzheimer's,ted en español,epidemiology
0,"0:14\r\r\rThank you so much, Chris.\rAnd it's ...",1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,"0:11\r\r\rIn terms of invention,\rI'd like to ...",0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,"0:12\r\r\rA public, Dewey long ago observed,\r...",0,0,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,"0:11\r\r\rI want to start off by saying, Houst...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"0:12\r\r\rWhat I want to talk about is, as bac...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2381,0:11\r\r\rImagine that when you walked\rin her...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2382,0:11\r\r\rPaying close attention to something:...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2383,"0:11\r\r\rSo, this happy pic of me\rwas taken ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2384,0:12\r\r\rMy seven-year-old grandson\rsleeps j...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 4. Cleaning the data

In [9]:
'''
1. Numbers
2. Apostrophe
3. All punctuations
4. Weird symbols
5. Stop words
6. lemmatization
'''

import string
from nltk.corpus import stopwords
from nltk.tokenize import ToktokTokenizer
from sklearn.feature_extraction import stop_words
from nltk.stem.wordnet import WordNetLemmatizer
sets=[stop_words.ENGLISH_STOP_WORDS]
sklearnStopWords = [list(x) for x in sets][0]
token=ToktokTokenizer()
lemma=WordNetLemmatizer()
stopWordList=stopwords.words('english')
stopWords = stopWordList + sklearnStopWords
stopWords = list(dict.fromkeys(stopWords))


def stopWordsRemove(text):
    wordList=[x.lower().strip() for x in token.tokenize(text)]
    removedList=[x + ' ' for x in wordList if not x in stopWords]
    text=''.join(removedList)
    return text


def lemitizeWords(text):
    words=token.tokenize(text)
    listLemma=[]
    for w in words:
        x=lemma.lemmatize(w,'v')
        listLemma.append(x)
    return text


# There is a mispelt word that needs to be replaced
ted_tags['transcript'] = ted_tags['transcript'].str.replace('\r',' ')
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'s"," is")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'m"," am")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'ll"," will")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Can\'t","cannot")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Sha\'t","shall not")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Won\'t","would not")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("n\'t"," not")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'ve"," have")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'re"," are")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("\'d"," would")
ted_tags['transcript'] = ted_tags['transcript'].str.replace(r"\(([^)]+)\)","")
# Deal with Mr. and Dr.
ted_tags['transcript'] = ted_tags['transcript'].str.replace("mr. ","mr")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Mr. ","mr")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("dr. ","dr")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("mrs. ","mrs")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Mrs. ","mrs")
ted_tags['transcript'] = ted_tags['transcript'].str.replace("Dr. ","dr")

ted_tags['transcript'] = ted_tags['transcript'].str.replace(r'\d+','')
ted_tags['transcript'] = ted_tags['transcript'].str.replace(r'<.*?>','')
for i in string.punctuation:
    if i == "'":
        ted_tags['transcript'] = ted_tags['transcript'].str.replace(i,'')
    else:
        ted_tags['transcript'] = ted_tags['transcript'].str.replace(i,' ')
ted_tags['transcript'] = ted_tags['transcript'].map(lambda com : stopWordsRemove(com))
ted_tags['transcript'] = ted_tags['transcript'].map(lambda com : lemitizeWords(com))
ted_tags['transcript'] = ted_tags['transcript'].str.replace('\s+',' ')


In [10]:
ted_tags

Unnamed: 0,transcript,cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,...,anthropocene,syria,movies,ted residency,ted-ed,telescopes,ted en espanol,alzheimer's,ted en español,epidemiology
0,thank chris truly great honor opportunity come...,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,terms invention like tell tale favorite projec...,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,public dewey long ago observed constituted dis...,0,0,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,want start saying houston problem entering sec...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,want talk background idea cars art actually qu...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2381,imagine walked evening discovered everybody ro...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2382,paying close attention easy attention pulled d...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2383,happy pic taken senior college right dance pra...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2384,seven year old grandson sleeps hall wakes lot ...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# !pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

totalText=''
for x in ted_tags['transcript']:
    totalText=totalText+''+x
wc=WordCloud(background_color='black',max_font_size=50).generate(totalText)
plt.figure(figsize=(16,12))
plt.imshow(wc, interpolation='bilinear')

# 5. Machine Learning part

In [11]:
x=ted_tags.iloc[:,0].values
y=ted_tags.iloc[:,1:-1].values


In [12]:
from sklearn.feature_extraction.text import CountVectorizer
body = ted_tags.transcript
cv = CountVectorizer().fit(body)
article = pd.DataFrame(cv.transform(body).todense(),columns=cv.get_feature_names())

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfart=TfidfTransformer().fit(article)
art=pd.DataFrame(tfidfart.transform(article).todense())

In [14]:
# !pip install scikit-multilearn

# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
xtrain,xtest,ytrain,ytest=train_test_split(art,y)
classifier = BinaryRelevance(GaussianNB())

# train
classifier.fit(xtrain.astype(float), ytrain.astype(float))

# predictions = classifier.predict(xtest.astype(float))
# predictions = classifier.predict_proba(xtest.astype(float))
# predictions.toarray()


BinaryRelevance(classifier=GaussianNB(priors=None, var_smoothing=1e-09),
                require_dense=[True, True])

In [15]:
predictions = classifier.predict(xtest.astype(float))

In [16]:
# from sklearn.metrics import accuracy_score
# accuracy_score(ytest.astype(float),predictions)

In [17]:
print(ytest)
print(ytest.sum(axis=1))

[[0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[ 7 14 13  8  4 10  5  7  9 15  9  5 15 15  4  2  3  5  9  7  6 10  6  4
  5  9 18  7  5  5  5  6 10  5 16  3  4 17  4 18  7  5  5  7  6  4  6  4
  5  5  8  5  5  3  4  7  6  4 15 22 20  4 14  7  5  3  5  3  6 18  6  9
  4  6  5  4  7 13  6  5  7  4  6  7  4  9  9  3  4 12  2  7  7 13 20  6
  3  4  7  6  8  7  3  7 17  8  4 25  3  6  5  7 16 12  3  9  6  6 13  6
  8  7  9  7  4 16  9  2  4  7  7  5 10  7 12  3  6  3 18  3  3  8  8  6
  5  3  6 10  7  5  5 16 13  4 11  5  6 11  9  6 10  2  4  9  5  4  6  6
  3 16  7  6 10  8  9 10  6  6  5  3 10  5  6 11  5  8 20  8  6  6  5 13
  3  6  8  6  8  3 10  3  4  5  5  7 10  4  6  8  7  4  4 15  5  4 21  6
  9  3  8  9  6  3 11  6  6  8  8 10  3  8  6  5  4  9  6  6 11  4  3  7
  7 14  4 15  4 10  6  2  3  9  5 10 17  4  6  7  3  6  6 10 13 14  7  5
  4  4  5  7 10  4 20  9 10 14  2  9  7  7 16 19  5  3 14  4 17  5 10  6
  5 

In [18]:
def compute_accuracy(test, predict):
    true_positive_ls = []
    true_negative_ls = []
    false_positive_ls = []
    false_negative_ls = []
    labelled_tags = 0
    # overall_correct_one = 0
    for index_pred, value_pred in enumerate(predict):
        true_positive = 0
        true_negative = 0
        false_positive = 0
        false_negative = 0
        for index_pred_indiv, value_pred_indiv in enumerate(value_pred):
            if test[index_pred][index_pred_indiv] == 1:
                labelled_tags += 1
            if value_pred_indiv == test[index_pred][index_pred_indiv]:
                if value_pred_indiv == 1:
                    # overall_correct_one += 1
                    true_positive += 1
                else:
                    true_negative += 1
            else:
                if value_pred_indiv == 1:
                    # Test is 0 but we predict 1
                    false_positive += 1
                else:
                    false_negative += 1
        true_positive_ls.append(true_positive)
        true_negative_ls.append(true_negative)
        false_positive_ls.append(false_positive)
        false_negative_ls.append(false_negative)
    return true_positive_ls, true_negative_ls, false_positive_ls, false_negative_ls
#     print(correct_one_ls)
#     print(wrong_ls)
#     print(labelled_tags)
#     print(overall_correct_one/labelled_tags)

true_positive_ls, true_negative_ls, false_positive_ls, false_negative_ls = compute_accuracy(ytest, predictions.toarray())

true_pos = sum(true_positive_ls)
true_neg = sum(true_negative_ls)
false_pos = sum(false_positive_ls)
false_neg = sum(false_negative_ls)
# print(true_pos)
# print(true_neg)
# print(false_pos)
# print(false_neg)
precision = true_pos/(true_pos + false_pos)
recall = true_pos/(true_pos + false_neg)
accuracy = (true_pos + true_neg) / (false_pos + false_neg + true_pos + true_neg)
weighted_harmonic_mean = (2 * precision * recall) / (precision + recall)
print('The precision is {}'.format(precision))
print('The recall is {}'.format(recall))
print('The accuracry (naive) is {}'.format(accuracy))
print('The weighted harmonic mean/F1 score is {}'.format(weighted_harmonic_mean))

The precision is 0.09982014388489209
The recall is 0.024400967245548473
The accuracry (naive) is 0.978099632779281
The weighted harmonic mean/F1 score is 0.0392156862745098


In [19]:
print(predictions.toarray().sum(axis=1))
print(type(predictions))

[  0.   0.   0.   0.   1.   1.   0.   0.   3.   0.   0.   1.   0.   1.
   1.   0.   0.   0.   1.   0.   0.   0.   0.   1.   0.   0.   0.   0.
   0.   0.   0.   3.   1.   0.   0.   0.   0.   0.   0.   1.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   0.
   0.   0.   0.   0.   0.   0.   2.   0.   1.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   0.   0.   0.
   1.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
   0.   1.   1.   0.   0.   0.   0.   1.   0.   0.   1.   0.   2.   0.
   0.   0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   1.   1.   0.
   6.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   8.
   1.   0.   0.   0.   0.   1.   0. 395.   0.   6.   1.   0.   0.   1.
   1.   0.   1.   0.   0.   0.   0.   0.   3.   0.   1.   0.   0.   0.
   0.   0.   1.   0.   0.   0.   0.   0.   0.   1.   0.   0.   1.   0.
   0. 

# 6. Save the model

In [20]:
from joblib import dump, load
#dump(classifier, 'classifier_binary_relevance.joblib') 