# 1. References

Title: Multi-Label Classification(Blog Tags Prediction)using NLP

Link: https://medium.com/coinmonks/multi-label-classification-blog-tags-prediction-using-nlp-b0b5ee6686fc

Naive approach is to do x -> y1, x -> y1, y2, x -> y1, y2, y3

# 2. Imports

In [15]:
import pandas as pd
import string

# 3. Creation of a valid dataframe - one hot encoding style

In [3]:
DATA_DIR = "../../data/raw/"
INPUT_FILE_NAME = 'subset_raw.parquet'

In [4]:
df = pd.read_parquet(DATA_DIR + INPUT_FILE_NAME)
df.head()

Unnamed: 0_level_0,speaker,headline,description,duration,tags,transcript,WC
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Al Gore,Averting the climate crisis,With the same humor and humanity he exuded in ...,0:16:17,"cars,alternative energy,culture,politics,scien...","0:14\r\r\rThank you so much, Chris.\rAnd it's ...",2281.0
2,Amy Smith,Simple designs to save a life,Fumes from indoor cooking fires kill more than...,0:15:06,"MacArthur grant,simplicity,industrial design,a...","0:11\r\r\rIn terms of invention,\rI'd like to ...",2687.0
3,Ashraf Ghani,How to rebuild a broken state,Ashraf Ghani's passionate and powerful 10-minu...,0:18:45,"corruption,poverty,economics,investment,milita...","0:12\r\r\rA public, Dewey long ago observed,\r...",2506.0
4,Burt Rutan,The real future of space exploration,"In this passionate talk, legendary spacecraft ...",0:19:37,"aircraft,flight,industrial design,NASA,rocket ...","0:11\r\r\rI want to start off by saying, Houst...",3092.0
5,Chris Bangle,Great cars are great art,American designer Chris Bangle explains his ph...,0:20:04,"cars,industrial design,transportation,inventio...","0:12\r\r\rWhat I want to talk about is, as bac...",3781.0


In [5]:
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2475 entries, 1 to 2804
Data columns (total 7 columns):
speaker        2475 non-null object
headline       2475 non-null object
description    2475 non-null object
duration       2475 non-null object
tags           2475 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 154.7+ KB


## 3.1 Remove nan transcripts

In [6]:
df = df.dropna(subset=['transcript'])
df = df.reset_index(drop=True)
df.iloc[:,:15].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2386 entries, 0 to 2385
Data columns (total 7 columns):
speaker        2386 non-null object
headline       2386 non-null object
description    2386 non-null object
duration       2386 non-null object
tags           2386 non-null object
transcript     2386 non-null object
WC             2386 non-null float64
dtypes: float64(1), object(6)
memory usage: 130.6+ KB


## 3.2 Finding the unique tags

In [7]:
joined_tags = df['tags'].str.cat(sep=',').split(',')
all_tags = pd.Series(joined_tags).str.strip().str.lower()
all_tags = list(dict.fromkeys(all_tags))
all_tags.remove('')
print(all_tags)
print(len(all_tags))

['cars', 'alternative energy', 'culture', 'politics', 'science', 'climate change', 'environment', 'sustainability', 'global issues', 'technology', 'macarthur grant', 'simplicity', 'industrial design', 'invention', 'engineering', 'design', 'corruption', 'poverty', 'economics', 'investment', 'military', 'policy', 'global development', 'entrepreneur', 'business', 'aircraft', 'flight', 'nasa', 'rocket science', 'transportation', 'art', 'biotech', 'oceans', 'genetics', 'dna', 'biology', 'biodiversity', 'ecology', 'computers', 'software', 'interface design', 'music', 'media', 'entertainment', 'performance', 'new york', 'memory', 'interview', 'death', 'architecture', 'disaster relief', 'cities', 'urban planning', 'collaboration', 'robots', 'education', 'innovation', 'social change', 'obesity', 'disease', 'health', 'health care', 'food', 'primates', 'africa', 'animals', 'nature', 'wunderkind', 'cancer', 'creativity', 'love', 'gender', 'relationships', 'cognitive science', 'psychology', 'evolut

## 3.3 Creating a new dataframe

In [8]:
def create_one_hot_encode(df=df):
    complete_transcripts_tags = []
    for rows, value in df.iterrows():
        one_hot_encoding = [0] * 417
        transcript = [value['transcript']]
        indiv_tags = value['tags'].split(',')
        for tags in indiv_tags:
            if tags == '':
                continue
            index = all_tags.index(tags.lower().lstrip(' '))
            one_hot_encoding[index] = 1
        indiv_transcript_tags = transcript + one_hot_encoding
        complete_transcripts_tags.append(indiv_transcript_tags)
    return pd.DataFrame(complete_transcripts_tags, columns=['transcript'] + all_tags)

In [9]:
ted_tags = create_one_hot_encode()
ted_tags

Unnamed: 0,transcript,cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,...,anthropocene,syria,movies,ted residency,ted-ed,telescopes,ted en espanol,alzheimer's,ted en español,epidemiology
0,"0:14\r\r\rThank you so much, Chris.\rAnd it's ...",1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,"0:11\r\r\rIn terms of invention,\rI'd like to ...",0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,"0:12\r\r\rA public, Dewey long ago observed,\r...",0,0,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,"0:11\r\r\rI want to start off by saying, Houst...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"0:12\r\r\rWhat I want to talk about is, as bac...",1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2381,0:11\r\r\rImagine that when you walked\rin her...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2382,0:11\r\r\rPaying close attention to something:...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2383,"0:11\r\r\rSo, this happy pic of me\rwas taken ...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2384,0:12\r\r\rMy seven-year-old grandson\rsleeps j...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 4. Cleaning the data

In [10]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import ToktokTokenizer
from nltk.corpus import stopwords
stopWordList=stopwords.words('english')
stopWordList.remove('no')
stopWordList.remove('not')
lemma=WordNetLemmatizer()
token=ToktokTokenizer()

In [11]:
import unicodedata

def removeAscendingChar(data):
    data=unicodedata.normalize('NFKD', data).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return data

def removeCharDigit(text):
    str="`1234567890-=~@#$%^&*()_+[!{;”:\’><.,/?”}]"
    for w in text:
        if w in str:
            text=text.replace(w,'')
    return text

def lemitizeWords(text):
    words=token.tokenize(text)
    listLemma=[]
    for w in words:
        x=lemma.lemmatize(w,'v')
        listLemma.append(x)
    return text


def stopWordsRemove(text):
    wordList=[x.lower().strip() for x in token.tokenize(text)]
    removedList=[x for x in wordList if not x in stopWordList]
    text=''.join(removedList)
    return text

In [12]:
import re 

def PreProcessing(text):
    text=removeCharDigit(text)
    text=removeAscendingChar(text)
    text=lemitizeWords(text)
    text=stopWordsRemove(text)
    return(text)


banned = ["\\\"", "\""]
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what’s", "what is", text)
    text = re.sub(r"’s", "", text)
    text = re.sub(r"'s", "", text)
    text = re.sub(r"’ve", ' have', text)
    text = re.sub(r"'ve", ' have', text)
    text = re.sub(r"can’t", "cannot", text)
    text = re.sub(r"can't", "not", text)
    text = re.sub(r"won’t", "would not", text)
    text = re.sub(r"won't", "would not", text)
    text = re.sub(r"i’m", "i am", text)
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"i\'m", "i am", text)
    text = re.sub(r"’re", " are", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"’d", " would", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"’ll", " will", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"\’scuse", "excuse", text)
    text = re.sub(r"\'scuse", "excuse", text)
#     text = re.sub('\W', '', text)
    #text = re.sub('\s+', ' ', text)
    # Stuff that is in this document
    for i in banned:
        text = text.replace(i, " " )
    text = text.replace("'", "" )
    text = re.sub('\s+', ' ', text)
    return text

In [13]:
from copy import deepcopy
ted_tags_copy = deepcopy(ted_tags)
ted_tags_copy['transcript'] = ted_tags_copy['transcript'].map(lambda com : clean_text(com))
ted_tags_copy['transcript'] = ted_tags_copy['transcript'].map(lambda com : PreProcessing(com))
ted_tags_copy

Unnamed: 0,transcript,cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,...,anthropocene,syria,movies,ted residency,ted-ed,telescopes,ted en espanol,alzheimer's,ted en español,epidemiology
0,thankmuchchristrulygreathonoropportunitycomest...,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
1,termsinventionwouldliketelltaleonefavoriteproj...,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,publicdeweylongagoobservedconstituteddiscussio...,0,0,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,wantstartsayinghoustonproblementeringsecondgen...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,wanttalkbackgroundideacarsartactuallyquitemean...,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2381,imaginewalkedeveningdiscoveredeverybodyroomloo...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2382,payingcloseattentionsomethingnoteasyattentionp...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2383,happypictakenseniorcollegerightdancepracticere...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2384,sevenyearoldgrandsonsleepshallwakeslotmornings...,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
ted_tags_glenn = deepcopy(ted_tags)

ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].str.replace(r'\d+','')
ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].str.replace(r'<.*?>','')
for i in string.punctuation:
    if i == "'":
        ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].str.replace(i,'')
    else:
        ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].str.replace(i,' ')
ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].str.replace('\s+',' ')
ted_tags_glenn['transcript'] = ted_tags_glenn['transcript'].map(lambda com : PreProcessing(com))

In [17]:
ted_tags['transcript'][0]

'0:14\r\r\rThank you so much, Chris.\rAnd it\'s truly a great honor\rto have the opportunity\rto come to this stage twice;\rI\'m extremely grateful.\rI have been blown away by this conference,\rand I want to thank all of you\rfor the many nice comments\rabout what I had to say the other night.\rAnd I say that sincerely,\rpartly because (Mock sob)\rI need that.\r\r\r\r\r 0:40\r\r\r(Laughter)\r\r\r\r\r 0:45\r\r\rPut yourselves in my position.\r\r\r\r\r 0:47\r\r\r(Laughter)\r\r\r\r\r 0:54\r\r\rI flew on Air Force Two for eight years.\r\r\r\r\r 0:57\r\r\r(Laughter)\r\r\r\r\r 0:59\r\r\rNow I have to take off my shoes\ror boots to get on an airplane!\r\r\r\r\r 1:02\r\r\r(Laughter)\r\r\r\r\r 1:05\r\r\r(Applause)\r\r\r\r\r 1:11\r\r\rI\'ll tell you one quick story\rto illustrate what\rthat\'s been like for me.\r\r\r\r\r 1:16\r\r\r(Laughter)\r\r\r\r\r 1:18\r\r\rIt\'s a true story \revery bit of this is true.\r\r\r\r\r 1:21\r\r\rSoon after Tipper and I left the \r(Mock sob) White House \r\r\r\r\r

In [18]:
ted_tags_copy['transcript'][0]

'thankmuchchristrulygreathonoropportunitycomestagetwiceextremelygratefulblownawayconferencewantthankmanynicecommentssaynightsaysincerelypartlymocksobneedlaughterputpositionlaughterflewairforcetwoeightyearslaughtertakeshoesbootsgetairplanelaughterapplausetellonequickstoryillustratelikelaughtertruestoryeverybittruesoontipperleftmocksobwhitehouselaughterdrivinghomenashvillelittlefarmmileseastnashvilledrivinglaughterknowsoundslikelittlethinglaughterlookedrearviewmirrorsuddenhitnomotorcadebacklaughterheardphantomlimbpainlaughterrentedfordtauruslaughterdinnertimestartedlookingplaceeatgotexitlebanontennesseegotexitfoundshoneyrestaurantlowcostfamilyrestaurantchaindontknowwentsatboothwaitresscamemadebigcommotiontipperlaughtertookorderwentcoupleboothnextusloweredvoicemuchreallystrainhearsayingsaidyesformervicepresidentalgorewifetippermansaidcomelongwayhasntlaughterapplausekindseriesepiphanieslaughternextdaycontinuingtotallytruestorygotgvflyafricamakespeechnigeriacitylagostopicenergybeganspeechte

In [19]:
ted_tags_glenn['transcript'][0]

'thankmuchchristrulygreathonoropportunitycomestagetwiceimextremelygratefulblownawayconferencewantthankmanynicecommentssaynightsaysincerelypartlymocksobneedlaughterputpositionlaughterflewairforcetwoeightyearslaughtertakeshoesbootsgetairplanelaughterapplauseilltellonequickstoryillustratethatslikelaughtertruestoryeverybittruesoontipperleftmocksobwhitehouselaughterdrivinghomenashvillelittlefarmmileseastnashvilledrivinglaughterknowsoundslikelittlethinglaughterlookedrearviewmirrorsuddenhitnomotorcadebacklaughteryouveheardphantomlimbpainlaughterrentedfordtauruslaughterdinnertimestartedlookingplaceeatgotexitlebanontennesseegotexitfoundshoneysrestaurantlowcostfamilyrestaurantchaindontknowwentsatboothwaitresscamemadebigcommotiontipperlaughtertookorderwentcoupleboothnextusloweredvoicemuchreallystrainhearsayingsaidyesthatsformervicepresidentalgorewifetippermansaidhescomelongwayhasntlaughterapplausethereskindseriesepiphanieslaughternextdaycontinuingtotallytruestorygotgvflyafricamakespeechnigeriacit

In [21]:
# !pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt

totalText=''
for x in ted_tags_copy['transcript']:
    ps=PreProcessing(x)
    totalText=totalText+''+ps
wc=WordCloud(background_color='black',max_font_size=50, max_words=100).generate(totalText)
plt.figure(figsize=(16,12))
plt.imshow(wc, interpolation='bilinear')

KeyboardInterrupt: 

# 5. Machine Learning part

In [None]:
x=ted_tags_copy.iloc[:,0].values
y=ted_tags_copy.iloc[:,1:-1].values


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
body = ted_tags_copy.transcript
cv = CountVectorizer().fit(body)
article = pd.DataFrame(cv.transform(body).todense(),columns=cv.get_feature_names())

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfart=TfidfTransformer().fit(article)
art=pd.DataFrame(tfidfart.transform(article).todense())

In [None]:
#!pip install scikit-multilearn

# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
xtrain,xtest,ytrain,ytest=train_test_split(art,y)
classifier = BinaryRelevance(GaussianNB())

# train
classifier.fit(xtrain.astype(float), ytrain.astype(float))

predictions = classifier.predict(xtest.astype(float))
predictions.toarray()
from sklearn.metrics import accuracy_score
accuracy_score(ytest.astype(float),predictions)

In [None]:
print(ytest)
print(ytest.sum(axis=1))

In [None]:
predictions = classifier.predict(xtest.astype(float))

In [None]:
print(predictions.toarray().sum(axis=1))
print(type(predictions))

In [None]:
print(ted_tags_copy.transcript)