Project Description: 
*********************
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem


DataSet 
*******

Blog Authorship Corpus 
Over 600,000 posts from more than 19 thousand bloggers T
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. 
The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words 
per person.


Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups: 
8240 "10s" blogs (ages 13-17), 
8086 "20s" blogs(ages 23-27) 
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.
Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blogauthorship-
corpus.zip/2at

1. Load the dataset (5 points)

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
#blogtxt = pd.read_csv("./blog-authorship-corpus/blogtext.csv",engine = "python")

blogtxt = pd.read_table("blog-authorship-corpus.zip", sep = ",")

In [3]:
blogtxt.shape

(681284, 7)

a. Tip: As the dataset is large, use fewer rows. Check what is working well on your
machine and decide accordingly.

In [7]:
blogtxt1 = blogtxt.head(15000)

In [8]:
blogtxt1.shape

(15000, 7)

In [9]:
blogtxt1.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [6]:
del blogtxt

2. Preprocess rows of the “text” column
***************************************
a. Remove unwanted characters <br>
b. Convert text to lowercase <br>
c. Remove unwanted spaces <br>
d. Remove stopwords </p>

In [10]:
import re
import nltk
from bs4 import BeautifulSoup
import unicodedata
from nltk.stem import WordNetLemmatizer
warnings.filterwarnings("ignore")

In [11]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_special_characters(text, remove_digits=False):
    #Using regex
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()]) 

def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [12]:
def normalize_corpus(corpus, html_stripping=True, accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # strip HTML
        if html_stripping:
            doc = strip_html_tags(doc)
        # remove accented characters
        if accented_char_removal:
            doc = remove_accented_chars(doc)
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)
        # remove special characters and\or digits    
        if special_char_removal:
            # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)  
        # remove extra whitespace
        doc = re.sub(' +', ' ', doc)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [13]:
blogtxt1['clean text'] = normalize_corpus(blogtxt1['text'])


In [14]:
blogtxt1.tail(100)

Unnamed: 0,id,gender,age,topic,sign,date,text,clean text
14900,727002,male,23,Internet,Leo,"10,October,2003",i had the most wonderful dream la...,i had the most wonderful dream last night it w...
14901,727002,male,23,Internet,Leo,"25,November,2003","so, i've been playing Knights of...",so ive been playing knight of the old republic...
14902,727002,male,23,Internet,Leo,"24,November,2003",last night's dreams were much bet...,last nights dream were much better a little on...
14903,727002,male,23,Internet,Leo,"24,November,2003","so, i had the strangest dream las...",so i had the strangest dream last night
14904,727002,male,23,Internet,Leo,"23,November,2003",My life is rated R. What is y...,my life is rated r what is your life rated
...,...,...,...,...,...,...,...,...
14995,727002,male,23,Internet,Leo,"09,May,2004","well, trying to get Blogger's new...",well trying to get bloggers new feature to wor...
14996,727002,male,23,Internet,Leo,"09,May,2004",jeebus! gas prices jumped a dime ...,jeebus gas price jumped a dime in the last wee...
14997,727002,male,23,Internet,Leo,"09,May,2004",*yawn* it's been more than 48 hou...,yawn its been more than hour since ive even be...
14998,727002,male,23,Internet,Leo,"08,May,2004",talked to mr. translator the othe...,talked to mr translator the other day he is in...


In [16]:
blogtxt1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          50000 non-null  int64 
 1   gender      50000 non-null  object
 2   age         50000 non-null  int64 
 3   topic       50000 non-null  object
 4   sign        50000 non-null  object
 5   date        50000 non-null  object
 6   text        50000 non-null  object
 7   clean text  50000 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.1+ MB


3. As we want to make this into a multi-label classification problem, you are required to merge
all the label columns together, so that we have all the labels together for a particular sentence

a. Label columns to merge: “gender”, “age”, “topic”, “sign”
b. After completing the previous step, there should be only two columns in your data
frame i.e. “text” and “labels” as shown in the below image


In [15]:
blogdf = pd.DataFrame()
blogdf['text'] = blogtxt1['clean text']

In [16]:
blogdf['labels'] = blogtxt1[blogtxt1.columns].apply(lambda x: list([x[1], str(x[2]), x[3], x[4]]),axis=1)  

In [94]:
blogdf.labels.dtypes

dtype('O')

In [95]:
blogdf

Unnamed: 0,text,labels
0,info ha been found pages and mb of pdf files n...,"[male, 15, Student, Leo]"
1,these are the team members drewes van der laag...,"[male, 15, Student, Leo]"
2,in het kader van kernfusie op aarde maak je ei...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks to yahoo s toolbar i can now capture th...,"[male, 33, InvestmentBanking, Aquarius]"
...,...,...
14995,well trying to get bloggers new feature to wor...,"[male, 23, Internet, Leo]"
14996,jeebus gas price jumped a dime in the last wee...,"[male, 23, Internet, Leo]"
14997,yawn its been more than hour since ive even be...,"[male, 23, Internet, Leo]"
14998,talked to mr translator the other day he is in...,"[male, 23, Internet, Leo]"


4. Separate features and labels, and split the data into training and testing (5 points

In [96]:
from sklearn.model_selection import train_test_split
Xtrain,Xtest, ytrain, ytest = train_test_split(blogdf.text, blogdf.labels, random_state=2,test_size = 0.25)

In [97]:
print(Xtrain.shape)
print(ytrain.shape)
print(Xtest.shape, ytest.shape)

(11250,)
(11250,)
(3750,) (3750,)


5. Vectorize the features (5 points)
a. Create a Bag of Words using count vectorizer
i. Use ngram_range=(1, 2)
ii. Vectorize training and testing features

In [98]:
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer(ngram_range=(1,2)).fit(Xtrain)
len(cvect.vocabulary_)

711436

In [99]:
cvect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [22]:
cvect.get_feature_names()

['__',
 '__ ___',
 '__ and',
 '__ being',
 '__ bit',
 '__ chance',
 '__ congrats',
 '__ gleejoyhappiness',
 '__ goodluck',
 '__ he',
 '__ ill',
 '__ im',
 '__ keke',
 '__ or',
 '__ so',
 '__ that',
 '__ the',
 '__ this',
 '__ until',
 '__ which',
 '__ yes',
 '__ you',
 '___',
 '___ ___',
 '___ and',
 '___ but',
 '___ currently',
 '___ had',
 '___ happy',
 '___ im',
 '___ ive',
 '___ like',
 '___ my',
 '___ of',
 '___ ok',
 '___ slip',
 '___ so',
 '___ the',
 '___ to',
 '___ urllink',
 '___ very',
 '___ wa',
 '___ yes',
 '___ you',
 '____',
 '____ and',
 '____ days',
 '____ good',
 '____ played',
 '____ ring',
 '____ so',
 '____ telephone',
 '____ then',
 '_____',
 '_____ however',
 '_____ im',
 '_____ insert',
 '_____ lose',
 '_____ the',
 '______',
 '______ in',
 '_______',
 '_______ please',
 '________',
 '________ get',
 '_________',
 '_________ or',
 '_________ service',
 '__________',
 '__________ but',
 '__________ monday',
 '__________ traditions',
 '____________',
 '___________

In [23]:
#Vectorizing training and test features
Xtrain_t = cvect.transform(Xtrain)
Xtest_t = cvect.transform(Xtest)

In [27]:
Xtrain_t

<37500x1974351 sparse matrix of type '<class 'numpy.int64'>'
	with 10491383 stored elements in Compressed Sparse Row format>

b. Print the term-document matrix

In [100]:
print(Xtrain_t[0])

  (0, 679)	4
  (0, 946)	1
  (0, 1104)	1
  (0, 1669)	1
  (0, 2010)	1
  (0, 3046)	1
  (0, 3054)	1
  (0, 5072)	1
  (0, 5329)	1
  (0, 5723)	1
  (0, 5773)	1
  (0, 8254)	2
  (0, 8256)	1
  (0, 8279)	1
  (0, 8381)	1
  (0, 9058)	1
  (0, 9383)	2
  (0, 9538)	1
  (0, 9674)	1
  (0, 10707)	1
  (0, 10933)	1
  (0, 13800)	3
  (0, 14711)	1
  (0, 15107)	1
  (0, 15120)	1
  :	:
  (0, 690689)	1
  (0, 690694)	1
  (0, 690803)	1
  (0, 691017)	1
  (0, 691154)	1
  (0, 691559)	1
  (0, 691794)	1
  (0, 692397)	1
  (0, 695391)	3
  (0, 695470)	1
  (0, 695744)	2
  (0, 697126)	1
  (0, 697127)	1
  (0, 697824)	2
  (0, 698294)	1
  (0, 698450)	1
  (0, 702077)	1
  (0, 702179)	1
  (0, 702417)	1
  (0, 702424)	1
  (0, 704310)	1
  (0, 704328)	1
  (0, 705174)	2
  (0, 706213)	1
  (0, 706222)	1


In [101]:
cvect.vocabulary_ 

{'file': 205137,
 'under': 645090,
 'too': 631574,
 'weird': 674547,
 'to': 622980,
 'be': 60470,
 'true': 638847,
 'dated': 144756,
 'guy': 248763,
 'few': 203672,
 'year': 702417,
 'ago': 10707,
 'we': 670741,
 'were': 676415,
 'friend': 221148,
 'before': 67536,
 'and': 23811,
 'after': 8381,
 'hooked': 278378,
 'upbroke': 649872,
 'up': 648633,
 'this': 613206,
 'in': 291690,
 'itself': 314062,
 'is': 305186,
 'miracle': 377964,
 'since': 536148,
 'most': 385329,
 'of': 417014,
 'have': 258413,
 'discovered': 159083,
 'breakups': 86260,
 'mean': 369431,
 'never': 401580,
 'having': 260760,
 'say': 514302,
 'anything': 38001,
 'ever': 188336,
 'again': 9383,
 'let': 340129,
 'alone': 16376,
 'sorry': 552535,
 'the': 591009,
 'fact': 196180,
 'that': 586739,
 'did': 155060,
 'remain': 496726,
 'good': 239395,
 'wa': 661447,
 'very': 657262,
 'cool': 130449,
 'both': 83146,
 'other': 438759,
 'people': 454375,
 'for': 213213,
 'awhile': 53467,
 'everything': 189888,
 'fine': 206862,
 

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will
be the total count of the label. Check below image for reference (5 points)

In [92]:
lbl = list(blogtxt1['gender'] + "," + blogtxt1['age'].astype('str') + "," + blogtxt1['topic'] + "," + blogtxt1['sign'])
lbl

['male,15,Student,Leo',
 'male,15,Student,Leo',
 'male,15,Student,Leo',
 'male,15,Student,Leo',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,33,InvestmentBanking,Aquarius',
 'male,

In [102]:
label = CountVectorizer()
ytrainlabel_matrix = label.fit_transform(lbl)
label.vocabulary_


{'male': 54,
 '15': 2,
 'student': 70,
 'leo': 51,
 '33': 10,
 'investmentbanking': 48,
 'aquarius': 28,
 'female': 42,
 '14': 1,
 'indunk': 46,
 'aries': 30,
 '25': 7,
 'capricorn': 36,
 '17': 4,
 'gemini': 43,
 '23': 5,
 'non': 59,
 'profit': 61,
 'cancer': 35,
 'banking': 33,
 '37': 14,
 'sagittarius': 65,
 '26': 8,
 '24': 6,
 'scorpio': 67,
 '27': 9,
 'education': 39,
 '45': 22,
 'engineering': 40,
 'libra': 52,
 'science': 66,
 '34': 11,
 '41': 18,
 'communications': 37,
 'media': 56,
 'businessservices': 34,
 'sports': 69,
 'recreation': 63,
 'virgo': 75,
 'taurus': 71,
 'arts': 31,
 'pisces': 60,
 '44': 21,
 '16': 3,
 'internet': 47,
 'museums': 58,
 'libraries': 53,
 'accounting': 25,
 '39': 16,
 '35': 12,
 'technology': 72,
 '36': 13,
 'law': 49,
 '46': 23,
 'consulting': 38,
 'automotive': 32,
 '42': 19,
 'religion': 64,
 '13': 0,
 'fashion': 41,
 '38': 15,
 '43': 20,
 'publishing': 62,
 '40': 17,
 'marketing': 55,
 'lawenforcement': 50,
 'security': 68,
 'humanresources': 45

7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a. Convert your train and test labels using MultiLabelBinarizer

In [24]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
ytrain_label = pd.DataFrame(mlb.fit_transform(ytrain), columns=mlb.classes_)
ytrain_label

Unnamed: 0,13,14,15,16,17,23,24,25,26,27,...,Sports-Recreation,Student,Taurus,Technology,Telecommunications,Transportation,Virgo,female,indUnk,male
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11245,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
11246,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
11247,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
11248,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1


In [25]:
mlb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', '47', 'Accounting', 'Advertising', 'Agriculture',
       'Aquarius', 'Architecture', 'Aries', 'Arts', 'Automotive',
       'Banking', 'BusinessServices', 'Cancer', 'Capricorn',
       'Communications-Media', 'Consulting', 'Education', 'Engineering',
       'Fashion', 'Gemini', 'Government', 'HumanResources', 'Internet',
       'InvestmentBanking', 'Law', 'LawEnforcement-Security', 'Leo',
       'Libra', 'Marketing', 'Military', 'Museums-Libraries',
       'Non-Profit', 'Pisces', 'Publishing', 'Religion', 'Sagittarius',
       'Science', 'Scorpio', 'Sports-Recreation', 'Student', 'Taurus',
       'Technology', 'Telecommunications', 'Transportation', 'Virgo',
       'female', 'indUnk', 'male'], dtype=object)

In [103]:
ytest_label = mlb.transform(ytest)

In [104]:
ytest_label.shape,ytrain_label.shape

((3750, 71), (11250, 71))

8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a
basic classifier, use LogisticRegression. It is one of the simplest methods, but often it
performs good enough in text classification tasks. It might take some time because the
number of classifiers to train is large.
a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on
every label
b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that




In [26]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver = 'lbfgs',n_jobs = -1, max_iter = 1000,verbose = 1)
clf = OneVsRestClassifier(clf,n_jobs = -1)

9. Fit the classifier, make predictions and get the accuracy (5 points)
a. Print the following
i. Accuracy score
ii. F1 score
iii. Average precision score
iv. Average recall score
v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

In [28]:
ytrain_label.shape

(11250, 71)

In [27]:
clf.fit(Xtrain_t,ytrain_label)
yPred = clf.predict(Xtest_t)

In [29]:
yPred.shape

(3750, 71)

In [109]:
from sklearn.metrics import f1_score,accuracy_score,average_precision_score,recall_score, precision_score,classification_report

In [106]:
accuracy_score(ytest_label,yPred)

0.20986666666666667

In [107]:
f1_score(ytest_label,yPred,average="weighted")

0.531431468378459

In [37]:
recall_score(ytest_label, yPred, average='weighted')

0.4590666666666667

In [38]:
precision_score(ytest_label, yPred, average='micro')

0.7530621172353456

In [108]:
average_precision_score(ytest_label, yPred, average='weighted')

0.4347856729237682

In [113]:
print(classification_report(ytest_label, yPred))

              precision    recall  f1-score   support

           0       0.75      0.13      0.22        23
           1       0.38      0.05      0.09        99
           2       0.64      0.17      0.26       210
           3       0.73      0.27      0.39       181
           4       0.65      0.28      0.39       423
           5       0.67      0.22      0.33       397
           6       0.80      0.36      0.50       321
           7       0.58      0.19      0.28       236
           8       0.12      0.01      0.02        93
           9       0.74      0.33      0.46       364
          10       0.66      0.21      0.32        98
          11       0.95      0.59      0.72       184
          12       0.73      0.45      0.55       592
          13       0.92      0.48      0.63       440
          14       0.00      0.00      0.00         9
          15       0.75      0.23      0.35        13
          16       0.50      0.04      0.07        28
          17       0.00    

10. Print true label and predicted label for any five examples (7.5 points)

In [46]:
print(ytest_label[:5,:])  # Actual values for first 5

[[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1]]


In [40]:
print(yPred[:5,:])  #Predicted values for first 5

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 1],
       [0, 0, 0,

In [41]:
yPredLabel = mlb.inverse_transform(yPred)  #Transforming binary labels into original text format

In [50]:
yPredLabel[20:25] #Displaying 5 Predicted values

[('36', 'Fashion', 'male'),
 ('Aries', 'female'),
 ('36', 'Aries', 'Fashion', 'male'),
 ('female', 'indUnk'),
 ('17', 'female', 'indUnk')]

In [51]:
ytest[20:25] #Displaying 5 Predicted values

6470          [male, 36, Fashion, Aries]
12808    [female, 23, Government, Virgo]
7192          [male, 36, Fashion, Aries]
12521         [male, 16, Student, Libra]
13661        [female, 17, indUnk, Virgo]
Name: labels, dtype: object