**Project :**
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multilabel classification problem.

In [0]:
Dataset Information:
Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups: 8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)
 
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog- authorship-corpus.zip/2at

Mounting google drive, reading the dataset "blog-authorship-corpus" 

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


**drop the NAs while dataset is being read**

In [0]:
import pandas as pd
import numpy as np
data_blog = pd.read_csv('/gdrive/My Drive/Residency-8-NLP/NLP_Project1_Blog/blog-authorship-corpus.zip').dropna()

In [46]:
data_blog.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


**Taking size of dataframe to 4900 rows**

In [0]:
df=data_blog.iloc[:4900,0:]

In [48]:
df.shape

(4900, 7)

In [49]:
df.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


**Preprocessing of NLP text (check for lowercase and remove non essentials**  

In [50]:
import nltk
nltk.download("stopwords")
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


df['text']=df['text'].map(lambda s:preprocess(s))

In [52]:
df.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found pages pdf files wait untill team le...
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie aarde maak eigen water...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture urls popups means...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",interesting conversation dad morning talking k...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",somehow coca cola way summing things well earl...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",anything korea country extremes everything see...
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",take read news article urllink joongang ilbo n...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",surf english news sites lot looking tidbits ko...


In [53]:
df.shape

(4900, 7)

**merge “gender”, “age”, “topic”, “sign” by creating single label**

In [0]:
df['label']=df['gender'].astype(str)+','+df['age'].astype(str)+','+df['topic'].astype(str)+','+df['sign'].astype(str)

In [0]:
df1 = df.iloc[:,6:8]

In [56]:
df1.head(10)

Unnamed: 0,text,label
0,info found pages pdf files wait untill team le...,"male,15,Student,Leo"
1,team members drewes van der laag urllink mail ...,"male,15,Student,Leo"
2,het kader van kernfusie aarde maak eigen water...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture urls popups means...,"male,33,InvestmentBanking,Aquarius"
5,interesting conversation dad morning talking k...,"male,33,InvestmentBanking,Aquarius"
6,somehow coca cola way summing things well earl...,"male,33,InvestmentBanking,Aquarius"
7,anything korea country extremes everything see...,"male,33,InvestmentBanking,Aquarius"
8,take read news article urllink joongang ilbo n...,"male,33,InvestmentBanking,Aquarius"
9,surf english news sites lot looking tidbits ko...,"male,33,InvestmentBanking,Aquarius"


In [57]:
df1.shape

(4900, 2)

In [58]:
df1.info

<bound method DataFrame.info of                                                    text                               label
0     info found pages pdf files wait untill team le...                 male,15,Student,Leo
1     team members drewes van der laag urllink mail ...                 male,15,Student,Leo
2     het kader van kernfusie aarde maak eigen water...                 male,15,Student,Leo
3                                       testing testing                 male,15,Student,Leo
4     thanks yahoo toolbar capture urls popups means...  male,33,InvestmentBanking,Aquarius
...                                                 ...                                 ...
4895  ahhh bored mind today love american rejects no...            female,17,indUnk,Scorpio
4896  yes back feel like ranting snow melting last n...            female,17,indUnk,Scorpio
4897  ahhhhhhhhhhhhhhhhhhhhhhhhhhh american rejects ...            female,17,indUnk,Scorpio
4898  snow gorgeous already gone shot rolls film

In [59]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4900 entries, 0 to 4899
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    4900 non-null   object
 1   label   4900 non-null   object
dtypes: object(2)
memory usage: 114.8+ KB


**Define a function for converting labels to string, this will help in identifying different labels contained in a same row**

In [0]:
def convert2dict(sentence):
    sentence=str(sentence)
    return sentence.split(',')

df1['labels']=df1['label'].map(lambda s:convert2dict(s))

In [61]:
df1.head(10)

Unnamed: 0,text,label,labels
0,info found pages pdf files wait untill team le...,"male,15,Student,Leo","[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"male,15,Student,Leo","[male, 15, Student, Leo]"
2,het kader van kernfusie aarde maak eigen water...,"male,15,Student,Leo","[male, 15, Student, Leo]"
3,testing testing,"male,15,Student,Leo","[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
5,interesting conversation dad morning talking k...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
6,somehow coca cola way summing things well earl...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
7,anything korea country extremes everything see...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
8,take read news article urllink joongang ilbo n...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
9,surf english news sites lot looking tidbits ko...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"


**Separate features and labels, and split the data into training and testing, say in percentage of 75:25**

In [0]:
from sklearn.model_selection import train_test_split
x = df1['text']
y = df1['labels']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.25,random_state=1)

In [63]:
x_train.shape


(3675,)

In [64]:
x_test.shape

(1225,)

In [65]:
y_train.shape

(3675,)

In [66]:
y_test.shape

(1225,)

**Vectorize the features**

In [67]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 2))
print("create Matrix")
x_train_ct = cv.fit_transform(x_train)
x_test_ct = cv.transform(x_test)

create Matrix


In [68]:
cv.vocabulary_

{'girl': 85699,
 'sometimes': 200184,
 'forget': 78774,
 'visions': 234138,
 'grandeur': 90200,
 'weekend': 239086,
 'walked': 235312,
 'aisles': 4271,
 'discount': 56007,
 'auto': 14094,
 'parts': 156667,
 'store': 206858,
 'thought': 219156,
 'wanted': 236185,
 'save': 185383,
 'money': 138602,
 'figured': 74674,
 'could': 44742,
 'put': 170948,
 'oil': 150472,
 'coolant': 43960,
 'car': 30850,
 'change': 33599,
 'air': 4069,
 'filters': 75094,
 'pick': 160929,
 'steering': 205438,
 'wheel': 241042,
 'cover': 46122,
 'barely': 17018,
 'get': 84159,
 'damn': 49108,
 'stretch': 207705,
 'pry': 170094,
 'finger': 75823,
 'known': 115009,
 'already': 5485,
 'beaten': 18016,
 'hell': 97085,
 'thinking': 218483,
 'ended': 63730,
 'forgoing': 78891,
 'filter': 75076,
 'learned': 118002,
 'would': 245917,
 'need': 144358,
 'special': 202175,
 'tool': 223784,
 'tried': 225925,
 'figure': 74581,
 'rectangular': 176462,
 'went': 240323,
 'engine': 64093,
 'old': 150708,
 'something': 199724,
 '

In [69]:
len(cv.vocabulary_)

250170

**DTM for train and test data**

In [70]:
x_train_ct.shape

(3675, 250170)

In [72]:
print(x_train_ct[0])

  (0, 85699)	2
  (0, 200184)	1
  (0, 78774)	1
  (0, 234138)	1
  (0, 90200)	1
  (0, 239086)	2
  (0, 235312)	1
  (0, 4271)	1
  (0, 56007)	1
  (0, 14094)	1
  (0, 156667)	1
  (0, 206858)	1
  (0, 219156)	2
  (0, 236185)	1
  (0, 185383)	1
  (0, 138602)	1
  (0, 74674)	1
  (0, 44742)	2
  (0, 170948)	3
  (0, 150472)	8
  (0, 43960)	1
  (0, 30850)	4
  (0, 33599)	1
  (0, 4069)	2
  (0, 75094)	1
  :	:
  (0, 2817)	1
  (0, 6509)	1
  (0, 117988)	1
  (0, 199838)	1
  (0, 68073)	1
  (0, 85789)	1
  (0, 132502)	1
  (0, 140943)	1
  (0, 19327)	1
  (0, 68100)	1
  (0, 114342)	1
  (0, 64486)	1
  (0, 31675)	1
  (0, 84795)	1
  (0, 174881)	1
  (0, 20960)	1
  (0, 226338)	1
  (0, 55592)	1
  (0, 175595)	1
  (0, 145584)	1
  (0, 157785)	1
  (0, 244653)	1
  (0, 164841)	1
  (0, 123663)	1
  (0, 30890)	1


In [73]:
x_test_ct.shape

(1225, 250170)

In [74]:
print(x_test_ct[0])

  (0, 5719)	1
  (0, 11268)	1
  (0, 13042)	1
  (0, 14704)	1
  (0, 15115)	1
  (0, 19625)	1
  (0, 31662)	5
  (0, 34977)	2
  (0, 35309)	1
  (0, 39975)	1
  (0, 39982)	1
  (0, 44624)	1
  (0, 44742)	2
  (0, 44897)	1
  (0, 44991)	1
  (0, 48304)	1
  (0, 57462)	1
  (0, 57495)	1
  (0, 59070)	1
  (0, 59837)	1
  (0, 59934)	1
  (0, 61032)	1
  (0, 61104)	1
  (0, 62507)	1
  (0, 62929)	1
  :	:
  (0, 220626)	3
  (0, 221358)	1
  (0, 221373)	1
  (0, 222085)	1
  (0, 222187)	1
  (0, 231748)	2
  (0, 231955)	1
  (0, 232268)	1
  (0, 235012)	1
  (0, 239687)	1
  (0, 239830)	1
  (0, 240317)	1
  (0, 240323)	3
  (0, 240491)	1
  (0, 242274)	2
  (0, 242661)	1
  (0, 243511)	1
  (0, 243621)	1
  (0, 244360)	1
  (0, 244782)	1
  (0, 245917)	2
  (0, 246142)	1
  (0, 247698)	1
  (0, 247884)	1
  (0, 248207)	1


Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.**bold text**

In [0]:
text=[]
text.append(df1['label'].str.cat(sep=','))

In [76]:
def convert(lst): 
    return (lst[0].split(',')) 
  
# Calling code 
text = convert(text)
print(text)

['male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'Investment

In [77]:
dictionary = {}
for label in text :
  dictionary[label] = dictionary.get(label, 0) + 1
dictionary

{'14': 170,
 '15': 339,
 '16': 67,
 '17': 231,
 '23': 137,
 '24': 353,
 '25': 268,
 '26': 96,
 '27': 86,
 '33': 101,
 '34': 540,
 '35': 2307,
 '36': 60,
 '37': 19,
 '39': 79,
 '41': 14,
 '42': 9,
 '44': 3,
 '45': 14,
 '46': 7,
 'Accounting': 2,
 'Aquarius': 329,
 'Aries': 2483,
 'Arts': 31,
 'Automotive': 14,
 'Banking': 16,
 'BusinessServices': 87,
 'Cancer': 94,
 'Capricorn': 84,
 'Communications-Media': 61,
 'Consulting': 16,
 'Education': 118,
 'Engineering': 119,
 'Gemini': 86,
 'Internet': 20,
 'InvestmentBanking': 70,
 'Law': 3,
 'Leo': 190,
 'Libra': 414,
 'Museums-Libraries': 2,
 'Non-Profit': 47,
 'Pisces': 67,
 'Religion': 4,
 'Sagittarius': 704,
 'Science': 33,
 'Scorpio': 308,
 'Sports-Recreation': 75,
 'Student': 569,
 'Taurus': 100,
 'Technology': 2332,
 'Virgo': 41,
 'female': 1606,
 'indUnk': 1281,
 'male': 3294}

**Transform the labels**

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_mlbin=mlb.fit_transform(y_train)
y_test_mlbin=mlb.transform(y_test)

In [79]:
print(y_train_mlbin)
print(y_train_mlbin.shape)

[[0 0 0 ... 1 1 0]
 [0 1 0 ... 1 0 0]
 [0 1 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 1 0 ... 0 0 1]]
(3675, 54)


In [80]:
print(y_test_mlbin)
print(y_test_mlbin.shape)

[[0 0 0 ... 0 0 1]
 [0 0 0 ... 1 1 0]
 [0 0 0 ... 0 0 1]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]]
(1225, 54)


In [81]:
list(mlb.classes_)

['14',
 '15',
 '16',
 '17',
 '23',
 '24',
 '25',
 '26',
 '27',
 '33',
 '34',
 '35',
 '36',
 '37',
 '39',
 '41',
 '42',
 '44',
 '45',
 '46',
 'Accounting',
 'Aquarius',
 'Aries',
 'Arts',
 'Automotive',
 'Banking',
 'BusinessServices',
 'Cancer',
 'Capricorn',
 'Communications-Media',
 'Consulting',
 'Education',
 'Engineering',
 'Gemini',
 'Internet',
 'InvestmentBanking',
 'Law',
 'Leo',
 'Libra',
 'Museums-Libraries',
 'Non-Profit',
 'Pisces',
 'Religion',
 'Sagittarius',
 'Science',
 'Scorpio',
 'Sports-Recreation',
 'Student',
 'Taurus',
 'Technology',
 'Virgo',
 'female',
 'indUnk',
 'male']

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

**Choose a classifier**

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

In [86]:
clf.fit(x_train_ct,y_train_mlbin)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [87]:
y_pred=clf.predict(x_test_ct)
y_pred

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [88]:
y_pred.shape

(1225, 54)

In [0]:
y_pred_ = mlb.inverse_transform(y_pred)

In [90]:
print(y_pred_)

[('male',), ('male',), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('Student', 'female'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('female',), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('indUnk', 'male'), ('35', 'Aries', 'Technology', 'male'), ('female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('male',), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Techno

In [0]:
y_test_ = mlb.inverse_transform(y_test_mlb)

In [92]:
print(y_test_)

[('17', 'Sagittarius', 'Student', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('17', 'Leo', 'Student', 'female'), ('23', 'Sagittarius', 'indUnk', 'male'), ('35', 'Aries', 'Technology', 'male'), ('26', 'Gemini', 'indUnk', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Gemini', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('39', 'Education', 'Virgo', 'male'), ('35', 'Aries', 'Technology', 'male'), ('24', 'Scorpio', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 

In [0]:
from sklearn import metrics
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [94]:
print("F1: " + str(f1_score(y_test_mlbin,y_pred,average='micro')))
print("Recall: " + str(recall_score(y_test_mlbin,y_pred,average='micro')))
print("Precision: " + str(average_precision_score(y_test_mlbin, y_pred,average='micro')))
print("Accuracy:" + str(metrics.accuracy_score(y_test_mlbin,y_pred))) 

F1: 0.7588843904633378
Recall: 0.6885714285714286
Precision: 0.6050427309645746
Accuracy:0.5542857142857143


In [95]:
y_test[0:350]

836       [male, 17, Student, Sagittarius]
4536     [female, 34, indUnk, Sagittarius]
2583         [male, 35, Technology, Aries]
2098         [male, 35, Technology, Aries]
3119         [male, 35, Technology, Aries]
                       ...                
1399         [male, 35, Technology, Aries]
4168     [female, 34, indUnk, Sagittarius]
4739    [female, 23, Automotive, Aquarius]
3138         [male, 35, Technology, Aries]
3586         [male, 35, Technology, Aries]
Name: labels, Length: 350, dtype: object

In [98]:
y_test[4739]

['female', '23', 'Automotive', 'Aquarius']

In [99]:
y_test[3119]

['male', '35', 'Technology', 'Aries']

In [100]:
y_test[836]

['male', '17', 'Student', 'Sagittarius']