# Building classifier that predicts multiple features of the author of a given text.

#### Mounting the drive to work on Google Colab

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


- ### Let's import necessary packages for building model

In [0]:
import nltk
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
project_path = '/content/drive/My Drive/AIML_Projects/Statistical NLP/'          # Defining project path

In [0]:
import os
os.chdir(project_path)                                                          # Changing directory location to project path

In [0]:
os.listdir()                                                                    # Checking the contents in the project_path 

['blog-authorship-corpus.zip', 'blogtext.csv']

In [0]:
zip_path = project_path + 'blog-authorship-corpus.zip'                          # Specifing the zip_path 

In [0]:
from zipfile import ZipFile                                                     # Extracting the zip file 
with ZipFile(zip_path , 'r') as z:
  z.extractall()

In [0]:
blog = pd.read_csv(r'blogtext.csv')

In [0]:
blog.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [0]:
blog.shape

(681284, 7)

## Let's take the 50% of dataset to build our model to classify multiple features of the author of a given text.

In [0]:
df = blog.sample(frac=0.3,random_state=4)

In [0]:
df.shape

(204385, 7)

### Let's look for first few records in `'text'` column.

In [0]:
df.iloc[0,6]

"            'To give thanks in solitude is enough. Thanksgiving has wings and goes where it must go.'  -- Victor Hugo         "

In [0]:
df.iloc[1,6]

"       Quingar watched, excited, as the door handle turned. Chance opened the door, and gasped. 'Hello, dear cousin.' Quingar said. 'Let's cut right to the chase. You've had my men killed. I cared about them. Now it's your turn.' He laughed, and opened one of Chance's closets. Something fell out. Chance's eyes brimmed with tears when he saw what it was. Skyler's body, a katana sticking out of the back of her head.         "

- ### From the above two results we can observe that there is need to cleaning the text columns.
- ## We need to perform below mentioned text cleaning steps.
> - a. Remove unwanted characters
> - b. Convert text to lowercase
> - c. Remove unwanted spaces
> - d. Remove stopwords

### Information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 204385 entries, 237952 to 443796
Data columns (total 7 columns):
id        204385 non-null int64
gender    204385 non-null object
age       204385 non-null int64
topic     204385 non-null object
sign      204385 non-null object
date      204385 non-null object
text      204385 non-null object
dtypes: int64(2), object(5)
memory usage: 22.5+ MB


### Descriptive statistics to summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. 
### Analyzing both numeric and object column.

In [0]:
df.describe(include = 'all')

Unnamed: 0,id,gender,age,topic,sign,date,text
count,204385.0,204385,204385.0,204385,204385,204385,204385
unique,,2,,40,12,2082,194524
top,,male,,indUnk,Cancer,"02,August,2004",urlLink
freq,,103649,,75025,19513,4964,117
mean,2400247.0,,23.917132,,,,
std,1247010.0,,7.774584,,,,
min,5114.0,,13.0,,,,
25%,1241488.0,,17.0,,,,
50%,2608756.0,,24.0,,,,
75%,3526127.0,,26.0,,,,


### Removing unwanted leading and trailing spaces in text column using strip() method.

In [0]:
print('Before removing unwanted spaces\n', df.iloc[0,6])

Before removing unwanted spaces
             'To give thanks in solitude is enough. Thanksgiving has wings and goes where it must go.'  -- Victor Hugo         


In [0]:
df['text'] = df['text'].apply(lambda x: x.strip())

In [0]:
print('After removing unwanted spaces\n', df.iloc[0,6])

After removing unwanted spaces
 'To give thanks in solitude is enough. Thanksgiving has wings and goes where it must go.'  -- Victor Hugo


### Removing unwanted characters in text column using regular expressions.

In [0]:
import re

In [0]:
print('Before removing unwanted spaces\n', df.iloc[0,6])

Before removing unwanted spaces
 'To give thanks in solitude is enough. Thanksgiving has wings and goes where it must go.'  -- Victor Hugo


In [0]:
df['text'] = df['text'].apply(lambda x: re.sub('[^A-Za-z0-9 ]+', '', x)) #re.sub('[^A-Za-z0-9 ]+', '', blog.iloc[0,6])

In [0]:
print('After removing unwanted spaces\n', df.iloc[0,6])

After removing unwanted spaces
 To give thanks in solitude is enough Thanksgiving has wings and goes where it must go   Victor Hugo


### Converting text to lowercase using lower() method.

In [0]:
print('Before making all charachters to lower case\n', df.iloc[0,6])

Before making all charachters to lower case
 To give thanks in solitude is enough Thanksgiving has wings and goes where it must go   Victor Hugo


In [0]:
df['text'] = df['text'].apply(lambda x: x.lower())

In [0]:
print('After making all charachters to lower case\n', df.iloc[0,6])

After making all charachters to lower case
 to give thanks in solitude is enough thanksgiving has wings and goes where it must go   victor hugo


### Removing stopwords from text using package "NLTK".

In [0]:
from nltk.corpus import stopwords

In [0]:
print('Before removing stopwords\n', df.iloc[0,6])

Before removing stopwords
 to give thanks in solitude is enough thanksgiving has wings and goes where it must go   victor hugo


In [0]:
print('Length of text before removing stopwords -', len(df.iloc[0,6]))

Length of text before removing stopwords - 99


In [0]:
from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

df['text'] = df['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_dict]))

In [0]:
print('After removing stopwords\n', df.iloc[0,6])

After removing stopwords
 give thanks solitude enough thanksgiving wings goes must go victor hugo


In [0]:
print('Length of text after removing stopwords -', len(df.iloc[0,6]))

Length of text after removing stopwords - 71


### Merging all the label columns together, so that we have all the labels together for a particular sentence.
> - #### Label columns to merge: “gender”, “age”, “topic”, “sign”

In [0]:
#df['new'] = (blog[['gender', 'age', 'topic', 'sign']].iloc[0:6,:].apply(lambda x: ' '.join(str(x)),axis = 0)).tolist()

In [0]:
df['labels'] = df[['gender', 'age', 'topic', 'sign']].apply(lambda x: [','.join(x.astype(str))],axis=1)

In [0]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,labels
237952,449628,male,34,indUnk,Aries,"23,November,2003",give thanks solitude enough thanksgiving wings...,"[male,34,indUnk,Aries]"
414527,3611601,male,17,indUnk,Leo,"29,June,2004",quingar watched excited door handle turned cha...,"[male,17,indUnk,Leo]"
195465,1951423,female,24,Arts,Scorpio,"30,March,2004",miyah nice summary furze brought memories back...,"[female,24,Arts,Scorpio]"
295975,3053026,male,17,indUnk,Aquarius,"06,June,2004",mary admire commitment statements easy hide be...,"[male,17,indUnk,Aquarius]"
441337,2930795,male,23,Technology,Aquarius,"17,May,2004",urllink ullis roy orbison clingfilm site possi...,"[male,23,Technology,Aquarius]"


### As we have to focus only on text and labels column we will create seperate dataframe of `“text”` and `“labels”`.


In [0]:
blog_df = df[['text','labels']]

In [0]:
blog_df.head()

Unnamed: 0,text,labels
237952,give thanks solitude enough thanksgiving wings...,"[male,34,indUnk,Aries]"
414527,quingar watched excited door handle turned cha...,"[male,17,indUnk,Leo]"
195465,miyah nice summary furze brought memories back...,"[female,24,Arts,Scorpio]"
295975,mary admire commitment statements easy hide be...,"[male,17,indUnk,Aquarius]"
441337,urllink ullis roy orbison clingfilm site possi...,"[male,23,Technology,Aquarius]"


### Separating features and labels, and spliting the data into training and testing.

In [0]:
features = blog_df['text']
labels = blog_df['labels']

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.30, random_state=5)

In [0]:
print(X_train.shape)

(143069,)


In [0]:
print(X_test.shape)

(61316,)


In [0]:
print(y_train.shape)

(143069,)


In [0]:
print(y_test.shape)

(61316,)


#### Import and instantiating CountVectorizer (with the default parameters) and ngram_range=(1, 2)

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1, 2))

### Learning the 'vocabulary' of the `'text'` column.

In [0]:
vect.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

#### Transforming training data(X_train) into a 'document-term matrix'.

In [0]:
X_train_dtm = vect.transform(X_train)

#### Transforming test data(X_test) into a 'document-term matrix'.

In [0]:
X_test_dtm = vect.transform(X_test)

In [0]:
type(X_train_dtm)

scipy.sparse.csr.csr_matrix

In [0]:
type(X_train_dtm)

scipy.sparse.csr.csr_matrix

In [0]:
print(X_train_dtm.shape)

(143069, 7626998)


In [0]:
print(X_test_dtm.shape)

(61316, 7626998)


### Displaying Document Term Matrix for X_train

In [0]:
print(X_train_dtm)

  (0, 264411)	5
  (0, 264685)	1
  (0, 264864)	1
  (0, 265241)	1
  (0, 265553)	1
  (0, 266114)	1
  (0, 266452)	1
  (0, 266465)	1
  (0, 274723)	1
  (0, 275379)	1
  (0, 329889)	1
  (0, 331555)	1
  (0, 375168)	1
  (0, 375200)	1
  (0, 547513)	1
  (0, 547538)	1
  (0, 579596)	1
  (0, 579677)	1
  (0, 716165)	1
  (0, 717220)	1
  (0, 746577)	1
  (0, 748772)	1
  (0, 836580)	1
  (0, 836666)	1
  (0, 837886)	1
  :	:
  (143068, 3128619)	2
  (143068, 3128693)	1
  (143068, 3666061)	1
  (143068, 3666063)	1
  (143068, 4294908)	1
  (143068, 4294979)	1
  (143068, 4401521)	1
  (143068, 4404098)	1
  (143068, 4422730)	1
  (143068, 4807067)	1
  (143068, 4807083)	1
  (143068, 6631109)	1
  (143068, 6632708)	1
  (143068, 6884788)	1
  (143068, 6884899)	1
  (143068, 7323389)	1
  (143068, 7331820)	1
  (143068, 7414351)	1
  (143068, 7416466)	1
  (143068, 7416850)	1
  (143068, 7417210)	1
  (143068, 7497691)	1
  (143068, 7498502)	1
  (143068, 7617284)	1
  (143068, 7617285)	1


### Displaying Document Term Matrix for X_test

In [0]:
print(X_test_dtm)

  (0, 39014)	1
  (0, 46323)	3
  (0, 52256)	1
  (0, 71275)	1
  (0, 71621)	1
  (0, 75168)	1
  (0, 77281)	1
  (0, 88790)	1
  (0, 98285)	2
  (0, 101073)	1
  (0, 101691)	2
  (0, 144339)	1
  (0, 194028)	1
  (0, 200621)	1
  (0, 205982)	1
  (0, 220259)	4
  (0, 301565)	2
  (0, 341369)	1
  (0, 345119)	1
  (0, 353366)	1
  (0, 412042)	1
  (0, 414620)	1
  (0, 419936)	1
  (0, 420199)	1
  (0, 425014)	1
  :	:
  (61315, 1626308)	1
  (61315, 2099672)	1
  (61315, 2616558)	1
  (61315, 2616643)	1
  (61315, 3329223)	1
  (61315, 3722236)	1
  (61315, 3724931)	1
  (61315, 3755708)	1
  (61315, 4911318)	2
  (61315, 5065586)	1
  (61315, 5065800)	1
  (61315, 5466173)	1
  (61315, 5466397)	1
  (61315, 5544048)	1
  (61315, 5544349)	1
  (61315, 6044611)	1
  (61315, 6045278)	1
  (61315, 6523048)	1
  (61315, 6525066)	1
  (61315, 6631109)	1
  (61315, 6632658)	1
  (61315, 7047468)	1
  (61315, 7174823)	1
  (61315, 7292067)	1
  (61315, 7292756)	1


### Creating a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.

#### Creating empty dict

In [0]:
myDict = dict()

In [0]:
for i, j in enumerate(labels):
    my_list = j[0].split(',')
  
    for item in my_list:
        if (item in myDict): 
            myDict[item] += 1
        else: 
             myDict[item] = 1     
for key, value in myDict.items(): 
      print ("% s : % d"%(key, value))

male :  103649
34 :  6282
indUnk :  75025
Aries :  19423
17 :  24253
Leo :  16226
female :  100736
24 :  24041
Arts :  9798
Scorpio :  17158
Aquarius :  15036
23 :  21904
Technology :  12657
Fashion :  1443
Engineering :  3529
27 :  13975
Student :  46182
Capricorn :  14642
Banking :  1180
35 :  5310
38 :  2218
Manufacturing :  695
33 :  5287
Libra :  18780
Communications-Media :  6107
Pisces :  16226
16 :  21971
Taurus :  18787
25 :  20136
Virgo :  17988
HumanResources :  903
LawEnforcement-Security :  550
26 :  16500
Marketing :  1416
Environment :  156
Gemini :  15495
48 :  1072
Internet :  4886
37 :  2709
Military :  968
Publishing :  2388
Architecture :  495
Non-Profit :  4382
47 :  666
BusinessServices :  1348
15 :  12467
14 :  8192
Education :  8931
Biotech :  685
40 :  1553
Cancer :  19513
Sagittarius :  15111
46 :  829
Consulting :  1759
45 :  1324
13 :  3955
Government :  1990
Telecommunications :  1173
Museums-Libraries :  966
Transportation :  688
Science :  2163
36 :  4278

In [0]:
print(myDict)

{'male': 103649, '34': 6282, 'indUnk': 75025, 'Aries': 19423, '17': 24253, 'Leo': 16226, 'female': 100736, '24': 24041, 'Arts': 9798, 'Scorpio': 17158, 'Aquarius': 15036, '23': 21904, 'Technology': 12657, 'Fashion': 1443, 'Engineering': 3529, '27': 13975, 'Student': 46182, 'Capricorn': 14642, 'Banking': 1180, '35': 5310, '38': 2218, 'Manufacturing': 695, '33': 5287, 'Libra': 18780, 'Communications-Media': 6107, 'Pisces': 16226, '16': 21971, 'Taurus': 18787, '25': 20136, 'Virgo': 17988, 'HumanResources': 903, 'LawEnforcement-Security': 550, '26': 16500, 'Marketing': 1416, 'Environment': 156, 'Gemini': 15495, '48': 1072, 'Internet': 4886, '37': 2709, 'Military': 968, 'Publishing': 2388, 'Architecture': 495, 'Non-Profit': 4382, '47': 666, 'BusinessServices': 1348, '15': 12467, '14': 8192, 'Education': 8931, 'Biotech': 685, '40': 1553, 'Cancer': 19513, 'Sagittarius': 15111, '46': 829, 'Consulting': 1759, '45': 1324, '13': 3955, 'Government': 1990, 'Telecommunications': 1173, 'Museums-Libra

### Transforming the labels

In [0]:
list_class = [] 
for key in myDict.keys(): 
    list_class.append(key) 
list_class_array=np.array(list_class)

In [0]:
list_class_array

array(['male', '34', 'indUnk', 'Aries', '17', 'Leo', 'female', '24',
       'Arts', 'Scorpio', 'Aquarius', '23', 'Technology', 'Fashion',
       'Engineering', '27', 'Student', 'Capricorn', 'Banking', '35', '38',
       'Manufacturing', '33', 'Libra', 'Communications-Media', 'Pisces',
       '16', 'Taurus', '25', 'Virgo', 'HumanResources',
       'LawEnforcement-Security', '26', 'Marketing', 'Environment',
       'Gemini', '48', 'Internet', '37', 'Military', 'Publishing',
       'Architecture', 'Non-Profit', '47', 'BusinessServices', '15', '14',
       'Education', 'Biotech', '40', 'Cancer', 'Sagittarius', '46',
       'Consulting', '45', '13', 'Government', 'Telecommunications',
       'Museums-Libraries', 'Transportation', 'Science', '36',
       'Sports-Recreation', 'Agriculture', '41', 'RealEstate', '44',
       'Law', 'Advertising', '43', 'Accounting', 'Automotive',
       'Chemicals', 'Tourism', 'Religion', '42', 'Construction', '39',
       'InvestmentBanking', 'Maritime'], dtyp

In [0]:
# transform to dictionary as Acceptable format of MultiLabelBinarizer
y_train_pass = [set(i[0].split(',')) for i in y_train]
y_test_pass = [set(i[0].split(',')) for i in y_test]

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer()

In [0]:
mlb.fit(y_train_pass)

MultiLabelBinarizer(classes=None, sparse_output=False)

In [0]:
mlb.transform(y_train_pass)

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 1, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0]])

In [0]:
mlb.transform(y_test_pass)

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 1, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0]])

In [0]:
len(mlb.transform(y_test_pass))

61316

In [0]:
# retriving the lables 
mlb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', '47', '48', 'Accounting', 'Advertising', 'Agriculture',
       'Aquarius', 'Architecture', 'Aries', 'Arts', 'Automotive',
       'Banking', 'Biotech', 'BusinessServices', 'Cancer', 'Capricorn',
       'Chemicals', 'Communications-Media', 'Construction', 'Consulting',
       'Education', 'Engineering', 'Environment', 'Fashion', 'Gemini',
       'Government', 'HumanResources', 'Internet', 'InvestmentBanking',
       'Law', 'LawEnforcement-Security', 'Leo', 'Libra', 'Manufacturing',
       'Maritime', 'Marketing', 'Military', 'Museums-Libraries',
       'Non-Profit', 'Pisces', 'Publishing', 'RealEstate', 'Religion',
       'Sagittarius', 'Science', 'Scorpio', 'Sports-Recreation',
       'Student', 'Taurus', 'Technology', 'Telecommunications', 'Tourism',
       'Transportation', 'Virgo', 'female', 'indUnk', 'male'],
      dtype

In [0]:
y_trn_mlb = mlb.transform(y_train_pass)

In [0]:
y_test_mlb =mlb.transform(y_test_pass)

## Choose a classifier

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

### Initiating the Logistic regression model with solver 'lbfgs' where it can handle L2 or no penalty.

In [0]:
clf = LogisticRegression(solver = 'lbfgs')
clf = OneVsRestClassifier(clf)

In [0]:
clf.fit(X_train_dtm, y_trn_mlb)                                                 # Fitting the classifier 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [0]:
y_pred_class = clf.predict (X_test_dtm)                                         # Predicting on test data

In [0]:
from sklearn import metrics                                                     
metrics.accuracy_score(y_test_mlb, y_pred_class)

0.1293

In [0]:
print(metrics.classification_report(y_test_mlb, y_pred_class))

              precision    recall  f1-score   support

           0       0.73      0.20      0.32       337
           1       0.75      0.22      0.34       708
           2       0.74      0.29      0.42      1303
           3       0.77      0.37      0.50      1651
           4       0.68      0.31      0.42      2512
           5       0.63      0.21      0.32      2129
           6       0.68      0.30      0.41      2367
           7       0.65      0.22      0.33      1758
           8       0.67      0.25      0.36      1625
           9       0.65      0.24      0.35      1619
          10       0.51      0.10      0.17       539
          11       0.92      0.49      0.64       486
          12       0.75      0.25      0.37       959
          13       0.85      0.37      0.52       663
          14       0.35      0.09      0.14       143
          15       0.83      0.26      0.39       167
          16       0.71      0.17      0.27       120
          17       0.44    

  _warn_prf(average, modifier, msg_start, len(result))


In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

In [0]:
print("F1: " , (f1_score(y_test_mlb, y_pred_class, average='micro')))
print("F1_macro: " , (f1_score(y_test_mlb, y_pred_class, average='macro')))
print("Recall micro: " , recall_score(y_test_mlb, y_pred_class, average='micro'))
print("F1_micro: " , (f1_score(y_test_mlb, y_pred_class, average='micro')))
print("Recall macro: " , recall_score(y_test_mlb, y_pred_class, average='macro'))
print("Average Precision: " ,(average_precision_score(y_test_mlb, y_pred_class, average='micro')))
print("Accuracy:" , (accuracy_score(y_test_mlb, y_pred_class))) 

F1:  0.5051713435819717
F1_macro:  0.30538949765367895
Recall micro:  0.3916625
F1_micro:  0.5051713435819717
Recall macro:  0.2120484063104803
Average Precision:  0.3090148252939908
Accuracy: 0.1293


## Print true label and predicted label for any five examples

In [0]:
y_test_pred_inversed = mlb.inverse_transform(y_pred_class)
y_test_inversed = mlb.inverse_transform(y_test_mlb)
for i in range(15,20):
    print( 'True labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        
        ','.join(y_test_inversed[i]),
        ','.join(y_test_pred_inversed[i])
    ))

True labels:	36,Pisces,Technology,male
Predicted labels:	Pisces,male


True labels:	25,Cancer,Non-Profit,male
Predicted labels:	male


True labels:	35,Aries,Technology,male
Predicted labels:	male


True labels:	25,Gemini,indUnk,male
Predicted labels:	male


True labels:	17,Virgo,indUnk,male
Predicted labels:	17,Virgo,indUnk,male




# This is worked on small portion of dataset so accuracy is not great. Working on the model but it's taking long time to run so couldn't able to update the latest code file.