# Project Dscription:

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

# Dataset information:

### Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers.

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person. 


Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)


All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17),

8086 "20s" blogs(ages 23-27)

2994 "30s" blogs (ages 33-47)



For each age group, there is an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.


Link to dataset:
https://www.kaggle.com/rtatman/blog-authorship-corpus/


In [0]:
# import necessary libraries
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score
from nltk.stem import WordNetLemmatizer

## Step1: 

Load the dataset (5 points)
  
Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

In [115]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
# Read Data
blog = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Stat_NLP/blogtext.csv')

In [117]:
blog.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [118]:
# Taking only initial 10k rows to initial pre processing & training
blog_df = blog[:10000]
print(blog_df.shape)
blog_df["text"].loc[0]

(10000, 7)


'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

In [119]:
blog_df.info

<bound method DataFrame.info of            id  ...                                               text
0     2059027  ...             Info has been found (+/- 100 pages,...
1     2059027  ...             These are the team members:   Drewe...
2     2059027  ...             In het kader van kernfusie op aarde...
3     2059027  ...                   testing!!!  testing!!!          
4     3581210  ...               Thanks to Yahoo!'s Toolbar I can ...
...       ...  ...                                                ...
9995  1705136  ...          take me home with you forever where I ...
9996  1705136  ...          seductive secretness behind doors warn...
9997  1705136  ...          For being so kind to me when I need yo...
9998  1705136  ...          blurry outside sounds as people mingle...
9999  1705136  ...          my body feels broken while my mind rej...

[10000 rows x 7 columns]>

## Step 2:

Preprocess rows of the “text” column (7.5 points)

a. Remove unwanted characters

b. Convert text to lowercase

c. Remove unwanted spaces

d. Remove stopwords

In [120]:
#Removing unwanted / special characters
blog_df['text'] = blog_df['text'].str.replace('[^A-Za-z]',' ')
blog_df["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


'           Info has been found          pages  and     MB of  pdf files  Now i have to wait untill our team leader has processed it and learns html          '

In [121]:
# Coverting to lower case
blog_df['text'] = blog_df['text'].str.lower()
blog_df["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


'           info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html          '

In [122]:
#Removing spaces
blog_df["text"] = blog_df["text"].str.strip()
blog_df["text"].loc[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


'info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html'

In [123]:
blog_df["text"] = blog_df["text"].str.split()  # splitting each row of text data into individual words.
# So it can be iterated through to remove only stopwords in next steps.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


### Removing Stopwords

In [124]:
nltk.download('stopwords')
stop = stopwords.words('english')
def removestopwords(y):   # Function definition
 stopwordremoved = [w for w in y if w not in stop]
 return(" ".join(stopwordremoved))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [125]:
text_column_size = blog_df["text"].size
print("text column size :", text_column_size)

# Initialize an empty list to hold the text after stop word removal
cleaner_blog_df_sample_text = []

# Loop over each text
for i in range( 0, text_column_size):
    cleaner_blog_df_sample_text.append(removestopwords(blog_df["text"][i]))

text column size : 10000


In [126]:
cleaner_blog_df_sample_text[5015]

'hee hee two towers comes dvd tuesday let forget except probably buy mom wait buy three together sigh go rent watch eight times wait till return king comes come go see probably finals something day last year chantele went afterwards looking pics stuff though hee hee eowyn still favorite'

In [127]:
#Replace text column with cleaner_blog_df_sample_text 
blog_df["text"] = cleaner_blog_df_sample_text
blog_df["text"][5015]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


'hee hee two towers comes dvd tuesday let forget except probably buy mom wait buy three together sigh go rent watch eight times wait till return king comes come go see probably finals something day last year chantele went afterwards looking pics stuff though hee hee eowyn still favorite'

### Lemmatization

In [128]:
nltk.download('wordnet')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    lemm = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    return(" ".join(lemm)) 

blog_df["text"] = blog_df.text.apply(lemmatize_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [129]:
blog_df["text"][5015] # Lemmatized output

'hee hee two tower come dvd tuesday let forget except probably buy mom wait buy three together sigh go rent watch eight time wait till return king come come go see probably final something day last year chantele went afterwards looking pic stuff though hee hee eowyn still favorite'

# Step 3:

As we want to make this into a multi-label classification problem, you are required to merge
all the label columns together, so that we have all the labels together for a particular sentence
(7.5 points)

a. Label columns to merge: “gender”, “age”, “topic”, “sign”

b. After completing the previous step, there should be only two columns in your data
frame i.e. “text” and “labels” as shown in the below image

In [130]:
#name of available columns 
blog_df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [131]:
blog_df.shape

(10000, 7)

In [132]:
blog_df.sample(2)

Unnamed: 0,id,gender,age,topic,sign,date,text
443,649790,female,24,indUnk,Scorpio,"29,December,2003",ponder finished book francine river called sho...
1402,589736,male,35,Technology,Aries,"05,August,2004",perhaps email sent confirmation email sent bac...


In [133]:
# merge gender', 'age', 'topic', 'sign'
blog_df['age'] = blog_df['age'].astype(str)
blog_df['labels'] = blog_df[['gender','age','topic','sign']].apply(lambda x: ','.join(x), axis = 1) 
blog_df_merged = blog_df.drop(labels = ['date','gender', 'age','topic','sign','id'], axis = 1)
blog_df_merged.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,team member drewes van der laag urllink mail r...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male,33,InvestmentBanking,Aquarius"


In [134]:
blog_df_merged.shape

(10000, 2)

# Step 4: 

Separate features and labels, and split the data into training and testing (5 points)

In [135]:
feature = blog_df_merged['text']
blog_df_merged['labels'] = blog_df_merged['labels'].str.lower()
labels = blog_df_merged['labels']
X_train, X_test, Y_train, Y_test = train_test_split(feature,labels, test_size = 0.33, random_state = 143)
Y_train.shape

(6700,)

# Step 5:

Vectorize the features (5 points)
a. Create a Bag of Words using count vectorizer

  1. Use ngram_range=(1, 2)
  2. Vectorize training and testing features
  
b. Print the term-document matrix

In [136]:
# Creating Bag of words
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
print("X_train shape & sample",X_train.shape)
X_train[0]

X_train shape & sample (6700, 56767)


<1x56767 sparse matrix of type '<class 'numpy.int64'>'
	with 59 stored elements in Compressed Sparse Row format>

In [137]:
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=2,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)


In [138]:
print(X_train)

  (0, 53628)	1
  (0, 47750)	1
  (0, 45744)	2
  (0, 31071)	2
  (0, 37522)	2
  (0, 5375)	2
  (0, 30935)	1
  (0, 23953)	1
  (0, 6466)	1
  (0, 23299)	2
  (0, 30533)	1
  (0, 2147)	1
  (0, 15695)	1
  (0, 47976)	1
  (0, 41349)	1
  (0, 28732)	1
  (0, 1447)	1
  (0, 1183)	1
  (0, 23017)	1
  (0, 45980)	2
  (0, 5005)	2
  (0, 41392)	1
  (0, 37551)	1
  (0, 8310)	1
  (0, 31159)	1
  :	:
  (6699, 49089)	1
  (6699, 3211)	1
  (6699, 29977)	1
  (6699, 12832)	1
  (6699, 23580)	1
  (6699, 8831)	1
  (6699, 56185)	1
  (6699, 18407)	1
  (6699, 18461)	1
  (6699, 43834)	1
  (6699, 33143)	1
  (6699, 3207)	1
  (6699, 14866)	1
  (6699, 3680)	1
  (6699, 43452)	1
  (6699, 2132)	1
  (6699, 4429)	1
  (6699, 3567)	1
  (6699, 16498)	1
  (6699, 43573)	1
  (6699, 49134)	1
  (6699, 50103)	1
  (6699, 36396)	1
  (6699, 44080)	1
  (6699, 7022)	1


# Step 6:

Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. (5 points)

In [139]:
vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english")
labels_vector = vectorizer_labels.fit_transform(labels)
vectorizer_labels.vocabulary_

{'13': 0,
 '14': 1,
 '15': 2,
 '16': 3,
 '17': 4,
 '23': 5,
 '24': 6,
 '25': 7,
 '26': 8,
 '27': 9,
 '33': 10,
 '34': 11,
 '35': 12,
 '36': 13,
 '37': 14,
 '38': 15,
 '39': 16,
 '40': 17,
 '41': 18,
 '42': 19,
 '43': 20,
 '44': 21,
 '45': 22,
 '46': 23,
 'accounting': 24,
 'aquarius': 25,
 'aries': 26,
 'arts': 27,
 'automotive': 28,
 'banking': 29,
 'businessservices': 30,
 'cancer': 31,
 'capricorn': 32,
 'communications': 33,
 'consulting': 34,
 'education': 35,
 'engineering': 36,
 'fashion': 37,
 'female': 38,
 'gemini': 39,
 'humanresources': 40,
 'indunk': 41,
 'internet': 42,
 'investmentbanking': 43,
 'law': 44,
 'lawenforcement': 45,
 'leo': 46,
 'libra': 47,
 'libraries': 48,
 'male': 49,
 'marketing': 50,
 'media': 51,
 'museums': 52,
 'non': 53,
 'pisces': 54,
 'profit': 55,
 'publishing': 56,
 'recreation': 57,
 'religion': 58,
 'sagittarius': 59,
 'science': 60,
 'scorpio': 61,
 'security': 62,
 'sports': 63,
 'student': 64,
 'taurus': 65,
 'technology': 66,
 'telecommun

In [140]:
# Extracing only key value from above dictionary, which contains unique labels. These set of labels will be used as classes in 
# multilabelbinariser further.
label_classes = []  
for key in vectorizer_labels.vocabulary_.keys():
    label_classes.append(key)
    
print(sorted(label_classes))

['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', 'accounting', 'aquarius', 'aries', 'arts', 'automotive', 'banking', 'businessservices', 'cancer', 'capricorn', 'communications', 'consulting', 'education', 'engineering', 'fashion', 'female', 'gemini', 'humanresources', 'indunk', 'internet', 'investmentbanking', 'law', 'lawenforcement', 'leo', 'libra', 'libraries', 'male', 'marketing', 'media', 'museums', 'non', 'pisces', 'profit', 'publishing', 'recreation', 'religion', 'sagittarius', 'science', 'scorpio', 'security', 'sports', 'student', 'taurus', 'technology', 'telecommunications', 'virgo']


# Step 7:

Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

a. Convert your train and test labels using MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer(classes = label_classes)  # initialising multilabelbinariser with all unique possible classes

In [142]:
# Converting entire se of labels into format required by mlb
labels = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in labels]]
labels[30]

['male', '33', 'investmentbanking', 'aquarius']

In [143]:
labels_trans = mlb.fit(labels) # transforming entire set of lables
labels_trans

MultiLabelBinarizer(classes=['male', '15', 'student', 'leo', '33',
                             'investmentbanking', 'aquarius', 'female', '14',
                             'indunk', 'aries', '25', 'capricorn', '17',
                             'gemini', '23', 'non', 'profit', 'cancer',
                             'banking', '37', 'sagittarius', '26', '24',
                             'scorpio', '27', 'education', '45', 'engineering',
                             'libra', ...],
                    sparse_output=False)

In [144]:
#Convert Y_train into a format as required by mlb 
Y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_train]]
Y_train[30]

['female', '16', 'indunk', 'capricorn']

In [145]:
Y_train_trans = mlb.transform(Y_train) # transforming Train lables using mlb which is trained on all possible unnique labels on entire data set
Y_train_trans[30]

  .format(sorted(unknown, key=str)))


array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [146]:
Y_train_trans.shape

(6700, 69)

In [147]:
#Convert Y_test into a format as required by mlb 
Y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_test]]
Y_test_trans = mlb.transform(Y_test) # transforming test labels.
print(Y_test[30])

['male', '35', 'technology', 'aries']


  .format(sorted(unknown, key=str)))


In [148]:
Y_test_trans[30]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [149]:
len(mlb.classes_)

69

In [150]:
mlb.classes_

array(['male', '15', 'student', 'leo', '33', 'investmentbanking',
       'aquarius', 'female', '14', 'indunk', 'aries', '25', 'capricorn',
       '17', 'gemini', '23', 'non', 'profit', 'cancer', 'banking', '37',
       'sagittarius', '26', '24', 'scorpio', '27', 'education', '45',
       'engineering', 'libra', 'science', '34', '41', 'communications',
       'media', 'businessservices', 'sports', 'recreation', 'virgo',
       'taurus', 'arts', 'pisces', '44', '16', 'internet', 'museums',
       'libraries', 'accounting', '39', '35', 'technology', '36', 'law',
       '46', 'consulting', 'automotive', '42', 'religion', '13',
       'fashion', '38', '43', 'publishing', '40', 'marketing',
       'lawenforcement', 'security', 'humanresources',
       'telecommunications'], dtype=object)

In [151]:
Y_train_trans[10]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0])

In [152]:
Y_train[10]

['male', '36', 'fashion', 'aries']

# Step 8:

Choose a classifier - (5 points)

In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

In [0]:
clf = LogisticRegression(solver='lbfgs',max_iter=1000)
clf = OneVsRestClassifier(clf)

In [155]:
clf.fit(X_train,Y_train_trans) # Fitting on  train data

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

# Step 9:

Fit the classifier, make predictions and get the accuracy (5 points)

a. Print the following

  i. Accuracy score

  ii. F1 score

  iii. Average precision score

  iv. Average recall score
  
  v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

In [156]:
# Train Accuracy
print("Train Accuracy:",clf.score(X_train,Y_train_trans))

Train Accuracy: 0.9188059701492537


In [0]:
Y_pred = clf.predict(X_test)

In [158]:
print("Test Accuracy:" + str(accuracy_score(Y_test_trans, Y_pred)))
print("F1: " + str(f1_score(Y_test_trans, Y_pred, average='micro')))
print("F1_macro: " + str(f1_score(Y_test_trans, Y_pred, average='macro')))
print("Precision: " + str(precision_score(Y_test_trans, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(Y_test_trans, Y_pred, average='macro')))
print("Recall: " + str(recall_score(Y_test_trans, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(Y_test_trans, Y_pred, average='macro')))

Test Accuracy:0.333030303030303
F1: 0.6532452186091295
F1_macro: 0.2530621472043838
Precision: 0.7653573992411035
Precision_macro: 0.40256843112142954
Recall: 0.5697816460528325
Recall_macro: 0.2002225960042618


  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Step 10:

Print true label and predicted label for any five examples (7.5 points)

In [0]:
Y_pred_inv = mlb.inverse_transform(Y_pred)   # inverse transforming predited label data
Y_test_trans_inv =  mlb.inverse_transform(Y_test_trans) # inverse transforming original test label data

In [160]:
print("Example 1 - predicted :",Y_pred_inv[0])
print("Example 1 - Actual :",Y_test_trans_inv[0])
print("Example 1 - Actual_before mlb transformation :",Y_test[0])

Example 1 - predicted : ('male',)
Example 1 - Actual : ('female', 'indunk', '24', 'scorpio')
Example 1 - Actual_before mlb transformation : ['female', '24', 'indunk', 'scorpio']


In [161]:
print("Example 2 - predicted :",Y_pred_inv[30])
print("Example 2 - Actual :",Y_test_trans_inv[30])
print("Example 2 - Actual_before mlb transformation :",Y_test[30])

Example 2 - predicted : ('female', 'aries')
Example 2 - Actual : ('male', 'aries', '35', 'technology')
Example 2 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [162]:
print("Example 3 - predicted :",Y_pred_inv[46])
print("Example 3 - Actual :",Y_test_trans_inv[46])
print("Example 3 - Actual_before mlb transformation :",Y_test[46])

Example 3 - predicted : ('male', 'aries', '35', 'technology')
Example 3 - Actual : ('male', 'aries', '35', 'technology')
Example 3 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [163]:
print("Example 4 - predicted :",Y_pred_inv[826])
print("Example 4 - Actual :",Y_test_trans_inv[826])
print("Example 4 - Actual_before mlb transformation :",Y_test[826])

Example 4 - predicted : ('male', 'aries', '36', 'fashion')
Example 4 - Actual : ('male', 'aries', '36', 'fashion')
Example 4 - Actual_before mlb transformation : ['male', '36', 'fashion', 'aries']


In [165]:
print("Example 5 - predicted :",Y_pred_inv[12])
print("Example 5 - Actual :",Y_test_trans_inv[12])
print("Example 5 - Actual_before mlb transformation :",Y_test[12])

Example 5 - predicted : ('male', 'aries', '35', 'technology')
Example 5 - Actual : ('male', '23', 'cancer', 'education')
Example 5 - Actual_before mlb transformation : ['male', '23', 'education', 'cancer']
