##Project
Statistical NLP
We welcome you all to this NLP based case study. The case study (described below - 60 points)
covers concepts taught in traditional models in the NLP course.

##Project Description

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

##Dataset

###Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.

Link to dataset:
https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2a
t

##Approach & Steps
1. Load the dataset (5 points)

  a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)

  a. Remove unwanted characters

  b. Convert text to lowercase

  c. Remove unwanted spaces

  d. Remove stopwords

3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)

  a. Label columns to merge: “gender”, “age”, “topic”, “sign”

  b. After completing the previous step, there should be only two columns in 
  your dataframe i.e. “text” and “labels” as shown in the below image
4. Separate features and labels, and split the data into training and testing(5 points)
5. Vectorize the features (5 points)

  a. Create a Bag of Words using count vectorizer

    i. Use ngram_range=(1, 2)

    ii. Vectorize training and testing features

  b. Print the term-document matrix

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will
be the total count of the label. Check below image for reference (5 points)
7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

  a. Convert your train and test labels using MultiLabelBinarizer

8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

  a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier  to train it on every label

  b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)

  a. Print the following

    i. Accuracy score

    ii. F1 score

    iii. Average precision score

    iv. Average recall score

    v. Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

10. Print true label and predicted label for any five examples (7.5 points)

##Project submissions and Evaluation Criteria
While we encourage peer collaboration and contribution, plagiarism, copying the code from other sources or peers will defeat the purpose of coming to this program. We expect the highest order of ethical behavior.

You are provided with the basic approach and the steps that you need to implement. We expect you to do your own research about implementing the steps and knowing about the things that might look new to you.
####Submit the code on Olympus.
Submit the project in a Jupyter notebook and submit it to Olympus for evaluation.

##Project Support
You can clarify your queries by dropping a mail to Olympus
Happy Learning!

Import required libraries

In [361]:
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

2.1.0


In [0]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [0]:
os.chdir('/content/drive/My Drive/GreatLearning/myprojects/Residency8/')

1. Load the dataset (5 points)

  a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

  https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

In [364]:
!pwd

/content/drive/My Drive/GreatLearning/myprojects/Residency8


In [365]:
!ls

'1a. Classification_MNIST_CNN_Keras.ipynb'
'2a. Download flowers Dataset.ipynb'
'2b. Visualize an Image.ipynb'
'2c. Image Classification - Flowers.ipynb'
'2d. Image Augmentation.ipynb'
'2e. Image Classification - Flowers with Augmentation and Custom Batch Generator .ipynb'
'2e. Image Classification - with Data Augmentation - classwork.ipynb'
'2e. Image Classification - with Data Augmentation.ipynb'
 blog-authorship-corpus.zip
 blogtext.csv
 CNN
 CV_Project2_Dog_Breed_Classification_Questions_Residency8.ipynb
 DogBreed_Classification
 images
 images.zip
 NLP
 Notebooks
 Notebooks-20200131T034024Z-001.zip
 Project_SNLP_R8.ipynb
'Python Generators.ipynb'
 R8_Internal_Lab_ACV_NLP_Question.ipynb
'Statistical NLP Project Brief.pdf'
 tweets.csv
 vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [0]:
from zipfile import ZipFile
with ZipFile('blog-authorship-corpus.zip','r') as z:
  z.extractall()

In [367]:
!ls

'1a. Classification_MNIST_CNN_Keras.ipynb'
'2a. Download flowers Dataset.ipynb'
'2b. Visualize an Image.ipynb'
'2c. Image Classification - Flowers.ipynb'
'2d. Image Augmentation.ipynb'
'2e. Image Classification - Flowers with Augmentation and Custom Batch Generator .ipynb'
'2e. Image Classification - with Data Augmentation - classwork.ipynb'
'2e. Image Classification - with Data Augmentation.ipynb'
 blog-authorship-corpus.zip
 blogtext.csv
 CNN
 CV_Project2_Dog_Breed_Classification_Questions_Residency8.ipynb
 DogBreed_Classification
 images
 images.zip
 NLP
 Notebooks
 Notebooks-20200131T034024Z-001.zip
 Project_SNLP_R8.ipynb
'Python Generators.ipynb'
 R8_Internal_Lab_ACV_NLP_Question.ipynb
'Statistical NLP Project Brief.pdf'
 tweets.csv
 vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [0]:
df = pd.read_csv('blogtext.csv')

2. Preprocess rows of the “text” column (7.5 points)

  a. Remove unwanted characters

  b. Convert text to lowercase

  c. Remove unwanted spaces

  d. Remove stopwords

In [369]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [370]:
print('Columns:\n',df.columns)

Columns:
 Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')


In [371]:
print('Number of unique entries in topic:',len(df.topic.unique()))
print('Unique topic values:\n',df.topic.unique())

Number of unique entries in topic: 40
Unique topic values:
 ['Student' 'InvestmentBanking' 'indUnk' 'Non-Profit' 'Banking' 'Education'
 'Engineering' 'Science' 'Communications-Media' 'BusinessServices'
 'Sports-Recreation' 'Arts' 'Internet' 'Museums-Libraries' 'Accounting'
 'Technology' 'Law' 'Consulting' 'Automotive' 'Religion' 'Fashion'
 'Publishing' 'Marketing' 'LawEnforcement-Security' 'HumanResources'
 'Telecommunications' 'Military' 'Government' 'Transportation'
 'Architecture' 'Advertising' 'Agriculture' 'Biotech' 'RealEstate'
 'Manufacturing' 'Construction' 'Chemicals' 'Maritime' 'Tourism'
 'Environment']


In [372]:
print('Number of unique entries in date:',len(df.date.unique()))
print('Unique date values:\n',df.date.unique())

Number of unique entries in date: 2616
Unique date values:
 ['14,May,2004' '13,May,2004' '12,May,2004' ... '05,august,2004'
 '04,august,2004' '02,august,2004']


In [373]:
print('Number of unique entries in sign:',len(df.sign.unique()))
print('Unique sign values:\n',df.sign.unique())

Number of unique entries in sign: 12
Unique sign values:
 ['Leo' 'Aquarius' 'Aries' 'Capricorn' 'Gemini' 'Cancer' 'Sagittarius'
 'Scorpio' 'Libra' 'Virgo' 'Taurus' 'Pisces']


In [374]:
print('Number of unique entries in gender:',len(df.gender.unique()))
print('Unique gender values:\n',df.gender.unique())

Number of unique entries in gender: 2
Unique gender values:
 ['male' 'female']


In [375]:
print('Number of unique entries in age:',len(df.age.unique()))
print('Unique age values:\n',df.age.unique())
print('minimum and maximum age of the authors:',(min(df.age),max(df.age)))

Number of unique entries in age: 26
Unique age values:
 [15 33 14 25 17 23 37 26 24 27 45 34 41 44 16 39 35 36 46 42 13 38 43 40
 47 48]
minimum and maximum age of the authors: (13, 48)


In [376]:
df.shape

(681284, 7)

Given that dataset is huge, taking the subset (10K rows) of dataset for evaluation and modeling

In [0]:
n = 10000

In [0]:
df_subset = df.head(n)

In [379]:
df_subset.shape

(10000, 7)

In [380]:
df_subset['text'] = df_subset['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [381]:
df_subset.text[2]

"           in het kader van kernfusie op aarde:  maak je eigen waterstofbom   how to build an h-bomb from: ascott@tartarus.uwa.edu.au (andrew scott) newsgroups: rec.humor subject: how to build an h-bomb (humorous!) date: 7 feb 1994 07:41:14 gmt organization: the university of western australia  original file dated 12th november 1990. seemed to be a transcript of a 'seven days' article. poorly formatted and corrupted. i have added the text between 'examine under a microscope' and 'malleable, like gold,' as it was missing. if anyone has the full text, please distribute. i am not responsible for the accuracy of this information. converted to html by dionisio@infinet.com 11/13/98. (did a little spell-checking and some minor edits too.) stolen from  urllink http://my.ohio.voyager.net/~dionisio/fun/m...own-h-bomb.html  and reformatted the html. it now validates to xhtml 1.0 strict. how to build an h-bomb making and owning an h-bomb is the kind of challenge real americans seek. who wants to 

In [382]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [384]:
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
df_subset['text_without_stopwords'] = df_subset['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [385]:
df_subset.text_without_stopwords[1]

'team members: drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering (me) urllink mail'

In [0]:
import string, os
#removal of punctuations and lower casing all the word
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = "".join(v for v in txt if not v.isdigit())
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 


corpus = [clean_text(x) for x in df_subset.text_without_stopwords]

In [387]:
corpus[2]

'het kader van kernfusie op aarde maak je eigen waterstofbom build hbomb from ascotttartarusuwaeduau andrew scott newsgroups rechumor subject build hbomb humorous date  feb   gmt organization university western australia original file dated th november  seemed transcript seven days article poorly formatted corrupted added text examine microscope malleable like gold missing anyone full text please distribute responsible accuracy information converted html dionisioinfinetcom  did little spellchecking minor edits too stolen urllink httpmyohiovoyagernetdionisiofunmownhbombhtml reformatted html validates xhtml  strict build hbomb making owning hbomb kind challenge real americans seek wants passive victim nuclear war when little effort active participant bomb shelters losers wants huddle together underground eating canned spam winners want push button themselves making hbomb big step nuclear assertiveness training  called taking charge were sure enjoy risks heady thrill playing nuclear chick

In [388]:
type(corpus)

list

In [389]:
corpus[:5]

['info found   pages  mb pdf files wait untill team leader processed learns html',
 'team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail',
 'het kader van kernfusie op aarde maak je eigen waterstofbom build hbomb from ascotttartarusuwaeduau andrew scott newsgroups rechumor subject build hbomb humorous date  feb   gmt organization university western australia original file dated th november  seemed transcript seven days article poorly formatted corrupted added text examine microscope malleable like gold missing anyone full text please distribute responsible accuracy information converted html dionisioinfinetcom  did little spellchecking minor edits too stolen urllink httpmyohiovoyagernetdionisiofunmownhbombhtml reformatted html validates xhtml  strict build hbomb making owning hbomb kind challenge real americans seek wants passive victim nuclear war when little effort active participant bomb shelters losers wants huddle together undergrou

In [390]:
df_subset['processed_text'] = corpus

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [391]:
df_subset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_without_stopwords,processed_text
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages,...","info found (+/- 100 pages, 4.5 mb .pdf files) ...",info found pages mb pdf files wait untill t...
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewe...,team members: drewes van der laag urllink mail...,team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde...,het kader van kernfusie op aarde: maak je eige...,het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can ...,thanks yahoo!'s toolbar 'capture' urls popups....,thanks yahoos toolbar capture urls popupswhich...


3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence(7.5 points)

  a. Label columns to merge: “gender”, “age”, “topic”, “sign”

  b. After completing the previous step, there should be only two columns in your dataframe i.e. “text” and “labels” as shown in the below image

In [392]:
df_subset['labels1'] = df_subset['gender'] + df_subset['age'].astype(str) + df_subset['topic'] + df_subset['sign']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [393]:
df_subset['labels'] = (df_subset[['gender','age','topic','sign']].astype(str)).values.tolist()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [394]:
df_subset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_without_stopwords,processed_text,labels1,labels
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages,...","info found (+/- 100 pages, 4.5 mb .pdf files) ...",info found pages mb pdf files wait untill t...,male15StudentLeo,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewe...,team members: drewes van der laag urllink mail...,team members drewes van der laag urllink mail ...,male15StudentLeo,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde...,het kader van kernfusie op aarde: maak je eige...,het kader van kernfusie op aarde maak je eigen...,male15StudentLeo,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,male15StudentLeo,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can ...,thanks yahoo!'s toolbar 'capture' urls popups....,thanks yahoos toolbar capture urls popupswhich...,male33InvestmentBankingAquarius,"[male, 33, InvestmentBanking, Aquarius]"


In [395]:
df_subset['labels1'] = df_subset['labels1'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [396]:
df_subset['labels1']

0                      male15studentleo
1                      male15studentleo
2                      male15studentleo
3                      male15studentleo
4       male33investmentbankingaquarius
                     ...               
9995               female25indunkpisces
9996               female25indunkpisces
9997               female25indunkpisces
9998               female25indunkpisces
9999               female25indunkpisces
Name: labels1, Length: 10000, dtype: object

In [0]:
df1 = df_subset[['processed_text' , 'labels']]

In [398]:
df1.rename(columns ={'processed_text' : 'text'},inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


In [399]:
df1.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill t...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoos toolbar capture urls popupswhich...,"[male, 33, InvestmentBanking, Aquarius]"


4. Separate features and labels, and split the data into training and testing (5 points)

In [0]:
X = df1['text']
y = df1['labels']

In [0]:
# split the new DataFrame into training and testing sets [Default test size = 25%]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [402]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500,)
(2500,)
(7500,)
(2500,)


5. Vectorize the features (5 points)

  a. Create a Bag of Words using count vectorizer

        i. Use ngram_range=(1, 2)

        ii. Vectorize training and testing features

  b. Print the term-document matrix

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2),lowercase=True)

In [0]:
x_train_dtm = vect.fit_transform(x_train)

In [405]:
# rows are documents, columns are terms
x_train_dtm.shape

(7500, 549824)

In [0]:
x_test_dtm = vect.transform(x_test)

In [407]:
x_train_dtm

<7500x549824 sparse matrix of type '<class 'numpy.int64'>'
	with 1161622 stored elements in Compressed Sparse Row format>

In [408]:
print(x_train_dtm)

  (0, 524906)	1
  (0, 160465)	1
  (0, 191109)	1
  (0, 25768)	1
  (0, 434501)	1
  (0, 426820)	1
  (0, 475806)	1
  (0, 275899)	1
  (0, 45180)	1
  (0, 473147)	1
  (0, 14258)	1
  (0, 397857)	1
  (0, 228765)	1
  (0, 225817)	1
  (0, 62148)	1
  (0, 394790)	1
  (0, 325280)	1
  (0, 66899)	1
  (0, 541389)	1
  (0, 304839)	1
  (0, 257403)	1
  (0, 279684)	1
  (0, 543981)	1
  (0, 532945)	1
  (0, 525195)	1
  :	:
  (7498, 79408)	1
  (7498, 279578)	1
  (7498, 257870)	1
  (7498, 466684)	1
  (7498, 271077)	1
  (7498, 107039)	1
  (7498, 27108)	1
  (7498, 522899)	1
  (7498, 108033)	1
  (7498, 515680)	1
  (7498, 449393)	1
  (7498, 75621)	1
  (7498, 406501)	1
  (7498, 167583)	1
  (7498, 428531)	1
  (7498, 388937)	1
  (7498, 129635)	1
  (7498, 129650)	1
  (7498, 393317)	1
  (7498, 32304)	1
  (7498, 425070)	1
  (7498, 152394)	1
  (7499, 128138)	1
  (7499, 532282)	1
  (7499, 128216)	1


6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

In [409]:
print(df_subset.gender.value_counts())
print(df_subset.sign.value_counts())
print(df_subset.age.value_counts())
print(df_subset.topic.value_counts())

male      5916
female    4084
Name: gender, dtype: int64
Aries          4198
Sagittarius    1097
Scorpio         971
Taurus          812
Aquarius        571
Cancer          504
Libra           491
Pisces          454
Leo             301
Virgo           236
Capricorn       215
Gemini          150
Name: sign, dtype: int64
35    2315
36    1708
17    1185
27    1054
24     655
15     602
34     553
16     440
25     386
23     253
26     234
14     212
33     136
39      79
38      46
13      42
37      33
41      20
45      16
42      14
46       7
43       6
44       3
40       1
Name: age, dtype: int64
indUnk                     3287
Technology                 2654
Fashion                    1622
Student                    1137
Education                   270
Marketing                   156
Engineering                 127
Internet                    118
Communications-Media         99
BusinessServices             91
Sports-Recreation            80
Non-Profit                   71
Invest

In [0]:
labels_dict = dict(df_subset.gender.value_counts())
labels_dict.update(dict(df_subset.age.value_counts()))
labels_dict.update(dict(df_subset.topic.value_counts()))
labels_dict.update(dict(df_subset.sign.value_counts()))

Dictionary of labels is as below:

In [411]:
print(labels_dict)

{'male': 5916, 'female': 4084, 35: 2315, 36: 1708, 17: 1185, 27: 1054, 24: 655, 15: 602, 34: 553, 16: 440, 25: 386, 23: 253, 26: 234, 14: 212, 33: 136, 39: 79, 38: 46, 13: 42, 37: 33, 41: 20, 45: 16, 42: 14, 46: 7, 43: 6, 44: 3, 40: 1, 'indUnk': 3287, 'Technology': 2654, 'Fashion': 1622, 'Student': 1137, 'Education': 270, 'Marketing': 156, 'Engineering': 127, 'Internet': 118, 'Communications-Media': 99, 'BusinessServices': 91, 'Sports-Recreation': 80, 'Non-Profit': 71, 'InvestmentBanking': 70, 'Science': 63, 'Arts': 45, 'Consulting': 21, 'Museums-Libraries': 17, 'Banking': 16, 'Automotive': 14, 'Law': 11, 'LawEnforcement-Security': 10, 'Religion': 9, 'Accounting': 4, 'Publishing': 4, 'Telecommunications': 2, 'HumanResources': 2, 'Aries': 4198, 'Sagittarius': 1097, 'Scorpio': 971, 'Taurus': 812, 'Aquarius': 571, 'Cancer': 504, 'Libra': 491, 'Pisces': 454, 'Leo': 301, 'Virgo': 236, 'Capricorn': 215, 'Gemini': 150}


7. Transform the labels - (7.5 points)

  As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

  a. Convert your train and test labels using MultiLabelBinarizer

In [412]:
y_train

651               [male, 24, Engineering, Libra]
6560                  [male, 36, Fashion, Aries]
8974                  [male, 15, Student, Virgo]
2348               [male, 35, Technology, Aries]
5670                [female, 27, indUnk, Taurus]
                          ...                   
2895               [male, 35, Technology, Aries]
7813                  [male, 36, Fashion, Aries]
905     [male, 17, Sports-Recreation, Capricorn]
5192               [female, 17, indUnk, Scorpio]
235                [male, 15, Student, Aquarius]
Name: labels, Length: 7500, dtype: object

Creating list of list with y_train and y_test to be provided as an input to MultiLabelBinarizer

In [0]:
y_train_list = y_train.tolist()
y_test_list = y_test.tolist()

In [414]:
y_train_list

[['male', '24', 'Engineering', 'Libra'],
 ['male', '36', 'Fashion', 'Aries'],
 ['male', '15', 'Student', 'Virgo'],
 ['male', '35', 'Technology', 'Aries'],
 ['female', '27', 'indUnk', 'Taurus'],
 ['male', '36', 'Fashion', 'Aries'],
 ['male', '39', 'Communications-Media', 'Libra'],
 ['female', '24', 'indUnk', 'Scorpio'],
 ['male', '36', 'Fashion', 'Aries'],
 ['female', '27', 'indUnk', 'Taurus'],
 ['female', '36', 'indUnk', 'Pisces'],
 ['male', '15', 'Student', 'Virgo'],
 ['female', '23', 'Automotive', 'Aquarius'],
 ['male', '35', 'Technology', 'Aries'],
 ['male', '35', 'Technology', 'Aries'],
 ['male', '35', 'Technology', 'Aries'],
 ['female', '27', 'Education', 'Aquarius'],
 ['male', '35', 'Technology', 'Aries'],
 ['female', '17', 'Student', 'Pisces'],
 ['male', '35', 'Technology', 'Aries'],
 ['male', '35', 'Technology', 'Aries'],
 ['male', '36', 'Fashion', 'Aries'],
 ['female', '17', 'Student', 'Sagittarius'],
 ['male', '36', 'Fashion', 'Aries'],
 ['male', '15', 'Student', 'Virgo'],
 [

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_encoded = mlb.fit_transform(y_train_list)
y_test_encoded = mlb.transform(y_test_list)

In [416]:
y_train_encoded

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 0, 1],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 1, ..., 0, 0, 1]])

In [417]:
mlb.classes_

array(['13', '14', '15', '16', '17', '23', '24', '25', '26', '27', '33',
       '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44',
       '45', '46', 'Accounting', 'Aquarius', 'Aries', 'Arts',
       'Automotive', 'Banking', 'BusinessServices', 'Cancer', 'Capricorn',
       'Communications-Media', 'Consulting', 'Education', 'Engineering',
       'Fashion', 'Gemini', 'HumanResources', 'Internet',
       'InvestmentBanking', 'Law', 'LawEnforcement-Security', 'Leo',
       'Libra', 'Marketing', 'Museums-Libraries', 'Non-Profit', 'Pisces',
       'Publishing', 'Religion', 'Sagittarius', 'Science', 'Scorpio',
       'Sports-Recreation', 'Student', 'Taurus', 'Technology',
       'Telecommunications', 'Virgo', 'female', 'indUnk', 'male'],
      dtype=object)

In [418]:
len(mlb.classes_)

64

In [0]:
y_train_encoded_df = pd.DataFrame(y_train_encoded, columns = mlb.classes_)
y_test_encoded_df = pd.DataFrame(y_test_encoded, columns = mlb.classes_)

In [420]:
y_train.head()

651     [male, 24, Engineering, Libra]
6560        [male, 36, Fashion, Aries]
8974        [male, 15, Student, Virgo]
2348     [male, 35, Technology, Aries]
5670      [female, 27, indUnk, Taurus]
Name: labels, dtype: object

In [421]:
y_train_encoded_df.head()

Unnamed: 0,13,14,15,16,17,23,24,25,26,27,33,34,35,36,37,38,39,40,41,42,43,44,45,46,Accounting,Aquarius,Aries,Arts,Automotive,Banking,BusinessServices,Cancer,Capricorn,Communications-Media,Consulting,Education,Engineering,Fashion,Gemini,HumanResources,Internet,InvestmentBanking,Law,LawEnforcement-Security,Leo,Libra,Marketing,Museums-Libraries,Non-Profit,Pisces,Publishing,Religion,Sagittarius,Science,Scorpio,Sports-Recreation,Student,Taurus,Technology,Telecommunications,Virgo,female,indUnk,male
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0


In [422]:
y_test_encoded_df.head()

Unnamed: 0,13,14,15,16,17,23,24,25,26,27,33,34,35,36,37,38,39,40,41,42,43,44,45,46,Accounting,Aquarius,Aries,Arts,Automotive,Banking,BusinessServices,Cancer,Capricorn,Communications-Media,Consulting,Education,Engineering,Fashion,Gemini,HumanResources,Internet,InvestmentBanking,Law,LawEnforcement-Security,Leo,Libra,Marketing,Museums-Libraries,Non-Profit,Pisces,Publishing,Religion,Sagittarius,Science,Scorpio,Sports-Recreation,Student,Taurus,Technology,Telecommunications,Virgo,female,indUnk,male
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0
1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1
2,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0


In [423]:
mlb.get_params

<bound method BaseEstimator.get_params of MultiLabelBinarizer(classes=None, sparse_output=False)>

8. Choose a classifier - (5 points)

  In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

  a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
  
  b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)

  a. Print the following

    i. Accuracy score

    ii. F1 score

    iii. Average precision score

    iv. Average recall score

    v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

10. Print true label and predicted label for any five examples (7.5 points)

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver = 'lbfgs', max_iter = 5000)
clf = OneVsRestClassifier(clf)

In [426]:
clf.fit(x_train_dtm, y_train_encoded_df)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=5000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [427]:
from sklearn.metrics import accuracy_score
y_pred_train = clf.predict(x_train_dtm)
print ("Training Accuracy={}".format(accuracy_score(y_train_encoded_df, y_pred_train)))

Training Accuracy=0.9577333333333333


In [428]:
from sklearn.metrics import accuracy_score
y_pred = clf.predict(x_test_dtm)
print ("Test Accuracy={}".format(accuracy_score(y_test_encoded_df, y_pred)))

Test Accuracy=0.318


In [429]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score, precision_score, classification_report

def evaluation_scores(y_val, predicted):
    
    print ("Testing Accuracy={}".format(accuracy_score(y_val, predicted)))
    print ("Precision Score={}".format(precision_score(y_val, predicted, average='macro')))
    print ("Recall Score={}".format(recall_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_macro={}".format(f1_score(y_val, predicted, average='macro', zero_division = 1)))
    print ("F1_micro={}".format(f1_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_wted={}".format(f1_score(y_val, predicted, average='weighted',zero_division = 1)))
    
print('Evaluation Scores:')
evaluation_scores(y_test_encoded_df, y_pred)
y_pred_scores = clf.decision_function(x_test_dtm)
print ("Average Precision Score={}".format(average_precision_score(y_test_encoded_df, y_pred_scores)))

Evaluation Scores:
Testing Accuracy=0.318
Precision Score=0.5142657442278666
Recall Score=0.5371
F1_macro=0.29471758492143063
F1_micro=0.6376209414139016
F1_wted=0.5977362817819609


  _warn_prf(average, modifier, msg_start, len(result))


Average Precision Score=nan


  recall = tps / tps[-1]


In [430]:
# print the Classification matrix
classification_report(y_test_encoded_df, y_pred_labels,zero_division=1)

'              precision    recall  f1-score   support\n\n           0       1.00      0.30      0.46        10\n           1       0.71      0.09      0.16        57\n           2       0.72      0.22      0.34       141\n           3       0.69      0.21      0.32       117\n           4       0.79      0.31      0.44       289\n           5       1.00      0.02      0.04        51\n           6       0.70      0.09      0.16       180\n           7       0.44      0.04      0.08        93\n           8       0.33      0.02      0.04        50\n           9       0.81      0.39      0.53       239\n          10       1.00      0.33      0.50        33\n          11       0.98      0.76      0.86       158\n          12       0.72      0.64      0.68       563\n          13       0.93      0.49      0.65       455\n          14       0.00      0.00      0.00         7\n          15       1.00      0.07      0.12        15\n          16       1.00      0.00      0.00        31\n       

Printing below true lables and predicted labels for any 5 examples

In [0]:
true_labels = np.array(y_test_encoded_df[200:205])

In [0]:
predicted_labels = y_pred_labels[200:205]

In [433]:
mlb.inverse_transform(true_labels)

[('25', 'Leo', 'Student', 'female'),
 ('36', 'Aries', 'Fashion', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('15', 'Aquarius', 'Student', 'female'),
 ('35', 'Aries', 'Technology', 'male')]

In [434]:
mlb.inverse_transform(predicted_labels)

[('male',),
 ('36', 'Aries', 'Fashion', 'male'),
 ('34', 'Sagittarius', 'female', 'indUnk'),
 ('male',),
 ('35', 'Aries', 'Technology', 'male')]

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes). 



Attempting another approach for experimentation with all labels concatenated as string before encoding instead of list of lables.Although above approach looks more correct where multiple lables are taken into consideration in each example.

Creating dataframe of text and labels as df2 and repeating sequence of steps given in the problem

In [435]:
df2 = df_subset[['processed_text' , 'labels1']]
df2.rename(columns ={'processed_text' : 'text', 'labels1' : 'labels'},inplace = True)
df2.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(**kwargs)


Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill t...,male15studentleo
1,team members drewes van der laag urllink mail ...,male15studentleo
2,het kader van kernfusie op aarde maak je eigen...,male15studentleo
3,testing testing,male15studentleo
4,thanks yahoos toolbar capture urls popupswhich...,male33investmentbankingaquarius


In [436]:
labels_dict2 = dict(df2.labels.value_counts())
labels_dict2

{'female13indunklibra': 9,
 'female13studentaquarius': 31,
 'female13studentsagittarius': 2,
 'female14indunkaquarius': 5,
 'female14indunkaries': 21,
 'female14indunkcancer': 7,
 'female14indunkleo': 4,
 'female14studentcancer': 10,
 'female14studentcapricorn': 2,
 'female14studenttaurus': 11,
 'female15artspisces': 2,
 'female15indunkaquarius': 11,
 'female15indunkaries': 26,
 'female15studentaquarius': 62,
 'female15studentcancer': 9,
 'female15studentgemini': 5,
 'female15studentleo': 17,
 'female15studentlibra': 131,
 'female15studentpisces': 74,
 'female15studentvirgo': 4,
 'female16educationcancer': 24,
 'female16indunkcapricorn': 60,
 'female16indunkleo': 16,
 'female16indunksagittarius': 4,
 'female16indunktaurus': 20,
 'female16studentaquarius': 7,
 'female16studentcancer': 1,
 'female16studentcapricorn': 4,
 'female16studentlibra': 16,
 'female16studenttaurus': 32,
 'female17indunkcancer': 139,
 'female17indunkleo': 56,
 'female17indunkscorpio': 558,
 'female17studentaries':

In [0]:
X2 = df2['text']
y2 = df2['labels']

In [0]:
# split the new DataFrame into training and testing sets [Default test size = 25%]
from sklearn.model_selection import train_test_split
x_train2, x_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=1)

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2),lowercase=True)

In [0]:
x_train2_dtm = vect.fit_transform(x_train2)
x_test2_dtm = vect.transform(x_test2)

In [441]:
print(x_train2_dtm)

  (0, 524906)	1
  (0, 160465)	1
  (0, 191109)	1
  (0, 25768)	1
  (0, 434501)	1
  (0, 426820)	1
  (0, 475806)	1
  (0, 275899)	1
  (0, 45180)	1
  (0, 473147)	1
  (0, 14258)	1
  (0, 397857)	1
  (0, 228765)	1
  (0, 225817)	1
  (0, 62148)	1
  (0, 394790)	1
  (0, 325280)	1
  (0, 66899)	1
  (0, 541389)	1
  (0, 304839)	1
  (0, 257403)	1
  (0, 279684)	1
  (0, 543981)	1
  (0, 532945)	1
  (0, 525195)	1
  :	:
  (7498, 79408)	1
  (7498, 279578)	1
  (7498, 257870)	1
  (7498, 466684)	1
  (7498, 271077)	1
  (7498, 107039)	1
  (7498, 27108)	1
  (7498, 522899)	1
  (7498, 108033)	1
  (7498, 515680)	1
  (7498, 449393)	1
  (7498, 75621)	1
  (7498, 406501)	1
  (7498, 167583)	1
  (7498, 428531)	1
  (7498, 388937)	1
  (7498, 129635)	1
  (7498, 129650)	1
  (7498, 393317)	1
  (7498, 32304)	1
  (7498, 425070)	1
  (7498, 152394)	1
  (7499, 128138)	1
  (7499, 532282)	1
  (7499, 128216)	1


Converting the labels to list of list for input to MultiLabelBinarizer

In [0]:
def listoflists(lst): 
    return [[el] for el in lst]


y_train2_list = y_train2.tolist()
y_test2_list = y_test2.tolist()
y_train2_listoflist = listoflists(y_train2_list)
y_test2_listoflist = listoflists(y_test2_list)

In [443]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb2 = MultiLabelBinarizer()
y_train2_encoded = mlb2.fit_transform(y_train2_listoflist)
y_test2_encoded = mlb2.transform(y_test2_listoflist)

  .format(sorted(unknown, key=str)))


In [444]:
y_train2_encoded
mlb2.classes_

array(['female13indunklibra', 'female13studentaquarius',
       'female13studentsagittarius', 'female14indunkaquarius',
       'female14indunkaries', 'female14indunkcancer', 'female14indunkleo',
       'female14studentcancer', 'female14studentcapricorn',
       'female14studenttaurus', 'female15artspisces',
       'female15indunkaquarius', 'female15indunkaries',
       'female15studentaquarius', 'female15studentcancer',
       'female15studentgemini', 'female15studentleo',
       'female15studentlibra', 'female15studentpisces',
       'female15studentvirgo', 'female16educationcancer',
       'female16indunkcapricorn', 'female16indunkleo',
       'female16indunksagittarius', 'female16indunktaurus',
       'female16studentaquarius', 'female16studentcapricorn',
       'female16studentlibra', 'female16studenttaurus',
       'female17indunkcancer', 'female17indunkleo',
       'female17indunkscorpio', 'female17studentaries',
       'female17studentcapricorn', 'female17studentgemini',
       

In [0]:
y_train2_encoded_df = pd.DataFrame(y_train2_encoded, columns = mlb2.classes_)
y_test2_encoded_df = pd.DataFrame(y_test2_encoded, columns = mlb2.classes_)

In [447]:
y_train2_encoded_df.head()

Unnamed: 0,female13indunklibra,female13studentaquarius,female13studentsagittarius,female14indunkaquarius,female14indunkaries,female14indunkcancer,female14indunkleo,female14studentcancer,female14studentcapricorn,female14studenttaurus,female15artspisces,female15indunkaquarius,female15indunkaries,female15studentaquarius,female15studentcancer,female15studentgemini,female15studentleo,female15studentlibra,female15studentpisces,female15studentvirgo,female16educationcancer,female16indunkcapricorn,female16indunkleo,female16indunksagittarius,female16indunktaurus,female16studentaquarius,female16studentcapricorn,female16studentlibra,female16studenttaurus,female17indunkcancer,female17indunkleo,female17indunkscorpio,female17studentaries,female17studentcapricorn,female17studentgemini,female17studentleo,female17studentpisces,female17studentsagittarius,female17studentscorpio,female17studenttaurus,...,male24telecommunicationssagittarius,male25accountinglibra,male25artsaries,male25businessservicessagittarius,male25communications-mediapisces,male25indunktaurus,male25internetaries,male25non-profitcancer,male25technologyaries,male25technologypisces,male26educationlibra,male26indunkgemini,male26indunkleo,male26indunksagittarius,male26museums-librariesleo,male26sciencescorpio,male26sports-recreationleo,male26technologylibra,male26technologyscorpio,male27educationaries,male27educationscorpio,male27educationvirgo,male27technologypisces,male33engineeringaries,male33investmentbankingaquarius,male33non-profitgemini,male33technologysagittarius,male35indunkscorpio,male35technologyaries,male36fashionaries,male36indunktaurus,male37businessservicessagittarius,male37lawenforcement-securityaquarius,male39communications-medialibra,male39educationvirgo,male41communications-medialibra,male42religionaries,male44indunktaurus,male45humanresourcesaquarius,male46consultinggemini
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf2 = LogisticRegression(solver = 'lbfgs', max_iter = 5000)
clf2 = OneVsRestClassifier(clf2)

In [459]:
clf2.fit(x_train2_dtm, y_train2_encoded_df)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=5000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [460]:
from sklearn.metrics import accuracy_score
y_pred2_train = clf2.predict(x_train2_dtm)
print ("Training Accuracy={}".format(accuracy_score(y_train2_encoded_df, y_pred2_train)))

Training Accuracy=0.9565333333333333


In [461]:
from sklearn.metrics import accuracy_score
y_pred2_test = clf2.predict(x_test2_dtm)
print ("Test Accuracy={}".format(accuracy_score(y_test2_encoded_df, y_pred2_test)))

Test Accuracy=0.3744


In [462]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score, precision_score, classification_report

def evaluation_scores(y_val, predicted):
    
    print ("Accuracy={}".format(accuracy_score(y_val, predicted)))
    print ("Precision Score={}".format(precision_score(y_val, predicted, average='macro')))
    print ("Recall Score={}".format(recall_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_macro={}".format(f1_score(y_val, predicted, average='macro', zero_division = 1)))
    print ("F1_micro={}".format(f1_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_wted={}".format(f1_score(y_val, predicted, average='weighted',zero_division = 1)))
    
print('Evaluation Scores:')
evaluation_scores(y_test2_encoded_df, y_pred2_test)
y_pred2_scores = clf2.decision_function(x_test2_dtm)
print ("Average Precision Score={}".format(average_precision_score(y_test2_encoded_df, y_pred2_scores)))

Evaluation Scores:
Accuracy=0.3744
Precision Score=0.18916200643993844
Recall Score=0.3784541449739688


  _warn_prf(average, modifier, msg_start, len(result))


F1_macro=0.2335870061956319
F1_micro=0.5169584245076586
F1_wted=0.46352797835525766


  recall = tps / tps[-1]


Average Precision Score=nan


In [463]:
# print the Classification matrix
classification_report(y_test2_encoded_df, y_pred2_test,zero_division=1)

'              precision    recall  f1-score   support\n\n           0       1.00      0.00      0.00         1\n           1       1.00      0.50      0.67         8\n           2       1.00      0.00      0.00         1\n           3       1.00      1.00      1.00         1\n           4       1.00      0.00      0.00         7\n           5       1.00      1.00      1.00         0\n           6       1.00      0.00      0.00         1\n           7       0.00      1.00      0.00         0\n           8       1.00      0.00      0.00         1\n           9       1.00      0.00      0.00         2\n          10       1.00      1.00      1.00         0\n          11       0.00      0.00      0.00         2\n          12       1.00      0.33      0.50         6\n          13       0.33      0.08      0.13        12\n          14       1.00      0.00      0.00         2\n          15       1.00      0.00      0.00         1\n          16       1.00      0.33      0.50         6\n       

Printing below true lables and predicted labels for any 5 examples

In [0]:
true_labels2 = np.array(y_test2_encoded_df[200:205])

In [0]:
predicted_labels2 = y_pred2_test[200:205]

In [466]:
mlb2.inverse_transform(true_labels2)

[('female25studentleo',),
 ('male36fashionaries',),
 ('female34indunksagittarius',),
 ('female15studentaquarius',),
 ('male35technologyaries',)]

In [467]:
mlb2.inverse_transform(predicted_labels2)

[(),
 ('male36fashionaries',),
 ('female34indunksagittarius',),
 (),
 ('male35technologyaries',)]