##Project
Statistical NLP
We welcome you all to this NLP based case study. The case study (described below - 60 points)
covers concepts taught in traditional models in the NLP course.

##Project Description

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

##Dataset

###Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.

Link to dataset:
https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2a
t

##Approach & Steps
1. Load the dataset (5 points)

  a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)

  a. Remove unwanted characters

  b. Convert text to lowercase

  c. Remove unwanted spaces

  d. Remove stopwords

3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)

  a. Label columns to merge: “gender”, “age”, “topic”, “sign”

  b. After completing the previous step, there should be only two columns in 
  your dataframe i.e. “text” and “labels” as shown in the below image
4. Separate features and labels, and split the data into training and testing(5 points)
5. Vectorize the features (5 points)

  a. Create a Bag of Words using count vectorizer

    i. Use ngram_range=(1, 2)

    ii. Vectorize training and testing features

  b. Print the term-document matrix

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will
be the total count of the label. Check below image for reference (5 points)
7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

  a. Convert your train and test labels using MultiLabelBinarizer

8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

  a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier  to train it on every label

  b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)

  a. Print the following

    i. Accuracy score

    ii. F1 score

    iii. Average precision score

    iv. Average recall score

    v. Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

10. Print true label and predicted label for any five examples (7.5 points)

##Project submissions and Evaluation Criteria
While we encourage peer collaboration and contribution, plagiarism, copying the code from other sources or peers will defeat the purpose of coming to this program. We expect the highest order of ethical behavior.

You are provided with the basic approach and the steps that you need to implement. We expect you to do your own research about implementing the steps and knowing about the things that might look new to you.
####Submit the code on Olympus.
Submit the project in a Jupyter notebook and submit it to Olympus for evaluation.

##Project Support
You can clarify your queries by dropping a mail to Olympus
Happy Learning!

Import required libraries

In [1]:
%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)

TensorFlow 2.x selected.
2.1.0


In [0]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [0]:
os.chdir('/content/drive/My Drive/GreatLearning/myprojects/Residency8/')

1. Load the dataset (5 points)

  a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

  https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

In [4]:
!pwd

/content/drive/My Drive/GreatLearning/myprojects/Residency8


In [5]:
!ls

'1a. Classification_MNIST_CNN_Keras.ipynb'
'2a. Download flowers Dataset.ipynb'
'2b. Visualize an Image.ipynb'
'2c. Image Classification - Flowers.ipynb'
'2d. Image Augmentation.ipynb'
'2e. Image Classification - Flowers with Augmentation and Custom Batch Generator .ipynb'
'2e. Image Classification - with Data Augmentation - classwork.ipynb'
'2e. Image Classification - with Data Augmentation.ipynb'
 blog-authorship-corpus.zip
 blogtext.csv
 CNN
 CV_Project2_Dog_Breed_Classification_Questions_Residency8.ipynb
 DogBreed_Classification
 images
 images.zip
 NLP
 Notebooks
 Notebooks-20200131T034024Z-001.zip
 Project_SNLP_R8.ipynb
'Python Generators.ipynb'
 R8_Internal_Lab_ACV_NLP_Question.ipynb
'Statistical NLP Project Brief.pdf'
 tweets.csv
 vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [0]:
from zipfile import ZipFile
with ZipFile('blog-authorship-corpus.zip','r') as z:
  z.extractall()

In [7]:
!ls

'1a. Classification_MNIST_CNN_Keras.ipynb'
'2a. Download flowers Dataset.ipynb'
'2b. Visualize an Image.ipynb'
'2c. Image Classification - Flowers.ipynb'
'2d. Image Augmentation.ipynb'
'2e. Image Classification - Flowers with Augmentation and Custom Batch Generator .ipynb'
'2e. Image Classification - with Data Augmentation - classwork.ipynb'
'2e. Image Classification - with Data Augmentation.ipynb'
 blog-authorship-corpus.zip
 blogtext.csv
 CNN
 CV_Project2_Dog_Breed_Classification_Questions_Residency8.ipynb
 DogBreed_Classification
 images
 images.zip
 NLP
 Notebooks
 Notebooks-20200131T034024Z-001.zip
 Project_SNLP_R8.ipynb
'Python Generators.ipynb'
 R8_Internal_Lab_ACV_NLP_Question.ipynb
'Statistical NLP Project Brief.pdf'
 tweets.csv
 vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5


In [0]:
df = pd.read_csv('blogtext.csv')

2. Preprocess rows of the “text” column (7.5 points)

  a. Remove unwanted characters

  b. Convert text to lowercase

  c. Remove unwanted spaces

  d. Remove stopwords

In [9]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [10]:
print('Columns:\n',df.columns)

Columns:
 Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')


In [11]:
print('Number of unique entries in topic:',len(df.topic.unique()))
print('Unique topic values:\n',df.topic.unique())

Number of unique entries in topic: 40
Unique topic values:
 ['Student' 'InvestmentBanking' 'indUnk' 'Non-Profit' 'Banking' 'Education'
 'Engineering' 'Science' 'Communications-Media' 'BusinessServices'
 'Sports-Recreation' 'Arts' 'Internet' 'Museums-Libraries' 'Accounting'
 'Technology' 'Law' 'Consulting' 'Automotive' 'Religion' 'Fashion'
 'Publishing' 'Marketing' 'LawEnforcement-Security' 'HumanResources'
 'Telecommunications' 'Military' 'Government' 'Transportation'
 'Architecture' 'Advertising' 'Agriculture' 'Biotech' 'RealEstate'
 'Manufacturing' 'Construction' 'Chemicals' 'Maritime' 'Tourism'
 'Environment']


In [12]:
print('Number of unique entries in date:',len(df.date.unique()))
print('Unique date values:\n',df.date.unique())

Number of unique entries in date: 2616
Unique date values:
 ['14,May,2004' '13,May,2004' '12,May,2004' ... '05,august,2004'
 '04,august,2004' '02,august,2004']


In [13]:
print('Number of unique entries in sign:',len(df.sign.unique()))
print('Unique sign values:\n',df.sign.unique())

Number of unique entries in sign: 12
Unique sign values:
 ['Leo' 'Aquarius' 'Aries' 'Capricorn' 'Gemini' 'Cancer' 'Sagittarius'
 'Scorpio' 'Libra' 'Virgo' 'Taurus' 'Pisces']


In [14]:
print('Number of unique entries in gender:',len(df.gender.unique()))
print('Unique gender values:\n',df.gender.unique())

Number of unique entries in gender: 2
Unique gender values:
 ['male' 'female']


In [15]:
print('Number of unique entries in age:',len(df.age.unique()))
print('Unique age values:\n',df.age.unique())
print('minimum and maximum age of the authors:',(min(df.age),max(df.age)))

Number of unique entries in age: 26
Unique age values:
 [15 33 14 25 17 23 37 26 24 27 45 34 41 44 16 39 35 36 46 42 13 38 43 40
 47 48]
minimum and maximum age of the authors: (13, 48)


In [16]:
df.shape

(681284, 7)

In [0]:
n = 10000

In [0]:
df_subset = df.head(n)

In [19]:
df_subset.shape

(10000, 7)

In [20]:
df_subset['text'] = df_subset['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
df_subset.text[2]

"           in het kader van kernfusie op aarde:  maak je eigen waterstofbom   how to build an h-bomb from: ascott@tartarus.uwa.edu.au (andrew scott) newsgroups: rec.humor subject: how to build an h-bomb (humorous!) date: 7 feb 1994 07:41:14 gmt organization: the university of western australia  original file dated 12th november 1990. seemed to be a transcript of a 'seven days' article. poorly formatted and corrupted. i have added the text between 'examine under a microscope' and 'malleable, like gold,' as it was missing. if anyone has the full text, please distribute. i am not responsible for the accuracy of this information. converted to html by dionisio@infinet.com 11/13/98. (did a little spell-checking and some minor edits too.) stolen from  urllink http://my.ohio.voyager.net/~dionisio/fun/m...own-h-bomb.html  and reformatted the html. it now validates to xhtml 1.0 strict. how to build an h-bomb making and owning an h-bomb is the kind of challenge real americans seek. who wants to 

In [22]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
# Import stopwords with nltk.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [24]:
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
df_subset['text_without_stopwords'] = df_subset['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [25]:
df_subset.text_without_stopwords[1]

'team members: drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering (me) urllink mail'

In [0]:
import string, os
#removal of punctuations and lower casing all the word
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt 


corpus = [clean_text(x) for x in df_subset.text_without_stopwords]

In [27]:
corpus[2]

'het kader van kernfusie op aarde maak je eigen waterstofbom build hbomb from ascotttartarusuwaeduau andrew scott newsgroups rechumor subject build hbomb humorous date 7 feb 1994 074114 gmt organization university western australia original file dated 12th november 1990 seemed transcript seven days article poorly formatted corrupted added text examine microscope malleable like gold missing anyone full text please distribute responsible accuracy information converted html dionisioinfinetcom 111398 did little spellchecking minor edits too stolen urllink httpmyohiovoyagernetdionisiofunmownhbombhtml reformatted html validates xhtml 10 strict build hbomb making owning hbomb kind challenge real americans seek wants passive victim nuclear war when little effort active participant bomb shelters losers wants huddle together underground eating canned spam winners want push button themselves making hbomb big step nuclear assertiveness training  called taking charge were sure enjoy risks heady thr

In [28]:
type(corpus)

list

In [29]:
corpus[:5]

['info found  100 pages 45 mb pdf files wait untill team leader processed learns html',
 'team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail',
 'het kader van kernfusie op aarde maak je eigen waterstofbom build hbomb from ascotttartarusuwaeduau andrew scott newsgroups rechumor subject build hbomb humorous date 7 feb 1994 074114 gmt organization university western australia original file dated 12th november 1990 seemed transcript seven days article poorly formatted corrupted added text examine microscope malleable like gold missing anyone full text please distribute responsible accuracy information converted html dionisioinfinetcom 111398 did little spellchecking minor edits too stolen urllink httpmyohiovoyagernetdionisiofunmownhbombhtml reformatted html validates xhtml 10 strict build hbomb making owning hbomb kind challenge real americans seek wants passive victim nuclear war when little effort active participant bomb shelters losers w

In [30]:
df_subset['processed_text'] = corpus

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [31]:
df_subset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_without_stopwords,processed_text
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages,...","info found (+/- 100 pages, 4.5 mb .pdf files) ...",info found 100 pages 45 mb pdf files wait unt...
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewe...,team members: drewes van der laag urllink mail...,team members drewes van der laag urllink mail ...
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde...,het kader van kernfusie op aarde: maak je eige...,het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can ...,thanks yahoo!'s toolbar 'capture' urls popups....,thanks yahoos toolbar capture urls popupswhich...


3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence(7.5 points)

  a. Label columns to merge: “gender”, “age”, “topic”, “sign”

  b. After completing the previous step, there should be only two columns in your dataframe i.e. “text” and “labels” as shown in the below image

In [32]:
df_subset['labels'] = df_subset['gender'] + df_subset['age'].astype(str) + df_subset['topic'] + df_subset['sign']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [33]:
df_subset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_without_stopwords,processed_text,labels
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages,...","info found (+/- 100 pages, 4.5 mb .pdf files) ...",info found 100 pages 45 mb pdf files wait unt...,male15StudentLeo
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewe...,team members: drewes van der laag urllink mail...,team members drewes van der laag urllink mail ...,male15StudentLeo
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde...,het kader van kernfusie op aarde: maak je eige...,het kader van kernfusie op aarde maak je eigen...,male15StudentLeo
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,male15StudentLeo
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can ...,thanks yahoo!'s toolbar 'capture' urls popups....,thanks yahoos toolbar capture urls popupswhich...,male33InvestmentBankingAquarius


In [34]:
df_subset['labels'] = df_subset['labels'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [35]:
df_subset.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_without_stopwords,processed_text,labels
0,2059027,male,15,Student,Leo,"14,May,2004","info has been found (+/- 100 pages,...","info found (+/- 100 pages, 4.5 mb .pdf files) ...",info found 100 pages 45 mb pdf files wait unt...,male15studentleo
1,2059027,male,15,Student,Leo,"13,May,2004",these are the team members: drewe...,team members: drewes van der laag urllink mail...,team members drewes van der laag urllink mail ...,male15studentleo
2,2059027,male,15,Student,Leo,"12,May,2004",in het kader van kernfusie op aarde...,het kader van kernfusie op aarde: maak je eige...,het kader van kernfusie op aarde maak je eigen...,male15studentleo
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing!!! testing!!!,testing testing,male15studentleo
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks to yahoo!'s toolbar i can ...,thanks yahoo!'s toolbar 'capture' urls popups....,thanks yahoos toolbar capture urls popupswhich...,male33investmentbankingaquarius


In [0]:
df1 = df_subset[['processed_text' , 'labels']]

In [37]:
df1.head()

Unnamed: 0,processed_text,labels
0,info found 100 pages 45 mb pdf files wait unt...,male15studentleo
1,team members drewes van der laag urllink mail ...,male15studentleo
2,het kader van kernfusie op aarde maak je eigen...,male15studentleo
3,testing testing,male15studentleo
4,thanks yahoos toolbar capture urls popupswhich...,male33investmentbankingaquarius


4. Separate features and labels, and split the data into training and testing (5 points)

In [0]:
X = df1['processed_text']
y = df1['labels']

In [0]:
# split the new DataFrame into training and testing sets [Default test size = 25%]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [40]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(7500,)
(2500,)
(7500,)
(2500,)


5. Vectorize the features (5 points)

  a. Create a Bag of Words using count vectorizer

        i. Use ngram_range=(1, 2)

        ii. Vectorize training and testing features

  b. Print the term-document matrix

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2),lowercase=True)

In [0]:
x_train_dtm = vect.fit_transform(x_train)

In [43]:
# rows are documents, columns are terms (aka "tokens" or "features")
x_train_dtm.shape

(7500, 559037)

In [0]:
x_test_dtm = vect.transform(x_test)

In [45]:
x_train_dtm

<7500x559037 sparse matrix of type '<class 'numpy.int64'>'
	with 1174698 stored elements in Compressed Sparse Row format>

6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

In [46]:
print('Number of unique labels:',len(df1.labels.unique()))
print('Unique labels values:\n',df1.labels.unique())

Number of unique labels: 200
Unique labels values:
 ['male15studentleo' 'male33investmentbankingaquarius'
 'female14indunkaries' 'female25indunkcapricorn' 'female17studentgemini'
 'female17studentaries' 'female23indunkaquarius' 'male25non-profitcancer'
 'female33bankingaquarius' 'female37indunkaquarius'
 'female25indunksagittarius' 'male15studentaquarius' 'male26indunkleo'
 'female24indunkscorpio' 'female27educationaquarius'
 'female45indunksagittarius' 'male24engineeringlibra' 'male15sciencelibra'
 'female15studentgemini' 'female34indunkscorpio'
 'male41communications-medialibra' 'male24businessservicescancer'
 'male23indunksagittarius' 'male17studentsagittarius'
 'female14indunkcancer' 'male17sports-recreationcapricorn'
 'male14studentscorpio' 'female23indunklibra' 'female23indunkvirgo'
 'female25indunktaurus' 'female15artspisces' 'male44indunktaurus'
 'female15studentcancer' 'male23indunktaurus' 'female15studentlibra'
 'female27educationgemini' 'female33indunksagittarius'
 'female16

In [47]:
df1.labels.value_counts()

male35technologyaries        2294
male36fashionaries           1616
female27indunktaurus          605
female17indunkscorpio         558
female34indunksagittarius     532
                             ... 
female36studentgemini           1
female33indunksagittarius       1
male16indunkpisces              1
male25non-profitcapricorn       1
female40internetaquarius        1
Name: labels, Length: 200, dtype: int64

In [0]:
labels_dict = dict(df1.labels.value_counts())

In [49]:
labels_dict

{'female13indunklibra': 9,
 'female13studentaquarius': 31,
 'female13studentsagittarius': 2,
 'female14indunkaquarius': 5,
 'female14indunkaries': 21,
 'female14indunkcancer': 7,
 'female14indunkleo': 4,
 'female14studentcancer': 10,
 'female14studentcapricorn': 2,
 'female14studenttaurus': 11,
 'female15artspisces': 2,
 'female15indunkaquarius': 11,
 'female15indunkaries': 26,
 'female15studentaquarius': 62,
 'female15studentcancer': 9,
 'female15studentgemini': 5,
 'female15studentleo': 17,
 'female15studentlibra': 131,
 'female15studentpisces': 74,
 'female15studentvirgo': 4,
 'female16educationcancer': 24,
 'female16indunkcapricorn': 60,
 'female16indunkleo': 16,
 'female16indunksagittarius': 4,
 'female16indunktaurus': 20,
 'female16studentaquarius': 7,
 'female16studentcancer': 1,
 'female16studentcapricorn': 4,
 'female16studentlibra': 16,
 'female16studenttaurus': 32,
 'female17indunkcancer': 139,
 'female17indunkleo': 56,
 'female17indunkscorpio': 558,
 'female17studentaries':

7. Transform the labels - (7.5 points)

  As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

  a. Convert your train and test labels using MultiLabelBinarizer

In [50]:
y_train

651               male24engineeringlibra
6560                  male36fashionaries
8974                  male15studentvirgo
2348               male35technologyaries
5670                female27indunktaurus
                      ...               
2895               male35technologyaries
7813                  male36fashionaries
905     male17sports-recreationcapricorn
5192               female17indunkscorpio
235                male15studentaquarius
Name: labels, Length: 7500, dtype: object

Creating list of list with y_train and y_test to be provided as an input to MultiLabelBinarizer

In [0]:
y_train_list = y_train.tolist()
y_test_list = y_test.tolist()

In [0]:
def listoflists(lst): 
    return [[el] for el in lst] 

In [0]:
y_train_listoflist = listoflists(y_train_list)

In [0]:
y_test_listoflist = listoflists(y_test_list)

In [55]:
y_train_listoflist

[['male24engineeringlibra'],
 ['male36fashionaries'],
 ['male15studentvirgo'],
 ['male35technologyaries'],
 ['female27indunktaurus'],
 ['male36fashionaries'],
 ['male39communications-medialibra'],
 ['female24indunkscorpio'],
 ['male36fashionaries'],
 ['female27indunktaurus'],
 ['female36indunkpisces'],
 ['male15studentvirgo'],
 ['female23automotiveaquarius'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['female27educationaquarius'],
 ['male35technologyaries'],
 ['female17studentpisces'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['male36fashionaries'],
 ['female17studentsagittarius'],
 ['male36fashionaries'],
 ['male15studentvirgo'],
 ['male17indunkaquarius'],
 ['female17studentaries'],
 ['female34indunksagittarius'],
 ['male36fashionaries'],
 ['female16indunkcapricorn'],
 ['male35technologyaries'],
 ['male36fashionaries'],
 ['male35technologyaries'],
 ['female17indunkscorpio'],
 ['male36fashionaries'],
 ['male35technologyaries'],

In [56]:
y_test_listoflist

[['female16indunkcapricorn'],
 ['male14studentpisces'],
 ['female17indunkscorpio'],
 ['female36indunkpisces'],
 ['female17indunkscorpio'],
 ['male14studentcancer'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['female16indunkcapricorn'],
 ['male23internetaquarius'],
 ['male35technologyaries'],
 ['male36fashionaries'],
 ['female26accountingaquarius'],
 ['male36fashionaries'],
 ['female17indunkcancer'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male35technologyaries'],
 ['female15studentaquarius'],
 ['male35technologyaries'],
 ['male35technologyaries'],
 ['female17studentaries'],
 ['male35technologyaries'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male35technologyaries'],
 ['female24indunksagittarius'],
 ['female27indunktaurus'],
 ['male36fashionaries'],
 ['male36fashionaries'],
 ['male24engineeringlibra'],
 ['male36fashionari

In [57]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_encoded = mlb.fit_transform(y_train_listoflist)
y_test_encoded = mlb.transform(y_test_listoflist)

  .format(sorted(unknown, key=str)))


In [58]:
y_train_encoded

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [59]:
mlb.classes_

array(['female13indunklibra', 'female13studentaquarius',
       'female13studentsagittarius', 'female14indunkaquarius',
       'female14indunkaries', 'female14indunkcancer', 'female14indunkleo',
       'female14studentcancer', 'female14studentcapricorn',
       'female14studenttaurus', 'female15artspisces',
       'female15indunkaquarius', 'female15indunkaries',
       'female15studentaquarius', 'female15studentcancer',
       'female15studentgemini', 'female15studentleo',
       'female15studentlibra', 'female15studentpisces',
       'female15studentvirgo', 'female16educationcancer',
       'female16indunkcapricorn', 'female16indunkleo',
       'female16indunksagittarius', 'female16indunktaurus',
       'female16studentaquarius', 'female16studentcapricorn',
       'female16studentlibra', 'female16studenttaurus',
       'female17indunkcancer', 'female17indunkleo',
       'female17indunkscorpio', 'female17studentaries',
       'female17studentcapricorn', 'female17studentgemini',
       

In [0]:
y_train_encoded_df = pd.DataFrame(y_train_encoded, columns = mlb.classes_)
y_test_encoded_df = pd.DataFrame(y_test_encoded, columns = mlb.classes_)

In [61]:
y_train.head()

651     male24engineeringlibra
6560        male36fashionaries
8974        male15studentvirgo
2348     male35technologyaries
5670      female27indunktaurus
Name: labels, dtype: object

In [62]:
y_train_encoded_df.head()

Unnamed: 0,female13indunklibra,female13studentaquarius,female13studentsagittarius,female14indunkaquarius,female14indunkaries,female14indunkcancer,female14indunkleo,female14studentcancer,female14studentcapricorn,female14studenttaurus,female15artspisces,female15indunkaquarius,female15indunkaries,female15studentaquarius,female15studentcancer,female15studentgemini,female15studentleo,female15studentlibra,female15studentpisces,female15studentvirgo,female16educationcancer,female16indunkcapricorn,female16indunkleo,female16indunksagittarius,female16indunktaurus,female16studentaquarius,female16studentcapricorn,female16studentlibra,female16studenttaurus,female17indunkcancer,female17indunkleo,female17indunkscorpio,female17studentaries,female17studentcapricorn,female17studentgemini,female17studentleo,female17studentpisces,female17studentsagittarius,female17studentscorpio,female17studenttaurus,...,male24telecommunicationssagittarius,male25accountinglibra,male25artsaries,male25businessservicessagittarius,male25communications-mediapisces,male25indunktaurus,male25internetaries,male25non-profitcancer,male25technologyaries,male25technologypisces,male26educationlibra,male26indunkgemini,male26indunkleo,male26indunksagittarius,male26museums-librariesleo,male26sciencescorpio,male26sports-recreationleo,male26technologylibra,male26technologyscorpio,male27educationaries,male27educationscorpio,male27educationvirgo,male27technologypisces,male33engineeringaries,male33investmentbankingaquarius,male33non-profitgemini,male33technologysagittarius,male35indunkscorpio,male35technologyaries,male36fashionaries,male36indunktaurus,male37businessservicessagittarius,male37lawenforcement-securityaquarius,male39communications-medialibra,male39educationvirgo,male41communications-medialibra,male42religionaries,male44indunktaurus,male45humanresourcesaquarius,male46consultinggemini
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [63]:
y_test_encoded_df.head()

Unnamed: 0,female13indunklibra,female13studentaquarius,female13studentsagittarius,female14indunkaquarius,female14indunkaries,female14indunkcancer,female14indunkleo,female14studentcancer,female14studentcapricorn,female14studenttaurus,female15artspisces,female15indunkaquarius,female15indunkaries,female15studentaquarius,female15studentcancer,female15studentgemini,female15studentleo,female15studentlibra,female15studentpisces,female15studentvirgo,female16educationcancer,female16indunkcapricorn,female16indunkleo,female16indunksagittarius,female16indunktaurus,female16studentaquarius,female16studentcapricorn,female16studentlibra,female16studenttaurus,female17indunkcancer,female17indunkleo,female17indunkscorpio,female17studentaries,female17studentcapricorn,female17studentgemini,female17studentleo,female17studentpisces,female17studentsagittarius,female17studentscorpio,female17studenttaurus,...,male24telecommunicationssagittarius,male25accountinglibra,male25artsaries,male25businessservicessagittarius,male25communications-mediapisces,male25indunktaurus,male25internetaries,male25non-profitcancer,male25technologyaries,male25technologypisces,male26educationlibra,male26indunkgemini,male26indunkleo,male26indunksagittarius,male26museums-librariesleo,male26sciencescorpio,male26sports-recreationleo,male26technologylibra,male26technologyscorpio,male27educationaries,male27educationscorpio,male27educationvirgo,male27technologypisces,male33engineeringaries,male33investmentbankingaquarius,male33non-profitgemini,male33technologysagittarius,male35indunkscorpio,male35technologyaries,male36fashionaries,male36indunktaurus,male37businessservicessagittarius,male37lawenforcement-securityaquarius,male39communications-medialibra,male39educationvirgo,male41communications-medialibra,male42religionaries,male44indunktaurus,male45humanresourcesaquarius,male46consultinggemini
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [64]:
mlb.get_params

<bound method BaseEstimator.get_params of MultiLabelBinarizer(classes=None, sparse_output=False)>

8. Choose a classifier - (5 points)

  In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression . It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

  a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
  
  b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)

  a. Print the following

    i. Accuracy score

    ii. F1 score

    iii. Average precision score

    iv. Average recall score

    v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

10. Print true label and predicted label for any five examples (7.5 points)

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
def train_classifier(x_train, y_train):
    
    # Create and fit LogisticRegression wraped into OneVsRestClassifier.
    
    model = OneVsRestClassifier(LogisticRegression(penalty='l2', C=1.0))
    model.fit(x_train, y_train)
    return model
    
 
classifier = train_classifier(x_train_dtm, y_train_encoded_df)

y_pred_labels = classifier.predict(x_test_dtm)
y_pred_scores = classifier.decision_function(x_test_dtm)

In [66]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score, precision_score, classification_report

def evaluation_scores(y_val, predicted):
    
    print ("Accuracy={}".format(accuracy_score(y_val, predicted)))
    print ("Precision Score={}".format(precision_score(y_val, predicted, average='macro')))
    print ("Recall Score={}".format(recall_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_macro={}".format(f1_score(y_val, predicted, average='macro', zero_division = 1)))
    print ("F1_micro={}".format(f1_score(y_val, predicted, average='micro', zero_division = 1)))
    print ("F1_wted={}".format(f1_score(y_val, predicted, average='weighted',zero_division = 1)))
    
print('Tfidf')
evaluation_scores(y_test_encoded_df, y_pred_labels)
print ("Average Precision Score={}".format(average_precision_score(y_test_encoded_df, y_pred_scores)))

Tfidf
Accuracy=0.3736
Precision Score=0.1875174282684632
Recall Score=0.37725270324389265


  _warn_prf(average, modifier, msg_start, len(result))


F1_macro=0.2319321380268177
F1_micro=0.516730663741086
F1_wted=0.46254147920679295


  recall = tps / tps[-1]


Average Precision Score=nan


In [67]:
# print the Classification matrix
classification_report(y_test_encoded_df, y_pred_labels,zero_division=1)

'              precision    recall  f1-score   support\n\n           0       1.00      0.00      0.00         1\n           1       1.00      0.50      0.67         8\n           2       1.00      0.00      0.00         1\n           3       1.00      1.00      1.00         1\n           4       1.00      0.00      0.00         7\n           5       1.00      1.00      1.00         0\n           6       1.00      0.00      0.00         1\n           7       0.00      1.00      0.00         0\n           8       1.00      0.00      0.00         1\n           9       1.00      0.00      0.00         2\n          10       1.00      1.00      1.00         0\n          11       0.00      0.00      0.00         2\n          12       1.00      0.33      0.50         6\n          13       0.33      0.08      0.13        12\n          14       1.00      0.00      0.00         2\n          15       1.00      0.00      0.00         1\n          16       1.00      0.33      0.50         6\n       

In [0]:
true_labels = np.array(y_test_encoded_df[200:205])

In [0]:
predicted_labels = y_pred_labels[200:205]

In [70]:
mlb.inverse_transform(true_labels)

[('female25studentleo',),
 ('male36fashionaries',),
 ('female34indunksagittarius',),
 ('female15studentaquarius',),
 ('male35technologyaries',)]

In [71]:
mlb.inverse_transform(predicted_labels)

[(),
 ('male36fashionaries',),
 ('female34indunksagittarius',),
 (),
 ('male35technologyaries',)]

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes). 

