## Project description <br> 

Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text.We have designed it as a Multilabel classification problem.

# Dataset

Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups: 
*   8240 "10s" blogs (ages 13-17), 
*   8086 "20s" blogs(ages 23-27) 
*   2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

# Approach and Steps


1. Load the dataset (5 points)
a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)
a. Remove unwanted characters
b. Convert text to lowercase
c. Remove unwanted spaces
d. Remove stopwords
3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
a. Label columns to merge: “gender”, “age”, “topic”, “sign”
b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

![image#1](https://drive.google.com/file/d/1eVPo-opGSk5xXJRF_XIcnIry9rCHO_Co/view?usp=sharing)

4. Separate features and labels, and split the data into training and testing (5 points)
5. Vectorize the features (5 points)
a. Create a Bag of Words using count vectorizer
i. Use ngram_range=(1, 2)
ii. Vectorize training and testing features
b. Print the term-document matrix
6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

![image#2](https://drive.google.com/file/d/13AMyhdMqkxIZI9DTihhXCxwpX35O9apk/view?usp=sharing)

7. Transform the labels - (7.5 points) As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a. Convert your train and test labels using MultiLabelBinarizer
8. Choose a classifier - (5 points) In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.
a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

![image#3](https://drive.google.com/file/d/1YjWRl2lxCPBUS2lj5Mri94l8JcGntowD/view?usp=sharing)

9. Fit the classifier, make predictions and get the accuracy (5 points)
a. Print the following
i. Accuracy score
ii. F1 score
iii. Average precision score
iv. Average recall score
v. Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging
10. Print true label and predicted label for any five examples (7.5 points)

### 1. Mount the google drive, read the dataset (blog-authorship-corpus) and drop the NA's while reading the dataset

In [55]:
from google.colab import drive
drive.mount('/gdrive')

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [0]:
import pandas as pd
import numpy as np
# read file into pandas using a relative path. Please change the path as needed
data = pd.read_csv('/gdrive/My Drive/Colab Notebook - AIML/Project 10 (NLP)/blog-authorship-corpus.zip').dropna()

In [57]:
data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


#   1.b limiting dataframe to 4857 rows 

In [0]:
df=data.iloc[:4857,0:]

In [59]:
df

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
...,...,...,...,...,...,...,...
4852,1233335,male,25,BusinessServices,Sagittarius,"27,May,2004",There is khaos in my life... and now th...
4853,1233335,male,25,BusinessServices,Sagittarius,"26,May,2004","heather, i taped it for you. you're goi..."
4854,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004",I made a doctor's appointment today to ...
4855,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004","Well, just got back from the Skar and f..."


### 2. Preprocess NLP text (lowercase, remove special characters, remove numbers, remove emails, etc)

Defined a function preprocess, this will cater to lower case conversion, removal of html tags, url, stemming and lemmatization of the words.

In [60]:
import nltk
nltk.download("stopwords")
nltk.download('wordnet')
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [61]:
df

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
...,...,...,...,...,...,...,...
4852,1233335,male,25,BusinessServices,Sagittarius,"27,May,2004",There is khaos in my life... and now th...
4853,1233335,male,25,BusinessServices,Sagittarius,"26,May,2004","heather, i taped it for you. you're goi..."
4854,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004",I made a doctor's appointment today to ...
4855,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004","Well, just got back from the Skar and f..."


In [62]:
df.dtypes

id         int64
gender    object
age        int64
topic     object
sign      object
date      object
text      object
dtype: object

### 3. Create label column to merge: “gender”, “age”, “topic”, “sign”



In [0]:
df['label']=df['gender'].astype(str)+','+df['age'].astype(str)+','+df['topic'].astype(str)+','+df['sign'].astype(str)

In [64]:
df

Unnamed: 0,id,gender,age,topic,sign,date,text,label
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,"male,15,Student,Leo"
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,"male,15,Student,Leo"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
...,...,...,...,...,...,...,...,...
4852,1233335,male,25,BusinessServices,Sagittarius,"27,May,2004",There is khaos in my life... and now th...,"male,25,BusinessServices,Sagittarius"
4853,1233335,male,25,BusinessServices,Sagittarius,"26,May,2004","heather, i taped it for you. you're goi...","male,25,BusinessServices,Sagittarius"
4854,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004",I made a doctor's appointment today to ...,"male,25,BusinessServices,Sagittarius"
4855,1233335,male,25,BusinessServices,Sagittarius,"25,May,2004","Well, just got back from the Skar and f...","male,25,BusinessServices,Sagittarius"


In [0]:
df1 = df.iloc[:,6:8]

In [66]:
df1

Unnamed: 0,text,label
0,"Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,These are the team members: Drewe...,"male,15,Student,Leo"
2,In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,testing!!! testing!!!,"male,15,Student,Leo"
4,Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
...,...,...
4852,There is khaos in my life... and now th...,"male,25,BusinessServices,Sagittarius"
4853,"heather, i taped it for you. you're goi...","male,25,BusinessServices,Sagittarius"
4854,I made a doctor's appointment today to ...,"male,25,BusinessServices,Sagittarius"
4855,"Well, just got back from the Skar and f...","male,25,BusinessServices,Sagittarius"


Defined a function for converting label to string, this will help in identifying different labels contained in a same row

In [0]:
def convert2dict(sentence):
    sentence=str(sentence)
    return sentence.split(',')

df1['labels']=df1['label'].map(lambda s:convert2dict(s))

I've maintained 2 columns ( label and labels) intentionally. 


---


'label column'  : to maintain the count of dictionary label in step 6 


---


'labels column' : populated after converting to string using comma  separated format. 

In [68]:
df1

Unnamed: 0,text,label,labels
0,"Info has been found (+/- 100 pages,...","male,15,Student,Leo","[male, 15, Student, Leo]"
1,These are the team members: Drewe...,"male,15,Student,Leo","[male, 15, Student, Leo]"
2,In het kader van kernfusie op aarde...,"male,15,Student,Leo","[male, 15, Student, Leo]"
3,testing!!! testing!!!,"male,15,Student,Leo","[male, 15, Student, Leo]"
4,Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius","[male, 33, InvestmentBanking, Aquarius]"
...,...,...,...
4852,There is khaos in my life... and now th...,"male,25,BusinessServices,Sagittarius","[male, 25, BusinessServices, Sagittarius]"
4853,"heather, i taped it for you. you're goi...","male,25,BusinessServices,Sagittarius","[male, 25, BusinessServices, Sagittarius]"
4854,I made a doctor's appointment today to ...,"male,25,BusinessServices,Sagittarius","[male, 25, BusinessServices, Sagittarius]"
4855,"Well, just got back from the Skar and f...","male,25,BusinessServices,Sagittarius","[male, 25, BusinessServices, Sagittarius]"


### 4. Separate features and labels, and split the data into training and testing (5 points)



In [0]:
from sklearn.model_selection import train_test_split

x = df1['text']
y = df1['labels']
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=1)

In [70]:
print(x_train.shape)
print(x_test.shape)

(3399,)
(1458,)


In [71]:
print(y_train.shape)
print(y_test.shape)

(3399,)
(1458,)


In [72]:
y_train

1151                [female, 15, Student, Libra]
3854                 [male, 14, Student, Pisces]
1523               [male, 35, Technology, Aries]
545            [female, 27, Education, Aquarius]
2354               [male, 35, Technology, Aries]
                          ...                   
2895               [male, 35, Technology, Aries]
2763               [male, 35, Technology, Aries]
905     [male, 17, Sports-Recreation, Capricorn]
3980                     [male, 25, Arts, Aries]
235                [male, 15, Student, Aquarius]
Name: labels, Length: 3399, dtype: object

# 5. Vectorize the features

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 2))

# create document-term matrices
x_train_ct = cv.fit_transform(x_train)
x_test_ct = cv.transform(x_test)

In [74]:
#checking the vocabulary size
len(cv.vocabulary_)

263259

In [75]:
#what is there in vocabulary
cv.vocabulary_

{'goodness': 90926,
 'have': 97078,
 'had': 94052,
 'nothing': 152651,
 'to': 230418,
 'do': 63185,
 'this': 226883,
 'whole': 253518,
 'day': 57217,
 'seriously': 193495,
 'went': 250219,
 'online': 160440,
 'for': 81217,
 'few': 77878,
 'hours': 105185,
 'called': 40188,
 'bob': 33966,
 'because': 28553,
 'all': 8720,
 'of': 154397,
 'the': 217296,
 'gentlemen': 87499,
 'were': 250328,
 'over': 164600,
 'there': 224754,
 'and': 12496,
 'found': 83222,
 'out': 163957,
 'that': 215469,
 'they': 225433,
 'leaving': 125664,
 'in': 108716,
 'little': 129128,
 'bit': 32409,
 'anywho': 18562,
 'so': 200150,
 'stayed': 206618,
 'home': 103803,
 'dang': 56689,
 'fell': 77546,
 'asleep': 22467,
 'though': 228178,
 'around': 20756,
 '45': 2840,
 'until': 239706,
 'almost': 9561,
 'was': 245945,
 'quite': 178859,
 'sleepy': 199039,
 'really': 181311,
 'hoping': 104594,
 'tomarrow': 233679,
 'will': 254163,
 'be': 26968,
 'better': 31292,
 'but': 38166,
 'who': 253085,
 'knows': 122776,
 'what': 

Building DTM and vectorizing train and test features

In [76]:
x_train_ct.shape

(3399, 263259)

In [77]:
print(x_train_ct[0])

  (0, 90926)	1
  (0, 97078)	2
  (0, 94052)	1
  (0, 152651)	1
  (0, 230418)	7
  (0, 63185)	1
  (0, 226883)	1
  (0, 253518)	2
  (0, 57217)	3
  (0, 193495)	1
  (0, 250219)	1
  (0, 160440)	1
  (0, 81217)	4
  (0, 77878)	1
  (0, 105185)	1
  (0, 40188)	1
  (0, 33966)	1
  (0, 28553)	3
  (0, 8720)	2
  (0, 154397)	3
  (0, 217296)	5
  (0, 87499)	1
  (0, 250328)	3
  (0, 164600)	1
  (0, 224754)	2
  :	:
  (0, 26008)	1
  (0, 206214)	1
  (0, 109539)	1
  (0, 126304)	1
  (0, 215222)	1
  (0, 141235)	1
  (0, 2098)	1
  (0, 57673)	1
  (0, 27298)	1
  (0, 73506)	1
  (0, 157822)	1
  (0, 119318)	1
  (0, 93385)	1
  (0, 216390)	1
  (0, 130050)	1
  (0, 237966)	1
  (0, 142120)	1
  (0, 233965)	1
  (0, 217938)	1
  (0, 34746)	1
  (0, 114192)	1
  (0, 68154)	1
  (0, 144917)	1
  (0, 49310)	1
  (0, 198571)	1


In [78]:
x_test_ct.shape

(1458, 263259)

In [79]:
x_test_ct[0]

<1x263259 sparse matrix of type '<class 'numpy.int64'>'
	with 486 stored elements in Compressed Sparse Row format>

### 6. Create a dictionary to get the count of every label

In [0]:
 text=[]
 text.append(df1['label'].str.cat(sep=','))

In [81]:
text

['male,15,Student,Leo,male,15,Student,Leo,male,15,Student,Leo,male,15,Student,Leo,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,InvestmentBanking,Aquarius,male,33,

In [82]:
def convert(lst): 
    return (lst[0].split(',')) 
  
# Driver code 
text = convert(text)
print(text)

['male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '15', 'Student', 'Leo', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'InvestmentBanking', 'Aquarius', 'male', '33', 'Investment

In [83]:
dictionary = {}
for label in text :
  dictionary[label] = dictionary.get(label, 0) + 1

dictionary

{'14': 170,
 '15': 339,
 '16': 67,
 '17': 215,
 '23': 137,
 '24': 353,
 '25': 241,
 '26': 96,
 '27': 86,
 '33': 101,
 '34': 540,
 '35': 2307,
 '36': 60,
 '37': 19,
 '39': 79,
 '41': 14,
 '42': 9,
 '44': 3,
 '45': 14,
 '46': 7,
 'Accounting': 2,
 'Aquarius': 329,
 'Aries': 2483,
 'Arts': 31,
 'Automotive': 14,
 'Banking': 16,
 'BusinessServices': 60,
 'Cancer': 94,
 'Capricorn': 84,
 'Communications-Media': 61,
 'Consulting': 16,
 'Education': 118,
 'Engineering': 119,
 'Gemini': 86,
 'Internet': 20,
 'InvestmentBanking': 70,
 'Law': 3,
 'Leo': 190,
 'Libra': 414,
 'Museums-Libraries': 2,
 'Non-Profit': 47,
 'Pisces': 67,
 'Religion': 4,
 'Sagittarius': 677,
 'Science': 33,
 'Scorpio': 292,
 'Sports-Recreation': 75,
 'Student': 569,
 'Taurus': 100,
 'Technology': 2332,
 'Virgo': 41,
 'female': 1590,
 'indUnk': 1265,
 'male': 3267}

### 7. Transform the labels

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_mlb=mlb.fit_transform(y_train)
y_test_mlb=mlb.transform(y_test)

In [85]:
print(y_train_mlb)
print(y_train_mlb.shape)

[[0 1 0 ... 1 0 0]
 [1 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 1 0 ... 0 0 1]]
(3399, 54)


In [86]:
print(y_test_mlb)
print(y_test_mlb.shape)

[[0 0 0 ... 1 0 0]
 [0 0 0 ... 1 1 0]
 [0 1 0 ... 1 0 0]
 ...
 [0 0 0 ... 1 1 0]
 [0 0 0 ... 1 0 0]
 [0 0 0 ... 0 0 1]]
(1458, 54)


In [87]:
list(mlb.classes_)

['14',
 '15',
 '16',
 '17',
 '23',
 '24',
 '25',
 '26',
 '27',
 '33',
 '34',
 '35',
 '36',
 '37',
 '39',
 '41',
 '42',
 '44',
 '45',
 '46',
 'Accounting',
 'Aquarius',
 'Aries',
 'Arts',
 'Automotive',
 'Banking',
 'BusinessServices',
 'Cancer',
 'Capricorn',
 'Communications-Media',
 'Consulting',
 'Education',
 'Engineering',
 'Gemini',
 'Internet',
 'InvestmentBanking',
 'Law',
 'Leo',
 'Libra',
 'Museums-Libraries',
 'Non-Profit',
 'Pisces',
 'Religion',
 'Sagittarius',
 'Science',
 'Scorpio',
 'Sports-Recreation',
 'Student',
 'Taurus',
 'Technology',
 'Virgo',
 'female',
 'indUnk',
 'male']

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

### Just for the sake testing purposes, created a pipeline for steps 8 & 9 by choosing OneVsRestClassifier and Linear SVC, trained the model on every label and computed the accuracy scores, F1, Recall & Precision scores.

In [90]:
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(x_train, y_train_mlb)
predicted = classifier.predict(x_test)
print(predicted)

print("Accuracy Score using OvR and Linear SVC: ",accuracy_score(y_test_mlb, predicted))
print("F1 using OvR and Linear SVC: " + str(f1_score(y_test_mlb,predicted,average='micro')))
print("Recall using OvR and Linear SVC: " + str(recall_score(y_test_mlb,predicted,average='micro')))
print("Precision using OvR and Linear SVC: " + str(average_precision_score(y_test_mlb, predicted,average='micro')))

[[0 0 0 ... 1 0 0]
 [0 0 0 ... 1 1 0]
 [0 0 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 1]]
Accuracy Score using OvR and Linear SVC:  0.5473251028806584
F1 using OvR and Linear SVC: 0.7720616570327553
Recall using OvR and Linear SVC: 0.6870713305898491
Precision using OvR and Linear SVC: 0.6285217707591445


### 8. Choose a classifier

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf= LogisticRegression(solver='lbfgs')
clf= OneVsRestClassifier(clf)  

### 9. Fit the classifier, make predictions and get the accuracy

In [92]:
clf.fit(x_train_ct,y_train_mlb)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [93]:
y_pred=clf.predict(x_test_ct)
y_pred

array([[0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1]])

In [94]:
y_pred.shape

(1458, 54)

In [0]:
y_pred_ = mlb.inverse_transform(y_pred)

In [96]:
print(y_pred_)

[('25', 'female'), ('female', 'indUnk'), ('female',), ('male',), ('Aries', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('male',), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('24', 'Engineering', 'Libra', 'male'), ('35', 'Aries', 'Technology', 'male'), ('Aries', 'male'), ('23', '25', 'Libra', 'Student', 'Taurus', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('female',), ('35', 'Aries', 'Technology', 'male'), ('15', 'Libra', 'Student', 'male'), ('male',), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('female',), ('female', 'indUnk'), ('17', 'Capricorn', 'Sports-Recreation', 'female', 

In [97]:
y_test_mlb.shape

(1458, 54)

In [0]:
y_test_ = mlb.inverse_transform(y_test_mlb)

In [99]:
print(y_test_)

[('25', 'Aries', 'BusinessServices', 'female'), ('34', 'Sagittarius', 'female', 'indUnk'), ('15', 'Cancer', 'Student', 'female'), ('35', 'Aries', 'Technology', 'male'), ('39', 'Communications-Media', 'Libra', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('14', 'Pisces', 'Student', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('24', 'Scorpio', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('24', 'Engineering', 'Libra', 'male'), ('35', 'Aries', 'Technology', 'male'), ('23', 'Sagittarius', 'indUnk', 'male'), ('25', 'Libra', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'), ('17', 'Leo', 'Student', 'female'), ('35', 'Aries', 'Technology', 'male'), ('15', 'Libra', 'Student', 'female'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('35', 'Aries', 'Technology', 'male'), ('34', 'Sagittarius', 'female', 'indUnk'), ('35', 'Aries', 'Technology', 'male'),

In [0]:
from sklearn import metrics
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

In [101]:
print("F1: " + str(f1_score(y_test_mlb,y_pred,average='micro')))
print("Recall: " + str(recall_score(y_test_mlb,y_pred,average='micro')))
print("Precision: " + str(average_precision_score(y_test_mlb, y_pred,average='micro')))
print("Accuracy:" + str(metrics.accuracy_score(y_test_mlb,y_pred))) 

F1: 0.7498365860491175
Recall: 0.688443072702332
Precision: 0.5898404268346338
Accuracy:0.5384087791495199


### 10. Print true label and predicted label for any five examples


In [102]:
y_test[0:330]

3975      [female, 25, BusinessServices, Aries]
4307          [female, 34, indUnk, Sagittarius]
3789              [female, 15, Student, Cancer]
2682              [male, 35, Technology, Aries]
3707    [male, 39, Communications-Media, Libra]
                         ...                   
4604          [female, 34, indUnk, Sagittarius]
2742              [male, 35, Technology, Aries]
1710              [male, 35, Technology, Aries]
4320          [female, 34, indUnk, Sagittarius]
2807              [male, 35, Technology, Aries]
Name: labels, Length: 330, dtype: object

In [103]:
y_test[100]

['female', '17', 'Student', 'Gemini']

In [104]:
y_test[3789]

['female', '15', 'Student', 'Cancer']

In [105]:
y_pred_[12]

('24', 'Engineering', 'Libra', 'male')

In [106]:
y_pred_[4]

('Aries', 'male')

In [107]:
y_pred_[29]

('female', 'indUnk')

In [108]:
y_pred_[2]

('female',)

On the basis of the outlined approach and my limited knowledge about the basic NLP concepts i've attempted & implemented steps based on my own understanding and research.