#Statistical_NLP_Project

We welcome you all to this NLP based case study. The case study (described below - 60 points) covers concepts taught in traditional models in the NLP course.

##Project Description

Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multilabel classification problem.

##Dataset
Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
- 1.8240 "10s" blogs (ages 13-17),
- 2.8086 "20s" blogs(ages 23-27)
- 3.2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.
###Link to dataset:
https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

##Approach & Steps
1. Load the dataset (5 points)
- a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)
- a. Remove unwanted characters
- b. Convert text to lowercase
- c. Remove unwanted spaces
- d. Remove stopwords
3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
- a. Label columns to merge: “gender”, “age”, “topic”, “sign”
- b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image
4. Separate features and labels, and split the data into training and testing (5 points)
5. Vectorize the features (5 points)
- a. Create a Bag of Words using count vectorizer
- i. Use ngram_range=(1, 2)
- ii. Vectorize training and testing features
- b. Print the term-document matrix
6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)
7. Transform the labels - (7.5 points)
- As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
- a. Convert your train and test labels using MultiLabelBinarizer
8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.
- a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.
- b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

9. Fit the classifier, make predictions and get the accuracy (5 points)
- a. Print the following
- i. Accuracy score
- ii. F1 score
- iii. Average precision score
- iv. Average recall score
- v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging.
10. Print true label and predicted label for any five examples (7.5 points)
- Project submissions and Evaluation Criteria While we encourage peer  collaboration and contribution, plagiarism, copying the code from other
sources or peers will defeat the purpose of coming to this program. We expect the highest order of ethical behavior.
- You are provided with the basic approach and the steps that you need to implement. We expect you to do your own research about implementing the steps and knowing about the things that might look new to you.

In [151]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [152]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
from sklearn.preprocessing import MultiLabelBinarizer
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
import re
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1.Load the dataset (5 points)

In [0]:
blogtext = pd.read_csv('/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/blogtext.csv')

In [154]:
blogtext.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [155]:
blogtext.shape

(681284, 7)

In [156]:
blogtext.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [157]:
blogtext.info

<bound method DataFrame.info of              id  ...                                               text
0       2059027  ...             Info has been found (+/- 100 pages,...
1       2059027  ...             These are the team members:   Drewe...
2       2059027  ...             In het kader van kernfusie op aarde...
3       2059027  ...                   testing!!!  testing!!!          
4       3581210  ...               Thanks to Yahoo!'s Toolbar I can ...
...         ...  ...                                                ...
681279  1713845  ...         Dear Susan,  I could write some really ...
681280  1713845  ...         Dear Susan,  'I have the second yeast i...
681281  1713845  ...         Dear Susan,  Your 'boyfriend' is fuckin...
681282  1713845  ...         Dear Susan:    Just to clarify, I am as...
681283  1713845  ...         Hey everybody...and Susan,  You might a...

[681284 rows x 7 columns]>

In [158]:
blogtext.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,681284.0,2397802.0,1247723.0,5114.0,1239610.0,2607577.0,3525660.0,4337650.0
age,681284.0,23.93233,7.786009,13.0,17.0,24.0,26.0,48.0


In [159]:
blogtext.id.value_counts()

449628     4221
734562     2301
589736     2294
1975546    2261
958176     2244
           ... 
3993280       1
3483063       1
4165047       1
3575447       1
3599127       1
Name: id, Length: 19320, dtype: int64

In [160]:
# Users with max no of product ratings
blogtextcount = blogtext["id"].value_counts()
greaterthanhundind = blogtextcount[blogtextcount >200].index
greaterthanhundind

Int64Index([ 449628,  734562,  589736, 1975546,  958176, 1107146,  303162,
             942828, 1270648, 1784456,
            ...
            1131517, 4177216, 1726011, 2169579, 2680773, 1209865, 1032153,
             956218, 1552252, 1624111],
           dtype='int64', length=561)

######The dataset is huge ,to avoid memory issues ,for the current analysis we are filtering out first 10k rows

In [161]:
# Taking random sample of 20k rows from the main dataset
data =  blogtext.sample(n = 20000)
data.shape

(20000, 7)

In [162]:
data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
416756,1981768,female,15,indUnk,Aries,"27,June,2004",Updating Quiz Blog Now
465488,3371144,male,16,indUnk,Sagittarius,"16,May,2004",Despite the best efforts of Nate Lucas ...
399418,3040326,female,26,Publishing,Sagittarius,"10,October,2002",Bitchin You can watch Dali and Bunuel...
474112,4257680,female,14,Student,Pisces,"18,August,2004",urlLink hana makin a funny face...
599648,1089661,female,26,indUnk,Libra,"14,October,2003","Heidi is in Paris. Wish her well,..."


2.Preprocess rows of the “text” column (7.5 points)
- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [0]:
def cleanHtml(sentence):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext
def cleanPunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned
def keepAlpha(sentence):
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
    return alpha_sent
data['text'] = data['text'].str.lower()
data['text'] = data['text'].apply(cleanHtml)
data['text'] = data['text'].apply(cleanPunc)
data['text'] = data['text'].apply(keepAlpha)

In [0]:
#Removing Stop words
stop_words = set(stopwords.words('english'))
stop_words.update(['zero','one','two','three','four','five','six','seven','eight','nine','ten','may','also','across','among','beside','however','yet','within'])
re_stop_words = re.compile(r"\b(" + "|".join(stop_words) + ")\\W", re.I)
def removeStopWords(sentence):
    global re_stop_words
    return re_stop_words.sub(" ", sentence)
data['text'] = data['text'].apply(removeStopWords)

In [165]:
data.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
416756,1981768,female,15,indUnk,Aries,"27,June,2004",updating quiz blog now
465488,3371144,male,16,indUnk,Sagittarius,"16,May,2004",despite best efforts nate lucas fix camera...
399418,3040326,female,26,Publishing,Sagittarius,"10,October,2002",bitchin watch dali bunuels classic surreali...
474112,4257680,female,14,Student,Pisces,"18,August,2004",urllink hana makin funny face usual lol nbsp...
599648,1089661,female,26,indUnk,Libra,"14,October,2003",heidi paris wish well people told take ...


3.As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
- Label columns to merge: “gender”, “age”, “topic”, “sign”
- After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [0]:
## Removing id and date column as this is not needed for our current analysis
data = data.drop(['date', 'id'], axis=1)

In [167]:
data.head()

Unnamed: 0,gender,age,topic,sign,text
416756,female,15,indUnk,Aries,updating quiz blog now
465488,male,16,indUnk,Sagittarius,despite best efforts nate lucas fix camera...
399418,female,26,Publishing,Sagittarius,bitchin watch dali bunuels classic surreali...
474112,female,14,Student,Pisces,urllink hana makin funny face usual lol nbsp...
599648,female,26,indUnk,Libra,heidi paris wish well people told take ...


In [168]:
# lets reset the index for the above data
data.reset_index()

Unnamed: 0,index,gender,age,topic,sign,text
0,416756,female,15,indUnk,Aries,updating quiz blog now
1,465488,male,16,indUnk,Sagittarius,despite best efforts nate lucas fix camera...
2,399418,female,26,Publishing,Sagittarius,bitchin watch dali bunuels classic surreali...
3,474112,female,14,Student,Pisces,urllink hana makin funny face usual lol nbsp...
4,599648,female,26,indUnk,Libra,heidi paris wish well people told take ...
...,...,...,...,...,...,...
19995,598858,female,23,indUnk,Taurus,got bored hair today went bye byes to...
19996,1890,male,35,Technology,Aries,know ang service awfully crummy dont eve...
19997,302912,female,15,Student,Sagittarius,alison youre cocker spaniel bones youre p...
19998,18620,female,14,indUnk,Libra,cant wait tomorrow thats luvluvluv brittany ...


In [169]:
data.shape

(20000, 5)

In [170]:
data.isnull().sum()

gender    0
age       0
topic     0
sign      0
text      0
dtype: int64

In [0]:
## As there are no null values ,we proceed with merging of gender,age,topic,sign into labels col
data = data.assign(labels = data.gender.astype(str) + ', ' + \
  data.age.astype(str) + ', ' + data.topic.astype(str) + ', '+ \
  data.sign.astype(str))

In [172]:
data.head()

Unnamed: 0,gender,age,topic,sign,text,labels
416756,female,15,indUnk,Aries,updating quiz blog now,"female, 15, indUnk, Aries"
465488,male,16,indUnk,Sagittarius,despite best efforts nate lucas fix camera...,"male, 16, indUnk, Sagittarius"
399418,female,26,Publishing,Sagittarius,bitchin watch dali bunuels classic surreali...,"female, 26, Publishing, Sagittarius"
474112,female,14,Student,Pisces,urllink hana makin funny face usual lol nbsp...,"female, 14, Student, Pisces"
599648,female,26,indUnk,Libra,heidi paris wish well people told take ...,"female, 26, indUnk, Libra"


In [0]:
## Dropping gender/age/topic/sign from the df as they are now merged into labels
data.drop(labels = ['gender','age','topic','sign'], axis=1,inplace=True)

In [174]:
data.head()

Unnamed: 0,text,labels
416756,updating quiz blog now,"female, 15, indUnk, Aries"
465488,despite best efforts nate lucas fix camera...,"male, 16, indUnk, Sagittarius"
399418,bitchin watch dali bunuels classic surreali...,"female, 26, Publishing, Sagittarius"
474112,urllink hana makin funny face usual lol nbsp...,"female, 14, Student, Pisces"
599648,heidi paris wish well people told take ...,"female, 26, indUnk, Libra"


In [0]:
data.reset_index(inplace=True)

In [176]:
data.head()

Unnamed: 0,index,text,labels
0,416756,updating quiz blog now,"female, 15, indUnk, Aries"
1,465488,despite best efforts nate lucas fix camera...,"male, 16, indUnk, Sagittarius"
2,399418,bitchin watch dali bunuels classic surreali...,"female, 26, Publishing, Sagittarius"
3,474112,urllink hana makin funny face usual lol nbsp...,"female, 14, Student, Pisces"
4,599648,heidi paris wish well people told take ...,"female, 26, indUnk, Libra"


In [0]:
data.drop(labels = ['index'], axis=1,inplace=True)

In [178]:
data.head()

Unnamed: 0,text,labels
0,updating quiz blog now,"female, 15, indUnk, Aries"
1,despite best efforts nate lucas fix camera...,"male, 16, indUnk, Sagittarius"
2,bitchin watch dali bunuels classic surreali...,"female, 26, Publishing, Sagittarius"
3,urllink hana makin funny face usual lol nbsp...,"female, 14, Student, Pisces"
4,heidi paris wish well people told take ...,"female, 26, indUnk, Libra"



4.Separate features and labels, and split the data into training and testing (5 points)
- Features is the text column,labels is the labels column

In [0]:
X = data.text
y = data.labels

In [180]:
X.shape

(20000,)

In [181]:
y.shape

(20000,)

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42, shuffle=True)

In [183]:
X_train.shape

(14000,)

In [184]:
X_test.shape

(6000,)

In [185]:
y_train.shape

(14000,)

In [186]:
y_test.shape

(6000,)

In [0]:
import pickle

pickle_out = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/X_train.pickle","wb")
pickle.dump(X_train,pickle_out)
pickle_out.close()

pickle_out = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/y_train.pickle","wb")
pickle.dump(y_train,pickle_out)
pickle_out.close()

pickle_out = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/X_test.pickle","wb")
pickle.dump(X_test,pickle_out)
pickle_out.close()

pickle_out = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/y_test.pickle","wb")
pickle.dump(y_test,pickle_out)
pickle_out.close()

In [0]:
pickle_in = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/X_train.pickle","rb")
X_train = pickle.load(pickle_in) 
pickle_in.close()

pickle_in = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/y_train.pickle","rb")
y_train = pickle.load(pickle_in) 
pickle_in.close()

pickle_in = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/X_test.pickle","rb")
X_test = pickle.load(pickle_in) 
pickle_in.close()

pickle_in = open("/content/drive/My Drive/Great_Lakes_Assignments/11_Statistical_NLP_Project_R9_Project1/y_test.pickle","rb")
y_test = pickle.load(pickle_in) 
pickle_in.close()

5.Vectorize the features (5 points)
A. Create a Bag of Words using count vectorizer

- Use ngram_range=(1, 2)
- Vectorize training and testing features

Print the term-document matrix

In [0]:
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)

In [190]:
print("No. of X_train document term matrix - ", X_train_dtm.shape)
print("No. of X_test document term matrix - ", X_test_dtm.shape)

No. of X_train document term matrix -  (14000, 1050336)
No. of X_test document term matrix -  (6000, 1050336)


In [191]:
print (vectorizer.get_feature_names()[-40:])

['zzz take', 'zzz want', 'zzzaaappp', 'zzzaaappp well', 'zzzaaappp zzzaaappp', 'zzzin', 'zzzin arhx', 'zzzp', 'zzzp hes', 'zzzup', 'zzzup back', 'zzzz', 'zzzz coz', 'zzzz feelin', 'zzzz meeting', 'zzzz oh', 'zzzz repeat', 'zzzz school', 'zzzz sister', 'zzzz sucky', 'zzzzt', 'zzzzt weak', 'zzzzz', 'zzzzz ok', 'zzzzz sudden', 'zzzzz zzzzz', 'zzzzzz', 'zzzzzz didnt', 'zzzzzz watching', 'zzzzzz zzzzz', 'zzzzzzz', 'zzzzzzz zzzzz', 'zzzzzzzz', 'zzzzzzzzz', 'zzzzzzzzz raining', 'zzzzzzzzzz', 'zzzzzzzzzz yankees', 'zzzzzzzzzz yeah', 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'zzzzzzzzzzzzzzzzzzzzzzzzzzzzz huh']


6.Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

In [192]:
countsgender = blogtext_trimmed['gender'].value_counts().to_dict()

countsage = blogtext_trimmed['age'].value_counts().to_dict()

countstopic = blogtext_trimmed['topic'].value_counts().to_dict()

countssign = blogtext_trimmed['sign'].value_counts().to_dict()

dictlabels = {**countsgender , **countsage, **countstopic, **countssign}
 
print(' Dictionary to get the count of every label  :\n')
print(dictlabels)

 Dictionary to get the count of every label  :

{'male': 5098, 'female': 4902, 24: 1189, 17: 1138, 23: 1133, 16: 1067, 25: 957, 26: 795, 27: 686, 15: 622, 14: 395, 34: 329, 35: 265, 33: 259, 36: 207, 13: 185, 37: 134, 38: 100, 39: 78, 40: 71, 41: 65, 45: 64, 43: 61, 46: 53, 48: 46, 42: 40, 47: 31, 44: 30, 'indUnk': 3675, 'Student': 2272, 'Technology': 631, 'Arts': 480, 'Education': 442, 'Communications-Media': 282, 'Internet': 227, 'Non-Profit': 220, 'Engineering': 166, 'Law': 138, 'Government': 105, 'Publishing': 102, 'Consulting': 92, 'Science': 89, 'Fashion': 83, 'Religion': 75, 'Advertising': 73, 'Chemicals': 64, 'Banking': 62, 'Marketing': 59, 'BusinessServices': 56, 'Accounting': 55, 'HumanResources': 52, 'Telecommunications': 51, 'Military': 51, 'Sports-Recreation': 44, 'Biotech': 42, 'Manufacturing': 39, 'RealEstate': 38, 'Museums-Libraries': 36, 'Tourism': 35, 'Transportation': 32, 'LawEnforcement-Security': 31, 'Architecture': 21, 'Agriculture': 20, 'Construction': 16, 'Autom

7.Transform the labels - (7.5 points)

As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use ​MultiLabelBinarizer​ from sklearn

- Convert your train and test labels using MultiLabelBinarizer

In [193]:
# transform to dictionary
y_train = [set(i.split(',')) for i in y_train]
y_test = [set(i.split(',')) for i in y_test]
print(y_train[0])
print(y_test[0])

{' 16', ' Student', 'female', ' Aries'}
{' Technology', ' 17', 'male', ' Virgo'}


In [0]:
mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

In [195]:
y_test[0]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1])

In [196]:
y_train[0]

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [197]:
mlb.classes_

array([' 13', ' 14', ' 15', ' 16', ' 17', ' 23', ' 24', ' 25', ' 26',
       ' 27', ' 33', ' 34', ' 35', ' 36', ' 37', ' 38', ' 39', ' 40',
       ' 41', ' 42', ' 43', ' 44', ' 45', ' 46', ' 47', ' 48',
       ' Accounting', ' Advertising', ' Agriculture', ' Aquarius',
       ' Architecture', ' Aries', ' Arts', ' Automotive', ' Banking',
       ' Biotech', ' BusinessServices', ' Cancer', ' Capricorn',
       ' Chemicals', ' Communications-Media', ' Construction',
       ' Consulting', ' Education', ' Engineering', ' Environment',
       ' Fashion', ' Gemini', ' Government', ' HumanResources',
       ' Internet', ' InvestmentBanking', ' Law',
       ' LawEnforcement-Security', ' Leo', ' Libra', ' Manufacturing',
       ' Maritime', ' Marketing', ' Military', ' Museums-Libraries',
       ' Non-Profit', ' Pisces', ' Publishing', ' RealEstate',
       ' Religion', ' Sagittarius', ' Science', ' Scorpio',
       ' Sports-Recreation', ' Student', ' Taurus', ' Technology',
       ' Telecommuni

8.Choose a classifier - (5 points)

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier​ class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use ​LogisticRegression​. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

In [0]:
clf = LogisticRegression(solver = 'lbfgs')
clf = OneVsRestClassifier(clf)

9.Fit the classifier, make predictions and get the accuracy (5 points)

In [0]:
clf.fit(X_train_dtm, y_train)
y_pred_class = clf.predict(X_test_dtm)

In [200]:
print("Accuracy score :",(str(metrics.accuracy_score(y_test, y_pred_class)*100)) + "%")
print("F1 score micro :", (str(metrics.f1_score(y_test, y_pred_class.round(),average='micro')*100)) + "%")
print("F1 score macro :",(str(metrics.f1_score(y_test, y_pred_class.round(),average='macro')*100)) + "%")
print("Average precision score :",(str(metrics.precision_score(y_test, y_pred_class.round(),average='weighted')*100)) + "%")
print("Average Recall score :",(str(metrics.recall_score(y_test, y_pred_class.round(),average='weighted')*100)) + "%")

Accuracy score : 0.35000000000000003%
F1 score micro : 29.243261191101453%
F1 score macro : 4.181671681665517%
Average precision score : 33.99253114771598%
Average Recall score : 20.183333333333334%


In [201]:
from sklearn.metrics import classification_report
print ("Classification Report")
print(classification_report(y_test, y_pred_class.round()))

Classification Report
              precision    recall  f1-score   support

           0       0.43      0.02      0.04       132
           1       0.23      0.03      0.06       212
           2       0.32      0.04      0.07       373
           3       0.37      0.06      0.10       617
           4       0.38      0.08      0.13       721
           5       0.28      0.03      0.06       673
           6       0.20      0.03      0.05       719
           7       0.11      0.01      0.02       617
           8       0.19      0.02      0.03       497
           9       0.18      0.01      0.03       404
          10       0.00      0.00      0.00       138
          11       0.50      0.03      0.07       172
          12       0.27      0.02      0.04       140
          13       0.00      0.00      0.00       131
          14       0.57      0.06      0.10        70
          15       1.00      0.02      0.03        66
          16       0.00      0.00      0.00        49
     

10.Print true label and predicted label for any five examples (7.5 points)

In [204]:
y_test_pred_inversed = mlb.inverse_transform(y_pred_class)
y_test_inversed = mlb.inverse_transform(y_test)
for i in range(5000,5005):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test.get(i),
        ','.join(y_test_inversed[i]),
        ','.join(y_test_pred_inversed[i])
    ))

Title:	   go  getting   wants   person   scared    want  unsure    go  getting    throw  caution  wind  go     risk  falling flat   face    enjoy      hope    want comes  way    hmmm im opting  throwing  caution   wind  going   keep ur fingers crossed    might need it
True labels:	 16, Pisces, Student,male
Predicted labels:	 17, Leo, Taurus, indUnk,male


Title:	None
True labels:	 16, Aries, Student,female
Predicted labels:	female


Title:	None
True labels:	 33, Education, Scorpio,male
Predicted labels:	female


Title:	None
True labels:	 23, Pisces, indUnk,female
Predicted labels:	 Student,male


Title:	None
True labels:	 24, Gemini, indUnk,female
Predicted labels:	male


