# Project: Statistical NLP 

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

Dataset:

Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.
Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)
All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.
Link to dataset:
http://www.cs.biu.ac.il/~koppel/blogs/blogs.zip

## Import the necessary libraries

In [342]:
import os
import re
import zipfile
import numpy as np
import pandas as pd
from lxml import etree
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

## 1. Load the dataset (5 points)
Tip: As the dataset is large, use fewer rows. Check what is working well on your
machine and decide accordingly.

In [343]:
#with zipfile.ZipFile("Data/blogs.zip", 'r') as zip:
#    zip.extractall("Data/")

In [344]:
parser = etree.XMLParser(recover=True, encoding="iso-8859-5")

In [345]:
text = []
labels = []
dir = "Data/blogs/"

for filename in os.listdir(dir):
    #print(filename)
    label = filename.split(".")[1:5]
    
    parsed_xml = etree.parse(dir + filename, parser=parser)
    for node in parsed_xml.getroot():
        #print(node.tag)
        if (node.tag == 'post'):
            text.append(node.text)
            labels.append(label)

In [346]:
len(text), len(labels)

(679596, 679596)

In [347]:
data = pd.DataFrame(list(zip(text, labels)), columns=['text', 'labels'])

In [348]:
data.head()

Unnamed: 0,text,labels
0,"\n\n\t \n Well, everyone got up and going...","[female, 37, indUnk, Leo]"
1,\n\n\t \n My four-year old never stops ta...,"[female, 37, indUnk, Leo]"
2,"\n\n\t \n Actually it's not raining yet, ...","[female, 37, indUnk, Leo]"
3,\n\n\t \n Ha! Just set up my RSS feed - t...,"[female, 37, indUnk, Leo]"
4,"\n\n\t \n Oh, which just reminded me, we ...","[female, 37, indUnk, Leo]"


In [349]:
data.shape

(679596, 2)

In [350]:
data = data.sample(50000)

In [351]:
data.reset_index(drop=True, inplace=True)

In [352]:
data.shape

(50000, 2)

## 2. Preprocess rows of the “text” column (7.5 points)
a. Remove unwanted characters
b. Convert text to lowercase
c. Remove unwanted spaces
d. Remove stopwords

In [353]:
stop_words = stopwords.words("english")

In [354]:
def preprocess_text(text):
    new_text = str(text).lower()
    new_text = re.sub(r"[\n\t\'+]", "", new_text)
    words = new_text.split(" ")
    new_text = " ".join([word for word in words if word not in stop_words if word != ""])
    #print(new_text)
    return new_text

In [355]:
data['text'] = data['text'].apply(lambda row: preprocess_text(row))

## 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence
(7.5 points)
a. Label columns to merge: “gender”, “age”, “topic”, “sign”
b. After completing the previous step, there should be only two columns in your data
frame i.e. “text” and “labels”

In [356]:
data.head()

Unnamed: 0,text,labels
0,"cant sleep, means stay looking wacky news item...","[female, 23, indUnk, Pisces]"
1,havent posted last couple days havent known wr...,"[female, 25, Education, Cancer]"
2,"tonight good night. driving home, felt alive c...","[female, 25, indUnk, Aries]"
3,"dear president bush, one basic definitions ""so...","[male, 24, Student, Cancer]"
4,"message. university waterloo. ironically, pers...","[male, 17, indUnk, Virgo]"


## 4. Separate features and labels, and split the data into training and testing (5 points)

In [357]:
X = data['text']

In [358]:
y = data['labels']

In [359]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 1)

In [360]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((40000,), (10000,), (40000,), (10000,))

## 5. Vectorize the features (5 points)
a. Create a Bag of Words using count vectorizer
i. Use ngram_range=(1, 2)
ii. Vectorize training and testing features
b. Print the term-document matrix

In [361]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer(ngram_range=(1,2))

In [362]:
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [363]:
len(vect.get_feature_names())

2670887

In [364]:
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

## 6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. (5 points)

In [365]:
mlb = MultiLabelBinarizer()

In [366]:
labels_tr = mlb.fit_transform(data['labels'])

In [367]:
labels_df = pd.DataFrame(labels_tr, columns=mlb.classes_)

In [368]:
labels_dict = dict(labels_df.sum())

In [369]:
labels_dict

{'13': 992,
 '14': 2001,
 '15': 3131,
 '16': 5296,
 '17': 5989,
 '23': 5385,
 '24': 5770,
 '25': 4897,
 '26': 3939,
 '27': 3346,
 '33': 1362,
 '34': 1644,
 '35': 1205,
 '36': 1045,
 '37': 670,
 '38': 556,
 '39': 422,
 '40': 372,
 '41': 271,
 '42': 214,
 '43': 318,
 '44': 155,
 '45': 327,
 '46': 215,
 '47': 175,
 '48': 303,
 'Accounting': 270,
 'Advertising': 350,
 'Agriculture': 103,
 'Aquarius': 3704,
 'Architecture': 107,
 'Aries': 4630,
 'Arts': 2349,
 'Automotive': 83,
 'Banking': 292,
 'Biotech': 173,
 'BusinessServices': 374,
 'Cancer': 4712,
 'Capricorn': 3654,
 'Chemicals': 291,
 'Communications-Media': 1425,
 'Construction': 81,
 'Consulting': 439,
 'Education': 2123,
 'Engineering': 854,
 'Environment': 37,
 'Fashion': 363,
 'Gemini': 3922,
 'Government': 481,
 'HumanResources': 203,
 'Internet': 1231,
 'InvestmentBanking': 91,
 'Law': 666,
 'LawEnforcement-Security': 134,
 'Leo': 3883,
 'Libra': 4480,
 'Manufacturing': 173,
 'Maritime': 22,
 'Marketing': 357,
 'Military': 20

## 7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a. Convert your train and test labels using MultiLabelBinarizer

In [370]:
y_train_tr = mlb.transform(y_train)

In [371]:
y_test_tr = mlb.transform(y_test)

## 8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a
basic classifier, use LogisticRegression . It is one of the simplest methods, but often it
performs good enough in text classification tasks. It might take some time because the
number of classifiers to train is large.
a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on
every label
b. As One-vs-Rest approach might not have been discussed in the sessions, we are
providing you the code for that

In [372]:
lgr = LogisticRegression(solver='lbfgs', multi_class="ovr")

In [373]:
ovrc = OneVsRestClassifier(lgr)

## 9. Fit the classifier, make predictions and get the accuracy (5 points)
a. Print the following
i. Accuracy score
ii. F1 score
iii. Average precision score
iv. Average recall score
v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging

In [374]:
ovrc.fit(X_train_dtm, y_train_tr)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='ovr', n_jobs=None,
                                                 penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [375]:
print("Training Accuracy: ", metrics.accuracy_score(y_train_tr, ovrc.predict(X_train_dtm)))

Training Accuracy:  0.9022


In [376]:
y_pred = ovrc.predict(X_test_dtm)

In [377]:
print("Training Accuracy: ", metrics.accuracy_score(y_test_tr, y_pred))

Training Accuracy:  0.0108


- The accuracy measurement of multi-label classification is different than single-label classification. In multi-lable classification, mis-classification is no hard right or wrong. The subset of prediction class is better than non-predicting even a single label.

- In micro-averaging all TPs, TNs, FPs and FNs for each class are summed up and then the average is taken. Micro-average aggregates the contributions of all classes to compute the average metric.

- Micro-averaging can be a useful measure when the class imbalance is already known.

- Macro-average computes the metric independently for each class and then take the average i.e. treating all classes equally.

- Macro-averaging is useful when we want to know how the system performs overall across the sets of data.

- In Weighted-averaging, each class contribution to the average is weighted by the relative number of examples available for it.

In [378]:
print("Micro-averaging F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='micro'))
print("Macro-averaging F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='macro'))
print("Weighted-averaging F1 Score: ", metrics.f1_score(y_test_tr, y_pred, average='weighted'))

Micro-averaging F1 Score:  0.3212034894071569
Macro-averaging F1 Score:  0.0707366509287803
Weighted-averaging F1 Score:  0.2541110811878727


In [379]:
print("Micro-averaging Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='micro'))
print("Macro-averaging Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='macro'))
print("Weighted-averaging Precision Score: ", metrics.precision_score(y_test_tr, y_pred, average='weighted'))

Micro-averaging Precision Score:  0.5578849721706864
Macro-averaging Precision Score:  0.25029888833293557
Weighted-averaging Precision Score:  0.4153331348633759


In [380]:
print("Micro-averaging Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='micro'))
print("Macro-averaging Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='macro'))
print("Weighted-averaging Recall Score: ", metrics.recall_score(y_test_tr, y_pred, average='weighted'))

Micro-averaging Recall Score:  0.225525
Macro-averaging Recall Score:  0.049242089378424045
Weighted-averaging Recall Score:  0.225525


## 10. Print true label and predicted label for any five examples (7.5 points)

In [384]:
for ii in np.random.randint(1, len(y_test_tr), 5):
    print(mlb.inverse_transform(y_test_tr)[ii])
    print(mlb.inverse_transform(y_pred)[ii])

('17', 'Libra', 'female', 'indUnk')
('Student', 'female')
('27', 'Leo', 'female', 'indUnk')
('Taurus', 'male')
('33', 'Biotech', 'Capricorn', 'female')
('female',)
('34', 'Aquarius', 'Education', 'male')
('34', 'Aquarius', 'Education', 'male')
('14', 'Capricorn', 'indUnk', 'male')
('male',)


- We have used only 50000 data points and it took almost 3 hours of training hours to reach training accuracy of 90%.

- The testing accuracy of the model is very poor, just close to 1%. Some more iterations and model tuning exercise should be conducted to improve it.

- This is also reflected from the above True and Predicted labels. We couldn't get prediction for all classes and these classes are also not predicted correctly.

- We also see significant difference between the Precision/Recall Score by Micro/Macro averaging methods. The label classes seem to be highly imbalance, ranges from 1 to 5000 data points. So, it would be better to go with Micro-averaging method.