# Blog Authorship Corpus

## Statistical NLP

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting
has been stripped with two exceptions. Individual posts within a single blogger are separated by the
date of the following post and links within a post are denoted by the label urllink.

### Steps

1. Load the dataset (5 points)
    a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.
2. Preprocess rows of the “text” column (7.5 points)
    a. Remove unwanted characters
    b. Convert text to lowercase
    c. Remove unwanted spaces
    d. Remove stopwords
3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
    a. Label columns to merge: “gender”, “age”, “topic”, “sign”
    b. After completing the previous step, there should be only two columns in your dataframe i.e. “text” and “labels”
4. Separate features and labels, and split the data into training and testing (5 points)
5. Vectorize the features (5 points)
    a. Create a Bag of Words using count vectorizer
        i. Use ngram_range=(1, 2)
        ii. Vectorize training and testing features
    b. Print the term-document matrix
6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)
7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
    a. Convert your train and test labels using MultiLabelBinarizer
8. Choose a classifier - (5 points)
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a
basic classifier, use LogisticRegression. It is one of the simplest methods, but often it
performs good enough in text classification tasks. It might take some time because the
number of classifiers to train is large.
    a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
    b. As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that
9. Fit the classifier, make predictions and get the accuracy (5 points)
    a. Print the following
        i. Accuracy score
        ii. F1 score
        iii. Average precision score
        iv. Average recall score
        v. Tip: Make sure you are familiar with all of them. How would you expect the things work for the multi-label scenario? Read about micro/macro/weighted averaging
10. Print true label and predicted label for any five examples (7.5 points)

### Download and load Blog Authorship Corpus
Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

#### Dataset description
You can find the dataset description here
https://www.kaggle.com/rtatman/blog-authorship-corpus

#### Load the contents of zip file

In [None]:
from zipfile import ZipFile

with ZipFile('blog-authorship-corpus.zip', 'r') as zipdata:
    data_csv = zipdata.open('blogtext.csv')

#### Read the csv using pandas

In [None]:
import pandas as pd

df = pd.read_csv(data_csv)

#### Delete data_csv variable for memory

In [None]:
del data_csv

#### Get the names of the columns

In [None]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

#### Have a look at some column values

In [None]:
df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


#### Check if there is any null value, and get the total count.

In [None]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

### Cut the data (skip this step in final run)
Make your data short during development. So that overall process takes less time to execute and you are able to rectify all the errors fast, and check if your code is running smooth.
When evrything is sorted at last, load the entire data and run your code on that and skip this step.

In [None]:
df = df.head(3000)

## Preprocess text
Preprocess values of text column

- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [None]:
# Select only alphabets
import re
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())

# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())

# Remove stopwords
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))

Verify the preprocessing steps by looking over some values

In [None]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

### Merge the label coulmns

Merge all the label columns together, so that we have all the tags together for a particular sentence

In [None]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

### Select only required columns from your dataframe

In [None]:
df = df[['text','labels']]

### Print final dataframe

In [None]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


## Create training and testing data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.text.values, df.labels.values, test_size=0.20, random_state=42)

## Vectorize the data

### Create Bag of Words
- Use CountVectorizer
- Transform the traing and testing data

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

#### Have a look at some feature names

In [None]:
vectorizer.get_feature_names()[:5]

['aa', 'aa compared', 'aa nice', 'aaa', 'aaa take']

#### View term-document matrix

In [None]:
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Create a dictionary to get label counts

In [None]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

#### Print the dictionary

In [None]:
label_counts

{'male': 2272,
 '15': 299,
 'Student': 403,
 'Leo': 55,
 '33': 94,
 'InvestmentBanking': 70,
 'Aquarius': 286,
 'female': 728,
 '14': 74,
 'indUnk': 452,
 'Aries': 1699,
 '25': 110,
 'Capricorn': 77,
 '17': 147,
 'Gemini': 21,
 '23': 93,
 'Non-Profit': 46,
 'Cancer': 76,
 'Banking': 16,
 '37': 19,
 'Sagittarius': 113,
 '26': 43,
 '24': 334,
 'Scorpio': 243,
 '27': 86,
 'Education': 118,
 '45': 14,
 'Engineering': 119,
 'Libra': 313,
 'Science': 33,
 '34': 6,
 '41': 14,
 'Communications-Media': 14,
 'BusinessServices': 21,
 'Sports-Recreation': 75,
 'Virgo': 39,
 'Taurus': 76,
 'Arts': 2,
 'Pisces': 2,
 '44': 3,
 '16': 25,
 'Internet': 20,
 'Museums-Libraries': 2,
 'Accounting': 2,
 '39': 32,
 '35': 1607,
 'Technology': 1607}

## Multi label binarizer

Load a multilabel binarizer and fit it on the labels.

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

## Classifier

Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.

In [None]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

### Fit the classifier

In [None]:
clf.fit(X_train_bow, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

## Make predictions
- Get predicted labels and scores

In [None]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

### Get inverse transform for predicted labels and test labels

In [None]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

### Print some samples

In [None]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	pink already done sure phoenix tho
True labels:	35,Aries,Technology,male
Predicted labels:	35,Aries,Technology,male


Title:	woohoo tomorrow probably means need clean place bit uh oh since jen going get car able least grab furniture somewhere sit timed perfectly pool opening weekend slaving away work lie pool fight cicadas got hdtv cable hookup unfortunately demand function seems broken meantime get early today someone come look brighter side things techie managed blag two u dual piii machines free swap mobo reasonable case hook home idea use yet betting munch seti units meantime figure one last thing hell alt gr button us keyboards annoys type alt euro symbol remember code accented e etc know could change keymap uk whatever alt gr option something end rant
True labels:	25,Aries,Internet,male
Predicted labels:	male


Title:	actually johnathan called late last night sounding groggy thanking something could barely make innane babble wondering saying hoped guess name
True labels:	3

## Calculate accuracy
- Accuracy
- F1-score
- Precision
- Recall

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [None]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.5233333333333333
F1 score:  0.7215575885526625
Average precision score:  0.5596074546125939
Average recall score:  0.6408333333333334
