# Project Description 

 
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. 

We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.


# Dataset 


Blog Authorship Corpus
Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at


# Approach & Steps 

1.	Load the dataset (5 points)

a.	Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

2.	Preprocess rows of the “text” column (7.5 points)

a.	Remove unwanted characters

b.	Convert text to lowercase

c.	Remove unwanted spaces

d.	Remove stopwords

3.	As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)

a.	Label columns to merge: “gender”, “age”, “topic”, “sign”
b.	After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

4.	Separate features and labels, and split the data into training and testing (5 points)

5.	Vectorize the features (5 points)

a.	Create a Bag of Words using count vectorizer
i.	Use ngram_range=(1, 2)
ii.	Vectorize training and testing features

b.	Print the term-document matrix

6.	Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label. Check below image for reference (5 points)

7. Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
a.	Convert your train and test labels using MultiLabelBinarizer

8.	 Choose a classifier - (5 points)

In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

a.	Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label

b.	As One-vs-Rest approach might not have been discussed in the sessions, we are providing you the code for that

9.	Fit the classifier, make predictions and get the accuracy (5 points)

a.	Print the following

i.	Accuracy score
ii.	F1 score
iii.	Average precision score
iv.	Average recall score
v.	Tip: Make sure you are familiar with all of them. How would you expect the things to work for the multi-label scenario? Read about micro/macro/weighted averaging

10.	 Print true label and predicted label for any five examples (7.5 points)


# 1. Load the dataset (5 points)

a. Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.

In [0]:
import numpy as np
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
import nltk
import spacy
from nltk.corpus import stopwords,wordnet
from nltk.tokenize import word_tokenize, RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import string
import re
from sklearn.feature_extraction import text 
%matplotlib inline 

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
data = pd.read_csv("blogtext.csv")

In [6]:
data.shape

(55511, 7)

In [7]:
data['text']

0                   Info has been found (+/- 100 pages,...
1                   These are the team members:   Drewe...
2                   In het kader van kernfusie op aarde...
3                         testing!!!  testing!!!          
4                     Thanks to Yahoo!'s Toolbar I can ...
                               ...                        
55506                                     urlLink         
55507                                     urlLink         
55508                                     urlLink         
55509                                     urlLink         
55510                                                  NaN
Name: text, Length: 55511, dtype: object

In [0]:
new_data = data.head(8000)

In [9]:
new_data['label'] = new_data[new_data.columns[1:5]].apply(lambda x: (','.join(x.dropna().astype(str))),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [10]:
new_data

Unnamed: 0,id,gender,age,topic,sign,date,text,label
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,"male,15,Student,Leo"
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,"male,15,Student,Leo"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
...,...,...,...,...,...,...,...,...
7995,2635745,female,15,Student,Pisces,"01,August,2004",Today was good for me! I had an excell...,"female,15,Student,Pisces"
7996,2635745,female,15,Student,Pisces,"01,August,2004",OH MY GOODNESS! OH MY GOODNESS! OH MY...,"female,15,Student,Pisces"
7997,2635745,female,15,Student,Pisces,"01,August,2004","Well, my day was...okay, average, long,...","female,15,Student,Pisces"
7998,2635745,female,15,Student,Pisces,"01,August,2004",Sorry for the title. It was more like ...,"female,15,Student,Pisces"


# 3. As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
a. Label columns to merge: “gender”, “age”, “topic”, “sign” b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image



In [0]:
new_data = new_data.drop(columns=['gender','age','topic','sign','id','date'])

In [12]:
new_data

Unnamed: 0,text,label
0,"Info has been found (+/- 100 pages,...","male,15,Student,Leo"
1,These are the team members: Drewe...,"male,15,Student,Leo"
2,In het kader van kernfusie op aarde...,"male,15,Student,Leo"
3,testing!!! testing!!!,"male,15,Student,Leo"
4,Thanks to Yahoo!'s Toolbar I can ...,"male,33,InvestmentBanking,Aquarius"
...,...,...
7995,Today was good for me! I had an excell...,"female,15,Student,Pisces"
7996,OH MY GOODNESS! OH MY GOODNESS! OH MY...,"female,15,Student,Pisces"
7997,"Well, my day was...okay, average, long,...","female,15,Student,Pisces"
7998,Sorry for the title. It was more like ...,"female,15,Student,Pisces"


In [13]:
print(len(new_data.text))

8000


In [14]:
new_data.text[0]

'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

In [0]:
nlp = spacy.load('en_core_web_sm')
l1 = ('btw','zza','zzzexy','zzzzz','youuuuu')
nlp.Defaults.stop_words.add(l1)


In [17]:
new_data.text

0                  Info has been found (+/- 100 pages,...
1                  These are the team members:   Drewe...
2                  In het kader van kernfusie op aarde...
3                        testing!!!  testing!!!          
4                    Thanks to Yahoo!'s Toolbar I can ...
                              ...                        
7995           Today was good for me!  I had an excell...
7996           OH MY GOODNESS!  OH MY GOODNESS!  OH MY...
7997           Well, my day was...okay, average, long,...
7998           Sorry for the title.  It was more like ...
7999           READY FOR LOVE: You're sensitive but no...
Name: text, Length: 8000, dtype: object

In [21]:
for i in range(len(new_data.text)):
    tokenizer = RegexpTokenizer(r'\w+')
    new_data.text[i] = new_data.text[i].lower()
    word_tokens = tokenizer.tokenize(new_data.text[i])
    filtered_sentence = [w for w in word_tokens if not w in stopwords.words('english')] 
    filtered_sentence = [] 
    for w in word_tokens: 
        if w not in (nlp.Defaults.stop_words or string.punctuation):
            #if not w.isalpha():
            filtered_sentence.append(re.sub(r"[^a-zA-Z0-9]+", ' ',w ))
   
    new_data.text[i] = " ".join(filtered_sentence)
%time

LookupError: ignored

In [0]:
X = new_data.text
Y = new_data.label

In [0]:

# spliting into training and testing data sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=103)

In [25]:
# defining a function that accepts a vectorizer and calculates the accuracy
def tokenize_test(vect):
    X_train_dtm = vect.fit_transform(X_train)
    print('Features: ', X_train_dtm.shape[1])
    X_test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(X_train_dtm, y_train)
    y_train_pred = nb.predict(X_train_dtm)
    y_pred_class = nb.predict(X_test_dtm)
    print('Train Accuracy for NB : ', metrics.accuracy_score(y_train,y_train_pred))
    print('Test Accuracy for NB: ', metrics.accuracy_score(y_test, y_pred_class))
    logreg = LogisticRegression(C=1e9)
    logreg.fit(X_train_dtm, y_train)
    y_pred_class_LR = logreg.predict(X_test_dtm)
#print(metrics.accuracy_score(y_test, y_pred_class))
    y_train_LR = logreg.predict(X_train_dtm)
    print('Train Accuracy for LR: ',metrics.accuracy_score(y_train, y_train_LR))
    print('Test Accuracy for LR: ',metrics.accuracy_score(y_test, y_pred_class_LR))
%time

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 4.29 µs


In [26]:
# Include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  394714
Train Accuracy for NB :  0.653
Test Accuracy for NB:  0.383




Train Accuracy for LR:  0.998
Test Accuracy for LR:  0.6075


In [27]:
X_train_dtm = vect.fit_transform(X_train)
print('Features: ', X_train_dtm.shape[1])
X_test_dtm = vect.transform(X_test)
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
    #feature = nb.feature_count_
   # print(nb.feature_count_.shape)
y_train_pred = nb.predict(X_train_dtm)
y_pred_class = nb.predict(X_test_dtm)
print('Train Accuracy for NB : ', metrics.accuracy_score(y_train,y_train_pred))
print('Test Accuracy for NB: ', metrics.accuracy_score(y_test, y_pred_class))

Features:  394714
Train Accuracy for NB :  0.653
Test Accuracy for NB:  0.383


In [28]:
# features names
feature_names = vect.get_feature_names()
print(feature_names[50:500])

['00 the', '00 then', '00 this', '00 to', '00 too', '00 week', '00 when', '00 while', '00 would', '00 yes', '000', '000 000', '000 100', '000 20', '000 acre', '000 address', '000 amish', '000 and', '000 animals', '000 anything', '000 are', '000 armored', '000 automobile', '000 big', '000 bombers', '000 cats', '000 cds', '000 checks', '000 coffee', '000 companion', '000 compared', '000 copies', '000 credits', '000 do', '000 dollar', '000 dollars', '000 english', '000 families', '000 feet', '000 fighter', '000 figure', '000 flights', '000 for', '000 he', '000 hearing', '000 households', '000 how', '000 human', '000 in', '000 iraqi', '000 it', '000 japanese', '000 jesus', '000 jobs', '000 kids', '000 lakes', '000 layers', '000 loan', '000 miles', '000 minnesota', '000 minnesotans', '000 more', '000 my', '000 new', '000 next', '000 nice', '000 oeople', '000 of', '000 on', '000 options', '000 or', '000 people', '000 per', '000 points', '000 prospects', '000 ramsey', '000 renovation', '000 s

In [29]:
len(y_train)

6000

In [0]:
ds = {}

In [0]:
ds = y_train.apply(lambda x : pd.value_counts(x.split(","))).sum(axis = 0).to_dict()

In [32]:
ds

{'13': 8.0,
 '14': 131.0,
 '15': 289.0,
 '16': 57.0,
 '17': 707.0,
 '23': 102.0,
 '24': 281.0,
 '25': 201.0,
 '26': 87.0,
 '27': 526.0,
 '33': 82.0,
 '34': 411.0,
 '35': 1721.0,
 '36': 1246.0,
 '37': 14.0,
 '38': 32.0,
 '39': 61.0,
 '41': 13.0,
 '42': 12.0,
 '44': 2.0,
 '45': 11.0,
 '46': 6.0,
 'Aquarius': 270.0,
 'Aries': 3070.0,
 'Arts': 24.0,
 'Automotive': 12.0,
 'Banking': 14.0,
 'BusinessServices': 63.0,
 'Cancer': 166.0,
 'Capricorn': 73.0,
 'Communications-Media': 51.0,
 'Consulting': 16.0,
 'Education': 92.0,
 'Engineering': 86.0,
 'Fashion': 1184.0,
 'Gemini': 64.0,
 'Internet': 66.0,
 'InvestmentBanking': 57.0,
 'Law': 2.0,
 'Leo': 160.0,
 'Libra': 321.0,
 'Museums-Libraries': 2.0,
 'Non-Profit': 30.0,
 'Pisces': 75.0,
 'Religion': 7.0,
 'Sagittarius': 562.0,
 'Science': 27.0,
 'Scorpio': 669.0,
 'Sports-Recreation': 63.0,
 'Student': 461.0,
 'Taurus': 539.0,
 'Technology': 1752.0,
 'Virgo': 31.0,
 'female': 2328.0,
 'indUnk': 1991.0,
 'male': 3672.0}

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer

In [0]:
mlb = MultiLabelBinarizer()

In [0]:
y_train_mlb = mlb.fit_transform(y_train)

In [0]:
y_test_mlb = mlb.transform(y_test)

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

In [0]:
LR = LogisticRegression(solver = 'lbfgs',random_state= 101)

In [39]:
LR

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=101, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
clf = OneVsRestClassifier(LR)

In [0]:
names = vect.get_feature_names()

In [42]:
y_train_mlb

array([[1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [43]:
clf.fit(X_train_dtm,y_train_mlb)
%time

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


CPU times: user 4 µs, sys: 3 µs, total: 7 µs
Wall time: 7.63 µs




In [0]:
y_pred_clf = clf.predict(X_test_dtm)

In [45]:
print(metrics.accuracy_score(y_test_mlb,y_pred_clf))

0.325


In [0]:
from sklearn.metrics import classification_report

In [47]:
print(classification_report(y_test_mlb, y_pred_clf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2000
           1       0.69      0.22      0.33        41
           2       0.87      0.55      0.68       365
           3       0.71      0.36      0.48       398
           4       0.85      0.86      0.85      1275
           5       0.86      0.49      0.63       273
           6       0.77      0.63      0.69       761
           7       0.91      0.54      0.68       499
           8       0.81      0.41      0.54       377
           9       0.67      0.14      0.24        14
          10       0.50      0.06      0.10        18
          11       0.84      0.84      0.84      1156
          12       1.00      0.21      0.34        39
          13       0.76      0.25      0.38        99
          14       0.77      0.16      0.27        62
          15       0.95      0.58      0.72       432
          16       0.00      0.00      0.00        24
          17       0.88    

  'precision', 'predicted', average, warn_for)


In [48]:
metrics.average_precision_score(y_test_mlb, y_pred_clf,average='micro')

0.8121708958508178

In [49]:
metrics.recall_score(y_test_mlb, y_pred_clf, labels=None, pos_label=1, average='micro', sample_weight=None)

0.8263946220073122

In [50]:
y_pred_clf[10:15]

array([[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 1],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0]])

In [51]:
y_test_mlb[10:15]

array([[1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
        1, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
        1, 1, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
        1, 0, 0, 0, 0, 1],
       [1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 0]])