**Blog Authorship Corpus**

Over 600,000 posts from more than 19 thousand bloggers

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17),

8086 "20s" blogs(ages 23-27)

2994 "30s" blogs (ages 33-47)

For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at


In [0]:
import pandas as pd
import numpy as np

In [0]:
from google.colab import drive 
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


### 1.	Load the dataset (5 points)

  a.	Tip: As the dataset is large, use fewer rows. Check what is working well on your machine and decide accordingly.


In [0]:
import zipfile

zf = zipfile.ZipFile('/content/gdrive/My Drive/AIML/NLP Stat/blog-authorship-corpus.zip') 
df = pd.read_csv(zf.open('blogtext.csv'))

In [0]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [0]:
df.shape

(681284, 7)

In [0]:
df['text'].head(10)

0               Info has been found (+/- 100 pages,...
1               These are the team members:   Drewe...
2               In het kader van kernfusie op aarde...
3                     testing!!!  testing!!!          
4                 Thanks to Yahoo!'s Toolbar I can ...
5                 I had an interesting conversation...
6                 Somehow Coca-Cola has a way of su...
7                 If anything, Korea is a country o...
8                 Take a read of this news article ...
9                 I surf the English news sites a l...
Name: text, dtype: object

In [0]:
trunc_df = df.head(5000).copy()

In [0]:
pd.set_option('display.max_colwidth',-1)
trunc_df.text[2]

"           In het kader van kernfusie op aarde:  MAAK JE EIGEN WATERSTOFBOM   How to build an H-Bomb From: ascott@tartarus.uwa.edu.au (Andrew Scott) Newsgroups: rec.humor Subject: How To Build An H-Bomb (humorous!) Date: 7 Feb 1994 07:41:14 GMT Organization: The University of Western Australia  Original file dated 12th November 1990. Seemed to be a transcript of a 'Seven Days' article. Poorly formatted and corrupted. I have added the text between 'examine under a microscope' and 'malleable, like gold,' as it was missing. If anyone has the full text, please distribute. I am not responsible for the accuracy of this information. Converted to HTML by Dionisio@InfiNet.com 11/13/98. (Did a little spell-checking and some minor edits too.) Stolen from  urlLink http://my.ohio.voyager.net/~dionisio/fun/m...own-h-bomb.html  and reformatted the HTML. It now validates to XHTML 1.0 Strict. How to Build an H-Bomb Making and owning an H-bomb is the kind of challenge real Americans seek. Who wants to 

In [0]:
trunc_df.head(2)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewes van der Laag urlLink mail Ruiyu Xie urlLink mail Bryan Aaldering (me) urlLink mail


### 2.	Preprocess rows of the “text” column (7.5 points)
a.	Remove unwanted characters

b.	Convert text to lowercase

c.	Remove unwanted spaces

d.	Remove stopwords

In [0]:
trunc_df['text']=trunc_df['text'].map(lambda x: x.lower())

In [0]:
trunc_df.text[2]

"           in het kader van kernfusie op aarde:  maak je eigen waterstofbom   how to build an h-bomb from: ascott@tartarus.uwa.edu.au (andrew scott) newsgroups: rec.humor subject: how to build an h-bomb (humorous!) date: 7 feb 1994 07:41:14 gmt organization: the university of western australia  original file dated 12th november 1990. seemed to be a transcript of a 'seven days' article. poorly formatted and corrupted. i have added the text between 'examine under a microscope' and 'malleable, like gold,' as it was missing. if anyone has the full text, please distribute. i am not responsible for the accuracy of this information. converted to html by dionisio@infinet.com 11/13/98. (did a little spell-checking and some minor edits too.) stolen from  urllink http://my.ohio.voyager.net/~dionisio/fun/m...own-h-bomb.html  and reformatted the html. it now validates to xhtml 1.0 strict. how to build an h-bomb making and owning an h-bomb is the kind of challenge real americans seek. who wants to 

In [0]:
trunc_df.dtypes

id        int64 
gender    object
age       int64 
topic     object
sign      object
date      object
text      object
dtype: object

In [0]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stops = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
import re
trunc_df.text=trunc_df['text'].apply(lambda x: re.sub(' +',' ',x))

In [0]:
trunc_df.text=trunc_df['text'].apply(lambda x: ' '.join([text for text in x.split() if text not in (stops)]))

In [0]:
#trunc_df.text=trunc_df['text'].apply(lambda x: ' '.join([text for text in x.split() if text.isalpha()]))

In [0]:
trunc_df.text = trunc_df['text'].str.replace(r"[^a-zA-Z\d\_]+",' ')

In [0]:
trunc_df['label'] = trunc_df[trunc_df.columns[1:5]].apply(lambda x: (','.join(x.dropna().astype(str))),axis=1)

In [0]:
trunc_df.head(2)

Unnamed: 0,id,gender,age,topic,sign,date,text,label
0,2059027,male,15,Student,Leo,"14,May,2004",info found 100 pages 4 5 mb pdf files wait untill team leader processed learns html,"male,15,Student,Leo"
1,2059027,male,15,Student,Leo,"13,May,2004",team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail,"male,15,Student,Leo"


In [0]:
trunc_df.text[1]

'team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail'

### 3.	As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
a.	Label columns to merge: “gender”, “age”, “topic”, “sign”


In [0]:
trunc_df = trunc_df.drop(columns=['gender','age','topic','sign','id','date'])

In [0]:
trunc_df.head(2)

Unnamed: 0,text,label
0,info found 100 pages 4 5 mb pdf files wait untill team leader processed learns html,"male,15,Student,Leo"
1,team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering me urllink mail,"male,15,Student,Leo"


### 4.	Separate features and labels, and split the data into training and testing (5 points)

In [0]:
X = trunc_df.text
Y = trunc_df.label

In [0]:
print('X Shape',X.shape,'\nY Shape',Y.shape)

X Shape (5000,) 
Y Shape (5000,)


In [0]:
from sklearn.model_selection import train_test_split
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=123)

### 5.	Vectorize the features (5 points)
a.	Create a Bag of Words using count vectorizer

i.	Use ngram_range=(1, 2)

ii.	Vectorize training and testing features

b.	Print the term-document matrix

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1, 2),min_df=2,max_df=0.8)
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [0]:
# features names
feature_names = vect.get_feature_names()
print(feature_names[0:50],'\nNumber of features names:',len(feature_names))

['00', '00 30', '00 fun', '00 got', '00 mom', '00 morning', '00 pm', '000', '000 000', '000 miles', '000 new', '000 people', '000 usd', '000 won', '000 year', '000 years', '007', '00am', '00pm', '01', '02', '03', '03 pm', '04', '05', '06', '06 04', '06 10', '06 pm', '07', '07 24', '08', '09', '10', '10 00', '10 000', '10 11', '10 12', '10 15', '10 20', '10 30', '10 45', '10 am', '10 days', '10 euros', '10 feet', '10 games', '10 gas', '10 hour', '10 hours'] 
Number of features names: 38978


In [0]:
y_train.head()

2413    male,35,Technology,Aries    
1471    male,35,Technology,Aries    
1196    male,25,Internet,Aries      
1509    male,35,Technology,Aries    
4110    female,34,indUnk,Sagittarius
Name: label, dtype: object

### 6.	Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.  (5 points) 

In [0]:
y_train.apply(lambda x: pd.value_counts(x.split(","))).sum(axis = 0).to_dict()

{'14': 129.0,
 '15': 250.0,
 '16': 49.0,
 '17': 239.0,
 '23': 100.0,
 '24': 267.0,
 '25': 210.0,
 '26': 76.0,
 '27': 64.0,
 '33': 82.0,
 '34': 398.0,
 '35': 1738.0,
 '36': 45.0,
 '37': 13.0,
 '39': 57.0,
 '41': 10.0,
 '42': 7.0,
 '44': 3.0,
 '45': 10.0,
 '46': 3.0,
 'Accounting': 2.0,
 'Aquarius': 259.0,
 'Aries': 1876.0,
 'Arts': 19.0,
 'Automotive': 11.0,
 'Banking': 13.0,
 'BusinessServices': 67.0,
 'Cancer': 72.0,
 'Capricorn': 64.0,
 'Communications-Media': 43.0,
 'Consulting': 10.0,
 'Education': 88.0,
 'Engineering': 79.0,
 'Gemini': 61.0,
 'Internet': 17.0,
 'InvestmentBanking': 56.0,
 'Law': 3.0,
 'Leo': 145.0,
 'Libra': 289.0,
 'Museums-Libraries': 2.0,
 'Non-Profit': 38.0,
 'Pisces': 50.0,
 'Religion': 3.0,
 'Sagittarius': 511.0,
 'Science': 22.0,
 'Scorpio': 318.0,
 'Sports-Recreation': 57.0,
 'Student': 419.0,
 'Taurus': 75.0,
 'Technology': 1764.0,
 'Virgo': 30.0,
 'female': 1266.0,
 'indUnk': 1037.0,
 'male': 2484.0}

### Transform the labels - (7.5 points)
As we have noticed before, in this task each example can have multiple tags. To deal with such kind of prediction, we need to transform labels in a binary form and the prediction will be a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

a.	Convert your train and test labels using MultiLabelBinarizer


In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
y_train_mlb = mlb.fit_transform(y_train)
y_test_mlb = mlb.transform(y_test)

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver = 'lbfgs')
clf = OneVsRestClassifier(clf)

clf.fit(X_train_dtm,y_train_mlb)

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [0]:
y_pred_clf = clf.predict(X_test_dtm)

In [0]:
y_pred_clf.shape

(1250, 48)

### 9.	Fit the classifier, make predictions and get the accuracy (5 points)
a.	Print the following

  i.	Accuracy score

  ii.	F1 score

  iii.	Average precision score

  iv.	Average recall score

In [0]:
from sklearn import metrics
print('Accuracy',metrics.accuracy_score(y_test_mlb,y_pred_clf),
      '\nf1 Score_macro',metrics.f1_score(y_test_mlb,y_pred_clf,average='macro'),
      '\nf1 Score_weighted',metrics.f1_score(y_test_mlb,y_pred_clf,average='weighted'),
      '\nPrecision Score',metrics.precision_score(y_test_mlb,y_pred_clf,average='macro'),
      '\nPrecision Score',metrics.precision_score(y_test_mlb,y_pred_clf,average='weighted'),'\n')

Accuracy 0.4864 
f1 Score_macro 0.6322582291971052 
f1 Score_weighted 0.8744470211120133 
Precision Score 0.814199713913004 
Precision Score 0.9061492403348755 



  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)


### 10.	 Print true label and predicted label for any five examples (7.5 points)

In [0]:
y_predict_feature=mlb.inverse_transform(y_pred_clf)
y_true=mlb.inverse_transform(y_test_mlb)

In [0]:
print('True output\n',y_test_mlb[0:5])

True output
 [[1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 1 1 0 1 1 1
  1 1 0 0 0 1 1 1 1 0 0 0]
 [1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1
  1 1 1 1 0 1 0 0 0 0 0 0]
 [1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 1 1
  1 1 0 0 0 1 1 0 0 0 0 0]]


In [0]:
print('Predicted out put\n',y_pred_clf[0:5])

Predicted out put
 [[1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 1 1 0 1
  1 1 1 0 0 1 1 0 0 0 0 1]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1
  1 1 0 0 0 1 0 0 0 0 0 0]]


In [0]:
print('True Value',y_true[0:3],'\n\nPredicted Value',y_predict_feature[0:3])

True Value [(',', '3', '5', 'A', 'T', 'a', 'c', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'y'), (',', '3', '5', 'A', 'T', 'a', 'c', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'y'), (',', '3', '4', 'S', 'U', 'a', 'd', 'e', 'f', 'g', 'i', 'k', 'l', 'm', 'n', 'r', 's', 't', 'u')] 

Predicted Value [(',', '3', '5', 'A', 'T', 'a', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'y'), (',', '3', '5', 'A', 'T', 'a', 'c', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'y'), (',', '3', '5', 'A', 'a', 'c', 'e', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'r', 's', 'y')]
