# Project: Statistical NLP 

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

### Dataset :
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.)

All bloggers included in the corpus fall into one of three age groups:

8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27)
2994 "30s" blogs (ages 33-47).
For each age group there are an equal number of male and female bloggers.

Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink.

Link for dataset is 
https://www.kaggle.com/rtatman/blog-authorship-corpus/  
File to be downloaded is blog-authorship-corpus.zip

#### Step 1.  Load the dataset

In [331]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import nltk
import re
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mansvi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [332]:
data = pd.read_csv("blog-authorship-corpus\\blogtext.csv")
data.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [333]:
data.shape

(681284, 7)

In [334]:
print(data.info())
data.isnull().sum().sort_values(ascending=False) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 681284 entries, 0 to 681283
Data columns (total 7 columns):
id        681284 non-null int64
gender    681284 non-null object
age       681284 non-null int64
topic     681284 non-null object
sign      681284 non-null object
date      681284 non-null object
text      681284 non-null object
dtypes: int64(2), object(5)
memory usage: 36.4+ MB
None


text      0
date      0
sign      0
topic     0
age       0
gender    0
id        0
dtype: int64

#Taking 5000 records as sample
data = data.sample(5000)
data.reset_index(drop=True, inplace=True)
data.shape

In [335]:
data = data[:5000]
print(data.shape)
data["text"].loc[0]

(5000, 7)


'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

#### Step 2.  Preprocess rows of the “text” column (7.5 points)
<br>a. Remove unwanted characters
<br>b. Convert text to lowercase
<br>c. Remove unwanted spaces
<br>d. Remove stopwords

In [336]:
data['text'] = data['text'].str.replace('[^A-Za-z]',' ')
data['text'] = data['text'].str.lower()
data["text"] = data["text"].str.strip()
data["text"] = data["text"].str.split()

stop = stopwords.words('english')
def removestopwords(y):   # Function definition
 stopwordremoved = [w for w in y if w not in stop]
 return(" ".join(stopwordremoved)) 


In [337]:
text_column_size = data["text"].size
print("text column size :", text_column_size)

# Initialize an empty list to hold the text after stop word removal
data_cleaner = []

# Loop over each text
for i in range( 0, text_column_size):
    data_cleaner.append(removestopwords(data["text"][i]))

text column size : 5000


In [338]:
data_cleaner[5]

'interesting conversation dad morning talking koreans put money invariably lot real estate cash cash would include short term investments one year well savings accounts reason real estate makes money lot money seen surveys seoul real estate rising per year long stretches even taking account crisis referred imf crisis although imf bailed korea compare korean corporate bonds fell modestly recovered local stock market represented kospi version dow jones index gone appreciably high points points see urllink link see real estate makes sense back conversation noted real big elite real estate investor billion usd see urllink converter properties dad seemed little flabbergasted heck need million dollars need much retire maybe lot risk take real estate south korean asset example north toots horn louder make move country usd worth cents also denominated imf crisis dropped vis vis usd also make bad investment fall victim scam latest urllink good morning city project toast saw lady tv lost everyth

In [339]:
data["text"] = data_cleaner

### Lemmatization



In [340]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mansvi\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [341]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    lemm = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    return(" ".join(lemm)) 

data["text"] = data.text.apply(lemmatize_text)


#### Step 3
As we want to make this into a multi-label classification problem, you are required to merge all the label columns together, so that we have all the labels together for a particular sentence (7.5 points)
<br>a. Label columns to merge: “gender”, “age”, “topic”, “sign”
<br>b. After completing the previous step, there should be only two columns in your data frame i.e. “text” and “labels” as shown in the below image

In [342]:
data['age'] = data['age'].astype(str)
data['labels'] = data[['gender','age','topic','sign']].apply(lambda x: ','.join(x), axis = 1) 
data_merged = data.drop(labels = ['date','gender', 'age','topic','sign','id'], axis = 1)
data_merged.head()


Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,team member drewes van der laag urllink mail r...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male,33,InvestmentBanking,Aquarius"


### Step 4
Separate features and labels, and split the data into training and testing

In [343]:
X = data_merged['text']
y = data_merged['labels'].str.lower()
labels = data_merged['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 143)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((3350,), (1650,), (3350,), (1650,))

### Step 5
Vectorize the features 
<br>a. Create a Bag of Words using count vectorizer
<br>i. Use ngram_range=(1, 2)
<br>ii. Vectorize training and testing features
<br>b. Print the term-document matrix
    

In [344]:
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
print("X_train shape & sample",X_train.shape)
X_train[0]

X_train shape & sample (3350, 27104)


<1x27104 sparse matrix of type '<class 'numpy.int64'>'
	with 68 stored elements in Compressed Sparse Row format>

### Step 6:
Create a dictionary to get the count of every label i.e. the key will be label name and value will  be the total count of the label.

In [345]:
vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english")
labels_vector = vectorizer_labels.fit_transform(labels)
vectorizer_labels.vocabulary_

{'male': 42,
 '15': 1,
 'student': 54,
 'leo': 39,
 '33': 9,
 'investmentbanking': 37,
 'aquarius': 21,
 'female': 33,
 '14': 0,
 'indunk': 35,
 'aries': 22,
 '25': 6,
 'capricorn': 28,
 '17': 3,
 'gemini': 34,
 '23': 4,
 'non': 45,
 'profit': 47,
 'cancer': 27,
 'banking': 25,
 '37': 13,
 'sagittarius': 50,
 '26': 7,
 '24': 5,
 'scorpio': 52,
 '27': 8,
 'education': 31,
 '45': 18,
 'engineering': 32,
 'libra': 40,
 'science': 51,
 '34': 10,
 '41': 15,
 'communications': 29,
 'media': 43,
 'businessservices': 26,
 'sports': 53,
 'recreation': 48,
 'virgo': 57,
 'taurus': 55,
 'arts': 23,
 'pisces': 46,
 '44': 17,
 '16': 2,
 'internet': 36,
 'museums': 44,
 'libraries': 41,
 'accounting': 20,
 '39': 14,
 '35': 11,
 'technology': 56,
 '36': 12,
 'law': 38,
 '46': 19,
 'consulting': 30,
 'automotive': 24,
 '42': 16,
 'religion': 49}

In [346]:
# Extracting the vocab keys. These set of labels will be used as classes in multilabelbinariser
l_classes = []  
for k in vectorizer_labels.vocabulary_.keys():
    l_classes.append(k)


### Step 7
Transform the labels - 
<br>As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn
<br>a. Convert your train and test labels using MultiLabelBinarizer

In [347]:
mlb = MultiLabelBinarizer(classes = l_classes)
labels = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in labels]]

In [348]:
labels_trans = mlb.fit(labels) # transforming entire set of lables
labels_trans

MultiLabelBinarizer(classes=['male', '15', 'student', 'leo', '33',
                             'investmentbanking', 'aquarius', 'female', '14',
                             'indunk', 'aries', '25', 'capricorn', '17',
                             'gemini', '23', 'non', 'profit', 'cancer',
                             'banking', '37', 'sagittarius', '26', '24',
                             'scorpio', '27', 'education', '45', 'engineering',
                             'libra', ...],
                    sparse_output=False)

In [349]:
y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in y_train]]
y_train_tr = mlb.transform(y_train)
y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in y_test]]
y_test_tr = mlb.transform(y_test)

  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))


In [350]:
mlb.classes_

array(['male', '15', 'student', 'leo', '33', 'investmentbanking',
       'aquarius', 'female', '14', 'indunk', 'aries', '25', 'capricorn',
       '17', 'gemini', '23', 'non', 'profit', 'cancer', 'banking', '37',
       'sagittarius', '26', '24', 'scorpio', '27', 'education', '45',
       'engineering', 'libra', 'science', '34', '41', 'communications',
       'media', 'businessservices', 'sports', 'recreation', 'virgo',
       'taurus', 'arts', 'pisces', '44', '16', 'internet', 'museums',
       'libraries', 'accounting', '39', '35', 'technology', '36', 'law',
       '46', 'consulting', 'automotive', '42', 'religion'], dtype=object)

In [351]:
print(y_test_tr[10])
y_test[10]

[1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]


['male', '35', 'technology', 'aries']

In [352]:
len(mlb.classes_)

58

### Step 8:

In this task, we suggest using the One-vs-Rest approach, which is implemented in  OneVsRestClassifier​ class. In this approach k classifiers (= number of tags) are trained. As a  basic classifier, use ​LogisticRegression​. It is one of the simplest methods, but often it  performs good enough in text classification tasks. It might take some time because the  number of classifiers to train is large.  

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on  every label 

In [353]:
clf = LogisticRegression(solver = 'lbfgs', max_iter = 1000)  # initiating the classifier
clf = OneVsRestClassifier(clf)

### Step 9:

Fit the classifier, make predictions and get the accuracy

a. Print the following  
        i. Accuracy score  
        ii. F1 score  
        iii. Average precision score  
        iv. Average recall score 
 
Tip: Make sure you are familiar with all of them. How would you expect the  things to work for the multi-label scenario? Read about micro/macro/weighted  averaging 

In [354]:
clf.fit(X_train,y_train_tr)

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [355]:
print("Train Accuracy:",clf.score(X_train,y_train_tr))

Train Accuracy: 0.9561194029850746


In [364]:
Y_pred = clf.predict(X_test) 

In [365]:
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score
from nltk.stem import WordNetLemmatizer 
print("Test Accuracy:" + str(accuracy_score(y_test_tr, Y_pred)))
print("F1: " + str(f1_score(y_test_tr, Y_pred, average='micro')))
print("F1_macro: " + str(f1_score(y_test_tr, Y_pred, average='macro')))
print("Precision: " + str(precision_score(y_test_tr, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(y_test_tr, Y_pred, average='macro')))
print("Recall: " + str(recall_score(y_test_tr, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(y_test_tr, Y_pred, average='macro')))

Test Accuracy:0.5315151515151515
F1: 0.7432240847302143
F1_macro: 0.25418706963707105
Precision: 0.8188166115398751
Precision_macro: 0.42299519494609034
Recall: 0.6804092227821041
Recall_macro: 0.2056911276465752


### Step 10
Print true label and predicted label for any five examples

In [366]:
Y_pred_inv = mlb.inverse_transform(Y_pred)   # inverse transforming predited label data
Y_test_trans_inv =  mlb.inverse_transform(y_test_tr) # inverse transforming original test label data

In [367]:
print("Example 1 - predicted :",Y_pred_inv[0])
print("Example 1 - Actual :",Y_test_trans_inv[0])
print("Example 1 - Actual_before mlb transformation :",y_test[0])

Example 1 - predicted : ('male', 'aries', '35', 'technology')
Example 1 - Actual : ('male', 'aries', '35', 'technology')
Example 1 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [368]:
print("Example 2 - predicted :",Y_pred_inv[30])
print("Example 2 - Actual :",Y_test_trans_inv[30])
print("Example 2 - Actual_before mlb transformation :",y_test[30])

Example 2 - predicted : ('male',)
Example 2 - Actual : ('leo', 'female', 'indunk', '16')
Example 2 - Actual_before mlb transformation : ['female', '16', 'indunk', 'leo']


In [369]:
print("Example 3 - predicted :",Y_pred_inv[39])
print("Example 3 - Actual :",Y_test_trans_inv[39])
print("Example 3 - Actual_before mlb transformation :",y_test[39])

Example 3 - predicted : ('female', 'indunk', 'scorpio')
Example 3 - Actual : ('female', 'indunk', '24', 'scorpio')
Example 3 - Actual_before mlb transformation : ['female', '24', 'indunk', 'scorpio']


In [370]:
print("Example 4 - predicted :",Y_pred_inv[300])
print("Example 4 - Actual :",Y_test_trans_inv[300])
print("Example 4 - Actual_before mlb transformation :",y_test[300])

Example 4 - predicted : ('male', 'leo', 'indunk', '26')
Example 4 - Actual : ('male', 'leo', 'indunk', '26')
Example 4 - Actual_before mlb transformation : ['male', '26', 'indunk', 'leo']


In [371]:
print("Example 5 - predicted :",Y_pred_inv[89])
print("Example 5 - Actual :",Y_test_trans_inv[89])
print("Example 5 - Actual_before mlb transformation :",y_test[892])

Example 5 - predicted : ('female', 'indunk', 'pisces', '36')
Example 5 - Actual : ('female', 'indunk', 'pisces', '36')
Example 5 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


# Conclusions:
- Tried this model with 50k, 20k and 3k data also. Could not get accuracy over 53%
- For this test, we ran it with 5000 data points.
- The testing accuracy of the model is not very good, just close to 53%. Some more iterations and model tuning exercise should be conducted to improve it.
- This is also reflected from the above True and Predicted labels. We couldn't get prediction for all classes and these classes are also not predicted correctly.
- Lemmatization is used as an additional step in the pre processing, still it is not impacting model generalisation.  