# Project - Statistical NLP

We welcome you all to this NLP based case study. The case study (described below - 60 points)
covers concepts taught in traditional models in the NLP course.

# Project Description
Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the
information about the writer without knowing about him/her.

We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

# Dataset
Blog Authorship Corpus

Over 600,000 posts from more than 19 thousand bloggers
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from
blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million
words - or approximately 35 posts and 7250 words per person.

Each blog is presented as a separate file, the name of which indicates a blogger id# and the
blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and
age but for many, industry and/or sign is marked as unknown.)

# Link to dataset: 

https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blogauthorship-corpus.zip/2at

# Importing libraries:

In [0]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score
from nltk.stem import WordNetLemmatizer

# 1-Loading Dataset:

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import os
os.chdir('/content/drive/My Drive/Colab Notebooks')

In [0]:
corpus_df = pd.read_csv("blogtext.csv")

In [0]:
corpus_df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


In [0]:
# Taking only initial 5k rows for training

In [0]:
corpus_df = corpus_df[:3000]
print(corpus_df.shape)
corpus_df["text"].loc[0]

(3000, 7)


'           Info has been found (+/- 100 pages, and 4.5 MB of .pdf files) Now i have to wait untill our team leader has processed it and learns html.         '

# 2. Preprocess rows of the “text” column 
a. Remove unwanted characters

b. Convert text to lowercase

c. Remove unwanted spaces

d. Remove stopwords

In [0]:
#Removing unwanted / special characters
corpus_df['text'] = corpus_df['text'].str.replace('[^A-Za-z]',' ')
corpus_df["text"].loc[0]

'           Info has been found          pages  and     MB of  pdf files  Now i have to wait untill our team leader has processed it and learns html          '

In [0]:
#Converting letters to lower case
corpus_df['text'] = corpus_df['text'].str.lower()
corpus_df["text"].loc[0]

'           info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html          '

In [0]:
#Space removal
corpus_df["text"] = corpus_df["text"].str.strip()
corpus_df["text"].loc[0]

'info has been found          pages  and     mb of  pdf files  now i have to wait untill our team leader has processed it and learns html'

In [0]:
 #splitting each row of text data into individual words, so that it can be iterated through to remove only stopwords in next steps.
corpus_df["text"] = corpus_df["text"].str.split()  


In [0]:
#Removing stopwords

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
stop = stopwords.words('english')
def removestopwords(y): 
 stopwordsremoved = [w for w in y if w not in stop]
 return(" ".join(stopwordsremoved))

In [0]:
text_column_size = corpus_df["text"].size
print("text column size :", text_column_size)

# Initialize an empty list to hold the text after stop word removal
cleaner_corpus_df_text = []

# Looping over each text
for i in range( 0, text_column_size):
    cleaner_corpus_df_text.append(removestopwords(corpus_df["text"][i]))

text column size : 3000


In [0]:
cleaner_corpus_df_text[10]

'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

In [0]:
#Replacing text column with cleaner_corpus_df_text 

corpus_df["text"] = cleaner_corpus_df_text
corpus_df["text"][10]

'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

In [0]:
#Lemmatization:

In [0]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    lemm = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    return(" ".join(lemm)) 

corpus_df["text"] = corpus_df.text.apply(lemmatize_text)

In [0]:
corpus_df["text"][10] 

'ah korean language look difficult first figure read hanguel korea surprisingly easy learn alphabet character seems easy vocabulary start oh backwards u sentence structure yikes luckily many option u slow witted foreigner take language course could list urllink joongang article say lot resource urllink well guy motivation jeon ji hyun latest something actually star movie cf hear mean commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly title make sense like website korean english look quite good actually urllink movie shown theatre subtitle special time info urllink list many theatre seoul click urllink urllink great reason learn korean already married went foreigner well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea nothing xxx sensibil

# 3- Merging: 
As we want to make this into a multi-label classification problem, you are required to merge
all the label columns together, so that we have all the labels together for a particular sentence

a. Label columns to merge: “gender”, “age”, “topic”, “sign”

In [0]:
corpus_df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [0]:
corpus_df.shape

(3000, 7)

In [0]:
corpus_df.head(5)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004",info found page mb pdf file wait untill team l...
1,2059027,male,15,Student,Leo,"13,May,2004",team member drewes van der laag urllink mail r...
2,2059027,male,15,Student,Leo,"12,May,2004",het kader van kernfusie op aarde maak je eigen...
3,2059027,male,15,Student,Leo,"12,May,2004",testing testing
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",thanks yahoo toolbar capture url popups mean s...


In [0]:
# merging gender, age, topic and sign


In [0]:
corpus_df['age'] = corpus_df['age'].astype(str)
corpus_df['labels'] = corpus_df[['gender','age','topic','sign']].apply(lambda x: ','.join(x), axis = 1) 
corpus_df_merged = corpus_df.drop(labels = ['date','gender', 'age','topic','sign','id'], axis = 1)
corpus_df_merged.head()

Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,team member drewes van der laag urllink mail r...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male,33,InvestmentBanking,Aquarius"


In [0]:
corpus_df_merged.shape

(3000, 2)

# 4. Separate features and labels, and split the data into training and testing

In [0]:
feature = corpus_df_merged['text']
corpus_df_merged['labels'] = corpus_df_merged['labels'].str.lower()
labels = corpus_df_merged['labels']
X_train, X_test, Y_train, Y_test = train_test_split(feature,labels, test_size = 0.33, random_state = 143)
Y_train.shape

(2010,)

# 5. Vectorize the features
a. Create a Bag of Words using count vectorizer

    1.Use ngram_range=(1, 2)
    2.Vectorize training and testing features

b. Print the term-document matrix

In [0]:
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
print("X_train shape & sample",X_train.shape)
X_train[0]

X_train shape & sample (2010, 16018)


<1x16018 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

# 6. Create a dictionary to get the count of every label i.e. the key will be label name and value will be the total count of the label.

In [0]:
vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english")
labels_vector = vectorizer_labels.fit_transform(labels)
vectorizer_labels.vocabulary_


{'14': 0,
 '15': 1,
 '16': 2,
 '17': 3,
 '23': 4,
 '24': 5,
 '25': 6,
 '26': 7,
 '27': 8,
 '33': 9,
 '34': 10,
 '35': 11,
 '37': 12,
 '39': 13,
 '41': 14,
 '44': 15,
 '45': 16,
 'accounting': 17,
 'aquarius': 18,
 'aries': 19,
 'arts': 20,
 'banking': 21,
 'businessservices': 22,
 'cancer': 23,
 'capricorn': 24,
 'communications': 25,
 'education': 26,
 'engineering': 27,
 'female': 28,
 'gemini': 29,
 'indunk': 30,
 'internet': 31,
 'investmentbanking': 32,
 'leo': 33,
 'libra': 34,
 'libraries': 35,
 'male': 36,
 'media': 37,
 'museums': 38,
 'non': 39,
 'pisces': 40,
 'profit': 41,
 'recreation': 42,
 'sagittarius': 43,
 'science': 44,
 'scorpio': 45,
 'sports': 46,
 'student': 47,
 'taurus': 48,
 'technology': 49,
 'virgo': 50}

In [0]:
## Extracing only key value from above dictionary, which contains unique labels. 

In [0]:
label_classes = []  
for key in vectorizer_labels.vocabulary_.keys():
    label_classes.append(key)
    
print(sorted(label_classes))

['14', '15', '16', '17', '23', '24', '25', '26', '27', '33', '34', '35', '37', '39', '41', '44', '45', 'accounting', 'aquarius', 'aries', 'arts', 'banking', 'businessservices', 'cancer', 'capricorn', 'communications', 'education', 'engineering', 'female', 'gemini', 'indunk', 'internet', 'investmentbanking', 'leo', 'libra', 'libraries', 'male', 'media', 'museums', 'non', 'pisces', 'profit', 'recreation', 'sagittarius', 'science', 'scorpio', 'sports', 'student', 'taurus', 'technology', 'virgo']


# 7. Transform the labels 
As we have noticed before, in this task each example can have multiple tags. To deal with
such kind of prediction, we need to transform labels in a binary form and the prediction will be
a mask of 0s and 1s. For this purpose, it is convenient to use MultiLabelBinarizer from sklearn

a. Convert your train and test labels using MultiLabelBinarizer

In [0]:
# initialising multilabelbinariser with all unique possible classes
mlb = MultiLabelBinarizer(classes = label_classes)

In [0]:
# Converting entire set of labels into format required by mlb
labels = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in labels]]
labels[30]

['male', '33', 'investmentbanking', 'aquarius']

In [0]:
labels_trans = mlb.fit(labels) # transforming entire set of lables
labels_trans

MultiLabelBinarizer(classes=['male', '15', 'student', 'leo', '33',
                             'investmentbanking', 'aquarius', 'female', '14',
                             'indunk', 'aries', '25', 'capricorn', '17',
                             'gemini', '23', 'non', 'profit', 'cancer',
                             'banking', '37', 'sagittarius', '26', '24',
                             'scorpio', '27', 'education', '45', 'engineering',
                             'libra', ...],
                    sparse_output=False)

In [0]:
#Convert Y_train into format as required by mlb 

Y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_train]]
Y_train[30]


['male', '24', 'engineering', 'libra']

In [0]:
# transforming Train lables using mlb which is trained on all possible unnique labels on entire data set
Y_train_trans = mlb.transform(Y_train)
Y_train_trans[30]


  .format(sorted(unknown, key=str)))


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [0]:
Y_train_trans.shape


(2010, 51)

In [0]:
#Converting Y_test into a format as required by mlb 

Y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_test]]
Y_test_trans = mlb.transform(Y_test) # transforming test labels.
print(Y_test[30])

['male', '35', 'technology', 'aries']


  .format(sorted(unknown, key=str)))


In [0]:
Y_test_trans[30]


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1])

In [0]:
len(mlb.classes_)

51

In [0]:
Y_train_trans[10]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [0]:
Y_train[10]

['male', '25', 'nonprofit', 'cancer']

# 8. Choose a classifier:
In this task, we suggest using the One-vs-Rest approach, which is implemented in
OneVsRestClassifier class. In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. It is one of the simplest methods, but often it performs good enough in text classification tasks. It might take some time because the number of classifiers to train is large.

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label


In [0]:
clf = LogisticRegression(solver = 'lbfgs',max_iter = 1000)  # initiating the classifier
#from sklearn.svm import SVC
clf = OneVsRestClassifier(clf)

# 9. Fit the classifier, make predictions and get the accuracy :
a. Print the following

  i. Accuracy score

  ii. F1 score

  iii. Average precision score

  iv. Average recall score

  v. Tip: Make sure you are familiar with all of them. How would you expect the
things to work for the multi-label scenario? Read about micro/macro/weighted
averaging



In [0]:
clf.fit(X_train,Y_train_trans) # Fitting on  train data

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=1000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

# Accuracy:

In [0]:
print("Train Accuracy:",clf.score(X_train,Y_train_trans))

Train Accuracy: 0.972139303482587


In [0]:
Y_pred = clf.predict(X_test)

In [0]:
#Accuracy, F1 score, Average precision score, Average recall score

In [0]:
print("Test Accuracy:" + str(accuracy_score(Y_test_trans, Y_pred)))
print("F1: " + str(f1_score(Y_test_trans, Y_pred, average='micro')))
print("F1_macro: " + str(f1_score(Y_test_trans, Y_pred, average='macro')))
print("Precision: " + str(precision_score(Y_test_trans, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(Y_test_trans, Y_pred, average='macro')))
print("Recall: " + str(recall_score(Y_test_trans, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(Y_test_trans, Y_pred, average='macro')))

Test Accuracy:0.5686868686868687
F1: 0.7584394023242943
F1_macro: 0.2777544085541704
Precision: 0.8270971635485818
Precision_macro: 0.42504160995159523
Recall: 0.7003065917220235
Recall_macro: 0.2290073113591835


  average, "true nor predicted", 'F-score is', len(true_sum)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# 10. Print true label and predicted label for any five examples

In [0]:
Y_pred_inv = mlb.inverse_transform(Y_pred)   # inverse transforming predited label data
Y_test_trans_inv =  mlb.inverse_transform(Y_test_trans) # inverse transforming original test label data

In [0]:
#Example 1

In [0]:
print("Ex. 1 - predicted :",Y_pred_inv[0])
print("Ex.  1 - Actual :",Y_test_trans_inv[0])
print("Ex. 1 - Actual_before mlb transformation :",Y_test[0])

Ex. 1 - predicted : ('male', 'aries', '35', 'technology')
Ex.  1 - Actual : ('aquarius', 'female', '27', 'education')
Ex. 1 - Actual_before mlb transformation : ['female', '27', 'education', 'aquarius']


In [0]:
#Example 2

In [0]:
print("Ex. 2 - predicted :",Y_pred_inv[20])
print("Ex. 2 - Actual :",Y_test_trans_inv[20])
print("Ex. 2 - Actual_before mlb transformation :",Y_test[20])

Ex. 2 - predicted : ('male', 'aries', '35', 'technology')
Ex. 2 - Actual : ('male', 'aries', '35', 'technology')
Ex. 2 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [0]:
#Example 3

In [0]:
print("Ex. 3 - predicted :",Y_pred_inv[40])
print("Ex. 3 - Actual :",Y_test_trans_inv[40])
print("Ex. 3 - Actual_before mlb transformation :",Y_test[40])

Ex. 3 - predicted : ('male', 'aries', '35', 'technology')
Ex. 3 - Actual : ('male', 'aries', '35', 'technology')
Ex. 3 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [0]:
#Example 4

In [0]:
print("Ex. 4 - predicted :",Y_pred_inv[225])
print("Ex. 4 - Actual :",Y_test_trans_inv[225])
print("Ex. 4 - Actual_before mlb transformation :",Y_test[225])

Ex. 4 - predicted : ('female',)
Ex. 4 - Actual : ('female', 'indunk', '24', 'scorpio')
Ex. 4 - Actual_before mlb transformation : ['female', '24', 'indunk', 'scorpio']


In [0]:
#Example 5

In [0]:
print("Ex. 5 - predicted :",Y_pred_inv[250])
print("Ex. 5 - Actual :",Y_test_trans_inv[250])
print("Ex. 5 - Actual_before mlb transformation :",Y_test[250])

Ex. 5 - predicted : ('male', 'aries', '35', 'technology')
Ex. 5 - Actual : ('male', 'aries', '35', 'technology')
Ex. 5 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']
