# Project
## Statistical NLP

##Project Description

Classification is probably the most popular task that you would deal with in real life.
Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her.
We are going to create a classifier that predicts multiple features of the author of a given text.
We have designed it as a Multilabel classification problem.

##Dataset

Blog Authorship Corpus Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and age but for many, industry and/or sign is marked as unknown.) All bloggers included in the corpus fall into one of three age groups:
8240 "10s" blogs (ages 13-17),
8086 "20s" blogs(ages 23-27),
2994 "30s" blogs (ages 33-47)
For each age group, there is an equal number of male and female bloggers. Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions. Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label urllink. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

### Approach & Steps

In [1]:
# 1.Load the dataset

# imports
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report,f1_score, accuracy_score, recall_score, precision_score
from nltk.stem import WordNetLemmatizer


# Read data 
corpus_df = pd.read_csv("blogtext.csv")
corpus_df.head(10)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [2]:
corpus_df.shape

(681284, 7)

In [3]:
corpus_df_sample = corpus_df[:3000]       #taking initial 3000 rows
print(corpus_df_sample.shape)
corpus_df_sample["text"].loc[1]

(3000, 7)


'           These are the team members:   Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering (me)          urlLink mail          '

In [4]:
# 2.Preprocess rows of the “text” column
    # a. Remove unwanted characters
corpus_df_sample['text'] = corpus_df_sample['text'].str.replace('[^A-Za-z]',' ')
corpus_df_sample["text"].loc[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


'           These are the team members    Drewes van der Laag           urlLink mail  Ruiyu Xie                     urlLink mail  Bryan Aaldering  me           urlLink mail          '

In [5]:
# b. Convert text to lowercase
corpus_df_sample['text'] = corpus_df_sample['text'].str.lower()
corpus_df_sample["text"].loc[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


'           these are the team members    drewes van der laag           urllink mail  ruiyu xie                     urllink mail  bryan aaldering  me           urllink mail          '

In [6]:
# c. Remove unwanted spaces
corpus_df_sample["text"] = corpus_df_sample["text"].str.strip()
corpus_df_sample["text"].loc[1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


'these are the team members    drewes van der laag           urllink mail  ruiyu xie                     urllink mail  bryan aaldering  me           urllink mail'

In [7]:
corpus_df_sample["text"] = corpus_df_sample["text"].str.split()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [8]:
# d.Remove stopwords
import nltk
from nltk.corpus import stopwords

In [9]:
stop = stopwords.words('english')
def removestopwords(y):   # Function definition
 stopwordremoved = [w for w in y if w not in stop]
 return(" ".join(stopwordremoved))

In [10]:
text_column_size = corpus_df_sample["text"].size
print("text column size :", text_column_size)

# Initialize an empty list to hold the text after stop word removal
cleaner_corpus_df_sample_text = []

# Loop over each text
for i in range( 0, text_column_size):
    cleaner_corpus_df_sample_text.append(removestopwords(corpus_df_sample["text"][i]))

text column size : 3000


In [11]:
cleaner_corpus_df_sample_text[1]

'team members drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering urllink mail'

In [12]:

#Replace text column with cleaner_corpus_df_sample_text 
corpus_df_sample["text"] = cleaner_corpus_df_sample_text
corpus_df_sample["text"][10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


'ah korean language looks difficult first figure read hanguel korea surprisingly easy learn alphabet characters seems easy vocabulary starts oh backwards us sentence structure yikes luckily many options us slow witted foreigners take language course could list urllink joongang article says lot resources urllink well guy motivation jeon ji hyun latest something actually star movies cfs hear means commercial feature positive saw latest movie sunday night hard describe name english version windstruck korean version yeochinso short ne yeojachingu rul sogayhamnida like introduce girlfriend surprisingly titles make sense like website korean english looks quite good actually urllink movie shown theatres subtitles special times info urllink list many theatres seoul click urllink urllink great reason learn korean already married went foreigners well local korean national course korean take picture put urllink movie hof bar update bud mine passed urllink link giordano ad apparently aired korea n

In [13]:
corpus_df_sample["text"].dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(result)


In [14]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\malash01\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [15]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [16]:

# w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
# lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    lemm = [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
    return(" ".join(lemm)) 

corpus_df_sample["text"] = corpus_df_sample.text.apply(lemmatize_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [17]:
corpus_df_sample["text"][1] # Lemmatized output

'team member drewes van der laag urllink mail ruiyu xie urllink mail bryan aaldering urllink mail'

In [18]:
# 3. As we want to make this into a multi-label classification problem, 
# you are required to merge all the label columns together, 
# so that we have all the labels together for a particular sentence

# a. Label columns to merge: “gender”, “age”, “topic”, “sign”
print(corpus_df_sample.columns)
print(corpus_df_sample.shape)
print(corpus_df_sample.head(2))

corpus_df_sample['age'] = corpus_df_sample['age'].astype(str)
corpus_df_sample['labels'] = corpus_df_sample[['gender','age','topic','sign']].apply(lambda x: ','.join(x), axis = 1) 
corpus_df_sample_merged = corpus_df_sample.drop(labels = ['date','gender', 'age','topic','sign','id'], axis = 1)
corpus_df_sample_merged.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')
(3000, 7)
        id gender  age    topic sign         date  \
0  2059027   male   15  Student  Leo  14,May,2004   
1  2059027   male   15  Student  Leo  13,May,2004   

                                                text  
0  info found page mb pdf file wait untill team l...  
1  team member drewes van der laag urllink mail r...  


Unnamed: 0,text,labels
0,info found page mb pdf file wait untill team l...,"male,15,Student,Leo"
1,team member drewes van der laag urllink mail r...,"male,15,Student,Leo"
2,het kader van kernfusie op aarde maak je eigen...,"male,15,Student,Leo"
3,testing testing,"male,15,Student,Leo"
4,thanks yahoo toolbar capture url popups mean s...,"male,33,InvestmentBanking,Aquarius"


In [19]:
# b.After completing the previous step, 
# there should be only two columns in your data frame i.e. “text” and “labels”

In [20]:
corpus_df_sample_merged.shape

(3000, 2)

In [21]:
# 4. Separate features and labels, and split the data into training and testing
feature = corpus_df_sample_merged['text']
corpus_df_sample_merged['labels'] = corpus_df_sample_merged['labels'].str.lower()
labels = corpus_df_sample_merged['labels']
X_train, X_test, Y_train, Y_test = train_test_split(feature,labels, test_size = 0.33, random_state = 143)
Y_train.shape

(2010,)

In [22]:
# 5. Vectorize the features
    # a. Create a Bag of Words using count vectorizer: i. Use ngram_range=(1, 2); ii. Vectorize training and testing features
    # b. Print the term-document matrix
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english")
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
print("X_train shape & sample",X_train.shape)
X_train[0]

X_train shape & sample (2010, 16018)


<1x16018 sparse matrix of type '<class 'numpy.int64'>'
	with 27 stored elements in Compressed Sparse Row format>

In [23]:
# 6. Create a dictionary to get the count of every label
# i.e. the key will be label name and value will be the total count of the label.
vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english")
labels_vector = vectorizer_labels.fit_transform(labels)
vectorizer_labels.vocabulary_

{'male': 36,
 '15': 1,
 'student': 47,
 'leo': 33,
 '33': 9,
 'investmentbanking': 32,
 'aquarius': 18,
 'female': 28,
 '14': 0,
 'indunk': 30,
 'aries': 19,
 '25': 6,
 'capricorn': 24,
 '17': 3,
 'gemini': 29,
 '23': 4,
 'non': 39,
 'profit': 41,
 'cancer': 23,
 'banking': 21,
 '37': 12,
 'sagittarius': 43,
 '26': 7,
 '24': 5,
 'scorpio': 45,
 '27': 8,
 'education': 26,
 '45': 16,
 'engineering': 27,
 'libra': 34,
 'science': 44,
 '34': 10,
 '41': 14,
 'communications': 25,
 'media': 37,
 'businessservices': 22,
 'sports': 46,
 'recreation': 42,
 'virgo': 50,
 'taurus': 48,
 'arts': 20,
 'pisces': 40,
 '44': 15,
 '16': 2,
 'internet': 31,
 'museums': 38,
 'libraries': 35,
 'accounting': 17,
 '39': 13,
 '35': 11,
 'technology': 49}

In [24]:
#7. Convert your train and test labels using MultiLabelBinarizer
label_classes = []  
for key in vectorizer_labels.vocabulary_.keys():
    label_classes.append(key)
    
print(sorted(label_classes))

['14', '15', '16', '17', '23', '24', '25', '26', '27', '33', '34', '35', '37', '39', '41', '44', '45', 'accounting', 'aquarius', 'aries', 'arts', 'banking', 'businessservices', 'cancer', 'capricorn', 'communications', 'education', 'engineering', 'female', 'gemini', 'indunk', 'internet', 'investmentbanking', 'leo', 'libra', 'libraries', 'male', 'media', 'museums', 'non', 'pisces', 'profit', 'recreation', 'sagittarius', 'science', 'scorpio', 'sports', 'student', 'taurus', 'technology', 'virgo']


In [25]:
mlb = MultiLabelBinarizer(classes = label_classes)

In [26]:
labels = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in labels]]
labels[30]

['male', '33', 'investmentbanking', 'aquarius']

In [27]:
labels_trans = mlb.fit(labels) # transforming entire set of lables
labels_trans

MultiLabelBinarizer(classes=['male', '15', 'student', 'leo', '33',
                             'investmentbanking', 'aquarius', 'female', '14',
                             'indunk', 'aries', '25', 'capricorn', '17',
                             'gemini', '23', 'non', 'profit', 'cancer',
                             'banking', '37', 'sagittarius', '26', '24',
                             'scorpio', '27', 'education', '45', 'engineering',
                             'libra', ...],
                    sparse_output=False)

In [28]:
Y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_train]]
Y_train[30]

['male', '24', 'engineering', 'libra']

In [29]:

Y_train_trans = mlb.transform(Y_train) # transforming Train lables using mlb which is trained on all possible unnique labels on entire data set
Y_train_trans[30]


  .format(sorted(unknown, key=str)))


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [30]:
Y_train_trans.shape

(2010, 51)

In [31]:
Y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in Y_test]]
Y_test_trans = mlb.transform(Y_test) # transforming test labels.
print(Y_test[30])
print(Y_test_trans[30])

['male', '35', 'technology', 'aries']
[1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 1]


In [32]:
len(mlb.classes_)

51

In [33]:
mlb.classes_

array(['male', '15', 'student', 'leo', '33', 'investmentbanking',
       'aquarius', 'female', '14', 'indunk', 'aries', '25', 'capricorn',
       '17', 'gemini', '23', 'non', 'profit', 'cancer', 'banking', '37',
       'sagittarius', '26', '24', 'scorpio', '27', 'education', '45',
       'engineering', 'libra', 'science', '34', '41', 'communications',
       'media', 'businessservices', 'sports', 'recreation', 'virgo',
       'taurus', 'arts', 'pisces', '44', '16', 'internet', 'museums',
       'libraries', 'accounting', '39', '35', 'technology'], dtype=object)

In [34]:

Y_train_trans[10]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0])

In [35]:
Y_train[10]

['male', '25', 'nonprofit', 'cancer']

In [36]:
# 8. In this task, we suggest using the One-vs-Rest approach, which is implemented in OneVsRestClassifier class. 
# In this approach k classifiers (= number of tags) are trained. As a basic classifier, use LogisticRegression. 
# It is one of the simplest methods, but often it performs good enough in text classification tasks. 
# It might take some time because the number of classifiers to train is large.
# a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label
clf = LogisticRegression(solver = 'lbfgs')
clf = OneVsRestClassifier(clf)

In [37]:
# 9. Fit the classifier, make predictions and get the accuracy

clf.fit(X_train,Y_train_trans)


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [38]:
# Print the following  
# i. Accuracy score
# ii. F1 score  
# iii. Average precision score
# iv. Average recall score 
# v. Tip: Make sure you are familiar with all of them. 
    #   How would you expect the things to work for the multi-label scenario? 
    #   Read about micro/macro/weighted averaging

print("Train Accuracy:",clf.score(X_train,Y_train_trans))

Y_pred = clf.predict(X_test)

print("Test Accuracy:" + str(accuracy_score(Y_test_trans, Y_pred)))
print("F1: " + str(f1_score(Y_test_trans, Y_pred, average='micro')))
print("F1_macro: " + str(f1_score(Y_test_trans, Y_pred, average='macro')))
print("Precision: " + str(precision_score(Y_test_trans, Y_pred, average='micro')))
print("Precision_macro: " + str(precision_score(Y_test_trans, Y_pred, average='macro')))
print("Recall: " + str(recall_score(Y_test_trans, Y_pred, average='micro')))
print("Recall_macro: " + str(recall_score(Y_test_trans, Y_pred, average='macro')))



Train Accuracy: 0.972139303482587
Test Accuracy:0.5686868686868687
F1: 0.7584394023242943
F1_macro: 0.2777544085541704
Precision: 0.8270971635485818
Precision_macro: 0.42504160995159523
Recall: 0.7003065917220235
Recall_macro: 0.2290073113591835


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [39]:
#10. Print true label and predicted label for any five examples

In [40]:
Y_pred_inv = mlb.inverse_transform(Y_pred)   # inverse transforming predited label data
Y_test_trans_inv =  mlb.inverse_transform(Y_test_trans) # inverse transforming original test label data

In [41]:
print("Example 1 - predicted :",Y_pred_inv[0])
print("Example 1 - Actual :",Y_test_trans_inv[0])
print("Example 1 - Actual_before mlb transformation :",Y_test[0])

Example 1 - predicted : ('male', 'aries', '35', 'technology')
Example 1 - Actual : ('aquarius', 'female', '27', 'education')
Example 1 - Actual_before mlb transformation : ['female', '27', 'education', 'aquarius']


In [42]:
print("Example 2 - predicted :",Y_pred_inv[1])
print("Example 2 - Actual :",Y_test_trans_inv[1])
print("Example 2 - Actual_before mlb transformation :",Y_test[1])

Example 2 - predicted : ('male', 'aries', '35', 'technology')
Example 2 - Actual : ('male', 'aries', '35', 'technology')
Example 2 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [43]:
print("Example 3 - predicted :",Y_pred_inv[13])
print("Example 3 - Actual :",Y_test_trans_inv[13])
print("Example 3 - Actual_before mlb transformation :",Y_test[13])

Example 3 - predicted : ('male', 'aries', '35', 'technology')
Example 3 - Actual : ('male', 'aries', '35', 'technology')
Example 3 - Actual_before mlb transformation : ['male', '35', 'technology', 'aries']


In [44]:
print("Example 4 - predicted :",Y_pred_inv[139])
print("Example 4 - Actual :",Y_test_trans_inv[139])
print("Example 4 - Actual_before mlb transformation :",Y_test[139])

Example 4 - predicted : ('male', '15', 'student', 'aries', '35', 'technology')
Example 4 - Actual : ('15', 'student', 'aquarius', 'female')
Example 4 - Actual_before mlb transformation : ['female', '15', 'student', 'aquarius']


In [45]:
print("Example 5 - predicted :",Y_pred_inv[110])
print("Example 5 - Actual :",Y_test_trans_inv[110])
print("Example 5 - Actual_before mlb transformation :",Y_test[110])

Example 5 - predicted : ('male', 'aries', '35', 'technology')
Example 5 - Actual : ('male', 'indunk', '23', 'sagittarius')
Example 5 - Actual_before mlb transformation : ['male', '23', 'indunk', 'sagittarius']
