<a href="https://colab.research.google.com/github/GreatLearningAIML1/bangalore-aug19-batch-Yashveerb/blob/master/R8_Statistical_NLP_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Project Description :**
Classification is probably the most popular task that you would deal with in real life.  Text in the form of blogs, posts, articles, etc. is written every second. It is a challenge to predict the  information about the writer without knowing about him/her.     We are going to create a classifier that predicts multiple features of the author of a given text.  We have designed it as a Multilabel classification problem. 

**Data set info :**
Blog Authorship Corpus  Over 600,000 posts from more than 19 thousand bloggers    The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from  blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million  words - or approximately 35 posts and 7250 words per person.    Each blog is presented as a separate file, the name of which indicates a blogger id# and the  blogger’s self-provided gender, age, industry, and astrological sign. (All are labeled for gender and  age but for many, industry and/or sign is marked as unknown.)    All bloggers included in the corpus fall into one of three age groups:  8240 "10s" blogs (ages 13-17),  8086 "20s" blogs(ages 23-27)  2994 "30s" blogs (ages 33-47) 

  For each age group, there is an equal number of male and female bloggers.  Each blog in the corpus includes at least 200 occurrences of common English words. All formatting  has been stripped with two exceptions. Individual posts within a single blogger are separated by the  date of the following post and links within a post are denoted by the label urllink.    Link to dataset:  https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads

**Step 1**
Load the dataset (5 points)

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
import re
import unicodedata
from nltk.stem import WordNetLemmatizer


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [5]:
import numpy as np
import pandas as pd

In [16]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/Statistical NLP/blogtext.csv')

In [17]:
df.shape

(681284, 7)

In [18]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...


Let us take 5000 rows of the data

In [19]:
df=df[0:5000]

**Step 2 :** Preprocess rows of the “text” column.

a. Remove unwanted characters  b. Convert text to lowercase  c. Remove unwanted spaces  d. Remove stopwords 

In [20]:
df=df.drop(['id','date'],axis=1)

In [21]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

In [22]:
df['text']

0                  Info has been found (+/- 100 pages,...
1                  These are the team members:   Drewe...
2                  In het kader van kernfusie op aarde...
3                        testing!!!  testing!!!          
4                    Thanks to Yahoo!'s Toolbar I can ...
                              ...                        
4995           So... I had another one of those dreams...
4996           mmm... strawberry tea for breakfast. To...
4997           Yay for a new layout!!  Yeah, I know, I...
4998           Ok, so I lied... Fed up isn't playing F...
4999           well, today I went to church and talked...
Name: text, Length: 5000, dtype: object

In [14]:
df.shape

(5000, 5)

In [23]:
df['text_new']=df['text'].str.lower()

In [24]:
def remove_special_characters(text, remove_digits=False):
    #Using regex
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

In [25]:
def lemmatize_text(text):

    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize(word) for word in text.split()])

In [26]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [27]:
def normalize_corpus(corpus, html_stripping=True, accented_char_removal=True, text_lower_case=True, 
                     text_lemmatization=True, special_char_removal=True, 
                     stopword_removal=True, remove_digits=True):
    
    normalized_corpus = []
    # normalize each document in the corpus
    for doc in corpus:
        # lowercase the text    
        if text_lower_case:
            doc = doc.lower()
        # remove extra newlines
        doc = re.sub(r'[\r|\n|\r\n]+', ' ',doc)
        # remove special characters and\or digits    
        if special_char_removal:
        # insert spaces between special characters to isolate them    
            special_char_pattern = re.compile(r'([{.(-)!}])')
            doc = special_char_pattern.sub(" \\1 ", doc)
            doc = remove_special_characters(doc, remove_digits=remove_digits)
        # lemmatize text
        if text_lemmatization:
            doc = lemmatize_text(doc)  
        # remove extra whitespace
            doc = re.sub(' +', ' ', doc)
            
        normalized_corpus.append(doc)
        
    return normalized_corpus

In [28]:
df['text_new'] =normalize_corpus(df['text_new'])

In [29]:
df.head()

Unnamed: 0,gender,age,topic,sign,text,text_new
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info ha been found page and mb of pdf file now...
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team member drewes van der laag ...
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde maak je ei...
3,male,15,Student,Leo,testing!!! testing!!!,testing testing
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo s toolbar i can now capture th...


In [30]:
df.shape

(5000, 6)

In [31]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
import re
import unicodedata
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [32]:
df.shape

(5000, 6)

**Step 3:**
As this is a multi-label classification problem, merge  all the label columns together, so that we have all the labels together for a particular sentence. a. Label columns to merge: “gender”, “age”, “topic”, “sign”

In [33]:
df['label'] =df["gender"].map(str)+ ', ' +df["age"].map(str)+ ', ' +df["topic"].map(str)+ ', ' +df["sign"]

In [34]:
df.head()

Unnamed: 0,gender,age,topic,sign,text,text_new,label
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info ha been found page and mb of pdf file now...,"male, 15, Student, Leo"
1,male,15,Student,Leo,These are the team members: Drewe...,these are the team member drewes van der laag ...,"male, 15, Student, Leo"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,in het kader van kernfusie op aarde maak je ei...,"male, 15, Student, Leo"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"male, 15, Student, Leo"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks to yahoo s toolbar i can now capture th...,"male, 33, InvestmentBanking, Aquarius"


In [35]:
data=df[['text_new','label']]

In [37]:
data.head()

Unnamed: 0,text_new,label
0,info ha been found page and mb of pdf file now...,"male, 15, Student, Leo"
1,these are the team member drewes van der laag ...,"male, 15, Student, Leo"
2,in het kader van kernfusie op aarde maak je ei...,"male, 15, Student, Leo"
3,testing testing,"male, 15, Student, Leo"
4,thanks to yahoo s toolbar i can now capture th...,"male, 33, InvestmentBanking, Aquarius"


In [38]:
data['label'].nunique()

96

In [39]:
data.astype

<bound method NDFrame.astype of                                                text_new                                  label
0     info ha been found page and mb of pdf file now...                 male, 15, Student, Leo
1     these are the team member drewes van der laag ...                 male, 15, Student, Leo
2     in het kader van kernfusie op aarde maak je ei...                 male, 15, Student, Leo
3                                       testing testing                 male, 15, Student, Leo
4     thanks to yahoo s toolbar i can now capture th...  male, 33, InvestmentBanking, Aquarius
...                                                 ...                                    ...
4995  so i had another one of those dream last night...            female, 17, indUnk, Scorpio
4996  mmm strawberry tea for breakfast tomorrow i th...            female, 17, indUnk, Scorpio
4997  yay for a new layout yeah i know i need to get...            female, 17, indUnk, Scorpio
4998  ok so i lied

**Step4:**
Separate features and labels, and split the data into training and testing

In [40]:
X=data['text_new']
Y=data['label']
print(X.shape)
print(Y.shape)

(5000,)
(5000,)


In [41]:
from sklearn.model_selection import train_test_split

In [42]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,random_state=2)

In [43]:
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(3750,) (3750,)
(1250,) (1250,)


Step5 :
Vectorizing the features.

a. Create a Bag of Words using count vectorizer  i. Use ngram_range=(1, 2)  ii. Vectorize training and testing features 

b. Print the term-document matrix 

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1, 2),stop_words='english')

In [45]:
cv.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 2), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [46]:
len(cv.vocabulary_)

228837

In [47]:
x_train_dtm=cv.transform(X_train)

In [48]:
x_test_dtm=cv.transform(X_test)

In [49]:
x_train_dtm[0]

<1x228837 sparse matrix of type '<class 'numpy.int64'>'
	with 104 stored elements in Compressed Sparse Row format>

In [50]:
x_test_dtm[0]

<1x228837 sparse matrix of type '<class 'numpy.int64'>'
	with 12 stored elements in Compressed Sparse Row format>

In [51]:
print(x_train_dtm.shape)
print(x_test_dtm.shape)

(3750, 228837)
(1250, 228837)


In [52]:
x_train_dtm

<3750x228837 sparse matrix of type '<class 'numpy.int64'>'
	with 437929 stored elements in Compressed Sparse Row format>

In [53]:
x_test_dtm

<1250x228837 sparse matrix of type '<class 'numpy.int64'>'
	with 85708 stored elements in Compressed Sparse Row format>

Step 7:
Transform the labels - As we have noticed before, in this task each example can have multiple tags. To deal with  such kind of prediction, we need to transform labels in a binary form and the prediction will be  a mask of 0s and 1s. For this purpose, it is convenient to use ​MultiLabelBinarizer​ from sklearn 

a. Convert your train and test labels using MultiLabelBinarizer 

In [54]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=Y_train.unique())
mlb.fit_transform('Y_train')
mlb.transform('Y_test')

  .format(sorted(unknown, key=str)))
  .format(sorted(unknown, key=str)))


array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0,

In [55]:
x_train_dtm.shape

(3750, 228837)

In [56]:
print(Y_train.shape)
print(Y_test.shape)

(3750,)
(1250,)


Step 8:
In this task, we suggest using the One-vs-Rest approach, which is implemented in  OneVsRestClassifier​ class. In this approach k classifiers (= number of tags) are trained. As a  basic classifier, use ​LogisticRegression​. It is one of the simplest methods, but often it  performs good enough in text classification tasks. It might take some time because the  number of classifiers to train is large. 

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on  every label 

In [57]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='lbfgs',max_iter=2000)
clf1vR = OneVsRestClassifier(clf)

**Step 9:**
Fit the classifier, make predictions and get the accuracy

a. Print the following  i. Accuracy score  ii. F1 score  iii. Average precision score  iv. Average recall score 

Tip: Make sure you are familiar with all of them. How would you expect the  things to work for the multi-label scenario? Read about micro/macro/weighted  averaging 

In [59]:
clf1vR.fit(x_train_dtm,Y_train) # Fitting on  train data

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=2000,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [62]:
predicted=clf1vR.predict(x_test_dtm)

In [63]:
from sklearn.metrics import accuracy_score
print( "Accuracy Score: ",accuracy_score(Y_test, predicted))

Accuracy Score:  0.6928


**Step 10:**
Print true label and predicted label for any five examples

In [64]:
print(Y_train,predicted)

4715     male, 25, Technology, Aries
3576     male, 35, Technology, Aries
4996     female, 17, indUnk, Scorpio
2556     male, 35, Technology, Aries
611     male, 24, Engineering, Libra
                    ...             
3335     male, 35, Technology, Aries
1099      female, 15, Student, Libra
2514     male, 35, Technology, Aries
3606     male, 35, Technology, Aries
2575     male, 35, Technology, Aries
Name: label, Length: 3750, dtype: object ['male, 35, Technology, Aries' 'female, 34, indUnk, Sagittarius'
 'male, 35, Technology, Aries' ... 'male, 35, Technology, Aries'
 'male, 39, Communications-Media, Libra' 'female, 34, indUnk, Sagittarius']


Learnings / Conclusions:
Have executed this model with 30k/60k/25k samples too. But everytime model is overfitting like how it is demonstrated in above result.
Lemmatization is used as an additional step in the pre processing, still it is not impacting model generalisation.