<font size="4"> <b> • DOMAIN: </b>Digital content management</font>

<font size="4"> <b> • CONTEXT: </b>Classification is probably the most popular task that you would deal with in real life. Text in the form of blogs, posts, articles,
etc. is written every second. It is a challenge to predict the information about the writer without knowing about him/her. We are going to create a classifier that predicts multiple features of the author of a given text. We have designed it as a Multi label classification problem.

<font size="4"> <b> • DATA DESCRIPTION: </b>Over 600,000 posts from more than 19 thousand bloggers The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person. Each blog is presented as a separate file, the name of which indicates a blogger id# and the blogger’s self-provided gender, age, industry, and astrological sign. (All are labelled for gender and age but for many,industry and/or sign is marked as unknown.) 
    
All bloggers included in the corpus fall into one of three age groups:
    
• 8240 "10s" blogs (ages 13-17),
• 8086 "20s" blogs(ages 23-27) and
• 2994 "30s" blogs (ages 33-47)
    
For each age group, there is an equal number of male and female bloggers.
Each blog in the corpus includes at least 200 occurrences of common English words. All formatting has been stripped with two exceptions.
    
Individual posts within a single blogger are separated by the date of the following post and links within a post are denoted by the label url
link. Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus

<font size="4"> <b> • PROJECT OBJECTIVE: </b>The need is to build a NLP classifier which can use input text parameters to determine the label/s of the blog.

<b>Steps and tasks:</b>
    
1. Import and analyse the data set. 
2. Perform data pre-processing on the data:
    
>Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.
    
>Target/label merger and transformation
    
>Train and test split
  
>Vectorisation, etc.
    
3. Design, train, tune and test the best text classifier
    
&nbsp;
4. Display and explain detail the classification report
    
&nbsp;
5. Print the true vs predicted labels for any 5 entries from the dataset.
    
<b>Hint: The aim here Is to import the text, process it such a way that it can be taken as an inout to the ML/NN classifiers. Be analytical and experimental here in trying new
    approaches to design the best model.</b>
    
</font>

<font size="5"><p style="color:black"> <b>1. Import and analyse the data set. </p></font>

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">1.1 Importing dataset and libraries

In [1]:
import pandas as pd
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


from matplotlib import pyplot
%matplotlib inline
import re

In [2]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sathya99\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
df = pd.read_csv('Dataset+-+blogtext.csv',nrows=100000)
df.head(20)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...


In [4]:
df.shape

(100000, 7)

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">1.2 Checking for duplicates

In [5]:
dupes = df.duplicated()
sum(dupes)

836

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">1.3 Checking for Missing values

In [6]:
def missing_check(df):
    total = df.isnull().sum().sort_values(ascending=False)   # total number of null values
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)  # percentage of values that are null
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])  # putting the above two together
    return missing_data # return the dataframe
missing_check(df)

Unnamed: 0,Total,Percent
id,0,0.0
gender,0,0.0
age,0,0.0
topic,0,0.0
sign,0,0.0
date,0,0.0
text,0,0.0


<font size="5"><p style="color:black"> <b> 2. Perform data pre-processing on the data:</p></font>

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">2.1 Data cleansing by removing unwanted characters, spaces, stop words etc. Convert text to lowercase.

**Using Regular expressions to cleanse data to make it suitable for further modelling**

In [7]:
df['cleaned_text']=df['text'].apply(lambda x: re.sub(r'[^A-Za-z]+',' ',x))

In [8]:
df['cleaned_text']=df['cleaned_text'].apply(lambda x: x.lower())

In [9]:
df['cleaned_text']=df['cleaned_text'].apply(lambda x: x.strip())

**Actual Text**

In [10]:
print("Actual text:\n\n", df['text'][7])

Actual text:

              If anything, Korea is a country of extremes.  Everything here seems fad-based.  I think it may come from Korea's history.  It has been invaded a reported 700 times over the years, and each time they got independence I imagine they had to move quickly to get to the next level before the next war or occupation.  Lately (well, not really lately...in 1945) the Japanese Occupation ended.  Then the Korean War occurred from 1950-3.  After that there was turmoil, but in 1961 Park Chung Hee took over as dictator/president.  He had elections, in which everyone was 'encouraged' to vote, but he was still a dictator.  After his assassination in 1979 the next few leaders were basically of the same ilk.  President Park did some amazing things in his time, however.  He took an incredibly backward country and set it on the road to industrialization. Japan had stripped Korea of its resources, people and even its language and culture (many buildings and palaces were razed and 

**Text after Data wrangling**

In [11]:
print("Text after Data wrangling:\n\n", df['cleaned_text'][7])

Text after Data wrangling:

 if anything korea is a country of extremes everything here seems fad based i think it may come from korea s history it has been invaded a reported times over the years and each time they got independence i imagine they had to move quickly to get to the next level before the next war or occupation lately well not really lately in the japanese occupation ended then the korean war occurred from after that there was turmoil but in park chung hee took over as dictator president he had elections in which everyone was encouraged to vote but he was still a dictator after his assassination in the next few leaders were basically of the same ilk president park did some amazing things in his time however he took an incredibly backward country and set it on the road to industrialization japan had stripped korea of its resources people and even its language and culture many buildings and palaces were razed and japanese was the official language here from but president pa

**Remove all stop words**

In [12]:
from nltk.corpus import stopwords
stopwords=set(stopwords.words('english'))

In [13]:
df['cleaned_text']=df['cleaned_text'].apply(lambda x: ' '.join([words for words in x.split() if words not in stopwords]))

In [14]:
df['cleaned_text'][7]

'anything korea country extremes everything seems fad based think may come korea history invaded reported times years time got independence imagine move quickly get next level next war occupation lately well really lately japanese occupation ended korean war occurred turmoil park chung hee took dictator president elections everyone encouraged vote still dictator assassination next leaders basically ilk president park amazing things time however took incredibly backward country set road industrialization japan stripped korea resources people even language culture many buildings palaces razed japanese official language president park determined change orchestrated han river miracle han river hangang main river seoul korea korea made terrific strides expense civil liberties fastforward present point see korea world wired nation canada finland way beyond u craze pc pc bangs rooms everywhere country well instead playstation like games players go computer one two people korean gamers always 

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">2.2 Target/label merger and transformation

In [15]:
df['labels']=df.apply(lambda col: [col['gender'],str(col['age']),col['topic'],col['sign']], axis=1)


In [16]:
df.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,cleaned_text,labels
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


In [17]:
df1 = df.drop(['id','date'],axis=1)
df1

Unnamed: 0,gender,age,topic,sign,text,cleaned_text,labels
0,male,15,Student,Leo,"Info has been found (+/- 100 pages,...",info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,male,15,Student,Leo,These are the team members: Drewe...,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,male,15,Student,Leo,In het kader van kernfusie op aarde...,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,male,15,Student,Leo,testing!!! testing!!!,testing testing,"[male, 15, Student, Leo]"
4,male,33,InvestmentBanking,Aquarius,Thanks to Yahoo!'s Toolbar I can ...,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"
...,...,...,...,...,...,...,...
99995,male,27,Student,Virgo,THE HINDU - 125 YEARS ...,hindu years great see special edition hindu co...,"[male, 27, Student, Virgo]"
99996,male,27,Student,Virgo,DILBERT & IIT-ans ...,dilbert iit ans global iit brand finds space u...,"[male, 27, Student, Virgo]"
99997,male,27,Student,Virgo,Case Study : How HP won $3 billion...,case study hp billion p g outsourcing deal bea...,"[male, 27, Student, Virgo]"
99998,male,27,Student,Virgo,Championing Chennai ...,championing chennai bangalore iim hyderabad ho...,"[male, 27, Student, Virgo]"


In [18]:
df2=df1[['cleaned_text','labels']]

In [19]:
df2

Unnamed: 0,cleaned_text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"
...,...,...
99995,hindu years great see special edition hindu co...,"[male, 27, Student, Virgo]"
99996,dilbert iit ans global iit brand finds space u...,"[male, 27, Student, Virgo]"
99997,case study hp billion p g outsourcing deal bea...,"[male, 27, Student, Virgo]"
99998,championing chennai bangalore iim hyderabad ho...,"[male, 27, Student, Virgo]"


<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">2.3 Train and test split

In [20]:
x=df2['cleaned_text']
y=df2['labels']

In [21]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3, train_size = 0.7, random_state =12)

In [22]:
x_train.shape, x_test.shape,y_train.shape,y_test.shape

((70000,), (30000,), (70000,), (30000,))

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">2.4 Vectorisation, MultiLabel Binarizer etc.

**Vectorizer**

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(ngram_range=(1,2))

**Total number of words in the dataset**

In [24]:
vectorizer.fit(x_train)
len(vectorizer.vocabulary_)

3738224

In [25]:
vectorizer.get_feature_names()

['aa',
 'aa aa',
 'aa alexisonfire',
 'aa amazing',
 'aa anyway',
 'aa back',
 'aa batteries',
 'aa beautiful',
 'aa brown',
 'aa button',
 'aa candle',
 'aa charge',
 'aa class',
 'aa coming',
 'aa compared',
 'aa damn',
 'aa done',
 'aa eating',
 'aa ended',
 'aa enough',
 'aa eriol',
 'aa fc',
 'aa flights',
 'aa forms',
 'aa gai',
 'aa gaye',
 'aa going',
 'aa great',
 'aa grins',
 'aa haha',
 'aa hey',
 'aa htm',
 'aa hyper',
 'aa jaeyin',
 'aa join',
 'aa keeps',
 'aa kenyan',
 'aa kk',
 'aa knows',
 'aa lets',
 'aa like',
 'aa lizzy',
 'aa lovedocmartens',
 'aa man',
 'aa meeting',
 'aa meetings',
 'aa milne',
 'aa months',
 'aa motoring',
 'aa motw',
 'aa much',
 'aa nbsp',
 'aa ncaa',
 'aa need',
 'aa nothing',
 'aa one',
 'aa page',
 'aa people',
 'aa players',
 'aa pona',
 'aa process',
 'aa raha',
 'aa restaurant',
 'aa right',
 'aa screed',
 'aa sd',
 'aa snake',
 'aa soon',
 'aa southpaw',
 'aa species',
 'aa st',
 'aa sudden',
 'aa suppose',
 'aa tax',
 'aa think',
 'aa 

In [26]:
x_train=vectorizer.transform(x_train)

In [27]:
x_train.shape

(70000, 3738224)

In [28]:
x_train

<70000x3738224 sparse matrix of type '<class 'numpy.int64'>'
	with 11753157 stored elements in Compressed Sparse Row format>

In [29]:
x_test=vectorizer.transform(x_test)

In [30]:
x_test.shape

(30000, 3738224)

In [31]:
x_test

<30000x3738224 sparse matrix of type '<class 'numpy.int64'>'
	with 3787222 stored elements in Compressed Sparse Row format>

In [32]:
vectorizer.get_feature_names()[7:12]

['aa beautiful', 'aa brown', 'aa button', 'aa candle', 'aa charge']

**Counting words in the dataset using loop**

In [33]:
label_counts=dict()

for labels in df2.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label]+=1
        else:
            label_counts[label]=1


In [34]:
label_counts

{'male': 53358,
 '15': 6532,
 'Student': 22122,
 'Leo': 8230,
 '33': 2835,
 'InvestmentBanking': 244,
 'Aquarius': 9050,
 'female': 46642,
 '14': 3540,
 'indUnk': 33097,
 'Aries': 10637,
 '25': 8660,
 'Capricorn': 8723,
 '17': 12755,
 'Gemini': 9225,
 '23': 10757,
 'Non-Profit': 1326,
 'Cancer': 9253,
 'Banking': 354,
 '37': 863,
 'Sagittarius': 7366,
 '26': 8059,
 '24': 11814,
 'Scorpio': 7049,
 '27': 8007,
 'Education': 5553,
 '45': 906,
 'Engineering': 2332,
 'Libra': 7250,
 'Science': 1090,
 '34': 2388,
 '41': 772,
 'Communications-Media': 2830,
 'BusinessServices': 626,
 'Sports-Recreation': 406,
 'Virgo': 7134,
 'Taurus': 8530,
 'Arts': 5031,
 'Pisces': 7553,
 '44': 76,
 '16': 8406,
 'Internet': 2251,
 'Museums-Libraries': 308,
 'Accounting': 528,
 '39': 568,
 '35': 4720,
 'Technology': 8484,
 '36': 3045,
 'Law': 360,
 '46': 914,
 'Consulting': 905,
 'Automotive': 124,
 '42': 156,
 'Religion': 1081,
 '13': 1497,
 'Fashion': 1898,
 '38': 801,
 '43': 505,
 'Publishing': 1079,
 '40'

**MultiLabel Binarizer**

In [35]:
from sklearn.preprocessing import MultiLabelBinarizer
binarizer=MultiLabelBinarizer(classes=sorted(label_counts.keys()))

In [36]:
y_train = binarizer.fit_transform(y_train)
y_test = binarizer.transform(y_test)

In [37]:
x_train.shape

(70000, 3738224)

In [38]:
x_test.shape

(30000, 3738224)

In [39]:
y_train.shape

(70000, 80)

In [40]:
y_test.shape

(30000, 80)

<font size="5"><p style="color:black"> <b>3. Design, train, tune and test the best text classifier </p></font> 

<span style="font-family: Arial; font-weight:bold;font-size:1.3em;color:#00b3e5;">3.1 Logistic Regression Classifier (LRC)

In [41]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(solver='lbfgs', max_iter=1000)
model=OneVsRestClassifier(model)
model.fit(x_train,y_train)

OneVsRestClassifier(estimator=LogisticRegression(max_iter=1000))

In [42]:
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import classification_report
from sklearn import metrics

prediction_LR_train = model.predict(x_train)
prediction_LR_test = model.predict(x_test)

LRtrain = metrics.accuracy_score(y_train, prediction_LR_train)
LRtest = metrics.accuracy_score(y_test, prediction_LR_test)

LR_precision_train = precision_score(y_train, prediction_LR_train,average='micro')
LR_recall_train = recall_score(y_train, prediction_LR_train,average='micro')

LR_precision_test = precision_score(y_test, prediction_LR_test,average='micro')
LR_recall_test = recall_score(y_test, prediction_LR_test,average='micro')

LR_F1 = 2 * (LR_precision_test * LR_recall_test) / (LR_precision_test + LR_recall_test)

<font size="5"><p style="color:black"> <b>4. Display and explain detail the classification report </p></font> 

In [44]:
resultsDf = pd.DataFrame({'Method':['Logistic Regression'], 'accuracy_Train': [LRtrain],'accuracy_Test': [LRtest],'Precision_Train': [LR_precision_train],'Precision_Test':[LR_precision_test],'Recall_Train':[LR_recall_train],'Recall_Test':[LR_recall_test], 'F1-Score':[LR_F1]})
resultsDf = resultsDf[['Method', 'accuracy_Train','accuracy_Test','Precision_Train','Precision_Test','Recall_Train','Recall_Test','F1-Score']]
resultsDf

Unnamed: 0,Method,accuracy_Train,accuracy_Test,Precision_Train,Precision_Test,Recall_Train,Recall_Test,F1-Score
0,Logistic Regression,0.907514,0.1073,0.996253,0.710154,0.944,0.359492,0.477344


<font size="5"><p style="color:black"> <b>5. Print the true vs predicted labels for any 5 entries from the dataset. </p></font> 

In [74]:
import random 

def print_predicted(y_predicted=prediction_LR_test, y_test = y_test , n = 5):
    j = []
    for i in range(n):
        j.append(random.randint(0, len(y_test)))
    print(j)
                 
    for k in j:
        print('ORIGINAL:',binarizer.inverse_transform(y_test)[k])
        print('PREDICTED:',binarizer.inverse_transform(y_predicted)[k])
        print("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")

In [75]:
print_predicted(y_predicted=prediction_LR_test,y_test=y_test, n= 5)

[11272, 21730, 19155, 24704, 9145]
ORIGINAL: ('16', 'Leo', 'Student', 'male')
PREDICTED: ('male',)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ORIGINAL: ('23', 'Sagittarius', 'indUnk', 'male')
PREDICTED: ('female',)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ORIGINAL: ('34', 'Aquarius', 'Education', 'male')
PREDICTED: ('34', 'Aquarius', 'Education', 'male')
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ORIGINAL: ('27', 'Aquarius', 'Technology', 'female')
PREDICTED: ('27', 'Aquarius', 'female')
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ORIGINAL: ('13', 'Aries', 'female', 'indUnk')
PREDICTED: ('female', 'indUnk')
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


### The above Logistic Regression classification model gave 10.73% accuracy, 71% Precision, 35% Recall on validation data. Though the accuracy is low, the model prediction is quite good, and can be improved based on data. This model can be used in production and can be implemented to an extent due to the predictability.