## Project - Statistical NLP

Link to dataset: https://www.kaggle.com/rtatman/blog-authorship-corpus/downloads/blog-authorship-corpus.zip/2at

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Importing os Module..
import os  
    
# Get the current working directory (CWD). 
cwd = os.getcwd()  
    
# Print the current working directory (CWD).
print("Current working directory:", cwd)  

Current working directory: /content


In [0]:
!ls

drive  sample_data


In [0]:
os.chdir('/content/drive/My Drive/Blog-Authorship-Corpus/')

In [0]:
project_path = "/content/drive/My Drive/Blog-Authorship-Corpus/"

### Load the contents of zip file....

In [0]:
from zipfile import ZipFile
with ZipFile(project_path+'blog-authorship-corpus.zip', 'r') as z:
  z.extractall()

In [0]:
!ls

blog-authorship-corpus.zip  blogtext.csv


In [0]:
import pandas as pd
df = pd.read_csv("blogtext.csv")

In [0]:
df.columns

Index(['id', 'gender', 'age', 'topic', 'sign', 'date', 'text'], dtype='object')

In [0]:
df.head(8)

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o..."


In [0]:
df.isnull().sum()

id        0
gender    0
age       0
topic     0
sign      0
date      0
text      0
dtype: int64

In [0]:
df.shape

(681284, 7)

## As the dataset is large, use fewer rows.

In [0]:
df = df.head(4000)

## 2) Preprocess rows of the “text” column..
Preprocess values of text column

- Remove unwanted characters
- Convert text to lowercase
- Remove unwanted spaces
- Remove stopwords

In [0]:
# Select only alphabets
import re
df.text = df.text.apply(lambda x: re.sub('[^A-Za-z]+', ' ', x))

# Convert text to lowercase
df.text = df.text.apply(lambda x: x.lower())

# Strip unwanted spaces
df.text = df.text.apply(lambda x: x.strip())

# Remove stopwords
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
stopwords = set(stopwords.words('english')) 
df.text = df.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
df.text[6]

'somehow coca cola way summing things well early flagship jingle like buy world coke tune like teach world sing pretty much summed post woodstock era well add much sales catchy tune korea coke theme urllink stop thinking feel pretty much sums lot korea koreans look relaxed couple stopped thinking started feeling course high regard education math logic deep think many koreans really like work emotion anything else westerners seem sublimate moreso least display different way maybe scratch westerners koreans probably pretty similar context different anyways think losing korea repeat stop thinking feel stop thinking feel stop thinking feel everything alright'

## 3.1) Merge all the label columns together, so that we have all the tags together for a particular sentence.

In [0]:
df['labels'] = df.apply(lambda row: [row['gender'], str(row['age']), row['topic'], row['sign']], axis=1)

## 3.2) Select only required columns from your dataframe..

In [0]:
df = df[['text','labels']]

In [0]:
df.head()

Unnamed: 0,text,labels
0,info found pages mb pdf files wait untill team...,"[male, 15, Student, Leo]"
1,team members drewes van der laag urllink mail ...,"[male, 15, Student, Leo]"
2,het kader van kernfusie op aarde maak je eigen...,"[male, 15, Student, Leo]"
3,testing testing,"[male, 15, Student, Leo]"
4,thanks yahoo toolbar capture urls popups means...,"[male, 33, InvestmentBanking, Aquarius]"


## 4) Separate features and labels, and split the data into training and testing.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.text.values, 
                                                    df.labels.values, test_size=0.20, random_state=42)

## 5) Vectorize the features.

- Create a Bag of Words using count vectorizer Use ngram_range=(1, 2).
- Vectorize training and testing features.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, ngram_range=(1, 2))
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

In [0]:
vectorizer.get_feature_names()[:5]

['aa', 'aa compared', 'aaa', 'aaa take', 'aaa travel']

#### Print the term-document matrix..

In [0]:
X_train_bow.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## 6) Create a dictionary to get the count of every label.  

In [0]:
label_counts = dict()

for labels in df.labels.values:
    for label in labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1

In [0]:
label_counts

{'14': 129,
 '15': 319,
 '16': 26,
 '17': 191,
 '23': 106,
 '24': 347,
 '25': 157,
 '26': 43,
 '27': 86,
 '33': 94,
 '34': 6,
 '35': 2307,
 '36': 60,
 '37': 19,
 '39': 79,
 '41': 14,
 '44': 3,
 '45': 14,
 'Accounting': 2,
 'Aquarius': 291,
 'Aries': 2449,
 'Arts': 20,
 'Banking': 16,
 'BusinessServices': 43,
 'Cancer': 79,
 'Capricorn': 77,
 'Communications-Media': 61,
 'Education': 118,
 'Engineering': 119,
 'Gemini': 35,
 'Internet': 20,
 'InvestmentBanking': 70,
 'Leo': 111,
 'Libra': 360,
 'Museums-Libraries': 2,
 'Non-Profit': 46,
 'Pisces': 67,
 'Sagittarius': 117,
 'Science': 33,
 'Scorpio': 292,
 'Sports-Recreation': 75,
 'Student': 476,
 'Taurus': 83,
 'Technology': 2294,
 'Virgo': 39,
 'female': 919,
 'indUnk': 605,
 'male': 3081}

## 7) Transform the labels..

### Convert your train and test labels using MultiLabelBinarizer.


In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes=sorted(label_counts.keys()))
y_train = mlb.fit_transform(y_train)
y_test = mlb.transform(y_test)

## 8) Classifier

a. Use a linear classifier of your choice, wrap it up in OneVsRestClassifier to train it on every label.  

In [0]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(solver='lbfgs')
clf = OneVsRestClassifier(clf)

In [0]:
clf.fit(X_train_bow, y_train)

OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False),
                    n_jobs=None)

In [0]:
predicted_labels = clf.predict(X_test_bow)
predicted_scores = clf.decision_function(X_test_bow)

In [0]:
pred_inversed = mlb.inverse_transform(predicted_labels)
y_test_inversed = mlb.inverse_transform(y_test)

In [0]:
for i in range(5):
    print('Title:\t{}\nTrue labels:\t{}\nPredicted labels:\t{}\n\n'.format(
        X_test[i],
        ','.join(y_test_inversed[i]),
        ','.join(pred_inversed[i])
    ))

Title:	wake early morning play together little girls smocked dresses bows hair run around high pitched voices learning share toys brought beach family vacation little girl cousins special thing automatic playmates best friends share genes beginning share lives growing clamor beach might play sand hours filling buckets dragging water surf hunting shells adorable bathing suits skin tans easily even super strength sun block fathers mothers take water hold safely waves tidal pools perfect depth rafts best toys sometimes hands sand saltwater get dirtier ever never seem notice grittiness stickiness want go house even hours playing love rinse hose semi grown shower little girl watching nieces relive cousin time beach salter path family campground bogue sound playground sound favorite things floating raft calm water clam digging cousins obsessed shells hunted tirelessly throughout days beach inlet north end island end shell hunts surveyed treasurer memorized shapes names olives whale eyes baby

## 9) Print the following.
i.   Accuracy score.  
ii.  F1 score.  
iii. Average precision score.  
iv.  Average recall score.  

In [0]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score

def print_evaluation_scores(y_val, predicted):
    print('Accuracy score: ', accuracy_score(y_val, predicted))
    print('F1 score: ', f1_score(y_val, predicted, average='micro'))
    print('Average precision score: ', average_precision_score(y_val, predicted, average='micro'))
    print('Average recall score: ', recall_score(y_val, predicted, average='micro'))

In [0]:
print('Bag-of-words')
print_evaluation_scores(y_test, predicted_labels)

Bag-of-words
Accuracy score:  0.55625
F1 score:  0.7498267498267498
Average precision score:  0.5959546721099015
Average recall score:  0.67625
