# DMG2 Assignment : Problem 6

_Naive Bayes Text Classifier_

Number of classes : 20

In each class, there are a number of documents, each one corresponding to a date. The test-train split will be based on the date. 

**Preprocessing in each document :**
* Keep only From, Subject, Host, Organization, Data
* Remove special characters, stop words
* Stem the words
* There are numbers in the data, as addresses, phone numbers, currency, etc. Should they be removed?

In [1]:
import os,re
import pandas as pd
import numpy as np
import nltk,unicodedata
import operator,math

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.lancaster import LancasterStemmer
from sklearn.feature_extraction.text import CountVectorizer

## Reading Files

In [2]:
DATA_DIR = 'D:\\ISB\\Term3\\DMG2\\assignment\\datasets\\20_newsgroups'

In [3]:
labels,files_list = [],[]
for root, dirs, files in os.walk(DATA_DIR):
    for file in files:
        labels.append(re.sub(r'D:\\ISB\\Term3\\DMG2\\assignment\\datasets\\20_newsgroups\\','',root))
        files_list.append(os.path.join(root,file))

In [4]:
files_df = pd.DataFrame({'filename':files_list, 'label' : labels})
list(files_df['label'].unique())

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Train - Test Split

For each class, splitting the documents to training and test based on a 70-30 rule.

In [5]:
train = pd.DataFrame(columns=['filename','label'])
test = pd.DataFrame(columns=['filename','label'])

for label in list(files_df['label'].unique()):
    threshold = files_df.loc[files_df['label'] == label].shape[0] * 0.7
    threshold = int(np.floor(threshold))
    train = train.append(files_df.loc[files_df['label'] == label].iloc[:threshold,:],ignore_index=True)
    test = test.append(files_df.loc[files_df['label'] == label].iloc[threshold:,:],ignore_index=True)

print(train.shape[0],test.shape[0])

13997 6000


## Creating dictionary of 5000 most frequent words in each class

Calculating P(W|C) for each word in each class, by normalizing using Laplace smoothing parameter of 30.

Here, CountVectorizer class from scikit-learn has been used to create the Document-Term Matrix. The class has an inbuilt preprocessing module.
After calculating the document term matrix, the counts of document in which each word occurs has been calculated to find the most frequent ones for each class. Then, the probability of word given class has been calculated for the top 5000 words. Laplace smoothing parameter of 30 has been used when calculating P(W|C).

In [73]:
# Dictionary to hold vectorizer objects
vect_dict = {}
# Dictionary to hold Document term matrix for each class.
# The document term matrix is converted to a Pandas DataFrame
class_dict = {}
for label in list(train['label'].unique()):
    vect_dict[label] = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    class_dict[label] = pd.DataFrame(vect_dict[label].fit_transform(list(train.loc[train['label'] == label]['filename'])).todense().T)
    class_dict[label]['count_docs'] = class_dict[label].sum(axis=1)
    class_dict[label]['word'] = vect_dict[label].get_feature_names()
    class_dict[label] = class_dict[label].sort_values(by='count_docs',ascending=False).iloc[:5000,:]
    tot_freq = class_dict[label]['count_docs'].sum()
    class_dict[label]['p(w|c)'] =  (class_dict[label]['count_docs'] + 30) / (tot_freq + (5000 * 30))

Considering the top 25 most frequent words for all labels, are there any words which occur in all the documents?

In [74]:
# top25_list = []
# for label in list(train['label'].unique()):
#     top25_list.append(class_dict[label].iloc[:25,:]['word'])
# intersect = set(top25_list[0])
# for list_ in top25_list[1:]:
#     intersect.intersection_update(list_)
# print(intersect)

Removing these words from each dictionary and recalculating probabilities

In [66]:
# for label in list(train['label'].unique()):
#     class_dict[label] = class_dict[label].loc[~ class_dict[label]['word'].isin(['cmu', 'cs', 'message', 'edu', 'com', 'subject', 'srv'])]
# for label in list(train['label'].unique()):
#     tot_freq = class_dict[label]['count_docs'].sum()
#     class_dict[label]['p(w|c)'] =  (class_dict[label]['count_docs'] + 30) / (tot_freq + (5000 * 30))

In [75]:
# Final Word Dictionary for each class
for label in list(train['label'].unique()):
    class_dict[label] = pd.Series(class_dict[label]['p(w|c)'].values,index=class_dict[label]['word']).to_dict()

In [70]:
#class_dict['comp.graphics'].shape

The words **cmu, edu,com,cs** can be removed for better results

## Calculating Class Priors

In [9]:
class_priors_dict = {}
total_freq = 0
for label in list(files_df['label'].unique()):
    class_priors_dict[label] = files_df.loc[files_df['label'] == label].shape[0]
    total_freq += class_priors_dict[label]
for label in list(files_df['label'].unique()):
    class_priors_dict[label] = np.round(class_priors_dict[label] / total_freq, 4)

In [10]:
class_priors_dict

{'alt.atheism': 0.05,
 'comp.graphics': 0.05,
 'comp.os.ms-windows.misc': 0.05,
 'comp.sys.ibm.pc.hardware': 0.05,
 'comp.sys.mac.hardware': 0.05,
 'comp.windows.x': 0.05,
 'misc.forsale': 0.05,
 'rec.autos': 0.05,
 'rec.motorcycles': 0.05,
 'rec.sport.baseball': 0.05,
 'rec.sport.hockey': 0.05,
 'sci.crypt': 0.05,
 'sci.electronics': 0.05,
 'sci.med': 0.05,
 'sci.space': 0.05,
 'soc.religion.christian': 0.0499,
 'talk.politics.guns': 0.05,
 'talk.politics.mideast': 0.05,
 'talk.politics.misc': 0.05,
 'talk.religion.misc': 0.05}

### Calculating Training Accuracy

In [76]:
train_predicted = pd.DataFrame(columns=['predicted','max_class_posterior_prob'])
for train_doc in list(train['filename']):
    vect_train = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    train_docterm = pd.DataFrame(vect_train.fit_transform([train_doc]).todense().T)
    #print(vect_train.get_feature_names())
    log_posterior_dict = class_priors_dict
    log_posterior_dict = dict([(k,math.log(v)) for (k,v) in log_posterior_dict.items()])
    for word in vect_train.get_feature_names():
        for k,v in log_posterior_dict.items():
            try:
                log_posterior_dict[k] = log_posterior_dict[k] + math.log(class_dict[k][word])
            except:
                pass
    log_posterior_dict = dict([(k,np.exp(v)) for (k,v) in log_posterior_dict.items()])
    train_predicted = train_predicted.append({'predicted':max(log_posterior_dict, key=log_posterior_dict.get),'max_class_posterior_prob':max(log_posterior_dict.values())},ignore_index=True)
train_predicted['actual'] = train['label']   
print('Training Accuracy : {}'.format(np.round(train_predicted.loc[train_predicted['predicted'] == train_predicted['actual']].shape[0]/train_predicted.shape[0],4)))            

Training Accuracy : 0.0175


### Calculating Test Accuracy

In [77]:
test_predicted = pd.DataFrame(columns=['predicted','max_class_posterior_prob'])
for test_doc in list(test['filename']):
    vect_test = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    test_docterm = pd.DataFrame(vect_test.fit_transform([test_doc]).todense().T)
    #print(vect_train.get_feature_names())
    log_posterior_dict = class_priors_dict
    log_posterior_dict = dict([(k,math.log(v)) for (k,v) in log_posterior_dict.items()])
    for word in vect_test.get_feature_names():
        for k,v in log_posterior_dict.items():
            try:
                log_posterior_dict[k] = log_posterior_dict[k] + math.log(class_dict[k][word])
            except:
                pass
    log_posterior_dict = dict([(k,np.exp(v)) for (k,v) in log_posterior_dict.items()])
    test_predicted = test_predicted.append({'predicted':max(log_posterior_dict, key=log_posterior_dict.get),'max_class_posterior_prob':max(log_posterior_dict.values())},ignore_index=True)
test_predicted['actual'] = test['label']   
print('Testing Accuracy : {}'.format(np.round(test_predicted.loc[test_predicted['predicted'] == test_predicted['actual']].shape[0]/test_predicted.shape[0],4)))            

Testing Accuracy : 0.0228


It is seen that the training and test accuracy is very low, at 1.75% and 2.28% respectively. 

## Creating dictionary of 10,000 most frequent words in each class

In [31]:
# Dictionary to hold vectorizer objects
vect_dict = {}
# Dictionary to hold Document term matrix for each class.
# The document term matrix is converted to a Pandas DataFrame
class_dict = {}
for label in list(train['label'].unique()):
    vect_dict[label] = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    class_dict[label] = pd.DataFrame(vect_dict[label].fit_transform(list(train.loc[train['label'] == label]['filename'])).todense().T)
    class_dict[label]['count_docs'] = class_dict[label].sum(axis=1)
    class_dict[label]['word'] = vect_dict[label].get_feature_names()
    class_dict[label] = class_dict[label].sort_values(by='count_docs',ascending=False).iloc[:10000,:]
    tot_freq = class_dict[label]['count_docs'].sum()
    class_dict[label]['p(w|c)'] =  (class_dict[label]['count_docs'] + 30) / (tot_freq + (10000 * 30))

In [32]:
# Final Word Dictionary for each class
for label in list(train['label'].unique()):
    class_dict[label] = pd.Series(class_dict[label]['p(w|c)'].values,index=class_dict[label]['word']).to_dict()

### Calculating Training Accuracy

In [36]:
train_predicted = pd.DataFrame(columns=['predicted','max_class_posterior_prob'])
for train_doc in list(train['filename']):
    vect_train = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    train_docterm = pd.DataFrame(vect_train.fit_transform([train_doc]).todense().T)
    #print(vect_train.get_feature_names())
    log_posterior_dict = class_priors_dict
    log_posterior_dict = dict([(k,math.log(v)) for (k,v) in log_posterior_dict.items()])
    for word in vect_train.get_feature_names():
        for k,v in log_posterior_dict.items():
            try:
                log_posterior_dict[k] = log_posterior_dict[k] + math.log(class_dict[k][word])
            except:
                pass
    log_posterior_dict = dict([(k,np.exp(v)) for (k,v) in log_posterior_dict.items()])
    train_predicted = train_predicted.append({'predicted':max(log_posterior_dict, key=log_posterior_dict.get),'max_class_posterior_prob':max(log_posterior_dict.values())},ignore_index=True)
train_predicted['actual'] = train['label']   
print('Training Accuracy : {}'.format(np.round(train_predicted.loc[train_predicted['predicted'] == train_predicted['actual']].shape[0]/train_predicted.shape[0],4)))            

Training Accuracy : 0.0294


### Calculating Test Accuracy

In [35]:
test_predicted = pd.DataFrame(columns=['predicted','max_class_posterior_prob'])
for test_doc in list(test['filename']):
    vect_test = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    test_docterm = pd.DataFrame(vect_test.fit_transform([test_doc]).todense().T)
    #print(vect_train.get_feature_names())
    log_posterior_dict = class_priors_dict
    log_posterior_dict = dict([(k,math.log(v)) for (k,v) in log_posterior_dict.items()])
    for word in vect_test.get_feature_names():
        for k,v in log_posterior_dict.items():
            try:
                log_posterior_dict[k] = log_posterior_dict[k] + math.log(class_dict[k][word])
            except:
                pass
    log_posterior_dict = dict([(k,np.exp(v)) for (k,v) in log_posterior_dict.items()])
    test_predicted = test_predicted.append({'predicted':max(log_posterior_dict, key=log_posterior_dict.get),'max_class_posterior_prob':max(log_posterior_dict.values())},ignore_index=True)
test_predicted['actual'] = test['label']   
print('Testing Accuracy : {}'.format(np.round(test_predicted.loc[test_predicted['predicted'] == test_predicted['actual']].shape[0]/test_predicted.shape[0],4)))            

Testing Accuracy : 0.0363


**It is seen that increasing the dictionary to 10,000 increases the training and test accuracies to 2.9% and 3.6%. This can be improved further.**

In [80]:
train_predicted = pd.DataFrame(columns=['predicted','max_class_posterior_prob'])
for train_doc in list(train.iloc[:1,:]['filename']):
    vect_train = CountVectorizer(input='filename',analyzer='word',stop_words='english',decode_error='ignore')
    train_docterm = pd.DataFrame(vect_train.fit_transform([train_doc]).todense().T)
    #print(vect_train.get_feature_names())
    log_posterior_dict = class_priors_dict
    log_posterior_dict = dict([(k,math.log(v)) for (k,v) in log_posterior_dict.items()])
    for word in vect_train.get_feature_names():
        for k,v in log_posterior_dict.items():
            try:
                log_posterior_dict[k] = log_posterior_dict[k] + math.log(class_dict[k][word])
                print('Word {0} matched from {1} class'.format(word,k))
            except:
                pass
    log_posterior_dict = dict([(k,np.exp(v)) for (k,v) in log_posterior_dict.items()])
    train_predicted = train_predicted.append({'predicted':max(log_posterior_dict, key=log_posterior_dict.get),'max_class_posterior_prob':max(log_posterior_dict.values())},ignore_index=True)
train_predicted['actual'] = train['label']   
print('Training Accuracy : {}'.format(np.round(train_predicted.loc[train_predicted['predicted'] == train_predicted['actual']].shape[0]/train_predicted.shape[0],4)))            

Word 071 matched from alt.atheism class
Word 1000 matched from talk.politics.mideast class
Word 1000 matched from sci.electronics class
Word 1000 matched from comp.windows.x class
Word 1000 matched from sci.space class
Word 1000 matched from comp.sys.ibm.pc.hardware class
Word 1000 matched from rec.motorcycles class
Word 1000 matched from comp.sys.mac.hardware class
Word 1000 matched from misc.forsale class
Word 1000 matched from comp.os.ms-windows.misc class
Word 1000 matched from rec.autos class
Word 1000 matched from sci.med class
Word 1000 matched from talk.politics.guns class
Word 1000 matched from comp.graphics class
Word 1000 matched from sci.crypt class
Word 11 matched from talk.politics.mideast class
Word 11 matched from sci.electronics class
Word 11 matched from comp.windows.x class
Word 11 matched from sci.space class
Word 11 matched from alt.atheism class
Word 11 matched from comp.sys.ibm.pc.hardware class
Word 11 matched from talk.politics.misc class
Word 11 matched from t