# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint
### Not for Grading

### Learning Objectives:

At the end of the experiment, you will be able to:

*  Preprocessing text data
*  Representation of  text document using Bag of Words
*  Understand Bag of Words represented text data with K-nearest neighbours

### Dataset
In this experiment we use the 20 newsgroup dataset

**Description**

This dataset is a collection of approximately 20,000 newsgroup documents, partitioned across 20 different newsgroups. That is there are approximately one thousand documents taken from each of the following newsgroups:

    alt.athesim
    comp.graphics   
    comp.os.ms-windows.misc
    comp.sys.ibm.pc.hardware
    comp.sys.mac.hardware
    comp.windows.x
    misc.forsale
    rec.autos
    rec.motorcycles
    rec.sport.baseball
    rec.sport.hockey
    sci.crypt
    sci.electronics
    sci.med
    sci.space
    soc.religion.christian
    talk.politics.guns
    talk.politics.mideast
    talk.politics.misc
    talk.religion.misc

The dataset consists **Usenet** posts--essentially an email sent by subscribers to that newsgroup. They typically contain quotes from previous posts as well as cross posts i.e. a few posts may be sent to more than once in a newsgroup.

Each newsgroup is stored in a subdirectory, with each post stored as a separate file.

Data source to this experiment : http://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

### Domain Information
A newsgroup, despite the name, has nothing to do with news. It is what we would call today a mailing list or a discussion forum. *Usenet* is a distributed discussion system designed and developed in 1979 and deployed in 1980.  

Members joined newsgroups of interest to them and made *posts* to them. Posts are very similar to email -- in later years, newsgroups became mailing lists and people posted via email.

The problem that we are attempting is "Text classification". This is a broadly defined task which is common to many services and products: for example, gmail classifies an incoming mail into different sections such as Updates, Forums etc


### Bag of Words (BoW)

* The bag-of-words is a simple to understand representation of documents and words. As you are aware it makes use of the one-hot representation of each word based on the vocabulary and the document is represented as a sum of the BoW vectors of all the words in the document
 
### Challenges

* The dimension of each vector representing a word is the number of words in the vocabulary. So we definitely will encounter the *curse of dimensionality* 
* Bag of words representation doesnâ€™t consider the semantic relation between words. 
* Nor does it capture the grammar of the language--parts of speech etc., 

In [None]:
! wget -qq https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/AIML_DS_NEWSGROUPS_PICKELFILE.pkl
    

### Importing required Packages


In [None]:
import pickle
import re
import operator
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
import math
import collections

### Load the dataset

Pickle is the process where a Python object is converted into a byte stream. For better understanding about the pickle, Refer the link: https://drive.google.com/file/d/17BfVZ57B726H91hVjzV8XqNppnToP3HY/view?usp=sharing 

In [None]:
dataset = pickle.load(open('AIML_DS_NEWSGROUPS_PICKELFILE.pkl','rb'))
print(type(dataset))
print(dataset.keys())

To get a sense of our data, lets first start by counting the frequencies of the target classes in our news articles in the training set.

In [None]:
# Print frequencies of dataset
print("Class : count")
print("--------------")

for key in dataset:
    print(key, ':', len(dataset[key]))

Next, lets split our dataset which consists of  about 1000 samples per class, into train and test sets. We use about 95% samples from each class for the training and remaining for testing.



In [None]:
train_set = {}
test_set = {}
    
# Split the dataset into 95% - 5% for training and testing
n_train = 0
n_test = 0
for k in dataset:
    split = int(0.95*len(dataset[k]))
    train_set[k] = dataset[k][0:split]
    test_set[k] = dataset[k][split:]
    n_train += len(train_set[k])
    n_test += len(test_set[k])

## 1. Bag-of-Words

Lets begin our journey into text classification with one of the simplest but most commonly used feature representations for news documents - Bag-of-Words.

As you might have realized, machine learning algorithms need good feature representations of different inputs.  Concretely, we would like to represent each news article $D$ in terms of a feature vector $V$, which can be used for classification. Feature vector $V$ is made up of the number of occurences of each word in the vocabulary.

Lets begin by counting the number of occurences of every word in the news documents in the training set.

### 1.1 Word frequency

Lets understand the kind of words that appear frequently, and those that occur rarely. We now count the frequencies of words:

In [None]:
# Initialize a dictionary to store frequencies of words.
# Key:Value === Word:Count
frequency = defaultdict(int)
    
for key in train_set:
    for f in train_set[key]:
        
        # Find all words which consist only of capital and lowercase characters and are between length of 2-9.
        # We ignore all special characters such as !.$ and words containing numbers
        words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', ' '.join(f))
    
        for word in words:
            frequency[word] += 1

sorted_words = sorted(frequency.items(), key=operator.itemgetter(1), reverse=True)

print("Top-10 most frequent words:")
for word in sorted_words[:10]:
    print(word)

print('----------------------------')
print("10 least frequent words:")
for word in sorted_words[-10:]:
    print(word)

Next, lets Attempt to plot a histogram of the counts of various words in descending order. 

Could you comment about the relationship between the frequency of the most frequent word to the second frequent word? 
And what about the third most frequent word?

(Hint - Check the relative frequencies of the first, second and third most frequent words)

(After answering, you can visit https://en.wikipedia.org/wiki/Zipf%27s_law for further Reading)

In [None]:
fig = plt.figure()
fig.set_size_inches(20,10)

plt.bar(range(len(sorted_words[:100])), [v for k, v in sorted_words[:100]] , align='center')
plt.xticks(range(len(sorted_words[:100])), [k for k, v in sorted_words[:100]])
locs, labels = plt.xticks()
plt.setp(labels, rotation=90)
plt.show()

### 1.2 Pre-processing to remove most and least frequent words

Lets see that different words appear with different frequencies.

The most common words appear in almost all documents. Hence, for a classification task, having information about those words' frequencies does not mater much since they appear frequently in every type of document. To get a good feature representation, we eliminate them since they do not add too much value.

Additionally, notice how the least frequent words appear so rarely that they might not be useful either.

Lets pre-process our news articles now to remove the most frequent and least frequent words by thresholding their counts: 

In [None]:
valid_words = defaultdict(int) 

print('Number of words before preprocessing:', len(sorted_words))

# Remove top 25 most frequent words, and the words which appear less than 10 times
remove_most_frequent = 25
freq_thresh = 10
feature_number = 0
for word, word_frequency in sorted_words[remove_most_frequent:]:
    if word_frequency > freq_thresh:    
        valid_words[word] = feature_number
        feature_number += 1
        
print('Number of words after preprocessing:', len(valid_words))

word_vector_size = len(valid_words)

### 1.3 Bag-of-Words representation

The simplest way to represent a document $D$ as a vector $V$ would be to now count the relevant words in the document. 

For each document, make a vector of the count of each of the words in the vocabulary (excluding the words removed in the previous step - the "stopwords").

In [None]:
def convert_to_BoW(dataset, number_of_documents):
    bow_representation = np.zeros((number_of_documents, word_vector_size))
    labels = np.zeros((number_of_documents, 1))
    
    i = 0
    for label, class_name in enumerate(dataset):
        
        # For each file
        for f in dataset[class_name]:
            
            # Read all text in file
            text = ' '.join(f).split(' ')
            
            # For each word
            for word in text:
                if word in valid_words:
                    bow_representation[i, valid_words[word]] += 1
            
            # Label of document
            labels[i] = label
            
            # Increment document counter
            i += 1
    
    return bow_representation, labels

# Convert the dataset into their bag of words representation treating train and test separately
train_bow_set, train_bow_labels = convert_to_BoW(train_set, n_train)
test_bow_set, test_bow_labels = convert_to_BoW(test_set, n_test)

### 1.4 Document classification using Bag-of-Words

For the test documents, lets use distance metric (Cosine, Euclidean, etc.) to find similar news articles from your training set and classify using kNN.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create an object for the KNeighborsClassifier
model = KNeighborsClassifier( n_neighbors=1)

# Fit the model
model.fit(train_bow_set, train_bow_labels)

Computing accuracy for the bag-of-words features on the full test set:

In [None]:
model.score(test_bow_set, test_bow_labels) # This cell may take some time to finish its execution

### Summary

Form the above experiment we can observe that the output of the bags of words would be a vector for each individual document. These documents will be parsed through different algorithms to extract the features that are used to classify the text.