# Data Mining: Classification analysis on textual data

## Idea

Our brains can easily perceive the meaning of the word in the context it is used. How will a machine learning algorithm do the same? This idea is quite interesting, ain't it? Suppose it's a library and we ask librarian to sort out 10,000 books by its genre. It would take a couple of days to do so. What if we have 10,000 digital books and we ask our algorithm to do the sorting? Is it possible? Yes, it is. 

## Data and what it represents

I've used 20 newsgroups dataset for classification analysis of textual data. You can download the dataset from [here.](https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups)

## Dependencies

This project has been coded using Python 3.7.1 under environment Anaconda Jupyter Notebook.
Please install the following packages before running the code.
#### 1.  nltk v3.2.2
#### 2. numpy v1.11.3
#### 3. matplotlib v2.0.0
#### 4. sklearn v0.18.1

# Part 1: Model Text Data and Feature Extraction

The algorithm can't read textual data. So, textual data needs to be encoded as integers or floating point values. 
Let's go through step-by-step. 

Step 1: Tokenization: Parsing text to remove words. 

Step 2: Feature extraction (or vectorization): Encoding words as integers to be fed as input to algorithm.

Read more about how to do this [here](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

Read about 'Bag of words' and 'TFIDF' method. This basically helps us converting textual data into a vectorised array of floating values. 

### Let's import dataset from scikit library

In [2]:
from sklearn.datasets import fetch_20newsgroups

There are total of 20 classes. You can find the [list of classes here](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) 

For the sake of understanding, we will consider only the following four classes: 

    'comp.graphics',

    'comp.os.ms-windows.misc',
 
    'comp.sys.ibm.pc.hardware',
 
    'comp.sys.mac.hardware'.
 
**Since this is about computer technology, we will henceforth consider only one class: 'Computer Technology'. The abovementioned will be four subclasses.**

Let's then make a list of these subclasses as follows.

In [3]:
computer_technology_subclasses=['comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware']

In [4]:
computer_technology_subclasses

['comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware']

# Preprocessing data

We need to make train and test data for the above class. This is quite easy to make.
If you look at docstring of fetch_20newsgroups under, you can set subset to either 'train'/'test'/'all'. 

In [6]:
help(fetch_20newsgroups)

Help on function fetch_20newsgroups in module sklearn.datasets.twenty_newsgroups:

fetch_20newsgroups(data_home=None, subset='train', categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)
    Load the filenames and data from the 20 newsgroups dataset (classification).
    
    Download it if necessary.
    
    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features                  text
    
    Read more in the :ref:`User Guide <20newsgroups_dataset>`.
    
    Parameters
    ----------
    data_home : optional, default: None
        Specify a download and cache folder for the datasets. If None,
        all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
    
    subset : 'train' or 'test', 'all', optional
        Select the dataset to load: 'train' for the training set, 'test'
        for the test set, 'all' for both, with shuffled ordering.
    
    categories : None or collect

**We also need to remove headers, footers, quotes, punctuations, stop words(eg., the, and, or). Stemming is used to remove suffixes of similar words.**

In [8]:
# Forming train and test data
computer_train=fetch_20newsgroups(subset='train',categories=computer_technology_subclasses,shuffle=True,random_state=42,
                                  remove=('headers','footers','quotes'))
computer_test=fetch_20newsgroups(subset='test',categories=computer_technology_subclasses,shuffle=True,random_state=42,
                                 remove=('headers','footers','quotes'))

In [9]:
from nltk.stem.snowball import SnowballStemmer

In [12]:
#defining the stemmer to be used in preprocessing the data
stemmer=SnowballStemmer("english")
stemmer

<nltk.stem.snowball.SnowballStemmer at 0x1a303cd1a90>

In [13]:
#defining the list of punctutations to be trimmed off the data in the preprocessing stage
punctuations='[! \" # $ % \& \' \( \) \ * + , \- \. \/ : ; <=> ? @ \[ \\ \] ^ _ ` { \| } ~]'
punctuations

'[! " # $ % \\& \' \\( \\) \\ * + , \\- \\. \\/ : ; <=> ? @ \\[ \\ \\] ^ _ ` { \\| } ~]'

In [16]:
import re 
#You can find more information by using Shift+Tab on re

In [17]:
#function for stemming, and removing punctuations
def preprocess(data):
    for i in range(len(data)):
        data[i]=" ".join([stemmer.stem(data) for data in re.split(punctuations,data[i])])
        data[i]=data[i].replace('\n','').replace('\t','').replace('\r','')

In [18]:
#preprocess the two datasets
preprocess(computer_train.data)
preprocess(computer_test.data)

In [20]:
type(computer_train.data)

list

The **CountVectorizer** provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

The **TfidfVectorizer** will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned **CountVectorizer**, you can use it with a **TfidfTransformer** to just calculate the inverse document frequencies and start encoding documents.

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Let's create our tfidf matrix. 

CountVectorizer will create the vocabulary of known words.

In [43]:
#Creating the instance of CountVectorizer class and removing stop_words.
#min_df = 2 means that if any word occurs rarely or has frequency of occurence lower than 2, it will be considered irrelevant
# and discarded.
vectorizer=CountVectorizer(min_df=2,stop_words ='english')
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [45]:
#tokenize and build vocab
vectorizer_counts=vectorizer.fit_transform(computer_train.data+computer_test.data)
vectorizer_counts

<3903x17594 sparse matrix of type '<class 'numpy.int64'>'
	with 190613 stored elements in Compressed Sparse Row format>

Sparse matrix means that the matrix will have a lot of zeros in it.

In [67]:
#Summarize
print(vectorizer.vocabulary_)

{'lc': 9692, 'iii': 8505, 'pop': 12355, 'reveal': 13322, 'socket': 14205, 'addit': 3081, 'vram': 16457, '72': 2236, 'pin': 12203, 'ram': 12999, 'flat': 7089, 'pack': 11907, 'processor': 12528, 'direct': 5971, 'slot': 14142, 'pds': 12051, 'ident': 8460, 'ii': 8497, 'withan': 16860, 'set': 13899, '32': 1171, 'path': 12000, 'guess': 7819, 'board': 4296, 'powerpc': 12404, 'chip': 4886, 'onli': 11635, 'place': 12249, 'hi': 8107, 'believ': 4093, 'undocu': 15911, 'featur': 6949, 'window': 16811, 'directori': 5975, 'microsoft': 10545, 'diagnost': 5907, 'ver': 16277, '00': 0, 'specif': 14315, 'depth': 5836, 'explan': 6785, 'legend': 9727, 'memori': 10424, 'map': 10242, 'report': 13249, 'thank': 15022, 'dh': 5899, 'doe': 6109, 'anyon': 3508, 'mountain': 10782, 'tape': 14877, 'backup': 3975, 'note': 11290, 'jumper': 9278, 'softwar': 14211, 'know': 9495, 'contact': 5295, 'maker': 10207, 'drive': 6196, 'network': 11120, 'solut': 14232, '800': 2439, '458': 1532, '0300': 46, 'general': 7502, 'number'

This creates a dictionary of words and indices. Each word is a key and a number is assigned to each key as index.

In [74]:
import pandas as pd
pd.DataFrame(vectorizer.vocabulary_, index = [0])

Unnamed: 0,lc,iii,pop,reveal,socket,addit,vram,72,pin,ram,...,sympathi,samea,genlock,infam,auxiliari,64mb,432,windoz,problemof,selector
0,9692,8505,12355,13322,14205,3081,16457,2236,12203,12999,...,14757,13675,7513,8672,3842,1976,1501,16825,12509,13851


In [75]:
#Encode the document
tfidf_transformer=TfidfTransformer()
tfidf_transformer

TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)

In [76]:
vectorizer_tfidf = tfidf_transformer.fit_transform(vectorizer_counts)
vectorizer_tfidf

<3903x17594 sparse matrix of type '<class 'numpy.float64'>'
	with 190613 stored elements in Compressed Sparse Row format>

So, we have converted the documents of 4 classes into numerical feature vectors by first tokenising each document into words and then excluding stop words, punctuations. 

Then after, we created TFxIDF vector representations. 


We will conclude by reporting the number of terms we extracted.


In [77]:
print('Min Frequency: 2')
print('Number of Terms: '+str(vectorizer_tfidf.shape[1]))

Min Frequency: 2
Number of Terms: 17594


In [78]:
#Summarize encoded vector
print(vectorizer_tfidf.shape)


(3903, 17594)


In [79]:
print(vectorizer_tfidf.toarray())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.22448033 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [80]:
import pandas as pd

In [81]:
pd.DataFrame(vectorizer_tfidf.toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,17584,17585,17586,17587,17588,17589,17590,17591,17592,17593
0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.22448,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [82]:
pd.DataFrame(vectorizer_tfidf.data)

Unnamed: 0,0
0,0.178612
1,0.121869
2,0.385321
3,0.113675
4,0.079552
5,0.170767
6,0.094418
7,0.111479
8,0.144066
9,0.141086
