http://ai.stanford.edu/%7Eamaas/data/sentiment/


### Text Classification

Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. 
These are two examples of topic classification, categorizing a text document into one of a predefined set of topics. In many topic classification problems, this categorization is based primarily on keywords in the text.

Another common type of text classification is sentiment analysis, whose goal is to identify the polarity of text content: the type of opinion it expresses.

#### Steps to achieve Problem Solving using Machine Learning
- Gather Data
- Explore your data 
- Choose a Model
- Prepare your model
- Build, train and evaluate your model
- Tune hyperparameters
- Deploy your model

#### Types of Models that can be used for Text Classification
- Models can be broadly classified into two categories: those that use word ordering information (sequence models), and ones that just see text as “bags” (sets) of words (n-gram models). 
- Types of sequence models include convolutional neural networks (CNNs), recurrent neural networks (RNNs), and their variations. Types of n-gram models include logistic regression, simple multi- layer perceptrons (MLPs, or fully-connected neural networks), gradient boosted trees and support vector machines.
- It is  observed that the ratio of “number of samples” (S) to “number of words per sample” (W) correlates with which model performs well.
- When the value for this ratio is small (<1500), small multi-layer perceptrons that take n-grams as input perform better or at least as well as sequence models. 
- MLPs are simple to define and understand, and they take much less compute time than sequence models. 
- When the value for this ratio is large (>= 1500), use a sequence model

#### Algorithm for Data Preparation and Model Building
1. Calculate the number of samples/number of words per sample ratio.
2. If this ratio is less than 1500, tokenize the text as n-grams and use a
simple multi-layer perceptron (MLP) model to classify them (left branch in the
flowchart below):
  - Split the samples into word n-grams; convert the n-grams into vectors.
  - Score the importance of the vectors and then select the top 20K using the scores.
  - Build an MLP model.
3. If the ratio is greater than 1500, tokenize the text as sequences and use a
   sepCNN model to classify them (right branch in the flowchart below):
  - Split the samples into words; select the top 20K words based on their frequency.
  - Convert the samples into word sequence vectors.
  - If the original number of samples/number of words per sample ratio is less
     than 15K, using a fine-tuned pre-trained embedding with the sepCNN
     model will likely provide the best results.
4. Measure the model performance with different hyperparameter values to find
   the best model configuration for the dataset.

#### Preparing the Data before it can be fed to the Model
- Before the data can be fed to a model, it needs to be transformed to a format that the model can understand. 
- The information associated with the ordering of samples should not influence the relationship between texts and labels. 
-  For example, if a dataset is sorted by class and is then split into training/validation sets, these sets will not be representative of the overall distribution of data.
- A simple best practice to ensure the model is not affected by data order is to always shuffle the data before doing anything else. 
- If your data is already split into training and validation sets, make sure to transform your validation data the same way you transform your training data.
- If you don’t already have separate training and validation sets, you can split the samples after shuffling; it’s typical to use 80% of the samples for training and 20% for validation.

Machine learning algorithms take numbers as inputs. This means that we will need to convert the texts into numerical vectors. There are two steps to this process:
- Tokenization : Divide the texts into words or smaller sub-texts, which will enable good generalization of relationship between the texts and the labels. This determines the “vocabulary” of the dataset (set of unique tokens present in the data).
- Vectorization :     Define a good numerical measure to characterize these texts.

In [1]:
data_path = '/home/deepshikha/Deepshikha'

In [11]:
import os
import random
import numpy as np
import matplotlib.pyplot as plt
imdb_data_path = os.path.join(data_path, 'aclImdb')

In [6]:
train_texts = []
train_labels = []
for category in ['pos', 'neg']:
    train_path = os.path.join(imdb_data_path, 'train', category)
    for fname in sorted(os.listdir(train_path)):
        if fname.endswith('.txt'):
            with open(os.path.join(train_path, fname)) as f:
                train_texts.append(f.read())
            train_labels.append(0 if category == 'neg' else 1)

In [9]:
test_texts = []
test_labels = []
for category in ['pos', 'neg']:
    test_path = os.path.join(imdb_data_path, 'test', category)
    for fname in sorted(os.listdir(test_path)):
        if fname.endswith('.txt'):
            with open(os.path.join(test_path, fname)) as f:
                test_texts.append(f.read())
            test_labels.append(0 if category == 'neg' else 1)

In [13]:
random.seed(123)
random.shuffle(train_texts)
random.seed(123)
random.shuffle(train_labels)

In [14]:
def get_num_words_per_sample(sample_texts):
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

def plot_sample_length_distribution(sample_texts):
    plt.hist([len(s) for s in sample_texts], 50)
    plt.xlabel('Length of a sample')
    plt.ylabel('Number of samples')
    plt.title('Sample length distribution')
    plt.show()