# Document Classification

To explore the objectives of this project refer to [Read Me.](README.md)

If you are running this project for the first time, first make sure you have all the libraries with:
```
pip install -r requirements.txt
```

Then run `main.py` to collect online resources with:
```
py main.py
```

This notebook deals with classification of blog posts. Links to resource files are done through Google Collabratory/Drive integration and should be adjusted to local paths if necessary.

In [None]:
# To shuffle data keys
import numpy as np

# To manipulate data
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For measuring performance
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# For generating visual presentations
import seaborn as sns
import matplotlib.pyplot as plt

# Trained models
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# To access local resources
import file_manager as fm

## Preprocessing



In [None]:
links = fm.get("_all_links")
data_block = fm.get("_data_block")
str_list = []
tags = []

keys = np.array(list(data_block.keys()))
np.random.shuffle(keys)
for k in keys:
    full_string = " ".join([token for token in data_block[k]['tokens'] if len(token) >= 2])
    str_list.append(full_string)
    tags.append(data_block[k]['tag'])

## Streamlining

The functions below are used in model training section to streamline the training and validation process, and provide readablity to code snippets.

In [None]:
def cross_validation_wrapper(data, labels, vector_type, model, test_split_ratio = 0.1):
    X_train, X_test, y_train, y_test = data_formatter(data, labels, vector_type, test_split_ratio)
    results = cross_validation(model, X_train, y_train, 10)
    showResults(results)

In [None]:
def data_formatter(data, labels, vec_type, test_size = 0.1):
    if vec_type == 'bow':
        cv = CountVectorizer()
        X = cv.fit_transform(data)
    elif vec_type == 'tfidf':
        tfidf = TfidfVectorizer()
        X = tfidf.fit_transform(data)
        
    # svd = TruncatedSVD(n_components = 10, random_state = 123)
    # = svd.fit_transform(X)
    le = LabelEncoder()
    y = le.fit_transform(labels)
    return train_test_split(X, y, test_size = test_size)

In [None]:
def cross_validation(model, _X, _y, _cv):
      '''Function to perform K-Fold Cross-Validation
       Parameters
       ----------
      model: Python Class, default=None
              This is the machine learning algorithm to be used for training.
      _X: array
           This is the matrix of features.
      _y: array
           This is the target variable.
      _cv: int
          Determines the number of folds for cross-validation.
       Returns
       -------
       The function returns a dictionary containing the metrics 'accuracy', 'precision',
       'recall', 'f1' for both training set and validation set.
      '''
      _scoring = ['accuracy', 'precision', 'recall', 'f1']
      results = cross_validate(estimator=model,
                               X=_X,
                               y=_y,
                               cv=_cv,
                               scoring=_scoring,
                               return_train_score=True)
      
      return {"Training Accuracy scores": results['train_accuracy'],
              "Mean Training Accuracy": results['train_accuracy'].mean()*100,
              "Training Precision scores": results['train_precision'],
              "Mean Training Precision": results['train_precision'].mean(),
              "Training Recall scores": results['train_recall'],
              "Mean Training Recall": results['train_recall'].mean(),
              "Training F1 scores": results['train_f1'],
              "Mean Training F1 Score": results['train_f1'].mean(),
              "Validation Accuracy scores": results['test_accuracy'],
              "Mean Validation Accuracy": results['test_accuracy'].mean()*100,
              "Validation Precision scores": results['test_precision'],
              "Mean Validation Precision": results['test_precision'].mean(),
              "Validation Recall scores": results['test_recall'],
              "Mean Validation Recall": results['test_recall'].mean(),
              "Validation F1 scores": results['test_f1'],
              "Mean Validation F1 Score": results['test_f1'].mean()
              }

In [None]:
def showResults(results):
    for k in results.keys():
        print(k)
        try:
            for i in range(len(results[k])):
                print(f'{i}th Fold : ', results[k][i])
        except Exception as e:
            print(results[k])
    return

In [None]:
def trainModel(estimator, data, tags, vector_type, test_split_ratio):
    X_train, X_test, y_train, y_test = data_formatter(data, tags, vector_type, test_split_ratio)
    estimator.fit(X_train, y_train)
    y_pred = estimator.predict(X_test)
    print(f'Accuracy = {accuracy_score(y_test, y_pred)}')
    print('Classification Report')
    print(classification_report(y_test,y_pred))
    plt.figure(figsize = (5,5))
    sns.heatmap(confusion_matrix(y_test, y_pred), annot = True)
    plt.show()

## Model Training

We have trained 5 different models with %80 of the data as training dataset and used the rest %20 for testing purposes.

These models are:
- Multinomial Naive Bayes
- Decision Tree
- Random Forest
- Support Vector Machines(SVM)
- Recurrent Neural Network(RNN)

For each model we have used non-weighted and TF-IDF weighted version of the data and will be comparing them as we go along.

---


### Multinomial Naive Bayes

The Multinomial Naive Bayes algorithm is a Bayesian learning approach popular in Natural Language Processing (NLP). The algorithm guesses the tag of a text, such as an email or a newspaper story, using the Bayes theorem. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance.

#### BOW

In [None]:
clf = MultinomialNB()
cross_validation_wrapper(str_list, tags, 'bow', clf)

In [None]:
clf = MultinomialNB()
trainModel(clf, str_list, tags, 'bow', 0.2)

#### TF-IDF

In [None]:
clf = MultinomialNB()
cross_validation_wrapper(str_list, tags, 'tfidf', clf)

In [None]:
clf = MultinomialNB()
trainModel(clf, str_list, tags, 'tfidf', 0.2)

### Decision Tree

Decision tree classifiers provide a readable classification model that is potentially accurate in many different application contexts, including energy-based applications. The decision tree classifier creates the classification model by building a decision tree. Each node in the tree specifies a test on an attribute, each branch descending from that node corresponds to one of the possible values for that attribute. Each leaf represents class labels associated with the instance. Instances in the training set are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path.

#### BOW

In [None]:
dtc = DecisionTreeClassifier()
cross_validation_wrapper(str_list, tags, 'bow', dtc)

In [None]:
dtc = DecisionTreeClassifier()
trainModel(dtc,str_list, tags, 'bow', 0.2)

#### TF-IDF

In [None]:
dtc = DecisionTreeClassifier()
cross_validation_wrapper(str_list, tags, 'tfidf', dtc)


In [None]:
dtc = DecisionTreeClassifier()
trainModel(dtc,str_list, tags, 'tfidf', 0.2)

### Random Forest

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient boosted trees. However, data characteristics can affect their performance.

#### BOW

In [None]:
rfc = RandomForestClassifier(15)
cross_validation_wrapper(str_list, tags, 'bow', rfc)


In [None]:
rfc = RandomForestClassifier(15)
trainModel(rfc,str_list, tags, 'bow', 0.2)

#### TF-IDF

In [None]:
rfc = RandomForestClassifier(15)
cross_validation_wrapper(str_list, tags, 'tfidf', rfc)


In [None]:
rfc = RandomForestClassifier(15)
trainModel(rfc,str_list, tags, 'tfidf', 0.2)

### Support Vector Machines

Support Vector Machines(SVM) are one of the most robust prediction methods, being based on statistical learning frameworks. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

#### BOW

In [None]:
svc = SVC(kernel = 'linear')
cross_validation_wrapper(str_list, tags, 'bow', svc)


In [None]:
svc = SVC(kernel = 'linear')
trainModel(svc,str_list, tags, 'bow', 0.2)

#### TF-IDF

In [None]:
svc = SVC(kernel = 'linear')
cross_validation_wrapper(str_list, tags, 'tfidf', svc)


In [None]:
svc = SVC(kernel = 'linear')
trainModel(svc,str_list, tags, 'tfidf', 0.2)

### Recurrent Neural Network

#### BOW

In [None]:
pass

#### TF-IDF

In [None]:
pass

## Comparison