# NLP981 Final Project - Phase #1

*   Instructor: Javad PourMostafa
*   Teaching Assistant: Parsa Abbasi
*   University of Guilan, 1st semester of 2019
*   GitHub repository : *https://github.com/JoyeBright/NLP*

In [0]:
# !pip install nltk

In [0]:
# import nltk
# nltk.download('punkt')

In [0]:
# sentence = "Hello, world!"
# print(nltk.word_tokenize(sentence))

It's the first phase of your final project for the *NLP981* course. The main idea behind this phase is to portray the develope side of *NLP*.

You must code inside of this python notebook. I highly recommend you to use the *Google Colab* environment. 

If you have any questions, feel free to ask.
You can use [*Quera*](https://quera.ir/course/4385/) platform for your general questions.



## Introduction

A category predictor is going to build at this phase of the project.

The predictor gets a text as input and predicts a category for that.

For this purpose, you need to :

1.   Load the dataset
2.   Preprocess the text data
3.   Implement a word representation method to represent each text as a numeric vector
4.   Implement a classification model and train that using the training set
5.   Predict a category for each of validation data using implemented model
6.   Measure your work using confusion matrix and some common metrics

**Important Note:** You can use any library you want in sections 1 and 2. But everything in section 3-6 need to be coded purely.



## 1) Dataset

The dataset you will use in this phase is called *Divar* that released by the *CafeBazaar* research team.

It contains more than 900,000 posts of the *Divar* ads platform. We split this dataset into training, validation, and testing sets.

The testing set is not accessible for you, and we use them to evaluate your work on the presentation day.

You can download the dataset files (training and validation sets) directly from the following link :

> *https://drive.google.com/open?id=1oj-fqpymjDr8QsOK-zQliiqXbVqakrFo*


### 1.1) Import

In [0]:
%%time

import pandas as pd

# Loading data from Google Drive.
# from google.colab import drive
# drive.mount('/gdrive')
# train_set = pd.read_csv('/gdrive/My Drive/trainset.csv')
# valid_set = pd.read_csv('/gdrive/My Drive/validationset.csv')

# Loading data from local.
train_set = pd.read_csv('trainset.csv')
valid_set = pd.read_csv('validationset.csv')

# Combining the title and description columns together.
train_set['doc'] = train_set.title + ' ' + train_set.desc
valid_set['doc'] = valid_set.title + ' ' + valid_set.desc

### 1.2) Analyzing

Display the top 10 rows of the train set.

In [0]:
train_set.head(10)

How many data (rows) stored in the training and validation sets?

In [0]:
train_len = len(train_set)
valid_len = len(valid_set)
print('rows in the training set:', train_len)
print('rows in the validation set:', valid_len)

How many posts are in each category (First level categories)? (Based on training set)

In [0]:
train_set.cat1.hist(figsize=(12, 6))

## 2) Preprocessing

There are two kinds of text data in the dataset: *Title* and *Description*.
You can use one or both of them as text inputs of your classification model. Choose a composition that gives you a higher measuring score.

You need to apply some preprocessing procedures on your text data first. We want at least **4** preprocessing step from you. It can be removing stop words, removing punctation, removing or replacing digits, stemming, lemmatizing, normalization, and so on.

You can use the [*Stopwords Guilan NLP*](https://github.com/JoyeBright/stopwords_guilannlp) library to access a collection of Persian stop words.

In [0]:
!pip install hazm stopwords_guilannlp

In [0]:
# This importation is required for the 'hazm' library.
from __future__ import unicode_literals

import re

from hazm import Normalizer, Stemmer, Lemmatizer, word_tokenize
from stopwords_guilannlp import stopwords_output


class PreProcessor:
    """
    Document pre processes.
    """

    def __init__(self):
        self._normalizer = Normalizer()
        self._stemmer = Stemmer()
        self._lemmatizer = Lemmatizer()
        # The 'set' output type is faster than others.
        self._stopwords = stopwords_output('persian', 'set')

    def pre_process(self, doc):
        """
        Extract clear words from the document.
        """
        scrubbed = self._scrub(doc)
        normalized = self._normalizer.normalize(scrubbed)
        tokens = word_tokenize(normalized)
        words = []
        for token in tokens:
            # Check if token is not a stop-word.
            if token not in self._stopwords:
                stemmed = self._stemmer.stem(token)
                # Sometimes stem method returns an empty string,
                # so we should check that and pass the original token to lemmatize.
                lemmatized = self._lemmatizer.lemmatize(stemmed or token)
                words.append(lemmatized)
        return words

    def _scrub(self, text):
        """
        Remove unexpected characters from the text.
        """
        # Remove persian symbols.
        text = re.sub(r'[۰۱۲۳۴۵۶۷۸۹…«»،؟]+', ' ', text)
        # Remove ascii symbols.
        text = re.sub(r'[-!@#$%^&*()_+|~=`{}\[\]:";\'<>?,.\\/\d]+', ' ', text)
        # Remove extra spaces.
        text = re.sub(r'\s+', ' ', text)
        # Trim the text.
        return text.strip()


pre_processor = PreProcessor()


def preprocessing(text):
    return pre_processor.pre_process(text)

## 3) Word Representation

As you know, classification models can't deal with strings directly, and you have to represent your texts in a numerical form.

### 3.1) Tf-idf

You have to implement the tf-idf vectorization method from scratch in this step. 

Furthermore, a function must be implemented that gives a text input and return a tf-idf vectorized representation.

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* (term-frequency) is the count of occurrences of the word `t` in specific text `d`.

*idf* (inverse document-frequency) is term that is inversely proportional to the number of texts with the given word. It can be calculated this way:
$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$
where $n_d$ is the whole number of texts and $n_{d(t)}$ is the number of texts with the word `t`.

In [0]:
from collections import defaultdict

import numpy as np
import scipy.sparse as sp


class Vectorizer:
    """
    TF-IDF vectorizer.
    """

    def __init__(self, pre_process):
        """
        Parameters:
            pre_process: callable
                A function that returns cleared tokens of a document.
        """
        self._pre_process = pre_process
        self._vocabulary = None
        self._idf = None

    def fit(self, docs):
        """
        Learn vocabulary and create an IDF matrix for the documents.

        Returns:
            The IDF matrix.
        """
        tf = self._tf(docs)
        samples = len(docs)
        features = len(self._vocabulary)
        # Count number of occurrences of each column index (word).
        ndt = np.bincount(tf.indices, minlength=features)
        # Calculate the IDF vector.
        idf = np.log((1 + samples) / (1 + ndt)) + 1
        # For creating a TF-IDF matrix we should multiply each row of the TF matrix
        # by the IDF vector, so to simplify the calculation it's better to convert
        # the IDF vector to a diagonal CSR matrix.
        self._idf = sp.diags(idf, format='csr', shape=(features, features))
        return tf

    def fit_transform(self, docs):
        """
        Learn vocabulary and create a TF-IDF matrix.
        """
        tf = self.fit(docs)
        return self._transform(tf)

    def transform(self, docs):
        """
        Create a TF-IDF matrix for the documents by multiplying,
        the TF and the IDF matrices.
        """
        tf = self._tf(docs)
        return self._transform(tf)

    def _transform(self, tf):
        return tf @ self._idf

    def _tf(self, docs):
        """
        Create a Term-Frequency matrix for the documents.
        """
        # If it's the first time (fitting time), we create a dictionary
        # that uses words as a key and generates an auto-increment index
        # for them as a value.
        vocabulary = self._vocabulary or defaultdict(lambda: len(vocabulary))
        data = []  # List of frequencies.
        indices = []  # List of column indices for each frequency.
        indptr = [0]  # Index pointer, indicates how many elements are in each row.
        for doc in docs:
            # We use a default dictionary to count each word in the current document.
            word_count = defaultdict(int)
            # Extract clear words from the document.
            for word in self._pre_process(doc):
                try:
                    index = vocabulary[word]
                    word_count[index] += 1
                except KeyError:
                    # If it's not fitting time and the word doesn't exist in the vocabulary,
                    # ignore it.
                    continue
            indices.extend(word_count.keys())
            data.extend(word_count.values())
            indptr.append(len(indices))
        if not self._vocabulary:
            self._vocabulary = dict(vocabulary)
        # Since the TF matrix has a lot of zeros, we should use a sparse matrix.
        # A Compressed Sparse Row (CSR) matrix is more efficient for arithmetic operations.
        samples, features = len(docs), len(vocabulary)
        return sp.csr_matrix((data, indices, indptr), shape=(samples, features))

In [0]:
%%time

# Estimated time: 2min 20s

vectorizer = Vectorizer(preprocessing)


def tf_idf(text):
    return vectorizer.transform(text)


x_train = vectorizer.fit_transform(train_set.doc)
x_valid = tf_idf(valid_set.doc)

## 4) Classification

![alt text](https://cdn.lynda.com/course/578082/578082-637075371482276339-16x9.jpg)

### 4.1) Logistic Regression

The Logistic Regression classifier must be implemented from scratch here.

You can fit the training data into the classifier after implementing linear regression.

In [0]:
import numpy as np
import scipy.sparse as sp


class Classifier:
    """
    Logistic Regression classifier.
    """

    def __init__(self, alpha=0.01, epochs=100):
        """
        Parameters:
            alpha: float
                Learning rate.
            epochs: int
                Maximum number of iterations.
        """
        self._alpha = alpha
        self._epochs = epochs
        self._classes = None
        self._thetas = None

    def fit(self, x, y):
        """
        Fit the model according to the given training data.
        """
        x = self._add_bias(x)
        samples, features = x.shape
        # Find unique classes in the targets.
        self._classes = np.unique(y)
        self._thetas = []
        # For a multi-class model, we must fit the wights for each class separately.
        for clazz in self._classes:
            # Create a binary vector of targets, by one-vs-rest strategy,
            # where the current class is one and others are zero.
            bin_y = np.where(clazz == y, 1, 0)
            # Create a new weights vector for the class and initialize it with zeros.
            theta = np.zeros(features)
            # Repeat this for the given iterations number.
            for _ in range(self._epochs):
                # Evaluate the output probability.
                pred = self._probability(x, theta)
                # Calculate gradient ascent.
                grad = (bin_y - pred) @ x / samples
                # Update the weights by gradient ascent.
                theta += self._alpha * grad
            self._thetas.append(theta)
        return

    def predict(self, x):
        """
        Predict the classification for each sample in the given list.
        """
        x = self._add_bias(x)
        # For each class, calculate the confidence scores of the given samples.
        scores = [self._probability(x, theta) for theta in self._thetas]
        # Find the maximum score in each column and give the row index of it,
        # which indicates the corresponding class index.
        indices = np.argmax(scores, axis=0)
        return self._classes[indices]

    def _add_bias(self, x):
        """
        Add a bias vector with values of one, as a new column,
        to the start of the data matrix.
        """
        samples = x.shape[0]
        bias = np.ones((samples, 1))
        if sp.isspmatrix(x):
            # If the original data is a sparse matrix, use scipy module instead.
            return sp.hstack([bias, x])
        return np.hstack([bias, x])

    def _probability(self, x, theta):
        """
        Calculate the confidence score for the given data by its weights.
        """
        # Linear model equation.
        z = x @ theta
        # Apply the sigmoid function to the output.
        return 1 / (1 + np.exp(-z))

## 5) Prediction

Now you can predict a category for each of the validation data using the implemented classifier.

In [0]:
%%time

# Estimated time: 10min 40s

y_train = train_set.cat1
y_valid = valid_set.cat1

classifier = Classifier(alpha=0.1, epochs=1000)
classifier.fit(x_train, y_train)
predicted = classifier.predict(x_valid)

## 6) Evaluation

It's time to evaluate your model using predicted categories for validation data.

You need to create a confusion matrix based on your prediction and the real labels. Then you can use this confusion matrix for calculation other measuring metrics. 

As this problem is a multi-class problem, the calculation formula is a little different from the general case. Read [this article](https://towardsdatascience.com/multi-class-metrics-made-simple-part-i-precision-and-recall-9250280bddc2) for more information.

In [0]:
import numpy as np
import pandas as pd


class Metrics:
    """
    Classification metrics based on the Confusion Matrix.
    """

    def __init__(self):
        self._cm = None

    def confusion_matrix(self, y_true, y_pred):
        """
        Create the Confusion Matrix.
        """
        # Find unique classes.
        classes = np.unique(y_true)
        # Create a data-frame for the confusion matrix,
        # because it's easy to work with later.
        # Rows of the DF are representing the predicted classes.
        # Columns of the DF are representing the true classes.
        self._cm = pd.DataFrame(0, index=classes, columns=classes)
        for true, pred in zip(y_true, y_pred):
            self._cm[true][pred] += 1
        return self._cm

    def classification_report(self):
        """
        Show the main classification metrics.
        """
        classes = self._cm.index
        return pd.DataFrame([{
            'precision': self.precision_score(clazz),
            'recall': self.recall_score(clazz),
            'f1-score': self.f1_score(clazz),
            'support': self._cm[clazz].sum()
        } for clazz in classes], index=classes)

    def precision_score(self, clazz=None, average=True):
        """
        Calculate the Precision Score from the confusion matrix.

        Parameters:
            clazz: object
                If presented, the score will be calculated just for this class.
            average: bool
                Should return score for each class or calculate the macro average?
        """
        if clazz:
            return self._portion(self._cm.loc[clazz])
        precisions = self._cm.apply(self._portion, axis=1)
        return precisions.mean() if average else precisions

    def recall_score(self, clazz=None, average=True):
        """
        Calculate the Recall Score from the confusion matrix.

        Parameters:
            clazz: object
                If presented, the score will be calculated just for this class.
            average: bool
                Should return score for each class or calculate the macro average?
        """
        if clazz:
            return self._portion(self._cm[clazz])
        recalls = self._cm.apply(self._portion, axis=0)
        return recalls.mean() if average else recalls

    def f1_score(self, clazz=None, average=True):
        """
        Calculate the F1 Score from the confusion matrix.

        Parameters:
            clazz: object
                If presented, the score will be calculated just for this class.
            average: bool
                Should return score for each class or calculate the macro average?
        """
        precision = self.precision_score(clazz, False)
        recall = self.recall_score(clazz, False)
        f1 = 2 * recall * precision / (recall + precision)
        return f1.mean() if average else f1

    def accuracy_score(self):
        """
        Calculate the Accuracy Score from the confusion matrix.
        """
        cm = self._cm.values
        return cm.diagonal().sum() / cm.sum()

    def _portion(self, vector):
        """
        Calculate the portion of an element in the vector along its axis.
        """
        return vector[vector.name] / vector.sum()

### 6.1) Confusion matrix

In [0]:
# Estimated time: 15s

metrics = Metrics()
metrics.confusion_matrix(y_valid, predicted)

In [0]:
metrics.classification_report()

### 6.2) Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}$$

In [0]:
accuracy = metrics.accuracy_score()
print(accuracy)

### 6.3) Precision

$$\text{Precision} = \frac{TP}{TP + FP}$$

In [0]:
precision = metrics.precision_score()
print(precision)

### 6.4) Recall

$$\text{Recall} = \frac{TP}{TP + FN}$$

In [0]:
recall = metrics.recall_score()
print(recall)

### 6.5) F1 score

$$\text{F1 score} = 2\times \frac{(Recall \times  Precision)}{Recall + Precision}$$ 

In [0]:
f1_score = metrics.f1_score()
print(f1_score)

## 7) K-Fold Cross Validation *(Optional)*

Evaluate your model based on the K-Fold Cross Validation approach. This step is optional and has a few extra points.

In [0]:
%%time

# Estimated time: 12min 40s

import numpy as np
import scipy.sparse as sp

x = sp.vstack([x_train, x_valid])
y = y_train.append(y_valid)


def split(folds):
    samples = x.shape[0]
    fold = samples // folds
    indices = np.arange(samples)
    np.random.shuffle(indices)
    for i in range(folds):
        start = fold * i
        end = fold * (i + 1)
        valid = indices[start:end]
        left = indices[:start]
        right = indices[end:]
        train = np.concatenate([left, right])
        yield x[train], x[valid], y.iloc[train], y.iloc[valid]


def cross_validation():
    scores = []
    classifier = Classifier(alpha=0.1, epochs=200)
    for x_t, x_v, y_t, y_v in split(5):
        classifier.fit(x_t, y_t)
        preds = classifier.predict(x_v)
        metrics.confusion_matrix(y_v, preds)
        score = metrics.accuracy_score()
        scores.append(score)
    return np.average(scores)


cv_accuracy = cross_validation()
print('K-Fold Cross Validation Accuracy:', cv_accuracy)