<a href="https://colab.research.google.com/github/CCIR-Academy/Techcamp2021S-Phase3/blob/main/Project__2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 2: Further on NLP



## Intro
- As we progress to the final section of the Techcamp, we will attempt to explore further on NLP: we would start to take some of the more analytic approaches with regards to analyzing and processing text data that consist of natural language; such approaches are mostly inspired by theories and practices in the field of Linguistic Studies and powered by deep-learning with the help of modern hardwares, and the goal is to handle these data with adequate understanding of dynamics of human verbal communication as well as the contextual meaning endowed by human in the actual usage of language.  
- In this project, we will be using Colab as our environment for the sake of a more controlled and powerful runtime that could provide more consistent experience which might vary by a great degree otherwise for numerous reasons.  


## Expectation
1. Similar to where we were in the last project, the capability behind the common applications of NLP is brought to you by researchers and developers on the basis of substantial mathematical and linguistic theories and breakthroughs, which could be overwelming for people that are new to machine learning in general; as such, our attention for the lecturing and coding sessions are somewhat limited to the implementations of a series of more out-of-box applications provided by some of the more production-oriented libraries like `spaCy`.
2. However, to some extent (in comparison to the previous project), the mechanisms, algorithms, and parameters involved in such applications seem rather "blackbox" to us, meaning that we don't have many ways to tweak and try it out other than simply feeding it with more data.
3. To address this issue for the purposing of making the course experience more tangible and aspiring, we are adopting a few strategies in our exploration:
    - We would like encourage you to integrate the NLP features with what we have covered in developing and deploying a bot on `Chai.ml` to substantially provide a complete "playable" experience.
    - We would like to encourage you to explore freely at [spaCy Univererse](https://spacy.io/universe), which is a collection of many great resources developed with or for `spaCy`, to get inspirations for what you would like to do after you learn something interesting about NLP. In terms of topic, such projects are encouraged to be stemming from your own interests or demand in any aspects of life, as long as you feel driven and enjoying the process of developing it (say, automatically create a good meme XD).
4. A good place to start with would be accessing the resources from a community called `HuggingFace Hub`. You can learn more about it here: https://spacy.io/usage/projects#huggingface_hub


## Part 1: Spam mail classification
> Note: This part is heavily inspired by this blog article: https://dataaspirant.medium.com/how-to-build-an-effective-email-spam-classification-model-with-spacy-python-af2217bf4e30

In this part of the project, we will attempt to use the default configurations to train a local model from a public dataset from Kaggle. Here is the link of the dataset as you will need to download it and upload it to the Colab runtime: https://www.kaggle.com/venky73/spam-mails-dataset

### Preparation
1. You would need to keep your code and other files updated on GitHub under the group CCIR-Academy. You can do so by either creating a new repo or keep using the repo for your previous sections. 
2. For your code to be able to access the dataset, you would need to either upload it to the runtime or store it to Google Drive and mount it to the runtime. Notice that if the runtime would not store the uploaded file after the runtime has been terminated (e.g., after you close the web page for Colab), and you might need to upload it again.

### Loading the dataset
At the begining of our entry function `main()`, We would use `pandas` to load our dataset as recommended by `spaCy` like many other machine learning libraries.  Before doing so, we will define the basic parameters casually by defining them as common variables. We would also like to print a line to see if the dataset has been loaded properly.
```python
## Paths
data_path = "/content/spam_ham_dataset.csv"

def main():
    # Load dataset
    data = pd.read_csv(data_path)
    observations = len(data.index)
    print("Dataset Size: {}".format(observations))
```

### Creating a blank spaCy model and a simple Pipeline
- To understand the dataset on the level of a natural language, a typical model would take multiple steps to process the data to the degree that it can make informed predictions or evaluation based on the processed data. On one hand, with trained models, we can directly get some output by providing some input; on the other hand, we can also train a new model by providing a dataset with both the input and the expected output.
- In our project, we create a blank model a pipeline from the "textcat" template which can conduct text categorizations after appropriate training.
```python
    # Create an empty spacy model
    nlp = spacy.load("en_core_web_sm")

    # Create the TextCategorizer with exclusive classes and "bow" architecture
    text_cat = nlp.create_pipe(
                  "textcat",
                  config={
                    "exclusive_classes": True,
                    "architecture": "bow"})

    # Adding the TextCategorizer to the created empty model
    nlp.add_pipe(text_cat)
```

### Prepare for training

### Appendix: Sample Code (For the purpose of instruction and self-learning)

In [None]:
%pip install pandas spacy seaborn sklearn matplotlib

"""
===============================================
Objective: Building email classifier with spacy
Author: saimadhu.polamuri
Blog: dataaspirant.com
Date: 2020-07-17
===============================================
"""
 
## Required packages
import random
import spacy
import pandas as pd
import seaborn as sns
from spacy.util import minibatch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt

## Paths
data_path = "/content/spam_ham_dataset.csv"

## Configurations
sns.set(style="darkgrid")

def train_model(model, train_data, optimizer, batch_size, epochs=10):
    losses = {}
    random.seed(1)

    for epoch in range(epochs):
        random.shuffle(train_data)

        batches = minibatch(train_data, size=batch_size)
        for batch in batches:
            # Split batch into texts and labels
            texts, labels = zip(*batch)

            # Update model with texts and labels
            model.update(texts, labels, sgd=optimizer, losses=losses)
        # print("Loss: {}".format(losses['textcat']))

    return losses['textcat']
def get_predictions(model, texts):
    # Use the model's tokenizer to tokenize each input text
    docs = [model.tokenizer(text) for text in texts]

    # Use textcat to get the scores for each doc
    textcat = model.get_pipe('textcat')
    scores, _ = textcat.predict(docs)

    # From the scores, find the label with the highest score/probability
    predicted_labels = scores.argmax(axis=1)
    predicted_class = [textcat.labels[label] for label in predicted_labels]

    return predicted_class


######## Main method ########

def main():
    # Load dataset
    data = pd.read_csv(data_path)
    observations = len(data.index)
    print("Dataset Size: {}".format(observations))

    # Create an empty spacy model
    nlp = spacy.load("en_core_web_sm")

    # Create the TextCategorizer with exclusive classes and "bow" architecture
    text_cat = nlp.create_pipe(
                  "textcat",
                  config={
                    "exclusive_classes": True,
                    "architecture": "bow"})

    # Adding the TextCategorizer to the created empty model
    nlp.add_pipe(text_cat)

    # Add labels to text classifier
    text_cat.add_label("ham")
    text_cat.add_label("spam")

    # Split data into train and test datasets
    x_train, x_test, y_train, y_test = train_test_split(
        data['text'], data['label'], test_size=0.33, random_state=7)
    
    train_labels = [{'cats': {'ham': label == 'ham',
                              'spam': label == 'spam'}}  for label in y_train]
    test_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}}  for label in y_test]

    # Spacy model data
    train_data = list(zip(x_train, train_labels))
    test_data = list(zip(x_test, test_labels))

    # Model configurations
    optimizer = nlp.begin_training()
    batch_size = 5
    epochs = 10

    # Training the model
    train_model(nlp, train_data, optimizer, batch_size, epochs)

    # Sample predictions
    # print(train_data[0])
    # sample_test = nlp(train_data[0][0])
    # print(sample_test.cats)

    # Train and test accuracy
    train_predictions = get_predictions(nlp, x_train)
    test_predictions = get_predictions(nlp, x_test)
    train_accuracy = accuracy_score(y_train, train_predictions)
    test_accuracy = accuracy_score(y_test, test_predictions)

    print("Train accuracy: {}".format(train_accuracy))
    print("Test accuracy: {}".format(test_accuracy))

    # Creating the confusion matrix graphs
    cf_train_matrix = confusion_matrix(y_train, train_predictions)
    plt.figure(figsize=(10,8))
    sns.heatmap(cf_train_matrix, annot=True, fmt='d')

    cf_test_matrix = confusion_matrix(y_test, test_predictions)
    plt.figure(figsize=(10,8))
    sns.heatmap(cf_test_matrix, annot=True, fmt='d')

if __name__ == "__main__":
    main()

## Part 2: Sentiment Analysis

https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

https://spacy.io/usage/rule-based-matching#example3

https://realpython.com/sentiment-analysis-python/