<a href="https://colab.research.google.com/github/DaanKuyper/DocumentSplitting/blob/master/NLP_Models_Pipeline_for_retrieval_of_Implicit_Feedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipeline for retrieval of Implicit Feedback via NLP models

This python notebook depicts the necesssary code for generating NLP based insights from a large collection of text-based input. To achieve this, certain classes from the Hugging Face transformers library are utilised: https://huggingface.co/docs/transformers/index. If using Google Colab to execute this notebook, change the runtime type to GPU hardware accelerated and change the configuration (explained later) to use GPU's as well, as this drastically improves performance.

For a detailed description of what this code represents and what its potential uses are, please refer to my paper on the subject, available on the GitHub repository: https://github.com/DaanKuyper/ImplicitFeedback

In [125]:
# For the best performace, a GPU should be used for hardware acceleration. 
# The code below can be used to view what runtime is connected.
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



# Imports

All the necessary python libraries are imported below.

In [126]:
%%capture
!pip install transformers
!pip install pytorch-lightning
!pip install sentencepiece

In [127]:
import os
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from google.colab import drive
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import Dataset
from bs4 import BeautifulSoup

# Model definition

Below the two classes for insight generation are displayed. These include a ModeConfig, used to define what results are desired from the input and which trained models are required to generate them. Additionally, the Insight_Generator class is shown, which creates and calls the classifiers used for insight generated based on passed variables.

For the generating of insights we make use of the HuggingFace pipepline class: https://huggingface.co/docs/transformers/main_classes/pipelines. This pipeline class allows for the classifying of an input for specific tasks - they are therefore also named **classifiers** after construction. Because classifiers can be used for a number of different operations, there are different kinds of classifiers like text classification, summarization or translation. Therefore, for the configurations, it is important to define what kind of classifier we would like to utilize.

Classifiers can be instantiated in two different ways. We can either make a classifier by passing the name of the desired pre-trained model for the specific task, which limits us to models uploaded to the extensive HuggingFace forum: https://huggingface.co/models. Or we can pass our own instances of a model and tokenizer class, again for a specific task, which allows us to fine-tune models according to our own input and validation datasets. This also allows us to save and load configurations locally. This process is explained in detail in the excellent Hugging Face documentation: https://huggingface.co/course/chapter3/1?fw=pt.

In [153]:
class ClassifierConfig():
  """
  Args:
      task (`str`): name of specific classifier task.
      name (`str`): name of pre-trained model.
      tokenizer (`AutoTokenizer`, *optional*): HuggingFace tokenizer class.
      model (`AutoModelClass`, *optional*): HuggingFace model class.
      language (`List[str]`, *optional*): specific languages for the model.
      args (`List[str]`, *optional*): additional arguments for the classifier.
      useGPU (`bool`, *optional*): determines if the classifier uses GPU.
  """

  def __init__(self, 
               task, 
               name, 
               model=None, 
               tokenizer=None, 
               languages=[],
               args=[],
               useGPU=False):
    self.task = task
    self.name = name
    self.model = model
    self.tokenizer = tokenizer
    self.languages = languages
    self.args = args
    self.device = 0 if useGPU else -1
    self.useName = model is None or tokenizer is None 


  def toString(self):
    # The combination of task, name and language form a unique identifier
    # for different configurations, which allows us to access them later.
    return f"{self.task} on '{self.name}' for '{self.languages}' languages"


In [154]:
class Insight_Generator():
  """
  Args:
      input (`List[str]`): List of input texts.
      languageModel (`str`, *optional*): Model used for language detection.
  """

  def __init__(
      self,
      input,
      languageModel='papluca/xlm-roberta-base-language-detection'
  ):
    self.input = self.preprocess(input)
    self.languageModel = languageModel

    # Subsets of input per specific language.
    self.language_subsets = {}

    # Dictionary of model insight results.
    self.insights = {}

    # Truncation on: for comments longer than 512 token sequence...
    self.tokenizer_kwargs = {'truncation':True}

  
  def preprocess(self, input):
    #  - Removing NaN values.
    df = pd.DataFrame(input)
    df.fillna('', inplace=True)

    # - Removal of all potential HTML elements and \n.
    safe_input = [''.join(BeautifulSoup(comment).findAll(text=True))\
                  .replace('\n', ' ') for comment in df.comment]

    return safe_input


  def detectLanguages(self):
    ld_df = self.generateInsight(ClassifierConfig("text-classification", 
                                                  self.languageModel))

    print(f"Detected languages: {np.unique(ld_df.label)}")

    # Link original comment back to language detection results.
    ld_df['comment'] = self.input

    # Split dataset in sets of detected language.
    for language in ld_df.label:
      self.language_subsets[language] = \
        ld_df.comment.take(np.where(ld_df.label == language)[0]).tolist()

    # Set result as class property to allow for later inspection.
    self.ld_df = ld_df


  def generateInsight(self, config):
    # Device=0 to utilize GPU.
    classifier = pipeline(config.task, model=config.name, device=config.device) if config.useName \
      else pipeline(config.task, model=config.model, tokenizer=config.tokenizer, device=config.device)

    input = self.input if config.languages == [] else \
      self.commentsByLanguages(config.languages).comment.tolist()

    return pd.DataFrame(classifier(input, *config.args, **self.tokenizer_kwargs))


  def generateInsights(self, configs):
    for config in configs:
      self.insights[config.toString()] = self.generateInsight(config)


  def commentsByLanguages(self, languages):
    return self.ld_df.loc[self.ld_df.label.isin(languages)]


# Input

As input for the NLP pipeline a list of texts is needed. The specific format for this input is as pandas.Series, where the name for the Series is 'comment'. For this, like the example shown below, we can link our google drive and access any .csv files from there. In the example below, the .csv file containing TicketVise comments is accessed, which is available on the GitHub repository. However, we can also simply define a static list. The file that is accessed using this method can be found in the github repository as well. Below the two different options are displayed in separate code blocks.

In [155]:
# Mount google drive for easy access to an input csv file.
drive.mount('/content/drive', force_remount=False)

train_path = '/content/drive/MyDrive/TicketVise_comments.csv'
delimiter = ';'

comments = pd.read_csv(train_path, delimiter=delimiter).comment

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [131]:
# Write input as static list of comments.
example_comments = [
 'These are some example comments', 
 'They can be written in differing lengths', 
 'And in multiple languages - for testing purposes', 
 'Dit is bijvoorbeeld commentaar in het nederlands!'
]

comments = pd.Series(example_comments, name='comment')

# Interaction

Below is shown how the defined model configuration class and the insight generator class can be initiated and interacted with to generate results.

In [156]:
# In the constructor of the generator class the input is preprocessed 
# and stored as a class property. 
generator = Insight_Generator(comments)

### Language detection

In [157]:
# For the use of language specific models later - we first use a 
# language detection model to split the input occording to languages.
%%time
generator.detectLanguages()

Detected languages: ['en' 'hi' 'nl' 'pt' 'sw' 'th' 'tr' 'ur']
CPU times: user 9min 3s, sys: 5.22 s, total: 9min 8s
Wall time: 9min 6s


In [158]:
# The following can be used to view accuracy score for specific languages.
generator.commentsByLanguages(['nl'])

Unnamed: 0,label,score,comment
1,nl,0.994471,"Beste Leen / TA's, Bij mij is vraag 1 van t..."
3,nl,0.994270,"Beste docent en assistenten, Ik ben vandaag..."
4,nl,0.995082,"Besta TA's, Ik loop bij Opgave 4 vast op he..."
6,nl,0.995098,"Beste studenten asistente, Ik ben deze week..."
8,nl,0.995516,"Hoi, Ik snap het nut van de GlobDecl child ..."
...,...,...,...
1382,nl,0.994399,"Hai, Ik zit vast bij de Go puzzelopdracht, ..."
1384,nl,0.994828,Ik merk dat ik van best een aantal punten die ...
1385,nl,0.995542,Is een cyclische stroom aanwezig zodra het For...
1386,nl,0.995837,"Geachte ta's en docenten van programmeertalen,..."


### Insight generation

In [159]:
# Desired classifiers are defined as ClassifierConfig.
# A list of these can then be passed to the insight generator.
configs = [
    ClassifierConfig("sentiment-analysis", 
                     "nlptown/bert-base-multilingual-uncased-sentiment")
]
# Don't forget to pass 'useGPU=True' when utilizing GPU!!!

In [160]:
# Example of the zero-shot-classification, which allows for input classifying 
# according to labels after the model training has taken place. 
# These labels are additional arguments for the classifier and are 
# passed to the 'args' property of the configuration.
configs.append(
    ClassifierConfig("zero-shot-classification", 
                     "joeddav/xlm-roberta-large-xnli",
                     languages=["en"],
                     args=["urgent"])
)

In [161]:
# Example of using locally stored trained models, which allows for the use of
# fine-tuned models and tokenizers.
# However, for the example, a non-local model is still used.
model_directory = "distilbert-base-uncased-finetuned-sst-2-english"

configs.append(
  ClassifierConfig("sentiment-analysis", 
                   "fine-tuned-model",
                   model=AutoModelForSequenceClassification.from_pretrained(model_directory), 
                   tokenizer=AutoTokenizer.from_pretrained(model_directory),
                   languages=["en"])
)

In [None]:
%%time
generator.generateInsights(configs)

In [None]:
# Results are stored within the generator.
generator.insights

# Visualization

The data that is gathered using the insight generator model is stored within its class properties. These can thus be extruded later to show the results in more insightful ways. We can access specific results by their unique identifier, a string containing all their relative information. Below three separate ways of visualizing the results is shown. Zero-shot-classification results in a list of scores and labels, for which the visualization functions are currently not optimized. But variants of the functions could easily be implemented, or existing once made more robust. Feel free to play around with what model result is passed to the functions, and which insights can be extruded from them. 

In [None]:
# Specific insights are retrieved by their unique identifiers:
for key in generator.insights.keys():
  print(key)

### Extreme edge values

Retrieving the extreme edge values for a specific label, sorted by the accuracy score given by the hugging face model.

In [None]:
def EdgeValues(dataframe, label, amount=5):
  return dataframe.loc[dataframe.label == label].sort_values(by=['score'])[:amount]

In [None]:
EdgeValues(generator.insights["sentiment-analysis on 'nlptown/bert-base-multilingual-uncased-sentiment' for '[]' languages"], '5 stars')

In [None]:
EdgeValues(generator.insights["sentiment-analysis on 'fine-tuned-model' for '['en']' languages"], 'POSITIVE')

### Label Pie plot

Pie plot depicting each specific label result in the model and its occurence.

In [None]:
def LabelPiePlot(dataframe):
  labels = np.unique(dataframe.label)
  ld_gg = dataframe.groupby(by=dataframe.label)

  means = []
  for label in labels:
      means.append(len(ld_gg.groups[label]) / len(dataframe))

  fig, ax = plt.subplots()
  ax.pie(means, labels=labels, autopct='%1.1f%%', startangle=90)
  plt.show()

In [None]:
LabelPiePlot(generator.insights["sentiment-analysis on 'nlptown/bert-base-multilingual-uncased-sentiment' for '[]' languages"])

In [None]:
LabelPiePlot(generator.insights["sentiment-analysis on 'fine-tuned-model' for '['en']' languages"])

### Scores Histogram
Bar plot depicting the ranges in accuracy scores across the input.

In [None]:
def ScoreHistogram(dataframe):
  plt.hist([round(score, 1) for score in dataframe.score], 
           lw=1, ec="yellow", fc="green", alpha=0.5, align='right')
  
  plt.show()

In [None]:
ScoreHistogram(generator.insights["sentiment-analysis on 'nlptown/bert-base-multilingual-uncased-sentiment' for '[]' languages"])

In [None]:
ScoreHistogram(generator.insights["sentiment-analysis on 'fine-tuned-model' for '['en']' languages"])