# Hands-On NLP Workshop: *Detecting Sentiment Using Tweets to US Airlines*

Within this hands-on workshop you will analyse tweets from customers of airlines about their performance. These were scraped from Twitter in 2015, and will be categorised in positive, neutral or negative sentiment. 
  
The steps which have been carried out
1. Load The Data 
2. Data Visualization 
3. Text Preprocessing and Cleaning  
4. Handling Imbalance         
5. Model Building  

The code in this notebook has been adapted from the brilliant work done by [Meisam Raz](https://www.kaggle.com/meisamraz/sentiment-analysis-96-acc-eda-text-preprocessing/notebook). 

In [None]:
# Colab has a load of packages pre-loaded into the environment. Installing the additional ones we require here.
!pip install seldon-deploy-sdk==1.4.1
!pip install google-cloud-storage

## Deploying the Model
As we have seen in the previous sections the tweets are pre-processed using a variety of techniques. In order to account for this we have 2 options for how to account for the pre-processing logic in production:
1. **Custom Model:** Incorporate the pre-processing directly in the `predict` method of a custom model. This provides simplicity when creating the deployment as there is only a single code base to worry about and a single component to be deployed.
2. **Input Transformer:** Make use of a separate container to perform all of the input transformation and then pass the vectors to the model for prediction. The schematic below outlines how this would work. 
```
            ________________________________________
            |            SeldonDeployment          |
            |                                      |
Request -->  Input transformer   -->     Model --> Response
            |  (Pre-processing)          (SKLearn) |
            |______________________________________|
```
The use of an input transformer allows us to separate the pre-processing logic from the prediction logic. This means we can leverage the pre-packaged SKLearn server provided by Seldon to serve our model, and each of the components can be upgraded independently of one another. However, it does introduce additional complexity in the deployment which is generated, and how that then interacts with advanced monitoring components such as outlier and drift detectors. 

This workshop will focus on the generation of a **custom model for this case**, therefore we need to define an `__init__` and `predict` method which shall load and perform inference respectively in our new deployment. 

--------- 

## Setup

We then define our Seldon custom model. The component parts required to build the custom model are outlined below. Each of the files play a key part in building the eventual Seldon docker container.

---
### TweetSentiment.py
This is the critical file as it contains the logic associated with the deployment wrapped as part of a class by the same name as the Python file. 

A key thing to note about the way this has been structured is that we have focused on making this deployment reusable. The `__init__` method accepts two custom predictor parameters; one for the saved model (`model_path`), and the other for the TF-IDF vectorizer (`tfidf_path`). 

The advantage of this is that it allows us to upgrade the model or vectorizer without having to re-build the container image. Additionally, if the logic was more general it could be used to accept a wider variety of objects for greater reusability. 

In [2]:
from seldon_deploy_sdk import EnvironmentApi, Configuration, ApiClient, SeldonDeploymentsApi, OutlierDetectorApi, DriftDetectorApi, ModelMetadataServiceApi
from seldon_deploy_sdk.auth import OIDCAuthenticator


In [None]:
%%writefile TweetSentiment.py

from joblib import load
import logging
import pandas as pd
import numpy as np
import re

# For downloading the model and OHE encoder from GCS
from io import BytesIO
from google.cloud import storage

import nltk
from nltk.corpus import stopwords
nltk.download("stopwords", download_dir="./nltk")
nltk.data.path.append("./nltk")

logger = logging.getLogger(__name__)


class TweetSentiment(object):

    def __init__(self, model_path, tfidf_path):
        logger.info(f"Connecting to GCS")
        self.client = storage.Client.create_anonymous_client()
        self.bucket = self.client.bucket('tom-seldon-examples')

        logger.info(f"Model name: {model_path}")
        self.model_path = model_path

        logger.info(f"TF-IDF Name: {tfidf_path}")
        self.tfidf_path = tfidf_path

        logger.info("Loading model file and TF-IDF vectorizer.")
        self.load_deployment_artefacts()
        self.ready = False

    def load_deployment_artefacts(self):
        logger.info("Loading model")
        model_file = BytesIO()
        model_blob = self.bucket.get_blob(f'{self.model_path}')
        model_blob.download_to_file(model_file)
        self.model = load(model_file)

        logger.info("Loading TF-IDF vectorizer")
        tfidf_file = BytesIO()
        tfidf_blob = self.bucket.get_blob(f'{self.tfidf_path}')
        tfidf_blob.download_to_file(tfidf_file)
        self.tfidf = load(tfidf_file)
        
        self.ready = True

    # Remove stop words
    def remove_stopwords(self, text):
        text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])
        return text

    
    def char(self, text):
        substitute = re.sub(r'[^a-zA-Z]',' ',text)
        return substitute

    def predict(self, tweets, names=[], meta={}):
        try:
            if not self.ready:
                self.load_deployment_artefacts()
            else:
                final_text = []

                for text in tweets:
                    # Apply functions to tweets
                    text = self.remove_username(text)
                    text = self.remove_url(text)
                    text = self.remove_emoji(text)
                    text = self.decontraction(text)
                    text = self.seperate_alphanumeric(text)
                    text = self.unique_char(self.cont_rep_char,text)
                    text = self.char(text)
                    text = text.lower()
                    text = self.remove_stopwords(text)
                    final_text.append(text)

                logger.info(f"Final text to be embedded: {final_text}")
                embeddings = self.tfidf.transform(final_text)
                sentiment = self.model.predict(embeddings)
                return sentiment

        except Exception as ex:
            logging.exception(f"Failed during predict: {ex}")


## Testing Locally
In order to ensure that we have gotten the `TweetClassifier.py` working correctly we can use the `seldon_core` Python package to run our model locally and test the endpoint. 
```
seldon-core-microservice TweetSentiment --service-type MODEL
                                        --parameters='[{ 
                                                        "name": "model_path",
                                                        "value": "nlp-workshop/<YOUR NAME>/model.joblib",
                                                        "type": "STRING"
                                                       }, {
                                                        "name": "tfidf_path",
                                                        "value": "nlp-workshop/<YOUR NAME>/tfidf.joblib",
                                                        "type": "STRING"
                                                       }]'
```
This endpoint can then be tested by posting cURL commands to the local endpoint: 
```
curl -H 'Content-Type: application/json' -d '{"data": {"ndarray": ["@united how can you not put my bag on plane to Seattle. Flight 1212. Waiting  in line to talk to someone about my bag. Status should matter."]}}' http://localhost:9000/api/v1.0/predictions
```

### .s2i/environment
In order for the Seldon base image to correctly convert your source code to an image it requires certain environment variables. In this case it is only 3 variables. 
* `MODEL_NAME`: The model name matches the name of the Python file and class which is created. 
* `SERVICE_TYPE`: Seldon allows you to create many different components each specialised for a different purpose e.g. `TRANSFORMER` for performing pre or post-processing steps. 
* `PERSISTENCE`: In some cases you would like to save the state of your deployments to Redis e.g. when scaling up multi-armed bandits

This is our environment file:
```
MODEL_NAME=TweetSentiment
SERVICE_TYPE=MODEL
PERSISTENCE=0
```
---
### requirements.txt
List of Python packages which the deployment requires to run.
```
joblib
pandas
numpy
seldon_core
google-cloud-storage
scikit-learn
nltk
```

## Building the Image

We can then build the custom model using source 2 image technology. Firstly, installing it locally as per [the documentation](https://github.com/openshift/source-to-image), and then by running this command: 
```
s2i build . seldonio/seldon-core-s2i-python3:1.12.0-dev tweet-sentiment:0.3
```

The built image is then pushed to Dockerhub where it can be pulled ready for deployment. In an enterprise setting, the container registry would be customised to the client's needs. 
```
docker tag tweet-sentiment:0.3 tomfarrand/tweet-sentiment:0.3
docker push tomfarrand/tweet-sentiment:0.3
```

In this case we will use my pre-built container image for speed and simplicity, which we can now deploy using the SDK. 

In [None]:
!s2i build . seldonio/seldon-core-s2i-python3:1.12.0-dev tweet-sentiment:0.3

## Using the SDK
Now that we have our trained model artefact and preprocessor, alongside our custom built container we can now use the Seldon Deploy SDK to deploy our model! 

Initially we setup some authentication: 

In [3]:
SD_IP = "34.73.238.47"

config = Configuration()
config.host = f"http://{SD_IP}/seldon-deploy/api/v1alpha1"
config.oidc_client_id = "sd-api"
config.oidc_server = f"http://{SD_IP}/auth/realms/deploy-realm"
config.oidc_client_secret = "sd-api-secret"
config.auth_method = 'client_credentials'

def auth():
    auth = OIDCAuthenticator(config)
    config.id_token = auth.authenticate()
    api_client = ApiClient(configuration=config, authenticator=auth)
    return api_client


Next, we define the parameters which we shall feed to the SDK.

!!! Again, remember to fill in the `YOUR_NAME` parameter !!!

In [4]:
YOUR_NAME = "jman"
MODEL_NAME = "tweet-sentiment"

DEPLOYMENT_NAME = f"airline-sentiment"
CONTAINER_NAME = f"tomfarrand/tweet-sentiment:0.3"

NAMESPACE = "seldon-demos"

CPU_REQUESTS = "0.1"
MEMORY_REQUESTS = "1Gi"

CPU_LIMITS = "0.1"
MEMORY_LIMITS = "1Gi"

MODEL_PATH = f"nlp-workshop/tom-farrand/model.joblib"
TFIDF_PATH = f"nlp-workshop/tom-farrand/tfidf.joblib"

The deployment specification is then defined. 

In [5]:
mldeployment = {
    "kind": "SeldonDeployment",
    "metadata": {
        "name": DEPLOYMENT_NAME,
        "namespace": NAMESPACE,
        "labels": {
            "fluentd": "true"
        }
    },
    "apiVersion": "machinelearning.seldon.io/v1alpha2",
    "spec": {
        "name": DEPLOYMENT_NAME,
        "annotations": {
            "seldon.io/engine-seldon-log-messages-externally": "true"
        },
        "protocol": "seldon",
        "predictors": [
            {
                "componentSpecs": [
                    {
                        "spec": {
                            "containers": [
                                {
                                    "name": f"{DEPLOYMENT_NAME}-container",
                                    "image": CONTAINER_NAME,
                                    "resources": {
                                        "requests": {
                                            "cpu": CPU_REQUESTS,
                                            "memory": MEMORY_REQUESTS
                                        },
                                        "limits": {
                                            "cpu": CPU_LIMITS,
                                            "memory": MEMORY_LIMITS
                                        }
                                    }
                                }
                            ]
                        }
                    }
                ],
                "name": "default",
                "replicas": 1,
                "traffic": 100,
                "graph": {
                    "name": f"{DEPLOYMENT_NAME}-container",
                    "parameters": [
                        {
                            "name":"model_path",
                            "value":MODEL_PATH,
                            "type":"STRING"
                        },
                        {
                            "name":"tfidf_path",
                            "value":TFIDF_PATH,
                            "type":"STRING"
                        }
                    ],
                    "children": [],
                    "logger": {
                        "mode": "all"
                    }
                }
            }
        ]
    },
    "status": {}
}

Finally, we deploy the model using a few simple API calls. 

In [6]:
deployment_api = SeldonDeploymentsApi(auth())
deployment_api.create_seldon_deployment(namespace=NAMESPACE, mldeployment=mldeployment)

{'api_version': 'machinelearning.seldon.io/v1alpha2',
 'kind': 'SeldonDeployment',
 'metadata': {'annotations': None,
              'cluster_name': None,
              'creation_timestamp': None,
              'deletion_grace_period_seconds': None,
              'deletion_timestamp': None,
              'finalizers': None,
              'generate_name': None,
              'generation': None,
              'labels': {'fluentd': 'true'},
              'managed_fields': None,
              'name': 'airline-sentiment',
              'namespace': 'seldon-demos',
              'owner_references': None,
              'resource_version': None,
              'self_link': None,
              'uid': None},
 'spec': {'annotations': {'seldon.io/engine-seldon-log-messages-externally': 'true'},
          'name': 'airline-sentiment',
          'oauth_key': None,
          'oauth_secret': None,
          'predictors': [{'annotations': None,
                          'component_specs': [{'hpa_spec': No

Once our endpoint becomes available we can then test the deployment using the following request: 

```
{
   "data":{
      "ndarray":[
         "@united how can you not put my bag on plane to Seattle. Flight 1212. Waiting  in line to talk to someone about my bag. Status should matter."
      ]
   }
}
```