# Review Sentiment Classification Notebook


## Summary
Text Classification aims to assign a text instance into one or more class(es) in a predefined set of classes.

## Description 
### Use Case Description
A company, such as bank, wants to analyze customer feedback in order to provide additional insight to enhance market campaign prediction. The bank collects customers feedback from public website. The task is to build a pipeline that automatically analyzes customer feedback messages, to provide the overall sentiment for the bank. The aim is to help the bank who wants to more accurately predict the success of telemarketing calls for selling bank long-term deposits gain extra features from social media.

#### Use Case Data
The data used in this use case is [BankReview dataset](https://www.creditkarma.com/reviews/banking/single/id/Simple#single-review-listingPaper), a publicly available data set collected from credit karma website. The data comprises approximately 120 customers feedback. 

We shared the review data as a Blob in a public Windows Azure Storage account. You can use this shared data to follow the steps in this template, or you can collect more feedbacks from credit karma website.

Each instance in the data set has 2 fields:
 
* sentiment - the polarity of the feedback (1 = strongly negative, 2 = negative, 3 = neutral, 4 = positive, 5 = strongly positive)
* review - the text of the feedback 

### Review Sentiment Operationalization

### Schema Generatation
In order to deploy the model as a web-service, we need first define functions to generate schema file for the service.

In [23]:
# This script generates the scoring and schema files
# necessary to operationalize the Market Campaign prediction sample
# Init and run functions

from azureml.api.schema.dataTypes import DataTypes
from azureml.api.schema.sampleDefinition import SampleDefinition
from azureml.api.realtime.services import generate_schema

In [24]:
import pandas as pd
import string

In [25]:
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [30]:
# Prepare the web service definition by authoring
# init() and run() functions. Test the fucntions
# before deploying the web service.

def init():
    from sklearn.externals import joblib

    # load the model file
    global model
    model = joblib.load('./code/reviewsentiment/model_30.pkl')

In [31]:
def run(input_df):
    import json
    
    input_df.columns = ['input_column'] 
    
    stop_words_df = pd.read_csv('./data/StopWords.csv')
    stop_words = set(stop_words_df["Col1"].tolist())
    for item in string.ascii_lowercase: #load stop words
        if item != "i":
            stop_words.add(item)

    input_column = []
    for line in input_df.input_column:
        value = " ".join(item.lower()
                         for item in RegexpTokenizer(r'\w+').tokenize(line)
                         if item.lower() not in stop_words)
        input_column.append(value)
    input_df.input_column = input_column

    stemmer = PorterStemmer()
    input_list = input_df["input_column"].tolist()

    # Tokenize the sentences in text_list and remove morphological affixes from words.

    def stem_tokens(tokens, stemmer_model):
        '''
        :param tokens: tokenized word list
        :param stemmer: remove stemmer
        :return:  tokenized and stemmed words
        '''
        return [stemmer_model.stem(original_word) for original_word in tokens]

    def tokenize(text):
        '''
        :param text: raw test
        :return: tokenized and stemmed words
        '''
        tokens = text.strip().split(" ")
        return stem_tokens(tokens, stemmer)

    # Initialize the TfidfVectorizer to compute tf-idf for each word

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english', max_df=160000,
                            min_df=1, norm="l2", use_idf=True)
    tfs = tfidf.fit_transform(input_list)
    
    pred = model.predict(tfs[0, :30])
    return json.dumps(str(pred[0]))
    #return pred[0]
print('executed')

executed


In [32]:
df = pd.DataFrame(data=[["I absolutely love my bank. There's a reason this bank's customer base is so strong--their customer service actually acts like people and not robots. I love that anytime my card is swiped, I'm instantly notified. And the built in budgeting app is something that really makes life easier. The biggest setback is not being able to deposit cash (you have to get a money order), and if you have another, non-simple bank account, transferring money between accounts can take a few days, which frankly isn't acceptable with most ACH taking a business day or less. Overall, it's a great bank, and I would recommend it to anyone."]], columns=['review'])
df.dtypes
df

Unnamed: 0,review
0,I absolutely love my bank. There's a reason th...


In [33]:
init()
input1 = pd.DataFrame(data=[["I absolutely love my bank. There's a reason this bank's customer base is so strong--their customer service actually acts like people and not robots. I love that anytime my card is swiped, I'm instantly notified. And the built in budgeting app is something that really makes life easier. The biggest setback is not being able to deposit cash (you have to get a money order), and if you have another, non-simple bank account, transferring money between accounts can take a few days, which frankly isn't acceptable with most ACH taking a business day or less. Overall, it's a great bank, and I would recommend it to anyone."]], columns=['review'])
input1

Unnamed: 0,review
0,I absolutely love my bank. There's a reason th...


In [34]:
run(input1)

'"0"'

In [35]:
inputs = {"input_df": SampleDefinition(DataTypes.PANDAS, df)}

# The prepare statement writes the scoring file (main.py) and
# the schema file (senti_service_schema.json) the the output folder.

generate_schema(run_func=run, inputs=inputs, filepath='senti_service_schema.json')

{'input': {'input_df': {'internal': 'gANjYXp1cmVtbC5hcGkuc2NoZW1hLnBhbmRhc1V0aWwKUGFuZGFzU2NoZW1hCnEAKYFxAX1xAihYDAAAAGNvbHVtbl90eXBlc3EDXXEEY251bXB5CmR0eXBlCnEFWAIAAABPOHEGSwBLAYdxB1JxCChLA1gBAAAAfHEJTk5OSv////9K/////0s/dHEKYmFYCgAAAHNjaGVtYV9tYXBxC31xDFgGAAAAcmV2aWV3cQ1oCHNYDAAAAGNvbHVtbl9uYW1lc3EOXXEPaA1hWAUAAABzaGFwZXEQSwFLAYZxEXViLg==',
   'swagger': {'example': [{'review': "I absolutely love my bank. There's a reason this bank's customer base is so strong--their customer service actually acts like people and not robots. I love that anytime my card is swiped, I'm instantly notified. And the built in budgeting app is something that really makes life easier. The biggest setback is not being able to deposit cash (you have to get a money order), and if you have another, non-simple bank account, transferring money between accounts can take a few days, which frankly isn't acceptable with most ACH taking a business day or less. Overall, it's a great bank, and I would recommend it to anyo

### Scoring Function
Then, we will need to define a scoring function to score on the new instance.

In [36]:
import pandas as pd
import string

In [37]:
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

In [38]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [42]:
def init():
    import numpy
    import scipy
    from sklearn.linear_model import LogisticRegression

    global model
    import pickle
    f = open('./code/reviewsentiment/model_30.pkl', 'rb')
    model = pickle.load(f)
    f.close()

In [45]:
# run takes an input dataframe and performs sentiment prediction
def run(input_df):
    import json
    import pickle
    
    input_df.columns = ['input_column'] 
    
    f = open('./code/reviewsentiment/stopwords.pkl', 'rb')
    stop_words_df = pickle.load(f)
    f.close()
    
    stop_words = set(stop_words_df["Col1"].tolist())
    for item in string.ascii_lowercase: #load stop words
        if item != "i":
            stop_words.add(item)

    input_column = []
    for line in input_df.input_column:
        value = " ".join(item.lower()
                         for item in RegexpTokenizer(r'\w+').tokenize(line)
                         if item.lower() not in stop_words)
        input_column.append(value)
    input_df.input_column = input_column

    stemmer = PorterStemmer()
    input_list = input_df["input_column"].tolist()

    # Tokenize the sentences in text_list and remove morphological affixes from words.

    def stem_tokens(tokens, stemmer_model):
        '''
        :param tokens: tokenized word list
        :param stemmer: remove stemmer
        :return:  tokenized and stemmed words
        '''
        return [stemmer_model.stem(original_word) for original_word in tokens]

    def tokenize(text):
        '''
        :param text: raw test
        :return: tokenized and stemmed words
        '''
        tokens = text.strip().split(" ")
        return stem_tokens(tokens, stemmer)

    # Initialize the TfidfVectorizer to compute tf-idf for each word

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english', max_df=160000,
                            min_df=1, norm="l2", use_idf=True)
    tfs = tfidf.fit_transform(input_list)
    
    pred = model.predict(tfs[0, :30])
    return json.dumps(str(pred[0]))
    #return pred[0]
print('executed')

executed


In [46]:
if __name__ == '__main__':
    init()
    input = pd.DataFrame(data=[["I absolutely love my bank. There's a reason this bank's customer base is so strong--their customer service actually acts like people and not robots. I love that anytime my card is swiped, I'm instantly notified. And the built in budgeting app is something that really  makes life easier. The biggest setback is not being able to deposit cash (you have to get a money order), and if you have another, non-simple bank account, transferring money between accounts can take a few days, which frankly isn't acceptable with most ACH taking a business day or less. Overall, it's a great bank, and I would recommend it to anyone."]], columns=['review'])
    print(run(input))
    #input = "{}"
    #run(input)

"0"
